GISR

TidyR – organize your data

Today we will introduce you to the basic funtions of the TidyR library, which is used to change the shape and hierarchy of a dataset. We’ll start, as always, by loading the library:

library(tidyr)

For our example, we will use data that we generate ourselves. For example, let’s take the number of men and women in regions A and B in the two years 2020 and 2021. We will use the tibble function to create a table with this data:

dt = tibble(region = c("A","A","B","B"), 
gender = c("m","f","m","f"),
"2020" = sample(1:1000,4), 
"2021" = sample(1:1000,4))

Our data looks like this:

dt
# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      m         503    940
4 B      f         543     45

The first function we will learn is gather, which converts columns to rows:

g = gather(data = dt, '2020', '2021', key = 'year', value = 'count')

This function converts our two columns, 2020 and 2021, into a single column called count, and also adds a column called year that contains the names of the original columns:

g
# A tibble: 8 x 4
  region gender year  count
   chr    chr    chr   int 
1 A      m      2020    802
2 A      f      2020    944
3 B      m      2020    503
4 B      f      2020    543
5 A      m      2021    668
6 A      f      2021    602
7 B      m      2021    940
8 B      f      2021     45

The spread function, on the other hand, converts rows to columns. We will split our data from table g by gender:

s = spread(g,gender,count)
s
# A tibble: 4 x 4
  region year      f     m
   chr    chr   int   int 
1 A      2020    944   802
2 A      2021    602   668
3 B      2020    543   503
4 B      2021     45   940

TidyR has other useful functions to clean up our collection of missing NA data. Let’s add the missing data to our table dt:

dt[3,3] = NA
dt
# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      m          NA    940
4 B      f         543     45

The drop_na function removes rows with missing data from our data frame:

drop_na(dt)

The row with missing data is removed:

# A tibble: 3 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      f         543     45

The fill function fills the missing data with values from the top or bottom of the column:

fill(data = dt,'2020',.direction = "up")

The missing value was filled in from the bottom row:

# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      m         543    940
4 B      f         543     45

Replace_na is used to replace the missing value with a user-defined value:

replace_na(dt,list('2020'=0))
# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     dbl    int 
1 A      m         802    668
2 A      f         944    602
3 B      m           0    940
4 B      f         543     45

What if our data was stored in a column as a delimited string of numbers:

# A tibble: 4 x 3
  region gender count  
   chr    chr    chr   
1 A      m      24;34  
2 A      f      719;508
3 B      m      819;981
4 B      f      47;631 

We use the seperate_rows function:

separate_rows(dt,count,sep=';')

To separate our columns into individual rows:

# A tibble: 8 x 3
  region gender count
   chr    chr    chr 
1 A      m      24   
2 A      m      34   
3 A      f      719  
4 A      f      508  
5 B      m      819  
6 B      m      981  
7 B      f      47   
8 B      f      631 

Or the separate function:

separate(dt,count,sep=';',into = c('2020','2021'))

To separate the data into two columns:

# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr    chr    chr  
1 A      m      24     34    
2 A      f      719    508   
3 B      m      819    981   
4 B      f      47     631

To return to writing with a separator, use the unite function:

unite(dt,'2020','2021',col = "count",sep = "-")
# A tibble: 4 x 3
  region gender count  
   chr    chr    chr   
1 A      m      24-34  
2 A      f      719-508
3 B      m      819-981
4 B      f      47-631 

The tools in the tidyR library are very useful for organising our data so that we can run analyses on it or create plots from it.

Leave a Reply

Your email address will not be published. Required fields are marked *