TidyR – organize your data

15 March 202122 April 2021 0 Comments fill_na, gather, R, separate, spread, tibble, tidyr

Today we will introduce you to the basic funtions of the TidyR library, which is used to change the shape and hierarchy of a dataset. We’ll start, as always, by loading the library:

library(tidyr)

1	library(tidyr)

For our example, we will use data that we generate ourselves. For example, let’s take the number of men and women in regions A and B in the two years 2020 and 2021. We will use the tibble function to create a table with this data:

dt = tibble(region = c("A","A","B","B"), 
gender = c("m","f","m","f"),
"2020" = sample(1:1000,4), 
"2021" = sample(1:1000,4))

dt = tibble(region = c("A","A","B","B"),

gender = c("m","f","m","f"),

"2020" = sample(1:1000,4),

"2021" = sample(1:1000,4))

Our data looks like this:

dt
# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      m         503    940
4 B      f         543     45

# A tibble: 4 x 4

region gender `2020` `2021`

chr chr int int

1 A m 802 668

2 A f 944 602

3 B m 503 940

4 B f 543 45

The first function we will learn is gather, which converts columns to rows:

g = gather(data = dt, '2020', '2021', key = 'year', value = 'count')

1	g = gather(data = dt, '2020', '2021', key = 'year', value = 'count')

This function converts our two columns, 2020 and 2021, into a single column called count, and also adds a column called year that contains the names of the original columns:

g
# A tibble: 8 x 4
  region gender year  count
   chr    chr    chr   int 
1 A      m      2020    802
2 A      f      2020    944
3 B      m      2020    503
4 B      f      2020    543
5 A      m      2021    668
6 A      f      2021    602
7 B      m      2021    940
8 B      f      2021     45

# A tibble: 8 x 4

region gender year count

chr chr chr int

1 A m 2020 802

2 A f 2020 944

3 B m 2020 503

4 B f 2020 543

5 A m 2021 668

6 A f 2021 602

7 B m 2021 940

8 B f 2021 45

The spread function, on the other hand, converts rows to columns. We will split our data from table g by gender:

s = spread(g,gender,count)

1	s = spread(g,gender,count)

s
# A tibble: 4 x 4
  region year      f     m
   chr    chr   int   int 
1 A      2020    944   802
2 A      2021    602   668
3 B      2020    543   503
4 B      2021     45   940

# A tibble: 4 x 4

region year f m

chr chr int int

1 A 2020 944 802

2 A 2021 602 668

3 B 2020 543 503

4 B 2021 45 940

TidyR has other useful functions to clean up our collection of missing NA data. Let’s add the missing data to our table dt:

dt[3,3] = NA

1	dt[3,3] = NA

dt
# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      m          NA    940
4 B      f         543     45

# A tibble: 4 x 4

region gender `2020` `2021`

chr chr int int

1 A m 802 668

2 A f 944 602

3 B m NA 940

4 B f 543 45

The drop_na function removes rows with missing data from our data frame:

drop_na(dt)

1	drop_na(dt)

The row with missing data is removed:

# A tibble: 3 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      f         543     45

# A tibble: 3 x 4

region gender `2020` `2021`

chr chr int int

1 A m 802 668

2 A f 944 602

3 B f 543 45

The fill function fills the missing data with values from the top or bottom of the column:

fill(data = dt,'2020',.direction = "up")

1	fill(data = dt,'2020',.direction = "up")

The missing value was filled in from the bottom row:

# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     int    int 
1 A      m         802    668
2 A      f         944    602
3 B      m         543    940
4 B      f         543     45

# A tibble: 4 x 4

region gender `2020` `2021`

chr chr int int

1 A m 802 668

2 A f 944 602

3 B m 543 940

4 B f 543 45

Replace_na is used to replace the missing value with a user-defined value:

replace_na(dt,list('2020'=0))

1	replace_na(dt,list('2020'=0))

# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr     dbl    int 
1 A      m         802    668
2 A      f         944    602
3 B      m           0    940
4 B      f         543     45

# A tibble: 4 x 4

region gender `2020` `2021`

chr chr dbl int

1 A m 802 668

2 A f 944 602

3 B m 0 940

4 B f 543 45

What if our data was stored in a column as a delimited string of numbers:

# A tibble: 4 x 3
  region gender count  
   chr    chr    chr   
1 A      m      24;34  
2 A      f      719;508
3 B      m      819;981
4 B      f      47;631

# A tibble: 4 x 3

region gender count

chr chr chr

1 A m 24;34

2 A f 719;508

3 B m 819;981

4 B f 47;631

We use the seperate_rows function:

separate_rows(dt,count,sep=';')

1	separate_rows(dt,count,sep=';')

To separate our columns into individual rows:

# A tibble: 8 x 3
  region gender count
   chr    chr    chr 
1 A      m      24   
2 A      m      34   
3 A      f      719  
4 A      f      508  
5 B      m      819  
6 B      m      981  
7 B      f      47   
8 B      f      631

# A tibble: 8 x 3

region gender count

chr chr chr

1 A m 24

2 A m 34

3 A f 719

4 A f 508

5 B m 819

6 B m 981

7 B f 47

8 B f 631

Or the separate function:

separate(dt,count,sep=';',into = c('2020','2021'))

1	separate(dt,count,sep=';',into = c('2020','2021'))

To separate the data into two columns:

# A tibble: 4 x 4
  region gender `2020` `2021`
   chr    chr    chr    chr  
1 A      m      24     34    
2 A      f      719    508   
3 B      m      819    981   
4 B      f      47     631

# A tibble: 4 x 4

region gender `2020` `2021`

chr chr chr chr

1 A m 24 34

2 A f 719 508

3 B m 819 981

4 B f 47 631

To return to writing with a separator, use the unite function:

unite(dt,'2020','2021',col = "count",sep = "-")

1	unite(dt,'2020','2021',col = "count",sep = "-")

# A tibble: 4 x 3
  region gender count  
   chr    chr    chr   
1 A      m      24-34  
2 A      f      719-508
3 B      m      819-981
4 B      f      47-631

# A tibble: 4 x 3

region gender count

chr chr chr

1 A m 24-34

2 A f 719-508

3 B m 819-981

4 B f 47-631

The tools in the tidyR library are very useful for organising our data so that we can run analyses on it or create plots from it.