TidyR – organize your data
Today we will introduce you to the basic funtions of the TidyR library, which is used to change the shape and hierarchy of a dataset. We’ll start, as always, by loading the library:
library(tidyr)
For our example, we will use data that we generate ourselves. For example, let’s take the number of men and women in regions A and B in the two years 2020 and 2021. We will use the tibble function to create a table with this data:
dt = tibble(region = c("A","A","B","B"),
gender = c("m","f","m","f"),
"2020" = sample(1:1000,4),
"2021" = sample(1:1000,4))
Our data looks like this:
dt
# A tibble: 4 x 4
region gender `2020` `2021`
chr chr int int
1 A m 802 668
2 A f 944 602
3 B m 503 940
4 B f 543 45
The first function we will learn is gather, which converts columns to rows:
g = gather(data = dt, '2020', '2021', key = 'year', value = 'count')
This function converts our two columns, 2020 and 2021, into a single column called count, and also adds a column called year that contains the names of the original columns:
g
# A tibble: 8 x 4
region gender year count
chr chr chr int
1 A m 2020 802
2 A f 2020 944
3 B m 2020 503
4 B f 2020 543
5 A m 2021 668
6 A f 2021 602
7 B m 2021 940
8 B f 2021 45
The spread function, on the other hand, converts rows to columns. We will split our data from table g by gender:
s = spread(g,gender,count)
s
# A tibble: 4 x 4
region year f m
chr chr int int
1 A 2020 944 802
2 A 2021 602 668
3 B 2020 543 503
4 B 2021 45 940
TidyR has other useful functions to clean up our collection of missing NA data. Let’s add the missing data to our table dt:
dt[3,3] = NA
dt
# A tibble: 4 x 4
region gender `2020` `2021`
chr chr int int
1 A m 802 668
2 A f 944 602
3 B m NA 940
4 B f 543 45
The drop_na function removes rows with missing data from our data frame:
drop_na(dt)
The row with missing data is removed:
# A tibble: 3 x 4
region gender `2020` `2021`
chr chr int int
1 A m 802 668
2 A f 944 602
3 B f 543 45
The fill function fills the missing data with values from the top or bottom of the column:
fill(data = dt,'2020',.direction = "up")
The missing value was filled in from the bottom row:
# A tibble: 4 x 4
region gender `2020` `2021`
chr chr int int
1 A m 802 668
2 A f 944 602
3 B m 543 940
4 B f 543 45
Replace_na is used to replace the missing value with a user-defined value:
replace_na(dt,list('2020'=0))
# A tibble: 4 x 4
region gender `2020` `2021`
chr chr dbl int
1 A m 802 668
2 A f 944 602
3 B m 0 940
4 B f 543 45
What if our data was stored in a column as a delimited string of numbers:
# A tibble: 4 x 3
region gender count
chr chr chr
1 A m 24;34
2 A f 719;508
3 B m 819;981
4 B f 47;631
We use the seperate_rows function:
separate_rows(dt,count,sep=';')
To separate our columns into individual rows:
# A tibble: 8 x 3
region gender count
chr chr chr
1 A m 24
2 A m 34
3 A f 719
4 A f 508
5 B m 819
6 B m 981
7 B f 47
8 B f 631
Or the separate function:
separate(dt,count,sep=';',into = c('2020','2021'))
To separate the data into two columns:
# A tibble: 4 x 4
region gender `2020` `2021`
chr chr chr chr
1 A m 24 34
2 A f 719 508
3 B m 819 981
4 B f 47 631
To return to writing with a separator, use the unite function:
unite(dt,'2020','2021',col = "count",sep = "-")
# A tibble: 4 x 3
region gender count
chr chr chr
1 A m 24-34
2 A f 719-508
3 B m 819-981
4 B f 47-631
The tools in the tidyR library are very useful for organising our data so that we can run analyses on it or create plots from it.