TidyR – organize your data
Today we will introduce you to the basic funtions of the TidyR library, which is used to change the shape and hierarchy of a dataset. We’ll start, as always, by loading the library:
1 |
library(tidyr) |
For our example, we will use data that we generate ourselves. For example, let’s take the number of men and women in regions A and B in the two years 2020 and 2021. We will use the tibble function to create a table with this data:
1 2 3 4 |
dt = tibble(region = c("A","A","B","B"), gender = c("m","f","m","f"), "2020" = sample(1:1000,4), "2021" = sample(1:1000,4)) |
Our data looks like this:
1 2 3 4 5 6 7 8 |
dt # A tibble: 4 x 4 region gender `2020` `2021` chr chr int int 1 A m 802 668 2 A f 944 602 3 B m 503 940 4 B f 543 45 |
The first function we will learn is gather, which converts columns to rows:
1 |
g = gather(data = dt, '2020', '2021', key = 'year', value = 'count') |
This function converts our two columns, 2020 and 2021, into a single column called count, and also adds a column called year that contains the names of the original columns:
1 2 3 4 5 6 7 8 9 10 11 12 |
g # A tibble: 8 x 4 region gender year count chr chr chr int 1 A m 2020 802 2 A f 2020 944 3 B m 2020 503 4 B f 2020 543 5 A m 2021 668 6 A f 2021 602 7 B m 2021 940 8 B f 2021 45 |
The spread function, on the other hand, converts rows to columns. We will split our data from table g by gender:
1 |
s = spread(g,gender,count) |
1 2 3 4 5 6 7 8 |
s # A tibble: 4 x 4 region year f m chr chr int int 1 A 2020 944 802 2 A 2021 602 668 3 B 2020 543 503 4 B 2021 45 940 |
TidyR has other useful functions to clean up our collection of missing NA data. Let’s add the missing data to our table dt:
1 |
dt[3,3] = NA |
1 2 3 4 5 6 7 8 |
dt # A tibble: 4 x 4 region gender `2020` `2021` chr chr int int 1 A m 802 668 2 A f 944 602 3 B m NA 940 4 B f 543 45 |
The drop_na function removes rows with missing data from our data frame:
1 |
drop_na(dt) |
The row with missing data is removed:
1 2 3 4 5 6 |
# A tibble: 3 x 4 region gender `2020` `2021` chr chr int int 1 A m 802 668 2 A f 944 602 3 B f 543 45 |
The fill function fills the missing data with values from the top or bottom of the column:
1 |
fill(data = dt,'2020',.direction = "up") |
The missing value was filled in from the bottom row:
1 2 3 4 5 6 7 |
# A tibble: 4 x 4 region gender `2020` `2021` chr chr int int 1 A m 802 668 2 A f 944 602 3 B m 543 940 4 B f 543 45 |
Replace_na is used to replace the missing value with a user-defined value:
1 |
replace_na(dt,list('2020'=0)) |
1 2 3 4 5 6 7 |
# A tibble: 4 x 4 region gender `2020` `2021` chr chr dbl int 1 A m 802 668 2 A f 944 602 3 B m 0 940 4 B f 543 45 |
What if our data was stored in a column as a delimited string of numbers:
1 2 3 4 5 6 7 |
# A tibble: 4 x 3 region gender count chr chr chr 1 A m 24;34 2 A f 719;508 3 B m 819;981 4 B f 47;631 |
We use the seperate_rows function:
1 |
separate_rows(dt,count,sep=';') |
To separate our columns into individual rows:
1 2 3 4 5 6 7 8 9 10 11 |
# A tibble: 8 x 3 region gender count chr chr chr 1 A m 24 2 A m 34 3 A f 719 4 A f 508 5 B m 819 6 B m 981 7 B f 47 8 B f 631 |
Or the separate function:
1 |
separate(dt,count,sep=';',into = c('2020','2021')) |
To separate the data into two columns:
1 2 3 4 5 6 7 |
# A tibble: 4 x 4 region gender `2020` `2021` chr chr chr chr 1 A m 24 34 2 A f 719 508 3 B m 819 981 4 B f 47 631 |
To return to writing with a separator, use the unite function:
1 |
unite(dt,'2020','2021',col = "count",sep = "-") |
1 2 3 4 5 6 7 |
# A tibble: 4 x 3 region gender count chr chr chr 1 A m 24-34 2 A f 719-508 3 B m 819-981 4 B f 47-631 |
The tools in the tidyR library are very useful for organising our data so that we can run analyses on it or create plots from it.