R

R Basics – Data analysis Part 2 – Data cleaning and preparation

The first part of the analysis in R showed how we load our files into the environment. Now we need to verify the data to remove any errors that might affect the outcome of our analysis. For this exercise, we have prepared a modified file with the population in districts, in which we have specifically inserted some errors (download).

Let’s start by loading the data file:

data = read.table("D:/population.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

Let’s look at the data stored in the data table. We display the first rows from the table:

head(data)
id district voivodeship area_ha area_km2 population population_km2
1 201 boleslawiecki dolnoslaskie 130351 1304 90306 69
2 202 dzierzoniowski dolnoslaskie 47851 479 104613 219
3 203 glogowski dolnoslaskie 44327 443 90302 204
4 204 gorowski dolnoslaskie 73827 738 36391 49
5 205 jaworski dolnoslaskie 58155 582 52070 90
6 206 jeleniogorski dolnoslaskie 62714 627 65171 104

We can scroll through the rows one by one:

data[100,]
id district voivodeship area_ha area_km2 population population_km2
100 1008 pabianicki lodzkie 49218 492 119672 243

As well as the individual columns in two ways:

data[,2]

 [1] „boleslawiecki”           „dzierzoniowski”          „glogowski”

…

data$powiat

 [1] „boleslawiecki”           „dzierzoniowski”          „glogowski”

…

The methods of browsing are presented in more detail in the data storage section of the course.

Sometimes we don’t know what form data is stored in a variable. We can check this using the class function that returns the type of the data container:

class(data)

[1] „data.frame”

Our variable data is a data frame that contains different types of values in columns.

Let’s go back to checking the data we loaded. We know that they contain the population of each district. There are 380 districts in Poland. With this information, we can check if the number of rows in our data is equal to this number. Let’s write an expression that checks if the number of rows in the table is equal to 380 and run it:

nrow(data) == 380
[1] FALSE

The expression returns us the value FALSE, which means that we do not have 380 rows. We have 381 rows:

nrow(data)
[1] 381

This means that some rows are duplicated. Let’s check this with the duplicated function in the identifier column:

duplicated(data$id)

 [1] FALSE FALSE FALSE FALSE …

The function returns a vector containing TRUE /FALSE values for each of the column elements. If one of the elements repeats a second time, it gets the value TRUE. It is easy to find repeated values in our set, but what if we have a very large data set? We can check if any value takes the value TRUE by using any function:

any(duplicated(data$id))
[1] TRUE

We already know that we have repetitions in the data, but where? For this, we use the which function that specifies the position of the element:

which(duplicated(data$id))
[1] 135

Row 135 is a repeat of another row in the table. Let’s display it:

data[135,]
     id district voivodeship area_ha area_km2 population population_km2
135 616    rycki   lubelskie   61453      615      58080             95

The ID of this district is 616. Let’s select the row numbers that have the ID into the sel variable:

sel = which(data$id == 616)

And let’s display them:

data[sel,]
id district voivodeship area_ha area_km2 population population_km2
69 616 rycki lubelskie 61453 615 58080 95
135 616 rycki lubelskie 61453 615 58080 95

Ricki district occurs twice in our table. Now we delete them in a similar way as we display them, only we add a minus and store it in the data variable:

data = data[-135,]

We’ve removed the repeating row. Let’s check again to see if we have repeats:

any(duplicated(data$id))
[1] FALSE

In numeric columns like population, we know that there should be no negative values and no values equal to zero. We can check this by creating an expression:

any(data$population ‹= 0)
[1] TRUE

We got a TRUE value from the expression that means there is a value less than or equal to zero in the data. We store the position of the wrong row in the variable sel:

sel = which(data$population == 0)

Let’s display the row:

data[sel,]
id district voivodeship area_ha area_km2 population population_km2
126 1211 nowotarski malopolskie 147499 1474 0 129

Using the position of the row stored in sel, we correct the incorrect value in the population column:

data$population[sel] = 189623

We display the row again and verify that the change is correct:

data[sel,]
id district voivodeship area_ha area_km2 population population_km2
126 1211 nowotarski malopolskie 147499 1474 189623 129

We have corrected the error. There may also be no data in the table cells. The missing data is recorded as NA. We can search for this type of placeholder using the is.na function:

any(is.na(data$population))
[1] TRUE

We have such an element in the Population column. Let’s find out which row it is:

sel = which(is.na(data$population))

We display it:

data[sel,]
id district voivodeship area_ha area_km2 population population_km2
249 2214 tczewski pomorskie 69711 697 NA 166

We improve:

data$population[sel] = 115962

We’ll check the change:

data[sel,]
id district voivodeship area_ha area_km2 population population_km2
249 2214 tczewski pomorskie 69711 697 115962 166

Leave a Reply

Your email address will not be published. Required fields are marked *