R Basics – Data analysis Part 2 – Data cleaning and preparation

The first part of the analysis in R showed how we load our files into the environment. Now we need to verify the data to remove any errors that might affect the outcome of our analysis. For this exercise, we have prepared a modified file with the population in districts, in which we have specifically inserted some errors (download).

Let’s start by loading the data file:

Let’s look at the data stored in the data table. We display the first rows from the table:

We can scroll through the rows one by one:

As well as the individual columns in two ways:

The methods of browsing are presented in more detail in the data storage section of the course.

Sometimes we don’t know what form data is stored in a variable. We can check this using the class function that returns the type of the data container:

Our variable data is a data frame that contains different types of values in columns.

Let’s go back to checking the data we loaded. We know that they contain the population of each district. There are 380 districts in Poland. With this information, we can check if the number of rows in our data is equal to this number. Let’s write an expression that checks if the number of rows in the table is equal to 380 and run it:

The expression returns us the value FALSE, which means that we do not have 380 rows. We have 381 rows:

This means that some rows are duplicated. Let’s check this with the duplicated function in the identifier column:

The function returns a vector containing TRUE /FALSE values for each of the column elements. If one of the elements repeats a second time, it gets the value TRUE. It is easy to find repeated values in our set, but what if we have a very large data set? We can check if any value takes the value TRUE by using any function:

We already know that we have repetitions in the data, but where? For this, we use the which function that specifies the position of the element:

Row 135 is a repeat of another row in the table. Let’s display it:

The ID of this district is 616. Let’s select the row numbers that have the ID into the sel variable:

And let’s display them:

Ricki district occurs twice in our table. Now we delete them in a similar way as we display them, only we add a minus and store it in the data variable:

We’ve removed the repeating row. Let’s check again to see if we have repeats:

In numeric columns like population, we know that there should be no negative values and no values equal to zero. We can check this by creating an expression:

We got a TRUE value from the expression that means there is a value less than or equal to zero in the data. We store the position of the wrong row in the variable sel:

Let’s display the row:

Using the position of the row stored in sel, we correct the incorrect value in the population column:

We display the row again and verify that the change is correct:

We have corrected the error. There may also be no data in the table cells. The missing data is recorded as NA. We can search for this type of placeholder using the is.na function:

We have such an element in the Population column. Let’s find out which row it is:

We display it:

We improve:

We’ll check the change:

Leave a Reply

Your email address will not be published.

Translate using Google Translate»
Social media & sharing icons powered by UltimatelySocial

Podoba Ci się nasza strona? Odwiedź nasz profil