Dplyr – manipulation of data frame

In this post, we will introduce the library dplyr, which allows you to easily manipulate variables of type data frame. For our example, we will use the list of geodetic surface fields that you can download as part of open data here.

Let’s initialize the library:

Let’s load the data:

First we will use the filter function, which allows us to select rows that meet a certain criterion. In our case, these will be districts for which TERYT has 5 characters:

As a result we have:

Let’s add the mutate function to our previous line of code, which will allow us to compute a new value (column). We will determine the TERYT of the county from the TERYT of the province (TERYT_woj) using the substr function:

As a result we have:

We use the select function to select the columns we are interested in:

And we have:

We will use the select we did earlier to add the names of the provinces to our counties, using left_join from the join function group:

The finished table looks like this:

Finally, in our long code, we select only the columns of interest in the resulting table (TERYT, unit name, area, and province name). We will give them new names (TERYT, Nazwa, Pow, Woj). The result will be stored in the variable dt:

The variable dt should contain this data:

For the created table we will calculate the minimum, average and maximum area of the districts using the summarize function:

If we want to calculate these values for provinces, our row should be completed with the function group_by:

Let’s add the number of districts:

And finally, we sort our results by the number of districts with the arrange function:

The rows are sorted from the smallest to the largest number of districts in the provinces. Reverse sorting is achieved by adding at the end of our code:

We have shown you how the most commonly used functions from the dplyr library work:

  • select – select columns,
  • filter – select rows based on an expression,
  • mutate – calculate new data based on existing,
  • left_join – join two tables (there are many types of join),
  • summarise -create summaries from values,
  • group_by – group by the values in a column,
  • arrange – sort rows by values in columns.

The library contains many other functions that you need to check yourself.

Leave a Reply

Your email address will not be published.

Translate using Google Translate»
Social media & sharing icons powered by UltimatelySocial

Podoba Ci się nasza strona? Odwiedź nasz profil