R Basics – Data analysis Part 1 – Loading data

We already know how to store data in R and how to process it with conditional statements, loops, and functions. However, the R language is a great tool designed mainly for analyzing large data sets. In a few posts, we will introduce you to its capabilities in this area. We will start with loading the data and then move on to counting basic descriptive statistics and plotting the results in graphs. To present the tools, we will use a table containing the population overview for Polish counties (download).

The basic tool for loading data is read.table, which loads records stored in tabular form. Before loading a particular file, it is necessary to check its structure and determine if:

  • it contains a header,
  • how the columns are separated – if there are no spaces in the table, we must not specify this parameter (sep).

We will show how to use the parameters of this function with an example. In our file there are districts with two-part names, e.g. Jelenia Gora. Now let’s try to load our file without additional attributes:

As a result, we get an error message telling us that line 28 of the file does not contain 7 elements as read before. This line contains a two-element district name, which is automatically split into two elements by the function:

It is possible to fix this error by adding the two parameters mentioned earlier. Let’s read the file again, this time defining the presence of header and column separator (“\t” – tab):

The data loaded without any problems. Let’s display the first lines with the head function:

The read.table tool has many other parameters that allow us to define the range of data loaded into a variable. There are two other simple functions derived from read.csv and read.delim. A detailed description of the parameters of these functions can be found in the help.

If our data is not in tabular form and we want to load it into R, we can do so with the readLines function. This tool reads individual lines of a file as strings, which allows us to edit each line depending on what data it contains. What can this be used for? For example, to edit configuration files where we want to change a particular piece of text.

Let’s load our file with the population in districts using the tool mentioned above:

Let’s display the second element of the date variable:

We see that it displays the entire contents of line 2 from the text file as a string, which we can modify freely.

The above functions are installed by default in the environment. R also gives us the ability to load other file types through packages that extend its functionality. For example, we used such an extension when loading a shapefile, but we can also download an extension to load a popular Excel spreadsheet file. We first need to install and initialize the xlsx library. Loading an XLS file is done using the readxls function from this library. Those who use Excel know that we can store data in different worksheets. The readxls function allows you to load data from different worksheets using the sheetIndex parameter, where you define their index:

The loaded data should be cleared and processed before we start analysing them, in order to remove errors that could affect the result of our analysis. We will write about this phase of data preparation in the next post.

Leave a Reply

Your email address will not be published.

Translate using Google Translate»
Social media & sharing icons powered by UltimatelySocial

Podoba Ci się nasza strona? Odwiedź nasz profil