R

R Basics – Storing data

To meet the needs expressed by our readers in the survey, we start a short introductory course in R. We start our adventure by learning how to store data (data frames) in this language. For those who are not yet familiar with the environment in which we will be working, I refer you to the entries dealing with installation and first steps in R.

The first way to store data is as a vector. In a vector we can store data of only one type, for example numeric. As an example, let’s create a vector w that contains four numbers:

w = c(1,4,5,3)

Display the created vector w in the console:

w
[1] 1 4 5 3

If we want to display a specific element of the vector, we need to type:

w[3]
[1] 5

We can also display more elements:

w[3:4]
[1] 5 3

Let us now create a vector containing consecutive numbers from 1 to 5:

w = c(1:5)
w
[1] 1 2 3 4 5

The vector can also contain strings with, for example, five names:

w = c(‘John’,’Sophia’,’Al’,’Jack’,’Donald’)
w
[1]  "John"    "Sophia"  "Al" "Jack"  "Donald"

What if we want to store different types of variables in one object? We can use the list object, which allows us to store different types of data in its elements. Let’s create a list that contains any three values of different types:

l = list(23.2,'John',TRUE)

Display the created variable in the console:

l
[[1]]
[1] 23.2

[[2]]
[1] "John"

[[3]]
[1] TRUE

To display the second element of the list, we need to type:

l[[2]]
[1] "John"

For two-dimensional data, such as grids, the values of each pixel are stored in a matrix. An arbitrary matrix is created with the matrix function by specifying the value with which it is to be filled and the number of rows and columns:

m = matrix(2, nrow = 3, ncol = 3)
m
     [,1] [,2] [,3]
[1,]    2    2    2
[2,]    2    2    2
[3,]    2    2    2

We can also insert any vector into the matrix whose number of elements is equal to the number of rows or all fields of the matrix:

m = matrix(c("Bob","John","Jack"),nrow = 3, ncol = 3)
m
     [,1]   [,2]   [,3]  
[1,] "Bob"  "Bob"  "Bob" 
[2,] "John" "John" "John"
[3,] "Jack" "Jack" "Jack"

Selecting any element in the matrix is done by specifying the row and column number:

m[1,2]
[1] "Bob"

An entire row or column can also be selected:

m[1,]
[1] "Bob" "Bob" "Bob"
m[,1]
[1] "Bob"  "John" "Jack"

An matrix, just like a vector, can only store data of one type.

In R, you can also use multidimensional matrices called arrays. They are created by specifying the values stored in them and the dimension:

a = array(1:24, dim=c(3,4,2))
a
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

, , 2

     [,1] [,2] [,3] [,4]
[1,]   13   16   19   22
[2,]   14   17   20   23
[3,]   15   18   21   24

Another type of data storage is the factor. We will show how it works with an example. Let’s create a vector with towns names:

towns = c("Cracow","Warsaw","Gdansk","Warsaw","Cracow","Poznan","Bialystok","Szczecin")

Let’s change this vector into a factor and plot it:

towns = factor(towns)
towns
[1] Cracow    Warsaw    Gdansk    Warsaw    Cracow    Poznan    Bialystok Szczecin 
Levels: Bialystok Cracow Gdansk Poznan Szczecin Warsaw

As we can see we have a vector and levels which are the names of the cities, to show how it works we convert the factor to a numeric vector:

as.numeric(towns)
[1] 2 6 3 6 2 4 1 5

As we can see, factor is a sequence of numbers that are assigned values from levels. When a value repeats, the same number appears in the string. In fact, how can we use it except for reducing the amount of memory needed to store this kind of data. For example, when counting the number of occurrences of a particular element

table(towns)
towns
Bialystok    Cracow    Gdansk    Poznan  Szczecin    Warsaw 
        1         2         1         1         1         2 

The last variable we consider, where we can store data, is the data frame. In the simplest way, it can be described as an Excel spreadsheet that stores data in columns and rows. In each column we can store a different type of data. Let’s create a simple data frame:

df = data.frame(N = c(1:3),name = c("John","Natalie","Kate"),surname = c("Black","White","Gray"),age = c(15,6,13), adult = c(F,F,F))
df
  N    name surname age adult
1 1    John   Black  15 FALSE
2 2 Natalie   White   6 FALSE
3 3    Kate    Gray  13 FALSE

We have already dealt with this type of data in the analysis of vector files, because in this form the table is loaded with the attributes of the layer SHP. From the data frame, we can extract column as we did with the matrix:

df[,2]
[1] John    Natalie Kate   
Levels: John Kate Natalie

or rows:

df[1,]
  N name surname age adult
1 1 John   Black  15 FALSE

Each row and column also has a name that we can use to select:

df$name
[1] John    Natalie Kate   
Levels: John Kate Natalie

df[,'name']
[1] John Natalie Kate
Levels: John Kate Natalie

We can show them too:

colnames(df)
[1] "N"       "name"    "surname" "age"     "adult"  
rownames(df)
[1] "1" "2" "3"

The choice of storage method depends on how the data is stored in the files. For example, as we mentioned earlier, grids will be an array and loaded spreadsheets will be a data frame. In the next stage of the course we will show how to use loops on the data.

Leave a Reply

Your email address will not be published. Required fields are marked *