R Basics – Data Analysis Part 3 – Statistics

23 January 201725 April 2021 0 Comments aggregate, max, mean, min, R, RStudio, statistics

Wczytaliśmy dane i sprawdziliśmy ich poprawność. Poznajmy kilka funkcji pozwalających je w prosty sposób przeanalizować. Do analizy wykorzystamy plik z ludnością w powiatach, który wczytywaliśmy w części pierwszej (pobierz). Wczytajmy go ponownie:

We have loaded the data and validated it. Let’s explore some features to analyze it in a simple way. For the analysis, we will use the file with population in districts that we loaded in the first part (download). Let us load it again:

data = read.table(„D:/population.txt”, header = TRUE,  sep = „\t”, stringsAsFactors = FALSE)

1	data = read.table(„D:/population.txt”, header = TRUE, sep = „\t”, stringsAsFactors = FALSE)

First, in the population column, we want to find the maximum population in the districts. To do this, we will use the max function:

max(data$population)
[1] 1724404

1 2	max(data$population) [1] 1724404

Using the expression we used in the previous post, we can display the name of the county with the highest population:

data$district[which(data$population==max(data$population))]
[1] "m. st. Warszawa"

1 2	data$district[which(data$population==max(data$population))] [1] "m. st. Warszawa"

The minimum population is displayed using the min function:

min(data$population)
[1] 20891

1 2	min(data$population) [1] 20891

The least populated district is:

data$distict[which(data$population==min(data$population))]
[1] "sejnenski"

1 2	data$distict[which(data$population==min(data$population))] [1] "sejnenski"

If we want to know the minimum and maximum value of the data at once, we can use the range function:

range(data$population)
[1]   20891 1724404

1 2	range(data$population) [1] 20891 1724404

The mean function is used to calculate the average value:

mean(data$population)
[1] 101304.4

1 2	mean(data$population) [1] 101304.4

We will find the median using the median function:

median(data$population)
[1] 76436

1 2	median(data$population) [1] 76436

The standard deviation will give us the sd function:

sd(data$population)
[1] 116999.1

1 2	sd(data$population) [1] 116999.1

And the variance of var:

var(data$population)
[1] 13688782521

1 2	var(data$population) [1] 13688782521

We will count quartiles with the quantile function for which we need to determine which one:

quantile(data$population,0.3)

   30%
58012.5

quantile(data$population,0.3)

30%

58012.5

We can use the summary function to display basic statistics for specific values:

summary(data$population)

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

 20890   55720   76440  101300  111900 1724000

summary(data$population)

Min. 1st Qu. Median Mean 3rd Qu. Max.

20890 55720 76440 101300 111900 1724000

It is also useful to use the aggregate function to analyze values in groups. In our case, for example, by province. Let’s use this function to calculate the average population in counties by province. As arguments we have to specify values, then we group the calculations as a list (by) and which function we want to use in the calculations (FUN).

aggregate(data$population, by = list(data$voivodeship), FUN = 'mean')<br />Group.1 x<br />1 dolnoslaskie 96999.90<br />2 kujawsko-pomorskie 90981.04<br />3 lodzkie 104712.21<br />4 lubelskie 88569.20<br />5 lubuskie 72962.14<br />6 malopolskie 144134.45<br />7 mazowieckie 126591.43<br />8 opolskie 83701.33<br />9 podkarpackie 85171.76<br />10 podlaskie 70292.06<br />11 pomorskie 114790.55<br />12 slaskie 127762.42<br />13 swietokrzyskie 90588.50<br />14 warminsko-mazurskie 68900.71<br />15 wielkopolskie 99057.60<br />16 zachodniopomorskie 81850.52

aggregate(data$population, by = list(data$voivodeship), FUN = 'mean') Group.1 x 1 dolnoslaskie 96999.90 2 kujawsko-pomorskie 90981.04 3 lodzkie 104712.21 4 lubelskie 88569.20 5 lubuskie 72962.14 6 malopolskie 144134.45 7 mazowieckie 126591.43 8 opolskie 83701.33 9 podkarpackie 85171.76 10 podlaskie 70292.06 11 pomorskie 114790.55 12 slaskie 127762.42 13 swietokrzyskie 90588.50 14 warminsko-mazurskie 68900.71 15 wielkopolskie 99057.60 16 zachodniopomorskie 81850.52

R has a lot of functions that are useful for data analysis, because that’s what it was created for. We just wanted to show the basic ones. If you want to deepen your knowledge in this area, we refer you to the documentation, which contains very detailed descriptions of each function. In the next post we will show how to display data in charts.