# R Basics – Data Analysis Part 3 – Statistics

Wczytaliśmy dane i sprawdziliśmy ich poprawność. Poznajmy kilka funkcji pozwalających je w prosty sposób przeanalizować. Do analizy wykorzystamy plik z ludnością w powiatach, który wczytywaliśmy w części pierwszej (pobierz). Wczytajmy go ponownie:

We have loaded the data and validated it. Let’s explore some features to analyze it in a simple way. For the analysis, we will use the file with population in districts that we loaded in the first part (download). Let us load it again:

1 |
data = read.table(„D:/population.txt”, header = TRUE, sep = „\t”, stringsAsFactors = FALSE) |

First, in the population column, we want to find the maximum population in the districts. To do this, we will use the max function:

1 2 |
max(data$population) [1] 1724404 |

Using the expression we used in the previous post, we can display the name of the county with the highest population:

1 2 |
data$district[which(data$population==max(data$population))] [1] "m. st. Warszawa" |

The minimum population is displayed using the **min** function:

1 2 |
min(data$population) [1] 20891 |

The least populated district is:

1 2 |
data$distict[which(data$population==min(data$population))] [1] "sejnenski" |

If we want to know the minimum and maximum value of the data at once, we can use the **range** function:

1 2 |
range(data$population) [1] 20891 1724404 |

The **mean** function is used to calculate the average value:

1 2 |
mean(data$population) [1] 101304.4 |

We will find the median using the **median** function:

1 2 |
median(data$population) [1] 76436 |

The standard deviation will give us the **sd** function:

1 2 |
sd(data$population) [1] 116999.1 |

And the variance of **var**:

1 2 |
var(data$population) [1] 13688782521 |

We will count quartiles with the **quantile** function for which we need to determine which one:

1 2 3 4 |
quantile(data$population,0.3) 30% 58012.5 |

We can use the **summary** function to display basic statistics for specific values:

1 2 3 4 5 |
summary(data$population) Min. 1st Qu. Median Mean 3rd Qu. Max. 20890 55720 76440 101300 111900 1724000 |

It is also useful to use the aggregate function to analyze values in groups. In our case, for example, by province. Let’s use this function to calculate the average population in counties by province. As arguments we have to specify values, then we group the calculations as a list (**by**) and which function we want to use in the calculations (**FUN**).

1 |
aggregate(data$population, by = list(data$voivodeship), FUN = 'mean')<br />Group.1 x<br />1 dolnoslaskie 96999.90<br />2 kujawsko-pomorskie 90981.04<br />3 lodzkie 104712.21<br />4 lubelskie 88569.20<br />5 lubuskie 72962.14<br />6 malopolskie 144134.45<br />7 mazowieckie 126591.43<br />8 opolskie 83701.33<br />9 podkarpackie 85171.76<br />10 podlaskie 70292.06<br />11 pomorskie 114790.55<br />12 slaskie 127762.42<br />13 swietokrzyskie 90588.50<br />14 warminsko-mazurskie 68900.71<br />15 wielkopolskie 99057.60<br />16 zachodniopomorskie 81850.52 |

R has a lot of functions that are useful for data analysis, because that’s what it was created for. We just wanted to show the basic ones. If you want to deepen your knowledge in this area, we refer you to the documentation, which contains very detailed descriptions of each function. In the next post we will show how to display data in charts.