4 Descriptive Statistics
4.1 Quick Summary
The easiest way to get a quick summary of a dataset in R is to the summary( )
function. This function provides the min and max, mean, median, and first and third quartiles for the entire dataset or variables that we select. Let’s take a look at the summary table for the medical
dataset for a few variables.
summary(medical[,c("sex", "race", "age", "avg_drinks")])
sex race age avg_drinks
Length:246 Length:246 Min. :20.0 Min. : 0.0
Class :character Class :character 1st Qu.:31.0 1st Qu.: 2.0
Mode :character Mode :character Median :35.0 Median : 12.0
Mean :36.3 Mean : 17.1
3rd Qu.:41.0 3rd Qu.: 24.0
Max. :60.0 Max. :142.0
R often knows which variables are numerical (i.e., quantitative) and which variables are characters (i.e., qualitative). In the output above, we see a bunch of summary statistics for each variable. For character variables, we only see the length value – which is the number of rows for these variables.
We can see the summary table for the entire dataset using:
summary(medical)
4.2 Frequency Tables
Another handy function in base R is table
which tabulates the data and creates frequency tables for variables. If the variable is a character, it will show the frequency of each level; if the variable is numerical, it will show the frequency of each value.
# Frequency tables for homeless status and sex
table(medical$homeless)
homeless housed
118 128
table(medical$sex)
female male
57 189
# Frequency table for age
table(medical$age)
20 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
2 7 3 4 2 6 8 9 4 12 7 19 14 11 16 11 16 6 16 5 8 7 10 3 6 2
47 48 49 50 51 53 54 55 56 57 58 60
10 5 4 2 1 2 1 2 1 1 2 1
The output is kind of messy. The first row shows the age values and the second row shows how many times those values appear in the dataset. Let’s reformat our frequency table into a more readable format.
<- as.data.frame(table(medical$age))
age_table head(age_table)
Var1 Freq
1 20 2
2 22 7
3 23 3
4 24 4
5 25 2
6 26 6
We can rename the columns with better names using the colnames( )
function.
colnames(age_table) <- c("Age", "Frequency")
head(age_table)
Age Frequency
1 20 2
2 22 7
3 23 3
4 24 4
5 25 2
6 26 6
Sometimes these raw freqency values are hard to interpret. Therefore, we may prefer to have proportions or percentages rather than actual frequency values. For this task, we need to use prop.table
. Let’s see the proportion of male and female participants in the data.
# Proportions
prop.table(table(medical$sex))
female male
0.2317 0.7683
# Percentages
prop.table(table(medical$sex))*100
female male
23.17 76.83
# Percentages rounded (no decimal points)
round(prop.table(table(medical$sex))*100, 0)
female male
23 77
We can also use the table
function for cross-tabulation. For example, if we want to see the number of homeless and housed patients by sex:
# Frequencies
table(medical$homeless, medical$sex)
female male
homeless 22 96
housed 35 93
# Proportions
prop.table(table(medical$homeless, medical$sex))
female male
homeless 0.08943 0.39024
housed 0.14228 0.37805
# Percentages
prop.table(table(medical$homeless, medical$sex))*100
female male
homeless 8.943 39.024
housed 14.228 37.805
# Percentages rounded to the 2nd decimal point
round(prop.table(table(medical$homeless, medical$sex))*100, 2)
female male
homeless 8.94 39.02
housed 14.23 37.80
4.2.1 Exercise 5
Use the table
and prop.table
functions to create a cross-tabulation for race
and substance
. We want to see the percentages. Which group is the largest and which group is the smallest based on their percentages?
4.3 Central Tendency and Dispersion
Central tendency refers to indices or measures that gives us an idea about the center of the data. Typical central tendency measures are:
- Mean: The sum of the values for a given variable divided by the number of values (\(n\)). Mean is typically denoted by \(\bar{X}\) (read as “x bar”) or simply \(M\). We can find the mean as:
\[\bar{X} = \frac{X_1 + X_2 + X_3 + ...+ X_n}{n}.\]
- Median: The middle value for a given variable when the values are sorted from smallest to largest.
Dispersion refers to statistics that tell us how dispersed or spread out the values of a variable are. Typical dispersion measures are:
- Standard deviation: A typical difference (deviation) between a particular value and the mean of a variable. Standard deviation is denoted by \(\sigma\) (read as “sigma”). We can find the standard deviation as:
\[\sigma = \sqrt{\frac{(X_1-\bar{X})^2+(X_2-\bar{X})^2+...+(X_n-\bar{X})^2}{n}}\]
Variance: Variance is the squared value of standard deviation, \(\sigma^2\).
Quantiles: If we divide a cumulative frequency curve into quarters, the value at the lower quarter is referred to as the lower quartile, the value at the middle gives the median and the value at the upper quarter is the upper quartile.
Range: Difference between the biggest value and the smallest value of a variable.
Interquartile range (IQR): Like the range, but instead of calculating the difference between the biggest and smallest values, it calculates the difference between the 25th quantile and the 75th quantile.
In R, we can calculate all of these statistics very easily. A critical point is that if the variable has missing values, then these statistics cannot be computed. Therefore, we need to add na.rm = TRUE
inside the functions to remove missing values before calculations begin. Let’s try the variable age
.
# Mean
mean(medical$age, na.rm = TRUE)
[1] 36.31
# Median
median(medical$age, na.rm = TRUE)
[1] 35
# Standard deviation
sd(medical$age, na.rm = TRUE)
[1] 7.984
# Variance
var(medical$age, na.rm = TRUE)
[1] 63.75
# Quantile
quantile(medical$age, na.rm = TRUE)
0% 25% 50% 75% 100%
20 31 35 41 60
# 95th percentile
quantile(medical$age, 0.95)
95%
49.75
# Range
range(medical$age, na.rm = TRUE)
[1] 20 60
# Min and max values
min(medical$age, na.rm = TRUE)
[1] 20
max(medical$age, na.rm = TRUE)
[1] 60
We can also calculate central tendency and dispersion by grouping variables, using the tapply
function. Let’s take a look at average and median age by sex.
tapply(medical$age, medical$sex, mean)
female male
37.07 36.08
tapply(medical$age, medical$sex, median)
female male
35 36
We can combine these functions using the summarise
function from the dplyr
package.
%>%
medical summarise(mean_age = mean(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
var_age = var(age, na.rm = TRUE))
mean_age median_age sd_age var_age
1 36.31 35 7.984 63.75
We can also create summaries by grouping variables using the group_by
function from the dplyr
package. Let’s take a look at the summary of age by sex.
%>%
medical group_by(sex) %>%
summarise(n = n(), # Count by sex
mean_age = mean(age, na.rm = TRUE), # Mean
median_age = median(age, na.rm = TRUE), # Median
sd_age = sd(age, na.rm = TRUE), # Standard deviation
var_age = var(age, na.rm = TRUE)) # Variance
# A tibble: 2 x 6
sex n mean_age median_age sd_age var_age
<chr> <int> <dbl> <int> <dbl> <dbl>
1 female 57 37.1 35 8.51 72.4
2 male 189 36.1 36 7.83 61.3
Another conventient way to summarize a dataset descriptively is to use the skim
function from the skimr
package (Waring et al. 2020). Let’s try it with our medical dataset.
# Let's install and activate the package
install.packages("skimr")
library("skimr")
To summarize the entire dataset:
skim(medical)
To summarize some variables:
skim(medical, mental1, mental2, avg_drinks, max_drinks)
-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 4 x 10
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mental1 0 1 31.7 12.5 6.76 22.0 29.1 40.6 60.5
2 mental2 0 1 41.0 13.8 6.68 30.2 42.4 52.7 69.9
3 avg_drinks 0 1 17.1 21.2 0 2 12 24 142
4 max_drinks 0 1 24.1 31.1 0 3 13 32 184
To summarize the dataset by grouping variables:
%>%
medical group_by(sex) %>%
select(sex, mental1, mental2, avg_drinks, max_drinks) %>%
skim()
-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 8 x 11
skim_variable sex n_missing complete_rate mean sd p0 p25 p50 p75 p100* <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mental1 female 0 1 29.3 13.2 7.04 19.8 26.3 37.4 60.5
2 mental1 male 0 1 32.4 12.2 6.76 22.9 30.1 41.6 59.5
3 mental2 female 0 1 38.8 13.0 6.68 29.6 38.3 48.7 64.3
4 mental2 male 0 1 41.6 14.0 7.09 30.5 43.6 53.5 69.9
5 avg_drinks female 0 1 13.7 18.1 0 0 6 19 71
6 avg_drinks male 0 1 18.2 22.0 0 3 13 25 142
7 max_drinks female 0 1 20.2 31.7 0 0 8 26 164
8 max_drinks male 0 1 25.3 30.9 0 4 19 33 184
4.3.1 Exercise 6
Using the summarise
function from the dplyr
package or the skim
function from the skimr
package, create a summary of the variable depression1
by race
. If you decide to use summarise
, you need to include count, mean, standard deviation, minimum, and maximum values.