4 Descriptive Statistics

4.1 Quick Summary

The easiest way to get a quick summary of a dataset in R is to the summary( ) function. This function provides the min and max, mean, median, and first and third quartiles for the entire dataset or variables that we select. Let’s take a look at the summary table for the medical dataset for a few variables.

summary(medical[,c("sex", "race", "age", "avg_drinks")])

     sex                race                age         avg_drinks   
 Length:246         Length:246         Min.   :20.0   Min.   :  0.0  
 Class :character   Class :character   1st Qu.:31.0   1st Qu.:  2.0  
 Mode  :character   Mode  :character   Median :35.0   Median : 12.0  
                                       Mean   :36.3   Mean   : 17.1  
                                       3rd Qu.:41.0   3rd Qu.: 24.0  
                                       Max.   :60.0   Max.   :142.0

R often knows which variables are numerical (i.e., quantitative) and which variables are characters (i.e., qualitative). In the output above, we see a bunch of summary statistics for each variable. For character variables, we only see the length value – which is the number of rows for these variables.

We can see the summary table for the entire dataset using:

summary(medical)

4.2 Frequency Tables

Another handy function in base R is table which tabulates the data and creates frequency tables for variables. If the variable is a character, it will show the frequency of each level; if the variable is numerical, it will show the frequency of each value.

# Frequency tables for homeless status and sex
table(medical$homeless)


homeless   housed 
     118      128

table(medical$sex)


female   male 
    57    189

# Frequency table for age
table(medical$age)


20 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 
 2  7  3  4  2  6  8  9  4 12  7 19 14 11 16 11 16  6 16  5  8  7 10  3  6  2 
47 48 49 50 51 53 54 55 56 57 58 60 
10  5  4  2  1  2  1  2  1  1  2  1

The output is kind of messy. The first row shows the age values and the second row shows how many times those values appear in the dataset. Let’s reformat our frequency table into a more readable format.

age_table <- as.data.frame(table(medical$age))
head(age_table)

  Var1 Freq
1   20    2
2   22    7
3   23    3
4   24    4
5   25    2
6   26    6

We can rename the columns with better names using the colnames( ) function.

colnames(age_table) <- c("Age", "Frequency")
head(age_table)

  Age Frequency
1  20         2
2  22         7
3  23         3
4  24         4
5  25         2
6  26         6

Sometimes these raw freqency values are hard to interpret. Therefore, we may prefer to have proportions or percentages rather than actual frequency values. For this task, we need to use prop.table. Let’s see the proportion of male and female participants in the data.

# Proportions
prop.table(table(medical$sex))


female   male 
0.2317 0.7683

# Percentages
prop.table(table(medical$sex))*100


female   male 
 23.17  76.83

# Percentages rounded (no decimal points)
round(prop.table(table(medical$sex))*100, 0)


female   male 
    23     77

We can also use the table function for cross-tabulation. For example, if we want to see the number of homeless and housed patients by sex:

# Frequencies
table(medical$homeless, medical$sex)

          
           female male
  homeless     22   96
  housed       35   93

# Proportions
prop.table(table(medical$homeless, medical$sex))

          
            female    male
  homeless 0.08943 0.39024
  housed   0.14228 0.37805

# Percentages
prop.table(table(medical$homeless, medical$sex))*100

          
           female   male
  homeless  8.943 39.024
  housed   14.228 37.805

# Percentages rounded to the 2nd decimal point
round(prop.table(table(medical$homeless, medical$sex))*100, 2)

          
           female  male
  homeless   8.94 39.02
  housed    14.23 37.80

4.2.1 Exercise 5

Use the table and prop.table functions to create a cross-tabulation for race and substance. We want to see the percentages. Which group is the largest and which group is the smallest based on their percentages?

4.3 Central Tendency and Dispersion

Central tendency refers to indices or measures that gives us an idea about the center of the data. Typical central tendency measures are:

Mean: The sum of the values for a given variable divided by the number of values (\(n\)). Mean is typically denoted by \(\bar{X}\) (read as “x bar”) or simply \(M\). We can find the mean as:

\[\bar{X} = \frac{X_1 + X_2 + X_3 + ...+ X_n}{n}.\]

Median: The middle value for a given variable when the values are sorted from smallest to largest.

Dispersion refers to statistics that tell us how dispersed or spread out the values of a variable are. Typical dispersion measures are:

Standard deviation: A typical difference (deviation) between a particular value and the mean of a variable. Standard deviation is denoted by \(\sigma\) (read as “sigma”). We can find the standard deviation as:

\[\sigma = \sqrt{\frac{(X_1-\bar{X})^2+(X_2-\bar{X})^2+...+(X_n-\bar{X})^2}{n}}\]

Variance: Variance is the squared value of standard deviation, \(\sigma^2\).
Quantiles: If we divide a cumulative frequency curve into quarters, the value at the lower quarter is referred to as the lower quartile, the value at the middle gives the median and the value at the upper quarter is the upper quartile.
Range: Difference between the biggest value and the smallest value of a variable.
Interquartile range (IQR): Like the range, but instead of calculating the difference between the biggest and smallest values, it calculates the difference between the 25th quantile and the 75th quantile.

In R, we can calculate all of these statistics very easily. A critical point is that if the variable has missing values, then these statistics cannot be computed. Therefore, we need to add na.rm = TRUE inside the functions to remove missing values before calculations begin. Let’s try the variable age.

# Mean
mean(medical$age, na.rm = TRUE)

[1] 36.31

# Median
median(medical$age, na.rm = TRUE)

[1] 35

# Standard deviation
sd(medical$age, na.rm = TRUE)

[1] 7.984

# Variance
var(medical$age, na.rm = TRUE)

[1] 63.75

# Quantile
quantile(medical$age, na.rm = TRUE)

  0%  25%  50%  75% 100% 
  20   31   35   41   60

# 95th percentile
quantile(medical$age, 0.95)

  95% 
49.75

# Range
range(medical$age, na.rm = TRUE)

[1] 20 60

# Min and max values
min(medical$age, na.rm = TRUE)

[1] 20

max(medical$age, na.rm = TRUE)

[1] 60

We can also calculate central tendency and dispersion by grouping variables, using the tapply function. Let’s take a look at average and median age by sex.

tapply(medical$age, medical$sex, mean)

female   male 
 37.07  36.08

tapply(medical$age, medical$sex, median)

female   male 
    35     36

We can combine these functions using the summarise function from the dplyr package.

medical %>%
  summarise(mean_age = mean(age, na.rm = TRUE),
            median_age = median(age, na.rm = TRUE),
            sd_age = sd(age, na.rm = TRUE),
            var_age = var(age, na.rm = TRUE))

  mean_age median_age sd_age var_age
1    36.31         35  7.984   63.75

We can also create summaries by grouping variables using the group_by function from the dplyr package. Let’s take a look at the summary of age by sex.

medical %>%
  group_by(sex) %>%
  summarise(n = n(), # Count by sex
            mean_age = mean(age, na.rm = TRUE), # Mean
            median_age = median(age, na.rm = TRUE), # Median
            sd_age = sd(age, na.rm = TRUE), # Standard deviation
            var_age = var(age, na.rm = TRUE)) # Variance

# A tibble: 2 x 6
  sex        n mean_age median_age sd_age var_age
  <chr>  <int>    <dbl>      <int>  <dbl>   <dbl>
1 female    57     37.1         35   8.51    72.4
2 male     189     36.1         36   7.83    61.3

Another conventient way to summarize a dataset descriptively is to use the skim function from the skimr package (Waring et al. 2020). Let’s try it with our medical dataset.

# Let's install and activate the package
install.packages("skimr")
library("skimr")

To summarize the entire dataset:

skim(medical)

To summarize some variables:

skim(medical, mental1, mental2, avg_drinks, max_drinks)

-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 4 x 10
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100
* <chr>             <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mental1               0             1  31.7  12.5  6.76  22.0  29.1  40.6  60.5
2 mental2               0             1  41.0  13.8  6.68  30.2  42.4  52.7  69.9
3 avg_drinks            0             1  17.1  21.2  0      2    12    24   142
4 max_drinks            0             1  24.1  31.1  0      3    13    32   184

To summarize the dataset by grouping variables:

medical %>%
  group_by(sex) %>%
  select(sex, mental1, mental2, avg_drinks, max_drinks) %>%
  skim()

-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 8 x 11
  skim_variable sex    n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100
* <chr>         <chr>      <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mental1       female         0             1  29.3  13.2  7.04  19.8  26.3  37.4  60.5
2 mental1       male           0             1  32.4  12.2  6.76  22.9  30.1  41.6  59.5
3 mental2       female         0             1  38.8  13.0  6.68  29.6  38.3  48.7  64.3
4 mental2       male           0             1  41.6  14.0  7.09  30.5  43.6  53.5  69.9
5 avg_drinks    female         0             1  13.7  18.1  0      0     6    19    71  
6 avg_drinks    male           0             1  18.2  22.0  0      3    13    25   142  
7 max_drinks    female         0             1  20.2  31.7  0      0     8    26   164  
8 max_drinks    male           0             1  25.3  30.9  0      4    19    33   184

4.3.1 Exercise 6

Using the summarise function from the dplyr package or the skim function from the skimr package, create a summary of the variable depression1 by race. If you decide to use summarise, you need to include count, mean, standard deviation, minimum, and maximum values.