Summarizing data, a Review

Lucy D’Agostino McGowan

Learning objectives

  • Recall how to summarize one continuous variable
  • Identify variables where a mean is a good summary measure (or not)
  • Explain why we summarize data (what is the big picture?)

Application Exercise

bit.ly/sta-112-f22-appex-05

One continuous variable

One continuous variable

How can we visualize a single continuous variable?

Histogram

Code
starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293")

Density

Code
starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height)) +
  geom_density(color = "#86a293")

Boxplot

Code
starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height, y = 1)) +
  geom_boxplot(outlier.shape = NA, color = "#86a293") + 
  geom_jitter(color = "#86a293") + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

One continuous variable

How can we numerically summarize a single continuous variable?


starwars %>%
  summarise(mean = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
   mean
  <dbl>
1  174.

One continuous variable

Code
library(geomtextpath)
starwars %>%
  drop_na(height) %>%
  ggplot(aes(x = height)) +
  geom_histogram(bins = 30, fill = "#86a293") +
  geom_textvline(xintercept = 174, 
                 lwd = 6, 
                 linewidth = 2, 
                 label = "mean = 174",
                 hjust = 0.25)

One continuous variable

Why do we calculate a mean?

  • Reduces the dimensionality of the data (from n to 1)
  • To get a sense of a “typical” observation
    • When is this an accurate representation?

Meaningful means

Symmetric

Code
set.seed(1)

d1 <- tibble(x = rnorm(1000, mean = 10))
ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Bimodal

Code
d2 <- tibble(x = c(rnorm(500, mean = 10),
                   rnorm(500, mean = 20)))
ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Skewed

Code
d3 <- tibble(x = rbeta(1000, 2, 5))
ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293")

Guess the mean for each of these variables.

Meaningful means

Symmetric

Code
ggplot(d1, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d1$x), lwd = 2)

Bimodal

Code
ggplot(d2, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d2$x), lwd = 2)

Skewed

Code
ggplot(d3, aes(x = x)) + 
  geom_histogram(bins = 30, fill = "#86a293") + 
  geom_vline(xintercept = mean(d3$x), lwd = 2)

Does this value represent a “typical” observation?

Math speak

\[\Large\bar{x} =\sum_{i=1}^n \frac{x_i}{n}\]

Math speak

\[\Large{\require{color}\colorbox{#86a293}{$\bar{x}$}} =\sum_{i=1}^n \frac{x_i}{n}\]

the mean of the variable \(x\)

Math speak

\[\Large\bar{x} ={\require{color}\colorbox{#86a293}{$\sum$}}_{i=1}^n \frac{x_i}{n}\]

add up the observations

Math speak

\[\Large\bar{x} =\sum_{{\require{color}\colorbox{#86a293}{$i=1$}}}^n \frac{x_i}{n}\]

from the first

Math speak

\[\Large\bar{x} =\sum_{i=1}^{\require{color}\colorbox{#86a293}{$n$}} \frac{x_i}{{\require{color}\colorbox{#86a293}{$n$}}}\]

total number of observations

Math speak

\[\Large\bar{x} =\sum_{i=1}^n \frac{{\require{color}\colorbox{#86a293}{$x_i$}}}{n}\]

continuous variable for observation i

Math speak

\[\Large\bar{x} =\sum_{i=1}^n \frac{x_i}{\require{color}\colorbox{#86a293}{${n}$}}\]

divide by the total number of observations

Application Exercise

data
\(x_1\) 3
\(x_2\) 5
\(x_3\) 1
\(x_4\) 7
\(x_5\) 8

  1. Using the data to the left, what is \(n\)?
  2. What is \(\bar{x}\)?
03:00

Data = model + error

Data

Code
d <- tibble(
  i = 1:5,
  x = c(3, 5, 1, 7, 8),
  model = mean(x),
  error = x - model
) 

knitr::kable(d)
i x model error
1 3 4.8 -1.8
2 5 4.8 0.2
3 1 4.8 -3.8
4 7 4.8 2.2
5 8 4.8 3.2

Data

Code
ggplot(d, aes(x = 1, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), label = "mean = 4.8") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code
ggplot(d, aes(x = i, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), label = "mean = 4.8") + 
  geom_segment(aes(y = x, yend = mean(x), x = i, xend = i), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Math Speak

\[\Large x = \beta_0 + \varepsilon\]

Math Speak

\[\Large {\require{color}\colorbox{#86a293}{$x$}} = \beta_0 + \varepsilon\]

This is the vector \(x=\{x_1,\dots,x_n\}\)

Math Speak

\[\Large x = {\require{color}\colorbox{#86a293}{$\beta_0$}} + \varepsilon\]

we call this the “intercept”, when there are no other variables, it is just the mean, \(\bar{x}\)

Math Speak

\[\Large x = \beta_0 + {\require{color}\colorbox{#86a293}{$\varepsilon$}}\]

the error

Data

Code
ggplot(d, aes(x = i, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = x, yend = mean(x), x = i, xend = i), color = "blue") +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank()) 

Data

Code
ggplot(d, aes(x = i, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_textsegment(aes(y = x, yend = mean(x), x = i, xend = i), color = "blue",
                   label = as.character(expression(epsilon)), parse = TRUE,
                   lwd = 5) +
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())

Data

Code
ggplot(d, aes(x = 1, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = x, yend = mean(x), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor.x = element_blank())  

Data

Code
ggplot(d, aes(x = 1, y = x)) + 
  geom_point() + 
  geom_texthline(yintercept = mean(d$x), lwd = 5, hjust = 0.1,
                 label = as.character(expression(beta[0])), parse = TRUE) + 
  geom_segment(aes(y = x, yend = mean(x), x = 1, xend = 1), color = "blue") + 
  theme(axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank()) 

Code
d2 <- d[2:4]
names(d2) <- c("$\\mathbf{x}$", "$\\beta_0$", "$\\varepsilon$")

knitr::kable(d2)
\(\mathbf{x}\) \(\beta_0\) \(\varepsilon\)
3 4.8 -1.8
5 4.8 0.2
1 4.8 -3.8
7 4.8 2.2
8 4.8 3.2

Calculating the mean in R

summarise(d, mean_x = mean(x))
# A tibble: 1 × 1
  mean_x
   <dbl>
1    4.8
lm(x ~ 1, data = d)

Call:
lm(formula = x ~ 1, data = d)

Coefficients:
(Intercept)  
        4.8  


  • “intercept only model”
  • lm: linear model

Application Exercise

Open your 04-appex.qmd file. Load the packages by running the top R chunk of code.

  1. Copy the code below into an R chunk at the bottom of the file:
d <- tibble(
  x = c(3, 5, 1, 7, 8)
)

What do you think this code does? Try typing ?tibble in the Console - what does this function do?

  1. Calculate the mean of x. Do this two ways, using the summary function and using the lm function.
  2. Add a new variable called error to the data set d that is equal to x minus the mean of x.
05:00

Recap

When is the mean an appropriate summary measure to calculate?

What assumptions need to be true in order to use a mean to represent your single continuous variable?