Simple Linear Regression Part 2

Lucy D’Agostino McGowan

Data

library(tidyverse)

starwars_nojabba <- starwars %>%
  drop_na(mass, height) %>%
  filter(mass < 1000)

Simple linear regression

mod <- lm(mass ~ height, data = starwars_nojabba)


starwars_nojabba %>%
  mutate(y_hat = fitted(mod)) %>%
  select(y_hat, mass, height)
# A tibble: 58 × 3
   y_hat  mass height
   <dbl> <dbl>  <int>
 1  74.3    77    172
 2  71.2    75    167
 3  27.1    32     96
 4  93.0   136    202
 5  60.7    49    150
 6  78.1   120    178
 7  70.0    75    165
 8  27.7    32     97
 9  81.2    84    183
10  80.5    77    182
# … with 48 more rows

Where did these \(\hat{y}\) values come from?

Fitted values

lm(mass ~ height, 
   data = starwars_nojabba)

Call:
lm(formula = mass ~ height, data = starwars_nojabba)

Coefficients:
(Intercept)       height  
   -32.5408       0.6214  
starwars_nojabba %>%
  mutate(y_hat = fitted(mod)) %>%
  select(y_hat, mass, height) %>%
  slice(1:3)
# A tibble: 3 × 3
  y_hat  mass height
  <dbl> <dbl>  <int>
1  74.3    77    172
2  71.2    75    167
3  27.1    32     96

\(\hat{y}_1 = -32.5 + 0.62 \times 172\)

\(\hat{y}_2 = -32.5 + 0.62 \times 167\)

\(\hat{y}_3 = -32.5 + 0.62 \times 96\)

Simple linear regression

How did we decide on this line?

Code
starwars_nojabba <- starwars_nojabba %>%
  mutate(fitted = fitted(lm(mass ~ height, data = starwars_nojabba)))

ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_point(color = "#86a293") +
  geom_segment(aes(
    x = height,
    y = mass,
    xend = height,
    yend = fitted
  ),
  color = "blue") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

Minimize Least Squares

Code
ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_rect(
    aes(
      xmin = height,
      xmax = height + mass - fitted,
      ymin = mass,
      ymax = fitted
    ),
    fill = "blue",
    color = "blue",
    alpha = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  geom_point(color = "#86a293") +
  coord_fixed() +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

“Squared Residuals”

Code
ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_rect(
    aes(
      xmin = height,
      xmax = height + mass - fitted,
      ymin = mass,
      ymax = fitted
    ),
    fill = "blue",
    color = "blue",
    alpha = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  geom_point(color = "#86a293") +
  coord_fixed() +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

“Residuals”

Code
ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_point(color = "#86a293") +
  geom_segment(aes(
    x = height,
    y = mass,
    xend = height,
    yend = fitted
  ),
  color = "blue") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

Let’s pause for definitions

Definitions

  • residual \((e)\)
  • squared residual \((e^2)\)
  • sum of squared residuals (SSE)
  • n
  • degrees of freedom
  • standard deviation of the errors \((\sigma_\varepsilon)\)

  • \(y - \hat{y}\)
  • \((y-\hat{y})^2\)
  • \(\sum(y-\hat{y})^2\)
  • sample size
  • \(n - p\) for simple linear regression: \(n - 2\)
  • estimated by \(\hat{\sigma}_\varepsilon=\sqrt{\frac{\textrm{SSE}}{df}}\)

A note about notation

\[\Large \sum(y-\hat{y})^2\]

\[\Large \sum_{i=1}^n(y_i - \hat{y}_i)^2\]

A note about notation

the \(i\) indicates a single individual

\[\Large e_i = y_i - \hat{y}_i\]

A note about notation

for the 1st observation, \(i = 1\)

\[\Large e_1 = y_1 - \hat{y}_1\]

Application Exercise

  1. Go to lucy.shinyapps.io/least-squares/
  2. This shows a scatter plot of 10 data points with a line estimating the relationship between x and y. Drag the blue points to change the line.
  3. See if you can find a line that minimizes the sum of square errors
03:00

Let’s do this in R!

Calculate the residual

mod <- lm(mass ~ height,
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat) %>%
  select(mass, height, y_hat, residual)
# A tibble: 58 × 4
    mass height y_hat residual
   <dbl>  <int> <dbl>    <dbl>
 1    77    172  74.3     2.67
 2    75    167  71.2     3.77
 3    32     96  27.1     4.89
 4   136    202  93.0    43.0 
 5    49    150  60.7   -11.7 
 6   120    178  78.1    41.9 
 7    75    165  70.0     5.02
 8    32     97  27.7     4.27
 9    84    183  81.2     2.83
10    77    182  80.5    -3.55
# … with 48 more rows

Calculate the residual squared

How could I add the residual squared to this data frame?

mod <- lm(mass ~ height,
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat) %>%
  select(mass, height, y_hat, residual)
# A tibble: 58 × 4
    mass height y_hat residual
   <dbl>  <int> <dbl>    <dbl>
 1    77    172  74.3     2.67
 2    75    167  71.2     3.77
 3    32     96  27.1     4.89
 4   136    202  93.0    43.0 
 5    49    150  60.7   -11.7 
 6   120    178  78.1    41.9 
 7    75    165  70.0     5.02
 8    32     97  27.7     4.27
 9    84    183  81.2     2.83
10    77    182  80.5    -3.55
# … with 48 more rows

Calculate the residual squared

How could I add the residual squared to this data frame?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  select(mass, height, y_hat, residual_2)
# A tibble: 58 × 4
    mass height y_hat residual_2
   <dbl>  <int> <dbl>      <dbl>
 1    77    172  74.3       7.11
 2    75    167  71.2      14.2 
 3    32     96  27.1      23.9 
 4   136    202  93.0    1851.  
 5    49    150  60.7     136.  
 6   120    178  78.1    1759.  
 7    75    165  70.0      25.2 
 8    32     97  27.7      18.2 
 9    84    183  81.2       8.02
10    77    182  80.5      12.6 
# … with 48 more rows

Calculate the SSE

How can I summarize this dataset to calculate the sum of the squared residuals?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2)  %>%
  select(mass, height, y_hat, residual_2)
# A tibble: 58 × 4
    mass height y_hat residual_2
   <dbl>  <int> <dbl>      <dbl>
 1    77    172  74.3       7.11
 2    75    167  71.2      14.2 
 3    32     96  27.1      23.9 
 4   136    202  93.0    1851.  
 5    49    150  60.7     136.  
 6   120    178  78.1    1759.  
 7    75    165  70.0      25.2 
 8    32     97  27.7      18.2 
 9    84    183  81.2       8.02
10    77    182  80.5      12.6 
# … with 48 more rows

Calculate the SSE

How can I summarize this dataset to calculate the sum of the squared residuals?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2)
    )
# A tibble: 1 × 1
     sse
   <dbl>
1 20509.

Calculate the sample size

How can I add the total sample size?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2)
    )
# A tibble: 1 × 1
     sse
   <dbl>
1 20509.

Calculate the sample size

How can I add the total sample size?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n()
    )
# A tibble: 1 × 2
     sse     n
   <dbl> <int>
1 20509.    58

Calculate the degrees of freedom

How can I add the degrees of freedom \((n-p)\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n()
    )
# A tibble: 1 × 2
     sse     n
   <dbl> <int>
1 20509.    58

Calculate the degrees of freedom

How can I add the degrees of freedom \((n-p)\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2
    )
# A tibble: 1 × 3
     sse     n    df
   <dbl> <int> <dbl>
1 20509.    58    56

Calculate the residual standard error

How can I add the total \(\hat{\sigma}_\varepsilon= \sqrt{\frac{\textrm{SSE}}{df}}\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2
    )
# A tibble: 1 × 3
     sse     n    df
   <dbl> <int> <dbl>
1 20509.    58    56

Calculate the residual standard error

How can I add the total \(\hat{\sigma}_\varepsilon= \sqrt{\frac{\textrm{SSE}}{df}}\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2,
    sigma = sqrt(sse / df)
    )
# A tibble: 1 × 4
     sse     n    df sigma
   <dbl> <int> <dbl> <dbl>
1 20509.    58    56  19.1

Find residual standard error in lm output

mod <- lm(mass ~ height, data = starwars_nojabba)

summary(mod)

Call:
lm(formula = mass ~ height, data = starwars_nojabba)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.382  -8.212   0.211   3.846  57.327 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.54076   12.56053  -2.591   0.0122 *  
height        0.62136    0.07073   8.785 4.02e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared:  0.5795,    Adjusted R-squared:  0.572 
F-statistic: 77.18 on 1 and 56 DF,  p-value: 4.018e-12

Application Exercise

  1. Create a new project from this template in RStudio Pro:
https://github.com/sta-112-f22/appex-07.git
  1. Load the packages and data by running the top chunk of R code
  2. Learn about the PorschePrice data by running ?PorschePrice in your Console
  3. Fit a linear model predicting Price from Mileage
  4. Add a variable called y_hat to the PorschePrice dataset with the predicted y values
  5. Add a variable called residual to the PorschePrice dataset with the residuals
07:00