Simple Linear Regression Part 2

Lucy D’Agostino McGowan

Data

library(tidyverse)

starwars_nojabba <- starwars %>%
  drop_na(mass, height) %>%
  filter(mass < 1000)

Simple linear regression

mod <- lm(mass ~ height, data = starwars_nojabba)

starwars_nojabba %>%
  mutate(y_hat = fitted(mod)) %>%
  select(y_hat, mass, height)

# A tibble: 58 × 3
   y_hat  mass height
   <dbl> <dbl>  <int>
 1  74.3    77    172
 2  71.2    75    167
 3  27.1    32     96
 4  93.0   136    202
 5  60.7    49    150
 6  78.1   120    178
 7  70.0    75    165
 8  27.7    32     97
 9  81.2    84    183
10  80.5    77    182
# … with 48 more rows

Where did these \(\hat{y}\) values come from?

Fitted values

lm(mass ~ height, 
   data = starwars_nojabba)


Call:
lm(formula = mass ~ height, data = starwars_nojabba)

Coefficients:
(Intercept)       height  
   -32.5408       0.6214

starwars_nojabba %>%
  mutate(y_hat = fitted(mod)) %>%
  select(y_hat, mass, height) %>%
  slice(1:3)

# A tibble: 3 × 3
  y_hat  mass height
  <dbl> <dbl>  <int>
1  74.3    77    172
2  71.2    75    167
3  27.1    32     96

\(\hat{y}_1 = -32.5 + 0.62 \times 172\)

\(\hat{y}_2 = -32.5 + 0.62 \times 167\)

\(\hat{y}_3 = -32.5 + 0.62 \times 96\)

Simple linear regression

How did we decide on this line?

Code

starwars_nojabba <- starwars_nojabba %>%
  mutate(fitted = fitted(lm(mass ~ height, data = starwars_nojabba)))

ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_point(color = "#86a293") +
  geom_segment(aes(
    x = height,
    y = mass,
    xend = height,
    yend = fitted
  ),
  color = "blue") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

Minimize Least Squares

Code

ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_rect(
    aes(
      xmin = height,
      xmax = height + mass - fitted,
      ymin = mass,
      ymax = fitted
    ),
    fill = "blue",
    color = "blue",
    alpha = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  geom_point(color = "#86a293") +
  coord_fixed() +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

“Squared Residuals”

Code

ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_rect(
    aes(
      xmin = height,
      xmax = height + mass - fitted,
      ymin = mass,
      ymax = fitted
    ),
    fill = "blue",
    color = "blue",
    alpha = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  geom_point(color = "#86a293") +
  coord_fixed() +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

“Residuals”

Code

ggplot(starwars_nojabba, aes(x = height, mass)) +
  geom_point(color = "#86a293") +
  geom_segment(aes(
    x = height,
    y = mass,
    xend = height,
    yend = fitted
  ),
  color = "blue") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    formula = "y ~ x",
    color = "#86a293"
  ) +
  labs(title = "The relationship between mass and height for Star Wars characters",
       caption = "Data from SWAPI (swapi.dev)")

Let’s pause for definitions

Definitions

residual \((e)\)
squared residual \((e^2)\)
sum of squared residuals (SSE)
n
degrees of freedom
standard deviation of the errors \((\sigma_\varepsilon)\)

\(y - \hat{y}\)
\((y-\hat{y})^2\)
\(\sum(y-\hat{y})^2\)
sample size
\(n - p\) for simple linear regression: \(n - 2\)
estimated by \(\hat{\sigma}_\varepsilon=\sqrt{\frac{\textrm{SSE}}{df}}\)

A note about notation

\[\Large \sum(y-\hat{y})^2\]

\[\Large \sum_{i=1}^n(y_i - \hat{y}_i)^2\]

A note about notation

the \(i\) indicates a single individual

\[\Large e_i = y_i - \hat{y}_i\]

A note about notation

for the 1st observation, \(i = 1\)

\[\Large e_1 = y_1 - \hat{y}_1\]

`Application Exercise`

Go to lucy.shinyapps.io/least-squares/
This shows a scatter plot of 10 data points with a line estimating the relationship between x and y. Drag the blue points to change the line.
See if you can find a line that minimizes the sum of square errors

03:00

Let’s do this in R!

Calculate the residual

mod <- lm(mass ~ height,
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat) %>%
  select(mass, height, y_hat, residual)

# A tibble: 58 × 4
    mass height y_hat residual
   <dbl>  <int> <dbl>    <dbl>
 1    77    172  74.3     2.67
 2    75    167  71.2     3.77
 3    32     96  27.1     4.89
 4   136    202  93.0    43.0 
 5    49    150  60.7   -11.7 
 6   120    178  78.1    41.9 
 7    75    165  70.0     5.02
 8    32     97  27.7     4.27
 9    84    183  81.2     2.83
10    77    182  80.5    -3.55
# … with 48 more rows

Calculate the residual squared

How could I add the residual squared to this data frame?

mod <- lm(mass ~ height,
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat) %>%
  select(mass, height, y_hat, residual)

# A tibble: 58 × 4
    mass height y_hat residual
   <dbl>  <int> <dbl>    <dbl>
 1    77    172  74.3     2.67
 2    75    167  71.2     3.77
 3    32     96  27.1     4.89
 4   136    202  93.0    43.0 
 5    49    150  60.7   -11.7 
 6   120    178  78.1    41.9 
 7    75    165  70.0     5.02
 8    32     97  27.7     4.27
 9    84    183  81.2     2.83
10    77    182  80.5    -3.55
# … with 48 more rows

Calculate the residual squared

How could I add the residual squared to this data frame?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  select(mass, height, y_hat, residual_2)

# A tibble: 58 × 4
    mass height y_hat residual_2
   <dbl>  <int> <dbl>      <dbl>
 1    77    172  74.3       7.11
 2    75    167  71.2      14.2 
 3    32     96  27.1      23.9 
 4   136    202  93.0    1851.  
 5    49    150  60.7     136.  
 6   120    178  78.1    1759.  
 7    75    165  70.0      25.2 
 8    32     97  27.7      18.2 
 9    84    183  81.2       8.02
10    77    182  80.5      12.6 
# … with 48 more rows

Calculate the SSE

How can I summarize this dataset to calculate the sum of the squared residuals?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2)  %>%
  select(mass, height, y_hat, residual_2)

# A tibble: 58 × 4
    mass height y_hat residual_2
   <dbl>  <int> <dbl>      <dbl>
 1    77    172  74.3       7.11
 2    75    167  71.2      14.2 
 3    32     96  27.1      23.9 
 4   136    202  93.0    1851.  
 5    49    150  60.7     136.  
 6   120    178  78.1    1759.  
 7    75    165  70.0      25.2 
 8    32     97  27.7      18.2 
 9    84    183  81.2       8.02
10    77    182  80.5      12.6 
# … with 48 more rows

Calculate the SSE

How can I summarize this dataset to calculate the sum of the squared residuals?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2)
    )

# A tibble: 1 × 1
     sse
   <dbl>
1 20509.

Calculate the sample size

How can I add the total sample size?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2)
    )

# A tibble: 1 × 1
     sse
   <dbl>
1 20509.

Calculate the sample size

How can I add the total sample size?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n()
    )

# A tibble: 1 × 2
     sse     n
   <dbl> <int>
1 20509.    58

Calculate the degrees of freedom

How can I add the degrees of freedom \((n-p)\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n()
    )

# A tibble: 1 × 2
     sse     n
   <dbl> <int>
1 20509.    58

Calculate the degrees of freedom

How can I add the degrees of freedom \((n-p)\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2
    )

# A tibble: 1 × 3
     sse     n    df
   <dbl> <int> <dbl>
1 20509.    58    56

Calculate the residual standard error

How can I add the total \(\hat{\sigma}_\varepsilon= \sqrt{\frac{\textrm{SSE}}{df}}\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2
    )

# A tibble: 1 × 3
     sse     n    df
   <dbl> <int> <dbl>
1 20509.    58    56

Calculate the residual standard error

How can I add the total \(\hat{\sigma}_\varepsilon= \sqrt{\frac{\textrm{SSE}}{df}}\)?

mod <- lm(mass ~ height, 
          data = starwars_nojabba)

starwars_nojabba %>%
  mutate(
    y_hat = fitted(mod),
    residual = mass - y_hat,
    residual_2 = residual^2) %>%
  summarize(
    sse = sum(residual_2),
    n = n(),
    df = n - 2,
    sigma = sqrt(sse / df)
    )

# A tibble: 1 × 4
     sse     n    df sigma
   <dbl> <int> <dbl> <dbl>
1 20509.    58    56  19.1

Find residual standard error in `lm` output

mod <- lm(mass ~ height, data = starwars_nojabba)

summary(mod)


Call:
lm(formula = mass ~ height, data = starwars_nojabba)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.382  -8.212   0.211   3.846  57.327 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.54076   12.56053  -2.591   0.0122 *  
height        0.62136    0.07073   8.785 4.02e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared:  0.5795,    Adjusted R-squared:  0.572 
F-statistic: 77.18 on 1 and 56 DF,  p-value: 4.018e-12

`Application Exercise`

Create a new project from this template in RStudio Pro:

https://github.com/sta-112-f22/appex-07.git

Load the packages and data by running the top chunk of R code
Learn about the PorschePrice data by running ?PorschePrice in your Console
Fit a linear model predicting Price from Mileage
Add a variable called y_hat to the PorschePrice dataset with the predicted y values
Add a variable called residual to the PorschePrice dataset with the residuals

07:00

Simple Linear Regression Part 2

Data

Simple linear regression

Where did these \(\hat{y}\) values come from?

Fitted values

\(\hat{y}_1 = -32.5 + 0.62 \times 172\)

\(\hat{y}_2 = -32.5 + 0.62 \times 167\)

\(\hat{y}_3 = -32.5 + 0.62 \times 96\)

Simple linear regression

Minimize Least Squares

“Squared Residuals”

“Residuals”

Let’s pause for definitions

Definitions

A note about notation

A note about notation

the \(i\) indicates a single individual

A note about notation

for the 1st observation, \(i = 1\)

Application Exercise

Let’s do this in R!

Calculate the residual

Calculate the residual squared

Calculate the residual squared

Calculate the SSE

Calculate the SSE

Calculate the sample size

Calculate the sample size

Calculate the degrees of freedom

Calculate the degrees of freedom

Calculate the residual standard error

Calculate the residual standard error

Find residual standard error in lm output

Application Exercise

`Application Exercise`

Find residual standard error in `lm` output

`Application Exercise`