Evaluating Multiple Linear Regression Models

Lucy D’Agostino McGowan

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)

Transforming the outcome variable

  • So far, we’ve talked about transforming an explanatory variable (for example including polynomial terms)
  • It is common to transform the outcome variable by taking the log or the square root of the outcome. This is particularly useful for constant variance violations
  • You need to be careful when interpreting coefficients after taking transformations, as the transformation needs to be accounted for (now a \(\hat\beta\) is a one unit change in x yields an expected change in the log of y, for example)

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)

Marginal effects plots

  • When we did simple linear regression we could examine a scatter plot of \(x\) and \(y\), but how can we do this when we have multiple explanatory variables in our model?
  • One option is to predict new values of \(\hat{y}\) over a range of a particlar \(x\) variable, holding all other variables at their reference values for categorical variables and the average value for continuous variables.

Marginal effect plots

library(Stat2Data)
data("Diamonds")

mod <- lm(TotalPrice ~ Carat + Depth + Color, data = Diamonds)

Diamonds %>%
  mutate(y_hat = fitted(mod),
         resid = residuals(mod)) %>%
  ggplot(aes(x = y_hat, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0) + 
  labs(x = "Fitted values", y = "Residuals")

Marginal effect plots

new_data <- data.frame(
  Carat = seq(.1, 3.4, by = 0.1),
  Depth = mean(Diamonds$Depth),
  Color = "D"
)

ggplot(Diamonds, aes(x = Carat, y = TotalPrice)) +
  geom_point() +
  geom_line(data = new_data, aes(x = Carat, y = predict(mod, new_data))) 

new_data2 <- data.frame(
  Carat = mean(Diamonds$Carat),
  Depth = seq(58, 79, by = 1),
  Color = "D"
)

ggplot(Diamonds, aes(x = Depth, y = TotalPrice)) +
  geom_point() +
  geom_line(data = new_data2, aes(x = Depth, y = predict(mod, new_data2)))

Marginal effect plots

mod2 <- lm(TotalPrice ~ Carat + I(Carat^2) + Depth + Color, data = Diamonds)

ggplot(Diamonds, aes(x = Carat, y = TotalPrice)) +
  geom_point() +
  geom_line(data = new_data, aes(x = Carat, y = predict(mod2, new_data)))

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)
Zero Mean

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)
Zero Mean The error distribution is centered at zero by default
Constant Variance

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)
Zero Mean The error distribution is centered at zero by default
Constant Variance The variability in the errors is the same for all values of the predictor variable Residuals vs fits plot fit a better model (try taking the log or square root of the outcome)
Independence

Transformation of y example

data("MetroHealth83")

mod <- lm(NumMDs ~ NumHospitals, data = MetroHealth83)
MetroHealth83 %>%
  mutate(y_hat = fitted(mod),
         resid = residuals(mod)) %>%
  ggplot(aes(x = y_hat, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0) + 
  labs(x = "Fitted values", y = "Residuals")

Transformation of y example

Try taking the log of the outcome.

mod2 <- lm(log(NumMDs) ~ NumHospitals, data = MetroHealth83)
Code
MetroHealth83 %>%
  mutate(y_hat = fitted(mod2),
         resid = residuals(mod2)) %>%
  ggplot(aes(x = y_hat, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0) + 
  labs(x = "Fitted values", y = "Residuals")

Transformation of y example

Try taking the log of the outcome.

mod2 <- lm(log(NumMDs) ~ NumHospitals, data = MetroHealth83)
summary(mod2)

Call:
lm(formula = log(NumMDs) ~ NumHospitals, data = MetroHealth83)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24906 -0.55027 -0.03063  0.49255  1.11936 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.75892    0.10046   57.32   <2e-16 ***
NumHospitals  0.14499    0.01047   13.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6054 on 81 degrees of freedom
Multiple R-squared:  0.7029,    Adjusted R-squared:  0.6992 
F-statistic: 191.6 on 1 and 81 DF,  p-value: < 2.2e-16

How do you interpret \(\hat\beta_1\)?

Transformation of y example

Try taking the square root of the outcome.

mod3 <- lm(sqrt(NumMDs) ~ NumHospitals, data = MetroHealth83)
Code
MetroHealth83 %>%
  mutate(y_hat = fitted(mod3),
         resid = residuals(mod3)) %>%
  ggplot(aes(x = y_hat, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0) + 
  labs(x = "Fitted values", y = "Residuals")

Transformation of y example

Try taking the square root of the outcome.

mod3 <- lm(sqrt(NumMDs) ~ NumHospitals, data = MetroHealth83)
summary(mod3)

Call:
lm(formula = sqrt(NumMDs) ~ NumHospitals, data = MetroHealth83)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.086  -5.845  -2.030   7.001  17.994 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   14.0329     1.4686   9.555 6.36e-15 ***
NumHospitals   2.9148     0.1531  19.036  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.85 on 81 degrees of freedom
Multiple R-squared:  0.8173,    Adjusted R-squared:  0.8151 
F-statistic: 362.4 on 1 and 81 DF,  p-value: < 2.2e-16

How do you interpret \(\hat\beta_1\)?

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)
Zero Mean The error distribution is centered at zero by default
Constant Variance The variability in the errors is the same for all values of the predictor variable Residuals vs fits plot fit a better model (try taking the log or square root of the outcome)
Independence The errors are assumed to be independent from one another 👀 data generation Find better data or fit a fancier model
Random

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)
Zero Mean The error distribution is centered at zero by default
Constant Variance The variability in the errors is the same for all values of the predictor variable Residuals vs fits plot fit a better model (try taking the log or square root of the outcome)
Independence The errors are assumed to be independent from one another 👀 data generation Find better data or fit a fancier model
Random The data are obtained using a random process 👀 data generation Find better data
Normality

Conditions for multiple linear regression

Assumption What it means How do you check? How do you fix?
Linearity The relationship between the outcome and explanatory variable or predictor is linear holding all other variables constant Residuals vs. fits plot
Marginal effects plots
fit a better model (transformations, polynomial terms, more / different variables, etc.)
Zero Mean The error distribution is centered at zero by default
Constant Variance The variability in the errors is the same for all values of the predictor variable Residuals vs fits plot fit a better model (try taking the log or square root of the outcome)
Independence The errors are assumed to be independent from one another 👀 data generation Find better data or fit a fancier model
Random The data are obtained using a random process 👀 data generation Find better data
Normality The random errors follow a normal distribution QQ-plot / residual histogram fit a better model