Model Comparisons

Lucy D’Agostino McGowan

🛠 toolkit for comparing models

👉 F-test

👉 \(\Large R^2\)

🛠 F-test for Multiple Linear Regression

Comparing the full model to the intercept only model \(H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0\) \(H_A: \textrm{at least one } \beta_i \neq 0\)

🛠 F-test for Multiple Linear Regression

\(\Large F = \frac{MSModel}{MSE}\)
df for the Model?
- k
df for the errors?
- n - k - 1

🛠 Nested F-test for Multiple Linear Regression

What does “nested” mean?
- You have a “small” model and a “large” model where the “small” model is completely contained in the “large” model
The F-test we have learned so far is one example of this, comparing:
- \(y = \beta_0 + \epsilon\) (small)
- \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots +\beta_kx_k + \epsilon\) (large)
The full (large) model has \(k\) predictors, the reduced (small) model has \(k - p\) predictors

🛠 Nested F-test for Multiple Linear Regression

The full (large) model has \(k\) predictors, the reduced (small) model has \(k - p\) predictors
What is \(H_0\)?
- \(H_0:\) \(\beta_i=0\) for all \(p\) predictors being dropped from the full model
What is \(H_A\)?
- \(H_A:\) \(\beta_i\neq 0\) for at least one of the \(p\) predictors dropped from the full model
Does the full model do a (statistically significant) better job of explaining the variability in the response than the reduced model?

🛠 Nested F-test for Multiple Linear Regression

The full (large) model has \(k\) predictors, the reduced (small) model has \(k - p\) predictors
\(F = \frac{SSMODEL_{Full} - SSMODEL_{Reduced}/p}{SSE_{Full}/n-k-1}\)

🛠 Nested F-test for Multiple Linear Regression

Which of these are nested models?

\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 * x_2 + \epsilon\)
\(y = \beta_0 + \beta_1 x_3 + \epsilon\)
\(y = \beta_0 + \beta_1 x_1 + \epsilon\)
\(y = \beta_0 + \beta_1 x_4 + \epsilon\)

((4) in (1) in (2))

🛠 Nested F-test for Multiple Linear Regression

\(y = \beta_0 + \beta_1 x_1 + \epsilon\)
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 * x_2 + \epsilon\)

Comparing these two models, what is \(p\)?
- \(p = 2\)
What is \(k\)?
- \(k = 3\)

🛠 Nested F-test for Multiple Linear Regression

Goal: Trying to predict the weight of fish based on their length and width

data("Perch")
model1 <- lm(
  Weight ~ Length + Width + Length * Width,
  data = Perch
  )
model2 <- lm(
  Weight ~ Length + Width + Length * Width + I(Length ^ 2) + I(Width ^ 2),
  data = Perch
  )

What is the equation for model1?
What is the equation for model2?

🛠 Nested F-test for Multiple Linear Regression

data("Perch")
model1 <- lm(
  Weight ~ Length + Width + Length * Width,
  data = Perch
  )
model2 <- lm(
  Weight ~ Length + Width + Length * Width + I(Length ^ 2) + I(Width ^ 2),
  data = Perch
  )

If we want to do a nested F-test, what is \(H_0\)?
- \(H_0: \beta_4 = \beta_5 = 0\)
What is \(H_A\)?
- \(H_A: \beta_4\neq 0\) or \(\beta_5\neq 0\)
What are the degrees of freedom of this test? (n = 56)
- 2, 50

🛠 Nested F-test for Multiple Linear Regression

anova(model1)

Analysis of Variance Table

Response: Weight
             Df  Sum Sq Mean Sq F value  Pr(>F)    
Length        1 6118739 6118739  3126.6 < 2e-16 ***
Width         1  110593  110593    56.5 7.4e-10 ***
Length:Width  1  314997  314997   161.0 < 2e-16 ***
Residuals    52  101765    1957                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(SSModel1 <- 6118739 + 110593 + 314997)

[1] 6544329

🛠 Nested F-test for Multiple Linear Regression

anova(model2)

Analysis of Variance Table

Response: Weight
             Df  Sum Sq Mean Sq F value  Pr(>F)    
Length        1 6118739 6118739 3289.64 < 2e-16 ***
Width         1  110593  110593   59.46 4.7e-10 ***
I(Length^2)   1  314899  314899  169.30 < 2e-16 ***
I(Width^2)    1    5381    5381    2.89   0.095 .  
Length:Width  1    3482    3482    1.87   0.177    
Residuals    50   93000    1860                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(SSModel1 <- 6118739 + 110593 + 314997)

[1] 6544329

(SSModel2 <- 6118739 + 110593 + 314899 + 5381 + 3482)

[1] 6553094

🛠 Nested F-test for Multiple Linear Regression

\(F = \frac{SSMODEL_{Full} - SSMODEL_{Reduced}/p}{SSE_{Full}/n-k-1}\)
\(SSMODEL_{Full} - SSMODEL_{Reduced}\):

SSModel2 - SSModel1

[1] 8765

What is \(p\)?

🛠 Nested F-test for Multiple Linear Regression

\(F = \frac{SSMODEL_{Full} - SSMODEL_{Reduced}/p}{SSE_{Full}/n-k-1}\)
\(SSMODEL_{Full} - SSMODEL_{Reduced}\) / p:

(SSModel2 - SSModel1) / 2

[1] 4382

🛠 Nested F-test for Multiple Linear Regression

\(F = \frac{SSMODEL_{Full} - SSMODEL_{Reduced}/p}{SSE_{Full}/n-k-1}\)
\(SSE_{Full}/n-k-1\)

anova(model2)

Analysis of Variance Table

Response: Weight
             Df  Sum Sq Mean Sq F value  Pr(>F)    
Length        1 6118739 6118739 3289.64 < 2e-16 ***
Width         1  110593  110593   59.46 4.7e-10 ***
I(Length^2)   1  314899  314899  169.30 < 2e-16 ***
I(Width^2)    1    5381    5381    2.89   0.095 .  
Length:Width  1    3482    3482    1.87   0.177    
Residuals    50   93000    1860                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

🛠 Nested F-test for Multiple Linear Regression

\(F = \frac{SSMODEL_{Full} - SSMODEL_{Reduced}/p}{SSE_{Full}/n-k-1}\)

((SSModel2 - SSModel1) / 2) /
  1860

[1] 2.36

What are the degrees of freedom for this test?
- 2, 50

pf(2.356183, 2, 50, lower.tail = FALSE)

[1] 0.105

🛠 Nested F-test for Multiple Linear Regression

An easier way

anova(model1, model2)

Analysis of Variance Table

Model 1: Weight ~ Length + Width + Length * Width
Model 2: Weight ~ Length + Width + Length * Width + I(Length^2) + I(Width^2)
  Res.Df    RSS Df Sum of Sq    F Pr(>F)
1     52 101765                         
2     50  93000  2      8765 2.36   0.11

🛠 \(R^2\) for Multiple Linear Regression

\(R^2= \frac{SSModel}{SSTotal}\)
\(R^2= 1 - \frac{SSE}{SSTotal}\)
As is, if you add a predictor this will always increase. Therefore, we have \(R^2_{adj}\) that has a small “penalty” for adding more predictors
\(R^2_{adj} = 1 - \frac{SSE/(n-k-1)}{SSTotal / (n-1)}\)
\(\frac{SSTotal}{n-1} = \frac{\sum(y - \bar{y})^2}{n-1}\) What is this?
- Sample variance! \(S_Y^2\)
\(R^2_{adj} = 1 - \frac{\hat\sigma^2_\epsilon}{S_Y^2}\)

🛠 \(R^2_{adj}\) for Multiple Linear Regression

\(R^2_{adj} = 1 - \frac{SSE/(n-k-1)}{SSTotal / (n-1)}\)
The denominator stays the same for all models fit to the same response variable and data
the numerator actually increase when a new predictor is added to a model if the decrease in the SSE is not sufficient to offset the decrease in the error degrees of freedom.
So \(R^2_{adj}\) can 👇 when a weak predictor is added to a model

🛠 \(R^2_{adj}\) for Multiple Linear Regression

summary(model1)


Call:
lm(formula = Weight ~ Length + Width + Length * Width, data = Perch)

Residuals:
    Min      1Q  Median      3Q     Max 
-140.11  -12.23    1.23    8.49  181.41 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   113.935     58.784    1.94    0.058 .  
Length         -3.483      3.152   -1.10    0.274    
Width         -94.631     22.295   -4.24  9.1e-05 ***
Length:Width    5.241      0.413   12.69  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.2 on 52 degrees of freedom
Multiple R-squared:  0.985, Adjusted R-squared:  0.984 
F-statistic: 1.11e+03 on 3 and 52 DF,  p-value: <2e-16

summary(model2)


Call:
lm(formula = Weight ~ Length + Width + Length * Width + I(Length^2) + 
    I(Width^2), data = Perch)

Residuals:
    Min      1Q  Median      3Q     Max 
-117.17  -11.90    2.82   11.56  157.60 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)   156.349     61.415    2.55    0.014 *
Length        -25.001     14.273   -1.75    0.086 .
Width          20.977     82.588    0.25    0.801  
I(Length^2)     1.572      0.724    2.17    0.035 *
I(Width^2)     34.406     18.745    1.84    0.072 .
Length:Width   -9.776      7.145   -1.37    0.177  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.1 on 50 degrees of freedom
Multiple R-squared:  0.986, Adjusted R-squared:  0.985 
F-statistic:  705 on 5 and 50 DF,  p-value: <2e-16

Model Comparision criteria

We are looking for reasonable ways to balance “goodness of fit” (how well the model fits the data) with “parsimony”
\(R^2_{adj}\) gets at this by adding a penalty for adding variables
AIC and BIC are two more methods that balance goodness of fit and parsimony

Log Likelihood

Both AIC and BIC are calculated using the log likelihood
\(\log(\mathcal{L}) = -\frac{n}{2}[\log(2\pi) + \log(SSE/n) + 1]\)
- \(\log = \log_e\), log() in R

logLik(model1)

'log Lik.' -290 (df=5)

-56 / 2 * (log(2 * pi) + log(101765 / 56) + 1)

[1] -290

“goodness of fit” measure
higher log likelihood is better

Log Likelihood

What I want you to remember

Both AIC and BIC are calculated using the log likelihood

\[\log(\mathcal{L}) = -\frac{n}{2}[\log(SSE/n) ]+\textrm{some constant}\]

\(\log = \log_e\), log() in R
“goodness of fit” measure
higher log likelihood is better

AIC

Akaike’s Information Criterion
\(AIC = 2(k+1) - 2\log(\mathcal{L})\)
\(k\) is the number of predictors in the model
lower AIC values are better

AIC(model1)

[1] 589

AIC(model2)

[1] 588

BIC

Bayesian Information Criterion
\(BIC = \log(n)(k+1) - 2\log(\mathcal{L})\)
\(k\) is the number of predictors in the model
lower BIC values are better

BIC(model1)

[1] 599

BIC(model2)

[1] 602

AIC and BIC can disagree!

AIC(model1)

[1] 589

AIC(model2)

[1] 588

BIC(model1)

[1] 599

BIC(model2)

[1] 602

the penalty term is larger in BIC than in AIC
What to do? Both are valid, pre-specify which you are going to use before running your models in the methods section of your analysis

🛠 toolkit for comparing models

👉 F-test

👉 \(\Large R^2\)

👉 \(AIC\)

👉 \(BIC\)

`Application Exercise`

Copy the following template into RStudio Pro:

https://github.com/sta-112-f22/appex-15.git

Fit a model predicting GPA using high school gpa (HSGPA) and verbal SAT score (SATV). Save this as model1
Fit a model predicting GPA using high school gpa, verbal SAT score (SATV), and math SAT score (SATM). Save this as model2.
Fit a model predicting GPA using high school gpa, verbal SAT score (SATV), math SAT score (SATM), and number of humanities credits taken in high school (HU). Save this as model3.
Conduct an “nested F test” comparing Model 1 to Model 2. What are the null and alternative hypotheses? Is model 2 significantly better than model 1?
Conduct a “nested F test” comparing Model 2 to Model 3. What are the null and alternative hypotheses? Is model 3 significantly better than model 2?
Choose AIC or BIC to compare models 1, 2, and 3. Rank the models.

08:00