Variable Transformations

Lucy D’Agostino McGowan

Adjusting for confounders

  • What is the relationship between average SAT scores and average teacher salaries?
  • Are we doing inference or prediction?

Adjusting for confounders

  • I fit a linear model for \(\hat{sat} = \hat\beta_0 + \hat\beta_1 salary\)
lm(sat ~ salary, SAT)

Call:
lm(formula = sat ~ salary, data = SAT)

Coefficients:
(Intercept)       salary  
    1158.86        -5.54  
  • How do we interpret this result?

Adjusting for confounders

  • There is a third variable, the fraction of students that took the SAT in that state. It is grouped as “Low”, “Medium”, and, “High”.
Code
SAT <- SAT %>%
  mutate(frac_group = case_when(
    frac < 22 ~ "LOW",
    frac < 49 ~ "MED",
    TRUE ~ "HIGH"
  ))
lm(sat ~ salary + frac_group, SAT)

Call:
lm(formula = sat ~ salary + frac_group, data = SAT)

Coefficients:
  (Intercept)         salary  frac_groupLOW  frac_groupMED  
      851.866          1.089        150.379         38.636  
  • What is the referent category?
  • How do you interpret the \(\hat{\beta}\) for frac_groupLOW?
  • How do you interpret the \(\hat{\beta}\) for salary now?

\(\hat\beta\) interpretation in multiple linear regression

The coefficient for \(x\) is \(\hat\beta\) (95% CI: \(LB_\hat\beta, UB_\hat\beta\)). A one-unit increase in \(x\) yields an expected increase in y of \(\hat\beta\), holding all other variables constant.

\(\hat\beta\) interpretation in multiple linear regression

The coefficient for average salary is 1.09 (95% CI: -0.90, 3.08). A $1,000 increase in average salary yields an expected increase in average SAT score of 1.09, holding the fraction of students that took the SAT constant.

Adjusting for confounders

Code
ggplot(SAT, aes(salary, sat)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Adjusting for confoundrs

Code
ggplot(SAT, aes(salary, sat, color = frac_group, group = frac_group)) + 
  geom_point() + 
  geom_line(aes(y = predict(lm(sat ~ salary + frac_group, data = SAT)))) +
  labs(color = "Fraction took SAT")
  • What is this called? Where the direction reverses?
  • Notice here the lines are parallel so holding the group constant, this is the effect we see.
  • 😱 what if the lines aren’t parallel?

Interactions

  • Data looking at the growth rate for kids

Interactions

Code
ggplot(Kids198, aes(Age, Weight)) +
  geom_point()
  • Will \(\hat\beta_{age}\) be positive or negative?

Interactions

  • Let’s look at this relationship split by sex (blue: Girl, black: Boy)
Code
ggplot(Kids198, aes(Age, Weight, color = Sex)) +
  geom_point() +
  theme(legend.position = "none")

Interactions

  • Let’s look at this relationship split by sex (blue: Girl, black: Boy)
Code
ggplot(Kids198, aes(Age, Weight, color = Sex, group = Sex)) +
  geom_point() +
  theme(legend.position = "none") + 
  geom_smooth(method = "lm", se = FALSE)
  • 😱 the lines cross! That means there is an interaction, that is the slopes differ based on the group

Interactions

  • Let’s look at this relationship split by sex (blue: Girl, black: Boy)
Code
ggplot(Kids198, aes(Age, Weight, color = Sex, group = Sex)) +
  geom_point() +
  theme(legend.position = "none") + 
  geom_smooth(method = "lm", se = FALSE)
  • What is the equation for this relationship?

Interactions

\(Weight = \beta_0 + \beta_1 Age + \beta_2 Girl + \beta_3 Age \times Girl + \epsilon\)

lm(Weight ~ Age + Sex + Age * Sex, data = Kids198)

Call:
lm(formula = Weight ~ Age + Sex + Age * Sex, data = Kids198)

Coefficients:
(Intercept)          Age          Sex      Age:Sex  
   -33.6925       0.9087      31.8506      -0.2812  
  • What does this model become for boys (When Sex = 0)
    • \(Weight = \beta_0 + \beta_1 Age + \epsilon\)
  • What does this model become for girls (When Sex = 1)
    • \(Weight = \beta_0 + \beta_1 Age + \beta_2 1 + \beta_3 Age \times 1 + \epsilon\)
    • \(Weight = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) Age + \epsilon\)
  • How do you interpret \(\hat\beta_0\) now?

Interactions

lm(Weight ~ Age + Sex + Age * Sex, data = Kids198)

Call:
lm(formula = Weight ~ Age + Sex + Age * Sex, data = Kids198)

Coefficients:
(Intercept)          Age          Sex      Age:Sex  
   -33.6925       0.9087      31.8506      -0.2812  
  • What does this model become for boys (When Sex = 0)
    • \(Weight = \beta_0 + \beta_1 Age + \epsilon\)
  • What does this model become for girls (When Sex = 1)
    • \(Weight = \beta_0 + \beta_1 Age + \beta_2 1 + \beta_3 Age \times 1 + \epsilon\)
    • \(Weight = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) Age + \epsilon\)
  • How do you interpret \(\hat\beta_{2}\) now?
  • The difference in intercepts between boys and girls

Interactions

lm(Weight ~ Age + Sex + Age * Sex, data = Kids198)

Call:
lm(formula = Weight ~ Age + Sex + Age * Sex, data = Kids198)

Coefficients:
(Intercept)          Age          Sex      Age:Sex  
   -33.6925       0.9087      31.8506      -0.2812  
  • What does this model become for boys (When Sex = 0)
    • \(Weight = \beta_0 + \beta_1 Age + \epsilon\)
  • What does this model become for girls (When Sex = 1)
    • \(Weight = \beta_0 + \beta_1 Age + \beta_2 1 + \beta_3 Age \times 1 + \epsilon\)
    • \(Weight = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) Age + \epsilon\)
  • How do you interpret \(\hat\beta_{3}\) now?
  • How much the slope changes as we move from the regression line for boys to that for girls

Interactions

\(Weight = \beta_0 + \beta_1 Age + \beta_2 Girl + \beta_3 Age \times Girl + \epsilon\)

  • Hypothesis testing: What if you want to test whether the slope is different between groups?
  • Is the growth rate different for boys and girls?
  • What is \(H_0\)?
  • \(H_0: \beta_3 = 0\)
  • What is \(H_A\)?
    • \(H_A:\beta_3 \neq 0\)

Interactions

Code
lm(Weight ~ Age + Sex + Age * Sex, data = Kids198) %>%
  summary()

Call:
lm(formula = Weight ~ Age + Sex + Age * Sex, data = Kids198)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.884 -12.055  -2.782  10.185  58.581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -33.69254   10.00727  -3.367 0.000917 ***
Age           0.90871    0.06106  14.882  < 2e-16 ***
Sex          31.85057   13.24269   2.405 0.017106 *  
Age:Sex      -0.28122    0.08164  -3.445 0.000700 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.19 on 194 degrees of freedom
Multiple R-squared:  0.6683,    Adjusted R-squared:  0.6631 
F-statistic: 130.3 on 3 and 194 DF,  p-value: < 2.2e-16
  • What is the result of our hypothesis test?

\(\hat\beta\) interpretation for interactions between \(x\) and a binary indicator \(I\)

The coefficient for the interaction between \(x\) and \(I\) is \(\hat\beta\) (95% CI: \(LB_\hat\beta, UB_\hat\beta\)). This means that the effect of \(x\) on \(y\) differs by \(\hat\beta\) when \(I = 1\) compared to \(I = 0\) holding all other variables constant*.

  • You must include this line if there are additional variables in your model.

\(\hat\beta\) interpretation for interactions between \(x\) and a binary indicator \(I\)

The coefficient for the interaction between Age and Sex is -0.28 (95% CI: -0.44, -0.12). This means that the expected effect of Age on Weight is lower by 0.28 among girls compared to boys.

Non-linear relationships

  • Sometimes the relationships between the outcome \(y\) and \(x\) variables are nonlinear.
  • We can use polynomials to address this!
  • Returning to the Diamonds data, let’s say we are interested in predicting Total Price from the Carats.
  • Is this an example of inference or prediction?

Non-linear relationships

Code
data("Diamonds")
ggplot(Diamonds, aes(Carat, TotalPrice)) +
  geom_point()

Non-linear relationships

lm(TotalPrice ~ Carat, data = Diamonds)
Code
data("Diamonds")
ggplot(Diamonds, aes(Carat, TotalPrice)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Non-linear relationships

lm(TotalPrice ~ Carat + I(Carat^2), data = Diamonds)

Call:
lm(formula = TotalPrice ~ Carat + I(Carat^2), data = Diamonds)

Coefficients:
(Intercept)        Carat   I(Carat^2)  
     -522.7       2386.0       4498.2  
Code
data("Diamonds")
ggplot(Diamonds, aes(Carat, TotalPrice)) +
  geom_point() + 
  geom_line(aes(y = predict(lm(TotalPrice ~ Carat + I(Carat^2), data = Diamonds))), lwd = 1.5, color = "blue")
  • What is the equation for this relationship?

Interpreting \(\hat\beta\)s in the presence of polynomials

\(Total Price = \beta_0 + \beta_1 Carat + \beta_2 Carat^2 + \epsilon\)

  • What is the interpretation of \(\hat\beta_1\)?
  • Typically, in multiple linear regression, the interpretation of \(\hat\beta_i\) is: a one-unit change in \(x\) yields an expected change in \(y\) of \(\hat\beta_i\) holding all other variables constant.
    • What does it mean to see a change in Caret holding Carat \(^2\) constant?
  • When you have a polynomial term, you need to specify the values you are changing between, since the change is no longer constant across all values of \(x\).

Interpreting \(\hat\beta\) in the presence of polynomials

lm(TotalPrice ~ Carat + I(Carat^2), data = Diamonds)

Call:
lm(formula = TotalPrice ~ Carat + I(Carat^2), data = Diamonds)

Coefficients:
(Intercept)        Carat   I(Carat^2)  
     -522.7       2386.0       4498.2  

What is the expected change in TotalPrice for a one-unit change in Carat, changing from 0.8 to 1.8?

Code
(-522.7 + 2386 * 1.8 + 4498.2 * 1.8^2) - 
  (-522.7 + 2386 * 0.8 + 4498.2 * 0.8^2)
[1] 14081.32
Code
2386 * (1.8 - 0.8) + 
  4498.2 * (1.8^2 - 0.8^2)
[1] 14081.32

Interpreting \(\hat\beta\) in the presence of polynomials

lm(TotalPrice ~ Carat + I(Carat^2), data = Diamonds) 

Call:
lm(formula = TotalPrice ~ Carat + I(Carat^2), data = Diamonds)

Coefficients:
(Intercept)        Carat   I(Carat^2)  
     -522.7       2386.0       4498.2  

What is the expected change in TotalPrice for a one-unit change in Carat, changing from 1.8 to 2.8?

2386 * (2.8 - 1.8) + 4498.2 * (2.8^2 - 1.8^2)
[1] 23077.72
  • Can we talk about \(\hat\beta_1\) and \(\hat\beta_2\) in the context of a one-unit change in Carat?

Interpreting \(\hat\beta\) in the presence of polynomials

  • \(\hat\beta\) coefficients that are transformations of the same \(x\) variable must be interpreted together
  • You must first choose to values of \(x\) to change between, and then report the change.
  • A sensible choice for the two \(x\) values can be the 25th% quantile and the 75th% quantile.

General \(\hat\beta\) interpretation with quadratic terms

The linear term in the model for \(x\) has a coefficient of \(\hat\beta_1\) (95% CI: \((LB_{\hat\beta_1}, UB_{\hat\beta_1})\)). The quadratic term in the model for \(x\) has a coefficient of \(\hat\beta_2\) (95% CI: \((LB_{\hat\beta_2}, UB_{\hat\beta_2})\)). A change in \(x\) from \(a\) to \(b\) yields an expected change in \(y\) of \(\hat\beta_1 (b - a) + \hat\beta_2 (b^2 - a^2)\) holding all other variables constant*.

  • You must include this line if there are additional variables in your model.

Specific \(\hat\beta\) interpretation for \(y = \beta_0 + \beta_1 Carat + \beta_2 Carat^2 + \epsilon\) model

The linear term in the model for Carat has a coefficient of 2386 (95% CI: \((906, 3866)\)). The quadratic term in the model for Carat has a coefficient of \(4498\) (95% CI: \((3981, 5016)\)). A change in Carat from \(0.7\) to \(1.24\) yields an expected change in TotalPrice of \(6000.5\).

  • Why didn’t I say holding all other variables constant?

Take aways

  • The interpretation of \(\hat\beta\) in multiple linear regression
    • A one-unit change in \(x\) yields an expected change in \(y\) of \(\hat\beta\) holding all other included variables constant
  • If the slope differs between groups (the lines cross in a scatterplot), an interaction is present
  • You can include polynomial terms to address non-linear relationships
    • The coefficients for a polynomial must be interpreted together

`Application Exercise}

  1. Open appex-14.qmd
  2. Fit the model \(TotalPrice = \beta_0 + \beta_1Carat + \beta_2 Carat^2 + \beta_3 Color+\epsilon\)
  3. Find the 0.25 quantile and 0.75 quantile of Carat
  4. What is the interpretation of \(\hat\beta_1\), \(\hat\beta_2\), and \(\hat\beta_3\)?
05:00