\[y = \beta_0 + \beta_1X + \epsilon\]
\[\epsilon\sim N(0,\sigma_\epsilon)\]
\[y = \beta_0 + \beta_1X_1 + \beta_2X_2+\dots+\beta_kX_k+ \epsilon\]
\[\epsilon\sim N(0,\sigma_\epsilon)\]
How are these coefficients estimated?
\[\hat{y} = \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2+\dots+\hat\beta_kX_k\]
What is my response variable? What are my explantory variables?
What is different between this and the lm()
functions we have been previously running?
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
Min 1Q Median 3Q Max
-18.930 -3.795 -0.309 4.116 12.811
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.9192 2.4835 28.556 < 2e-16 ***
Mileage -0.5613 0.1141 -4.921 3.76e-05 ***
Age -0.1302 0.4568 -0.285 0.778
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.291 on 27 degrees of freedom
Multiple R-squared: 0.7951, Adjusted R-squared: 0.7799
F-statistic: 52.39 on 2 and 27 DF, p-value: 5.073e-10
How would we get the predicted values for \(\hat{y}\)?
model <- lm(Price ~ Mileage + Age, data = PorschePrice)
PorschePrice <- PorschePrice %>%
mutate(y_hat = fitted(model)) #<<
Price Age Mileage y_hat
1 69.4 3 21.5 58.45976
2 56.9 3 43.0 46.39104
3 49.9 2 19.9 59.48812
4 47.4 4 36.0 50.19016
5 42.9 4 44.0 45.69948
6 36.9 6 49.8 42.18328
The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?
\[\Large \sqrt{\frac{SSE}{??}}\]
The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?
\[\Large\sqrt{ \frac{SSE}{30 - 2 - 1}}\]
Analysis of Variance Table
Response: Price
Df Sum Sq Mean Sq F value Pr(>F)
Mileage 1 5565.7 5565.7 104.7023 8.653e-11 ***
Age 1 4.3 4.3 0.0813 0.7778
Residuals 27 1435.2 53.2
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) )
Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) ) adjusting for all known confounders
What is a confounder?
Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) ) adjusting for all known confounders
A confounder is a variable that is associated with both the response variable ( \(y\) ) and the explanatory variable ( \(x\) ). If not accounted for, it can result in seeing a spurious relationship between \(x\) and \(y\).
Application Exercise
data <- tibble(
x = c(rnorm(25), rnorm(25, 2), rnorm(25, 4), rnorm(25, 6)),
group = rep(1:4, each = 25),
y = 5 + 2.5 * x - 10 * group + rnorm(100, 0, 5)
ggplot(data, aes(x, y, color = group)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, aes(group = group)) +
theme(legend.position = "none")
How do you calculate a t statistic for \(\hat{\beta}_2\)?
How do you calculate a t statistic for \(\hat{\beta}_2\)?
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
(Intercept) Mileage Age
70.9192 -0.5613 -0.1302
What is the null and alternative hypothesis?
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
(Intercept) Mileage Age
70.9192 -0.5613 -0.1302
What would the degrees of freedom be for the t-distribution used to calcualte a p-value?
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
(Intercept) Mileage Age
70.9192 -0.5613 -0.1302
How would you calculate a confidence interval for \(\beta_i\)?
2.5 % 97.5 %
(Intercept) 65.8234089 76.0149140
Mileage -0.7953816 -0.3272903
Age -1.0675929 0.8071415
Application Exercise
Open appex-13.qmd
Using the NFL2007Standings
data create a model that predicts WinPct
from PointsFor.
Examine the \(R^2\) and \(R^2_{adj}\) values.
Using the NFL2007Standings data create a model that predicts WinPct
from PointsFor
AND PointsAgainst.
Examine the \(R^2\) and \(R^2_{adj}\) values.
Which model do you think is better for predicting win percent?