Lucy D’Agostino McGowan
\[y = \beta_0 + \beta_1X + \epsilon\]
\[\epsilon\sim N(0,\sigma_\epsilon)\]
\[y = \beta_0 + \beta_1X_1 + \beta_2X_2+\dots+\beta_kX_k+ \epsilon\]
\[\epsilon\sim N(0,\sigma_\epsilon)\]
How are these coefficients estimated?
\[\hat{y} = \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2+\dots+\hat\beta_kX_k\]
What is my response variable? What are my explantory variables?
What is different between this and the lm()
functions we have been previously running?
What is different between this and the lm()
functions we have been previously running?
Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
Residuals:
Min 1Q Median 3Q Max
-18.930 -3.795 -0.309 4.116 12.811
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.9192 2.4835 28.556 < 2e-16 ***
Mileage -0.5613 0.1141 -4.921 3.76e-05 ***
Age -0.1302 0.4568 -0.285 0.778
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.291 on 27 degrees of freedom
Multiple R-squared: 0.7951, Adjusted R-squared: 0.7799
F-statistic: 52.39 on 2 and 27 DF, p-value: 5.073e-10
How would we get the predicted values for \(\hat{y}\)?
model <- lm(Price ~ Mileage + Age, data = PorschePrice)
PorschePrice <- PorschePrice %>%
mutate(y_hat = fitted(model)) #<<
head(PorschePrice)
Price Age Mileage y_hat
1 69.4 3 21.5 58.45976
2 56.9 3 43.0 46.39104
3 49.9 2 19.9 59.48812
4 47.4 4 36.0 50.19016
5 42.9 4 44.0 45.69948
6 36.9 6 49.8 42.18328
The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?
\[\Large \sqrt{\frac{SSE}{??}}\]
The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?
\[\Large \sqrt\frac{SSE}{n - k - 1}\]
The sample size is \(n = 30\), what would the degrees of freedom for the SSE be now?
\[\Large\sqrt{ \frac{SSE}{30 - 2 - 1}}\]
Analysis of Variance Table
Response: Price
Df Sum Sq Mean Sq F value Pr(>F)
Mileage 1 5565.7 5565.7 104.7023 8.653e-11 ***
Age 1 4.3 4.3 0.0813 0.7778
Residuals 27 1435.2 53.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) )
Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) ) adjusting for all known confounders
What is a confounder?
Goal: Discover the relationship between a response (outcome, \(y\)), and an explanatory variable ( \(x\) ) adjusting for all known confounders
A confounder is a variable that is associated with both the response variable ( \(y\) ) and the explanatory variable ( \(x\) ). If not accounted for, it can result in seeing a spurious relationship between \(x\) and \(y\).
Armstrong, K.A. (2012). Methods in comparative effectiveness research. Journal of clinical oncology: official journal of the American Society of Clinical Oncology, 30 34, 4208-14.
data("data set")
.csv
file?read_csv()
read_csv()
is a function from the readr package, which is included when you load the tidyverseSlides adapted from datasciencebox.org by Dr. Lucy D’Agostino McGowan
Application Exercise
05:00
set.seed(1)
data <- tibble(
x = c(rnorm(25), rnorm(25, 2), rnorm(25, 4), rnorm(25, 6)),
group = rep(1:4, each = 25),
y = 5 + 2.5 * x - 10 * group + rnorm(100, 0, 5)
)
ggplot(data, aes(x, y, color = group)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, aes(group = group)) +
theme(legend.position = "none")
How do you calculate a t statistic for \(\hat{\beta}_2\)?
How do you calculate a t statistic for \(\hat{\beta}_2\)?
Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
Coefficients:
(Intercept) Mileage Age
70.9192 -0.5613 -0.1302
What is the null and alternative hypothesis?
Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
Coefficients:
(Intercept) Mileage Age
70.9192 -0.5613 -0.1302
What would the degrees of freedom be for the t-distribution used to calcualte a p-value?
Call:
lm(formula = Price ~ Mileage + Age, data = PorschePrice)
Coefficients:
(Intercept) Mileage Age
70.9192 -0.5613 -0.1302
How would you calculate a confidence interval for \(\beta_i\)?
2.5 % 97.5 %
(Intercept) 65.8234089 76.0149140
Mileage -0.7953816 -0.3272903
Age -1.0675929 0.8071415
Application Exercise
Open appex-13.qmd
Using the NFL2007Standings
data create a model that predicts WinPct
from PointsFor.
Examine the \(R^2\) and \(R^2_{adj}\) values.
Using the NFL2007Standings data create a model that predicts WinPct
from PointsFor
AND PointsAgainst.
Examine the \(R^2\) and \(R^2_{adj}\) values.
Which model do you think is better for predicting win percent?
05:00