Prediction intervals

Lucy D’Agostino McGowan

confidence intervals

If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( \(\beta_1\) ) to fall within the interval estimates 95% of the time.

Confidence interval for \(\hat\beta_1\)

How do we calculate the confidence interval for the slope?

\[\hat\beta_1\pm t^*SE_{\hat\beta_1}\]

How do we calculate it in R?

In with the confint function:

mod <- lm(leaf_length ~ leaf_width, magnolia_data)
summary(mod)


Call:
lm(formula = leaf_length ~ leaf_width, data = magnolia_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4544  -3.2196  -0.0287   3.1761  12.6086 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.8362     1.3956   8.481 2.36e-13 ***
leaf_width    0.4386     0.1552   2.826  0.00571 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.327 on 98 degrees of freedom
Multiple R-squared:  0.07537,   Adjusted R-squared:  0.06593 
F-statistic: 7.988 on 1 and 98 DF,  p-value: 0.005707

confint(mod)

                2.5 %     97.5 %
(Intercept) 9.0666420 14.6058257
leaf_width  0.1306539  0.7466103

How do we calculate it in R?

“by hand”

t_star <- qt(0.025, df = nrow(magnolia_data) - 2, lower.tail = FALSE)
# or
t_star <- qt(0.975, df =  nrow(magnolia_data) - 2)

0.4386 - t_star * 0.1552

[1] 0.1306107

0.4386 + t_star * 0.1552

[1] 0.7465893

Confidence intervals

There are ✌️ other types of confidence intervals we may want to calculate

The confidence interval for the mean response in \(y\) for a given \(x^*\) value
The confidence interval for an individual response \(y\) for a given \(x^*\) value
Why are these different? Which do you think is easier to estimate? It is harder to predict one response than to predict a mean response. What does this mean in terms of the standard error?
The SE of the prediction interval is going to be larger

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[ \hat{y}\pm t^* SE\]

\(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
\(t^*\) is the critical value for the \(t_{n-2}\) density curve
\(SE\) takes ✌️ different values depending on which interval you’re interested in

\(SE_{\hat\mu}\)

\(SE_{\hat{y}}\)

Which will be larger?

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[\hat{y}\pm t^* SE\]

\(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
\(t^*\) is the critical value for the \(t_{n-2}\) density curve
\(SE\) takes ✌️ different values depending on which interval you’re interested in
\(SE_{\hat\mu} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
\(SE_{\hat{y}}=\hat{\sigma}_\epsilon\sqrt{1 + \frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)

What is the difference between these two equations?

Confidence intervals

confidence interval for \(\mu_y\) and prediction interval

\[\hat{y}\pm t^* SE\]

\(\hat{y}\) is the predicted \(y\) for a given \(x^*\)
\(t^*\) is the critical value for the \(t_{n-2}\) density curve
\(SE\) takes ✌️ different values depending on which interval you’re interested in
\(SE_{\hat\mu} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)
\(SE_{\hat{y}}=\hat{\sigma}_\epsilon\sqrt{\color{red}1 + \frac{1}{n}+\frac{(x^*-\bar{x})^2}{\Sigma(x-\bar{x})^2}}\)

an individual response will vary from the mean response \(\mu_y\) with a standard deviation of \(\sigma_\epsilon\)

Let’s do it in R!

mod <- lm(leaf_length ~ leaf_width, data = magnolia_data)
predict(mod)

       1        2        3 
14.07326 15.70058 15.43740

mod <- lm(leaf_length ~ leaf_width, data = magnolia_data)
predict(mod, interval = "confidence")

       fit      lwr      upr
1 14.07326 12.62545 15.52106
2 15.70058 14.63233 16.76884
3 15.43740 14.37976 16.49505

mod <- lm(leaf_length ~ leaf_width, data = magnolia_data)
predict(mod, interval = "prediction")

## WARNING predictions on current data refer to _future_ responses

       fit      lwr      upr
1 14.07326 3.402757 24.74376
2 15.70058 5.074925 26.32624
3 15.43740 4.812806 26.06200

Let’s do it in R!

What if we have new data?

new_magnolia_data <- data.frame(
  leaf_width = c(5, 7.2, 4.3)
)
new_magnolia_data

  leaf_width
1        5.0
2        7.2
3        4.3

predict(
  mod, 
  newdata = new_magnolia_data, 
  interval = "prediction")

       fit      lwr      upr
1 14.02939 3.355995 24.70279
2 14.99438 4.364317 25.62445
3 13.72235 3.026197 24.41851

`Aplication Exercise`

Open appex-12.qmd
You are interested in the predicted Porsche Price for Porsche cars that have 50,000 miles previously driven on average. Calculate this value with an appropriate confidence interval.
You are interested in the predicted Porsche Price for a particular Porsche with 40,000 miles previously driven. Calculate this value with an appropriate confidence interval.

04:00