Types of variables

Lucy D’Agostino McGowan

Variable types

There are two major classes of variables
- numeric (quantitative)
- categorical

Variable types

Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types

data("PorschePrice")
glimpse(PorschePrice)

Rows: 30
Columns: 3
$ Price   <dbl> 69.4, 56.9, 49.9, 47.4, 42.9, 36.9, 83.0, 72.9, 69.9, 67.9, 66…
$ Age     <int> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10, 3,…
$ Mileage <dbl> 21.50, 43.00, 19.90, 36.00, 44.00, 49.80, 1.30, 0.67, 13.40, 9…

What are the variables here?
fct: “factor” this is a type of categorical variable

Variable types

Recall from the first week of class, you can use the glimpse() function to see all of your variables and their types

glimpse(starwars)

Rows: 87
Columns: 5
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…

chr: “character” this is a type of categorical variable

Variable types

So far, our models have only included numeric (quantitative) variables
What would the equation be for predicting \(y\) from \(x\) when \(x\) is numeric?
What would happen if \(x\) is categorical?
- What would the equation be for predicting \(y\) from \(x\) if \(x\) is categorical with 2 levels?
- What would the equation be for predicting \(y\) from \(x\) if \(x\) is categorical with 3 levels?

indicator variable

An indicator variable uses two values, usually 0 and 1, to indicate whether a data case does (1) or does not (0) belong to a specific category

Indicator variable

Indicator variables

What does this line of code do?

Diamonds <- Diamonds %>%
  mutate(
    ColorD = ifelse(Color == "D", 1, 0), 
    ColorE = ifelse(Color == "E", 1, 0),
    ColorF = ifelse(Color == "F", 1, 0),
    ColorG = ifelse(Color == "G", 1, 0),
    ColorH = ifelse(Color == "H", 1, 0),
    ColorI = ifelse(Color == "I", 1, 0),
    ColorJ = ifelse(Color == "J", 1, 0)
  )

Indicator variables

What does this line of code do?

Diamonds <- Diamonds %>%
  mutate(
    ColorD = ifelse(Color == "D", 1, 0), 
    ColorE = ifelse(Color == "E", 1, 0), 
    ColorF = ifelse(Color == "F", 1, 0),
    ColorG = ifelse(Color == "G", 1, 0),
    ColorH = ifelse(Color == "H", 1, 0),
    ColorI = ifelse(Color == "I", 1, 0),
    ColorJ = ifelse(Color == "J", 1, 0)
  )

Indicator variables

What if I wanted to model the relationship between TotalPrice and Color?

Indicator variables

Why is ColorJ NA?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI + ColorJ,
   data = Diamonds)


Call:
lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + 
    ColorH + ColorI + ColorJ, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI       ColorJ  
       5704           NA

When including indicator variables in a model for k categories, always include k-1
The one that is left out is the “reference” category

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
   data = Diamonds)


Call:
lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + 
    ColorH + ColorI, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI  
       5704

Interpretation: A diamond with Color D compared to color J increases the expected total price by 3632.
Interpretation: A diamond with Color E compared to color J increases the expected total price by 2423

Indicator variables

What is the reference category?

lm(TotalPrice ~ ColorD + ColorE + ColorF + ColorG + ColorH + ColorI,
   data = Diamonds)


Call:
lm(formula = TotalPrice ~ ColorD + ColorE + ColorF + ColorG + 
    ColorH + ColorI, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI  
       5704

Interpretation: A diamond with Color D compared to color J increases the expected total price by 3632.
What is the interpretation for a diamond with Color F?

R is smart

lm(TotalPrice ~ Color, data = Diamonds)


Call:
lm(formula = TotalPrice ~ Color, data = Diamonds)

Coefficients:
(Intercept)       ColorE       ColorF       ColorG       ColorH       ColorI  
       5569        -1209         3592         3990         3100         2071  
     ColorJ  
      -3632

What is the reference category?

R is smart

lm(TotalPrice ~ Color, data = Diamonds)


Call:
lm(formula = TotalPrice ~ Color, data = Diamonds)

Coefficients:
(Intercept)       ColorE       ColorF       ColorG       ColorH       ColorI  
       5569        -1209         3592         3990         3100         2071  
     ColorJ  
      -3632

What is the interpretation for Color E now?
What if we wanted a different referent category?
- We could code the indicators ourselves
- We could relevel the factor

Relevel

levels(Diamonds$Color)

[1] "D" "E" "F" "G" "H" "I" "J"

new_levels <- c("J", "D", "E", "F", "G", "H", "I")
Diamonds <- Diamonds %>%
  mutate(Color = fct_relevel(Color, new_levels))

levels(Diamonds$Color)

[1] "J" "D" "E" "F" "G" "H" "I"

R is smart

lm(TotalPrice ~ Color, data = Diamonds)


Call:
lm(formula = TotalPrice ~ Color, data = Diamonds)

Coefficients:
(Intercept)       ColorD       ColorE       ColorF       ColorG       ColorH  
       1936         3632         2423         7224         7623         6732  
     ColorI  
       5704

What is the reference category?

What if the variable is binary

A binary variable is a special type of categorical variable with two levels

ICU example

A sample of 200 patients in an ICU unit
Want to see if the patient’s heart rate is related to whether they were admitted via the emergency room
- y: Heart rate (beats per minute)
- x: indicator for emergency room admission
Aside: Is this inference or prediction?

Binary x variable

data("ICU")
lm(Pulse ~ Emergency, data = ICU)


Call:
lm(formula = Pulse ~ Emergency, data = ICU)

Coefficients:
(Intercept)    Emergency  
      91.11        10.63

How can we interpret \(\hat{\beta}_0\) now?
How can we interpret \(\hat{\beta}_1\)?

`Application Exercise`

Copy the following template into RStudio Pro:

https://github.com/sta-112-f22/appex-14.git

What are the variables in the Diamonds dataset?
What are the levels of the Clarity variable in the Diamonds data?
Fit a model with TotalPrice as the outcome and Clarity as the explanatory variable
Change the referent category to SI1 and refit the model
Add the variable Depth to your model. How do you interpret the coefficient for this parameter?

05:00