Week 12: Categorical & Logistic Regression

Date: November 10, 2025

Today…

Final Project Prep
Some helpful tools (autosave & new library)
Regression Review
Categorical Predictors Review
Logistic Regression
Model Diagnostics 🤷

library(tidyverse)
library(rio)
library(here)
library(lm.beta)
library(easystats)
library(sjPlot)


#Remove Scientific Notation 
options(scipen=999)

#Import CAH Data
cah_data <- import(here("files", "data", 
                        "CAH.csv")) %>% 
  mutate(across(where(is.character), ~na_if(., "")))

Autosave

Tools >> Global Options >> Code >> Saving

New Library: `genzplyr`

dplyr but make it bussin fr fr no cap

genzplyr 💅

Regression does a vibe check on the data. lowkey draw a line through the points.

Regression

What is the equation for a regression??

Regression

\[ Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + ... + b_nX_{ni}+ e_i \]

Regression

How do we interpret the regression coefficients (Intercepts & slopes)?

Regression

INTERCEPT: When all predictor variables are set to 0, our expected value (predicted \(\hat{Y}\)) will be this value.

SLOPES: For every 1 unit change in our \(X_n\) variable, there will be beta (b or \(\beta\)) units increase in our Y (outcome) variable, holding all other variables constant.

Explaining Variance

What do we do with categorical variables as predictors?

Categorical Predictors = factors

Typically identified by a grouping variable that may be a character

glimpse(cah_data$political_affiliation)

 chr [1:1000] "Democrat" "Democrat" "Independent" "Republican" "Democrat" ...

Need to change the variable from character to factor which will assign a number to each group

cah_data <- cah_data %>%
  mutate(
    pol = as.factor(political_affiliation),
    ghosts = as.factor(ghosts)
    )

glimpse(cah_data$pol)

 Factor w/ 3 levels "Democrat","Independent",..: 1 1 2 3 1 1 2 3 3 1 ...

Dummy Coding (replacing factors)

Numerical placeholders used to represent categorical variables

Taking a categorical variable with \(k\) levels (e.g., Democratic, Independent, Republican) into \(k-1\) binary variables.

political_affiliation	becomes	Ind (binary1)	Rep (binary2)
Democrat	–>	0	0
Independent	–>	1	0
Republican	–>	0	1
…		…	…

Dummy Coding

For the variable with \(k\) levels, we use \(k-1\) binary variables…why not use all of them?

Including all three variables would result in perfect multicollinearity. All Democrats would be related to other Democrats and unrelated to everything else
By including 2 binary variables, we are able to obtain all information about group membership

Dummy Coding

The group that has 0’s for all the binary variables is the reference group

For interpretation, we will make our statements in reference to this group

political_affiliation	becomes	Ind (binary1)	Rep (binary2)
Democrat	–>	0	0
Independent	–>	1	0
Republican	–>	0	1
…		…	…

Categorical Regression

Going back to our Transformers dataset, let’s see how our political affiliation variable can predict # of transformers movies

Usually you have to create the separate dummy variables, but not in R. As long as your predictor is set as a factor, R will automatically dummy code the variable.

cat1 <- lm(transformers ~ pol,
           data = cah_data)

You can also double check the dummy/contrast coding

contrasts(cah_data$pol)

            Independent Republican
Democrat              0          0
Independent           1          0
Republican            0          1

Interpretation

We are shown the different levels of our predictor variable, but you will not see a predictor for the reference group…this is the intercept!

Every interpretation is ALWAYS in reference to the reference group

Go to next slide for a pretty table

summary(cat1)


Call:
lm(formula = transformers ~ pol, data = cah_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3447 -1.1804 -0.3447  0.8196  3.8367 

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     1.18039    0.09399  12.559 <0.0000000000000002 ***
polIndependent  0.16434    0.12349   1.331               0.184    
polRepublican  -0.01713    0.14257  -0.120               0.904    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.501 on 799 degrees of freedom
  (198 observations deleted due to missingness)
Multiple R-squared:  0.003244,  Adjusted R-squared:  0.0007487 
F-statistic:   1.3 on 2 and 799 DF,  p-value: 0.2731

Interpretation

tab_model(cat1)

	transformers
Predictors	Estimates	CI	p
(Intercept)	1.18	1.00 – 1.36	<0.001
pol [Independent]	0.16	-0.08 – 0.41	0.184
pol [Republican]	-0.02	-0.30 – 0.26	0.904
Observations	802
R² / R² adjusted	0.003 / 0.001

Interpretation (text)

A linear regression investigated the relationship between political affiliation and number of transformers movies watched. Democrats showed a significant difference from 0 (b = 1.18, p < .001), but there was not a significant difference between Democrats and Independents (p = .18) or Democrats and Republicans (p = .90). The findings suggest that there are no significant differences between groups.

Note

How does this sound?

Changing the Reference Group

Maybe we want to compare to a specific group

We need to then update the reference group

cah_data$pol2 <- relevel(cah_data$pol, ref = "Independent")
cat2 <- lm(transformers ~ pol2,
           data = cah_data)
summary(cat2)


Call:
lm(formula = transformers ~ pol2, data = cah_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3447 -1.1804 -0.3447  0.8196  3.8367 

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     1.34473    0.08011  16.786 <0.0000000000000002 ***
pol2Democrat   -0.16434    0.12349  -1.331               0.184    
pol2Republican -0.18146    0.13383  -1.356               0.175    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.501 on 799 degrees of freedom
  (198 observations deleted due to missingness)
Multiple R-squared:  0.003244,  Adjusted R-squared:  0.0007487 
F-statistic:   1.3 on 2 and 799 DF,  p-value: 0.2731

`lm()` or `aov()` ?

We have the same setup between a linear model and a one-way ANOVA.

\[ Outcome_{continuous} = Predictor_{categorical} + error \]

Why would we pick one over another?

If we have a reason to have a reference group –> Regression
- Maybe we have a control group
If we just expect a difference somewhere –> ANOVA
- When you are predicting # movies from political affiliation

Everything is a linear model

Break ☕🍵🥐

Categories in Action (last time)

Goal: Gain greater familiarity with dummy coding and categorical predictors

Scenario: We are researchers examining the impact of a new intervention on reducing the vocalization “6️⃣7️⃣” in the youths. We have done classroom observations to collect data on how many times students say “6️⃣7️⃣” after receiving the intervention. Youths have been randomly assigned to one of 3 groups, and we need to determine which had the biggest impact.

Code

six_seven <- import(here("files", "data", 
                         "cat_reg_complete.xlsx")) %>% 
  janitor::clean_names()

six_seven %>% 
  slice_sample(n=5) %>% 
  tab_df()

id	school	group	pre_score	post_score	age
111	B	Orange	17	15	10
201	A	Orange	12	8	11
214	A	Orange	14	11	13
209	A	Yellow	19	15	11
104	B	Green	11	9	17

Run the Regression with complete data

Note

Dr. Haraden needs to write the equation on the board 🧑‍🏫 Calculate the expected value for each group

lm1 <- lm(post_score ~ group, data = six_seven)
summary(lm1)


Call:
lm(formula = post_score ~ group, data = six_seven)

Residuals:
   Min     1Q Median     3Q    Max 
-5.000 -2.375  0.000  1.500  7.000 

Coefficients:
            Estimate Std. Error t value         Pr(>|t|)    
(Intercept)  11.5000     0.9487  12.122 0.00000000000197 ***
groupOrange   1.5000     1.3416   1.118           0.2734    
groupYellow   3.0000     1.3416   2.236           0.0338 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3 on 27 degrees of freedom
Multiple R-squared:  0.1562,    Adjusted R-squared:  0.09375 
F-statistic:   2.5 on 2 and 27 DF,  p-value: 0.1009

Now what are the means for each group?

six_seven %>% 
  group_by(group) %>% 
  summarize(
    post_avg = mean(post_score)
  )

# A tibble: 3 × 2
  group  post_avg
  <chr>     <dbl>
1 Green      11.5
2 Orange     13  
3 Yellow     14.5

Using R

Import dataset, include age and pre-scores to see how that changes interpretation

Warning

Dr. Haraden is going to start opening up R and doing a follow-along thing. You have been warned

Everything is a linear model

Introducing Generalized Linear Model

6️⃣7️⃣ Status

In our last example, we were able to see how these categorical variables could predict a continuous variable. This is perfect for linear regression.

What if we want to see if students have “recovered” from 6️⃣7️⃣?

We would then ask: “Which participants dropped below the clinical threshold for 6️⃣7️⃣ at follow-up? Now, our outcome is either recovered or not recovered.”

Recovery Status

Now we have a binary outcome; Yes/No recovery

What happens when we fit a linear regression? What are the chances of someone with a pre-score of 12 recovering by the follow-up?

Code

six_seven <- import(here("files", "data", 
                       "cat_reg_complete.xlsx")) %>% 
  janitor::clean_names() %>% 
  # 1 = recovered; 0 = not recovered
  mutate(recover = if_else(post_score > 14, 0, 1))

Recovery Status

That doesn’t look right… 😑

Code

six_seven %>% 
  ggplot(aes(pre_score, recover)) +
  geom_jitter(width = 0, height = 0.05, 
              alpha = 0.5) + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(
    title = "Pre-Score Predicting Recovery"
  )

Binary Outcomes in Regression

A simple line is not going to appropriately capture the data
- Plus, it definitely doesn’t make it a normal distribution! We only have 2 scores in our predictor variable…

Introducing Logistic Regression 🌟

Using a logistic function we are able to better capture the data and get a “likelihood” or “probability” of an outcome

Probability to Odds to Log-Odds

Probability (p): The chance of an event happening. Ranges from 0 to 1

Odds: The ratio of the probability of an event happening to it not happening.

\(Odds = \frac{p}{1-p}\)
Ranges from 0 to ∞. An odds of 4 means the event is 4 times more likely to happen than not.

Log-Odds (logit): The natural log of the odds

\(Logit(p) = ln(\frac{p}{1-p})\)
Ranges from -∞ to +∞

Important

This step transforms our bounded outcome variable (0/1) to an unbound one!

Generalized Linear Model (GLM)

A generalization of a linear model (duh) that is used when the response variable has a non-normal error distribution

Most commonly used when there is a binary (0-1) or count variable as the outcome (we will focus on the binary)

Ultimately, we are trying to identify the probability of the outcome taking the value 1 (“success”) that is being modeled in relation to the predictor variables

GLM: Logistic Regression

\[ transformation(p_i)=\beta_0+\beta_1x_{1,i} + \beta_2x_{2,i} + \cdots+\beta_lx_{k,i} \]

We have to apply a transformation to the left side so that it can take variables beyond just 0 & 1

A common transformation is the \(logit\ transformation\)

\[ \log_{e}\left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_k x_{k,i} \]

Recovering from 6️⃣7️⃣

Now we have a tool to figure this out, let’s see if the pre-score can predict recovery!

## Using glm() instead of lm()
rec_glm <- glm(recover ~ pre_score, 
           data = six_seven, 
           ## This is new
           ##Tells the model we are doing logistic regression with a binary outcome
           family = "binomial")

GLM - 6️⃣7️⃣ Data Interpretation

Note

Interpretation is still the same as linear regression, except we are dealing with log-odds of the outcome. What does that mean??

summary(rec_glm)


Call:
glm(formula = recover ~ pre_score, family = "binomial", data = six_seven)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  20.1873     7.0700   2.855  0.00430 **
pre_score    -1.2002     0.4268  -2.812  0.00492 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 38.191  on 29  degrees of freedom
Residual deviance: 18.216  on 28  degrees of freedom
AIC: 22.216

Number of Fisher Scoring iterations: 6

GLM - 6️⃣7️⃣ Data Interpretation (tables)

We want to be able to interpret these coefficients more easily so we put them into Odds Ratios

sjPlot is always coming in with the good tables

tab_model(rec_glm)

	recover
Predictors	Odds Ratios	CI	p
(Intercept)	585099560.03	10397.58 – 36877368326353280.00	0.004
pre score	0.30	0.10 – 0.58	0.005
Observations	30
R² Tjur	0.562

Odds Ratios (OR)

OR > 1: The predictor increases the odds of the outcome. (e.g., OR of 2.5 means the odds of believing in ghosts are 2.5 times higher).

OR < 1: The predictor decreases the odds of the outcome. (e.g., OR of 0.4 means the odds of believing in ghosts are 60% lower).

OR = 1: The predictor has no effect on the odds of the outcome.

Visualization

Extract the model implied probabilities for each individual

probs <- broom::augment(rec_glm, type.predict = "response")

Plotting the predicted probabilities

probs %>% 
ggplot(aes(pre_score, .fitted)) +
  geom_line(color = "blue", linewidth = 0.5) +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(
    title = "Predicted Probability of Recovery from 6-7 by Pre-Score",
    x = "Pre-Score",
    y = "Predicted Probability of Recovery"
  ) +
  ylim(0, 1) + # Keep the y-axis bounded at 0 and 1
  theme_minimal()

Logistic Regression: Summary

Feature	Linear Regression	Logistic Regression
Outcome Variable	Continuous	Categorical (Binary)
Equation	\(Y=β_0+β_1X\)	\(ln⁡(\frac{p}{1-p})=β_0+β_1X\)
Key Interpretation	\(\beta_1\) is the change in the mean of Y	\(\exp(\beta_1)\) is the odds ratio
R Function	`lm()`	`glm(..., family="binomial")`

Break ☕🍵🥐

Next Up…

Follow along to apply these methods to a new dataset!

Predicting the probability of believing in ghosts. Hopefully we have time to go through this example 👻

Week 12: Categorical & Logistic Regression

Today…

Autosave

New Library: genzplyr

Regression

Regression

Regression

Regression

Explaining Variance

Explaining Variance

What do we do with categorical variables as predictors?

Categorical Predictors = factors

Dummy Coding (replacing factors)

Dummy Coding

Dummy Coding

Categorical Regression

Interpretation

Interpretation

Interpretation (text)

Changing the Reference Group

lm() or aov() ?

Everything is a linear model

Break ☕🍵🥐

Categories in Action (last time)

Run the Regression with complete data

Now what are the means for each group?

Now what are the means for each group?

Using R

Everything is a linear model

Introducing Generalized Linear Model

6️⃣7️⃣ Status

Recovery Status

Recovery Status

Binary Outcomes in Regression

Probability to Odds to Log-Odds

Generalized Linear Model (GLM)

GLM: Logistic Regression

Recovering from 6️⃣7️⃣

GLM - 6️⃣7️⃣ Data Interpretation

GLM - 6️⃣7️⃣ Data Interpretation (tables)

Odds Ratios (OR)

Visualization

Logistic Regression: Summary

Break ☕🍵🥐

Next Up…

New Library: `genzplyr`

`lm()` or `aov()` ?