Regression: The Empire Strikes Back

PSYC 640 - Fall 2023

Dustin Haraden, PhD

Last Time

Today…

  • More Regression (but more details)

    • Coefficient of Determination

    • Coefficient of Alientation

    • Omnibus test

  • Multiple Regression? Comparing Models?

  • Lab 5!!! The Final!

options(scipen=999)
library(tidyverse)
library(rio)
library(broom)
library(psych)

Data for Today

school <- read_csv("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/example2-chisq.csv") %>% 
  mutate(Sleep_Hours_Non_Schoolnight = as.numeric(Sleep_Hours_Non_Schoolnight)) %>% 
  filter(Sleep_Hours_Non_Schoolnight < 24) #removing impossible values

Statistical Inference

  • The way the world is = our model + error

  • How good is our model? Does it “fit” the data well?

To assess how well our model fits the data, we will examine the proportion of variance in our outcome variable that can be “explained” by our model.

To do so, we need to partition the variance into different categories. For now, we will partition it into two categories: the variability that is captured by (explained by) our model, and variability that is not.

Partitioning variation

Let’s start with the formula defining the relationship between observed \(Y\) and fitted \(\hat{Y}\):

\[Y_i = \hat{Y}_i + e_i\]

One of the properties that we love about variance is that variances are additive when two variables are independent. In other words, if we create some variable, C, by adding together two other variables, A and B, then the variance of C is equal to the sum of the variances of A and B.

Why can we use that rule in this case?

Partitioning variation

\(\hat{Y}_i\) and \(e_i\) must be independent from each other. Thus, the variance of \(Y\) is equal to the sum of the variance of \(\hat{Y}\) and \(e\).

\[\large s^2_Y = s^2_{\hat{Y}} + s^2_{e}\]

Recall that variances are sums of squares divided by N-1. Thus, all variances have the same sample size, so we can also note the following:

\[\large SS_Y = SS_{\hat{Y}} + SS_{\text{e}}\]

A quick note about terminology: I demonstrated these calculations using \(Y\), \(\hat{Y}\) and \(e\). However, you may also see the same terms written differently, to more clearly indicate the source of the variance…

\[ SS_Y = SS_{\hat{Y}} + SS_{\text{e}}\] \[ SS_Y = SS_{\text{Model}} + SS_{\text{Residual}}\]

The relative magnitudes of sums of squares provides a way of identifying particularly large and important sources of variability.

Coefficient of Determination

\[\Large \frac{SS_{Model}}{SS_{Y}} = \frac{s_{Model}^2}{s_{y}^2} = R^2\]

\(R^2\) represents the proportion of variance in Y that is explained by the model.

\(\sqrt{R^2} = R\) is the correlation between the predicted values of Y from the model and the actual values of Y

\[\large \sqrt{R^2} = r_{Y\hat{Y}}\]

Example

fit.1 <- lm(Sleep_Hours_Non_Schoolnight ~ Ageyears, 
           data = school)
summary(fit.1) 

Call:
lm(formula = Sleep_Hours_Non_Schoolnight ~ Ageyears, data = school)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3947 -0.7306  0.3813  1.2694  4.5974 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 10.52256    0.90536  11.623 <0.0000000000000002 ***
Ageyears    -0.11199    0.05887  -1.902              0.0585 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.204 on 204 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.01743,   Adjusted R-squared:  0.01261 
F-statistic: 3.619 on 1 and 204 DF,  p-value: 0.05854
model_info <- augment(fit.1)
summary(fit.1)$r.squared
[1] 0.01742975

Example

school %>% 
  ggplot(aes(x = Ageyears, y = Sleep_Hours_Non_Schoolnight)) + 
  geom_point() + geom_smooth(method = "lm", se = F)

Example

The correlation between X and Y is:

cor(school$Sleep_Hours_Non_Schoolnight, school$Ageyears, use = "pairwise")
[1] -0.1320218

. . .

If we square the correlation, we get:

cor(school$Sleep_Hours_Non_Schoolnight, school$Ageyears, use = "pairwise")^2
[1] 0.01742975

. . .

Ta da!

summary(fit.1)$r.squared
[1] 0.01742975

Coefficient of Alienation

\(1-R^2\) is sometimes referred to as the coefficient of alienation. It represents the proportion of variance in Y that is unexplained by our model, or left over after accounting for X (and other predictors).

Residuals carry information about where and how the model fails to fit the data. However, it’s important to note that residuals (like all other aspects of our data) are estimates.

In fact, residuals are latent variables as we do not directly observe them in our data collection but infer their presence and value from other data.

We can use residuals to estimate true score error.

Standard Error of the Estimate

  • aka residual standard error

\[s_{Y|X} = \sqrt{\frac{SS_{\text{Residual}}}{df_{\text{Residual}}}} = \sqrt{\frac{\Sigma(Y_i -\hat{Y_i})^2}{N-2}}\]

  • interpreted in original units (unlike \(R^2\))

  • We interpret the standard error of the estimate to represent the spread of observed data around the regression line.

Standard Error of the Estimate


Call:
lm(formula = Sleep_Hours_Non_Schoolnight ~ Ageyears, data = school)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3947 -0.7306  0.3813  1.2694  4.5974 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 10.52256    0.90536  11.623 <0.0000000000000002 ***
Ageyears    -0.11199    0.05887  -1.902              0.0585 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.204 on 204 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.01743,   Adjusted R-squared:  0.01261 
F-statistic: 3.619 on 1 and 204 DF,  p-value: 0.05854

Standard Error of the Estimate

summary(fit.1)$sigma 
[1] 2.203647
sd(model_info$.resid)
[1] 2.198265

Note: these are not the same!

\(1-R^2\) and standard error of the estimate

  • two sides of same coin

  • one in original units (standard error of the estimate), the other standardized \((1-R^2)\)

Inferential tests

NHST is about making decisions:

  • these two means are/are not different
  • this correlation is/is not significant
  • the distribution of this categorical variable is/is not different between these groups

In regression, there are several inferential tests being conducted at once. The first is called the omnibus test – this is a test of whether the model fits the data.

Omnibus test

Historically we use the F distribution to estimate the significance of our model, because it works with our ability to partition variance.

What is our null hypothesis?

. . .

The model does not account for variance in \(Y\) (spoiler…ANOVA)

F Distributions and regression

F statistics are not testing the likelihood of differences; they test the likelihood of ratios. In this case, we want to determine whether the variance explained by our model is larger in magnitude than another variance.

Which variance?

\[\Large F_{\nu_1\nu_2} = \frac{\frac{\chi^2_{\nu_1}}{\nu_1}}{\frac{\chi^2_{\nu_2}}{\nu_2}}\]

\[\Large F_{\nu_1\nu_2} = \frac{\frac{\text{Variance}_{\text{Model}}}{\nu_1}}{\frac{\text{Variance}_{\text{Residual}}}{\nu_2}}\]

\[\Large F = \frac{MS_{Model}}{MS_{residual}}\]

The degrees of freedom for our model are

\[DF_1 = k\] \[DF_2 = N-k-1\]

Where k is the number of IV’s in your model, and N is the sample size.

Mean squares are calculated by taking the relevant Sums of Squares and dividing by their respective degrees of freedom.

  • \(SS_{\text{Model}}\) is divided by \(DF_1\)

  • \(SS_{\text{Residual}}\) is divided by \(DF_2\)

anova(fit.1)
Analysis of Variance Table

Response: Sleep_Hours_Non_Schoolnight
           Df Sum Sq Mean Sq F value  Pr(>F)  
Ageyears    1  17.57 17.5728  3.6187 0.05854 .
Residuals 204 990.64  4.8561                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit.1)

Call:
lm(formula = Sleep_Hours_Non_Schoolnight ~ Ageyears, data = school)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3947 -0.7306  0.3813  1.2694  4.5974 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 10.52256    0.90536  11.623 <0.0000000000000002 ***
Ageyears    -0.11199    0.05887  -1.902              0.0585 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.204 on 204 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.01743,   Adjusted R-squared:  0.01261 
F-statistic: 3.619 on 1 and 204 DF,  p-value: 0.05854

Mean square error (MSE)

  • AKAmean square residual and mean square within

  • unbiased estimate of error variance

    • measure of discrepancy between the data and the model
  • the MSE is the variance around the fitted regression line

  • Note: it is a transformation of the standard error of the estimate (and residual standard error)!

anova(fit.1)
Analysis of Variance Table

Response: Sleep_Hours_Non_Schoolnight
           Df Sum Sq Mean Sq F value  Pr(>F)  
Ageyears    1  17.57 17.5728  3.6187 0.05854 .
Residuals 204 990.64  4.8561                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next time…

  • More Regression!
  • Group Work!

LAB TIME