Examines the impact of 2 nominal/categorical variables on a continuous outcome
We can now examine:
The impact of variable 1 on the outcome (Main Effect)
The impact of variable 2 on the outcome (Main Effect)
The interaction of variable 1 & 2 on the outcome (Interaction Effect)
The effect of variable 1 depends on the level of variable 2
Main Effect & Interactions
Main Effect: Basically a one-way ANOVA
The effect of variable 1 is the same across all levels of variable 2
Interaction:
Able to examine the effect of variable 1 across different levels of variable 2
Basically speaking, the effect of variable 1 on our outcome DEPENDS on the levels of variable 2
Ghost Data
Since there are 2 header rows, we need to only include the first one. To do that, we have to “skip” the first two and then give the names of the columns back to the data.
We also need to specify what the missing values are. Typically we have been working with NA which is more traditional. However, missing values in this dataset are DK/REF and a blank. This will need to be specified in the import function (used this website)
# get the names of your columns which is the first rowghost_data_names <-read_csv(here("lectures", "data", "ghosts.csv")) %>%names()# import second time; skip row 2, and assign column names to argument col_names =ghost_data <-read_csv(here("lectures", "data", "ghosts.csv"),skip =2,col_names = ghost_data_names,na =c("DK/REF", "", " ") ) %>%clean_names()
Running the Test
Let’s take a look at Income by Political Affiliation
aov1 <-aov(income ~ political_affiliation, data = ghost_data)summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
political_affiliation 2 44252069121 22126034561 4.007 0.0189 *
Residuals 393 2170016726333 5521671059
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
604 observations deleted due to missingness
#Welch's F-testoneway.test(income ~ political_affiliation, data = ghost_data)
One-way analysis of means (not assuming equal variances)
data: income and political_affiliation
F = 3.4289, num df = 2.00, denom df = 260.52, p-value = 0.03389
#Don't know if I'm using all of these, but including theme here anywayslibrary(tidyverse)library(rio)library(broom)library(psych)library(gapminder)library(psychTools)#Remove Scientific Notation options(scipen=999)
Overview of Regression
Regression is a general data analytic system
Lots of things fall under the umbrella of regression
This system can handle a variety of forms of relations and types of variables
The output of regression includes both effect sizes and statistical significance
We can also incorporate multiple influences (IVs) and account for their intercorrelations
Uses for regression
Adjustment: Take into account (control) known effects in a relationship
Prediction: Develop a model based on what has happened previously to predict what will happen in the future
Explanation: examining the influence of one or more variable on some outcome
Study Design & Collection
Design - When data are collected
Retrospective/Prospective
Longitudinal
Cross-Sectional
Collection - How data are collected
Experimental
Field
Observational
Meta-analysis
Neuroimaging/Psychophys
Survey
Quasi-Experimental
Reminder to Professor for Model Drawings
maybe?
Regression Equation
With regression, we are building a model that we think best represents the data at hand
At the most simple form we are drawing a line to characterize the linear relationship between the variables so that for any value of x we can have an estimate of y
\[
Y = mX + b
\]
Y = Outcome Variable (DV)
m = Slope Term
X = Predictor (IV)
b = Intercept
Regression Equation
Overall, we are providing a model to give us a “best guess” on predicting
Let’s “science up” the equation a little bit:
\[
Y_i = b_0 + b_1X_i + e_i
\]
This equation is capturing how we are able to calculate each observation ( \(Y_i\) )
\[
\hat{Y_i} = b_0 + b_1X_i
\]
This one will give us the “best guess” or expected value of \(Y\) given \(X\)
Regression Equation
There are two ways to think about our regression equation. They’re similar to each other, but they produce different outputs. \[Y_i = b_{0} + b_{1}X_i +e_i\] \[\hat{Y_i} = b_{0} + b_{1}X_i\]
The model we are building by including new variables is to explain variance in our outcome
Expected vs. Actual
\[Y_i = b_{0} + b_{1}X_i + e_i\]
\[\hat{Y_i} = b_{0} + b_{1}X_i\]
\(\hat{Y}\) signifies that there is no error. Our line is predicting that exact value. We interpret it as being “on average”
Important to identify that that \(Y_i - \hat{Y_i} = e_i\).
OLS
How do we find the regression estimates?
Ordinary Least Squares (OLS) estimation
Minimizes deviations
\[ min\sum(Y_{i} - \hat{Y} ) ^{2} \]
Other estimation procedures possible (and necessary in some cases)
In order to find the OLS solution, you could try many different coefficients \((b_0 \text{ and } b_{1})\) until you find the one with the smallest sum squared deviation. Luckily, there are simple calculations that will yield the OLS solution every time.
According to this regression equation, when \(X = 0, Y = 0\). Our interpretation of the coefficient is that a one-standard deviation increase in X is associated with a \(b_{yx}^*\) standard deviation increase in Y. Our regression coefficient is equivalent to the correlation coefficient when we have only one predictor in our model.
Estimating the intercept, \(b_0\)
intercept serves to adjust for differences in means between X and Y
otherwise, intercept is where regression line crosses the y-axis at X = 0
The intercept adjusts the location of the regression line to ensure that it runs through the point \(\large (\bar{X}, \bar{Y}).\) We can calculate this value using the equation:
vars n mean sd median min max range skew kurtosis se
log_gdp 1 33 8.74 1.24 8.41 6.85 10.76 3.91 0.21 -1.37 0.22
lifeExp 2 33 70.73 7.96 72.40 43.83 82.60 38.77 -1.07 1.79 1.39
cor(gapminder$log_gdp, gapminder$lifeExp)
[1] 0.8003474
If we regress lifeExp onto log_gdp:
r =cor(gapminder$log_gdp, gapminder$lifeExp)m_log_gdp =mean(gapminder$log_gdp)m_lifeExp =mean(gapminder$lifeExp)s_log_gdp =sd(gapminder$log_gdp)s_lifeExp =sd(gapminder$lifeExp)b1 = r*(s_lifeExp/s_log_gdp)
[1] 5.157259
b0 = m_lifeExp - b1*m_log_gdp
[1] 25.65011
How will this change if we regress GDP onto life expectancy?
(b1 = r*(s_lifeExp/s_log_gdp))
[1] 5.157259
(b0 = m_lifeExp - b1*m_log_gdp)
[1] 25.65011
(b1 = r*(s_log_gdp/s_lifeExp))
[1] 0.1242047
(b0 = m_log_gdp - b1*m_lifeExp)
[1] -0.04405086
In R
fit.1<-lm(lifeExp ~ log_gdp, data = gapminder)summary(fit.1)
Call:
lm(formula = lifeExp ~ log_gdp, data = gapminder)
Residuals:
Min 1Q Median 3Q Max
-17.314 -1.650 -0.040 3.428 8.370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.6501 6.1234 4.189 0.000216 ***
log_gdp 5.1573 0.6939 7.433 0.0000000226 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.851 on 31 degrees of freedom
Multiple R-squared: 0.6406, Adjusted R-squared: 0.629
F-statistic: 55.24 on 1 and 31 DF, p-value: 0.00000002263
An observation about heights was part of the motivation to develop the regression equation: If you selected a parent who was exceptionally tall (or short), their child was almost always not as tall (or as short).
Code
library(psychTools)library(tidyverse)heights = psychTools::galtonmod =lm(child~parent, data = heights)point =902heights = broom::augment(mod)heights %>%ggplot(aes(x = parent, y = child)) +geom_jitter(alpha = .3) +geom_hline(aes(yintercept =72), color ="red") +geom_vline(aes(xintercept =72), color ="red") +theme_bw(base_size =20)
Regression to the mean
This phenomenon is known as regression to the mean. This describes the phenomenon in which an random variable produces an extreme score on a first measurement, but a lower score on a second measurement.
Regression to the mean
This can be a threat to internal validity if interventions are applied based on first measurement scores.