Correlation

PSYC 640 - Fall 2023

Dustin Haraden, PhD

Last Class

Two Way ANOVA
- Comparing means across multiple groups/levels

Looking Ahead

R-Workshop! (Link to Sign up)
- 11/3 & 12/1 from 2-3pm
Final Project Updates:
- Introduction & Methods draft due 11/15 (Peer Review)
- Data Analysis draft due 11/27 (Peer Review)

Today…

Linear Relationships - Correlation

Pearson Correlation
Spearman’s Rank Correlation
Missing Data
Creating correlation matrices

# for dplyr, ggplot2
library(tidyverse)
#Loading data
library(rio)
# nice tables/plots
library(sjPlot)
library(kableExtra)
library(ggpubr)


#Remove Scientific Notation 
options(scipen=999)

Relationships between variables (Ch 5.7)

Association - Correlation

Examine the relationship between two continuous variables

Similar to the mean and standard deviation, but it is between two variables

Typically displayed as a scatterplot

Code

set.seed(42)

n <- 200
x <- rnorm(n, mean = 10, sd = 2)
y <- 2 * x + rnorm(n, mean = 0, sd = 2) 
corr_data <- data.frame(x,y)

corr_data %>% 
  ggplot(aes(x,y)) + 
  geom_point() + 
  geom_smooth(method="lm", 
              se = FALSE) + 
  labs(
    x = "Number of Houses", 
    y = "Amount of Candy"
  )

Association - Covariance

Before we talk about correlation, we need to take a look at covariance

\[ cov_{xy} = \frac{\sum(x-\bar{x})(y-\bar{y})}{N-1} \]

Covariance can be thought of as the “average cross product” between two variables
It captures the raw/unstandardized relationship between two variables
Covariance matrix is the basis for many statistical analyses

Covariance

Let’s take a look back at that data before and get the covariance

cov(corr_data)

         x         y
x 3.799265  7.301818
y 7.301818 17.598057

What does having a covariance of 7.3 actually mean though?
We have to interpret the covariance in terms of the units present (x = # of houses and y = amount of candy)
- The scale is \(x*y\) … what does that even mean?

Covariance to Correlation

The Pearson correlation coefficient \(r\) addresses this by standardizing the covariance

It is done in the same way that we would create a \(z-score\)…by dividing by the standard deviation

\[ r_{xy} = \frac{Cov(x,y)}{sd_x sd_y} \]

Correlations

Tells us: How much 2 variables are linearly related
Range: -1 to +1
Most common and basic effect size measure
Is used to build the regression model

Interpreting Correlations (5.7.5)

Correlation	Strength	Direction
-1.0 to -0.9	Very Strong	Negative
-0.9 to -0.7	Strong	Negative
-0.7 to -0.4	Moderate	Negative
-0.4 to -0.2	Weak	Negative
-0.2 to 0	Negligible	Negative
0 to 0.2	Negligible	Positive
0.2 to 0.4	Weak	Positive
0.4 to 0.7	Moderate	Positive
0.7 to 0.9	Strong	Positive
0.9 to 1.0	Very Strong	Positive

Hypothesis Testing - Correlation

Statistical Test - Correlation

We tend to always compare our correlations to the null (0)

Hypotheses:

\(H_0: r_{xy} = 0\)
\(H_1:r_{xy} \neq 0\)

Assumptions:

Observations are independent
Linear Relationship

Statistical Test - Correlation

When comparing to 0, we can use the steps similar to a t-test

Calculate the test statistic (df = N - 2):

\[ t = \frac{r}{\sqrt{\frac{1-r^2}{N-2}}} \]

Then follow the typical steps for a t-test!

Statistical Test - But not always

It isn’t always that easy though…

We were able to use the \(t-distribution\) previously because we assumed the null was 0. However, we cannot do that when:

The null \(\neq\) 0
Calculating Confidence Intervals for correlations
Comparing two correlations against each other

Will need to transform the \(r\) to a \(z\) using Fisher’s r to z’ transformation (beyond this class)

\[ z' = \frac{1}{2}ln\frac{1+r}{1-r} \]

Pearson Correlations in R

Calculating Correlation in R

Now how do we get a correlation value in R?

cor(corr_data$x, corr_data$y)

[1] 0.8929946

That will give us the correlation, but we also want to know how to get our p-value

Correlation Test

To get the test of a single pair of variables, we will use the cor.test() function:

cor.test(x, y, data = corr_data)


    Pearson's product-moment correlation

data:  x and y
t = 27.919, df = 198, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8609168 0.9180000
sample estimates:
      cor 
0.8929946

Using real data - NY & NM

So far we have been looking at single variables, but we often care about the relationships between multiple variables in a dataset

Code

school <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/NM-NY_CAS.csv") %>% 
  select(Ageyears, Sleep_Hours_Schoolnight, Sleep_Hours_Non_Schoolnight,
         Reaction_time, Score_in_memory_game) %>% 
  janitor::clean_names()

cor(school) %>% 
  kable()

	ageyears	sleep_hours_schoolnight	sleep_hours_non_schoolnight	reaction_time	score_in_memory_game
ageyears	1	NA	NA	NA	NA
sleep_hours_schoolnight	NA	1	NA	NA	NA
sleep_hours_non_schoolnight	NA	NA	1	NA	NA
reaction_time	NA	NA	NA	1	NA
score_in_memory_game	NA	NA	NA	NA	1

Missing Values

Handling Missing - Correlation

Listwise Deletion (complete cases)
- Removes participants completely if they are missing a value being compared
- Smaller Sample Sizes
- Doesn’t bias correlation estimate
Pairwise Deletion
- Removes participants for that single pair, but leaves information in when there are complete information
- Larger Sample Sizes
- Could bias estimates if there is a systematic reason things are missing

cor(school, use = "complete") %>% kable()

	ageyears	sleep_hours_schoolnight	sleep_hours_non_schoolnight	reaction_time	score_in_memory_game
ageyears	1.0000000	-0.3771818	0.0259458	0.0582001	-0.1140799
sleep_hours_schoolnight	-0.3771818	1.0000000	0.0563271	0.0549937	0.0533009
sleep_hours_non_schoolnight	0.0259458	0.0563271	1.0000000	-0.0908150	0.0726865
reaction_time	0.0582001	0.0549937	-0.0908150	1.0000000	-0.0069244
score_in_memory_game	-0.1140799	0.0533009	0.0726865	-0.0069244	1.0000000

cor(school, use = "pairwise") %>% kable()

	ageyears	sleep_hours_schoolnight	sleep_hours_non_schoolnight	reaction_time	score_in_memory_game
ageyears	1.0000000	-0.3756825	0.0203699	0.0525420	-0.0531333
sleep_hours_schoolnight	-0.3756825	1.0000000	0.0593331	0.0548024	0.0532172
sleep_hours_non_schoolnight	0.0203699	0.0593331	1.0000000	-0.0894077	0.0726865
reaction_time	0.0525420	0.0548024	-0.0894077	1.0000000	-0.0069010
score_in_memory_game	-0.0531333	0.0532172	0.0726865	-0.0069010	1.0000000

Spearman’s Rank Correaltion

Shortcomings of Pearson Correlation

Focus on linear relationships - how data fall on a single straight line

We assume that with any increase in our X variable, there is an equal amount of increase in Y across the whole variable
Example: relation between studying/effort and grade
- If you put 0 effort you would expect a 0 grade
- However, a little bit of effort might be related to a grade of 45
- But more effort will need to take place in order to go from 45 - 90 than it does to go from 0 - 45

Spearman’s Rank Correlation

We need to be able to capture this different (ordinal) “relationship”

If student 1 works more hours than student 2, then we can guarantee that student 1 will get a better grade

Instead of using the amount given by the variables (“hours studied”), we rank the variables based on least (rank = 1) to most (rank = 10)

Then we correlate the rankings with one another

Foundations of Statistics

Who were those white dudes that started this?

Statistics and Eugenics

The concept of the correlation is primarily attributed to Sir Frances Galton

He was also the founder of the concept of eugenics

The correlation coefficient was developed by his student, Karl Pearson, and adapted into the ANOVA framework by Sir Ronald Fisher

Both were prominent advocates for the eugenics movement

What do we do with this info?

Never use the correlation or the later techniques developed on it? Of course not.
Acknowledge this history? Certainly.
Understand how the perspectives of Galton, Fisher, Pearson and others shaped our practices? We must! – these are not set in stone, nor are they necessarily the best way to move forward.

Be aware of the assumptions

Statistics are often thought of as being absent of bias…they are just numbers
Statistical significance was a way to avoid talking about nuance or degree.
“Correlation does not imply causation” was a refutation of work demonstrating associations between environment and poverty.
Need to be particularly mindful of our goals as scientists and how they can influence the way we interpret the findings

Fancy Tables

Correlation Tables

Before we used the cor() function to create a correlation matrix of our variables

But what is missing?

cor(school, use = "complete") %>% 
  kable()

	ageyears	sleep_hours_schoolnight	sleep_hours_non_schoolnight	reaction_time	score_in_memory_game
ageyears	1.0000000	-0.3771818	0.0259458	0.0582001	-0.1140799
sleep_hours_schoolnight	-0.3771818	1.0000000	0.0563271	0.0549937	0.0533009
sleep_hours_non_schoolnight	0.0259458	0.0563271	1.0000000	-0.0908150	0.0726865
reaction_time	0.0582001	0.0549937	-0.0908150	1.0000000	-0.0069244
score_in_memory_game	-0.1140799	0.0533009	0.0726865	-0.0069244	1.0000000

Correlation Tables - sjPlot

tab_corr(school, na.deletion = "listwise", triangle = "lower")

	ageyears	sleep_hours_schoolnight	sleep_hours_non_schoolnight	reaction_time	score_in_memory_game
ageyears
sleep_hours_schoolnight	-0.377***
sleep_hours_non_schoolnight	0.026	0.056
reaction_time	0.058	0.055	-0.091
score_in_memory_game	-0.114	0.053	0.073	-0.007
Computed correlation used pearson-method with listwise-deletion.

Correlation Tables - sjPlot

So many different cusomizations for this type of plot

Can add titles, indicate what missingness and method

Saves you a TON of time when putting it into a manuscript

Visualizing Data

It is always important to visualize our data! Even after getting the correlations and other descriptives

Let’s go back to the data that we had in a previous lecture

Code

data1 <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/data1.csv") %>% 
  mutate(dataset = "data1")

data2 <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/data2.csv") %>% 
  mutate(dataset = "data2")

data3 <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/data3.csv") %>% 
  mutate(dataset = "data3")

And then combine them to make it easier

three_data <- bind_rows(data1, data2, data3)

Descriptive Stats on the 3 datasets

three_data %>%
  group_by(dataset) %>% 
  summarize(
    mean_x = mean(x),
    mean_y = mean(y),
    std_x = sd(x), 
    std_y = sd(y), 
    cor_xy = cor(x,y)
  )

# A tibble: 3 × 6
  dataset mean_x mean_y std_x std_y  cor_xy
  <chr>    <dbl>  <dbl> <dbl> <dbl>   <dbl>
1 data1     54.3   47.8  16.8  26.9 -0.0641
2 data2     54.3   47.8  16.8  26.9 -0.0683
3 data3     54.3   47.8  16.8  26.9 -0.0645

Visualizing Dataset 1

Code

data1 %>% 
  ggplot(aes(x, y)) + 
  geom_point() + 
  labs(title = "Dataset 1")

Visualizing Dataset 1

Code

data1 %>% 
  ggplot(aes(x, y)) + 
  geom_point() + 
  labs(title = "Dataset 1") +
  geom_smooth(method = "lm", 
              se = FALSE)

Visualizing Dataset 1

data1 %>% 
  ggscatter("x", "y", 
            add = "reg.line") + 
  stat_cor(label.y = 55) + 
  labs(title = "Dataset 1")

Visualizing Dataset 2

Let’s try it out in R!

Next time…

More Correlation?
Maybe Regression? (Y = mX + b)
Group Work!