Date: September 15, 2025
Last week: We learned to describe and visualize patterns in our sample.
The Rest of the Course: We want to make claims about the population.
The Problem: How do we know if a pattern we see in our sample (e.g., a difference between groups) is a “real” effect or just random sampling error?
The Solution: We build a statistical model to test our question against the backdrop of randomness.
All research starts with a question about the world.
But how do we get from a messy, real-world question to a clean, statistical answer?
Our Journey Today:
Part 1: From Concept to Data (How do we measure what we care about?)
Part 2: The Blueprint for Claims (How do we structure our study?)
Part 3: From Design to Model (How do we analyze our data?)
Before we can analyze anything, we need data. This means turning an abstract concept into a concrete, measurable variable.
This process is called operationalization.
Your operationalization is your argument for what a concept means in the context of your study.
Let’s imagine a new study. Our research question is:
Does being aggressively attacked by a campus squirrel for your sandwich increase a student’s acute stress?
Predictor: squirrel_incident
(Attacked vs. Not Attacked).
Outcome: acute_stress
.
But what is “acute stress”? As the researcher, how do you measure it?
Turn to the person/people next to you and identify how you would define and measure “stress”.
Method of Measurement | How It Works | Data Generated | What It Really Measures |
Subjective Self-Report | “On a scale of 0-100, how stressed do you feel right now?” | Continuous (0-100) | The feeling of stress. |
Autonomic Arousal | Measure heart rate in beats per minute (BPM). | Continuous (e.g., 115 BPM) | The body’s alarm for stress. |
Cognitive Impairment | “Count backward from 1,084 by 7s.” Measure speed & accuracy. | Continuous (# of errors) | The brain’s processing under stress. |
Coded Facial Expression | Analyze video for micro-expressions of fear or anger. | Categorical or Ordinal | The face’s expression of stress. |
Now that we know what we’re measuring, let’s talk about how we structure the study. Your design determines the kind of claim you can make.
An Associational Claim: Two variables are related.
A Causal Claim: A change in one variable causes a change in another.
To earn the right to say “X causes Y,” you must demonstrate three things:
Covariance: X and Y are related.
Temporal Precedence: The cause (X) must happen before the effect (Y).
Internal Validity: There are no other plausible explanations (we’ve ruled out confounds).
A confound is a “third variable” that creates a spurious relationship between your two variables. This is the #1 threat to causal claims.
Classic Example: Ice cream sales are positively correlated with drowning deaths.
The Confound: Summer heat! Hot weather causes both more ice cream sales and more swimming.
A True Experiment is the most powerful tool for establishing causality. It has two magic ingredients:
Manipulation: The researcher actively manipulates the Independent Variable (IV).
Random Assignment: Every participant has an equal chance of being in any condition. This breaks the links to potential confounds by distributing them evenly across groups.
Between-Subjects: Different groups of people get different conditions. We compare Group A vs. Group B.
Within-Subjects: The same group of people experiences all conditions. We compare people to themselves.
Cross-Sectional: All data is collected at a single point in time. (Fails temporal precedence).
Longitudinal: Data is collected from the same people over multiple time points. (Establishes temporal precedence).
So far, we’ve focused on what we do with our participants (the design).
But there’s another crucial question: Who are our participants and who do they represent?
Population: The entire group we want to draw conclusions about.
Sample: The specific subset of the population that we actually collect data from.
The central question of sampling is generalizability: How well do the findings from our sample apply to the broader population?
Random Sampling (The Ideal)
Every single member of the target population has an equal chance of being selected for the study.
Why it’s the gold standard: It minimizes systematic bias. In the long run, it produces samples that are highly representative of the population.
Convenience Sampling (The Reality)
The researcher includes people who are easy to reach (e.g., Using undergraduate students, posting a flier in the clinic, etc.)
The Limitation: Our convenience sample may differ from the population in systematic ways (e.g., younger, wealthier, college educated, etc.). Must be thoughtful when generalizing from these samples.
We have our perfectly designed study to collect stress data from our observations on being on the squirrel hunt (let’s ignore the ethics around knowing squirrels might attack and not interfering).
We’ve moved beyond just describing our sample with means and standard deviations.
Now we have to make an inference. We see a difference in stress between the groups… but how do we know if it’s “real” or just random chance?
To answer this, we need to understand the logic of randomness itself. We need to understand probability.
Moving from simply describing our data
With means and standard deviations
To drawing conclusions about the population
Using inferential statistics
There are several possible interpretations of probability but they (almost) completely agree on the mathematical rules probability must follow:
\(P(A)\) = Probablity of event A
0 ≤ \(P(A)\) ≤ 1
As more observations are collected, the proportion of occurrences with a particular outcome, \(p̂_n\), converges to the probability of that outcome, \(p\).
Example:
As the sample size increases, the sample mean tends to get closer to the population mean
And as the sample size approaches infinity ♾️, the sample mean approaches the population mean
When tossing a fair coin, if heads comes up on each of the first 10 tosses, what do you think the chance is that another head will come up on the next toss? 0.5, less than 0.5, or more than 0.5?
H H H H H H H H H H ? |
When tossing a fair coin, if heads comes up on each of the first 10 tosses, what do you think the chance is that another head will come up on the next toss? 0.5, less than 0.5, or more than 0.5?
H H H H H H H H H H ? |
Probability is still 0.5
Coin is not “due” for a tail
Important
The common misunderstanding of the LLN is that random processes are supposed to compensate for whatever happened in the past; this is just not true and is also called gambler’s fallacy (or law of averages)
Think about seeing statistics posted on the news
We assume the sample is representative of the larger population, but how representative?
Probability theory provides tools to assess how likely sample results are if they differ from the true population parameter
Probability is a long-run frequency of an event
Probability of an event as the degree of belief
Pros
Cons
Infinity: not possible in the physical world
Narrow in scope: although we want to make statements about a single event, we typically are forbidden (see above point about infinity)
So, How Does This Help Us?
The Frequentist approach gives us a powerful framework: We can test our research hypothesis by asking, “How surprising would our sample data be if there were truly no effect in the long run?”
This question is the foundation for the rest of our journey today.
Your measurement and design choices create the blueprint for your analysis. They lead you directly to the correct statistical model.
All models we learn this semester can be expressed in R with a simple formula:
Outcome ~ Predictor
If your goal is to… COMPARE GROUPS:
Predictor (IV) is Categorical
Outcome (DV) is Continuous
👉 You are in the t-test / ANOVA family of models.
The model looks like: Continuous_Outcome ~ Categorical_Predictor
If your goal is to… ASSESS AN ASSOCIATION:
Predictor (IV) is Continuous
Outcome (DV) is Continuous
👉 You are in the Correlation / Regression family of models.
The model looks like: Continuous_Outcome ~ Continuous_Predictor
So we have a model, like: depression_score ~ therapy_group
. How do we test it?
We use the Null Hypothesis Significance Testing (NHST) framework:
The Null Hypothesis \(H_0\) : A statement that the predictor has no relationship with the outcome in the population. It is the “null model.”
therapy_group
does not predict depression_score
.The Alternative Hypothesis \(H_A\) : Our research hypothesis. The predictor does have a relationship with the outcome.
therapy_group
does predict depression_score
.The p-value tells us how surprising our sample data would be if the null hypothesis (the “no relationship” model) were true for the population.
It’s a measure of the incompatibility between our data and the null model.
The statistical test you run (t.test
, cor.test
, lm
) is just the engine that calculates the p-value for your specific model. The logic is always the same.
But…
Your Study Design tells you if you can call that relationship causal (requires random assignment, temporal precedence).
Your Sampling Method tells you if you can generalize that relationship to a broader population.
A p-value or statistical test cannot fix a flawed design!