Final Project
Instructions
The goal of the final project is to apply your knowledge of data analysis skills that we have learned in this course. You will begin by having a research question that you will attempt to answer with an open dataset. Then you will highlight the process, and report the appropriate analyses. All information should be self-contained and 100% reproducible with the file/folder that you have uploaded.
General Overview
We will be breaking the final project up into sections to help structure the data analytic process.
Part 1 (Due 11/16): Identify 5 datasets that you could work with for the final project
These cannot be data that we have used in class before
Data can be from your mentor/lab, but cannot be directly used for your thesis (I just don’t want a 1 to 1 copy from this to your thesis)
Part 2 (Due 11/23): Narrow down the datasets and generate a testable research question/hypothesis
- This should also include the analyses that you would like to perform
Part 3 (Due 12/08): Share your data and analyses in a brief presentation to the group
- You will also submit a written report that includes your data analytic plan and the results.
Project Steps
Part 1 (Due 11/16)
For the first part of your final project, you will need to identify five potential datasets that you could use to answer a research question. These datasets cannot be ones that we have used in class previously. While you may use data from your mentor or lab, the project you submit for this course cannot be a direct copy of your thesis work.
When searching for datasets, consider the following resources:
- Open Access Repositories: There are several online repositories that house freely accessible datasets. Some of these that are specific to psychology include PsychArchives and PsyArXiv. The Open Science Framework (OSF) is another excellent resource for finding datasets across various disciplines, including psychology.
University and Government-Funded Collections: Many universities and government agencies maintain data repositories. The Inter-university Consortium for Political and Social Research (ICPSR), sponsored by the University of Michigan, is a large archive of social and behavioral science data. The U.S. government also has many open datasets on topics like education and substance abuse that can be useful for psychological research.
https://guides.library.unt.edu/health-data-analytics/open-datasets
Document Submission: Please submit a document that has the following information for each dataset:
Source of data (include links)
Brief description (2-3 sentences) of the data (i.e., what is it measuring)
Sample Size
Brief rationale for why you think it would be a good choice for this project (3-5 sentences)
Part 2 (Due 11/23)
For the second part of the project, you will select one of the five datasets you identified in Part 1 and develop a testable research question/hypothesis. Your research question should be specific and able to be answered using the data in your chosen dataset.
Your proposal for Part 2 should include the following sections:
- Introduction: Briefly introduce the dataset you have chosen and the general topic you will be investigating. Be sure to include 3+ citations to support your investigation
- Research Question and Hypothesis: State your research question clearly and concisely. If applicable, formulate a specific, testable hypothesis that you will evaluate with your data analysis.
- Proposed Analyses: Describe the statistical analyses you plan to use to test your hypothesis/research question. Your analyses must include at least 2 tests that we have gone over in class. One of these tests must be a regression.
If your selected dataset has greater than 400 participants, you will need to create a random subset that has N = 400 for your analyses
Part 3 (Due 12/8 - In Class)
The final part of the project consists of a brief presentation to the class and a written report of your data analysis plan and results. This should be completely reproducible and be a zipped folder with all relevant materials.
Presentation (5 mins & 1-2 mins of Q & A)
Your presentation should be a brief overview of your project. It should be well-organized and engaging. This can take the form of a slideshow (5 slides maximum), or any other way that you would like to convey your message. Be sure to include the following:
- Introduction: Briefly introduce your research question and why it is important.
- Methods: Describe the dataset you used and the statistical analyses you performed. Reminder: Your analyses must include at least 2 tests that we have gone over in class. One of these tests must be a regression.
- Results: Present your main findings, using graphs and tables to help illustrate your results.
- Discussion: Interpret your results and discuss their implications.
Do not put walls of text on your slides. Your slides/materials should focus on the visualizations of your data (e.g., scatterplots with regression lines, interaction plots). You should narrate the methods and interpretation, rather than reading them off the slide.
Written Report
12 point font, 1 inch margins, double-spaced, 5 page minimum (excluding tables, figures, references and anything else that isn’t your writing), 7 page maximum
Your written report should be a comprehensive summary of your project. It should be well-structured and include the following sections:
Introduction: Provide an overview of your research problem and state your research question and hypothesis. Include 3+ citations
Methods: Detail the data you used, including its source and any cleaning or preparation steps you took. You should also describe the statistical methods you used in enough detail that another researcher could replicate your analyses.
Results: Present the results of your analyses in a clear and organized manner. Use tables and figures (APA formatting or whatever field is relevant) to display your results, and be sure to include relevant statistical information (e.g., p-values, confidence intervals, effect sizes).
Discussion: Interpret your findings in the context of your research question. Discuss the limitations of your study and suggest directions for future research.
Conclusion: Briefly summarize your main findings and their significance.
In the Methods or Results section, provide information on how you checked the assumptions for your regression (e.g., linearity, homoscedasticity, normality of residuals) and what the results of those checks were. If assumptions were violated, explain how you addressed them. (use check_model() from easystats)
APA Style: “APA Formatting” for results is strict.
Do not copy-paste raw software output (like
summary()) directly into the text.You must generate a clean Table (using packages like
sjPlot,apaTables,stargazer, orgt) that reports b, SE, t, p, and R2.
Your final submission should be a single folder that is 100% reproducible, meaning that all of your data and code should be included and organized in a way that someone else could easily re-run your analyses.
Helpful Tips
The “Run-All” Test: Before zipping your folder, test your code on a fresh R session (close everything and re-open the project). If your code references a file path like
C:/Users/YourName/Downloads/Project, it will not work on anyone else’s computer.- Tip: Use relative paths (e.g.,
import(here(("data", myDataFile.csv")) and a project based workflow
- Tip: Use relative paths (e.g.,
Folder Structure: Here is a suggested file structure:
/Data(Raw data files)/Scripts(Analysis code, e.g., .R, .Rmd, .qmd)/Output(The final report in PDF or Word)
Code Quality: Annotate your code. Comments should explain why a step is being taken (e.g.,
# Log transforming income to correct for skewness) rather than just what the code is doing.Page Count Clarification: The 5-7 page requirement refers to prose text. Large figures, massive tables, and bibliography do not count toward the minimum.
Rubric
Total Points: 200
Reproducibility & Code Quality (40 Points)
Focus: Can the instructor download the folder and run the analysis without editing a single line of code?
| Criteria | Proficient (34–40 pts) | Competent (26–33 pts) | Needs Improvement (0–25 pts) |
| Functionality & Organization | Code runs immediately upon download without error on a fresh environment. Uses relative paths or here() correctly. |
Code runs but requires minor troubleshooting (e.g., installing a missing package, fixing a hard-coded path like C:/Users/...). |
Code breaks significantly, references local hard drives, or necessary files are missing from the folder. |
Data Preparation (20 Points)
Focus: Did you prepare the data correctly before analyzing?
| Criteria | Proficient (18–20 pts) | Competent (14–17 pts) | Needs Improvement (0–13 pts) |
| Cleaning & Sampling | Random subset (N=400) created correctly (if applicable). Cleaning steps (handling NAs, recoding variables) are justified and executed efficiently. | Subset created but method is unclear/not reproducible. Minor errors in data cleaning (e.g., missed an obvious outlier). | No subsetting performed on large data. Data is “dirty” (e.g., typos in factor levels) affecting the analysis results. |
Statistical Rigor (50 Points)
Focus: Does the logic follow and build a cohesive statistical argument?
| Criteria | Proficient (44–50 pts) | Competent (34–43 pts) | Needs Improvement (0–33 pts) |
| Analysis Selection | Chosen analyses (must include regression) are perfectly aligned with the research question and data type. | Analyses are generally appropriate, but a better model existed (e.g., used standard linear regression on a binary outcome). | Analysis does not address the hypothesis or is statistically invalid for the data type. |
| Assumption Checks | All relevant assumptions (normality, linearity, homoscedasticity, etc.) are explicitly tested, reported, and handled if violated. | Assumptions are mentioned but not thoroughly tested, or violations are noted but ignored in the final model. | Assumptions are completely ignored or misunderstood. |
| Interpretation | Interpretation of coefficients, p-values, confidence intervals, and effect sizes is accurate and nuanced. | Interpretation is generally correct but relies too heavily on “significance” (p<.05) rather than effect size or practical significance. | Fundamental misunderstanding of statistical output (e.g., interpreting p>.05 as “proof” of no effect). |
Presentation (30 Points)
Focus: Can you synthesize the project in a concise way to an audience?
| Criteria | Proficient (26–30 pts) | Competent (20–25 pts) | Needs Improvement (0–19 pts) |
| Communication | Engaging, clear, and well-paced (within 7 min). Speaker demonstrates mastery of the material during Q&A. | Clear but relies heavily on reading notes. Slightly over or under the time limit. | Reading directly from slides (wall of text). Unable to answer basic questions about their own study. |
| Visual Aids | Slides are visual-heavy and support the narrative. Graphs are clean, large, and legible. | Slides are text-heavy. Graphs are present but small, pixelated, or unformatted default outputs. | Slides are disorganized, confusing, or missing. |
Written Report Content (60 Points)
Focus: Can you communicate the findings in a written format?
| Criteria | Proficient (52–60 pts) | Competent (40–51 pts) | Needs Improvement (0–39 pts) |
| Introduction & Logic | Clear narrative arc from problem identification to hypothesis. Citations (3+) are highly relevant and integrated well. | Hypothesis is stated, but the background justification is weak, disjointed, or citations are loosely related. | Hypothesis is missing or unclear. Introduction is disorganized or lacks citations. |
| Formatting (APA) | Tables and Figures are perfect APA style. No raw code output (e.g., R console text) in the document. Prose is professional and academic. | Minor APA errors in tables/figures. Some raw software output included in the text. | Major formatting errors. Figures are unreadable, missing labels, or screen-capped from software. |
| Discussion | Deeply contextualizes results within the field. Discusses limitations honestly and provides insightful future directions. | Summarizes results well but lacks depth in “implications.” Limitations are generic (e.g., “sample size could be bigger”). | Discussion just repeats the results section in words. No limitations mentioned. |