Lab 11: Categories & Regression
Instructions
Here is the data you will need for this lab:
Data is pulled from Kaggle.
When you are finished, click the Knit button to turn your work into an HTML document. You will submit both this .Rmd file and the 🧶knitted .html file.
Scenario & Goal
We are getting back to using R! 🥳 This lab will focus on examining categorical predictors. Similar to what we did in class this week.
The data for the lab contains information by country (Entity) and year for the overall prevalence of various forms of mental health/psychopathology. We are seeking to examine differences that may arise as a result of country and year.
Task 1: Setup & Data Inspection (15 points)
Your first and most important task is always to understand and inspect your data before running any analyses.
Tasks:
Create a new R Markdown file named
Week11_Lab.Rmd.Load Libraries and Data
Initial Inspection:
Use
glimpse()orsummary()to see the structure of your data.What are all the countries included in this dataset? What is the year range?
Visualization:
Generate a visualization of the distribution (histogram or bar plot) across all years for each of the mental health variables (5 total plots)
- This will not be separated by country. We want to see what the overall distribution for these variables are for the whole dataset.
Summary Stats (2 tables):
- Provide descriptive statistics (Means & SD) in a formatted table (not just using
describe()) that is separated by country that includes the mental health variables. This will be collapsed across all years. - Include a correlation table examining the relationship among all mental health variables for the year 2015. Be sure to include a title and update the labels.
- Provide descriptive statistics (Means & SD) in a formatted table (not just using
Task 2: Categorical Regression (20 points)
We are now interested in seeing if there are differences among the years and countries when it comes to the prevalence of these mental health variables.
Regression by country:
Include 3 different countries and perform 2 regressions to separately predict depression and anxiety (ignoring the year variable; meaning you don’t have to filter by year). For 1 of the regressions, provide a formatted table and write up the results in APA format.
- We should be seeing two regressions that each have the country variable with 3 levels. Then a table and write up for 1 of the regressions.
Regression by Time:
- Using all countries, include 4 different years and perform 2 regressions to predict depression and anxiety variables. Similar to above, select one of the regressions, provide a formatted table and write up the results in APA.
Task 3: Follow-up questions (10 points)
Please answer the following information about this dataset and categorical predictors
For each of your regressions, which group was your reference group? How do you know?
Based on the results that you have identified in with your regressions, report an overarching conclusion about mental health rates over time/countries.
A researcher is studying the effect of a student’s undergraduate major on their starting salary after graduation. The
majorvariable has four levels: “Psychology”, “Biology”, “Engineering”, and “English”.If you were to include this variable in a regression model using dummy coding, how many new dummy variables would you need to create? If you set “Psychology” as the reference group, what values would each dummy variable have for a student who majored in “Engineering”?
In a linear regression model predicting students’ self-reported
well_beingscores from theiryear_in_school(a categorical variable with levels: “First-year”, “Sophomore”, “Junior”, “Senior”), “First-year” is the reference group.The model output shows a coefficient of -2.5 for the
year_in_schoolSeniordummy variable. In one clear sentence, what is the correct interpretation of this coefficient?
Imagine you refit the model from the previous question, but this time you set
"Senior"as the reference group instead of"First-year".Would you expect the coefficient for the
year_in_schoolFirst-yeardummy variable in this new model to be positive, negative, or zero? Explain your reasoning in one sentence.
Formatting (5 points)
Formatting includes having clear code and statements in your documents. Avoid including unnecessary information.
Be sure to submit both an .Rmd file and an HTML file that is complete. Failure to include both will result in an automatic 10 point deduction.