Unit 4 - Data Wrangling

PSYC 640 - Fall 2024

Dustin Haraden, PhD

Reminders

  • Journal Entries

  • Reverse Results - Due 9/17

Last Class

  • Entry into the tidyverse with dplyr

Today…

Continuing on this trend to get more comfortable with all that dplyr can do!

This is a nice website that walks through a lot of this as well

# File management
library(here)
# for dplyr, ggplot2
library(tidyverse)
#Loading data
library(rio)

#Remove Scientific Notation 
options(scipen=999)

But first…

We need to revisit how to read the ggplot2 cheatsheet!

A lot of cheatsheets will have something that looks like this:

c <- ggplot(mpg, aes(hwy))

## Then later on it will say

c + geom_histogram(binwidth = 5)

WHAT IS c?!

c is just an object that is referring to the first line that we have in a ggplot function

c <- ggplot(mpg, aes(hwy))

## So this...
c + geom_histogram(binwidth = 5)

## Is the same as this...
mpg %>% 
  ggplot(aes(hwy)) + 
  geom_histogram(binwidth = 5)

Back to regular scheduled programming

Let’s first start by opening our Project

Then, create a new Notebook/Markdown Document that we will use for today

Setup the libraries and bring in the data

  • We will use the “Sleep Data” file
#import sleep data - Your path may be different
sleep_data <- import(here("lectures", "data", "Sleep_Data.csv"))

select()

Variable Renaming

Ooof…These variable names look terrible! How do we update those? Our brains would break having to remember how to connect them and having to reference another document. 

First, let’s get rid of the variables that we don’t need for the moment: 

  • Use select() to remove the following variables: 

    sleep_data <- sleep_data %>% 
      select(-c('StartDate', 'EndDate', 'Status', 'Progress', 
      'Duration__in_seconds_', 'Finished', 'RecordedDate',
      'DistributionChannel', 'UserLanguage', 'SC0', 'SC1',
      'SC2', 'SC3', 'SC4', 'SC5', 'Attention'))

Now we can update the names

  • Using the names() command, we can rename all columns in the file 

  • Here is a list so that it makes it easier to do instead of having to type everything out…that would just be cruel

names(sleep_data) <- c('age', 'gender', 'roommate', 'other_sleep', 'bed_read', 'bed_study', 'bed_hw', 'attention1', 'bed_internet',
                       'bed_tv', 'bed_eat', 'bed_friends', 'bed_videogame', 'bed_readp', 'BL_1', 'BL_2', 'BL_3', 'BL_4',
                       'BL_5', 'attention2', 'BL_6', 'BL_7', 'sleepsat1', 'sleepsat2', 'sleepsat3', 'sleepsat4', 'sleepsat5',
                       'sleepsat6', 'ESS1', 'ESS2', 'ESS3', 'ESS4', 'ESS5', 'ESS6', 'ESS7', 'ESS8', 'ESS00', 'ashs1', 'ashs2', 
                       'ashs3', 'ashs4', 'ashs5', 'attention3', 'ashs6', 'ashs7', 'ashs8', 'ashs9', 'ashs10', 'ashs11', 
                       'ashs12', 'ashs13', 'ashs14', 'ashs15', 'ashs16', 'ashs17', 'attention4', 'ashs18', 'ashs19', 
                       'ashs20', 'ashs21', 'ashs22', 'ashs23', 'ashs24', 'ashs26', 'ashs27', 'ashs28', 'attention5', 
                       'ashs29', 'ashs30', 'ashs31', 'ashs32', 'ashs33', 'id')

filter()

Extract Rows

The ‘filter()’ function is used to subset observations based on their values.

The result of filtering is a data frame with the same number of columns as before but fewer rows.

The first argument is data and subsequent arguments are logical expressions that tell you which observations to retain in the data frame.

filter(starwars, hair_color == "none" | eye_color == "black")
# A tibble: 39 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 2 Greedo      173    74 <NA>       green      black           44   male  mascu…
 3 IG-88       200   140 none       metal      red             15   none  mascu…
 4 Bossk       190   113 none       green      red             53   male  mascu…
 5 Lobot       175    79 none       light      blue            37   male  mascu…
 6 Ackbar      180    83 none       brown mot… orange          41   male  mascu…
 7 Nien Nu…    160    68 none       grey       black           NA   male  mascu…
 8 Nute Gu…    191    90 none       mottled g… red             NA   male  mascu…
 9 Jar Jar…    196    66 none       orange     orange          52   male  mascu…
10 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
# ℹ 29 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter observations

We can now generate a subset of observations based on a particular value

Since the age variable has been giving us some trouble, let’s filter based on that variable to only include appropriate ages for “college aged” participants

# Assigning it to a new variable
mod_sleep_data <- sleep_data %>% 
  filter(age >= 17, age <= 25)

Filter Obs. - Check

Generating a bar chart to see if it worked out!

mod_sleep_data %>% 
  ggplot(aes(age)) + 
  geom_bar()

mutate()

Compute new variable

We often need to make a sum/mean score for a variable of interest

The mutate() function is most commonly used to add new columns to your data frame that are functions of existing columns.

mutate() requires data as its first argument, followed by a set of expressions defining new columns.

  • Note: New variables are automatically added at the end of the data frame (scroll to the right to see them)

For example, in the mod_sleep_data, we have the Epworth Sleepiness Scale

Take a look at the scoring of the ESS and compute the total score labelled ess_total

ess_total

It is a sum score with a scale of 0 - 24

# We will assign it to the same dataset
mod_sleep_data <- mod_sleep_data %>% 
  mutate(ess_total = ESS1 + ESS2 + ESS3 + 
           ESS4 + ESS5 + ESS6 + ESS7 + ESS8)

Visualize

mod_sleep_data %>% 
  ggplot(aes(ess_total)) + 
  geom_bar()
  • But why are there some above 24??

ESS Variables

The original items are supposed to be on a 0-3 scale

The data that we have are on a 1-4 scale

What should we do?!

ESS Rescale

We could subtract 1 from each item before creating another sum

mod_sleep_data <- mod_sleep_data %>% 
  mutate(ESS1m1 = ESS1 - 1, 
         ESS2m1 = ESS2 - 1,
         ESS3m1 = ESS3 - 1,
         ESS4m1 = ESS4 - 1,
         ESS5m1 = ESS5 - 1,
         ESS6m1 = ESS6 - 1,
         ESS7m1 = ESS7 - 1, 
         ESS8m1 = ESS8 - 1) %>% 
  mutate(ess_total2 = ESS1m1 + ESS2m1 + ESS3m1 + 
           ESS4m1 + ESS5m1 + ESS6m1 + ESS7m1 + ESS8m1)

ESS Rescale

Or we could combine all the 1’s that we are subtracting and just take that from the total score

mod_sleep_data <- mod_sleep_data %>% 
  mutate(ess_total = ESS1 + ESS2 + ESS3 + 
           ESS4 + ESS5 + ESS6 + ESS7 + ESS8) %>% 
  mutate(ess_total3 = ess_total-8)

ESS Rescale

Let’s see what the differences are between these methods

mod_sleep_data %>% 
  select(contains("_total")) %>% 
  head()
  ess_total ess_total2 ess_total3
1        20         12         12
2        20         12         12
3        16          8          8
4        20         12         12
5        13          5          5
6        22         14         14

Rescaling Conclusion

It is good to always know what the original scaling needs to be (e.g., 1-4 or 0-3)

Extremely important to visualize the data! ALWAYS VISUALIZE

There is also another way to allow for the sum() function in mutate(), but we’ll get to that later

Attention Checks

How do we know participants were paying attention?

You may have noticed the attention variables throughout.

These are “attention checks” that researchers put in to…um….check the attention of the participants. It’s right in the name.

For this survey, they were nonsensical questions that were embedded to look like the others.

Var Name Question Responses
attention1 How often do you breathe in bed? 1-5 (Never to Always)
attention2 How often have you used a screen device to steal over a million dollars? 1-5 (Never to Every Day)
attention3 After 6:00 in the evening…Choose 60%

1-6 (Never to Always)

4 = 60%

attention4 I go to bed and know that a magical train will literally take me to see Santa. 1-6 (Never to Always)
attention5 I have gone 30 days or more without sleeping at all. 1-6 (Never to Always)

Recoding Variables

Using recode()

We can combine our mutate() with this new function to produce the results we want

Nice website to explain the process

  • When using recode(), we have to first tell it the variable we want to recode

  • The next arguments are how we want to recode the values

    recode(variable_name, "old_value" = "new_value")

Putting it together

Now let’s put this into action by recoding the attention variables

Getting a 1 indicates that they were paying attention

mod_sleep_data %>% 
  mutate(attention1r = recode(attention1, 
                              `1`=0, #NOTICE THE BACKTICKS
                              `2`=0,
                              `3`=0,
                              `4`=0,
                              `5`=1)) #be sure to finish the rest

Create composite attention score

Let’s make a total score for the attention variable so that we can get rid of folks who are not paying attention

mod_sleep_data <- mod_sleep_data %>% 
  mutate(att_sum = (attention1r + attention2r + attention3r +
                      attention4r + attention5r))

OR

mod_sleep_data <- mod_sleep_data %>% 
  rowwise() %>% 
  mutate(att_sum = sum(attention1r, attention2r, attention3r,
                      attention4r, attention5r)) %>% 
  ungroup()

Visualize

mod_sleep_data %>% 
  ggplot(aes(att_sum)) + 
  geom_bar() + 
  geom_text(stat = "count", #Tells it to calculate the statistic
            aes(label=after_stat(count))) + #Puts label of the count
  theme_minimal() #or theme_bw

Your turn!

Steps to do if we have time

Create two new datasets labeled (1) “data_attend” and (2) “data_distract”. In each dataset have those who were paying attention in the “data_attend” and those who were not in the “data_distract”. Paying attention is operationalized as having a score of 5 on the aggregated variable.

Get the sample size of each of these datasets (use Google to search for things like number of rows)

What is the mean/average of ess_total in each of the samples?

Next Class

Starting to look at descriptive statistics and how to make these nice tables

Will begin Lab 1 (as long as we have time to do so)