Figures: Using ggplot2

PSYC 640 - Fall 2023

Dustin Haraden, PhD

Last Class

  • Comparing Means: \(t\)-test
    • Independent Samples \(t\)-test Review
    • Paired Samples \(t\)-test

Looking Ahead

  • Plan to have 2 more labs that will be similar to the last lab

    • Likely take place on 10/25 and sometime the week of 11/13
  • Outside of these labs, I am going to plan on having additional mini-labs

    • Likely to take place on 11/1, 11/22 and 11/29 (will update based on how things are going in class)

Today…

Working with ggplot2 to get some really fancy visualizations!

Maybe integrating some generative AI (ChatGPT) to help us out too

# File management
library(here)
# for dplyr, ggplot2
library(tidyverse)
#Loading data
library(rio)

#for the penguins dataset
#install.packages('palmerpenguins')
library(palmerpenguins)

#Remove Scientific Notation 
options(scipen=999)

Take a look at the data

Will be using a dataset from the palmerpenguins library (link) which is a dataset about…penguins. This function will pull that data into our environment:

data(penguins)

Now we can get some descriptive statistics:

Code
penguins %>% 
  group_by(species) %>% 
  summarize(
    mean_flipper = mean(flipper_length_mm, na.rm = TRUE),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    std_flipper = sd(flipper_length_mm, na.rm = TRUE), 
    std_mass = sd(body_mass_g, na.rm = TRUE), 
    cor_flip_mass = cor(flipper_length_mm, body_mass_g)
  )
# A tibble: 3 × 6
  species   mean_flipper mean_mass std_flipper std_mass cor_flip_mass
  <fct>            <dbl>     <dbl>       <dbl>    <dbl>         <dbl>
1 Adelie            190.     3701.        6.54     459.        NA    
2 Chinstrap         196.     3733.        7.13     384.         0.642
3 Gentoo            217.     5076.        6.48     504.        NA    

ggplot2

ggplot2 from the tidyverse

Since we have already installed and loaded the library, we don’t have to do anything else at this point!

ggplot2 follows the “grammar of graphics”

  • Theoretical framework for creating data visualizations
  • Breaks the process down into separate components:

Data

Aesthetics (aes)

Geometric Objects (geoms)

Faceting

Themes

Grammar of Graphics

ggplot2 cheatsheet

ggplot2 syntax

There is a basic structure to create a plot within ggplot2, and consists of at least these three things:

  1. A Data Set
  2. Coordinate System
  3. Geoms - visual marks to represent the data points

In R it looks like this:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

#or how I like to do it
<DATA> %>% 
  ggplot(aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>()

ggplot2 syntax

Let’s start with a basic figure with palmerpenguins

First we will define the data that we are using and the variables we are visualizing

#the dataset is called penguins

penguins %>% 
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g))

What happens?

We forgot to tell it what to do with the data!

Need to add the appropriate geom to have it plot points for each observation

penguins %>% 
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) + 
  geom_point()


Note:
the geom_point() layer will inherit what is in the aes() in the previous layer

Adding in Color

Maybe we would like to have each of the points colored by their respective species

This information will be added to the aes() within the geom_point() layer

penguins %>% 
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) + 
  geom_point(aes(color = species))

Including a fit line

Why don’t we put in a line that represents the relationship between these variables?

We will want to add another layer/geom

penguins %>% 
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) + 
  geom_point(aes(color = species)) + 
  geom_smooth()


That looks a little wonky…why is that? Did you get a note in the console?

Including a fit line

The geom_smooth() defaults to using a loess line to fit to the data

In order to update that, we need to change some of the defaults for that layer and specify that we want a “linear model” or lm function to the data

penguins %>% 
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) + 
  geom_point(aes(color = species)) + 
  geom_smooth(method = 'lm')


Did that look a little better?

Individual fit lines

It might make more sense to have individual lines for each of the species instead of something that is across all

penguins %>% 
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g, 
             color = species)) + 
  geom_point() + 
  geom_smooth(method = 'lm')


What did we move around from the last set of code?

Updating Labels/Title

It will default to including the variable names as the x and y labels, but that isn’t something that makes sense. Also would be good to have a title!

We add on another layer called labs() for our labels (link)

penguins %>% 
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g, 
             color = species)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(
    title = "Palmer Penguins",
    subtitle = "Body Mass by Flipper Length", 
    x = "Flipper Length (mm)", 
    y = "Body Mass (g)", 
    color = "Species"
  )

Penguin Histogram

Taken from the website for palmerpenguins (link)

penguins %>% 
  ggplot(aes(x = flipper_length_mm)) +
    geom_histogram(aes(fill = species), 
                   alpha = 0.5, 
                   position = "identity")

Example 2

I have uploaded three datasets. I would like us to explore the datasets together to see what is going on with them.

The Three Datasets

Code
data1 <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/data1.csv") %>% 
  mutate(dataset = "data1")

data2 <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/data2.csv") %>% 
  mutate(dataset = "data2")

data3 <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/data3.csv") %>% 
  mutate(dataset = "data3")

We need to combine them together just to make things easier:

three_data <- bind_rows(data1, data2, data3)

Descriptive Stats on the 3

Code
three_data %>%
  group_by(dataset) %>% 
  summarize(
    mean_x = mean(x),
    mean_y = mean(y),
    std_x = sd(x), 
    std_y = sd(y), 
    cor_xy = cor(x,y)
  )
# A tibble: 3 × 6
  dataset mean_x mean_y std_x std_y  cor_xy
  <chr>    <dbl>  <dbl> <dbl> <dbl>   <dbl>
1 data1     54.3   47.8  16.8  26.9 -0.0641
2 data2     54.3   47.8  16.8  26.9 -0.0683
3 data3     54.3   47.8  16.8  26.9 -0.0645

Visualize the datasets

Create a scatterplot for each of the datasets

But I didn’t talk about how to do that for multiple datasets…

Check out ChatGPT

Try it out on data

Take a look at the data that we have been using and try making various visualizations for two of the variables

state_school <- import("https://raw.githubusercontent.com/dharaden/dharaden.github.io/main/data/NM-NY_CAS.csv")

Next time…

  • ANOVA!