PSYC 640 - Fall 2024
Journal Entries
Reverse Results - Due 9/17
Working with ggplot2
to get some really fancy visualizations!
Maybe integrating some generative AI (ChatGPT) to help us out too
Let’s first start by opening our Project
Then, create a new Notebook/Markdown Document that we will use for today
Setup the libraries and bring in the data
Will be using a dataset from the palmerpenguins
library (link) which is a dataset about…penguins. This function will pull that data into our environment:
ggplot2
ggplot2
from the tidyverse
Since we have already installed and loaded the library, we don’t have to do anything else at this point!
ggplot2
follows the “grammar of graphics”
Data
Aesthetics (aes
)
Geometric Objects (geoms
)
Faceting
Themes
ggplot2
cheatsheetggplot2
syntaxThere is a basic structure to create a plot within ggplot2
, and consists of at least these three things:
In R it looks like this:
ggplot2
syntaxLet’s start with a basic figure with palmerpenguins
First we will define the data that we are using and the variables we are visualizing
What happens?
We forgot to tell it what to do with the data!
Need to add the appropriate geom
to have it plot points for each observation
Note: the geom_point()
layer will inherit what is in the aes()
in the previous layer
Maybe we would like to have each of the points colored by their respective species
This information will be added to the aes()
within the geom_point()
layer
Why don’t we put in a line that represents the relationship between these variables?
We will want to add another layer/geom
That looks a little wonky…why is that? Did you get a note in the console?
The geom_smooth()
defaults to using a loess line to fit to the data
In order to update that, we need to change some of the defaults for that layer and specify that we want a “linear model” or lm
function to the data
Did that look a little better?
It might make more sense to have individual lines for each of the species instead of something that is across all
What did we move around from the last set of code?
It will default to including the variable names as the x
and y
labels, but that isn’t something that makes sense. Also would be good to have a title!
We add on another layer called labs()
for our labels (link)
Taken from the website for palmerpenguins
(link)
Let’s start by looking at our data. You can either click on the dataset in the Environment
or use the View(sleep_data)
command. Here, I am using the head()
command just to visualize a sample of the data for the slides.
StartDate EndDate Status Progress Duration__in_seconds_
1 2017-08-16 13:13:06 2017-08-16 13:15:13 0 100 126
2 2017-08-16 13:17:16 2017-08-16 13:19:03 0 100 106
3 2017-11-27 16:59:07 2017-11-27 17:07:27 0 100 499
4 2017-11-27 18:57:16 2017-11-27 19:13:00 0 100 943
5 2017-11-27 18:54:35 2017-11-27 19:14:12 0 100 1177
6 2017-11-27 19:41:14 2017-11-27 19:46:48 0 100 333
Finished RecordedDate DistributionChannel UserLanguage Q5 Q7 Q8 Q12
1 1 2017-08-16 13:15:14 anonymous EN 12 2 2 3
2 1 2017-08-16 13:19:03 anonymous EN 1 2 2 1
3 1 2017-11-27 17:07:27 anonymous EN 19 1 2 1
4 1 2017-11-27 19:13:00 anonymous EN 18 2 2 2
5 1 2017-11-27 19:14:13 anonymous EN 21 1 2 2
6 1 2017-11-27 19:46:48 anonymous EN 18 2 1 3
Q13_1 Q13_2 Q13_3 Q13_4 Q13_5 Q13_6 Q13_7 Q13_8 Q13_9 Q13_10 Q14_1 Q14_2
1 2 2 2 NA 2 2 2 2 2 2 2 2
2 1 1 1 NA 2 3 2 3 2 3 4 2
3 1 2 2 5 3 4 1 1 2 1 5 5
4 3 4 5 5 5 5 5 5 5 1 5 5
5 1 2 2 5 4 2 2 1 1 2 5 5
6 1 2 4 5 4 2 4 4 5 4 5 5
Q14_3 Q14_4 Q14_5 Q14_6 Q14_7 Q14_8 Q11_1 Q11_2 Q11_3 Q11_4 Q11_5 Q11_6 Q17_1
1 2 2 2 NA 2 2 2 3 3 4 4 3 2
2 4 4 4 NA 4 4 4 4 4 4 4 4 2
3 5 5 1 1 1 1 4 5 4 4 5 4 4
4 5 5 5 1 5 5 2 3 4 3 1 1 4
5 5 5 2 1 3 4 2 1 4 3 4 2 2
6 5 5 4 1 4 3 1 4 2 2 4 5 2
Q17_2 Q17_3 Q17_4 Q18_1 Q18_2 Q18_3 Q18_4 Q18_5 Q20_1 Q20_2 Q21_1 Q21_2 Q21_3
1 3 3 2 2 2 2 3 1 1 2 4 3 3
2 2 1 2 3 1 1 1 1 2 3 5 4 4
3 2 1 4 3 1 4 1 2 3 3 1 2 3
4 4 1 3 4 1 2 1 1 4 2 5 2 1
5 2 2 4 3 1 1 1 2 3 3 4 1 3
6 4 3 4 4 1 1 1 4 6 5 4 4 5
Q21_4 Q21_5 Q21_6 Q22_1 Q22_2 Q22_3 Q22_4 Q22_5 Q23_1 Q23_2 Q23_3 Q23_4 Q23_5
1 NA 5 3 3 2 1 2 3 4 5 3 3 2
2 NA 2 2 2 3 4 5 5 5 5 4 3 3
3 4 1 1 5 4 3 4 4 5 6 2 5 5
4 4 1 1 2 5 1 6 1 6 6 2 5 6
5 4 1 2 3 4 1 4 2 2 6 3 5 5
6 4 1 2 2 2 3 4 3 4 5 2 3 3
Q23_6 Q23_7 Q23_8 Q24_1 Q24_2 Q24_3 Q24_4 Q24_5 Q25_1 Q25_2 Q25_3 Q25_4 Q25_5
1 NA 3 3 3 3 3 3 3 3 3 3 NA 4
2 NA 3 3 3 3 2 2 1 1 2 1 NA 1
3 1 1 3 1 1 1 1 2 3 4 2 1 1
4 1 3 1 1 6 2 1 1 4 2 6 1 6
5 1 2 2 1 2 2 1 2 5 4 2 1 1
6 1 2 4 4 5 1 2 6 6 3 5 1 3
Q26_1 Q26_2 Q27_1 Q27_2 SC0 SC1 SC2 SC3 SC4 SC5 id Attention
1 3 3 3 3 19 11 19 14 82 0 15516 NA
2 1 1 1 1 24 5 24 26 79 0 15516 NA
3 3 1 4 1 26 13 26 23 70 0 17915 NA
4 4 4 1 1 14 12 14 35 89 0 17648 NA
5 5 2 5 5 16 9 16 29 76 0 17799 NA
6 5 1 6 6 18 15 18 31 99 0 18003 NA
What do these variables mean? Who does this?
It is always important to have appropriate data documentation!
If you can’t look at your data and know what it means right away, you aren’t going to remember what it means later on.
Sleep Data Documentation - myCourses
“Q5” - What is your age in years? (open text)
This is a free text field…is that a good way to get quality data?
Let’s see if everyone followed directions. Check the “structure” of the variable
Use the ggplot cheatsheet to identify an appropriate way to visualize the data
Add some color
Update the title and axes
When you are done, post your creation here!