Skip to Tutorial Content

Learning persona

hex sticker, female student reading the book 'an adventure in statistics' by Andy Field.

This class is made up of two different learners, we'll call them Alicia and Abi. Both Alicia and Abi are enrolled on a Psychology degree program, they both expected some statistics training (although perhaps not as much as they are getting) but were not expecting to learn any form of coding.

Alicia is 19 years old. She has a post-16 Maths qualification (a UK A-level) but nothing related to computing. Alicia went straight to University from school. Alicia is very engaged, capable, and sees the interconnect between statistics and psychology as a science. However, she lacks natural confidence and gets very anxious about her academic performance. She worries that her dyslexia will affect her ability to code.

Alicia needs knowledge to be built up gradually so that she doesn't feel overwhelmed and can build her confidence. She needs reassurance that she is making progress and wants her hard work to be noticed. She responds particularly well to praise and positivity, she is very sensitive to criticism.

Abi is aged 22 years. She has only an age 16 Maths qualification (a UK GCSE) but nothing related to computing. Abi has been out of education for a couple of years having gone travelling for a year after School then working to save money for university. Abi is a reluctant learner - she doesn't see the need to learn statistics on a Psychology degree, she thought that Psychology was about ‘people not numbers'.

Abi needs motivation! She repsonds well to engaging examples and enthusiasm. She becomes disinterested if explanations are too long or technical. She struggles with computing and needs hands on examples of concepts to grasp them. She responds better to active learning and 'trying things out' although she gets overwhelmed by error messages. In the face of too many obstacles she tends to shut down and think 'I can't do it'.

Tips

Remember to use hints and solutions to guide you through the exercises (Figure 1).

Each codebox has a hints or solution button that activates a popup window containing code and text to guide you through each exercise.
Figure 1: In a code exercise click the hints button to guide you through the exercise.

Good luck - you'll be amazing!

Recap

In the last session we

  • Discovered the key concepts within ggplot2: geoms, stats, scales, themes.
  • Used a data set of audio features of the artists Taylor Swift and, my favourite band, Iron Maiden scraped using the spotifyr package.

  • Found out how to use ggplot2 to:
  • Plot a violin plot.
  • Add x- and y-labels.
  • Apply a built-in theme.

Code example

We finished last week's session by creating a violin plot of danceability scores for Iron Maiden and Taylor swift.

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  geom_violin(fill = "#94c5e8") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  theme_minimal()

Coding challenge

Try out the code above to remind yourself of what the plot from last week looked like.

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  geom_violin(fill = "#94c5e8") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  theme_minimal()

Learning outcomes

We will develop last week's session to look at the concept of layers. Specifically we develop last week's plot to:

  • Adjust the y-axis breaks using scale_y_continuous().
  • Add a statistical summary (mean and standard deviation) using stat_summary().
  • Look at how plots are built up using layers.

By the end of the session you should be able to

  • Adjust the y-axis breaks using scale_y_continuous().
  • Add a mean and standard deviation to a plot with stat_summary().
  • Describe what a layer is in the context of ggplot2.
  • Differentiate commands that add layers from those that manipulate layers.
  • Understand that when creating new layers the order of commands in your code matters.

Layers (1)

Figure 2 shows how ggplot2 works. You begin with some data and you initialize a plot with the ggplot() function within which you name the tibble and set the variables for the x-axis and y-axis. This initiates a blank plot. Once the plot is initialized you add layers to the plot that control what the plot shows and its visual properties. You can think of a layer as a plastic transparency with something printed on it. That ‘something’ could be text, data points, lines, bars, pictures of chickens, or pretty much whatever you like. To make a final image, these transparencies are placed on top of each other. Layers can also be manipulated, that is, once a layer is added its visual properties can be changed.

  • You display the data, or summaries of the data by adding layers. Things that add layers:
  • geoms (e.g., geom-point(), geom_violin())
  • stats (e.g., stat_summary())
  • Each layer can be manipulated to change the appearance. For example
  • Change axis labels with labs()
  • Change the limits of the scale (coord_cartesian()) or breaks (e.g., scale_y_cartesian())
  • Change the colour of fill (e.g. aes(fill = album_name))
  • Apply a theme (e.g., theme_minimal())

In Figure 2, initializing the plot creates a base layer showing only the axes, their labels and tick marks but nothing else. Next, we manipulate how that layer looks by applying a theme. This doesn't add anything to the plot (it doesn't create a layer), but it affects how the existing layer looks.

To display a data summary we add a new layer with geom_violin(). Imagine the 'violins' were printed on a transparent sheet, it's like we place this sheet on top of the base layer so the violins now sit on top of the axes and gridlines.

Next we manipulate the layer by adding a fill colour to the violins using aes(). This process doesn't add a new layer, but it changes the appearance of an existing one.

Then we use stat_summary() to add another layer that places the mean and confidence interval on top of each violin. Again, imagine these error bars are on a transparent sheet that we lay on top of the two layers underneath. Finally, we use labs() and scales() to change the appearance of the base layer by changing the axis label text and the axis breaks.

See main text for description.
Figure 2: A ggplot is made up of layers.

Let's try this using our spotify data.

Coding challenge

The box below displays the code to initiate the plot from last week. Execute this code.

ggplot(spotify_tib, aes(x = artist_name, y = danceability))

Note that you see a blank plot - the base layer. Now add theme_minimal().

De-bug: don't forget +

A common cause of errors messages when using ggplot() is forgetting to put a + at the end of each line (except the last). If you get an error message check that each line that builds up a plot has a + at the end of it (i.e. each function is separated by +). I make this mistake all the time!

ggplot(spotify_tib, aes(x = artist_name, y = danceability))
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal()

Note that the base layer looks different, but we can't see any data! Use geom_violin() to add a layer showing data violins to the plot.

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal()
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin()

You should now see the data violins, but we like colour, so let's use aes(fill = artist_name) within geom-violin() to change the colours of the violin layer.

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin()
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin(aes(fill = artist_name))

Great, you should see some colour! We want some summary statistics though, so add a layer showing the mean and standard deviation using stat_summary(fun.data = "mean_sdl").

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin(aes(fill = artist_name))
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl")

Finally let's adjust the axis labels and breaks by adding labs(x = "Artist", y = "Danceability (0-1)") and scale_y_continuous(breaks = seq(0, 1, 0.1)) to the plot.

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl")
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1))

When order matters (2)

The commands in your code to create a plot are processed in the order you write them. Like I said, adding a layer is like adding a transparency to whatever layers have already been created. So the order of layers matters.

Coding challenge

The exercise box below shows the code for the plot we just created. Run the code, then change the order of so that the geom_violin() layer is created after the stat_summary() layer. Run the code again.

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1))
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  stat_summary(fun.data = "mean_sdl") +
  geom_violin(aes(fill = artist_name)) +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1))

What do you notice?

Quiz time

You should find that your error bars have vanished. But why? The code is the same as before, you defined a layer using stat_summary() exactly as before. Surely, you must have done something wrong?

Let's find out the source of this strange Wizzardry. Below is the code that you just used except that I have included alpha = 1 within geom_violin(). This arguments sets the transparency of the geom, and the default is 1 (no transparency). Run this code and note that it does exactly the same thing as the code for the task above. Now change alpha = 1 to alpha = 0.9. This makes the violins very slightly transparent. You should now see the bars behind the violins. Try running the code again with values of alpha of 0.5 (half-transparent) and 0 (fully transparent). As the violins get more transparent, the error bars behind become more visible.

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  stat_summary(fun.data = "mean_sdl") +
  geom_violin(aes(fill = artist_name), alpha = 1) +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1))

This exercise illustrates when the error bars disappeared on our plot: because the violin geom is filled (the space between the lines isn't transparent) by adding geom_violin() after stat_summary() the violin geoms are layered on top of the error bars, which hid them. They're still there, there's no error in your code, they're just hidden by the violins because of the order in which you added the layers. When adding layers order matters!

When order doesn't matter (1)

What about commands that manipulate layers? Does order matter then? Let's see.

Coding challenge

The exercise box below shows the code for the plot from the first part of the tutorial again (i.e., with stat_summary() and geom_violin() in the correct order). Run the code, then change the order of the code so that theme_minimal() is applied last. Run the code again. (Don't forget to add + to the end of the penultimate line of code and to remove the + after theme_minimal().)

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  theme_minimal() +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1))
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1)) +
  theme_minimal()

What do you notice?

Quiz time

Now try moving either scale_y_continuous(breaks = seq(0, 1, 0.1)) or labs(x = "Artist", y = "Danceability (0-1)") to before geom_violin().

ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1)) +
  theme_minimal()
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1)) +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  theme_minimal()

Again the plot shouldn't change because scale_y_continuous() and labs() manipulate existing layers (the base layer) rather than creating new ones so we can move them without it affecting the plot.

But (1)

So far we have seen that when we create layers the order of our code matters, but when we manipulate existing layers it doesn't. Except ...

Coding challenge

The exercise box below shows the code for the plot again. Add this final line of code (remembering to add + after theme_minimal()):

labs(x = "Think of a label later", y = "You can do this!")
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1)) +
  theme_minimal()
ggplot(spotify_tib, aes(x = artist_name, y = danceability)) +
  geom_violin(aes(fill = artist_name)) +
  stat_summary(fun.data = "mean_sdl") +
  labs(x = "Artist", y = "Danceability (0-1)") +
  scale_y_continuous(breaks = seq(0, 1, 0.1)) +
  theme_minimal() +
  labs(x = "Think of a label later", y = "Bored")

What do you notice?

Quiz time

You should find that the axis labels change, which demonstrates that the code is still processed in order. In this case, the second labs() command (the one we added to the end) overwrites the first one and the axis labels reflect the latter of the two commands.

Try your code out below

ggplot(spotify_tib, aes(x = artist_name, y = valence)) +
  geom_violin() +
  geom_point(position = position_jitter(width = 0.2)) +
  labs(x = "Artist", y = "Valence (0-1)") +
  theme_minimal()

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R.

Well done!

Well done on completing phase 5 of your mission! Visualizing data is an essential skill - being able to produce plots is an essential part of communicating about data. Good work!

Transfer task (1)

Filter the data in spotify_tib to look only at the Iron Maiden or Taylor Swift albums (doesn't matter which). Produce a violin plot, with an error bar of either song danceability or valence (again, your choice) against the name of the album.

Workflow

  • This tutorial is self-contained (you practice code in code boxes). However, so you get practice at working in The R-project logo. this transfer task should be done using an markdown file. I strongly recommend that you create this file within an The R-project logo. project. To get an idea of the workflow see

Packages

You will need to load the following packages:

  • here (Müller 2017)
  • tidyverse (Wickham 2017)

Data

You need to download the following data file:

Set up an The R-project logo. project in the way that I recommend in this tutorial, and save the data files to the folder within your project called data. Place this code in the first code chunk in your Markdown document:

spotify_tib <- here::here("data/iron_swift.csv") %>%
  readr::read_csv()

Alternatively load it directly from the URL

spotify_tib <- readr::read_csv("https://www.discovr.rocks/csv/iron_swift.csv")

Resources

Statistics

  • The tutorials typically follow examples described in detail in Field (2021). That book covers the theoretical side of the statistical models, and has more depth on conducting and interpreting the models in these tutorials.
  • If any of the statistical content doesn't make sense, you could try my more introductory book An adventure in statistics (Field 2016).
  • There are free lectures and screencasts on my YouTube channel.
  • There are free statistical resources on my websites www.discoveringstatistics.com and milton-the-cat.rocks.

Acknowledgement

I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).

References

Field, Andy P. 2016. An Adventure in Statistics: The Reality Enigma. London: Sage.

———. 2021. Discovering Statistics Using R and RStudio. Second. London: Sage.

Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Layers in ggplot

Andy Field