GLM 8: repeated measures designs

The general linear model: repeated measures designs

Overview

This tutorial is one of a series that accompanies Discovering Statistics Using IBM SPSS Statistics (Field 2017) by me, Andy Field. These tutorials contain abridged sections from the book (so there are some copyright considerations).1

  • Who is the tutorial aimed at?
    • Students enrolled on my Discovering Statistics module at the University of Sussex, or anyone reading my textbook Discovering Statistics Using IBM SPSS Statistics (Field 2017)
  • What is covered?
    • This tutorial develops the material from the previous tutorial to look at comparing means using IBM SPSS Statistics when the research design uses repeated measures. We explore one- and two-way designs.
    • This tutorial does not teach the background theory: it is assumed you have either attended my lecture or read the relevant chapter in my book (or someone else’s)
    • The aim of this tutorial is to augment the theory that you already know by guiding you through fitting linear models using IBM SPSS Statistics and asking you questions to test your knowledge along the way.
  • Want more information?
    • The main tutorial follows the example described in detail in Field (2017), so there’s a thorough account in there.
    • You can access free lectures and screencasts on my YouTube channel
    • There are more statistical resources on my website www.discoveringstatistics.com

One predictor variable

The first example in this tutorial is from Field (2017), who uses an example based on the TV show I’m a Celebrity, Get Me Out of Here! in which z-list celebrities, in a pitiful attempt to salvage their careers, live in the jungle for a few weeks. During the show, they are subjected to various humiliating and degrading tasks to win food for their camp mates. One such task is the bushtucker trial in which to win stars the celebrities must eat various noxious things such as live stick insects or witchetty grubs, fish eyes, and kangaroo testicles. Seeing a fish eye exploding in someone’s mouth is a mental scar that’s hard to shake off.

For those of you that enjoy implanting disturbing images in your mind, you can watch a particularly traumatic clip from the official YouTube channel that culminates in someone eating a massive live spider here (you have been warned):

I’ve often wondered (perhaps a little too much) which of the bushtucker foods is the most revolting. Imagine that I answered this question by getting eight celebrities and forcing them to eat four different animals (the aforementioned stick insect, kangaroo testicle, fish eye and witchetty grub) in counterbalanced order. On each occasion I measured the time it took the celebrity to retch, in seconds. This design is a repeated-measures design because every celebrity eats every food. The predictor/independent variable is the type of food eaten and the outcome/dependent variable is the time taken to retch. The data are in Bushtucker.sav and are shown in Figure 1. The eight rows show the different celebrities, and the columns indicate their time to retch after eating each animal.

Figure 1: Bushtucker.sav

Figure 1: Bushtucker.sav

Fitting the model

To fit the model we use the Analyze > General Linear Model > Repeated Measures … menu. The following video shows how.

Interpreting a one-way design

Descriptive statistics

Output 1 shows that, on average, the time taken to retch was longest after eating the stick insect, and quickest after eating a testicle or eyeball. These means are useful for interpreting the main analysis.

Output 1

Output 1

Correcting for sphericity

Output 2 shows Mauchly’s test. The significance value (0.047) is less than the critical value of 0.05, which implies that the assumption of sphericity has been violated, but in my book I suggest you ignore this test and instead routinely apply a correction for whatever deviation from sphericity is present in the data. The more informative part of the table contains the Greenhouse–Geisser (\(\hat{\epsilon}\) = 0.533) and the Huynh–Feldt (\(\tilde{\epsilon}\) = 0.666) estimates of sphericity (Greenhouse and Geisser 1959; Huynh and Feldt 1976). If the data are perfectly spherical then these estimates will be 1. Therefore, both estimates indicate a departure from sphericity, so we may as well correct for it regardless of what Mauchly’s test says. These estimates are used to correct the degrees of freedom for the F-statistic.

Output 2

Output 2

The F-statistic

Output 3 shows the summary information for the F-statistic that tests whether we can significantly predict retching times from the group means (i.e., are the means significantly different?).

Output 3

Output 3

Explanation

The answers to the quiz highlight the problems with resorting to an ‘all-or-nothing’ decision rule about significance dependent on whether an observed p is one side of the 0.05 threshold or the other. The unadjusted p-value associated with the F-statistic is 0.026, which is significant because it is less than the criterion value of 0.05. However, when you adjust the degrees of freedom the conclusions depend on another arbitrary decision: which correction you apply. The adjustments result in the observed F being non-significant when using the Greenhouse–Geisser correction (because p > 0.05) but significant using the Huynh–Feldt correction (because the probability value of 0.048 is just below the criterion value of 0.05). This leaves us with the puzzling dilemma of whether to accept this F-statistic as significant. It’s easy to see how the decision rule applied to p-values can lead to results that don’t replicate, conclusions that have been influenced by researcher degrees of freedom, and a lot of noise in the scientific literature.

Contrasts

Output 4 lists each contrast and its F-statistic, which compares the two chunks of variation within the contrast. We can conclude that celebrities took significantly longer to retch after eating the stick insect compared to the kangaroo testicle, p = 0.002 (Level 1 vs. Level 2), but that the time to retch was roughly the same after eating the kangaroo testicle and the fish eyeball, p = 0.920 (Level 2 vs. Level 3) and after eating a fish eyeball compared to eating a witchetty grub, p = 0.402 (Level 3 vs. Level 4). It’s worth remembering that, by some criteria, our main effect of the type of animal eaten was not significant, and if this is the case then we really shouldn’t look at these contrasts.

Output 4

Output 4

Post hoc tests

If you selected post hoc tests for the repeated-measures variable, then Output 5 is produced. Based on the significance values and the means we can conclude that the time to retch was significantly longer after eating a stick insect compared to a kangaroo testicle (p = 0.012) and a fish eye (p = 0.006), but not compared to a witchetty grub (p = 1). The time to retch after eating a kangaroo testicle was not significantly different compared to after eating a fish eye or witchetty grub (both ps = 1). Finally, the time to retch was not significantly different after eating a fish eyeball compared to a witchetty grub (p = 1). Again, it’s worth noting that (1) we wouldn’t interpret these effects if we decide that the main effect of the type of animal eaten wasn’t significant, and (2) we wouldn’t normally conduct post hoc tests and contrasts, we’d do one or the other.

Output 5

Output 5

Two predictor variables

The second example (again from Field (2017)) is based on evidence that attitudes towards stimuli can be changed using positive and negative imagery (e.g., Stuart, Shimp, and Engle (1987); Hofmann et al. (2010)). As part of an initiative to stop binge drinking in teenagers, the government funded scientists to look at whether negative imagery could be used to make teenagers’ attitudes towards alcohol more negative. The scientists compared the effects of negative imagery against positive and neutral imagery for different types of drinks. Participants viewed a total of nine videos over three sessions. In one session, they saw three videos: (1) a brand of beer (Strange Brew) presented alongside negative imagery (a bunch of inanimate dead bodies in a trendy bar with the slogan ‘Strange Brew: who needs a liver?’); (2) a brand of wine (Liquid Fire) presented within positive imagery (a bunch of sexy hipster types in a trendy bar with the slogan ‘Liquid Fire: your life would be so much better if you were a sexy hipster type’); and (3) a brand of water (Backwater) presented with neutral imagery (some completely average people in a trendy bar accompanied by the slogan ‘Backwater: it will make no difference to your life one way or another’). In a second session (a week later), the participants saw the same three brands, but this time Strange Brew was accompanied by the positive imagery, Liquid Fire by the neutral image and Backwater by the negative. In a third session, the participants saw Strange Brew accompanied by the neutral image, Liquid Fire by the negative image and Backwater by the positive. After each advert participants rated the drinks from −100 (dislike very much) through 0 (neutral) to 100 (like very much). The order of adverts was randomized, as was the order in which people participated in the three sessions. This design is quite complex. There are two predictor/independent variables: the type of drink (beer, wine or water) and the type of imagery used (positive, negative or neutral). These two variables completely cross over, producing nine experimental conditions represented by 9 columns in the data editor. Figure 2 shows the data in the file Attitude.sav. The 20 rows show the different participants, and the columns indicate the ratings of the drinks across the 9 experimental conditions.

Figure 2: Attitude.sav

Figure 2: Attitude.sav

Fitting the model

To fit the model we use the Analyze > General Linear Model > Repeated Measures … menu. The following video shows how.

Interpreting a two-way design

Descriptives

Output 6 contains the means and standard deviations across the nine conditions. The names in this table are the variable labels in the data editor. The descriptives tell us that the variability among scores was greatest when beer was used as a product (compare the standard deviations of the beer variables against the others). Also, when dead bodies were used as imagery the ratings given to the products were negative (as expected) for wine and water but not for beer (for some reason negative imagery didn’t have the expected effect when beer was used as a stimulus).

Output 6

Output 6

Sphericity

Output 7 shows the sphericity estimates for each of the three effects in the model (two main effects and one interaction). All three effects have estimates less than 1, indicating some deviation from sphericity, so we may as well correct for these.

Output 7

Output 7

F-statistics

Output 8 shows the F-statistics (with corrections). The table is quite mind-blowing, but we can stay calm by focussing on the information that we plan to use. For example, if, like me, you want top routinely report Greenhouse–Geisser corrected values then we can focus on these values. The significance values tell us that there is a significant main effect of the type of drink used as a stimulus, a significant main effect of the type of imagery used and a significant interaction between these two variables. I will examine each of these effects in turn.

Output 8

Output 8

The main effect of drink

Quiz

Explanation

The type of drink used was significant, which tells us that if we ignore the type of imagery that was used, participants rated some drinks significantly differently than others. We requested estimated marginal means for the effects in the model and these are shown in Output 9. The levels of Drink are labelled 1, 2 and 3, so we must think back to the order in which we assigned variables to know which row of the table relates to which drink. We entered the beer condition first and the water condition last. As such beer and wine were rated higher than water (with beer being rated most highly).

Output 9

Output 9

Output 10 shows the Bonferroni adjusted pairwise comparisons for the main effect of Drink. The significant main effect seems to reflect a significant difference (p = 0.001) between levels 2 and 3 (wine and water). Curiously, the difference between the beer and water conditions is larger than that for wine and water yet this effect is non-significant (p = 0.066). This inconsistency can be explained by looking at the standard error in the beer condition, which is large compared to the wine condition, indicating a lot of uncertainty about the value of the mean for beer.

Output 10

Output 10

The main effect of imagery

Quiz

Explanation

The main effect of the type of imagery had a significant influence on participants’ ratings of the drinks (Output 8). This effect tells us that if we ignore the type of drink that was used, participants’ ratings of those drinks were different according to the type of imagery that was used. Output 11 shows the means that we requested when we fit the model. The levels of imagery are labelled 1, 2 and 3 so we need to again think back to how we assigned variables. We assigned the positive condition to the first level and the neutral condition to the last. Positive imagery resulted in very positive ratings (compared to neutral imagery) and negative imagery resulted in negative ratings (compared to of neutral imagery). Output 1.12 shows the Bonferroni adjusted pairwise comparisons, which show that the significant main effect reflects significant differences (all ps < 0.001) between levels 1 and 2 (positive and negative), levels 1 and 3 (positive and neutral) and levels 2 and 3 (negative and neutral).

Output 11

Output 11

Output 12

Output 12

The interaction effect (drink × imagery)

Explanation

The type of imagery interacted significantly with the type of drink used as a stimulus to affect ratings (Output 8). This effect tells us that the type of imagery used had a different effect depending on which type of drink was being rated. We can use the means in Output 13 to unpick this interaction (these values are essentially the same as the initial descriptive statistics in Output 6, except that the standard errors are displayed rather than the standard deviations).

Output 13

Output 13

The graph shows that the pattern of response across drinks was similar when positive and neutral imagery were used (blue and grey lines). That is, ratings were positive for beer, they were slightly higher for wine and they were lower for water. The fact that the (blue) line representing positive imagery is higher than the neutral (grey) line indicates that positive imagery produced higher ratings than neutral imagery across all drinks. The red line (representing negative imagery) shows a different pattern: ratings were lowest for wine and water but quite high for beer. Therefore, negative imagery had the desired effect on attitudes towards wine and water, but much less impact on ratings of beer. Therefore, the interaction is likely to reflect the fact that imagery has the expected effect for wine and water (that is, ratings are highest for positive imagery, lowest for negative imagery and neutral falls somewhere in between) but not for beer (where ratings after negative information do not seem to be particularly negative). To verify the interpretation of the interaction effect, we can look at the contrasts.

Contrasts for the main effects

We requested simple contrasts for both the Drink (water was used as the control category) and Imagery variables (neutral imagery was used as the control category). Output 14 shows these contrasts. The table is split into main effects and interactions, and within each are the contrasts. If you are confused as to which level is which, think back to how we specified them when we fit the model. For the main effect of drink, the first contrast shows a significant difference between level 1 (beer) and level 3 (water), F(1, 19) = 6.22, p = 0.022, which contradicts the equivalent post hoc test (see Output 10).

Output 14

Output 14

The next contrast shows a significant difference between level 2 (wine) and level 3 (water), F(1, 19) = 18.61, p < 0.001. For the imagery main effect, level 1 (positive) is significantly different than level 3 (neutral), F(1, 19) = 142.19, p < 0.001, and level 2 (negative imagery) is significantly different than level 3 (neutral), F(1, 19) = 47.07, p < 0.001.

Contrasts for the interaction effect

The contrasts for the interaction term are more interesting. By interesting I mean hell-ish. To help us interpret these contrasts Figure 3 breaks the interaction into the four contrasts. The first contrast for the interaction looks at level 1 of Drink (beer) compared to level 3 (water), when positive imagery (level 1) is used compared to neutral (level 3). This contrast is non-significant, p = 0.225. This result tells us that the higher ratings when positive imagery is used (compared to neutral imagery) are equivalent for beer and water. Figure 3 (top left) shows this contrast: the non-significance means that the distance between the lines in the beer condition is the same as the distance between the lines in the water condition. We could conclude that the improvement of ratings due to positive imagery compared to neutral is not affected by whether people are evaluating beer or water.

Figure 3: Figures illustrating the contrasts for the interaction term

Figure 3: Figures illustrating the contrasts for the interaction term

The second contrast for the interaction term looks at level 1 of Drink (beer) compared to level 3 (water), when negative imagery (level 2) is used compared to neutral (level 3). This contrast is significant, F(1, 19) = 6.75, p = 0.018. Figure 3 (top right) shows the contrast. The significance means that the distance between the red and grey line in the beer condition is significantly smaller than the distance between the red and grey line in the water condition. When beer is being rated the ratings are similar regardless of the type of imagery, but for water ratings are much lower with negative imagery than neutral imagery.

The third contrast looks at level 2 of Drink (wine) compared to level 3 (water), when positive imagery (level 1) is used compared to neutral (level 3). This contrast is non-significant, p = 0.633, indicating that the higher ratings when positive imagery is used (compared to neutral imagery) are similar for wine and water. Figure 3 (bottom left) shows this contrast. The non-significance implies that the distance between the grey and blue lines in the wine condition is similar to the distance between the lines in the water condition.

The final contrast for the interaction term looks at level 2 of Drink (wine) compared to level 3 (water), when negative imagery (level 2) is used compared to neutral (level 3). This contrast is significant, F(1, 19) = 26.91, p < 0.001. Figure 3 (bottom right) shows this contrast. The significance implies that the distance between the red and grey lines in the wine condition is significantly larger than the distance between the lines in the water condition. In short, the lower ratings due to negative imagery (compared to neutral) are significantly greater for wine than for water

These contrasts tell us nothing about the differences between the beer and wine conditions (or the positive and negative conditions), and different contrasts would have to be run to find out more. However, they do tell us that, relative to the neutral condition, positive imagery increased liking for the products regardless of the product, whereas negative imagery affected ratings of wine but not so much beer. These differences were not predicted.

Interpreting interaction terms is complex, and even some well-respected researchers struggle with them, so don’t feel disheartened if you find them hard. Try to be thorough and break each effect down using contrasts and graphs, and you will get there.

Unguided example

Let’s look at a second example from Field (2017). In the previous tutorial we came across the beer-goggles effect. In that chapter, we saw that the beer-goggles effect was stronger for unattractive faces. We took a follow-up sample of 26 people and gave them doses of alcohol (0 pints, 2 pints, 4 pints and 6 pints of lager) over four different weeks. We asked them to rate a bunch of photos of unattractive faces in either dim or bright lighting. The outcome measure was the mean attractiveness rating (out of 100) of the faces and the predictors were the dose of alcohol and the lighting conditions.

The data are in the file BeerGogglesLighting.sav, which contains the variables dim0, dim2, dim4 and dim6 which contain the median rating of the faces after after no alcohol, 2, 4 and 6 pints respectively and bright0, bright2, bright4 and bright6 which contain the same information but when ratings were don in bright lighting. Fit a model to see whether alcohol dose and lighting interact to magnify the beer goggles effect. Use a repeated contrast on the effect of alcohol.

Quiz

Next tutorial

The next tutorial will look at analysing mixed designs using the general linear model.

References

Field, Andy P. 2017. Discovering Statistics Using Ibm Spss Statistics: And Sex and Drugs and Rock ’N’ Roll. Book. 5th ed. London: Sage.

Greenhouse, S. W., and S. Geisser. 1959. “On Methods in the Analysis of Profile Data.” Journal Article. Psychometrika 24: 95–112.

Hofmann, Wilhelm, Jan De Houwer, Marco Perugini, Frank Baeyens, and Geert Crombez. 2010. “Evaluative Conditioning in Humans: A Meta-Analysis.” Journal Article. Psychological Bulletin 136 (3): 390–421. doi:10.1037/a0018916.

Huynh, H., and L. S. Feldt. 1976. “Estimation of the Box Correction for Degrees of Freedom from Sample Data in Randomised Block and Split-Plot Designs.” Journal Article. Journal of Educational Statistics 1 (1): 69–82.

Stuart, E. W., T. A. Shimp, and R. W. Engle. 1987. “Classical-Conditioning of Consumer Attitudes - Four Experiments in an Advertising Context.” Journal Article. Journal of Consumer Research 14 (3): 334–49. <Go to ISI>://A1987L138400003.


  1. This tutorial is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, basically you can use it for teaching and non-profit activities but not meddle with it.

Andy Field