GLM 3: the beast of bias

The general linear model: bias

Overview

This tutorial is one of a series that accompanies Discovering Statistics Using IBM SPSS Statistics (Field 2017) by me, Andy Field. These tutorials contain abridged sections from the book (so there are some copyright considerations).1

  • Who is the tutorial aimed at?
    • Students enrolled on my Discovering Statistics module at the University of Sussex, or anyone reading my textbook Discovering Statistics Using IBM SPSS Statistics (Field 2017)
  • What is covered?
    • This tutorial develops the material from the previous tutorial to look at bias in the linear model using IBM SPSS Statistics. We will look at how to test and correct for things that bias model parameters and significance tests.
    • This tutorial does not teach the background theory: it is assumed you have either attended my lecture or read the relevant chapter in my book (or someone else’s)
    • The aim of this tutorial is to augment the theory that you already know by guiding you through fitting linear models using IBM SPSS Statistics and asking you questions to test your knowledge along the way.
  • Want more information?
    • The main tutorial follows the example described in detail in Field (2017), so there’s a thorough account in there.
    • You can access free lectures and screencasts on my YouTube channel
    • There are more statistical resources on my website www.discoveringstatistics.com

Assumptions of the linear model

First, let’s see whether you understood the lecture and book chapter with a quiz, because everyone likes to start a tutorial with a quiz. Or is that chocolate? “Everyone likes to start a tutorial with chocolate?” does sound plausible, but I get so confused between the two. No, I’m pretty sure it’s quizzes that people like, not chocolate, so here goes:

Quiz

Fitting the model

We continue the example from Field (2017) from previous tutorials that predicts physical and downloaded album sales (outcome variable) from the amount (in thousands of pounds/dollars/euro/whatever currency you use) spent promoting the album before release (Adverts), airplay of songs from the album in the week before release (Airplay), and how attractive people found the band’s image out of 10 (Image). The data are in the file Album Sales.sav, and although you’ll be familiar with the data set if you have done the previous tutorials it looks like this:

Figure 1: The data in IBM SPSS Statistics

Figure 1: The data in IBM SPSS Statistics

As revision from previous tutorials, the model we’re fitting is:

\[ \text{Sales}_i = b_0 + b_1\text{Advertising}_i+ b_2\text{Airplay}_i + b_3\text{Image}_i + ε_i \] We fit the model hierarchically in two blocks with advertising budget entered in the first block and the other two predictors entered in a second block. We’re going to fit the model as we have done in previous tutorials, but this time look at options to give us diagnostic information about the model. The following video recaps how to fit this model using SPSS Statistics and how to ask for diagnostic information.

Residual plots

We requested several plots of the residuals from the model. Remember from the lecture that we can identify potential problems with linearity and spherical errors (i.e., independent errors and homoscedastic errors). The Figure below (taken from Field (2017)) shows that we’re looking for a random scatter of dots. Curvature in the plot indicates a lack of linearity, and a funnel shape (residuals that ‘fan out’) indicates heteroscedasticity.

Figure 2: Residual plots

Figure 2: Residual plots

Let’s look at the plots in our output to see whether we can see any of these patterns. The first plot shows the standardized predicted values from the model (zpred) against the standardized residuals from the model (zresid). The plots included here should match your SPSS output (if they don’t then one of us has fit the model incorrectly) but I have added an overlay to help you to interpret them.

Figure 3: zpred vs. zresid for the model

Figure 3: zpred vs. zresid for the model

Quiz

Now let’s look at the partial plots to see whether we can see any of these patterns. There are three plots: one for each predictor. Again I have added an overlay on each plot to help you to interpret them.

Figure 4: Partial plots for the model

Figure 4: Partial plots for the model

Quiz

Finally, even though normality is not a major concern (especially with a sample size of 200) we can check whether the residuals are normally distributed using the histogram and P-P plot that we requested.

Figure 5: Normality plots for the model

Figure 5: Normality plots for the model

Quiz

Outliers and influential cases

As a bare minimum we should check the model for outliers using the standardized residuals and use Cook’s distance to identify influential cases. Field (2017) describes a much wider battery of values that you can use to check for these things so if you’re starting to get the stats bug (?!) then check that out. Otherwise, we can apply the following heuristics:

  • Standardized residuals: in an average sample, 95% of standardized residuals should lie between 2, 99% of standardized residuals should lie between 2.5, and any case for which the absolute value of the standardized residual is 3 or more, is likely to be an outlier.
  • Cook’s distance: measures the influence of a single case on the model as a whole. Absolute values greater than 1 may be cause for concern (Cook and Weisberg 1982).

We asked for a summary of standardized residuals when we selected these options:

Figure 9: Requesting casewise diagnostics for the model

Figure 9: Requesting casewise diagnostics for the model

The resulting output (which should match that in your viewer window is):

Output 1: Casewise diagnostic summary for the model

Output 1: Casewise diagnostic summary for the model

When attempting the following quiz remember that there were 200 cases in total and that the absolute value is the value when you ignore the plus or minus sign. The standardized residuals are labelled Std. Residual in the output:

Quiz

For Cook’s distance it is a matter of screening the column in the data editor in which you saved these values and noting cases with values greater than 1.

Figure 10: Values of Cook’s distance in the data editor

Figure 10: Values of Cook’s distance in the data editor

Quiz

Bootstrap confidence intervals

The next output contains the bootstrap confidence intervals for each model parameter. These bootstrap confidence intervals2 and significance values do not rely on assumptions of normality or homoscedasticity, so they give us an accurate estimate of the population value of b for each predictor (assuming our sample is one of the 95% with confidence intervals that contain the population value).

Output 2: Bootstrapped model parameter estimates

Output 2: Bootstrapped model parameter estimates

Quiz

The quiz told you about the confidence interval for image, for the remaining predictors the confidence intervals tell us that assuming that each confidence interval is one of the 95% that contains the population parameter:

  • The true size of the relationship between advertising budget and album sales lies somewhere between 0.07 and 0.10.
  • The true size of the relationship between airplay and album sales lies somewhere between 2.77 and 3.97.

To sum up, the bootstrapped statistics tell us that advertising, b = 0.09 [0.07, 0.10], p = 0.001, airplay, b = 3.37 [2.77, 3.97], p = 0.001, and the band’s image, b = 11.09 [6.26, 15.28], p = 0.001, all significantly predict album sales. Basically we interpret bootstrap confidence intervals and significance tests in the same way as regular ones but the bootstrapped ones should be robust to violations of the assumptions of the model.

Unguided example

We’ll use the same unguided example as in the last tutorial. To recap, the data are in the file SocialAnxietyRegression.sav. This file contains three variables of interest to us:

  • The Social Phobia and Anxiety Inventory (SPAI), which measures levels of social anxiety (Turner, Beidel, and Dancu 1996).
  • Obsessive Beliefs Questionnaire (OBQ), which measures the degree to which people experience obsessive beliefs like those found in OCD (Steketee et al. 2001).
  • The Test of Self-Conscious Affect (TOSCA), which measures shame (Tangney et al. 2000).

In the previous tutorial we fitted a hierarchical linear model with two blocks:

  1. Block 1: the first block will contain any predictors that we expect to predict social anxiety. In this example we have only one variable that we expect, theoretically, to predict social anxiety and that is shame (measured by the TOSCA).
  2. Block 2: the second block contains OBQ, the predictor variable that we don’t necessarily expect to predict social anxiety.

Use what you have learned to fit this model but saving diagnostic information. Use the output to answer these questions.

Plots

The plots that you should get are displayed below so you can check your output.

Figure 11: Plots from the social anxiety model (ZPRED vs ZResid, partial plot for shame, partial plot for OCD, and P-P plot of the standardized residuals

Figure 11: Plots from the social anxiety model (ZPRED vs ZResid, partial plot for shame, partial plot for OCD, and P-P plot of the standardized residuals

Quiz

Casewise diagnostics

To answer these questions remember that the sample size was 134.

Quiz

Bootstrap confidence intervals

The nature of bootstrapping means that I can’t ask questions about the specific values. Here are some general questions:

Quiz

Next tutorial

The next tutorial will look at using categorical predictors in the linear model.

Useful resources

These are useful resources for understanding some of the concepts in this tutorial. These are not written or hosted by me, so I take no responsibility for whether they work. If they are working though, you might find them useful.

  • Click here for an interactive app that illustrates how diagnostics from the model change as you fit a linear model to linear relationship (select Linear up or Linear down), non-linear relationships (select Curved up or Curved down) and heteroscedastic data (select Fan-shaped)

References

Cook, R. D., and S. Weisberg. 1982. Residuals and Influence in Regression. Book. New York: Chapman & Hall.

Field, Andy P. 2017. Discovering Statistics Using Ibm Spss Statistics: And Sex and Drugs and Rock ’N’ Roll. Book. 5th ed. London: Sage.

Steketee, G., R. Frost, N. Amir, M. Bouvard, C. Carmin, D. A. Clark, J. Cottraux, et al. 2001. “Development and Initial Validation of the Obsessive Beliefs Questionnaire and the Interpretation of Intrusions Inventory.” Journal Article. Behaviour Research and Therapy 39 (8): 987–1006. <Go to ISI>://000170088800011.

Tangney, J. P., R. Dearing, P. E. Wagner, and R. Gramzow. 2000. The Test of Self–conscious Affect–3 (Tosca–3). Book. Fairfax,VA: George Mason University.

Turner, S.M., D. C. Beidel, and C. V. Dancu. 1996. Social Phobia and Anxiety Inventory: Manual. Book. Toronto: Multi–health Systems Inc.


  1. This tutorial is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, basically you can use it for teaching and non-profit activities but not meddle with it.

  2. Because of how bootstrapping works the values in your output will be different than mine, and different if you re-run the analysis.

Andy Field