Tidy Data Frames of Marginal Effects

Daniel Lüdecke

2018-08-11

Aim of the ggeffects-package

The aim of this package is similar to the broom-package: transforming “untidy” input into a tidy data frame, especially for further use with ggplot. However, ggeffects does not return model-summaries; rather, this package computes marginal effects at the mean or average marginal effects from statistical models and returns the result as tidy data frame (as tibbles, to be more precisely).

Since the focus lies on plotting the data (the marginal effects), at least one model term needs to be specified for which the effects are computed. It is also possible to compute marginal effects for model terms, grouped by the levels of another model’s predictor. The package also allows plotting marginal effects for two- or three-way-interactions, or for specific values of a model term only. Examples are shown below.

Consistent and tidy structure

The returned data frames always have the same, consistent structure and column names, so it’s easy to create ggplot-plots without the need to re-write the arguments to be mapped in each ggplot-call. x and predicted are the values for the x- and y-axis. conf.low and conf.high could be used as ymin and ymax aesthetics for ribbons to add confidence bands to the plot. group can be used as grouping-aesthetics, or for faceting.

Marginal effects at the mean

ggpredict() computes predicted values for all possible levels and values from a model’s predictors. In the simplest case, a fitted model is passed as first argument, and the term in question as second argument:

library(ggeffects)
data(efc)
fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc)

ggpredict(fit, terms = "c12hour")
#> # A tibble: 35 x 5
#>        x predicted conf.low conf.high group
#>    <dbl>     <dbl>    <dbl>     <dbl> <fct>
#>  1     0      75.4     73.3      77.6 1    
#>  2     5      74.2     72.1      76.3 1    
#>  3    10      72.9     70.9      74.9 1    
#>  4    15      71.6     69.8      73.5 1    
#>  5    20      70.4     68.6      72.2 1    
#>  6    25      69.1     67.4      70.9 1    
#>  7    30      67.8     66.1      69.5 1    
#>  8    35      66.6     64.9      68.2 1    
#>  9    40      65.3     63.7      67.0 1    
#> 10    45      64.0     62.4      65.7 1    
#> # ... with 25 more rows

The output shows the predicted values for the response at each value from the term c12hour. The data is already in shape for ggplot:

library(ggplot2)
theme_set(theme_bw())

mydf <- ggpredict(fit, terms = "c12hour")
ggplot(mydf, aes(x, predicted)) + geom_line()

Marginal effects at the mean for different groups

The terms-argument accepts up to three model terms, where the second and third term indicate grouping levels. This allows predictions for the term in question at different levels for other model terms:

ggpredict(fit, terms = c("c12hour", "c172code"))
#> # A tibble: 105 x 5
#>        x predicted conf.low conf.high group                          
#>    <dbl>     <dbl>    <dbl>     <dbl> <fct>                          
#>  1     0      74.7     71.3      78.2 low level of education         
#>  2     0      75.5     73.3      77.6 intermediate level of education
#>  3     0      76.2     72.8      79.5 high level of education        
#>  4     5      73.5     70.1      76.9 low level of education         
#>  5     5      74.2     72.1      76.3 intermediate level of education
#>  6     5      74.9     71.6      78.2 high level of education        
#>  7    10      72.2     68.9      75.5 low level of education         
#>  8    10      72.9     71.0      74.9 intermediate level of education
#>  9    10      73.7     70.4      76.9 high level of education        
#> 10    15      70.9     67.7      74.2 low level of education         
#> # ... with 95 more rows

Creating a ggplot is pretty straightforward: the colour-aesthetics is mapped with the group-column:

mydf <- ggpredict(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()

Finally, a second grouping structure can be defined, which will create another column named facet, which - as the name implies - might be used to create a facted plot:

mydf <- ggpredict(fit, terms = c("c12hour", "c172code", "c161sex"))
mydf
#> # A tibble: 210 x 6
#>        x predicted conf.low conf.high group                           facet     
#>    <dbl>     <dbl>    <dbl>     <dbl> <fct>                           <fct>     
#>  1     0      75.0     71.4      78.6 low level of education          [2] Female
#>  2     0      74.0     69.4      78.6 low level of education          [1] Male  
#>  3     0      75.7     73.3      78.1 intermediate level of education [2] Female
#>  4     0      74.7     71.1      78.3 intermediate level of education [1] Male  
#>  5     0      76.4     72.9      80.0 high level of education         [2] Female
#>  6     0      75.4     71.0      79.7 high level of education         [1] Male  
#>  7     5      73.7     70.2      77.2 low level of education          [2] Female
#>  8     5      72.7     68.1      77.2 low level of education          [1] Male  
#>  9     5      74.4     72.1      76.7 intermediate level of education [2] Female
#> 10     5      73.4     69.8      77.0 intermediate level of education [1] Male  
#> # ... with 200 more rows
ggplot(mydf, aes(x, predicted, colour = group)) + 
  geom_line() + 
  facet_wrap(~facet)

Marginal effects for each model term

If the term argument is either missing or NULL, marginal effects for each model term are calculated. The result is returned as a list, which can be plotted manually (or using the plot() function).

mydf <- ggpredict(fit)
mydf
#> $c12hour
#> # A tibble: 35 x 5
#>        x predicted conf.low conf.high group  
#>    <dbl>     <dbl>    <dbl>     <dbl> <chr>  
#>  1     0      75.4     73.3      77.6 c12hour
#>  2     5      74.2     72.1      76.3 c12hour
#>  3    10      72.9     70.9      74.9 c12hour
#>  4    15      71.6     69.8      73.5 c12hour
#>  5    20      70.4     68.6      72.2 c12hour
#>  6    25      69.1     67.4      70.9 c12hour
#>  7    30      67.8     66.1      69.5 c12hour
#>  8    35      66.6     64.9      68.2 c12hour
#>  9    40      65.3     63.7      67.0 c12hour
#> 10    45      64.0     62.4      65.7 c12hour
#> # ... with 25 more rows
#> 
#> $neg_c_7
#> # A tibble: 12 x 5
#>        x predicted conf.low conf.high group  
#>    <int>     <dbl>    <dbl>     <dbl> <chr>  
#>  1     6      78.2     75.1      81.2 neg_c_7
#>  2     8      73.6     71.2      75.9 neg_c_7
#>  3    10      69.0     67.1      70.8 neg_c_7
#>  4    12      64.4     62.7      66.0 neg_c_7
#>  5    14      59.8     57.9      61.7 neg_c_7
#>  6    16      55.2     52.7      57.7 neg_c_7
#>  7    18      50.6     47.4      53.8 neg_c_7
#>  8    20      46.0     42.0      50.0 neg_c_7
#>  9    22      41.4     36.6      46.2 neg_c_7
#> 10    24      36.8     31.2      42.4 neg_c_7
#> 11    26      32.2     25.8      38.7 neg_c_7
#> 12    28      27.6     20.3      34.9 neg_c_7
#> 
#> $c161sex
#> # A tibble: 2 x 5
#>       x predicted conf.low conf.high group  
#>   <dbl>     <dbl>    <dbl>     <dbl> <chr>  
#> 1     1      64.0     60.6      67.3 c161sex
#> 2     2      65.0     63.1      66.9 c161sex
#> 
#> $c172code
#> # A tibble: 3 x 5
#>       x predicted conf.low conf.high group   
#>   <dbl>     <dbl>    <dbl>     <dbl> <chr>   
#> 1     1      64.1     61.0      67.1 c172code
#> 2     2      64.8     63.1      66.4 c172code
#> 3     3      65.5     62.3      68.7 c172code
#> 
#> attr(,"class")
#> [1] "ggalleffects" "list"

Average marginal effects

ggaverage() compute average marginal effects. While ggpredict() creates a data-grid (using expand.grid()) for all possible combinations of values (even if some combinations are not present in the data), ggaverage() computes predicted values based on the given data. This means that different predicted values for the outcome may occure at the same value or level for the term in question. The predicted values are then averaged for each value of the term in question and the linear trend is smoothed accross the averaged predicted values. This means that the line representing the marginal effects may cross or diverge, and are not necessarily in paralell to each other.

mydf <- ggaverage(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()

Two- and Three-Way-Interactions

To plot the marginal effects of interaction terms, simply specify these terms in the terms-argument.

library(sjmisc)
data(efc)

# make categorical
efc$c161sex <- to_factor(efc$c161sex)

# fit model with interaction
fit <- lm(neg_c_7 ~ c12hour + barthtot * c161sex, data = efc)

# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("barthtot [30,50,70]", "c161sex"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()

Since the terms-argument accepts up to three model terms, you can also compute marginal effects for a 3-way-interaction.

To plot the marginal effects of interaction terms, simply specify these terms in the terms-argument.

# fit model with 3-way-interaction
fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex, data = efc)

# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("c12hour", "barthtot [30,50,70]", "c161sex"))

ggplot(mydf, aes(x, predicted, colour = group)) + 
  geom_line() +
  facet_wrap(~facet)

Polynomial terms and splines

ggpredict() also works for models with polynomial terms or splines. Following code reproduces the plot from ?splines::bs:

library(splines)
data(women)

fm1 <- lm(weight ~ bs(height, df = 5), data = women)
dat <- ggpredict(fm1, "height")

ggplot(dat, aes(x, predicted)) + 
  geom_line() +
  geom_point()

Labelling the data

ggeffects makes use of the sjlabelled-package and supports labelled data. If the data from the fitted models is labelled, the value and variable label attributes are usually copied to the model frame stored in the model object. ggeffects provides various getter-functions to access these labels, which are returned as character vector and can be used in ggplot’s lab()- or scale_*()-functions.

The data frame returned by ggpredict() or ggaverage() must be used as argument to one of the above function calls.

get_x_title(mydf)
#> [1] "average number of hours of care per week"
get_y_title(mydf)
#> [1] "Negative impact with 7 items"

ggplot(mydf, aes(x, predicted, colour = group)) + 
  geom_line() +
  facet_wrap(~facet) +
  labs(
    x = get_x_title(mydf),
    y = get_y_title(mydf),
    colour = get_legend_title(mydf)
  )