This documents reanalysis response time data from an Experiment performed by Freeman, Heathcote, Chalmers, and Hockley (2010) using the mixed model functionality of **afex** implemented in function `mixed`

followed by post-hoc tests using package **emmeans** (Lenth, 2017). After a brief description of the data set and research question, the code and results are presented.

The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants presented in Freeman et al. (2010). The 300 items in each `stimulus`

condition were selected to form a balanced \(2 \times 2\) design with factors neighborhood `density`

(low versus high) and `frequency`

(low versus high). The `task`

was a between subjects factor: 25 participants worked on the lexical decision task and 20 participants on the naming task. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. We analyzed log RTs which showed an approximately normal picture.

We start with loading some packages we will need throughout this example. For data manipulation we will be using the `dplyr`

and `tidyr`

packages from the `tidyverse`

. A thorough introduction to these packages is beyond this example, but well worth it, and can be found in ‘R for Data Science’ by Wickham and Grolemund. For plotting we will be diverging from the `tidyverse`

and use `lattice`

instead. At some later point in time I will change this to `ggplot2`

plots.

After loading the packages, we will load the data (which comes with `afex`

), remove the errors, and take a look at the variables in the data.

```
library("afex") # needed for mixed() and attaches lme4 automatically.
library("emmeans") # emmeans is needed for follow-up tests (and not anymore loaded automatically).
library("multcomp") # for advanced control for multiple testing/Type 1 errors.
library("dplyr") # for working with data frames
library("tidyr") # for transforming data frames from wide to long and the other way round.
library("lattice") # for plots
library("latticeExtra") # for combining lattice plots, etc.
lattice.options(default.theme = standard.theme(color = FALSE)) # black and white
lattice.options(default.args = list(as.table = TRUE)) # better ordering
data("fhch2010") # load
fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors
str(fhch2010) # structure of the data
```

```
## 'data.frame': 13222 obs. of 10 variables:
## $ id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ task : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ...
## $ stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ...
## $ density : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ...
## $ frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ...
## $ length : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ...
## $ item : Factor w/ 600 levels "abide","acts",..: 363 121 202 525 580 135 42 368 227 141 ...
## $ rt : num 1.091 0.876 0.71 1.21 0.843 ...
## $ log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ...
## $ correct : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
```

To make sure our expectations about the data match the data we use some `dplyr`

magic to confirm the number of participants per condition and items per participant.

```
## are all participants in only one task?
fhch2010 %>% group_by(id) %>%
summarise(task = n_distinct(task)) %>%
as.data.frame() %>%
{.$task == 1} %>%
all()
```

`## [1] TRUE`

```
## participants per condition:
fhch2010 %>% group_by(id) %>%
summarise(task = first(task)) %>%
ungroup() %>%
group_by(task) %>%
summarise(n = n())
```

```
## # A tibble: 2 x 2
## task n
## <fct> <int>
## 1 naming 20
## 2 lexdec 25
```

```
## number of different items per participant:
fhch2010 %>% group_by(id, stimulus) %>%
summarise(items = n_distinct(item)) %>%
ungroup() %>%
group_by(stimulus) %>%
summarise(min = min(items),
max = max(items),
mean = mean(items))
```

```
## # A tibble: 2 x 4
## stimulus min max mean
## <fct> <dbl> <dbl> <dbl>
## 1 word 139 150 147.
## 2 nonword 134 150 146.
```

Before running the analysis we should make sure that our dependent variable looks roughly normal. To compare `rt`

with `log_rt`

within the same graph using `lattice`

we first need to transform the data from the wide format (where both rt types occupy one column each) into the long format (in which the two rt types are combined into a single column with an additional indicator column). To do so we use `tidyr::gather`

. Then we simply call the `histogram`

function on the new `data.frame`

and make a few adjustments to the defaults to obtain a nice looking output. The plot shows that `log_rt`

looks clearly more normal than `rt`

, although not perfectly so. An interesting exercise could be to rerun the analysis below using a transformation that provides an even better ‘normalization’.

```
fhch_long <- fhch %>% gather("rt_type", "rt", rt, log_rt)
histogram(~rt|rt_type, fhch_long, breaks = "Scott", type = "density",
scale = list(x = list(relation = "free")))
```

The main factors in the experiment were the between-subjects factor `task`

(`naming`

vs. `lexdec`

), and the within-subjects factors `stimulus`

(`word`

vs. `nonword`

), `density`

(`low`

vs. `high`

), and `frequency`

(`low`

vs. `high`

). Before running an analysis it is a good idea to visually inspect the data to gather some expectations regarding the results. Should the statistical results dramatically disagree with the expectations this suggests some type of error along the way (e.g., model misspecification) or at least encourages a thorough check to make sure everything is correct. We first begin by plotting the data aggregated by-participant.

In each plot we plot the raw data in the background. To make the individual data points visible we add some `jitter`

on the x-axis and choose `pch`

and `alpha`

values such that we see where more data points are (i.e., where plot overlaps it is darker). Then we add the mean as a x in a circle. Both of this is done in the same call to `xyplot`

using a custom panel function. Finally, we combine this plot with a simple boxplot using `bwplot`

.

```
agg_p <- fhch %>% group_by(id, task, stimulus, density, frequency) %>%
summarise(mean = mean(log_rt)) %>%
ungroup()
xyplot(mean ~ density:frequency|task+stimulus, agg_p, jitter.x = TRUE, pch = 20, alpha = 0.5,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
tmp <- aggregate(y, by = list(x), mean)
panel.points(tmp$x, tmp$y, pch = 13, cex =1.5)
}) +
bwplot(mean ~ density:frequency|task+stimulus, agg_p, pch="|", do.out = FALSE)
```

Now we plot the same data but aggregated across items:

```
agg_i <- fhch %>% group_by(item, task, stimulus, density, frequency) %>%
summarise(mean = mean(log_rt)) %>%
ungroup()
xyplot(mean ~ density:frequency|task+stimulus, agg_i, jitter.x = TRUE, pch = 20, alpha = 0.2,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
tmp <- aggregate(y, by = list(x), mean)
panel.points(tmp$x, tmp$y, pch = 13, cex =1.5)
}) +
bwplot(mean ~ density:frequency|task+stimulus, agg_i, pch="|", do.out = FALSE)
```

These two plots show a very similar pattern and suggest several things:

- Responses to
`nonwords`

appear slower than responses to`words`

, at least for the`naming`

task. `lexdec`

responses appear to be slower than`naming`

responses, particularly in the`word`

condition.- In the
`nonword`

and`naming`

condition we see a clear effect of`frequency`

with slower responses to`high`

than`low`

`frequency`

words. - In the
`word`

conditions the`frequency`

pattern appears to be in the opposite direction to the pattern described in the previous point: faster responses to`low`

`frequency`

than to`high`

`frequency`

words. `density`

appears to have no effect, perhaps with the exception of the`nonword`

`lexdec`

condition.

To set up a mixed model it is important to identify which factors vary within which grouping factor generating random variability (i.e., grouping factors are sources of stochastic variability). The two grouping factors are participants (`id`

) and items (`item`

). The within-participant factors are `stimulus`

, `density`

, and `frequency`

. The within-item factor is `task`

. The ‘maximal model’ (Barr, Levy, Scheepers, and Tily, 2013) therefore is the model with by-participant random slopes for `stimulus`

, `density`

, and `frequency`

and their interactions and by-item random slopes for `task`

.

Occasionally, the maximal model does not converge successfully. In this case a good first approach for dealing with this problem is to remove the corelations among the random terms. In our example, there are two sets of correlations, one for each random effect grouping variable. Consequently, we can build four model that have the maximal structure in terms of random-slopes and only differ in which correlations among random terms are calculated:

- With all correlations.
- No correlation among by-item random effects (i.e., no correlation between random intercept and
`task`

random slope). - No correlation among by-participant random effect terms (i.e., no correlation among random slopes themselves and between the random slopes and the random intercept).
- No correlation among either random grouping factor.

The next decision to be made is which method to use for obtaining \(p\)-values. The default method is `KR`

(=Kenward-Roger) which provides the best control against anti-conservative results. However, `KR`

needs quite a lot of RAM, especially with complicated random effect structures and large data sets. As in this case we have both, relatively large data (i.e., many levels on each random effect, especially the item random effect) and a complicated random effect structure, it seems a reasonable decision to choose another method. The second ‘best’ method (in terms of controlling for Type I errors) is the ‘Satterthwaite’ approximation, `method='S'`

. It provides a similar control of Type I errors as the Kenward-Roger approximation and needs less RAM, however one downside is that it simply fails in some cases.

The following code fits the four models using the Satterthwaite method. To suppress random effects we use the `||`

notation. Note that it is necessary to set `expand_re=TRUE`

when suppressing random effects among variables that are entered as factors and not as numerical variables (as done here). Also note that `mixed`

automatically uses appropriate contrast codings if factors are included in interactions (`contr.sum`

) in contrast to the `R`

default (which is `contr.treatment`

). To make sure the estimation does not end prematurely we set the allowed number of function evaluations to a very high value (using `lmerControl`

).

```
m1s <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency|id)+
(task|item), fhch, method = "S",
control = lmerControl(optCtrl = list(maxfun = 1e6)))
m2s <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency|id)+
(task||item), fhch, method = "S",
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
m3s <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency||id)+
(task|item), fhch, method = "S",
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
m4s <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency||id)+
(task||item), fhch, method = "S",
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

As the estimation of these model may take some time, `afex`

inlcudes the estimated models which can be loaded with the following code. Note that when using the `print`

or `anova`

method for `mixed`

objects, the `warnings`

emitted during estimation of the model by `lmer`

will be printed again. So there is no downside of loading the already estimated models. We then inspect the four results.

```
load(system.file("extdata/", "freeman_models.rda", package = "afex"))
m1s
```

```
## Warning: lme4 reported (at least) the following warnings for 'full':
## * unable to evaluate scaled gradient
## * Model failed to converge: degenerate Hessian with 1 negative eigenvalues
```

```
## Mixed Model Anova Table (Type 3 tests, S-method)
##
## Model: log_rt ~ task * stimulus * density * frequency + (stimulus *
## Model: density * frequency | id) + (task | item)
## Data: fhch
## Effect df F p.value
## 1 task 1, NA 128.72 <NA>
## 2 stimulus 1, NA 117.02 <NA>
## 3 density 1, NA 1.20 <NA>
## 4 frequency 1, NA 1.30 <NA>
## 5 task:stimulus 1, NA 63.74 <NA>
## 6 task:density 1, NA 10.59 <NA>
## 7 stimulus:density 1, NA 0.39 <NA>
## 8 task:frequency 1, NA 55.81 <NA>
## 9 stimulus:frequency 1, NA 85.68 <NA>
## 10 density:frequency 1, NA 0.03 <NA>
## 11 task:stimulus:density 1, NA 12.87 <NA>
## 12 task:stimulus:frequency 1, NA 119.08 <NA>
## 13 task:density:frequency 1, NA 5.22 <NA>
## 14 stimulus:density:frequency 1, NA 2.45 <NA>
## 15 task:stimulus:density:frequency 1, NA 10.16 <NA>
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

`m2s`

```
## Warning: lme4 reported (at least) the following warnings for 'full':
## * unable to evaluate scaled gradient
## * Model failed to converge: degenerate Hessian with 1 negative eigenvalues
```

```
## Mixed Model Anova Table (Type 3 tests, S-method)
##
## Model: log_rt ~ task * stimulus * density * frequency + (stimulus *
## Model: density * frequency | id) + (task || item)
## Data: fhch
## Effect df F p.value
## 1 task 1, 43.54 13.69 *** .0006
## 2 stimulus 1, 51.06 150.61 *** <.0001
## 3 density 1, 192.25 0.31 .58
## 4 frequency 1, 72.78 0.52 .47
## 5 task:stimulus 1, 52.03 71.20 *** <.0001
## 6 task:density 1, 201.56 15.92 *** <.0001
## 7 stimulus:density 1, 287.88 1.06 .30
## 8 task:frequency 1, 76.76 80.05 *** <.0001
## 9 stimulus:frequency 1, 177.48 55.45 *** <.0001
## 10 density:frequency 1, 235.01 0.12 .73
## 11 task:stimulus:density 1, 300.16 14.21 *** .0002
## 12 task:stimulus:frequency 1, 190.61 109.33 *** <.0001
## 13 task:density:frequency 1, 248.09 5.46 * .02
## 14 stimulus:density:frequency 1, 104.15 3.72 + .06
## 15 task:stimulus:density:frequency 1, 111.32 10.07 ** .002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

`m3s`

```
## Mixed Model Anova Table (Type 3 tests, S-method)
##
## Model: log_rt ~ task * stimulus * density * frequency + (stimulus *
## Model: density * frequency || id) + (task | item)
## Data: fhch
## Effect df F p.value
## 1 task 1, 43.52 13.68 *** .0006
## 2 stimulus 1, 50.57 151.33 *** <.0001
## 3 density 1, 584.49 0.36 .55
## 4 frequency 1, 70.26 0.56 .46
## 5 task:stimulus 1, 51.50 71.29 *** <.0001
## 6 task:density 1, 578.65 17.89 *** <.0001
## 7 stimulus:density 1, 584.50 1.19 .28
## 8 task:frequency 1, 74.11 82.66 *** <.0001
## 9 stimulus:frequency 1, 584.68 63.34 *** <.0001
## 10 density:frequency 1, 584.54 0.11 .74
## 11 task:stimulus:density 1, 578.66 14.86 *** .0001
## 12 task:stimulus:frequency 1, 578.82 124.10 *** <.0001
## 13 task:density:frequency 1, 578.69 5.92 * .02
## 14 stimulus:density:frequency 1, 88.40 4.16 * .04
## 15 task:stimulus:density:frequency 1, 94.42 10.60 ** .002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

`m4s`

```
## Mixed Model Anova Table (Type 3 tests, S-method)
##
## Model: log_rt ~ task * stimulus * density * frequency + (stimulus *
## Model: density * frequency || id) + (task || item)
## Data: fhch
## Effect df F p.value
## 1 task 1, 43.54 13.67 *** .0006
## 2 stimulus 1, 51.05 150.79 *** <.0001
## 3 density 1, 587.35 0.35 .55
## 4 frequency 1, 71.90 0.53 .47
## 5 task:stimulus 1, 52.02 71.50 *** <.0001
## 6 task:density 1, 582.30 17.50 *** <.0001
## 7 stimulus:density 1, 587.35 1.13 .29
## 8 task:frequency 1, 75.90 81.49 *** <.0001
## 9 stimulus:frequency 1, 587.51 62.27 *** <.0001
## 10 density:frequency 1, 587.39 0.11 .74
## 11 task:stimulus:density 1, 582.31 14.61 *** .0001
## 12 task:stimulus:frequency 1, 582.45 121.11 *** <.0001
## 13 task:density:frequency 1, 582.34 5.84 * .02
## 14 stimulus:density:frequency 1, 90.80 3.90 + .05
## 15 task:stimulus:density:frequency 1, 97.08 10.52 ** .002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

Before looking at the results we can see that for models 1 and 2, `lmer`

emmited a warning that the model failed to converge. These warnings do not necessarily mean that the results cannot be used. As we will see below, model 2 (`m2s`

) produces essentially the same results as models 3 and 4 suggesting that this warning is indeed a false positive. However, the results also show that estimating the Satterthwaite approximation failed for `m1s`

, we have no denominator degrees of freedom and no \(p\)-values. If this happens, we can only try another method or a reduced model.

Models 2 to 4 produce results and the results are extremely similar across models. A total of 9 or 10 effects reached significance. We found main effects for `task`

and `stimulus`

, two-way interactions of `task:stimulus`

, `task:density`

, `task:frequency`

, and `stimulus:frequency`

, three-way interactions of `task:stimulus:density`

, `task:stimulus:frequency`

, and `task:density:frequency`

, a marginal three-way interaction (for two of three models) of `stimulus:density:frequency`

, and the four-way interaction of `task:stimulus:density:frequency`

. Additionally, all \(F\) and \(p\) values are very similar to each other across the three models.

The only difference in terms of significant versus non-significant effects between the three models is the three-way interaction of `stimulus:density:frequency`

which is only significant for model 3 with \(F(1, 88.40) = 4.16\), \(p = .04\), and only reaches marginal significance for the other two models with \(p > .05\) and a very similar \(F\)-value.

It is instructive to compare those results with results obtained using the comparatively ‘worst’ method for obtaining \(p\)-value simplmeneted in `afex`

, likelihood ratio tests. Likelihood ratio-tests should in principle deliver reasonable results for large data sets such as the current one, so we should expect not too many deviations. We again fit all four models, this time using `method='LRT'`

.

```
m1lrt <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency|id)+
(task|item), fhch, method = "LRT",
control = lmerControl(optCtrl = list(maxfun = 1e6)))
m2lrt <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency|id)+
(task||item), fhch, method = "LRT",
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
m3lrt <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency||id)+
(task|item), fhch, method = "LRT",
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
m4lrt <- mixed(log_rt ~ task*stimulus*density*frequency + (stimulus*density*frequency||id)+
(task||item), fhch, method = "LRT",
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

Because the resulting `mixed`

objects are of considerable size, we do not include the full objects, but only the resulting ANOVA tables and `data.frames`

(`nice_lrt`

is a list containing the result from calling `nice`

on the objects, `anova_lrt`

contains the result from calling `anova`

).

Before considering the results, we again first consider the warnings emitted when fitting the models. Because methods `'LRT'`

and `'PB'`

fit one `full_model`

and one `restricted_model`

for each effect (i.e., term), there can be more warnings than for methods `'KR'`

and `'S'`

which only fit one model (the `full_model`

). And this is exactly what happens. For `m1lrt`

there are 11 convergence warnings, almost one per fitted model. However, none of those immediately invalidates the results. This is different for models 2 and 3 for both of which warnings indicate that `nested model(s) provide better fit than full model`

. What this warning means is that the `full_model`

does not provide a better fit than at least one of the `restricted_model`

, which is mathematically impossible as the `restricted_models`

are nested within the full model (i.e., they result from setting one or several parameters equal to 0, so the `full_model`

can always provide an at least as good account as the `restricted_models`

). Model 4 finally shows no warnings.

The following code produces a single output comparing models 1 and 4 next to each other. The results show basically the same pattern as obtained with the Satterthwaite approximation. Even the \(p\)-values are extremely similar to the \(p\)-values of the Satterthwaite models. The only ‘difference’ is that the `stimulus:density:frequency`

three-way interaction is significant in each case now, although only barely so.

```
res_lrt <- cbind(nice_lrt[[1]], " " = " ",
nice_lrt[[4]][,-(1:2)])
colnames(res_lrt)[c(3,4,6,7)] <- paste0(
rep(c("m1_", "m4_"), each =2), colnames(res_lrt)[c(3,4)])
res_lrt
```

```
## Effect df m1_Chisq m1_p.value m4_Chisq m4_p.value
## 1 task 1 12.18 *** .0005 12.43 *** .0004
## 2 stimulus 1 70.00 *** <.0001 70.03 *** <.0001
## 3 density 1 0.01 .91 0.35 .55
## 4 frequency 1 0.57 .45 0.54 .46
## 5 task:stimulus 1 45.06 *** <.0001 45.68 *** <.0001
## 6 task:density 1 15.50 *** <.0001 17.43 *** <.0001
## 7 stimulus:density 1 0.82 .36 1.14 .29
## 8 task:frequency 1 52.87 *** <.0001 53.51 *** <.0001
## 9 stimulus:frequency 1 45.44 *** <.0001 50.45 *** <.0001
## 10 density:frequency 1 0.11 .73 0.12 .73
## 11 task:stimulus:density 1 14.15 *** .0002 14.59 *** .0001
## 12 task:stimulus:frequency 1 73.40 *** <.0001 77.83 *** <.0001
## 13 task:density:frequency 1 5.59 * .02 5.88 * .02
## 14 stimulus:density:frequency 1 4.00 * .05 3.92 * .05
## 15 task:stimulus:density:frequency 1 9.91 ** .002 10.24 ** .001
```

We can also compare this with the results from model 3. Although the `full_model`

cannot be the maximum-likelihood estimate (as it provides a worse than the `density:frequency`

model), the difference seems to be minimal as it also shows exactly the same pattern as the other models.

`nice_lrt[[2]]`

```
## Mixed Model Anova Table (Type 3 tests, LRT-method)
##
## Model: log_rt ~ task * stimulus * density * frequency + (stimulus *
## Model: density * frequency | id) + (task || item)
## Data: fhch
## Df full model: 55
## Effect df Chisq p.value
## 1 task 1 12.15 *** .0005
## 2 stimulus 1 70.00 *** <.0001
## 3 density 1 0.32 .57
## 4 frequency 1 0.24 .63
## 5 task:stimulus 1 45.55 *** <.0001
## 6 task:density 1 15.27 *** <.0001
## 7 stimulus:density 1 0.78 .38
## 8 task:frequency 1 52.71 *** <.0001
## 9 stimulus:frequency 1 45.54 *** <.0001
## 10 density:frequency 1 0.00 >.99
## 11 task:stimulus:density 1 14.33 *** .0002
## 12 task:stimulus:frequency 1 72.79 *** <.0001
## 13 task:density:frequency 1 5.28 * .02
## 14 stimulus:density:frequency 1 3.45 + .06
## 15 task:stimulus:density:frequency 1 9.57 ** .002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

Fortunately, the results from all models that actually produced results and converged without a critical warning (e.g., one critical warning is that a `restricted_model`

provides a better fit than the `full_model`

) agreed very strongly providing a high degree of confidence in the results. This might not be too surprising given the comparatively large number of total data points and the fact that each random effect grouping factor has a considerable number of levels (way above 20 for both participants and items). This also suggests that the convergence warnings are likely false positives; the models seem to have converged successfully to the maximum likelihood estimate, or at least to a value very near the maximum likelihood estimate. How further reducing the random effects structure (e.g., removing the random slopes for the highest interaction) affects the results is left as an exercise for the reader.

In terms of the significant findings, there are many that seem to be in line with the descriptive results described above. For example, the highly significant effect of `task:stimulus:frequency`

with \(F(1, 190.61) = 109.33\), \(p < .0001\) (values from `m2s`

), appears to be in line with the observation that the frequency effect appears to change its sign depending on the `task:stimulus`

cell (with `nonword`

and `naming`

showing the opposite patterns than the other three conditions). Consequently, we start by investigating this interaction further below.

Before investigating the significant interaction in detail it is a good idea to remind oneself what a significant interaction represents on a conceptual level; that one or multiple of the variables in the interaction moderate (i.e., affect) the effect of the other variable or variables. Consequently, there are several ways to investigate a significant interaction. Each of the involved variables can be seen as the moderating variables and each of the variables can be seen as the effect of interest. Which one of those possible interpretations is of interest in a given situation highly depends on the actual data and research question and multiple views can be ‘correct’ in a given situation.

In addition to this conceptual issue, there are also multiple technical ways to investigate a significant interaction. One approach not followed here is to split the data according to the moderating variables and compute the statistical model again for the splitted data sets with the effect variable(s) as remaining fixed effect. This approach, also called *simple effects* analysis, is, for example, recommended by Maxwell and Delaney (2004) as it does not assume variance homogeneity and is faithful to the data at each level. The approach taken here is to simply perform the test on the fitted full model. This approach assumes variance homogeneity (i.e., that the variances in all groups are homogeneous) and has the added benefit that it is computationally relatively simple. In addition, it can all be achieved using the framework provided by `emmeans`

(Lenth, 2017).

Our interest in the beginning is on the effect of `frequency`

by `task:stimulus`

combination. So let us first look at the estimated marginal means os this effect. In `emmeans`

parlance these estimated means are called ‘least-square means’ because of historical reasons, but because of the lack of least-square estimation in mixed models we prefer the term estimated marginal means, or EMMs for short. Those can be obtained in the following way. To prevent `emmeans`

from calculating the *df* for the EMMs (which can be quite costly), we use asymptotic *df*s (i.e., \(z\) values and tests). `emmeans`

requires to first specify the variable(s) one wants to treat as the effect variable(s) (here `frequency`

) and then allows to specify condition variables.

```
emm_options(lmer.df = "asymptotic") # also possible: 'satterthwaite', 'kenward-roger'
emm_i1 <- emmeans(m2s, "frequency", by = c("stimulus", "task"))
```

`## NOTE: Results may be misleading due to involvement in interactions`

`emm_i1`

```
## stimulus = word, task = naming:
## frequency emmean SE df asymp.LCL asymp.UCL
## low -0.323260819 0.04154237 Inf -0.40468237 -0.24183927
## high -0.381928410 0.04571233 Inf -0.47152293 -0.29233389
##
## stimulus = nonword, task = naming:
## frequency emmean SE df asymp.LCL asymp.UCL
## low -0.143143471 0.04584628 Inf -0.23300052 -0.05328642
## high 0.063627305 0.04965563 Inf -0.03369594 0.16095054
##
## stimulus = word, task = lexdec:
## frequency emmean SE df asymp.LCL asymp.UCL
## low 0.023231980 0.03730914 Inf -0.04989260 0.09635656
## high -0.039959811 0.04099368 Inf -0.12030595 0.04038633
##
## stimulus = nonword, task = lexdec:
## frequency emmean SE df asymp.LCL asymp.UCL
## low 0.104057173 0.04115486 Inf 0.02339514 0.18471921
## high -0.006455104 0.04451011 Inf -0.09369331 0.08078310
##
## Results are averaged over the levels of: density
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
```

The returned values are in line with our observation that the `nonword`

and `naming`

condition diverges from the other three. But is there actual evidence that the effect flips? We can test this using additional `emmeans`

functionality. Specifically, we first use the `pairs`

function which provides us with a pairwise test of the effect of `frequency`

in each `task:stimulus`

combination. Then we need to combine the four tests within one object to obtain a familywise error rate correction which we do via `update(..., by = NULL)`

(i.e., we revert the effect of the `by`

statement from the earlier `emmeans`

call) and finally we select the `holm`

method for controlling for family wise error rate (the Holm method is uniformly more powerful than the Bonferroni).

`update(pairs(emm_i1), by = NULL, adjust = "holm")`

```
## contrast stimulus task estimate SE df z.ratio p.value
## low - high word naming 0.05866759 0.01620454 Inf 3.620 0.0003
## low - high nonword naming -0.20677078 0.01766787 Inf -11.703 <.0001
## low - high word lexdec 0.06319179 0.01517331 Inf 4.165 0.0001
## low - high nonword lexdec 0.11051228 0.01644788 Inf 6.719 <.0001
##
## Results are averaged over the levels of: density
## P value adjustment: holm method for 4 tests
```

We could also use a slightly more powerful method than the Holm method, method `free`

from package `multcomp`

, which takes the correlation of the model parameters into account (note that due a bug in the current emmenas version this is currently deactivated):

`summary(as.glht(update(pairs(emm_i1), by = NULL)), test = adjusted("free"))`

We see that the results are exactly as expected. In the `nonword`

and `naming`

condition we have a clear negative effect of frequency while in the other three conditions it is clearly positive. We could now also use the EMMs and retransform them onto the response scale (i.e., RTs) which we could use for plotting. Note that the \(p\)-values in this ouput are for the \(z\) test of whether or not a value is significantly above 0 on the `log_rt`

-scale (i.e., above 1 second on the response scale). That seems not the most interesting test, but the output is interesting because of the EMMs and standard errors that could be used for printing.

```
emm_i1b <- summary(contrast(emm_i1, by = NULL))
emm_i1b[,c("estimate", "SE")] <- exp(emm_i1b[,c("estimate", "SE")])
emm_i1b
```

```
## contrast estimate SE df z.ratio p.value
## low,word,naming effect 0.7903480 1.029560 Inf -8.076 <.0001
## high,word,naming effect 0.7453141 1.033128 Inf -9.019 <.0001
## low,nonword,naming effect 0.9463294 1.033081 Inf -1.695 0.1029
## high,nonword,naming effect 1.1637019 1.035842 Inf 4.305 <.0001
## low,word,lexdec effect 1.1176306 1.029734 Inf 3.796 0.0002
## high,word,lexdec effect 1.0491907 1.032572 Inf 1.498 0.1341
## low,nonword,lexdec effect 1.2117142 1.032575 Inf 5.991 <.0001
## high,nonword,lexdec effect 1.0849390 1.034791 Inf 2.384 0.0228
##
## Results are averaged over the levels of: density
## P value adjustment: fdr method for 8 tests
```

As the last example, let us take a look at the significant four-way interaction of `task:stimulus:density:frequency`

, \(F(1, 111.32) = 10.07\), \(p = .002\). Here we might be interested in a slightly more difficult question namely whether the `density:frequency`

interaction varies across `task:stimulus`

conditions. If we again look at the figures above, it appears that there is a difference between `low:low`

and `high:low`

in the `nonword`

and `lexdec`

condition, but not in the other conditions. We again first begin by obtaining the EMMs. However, the actual values are not interesting at the moment, we are basically only interested in the interaction for each `task:stimulus`

condition. Therefore, we use the EMMs to create two consecutive contrasts, the first one for `density`

and then for `frequency`

using the fist contrast. Then we run a joint test conditional on the `task:stimulus`

conditions.

```
emm_i2 <- emmeans(m2s, c("density", "frequency"), by = c("stimulus", "task"))
con1 <- contrast(emm_i2, "trt.vs.ctrl1", by = c("frequency", "stimulus", "task")) # density
con2 <- contrast(con1, "trt.vs.ctrl1", by = c("contrast", "stimulus", "task"))
test(con2, joint = TRUE, by = c("stimulus", "task"))
```

```
## stimulus task df1 df2 F.ratio p.value
## word naming 1 Inf 0.105 0.7464
## nonword naming 1 Inf 2.537 0.1112
## word lexdec 1 Inf 1.790 0.1809
## nonword lexdec 1 Inf 16.198 0.0001
```

This test indeed shows that the `density:frequency`

interaction is only significant in the `nonword`

and `lexdec`

condition. Next, let’s see if we can unpack this interaction in a meaningful manner. For this we compare `low:low`

and `high:low`

in each of the four groups. And just for the sake of making the example more complex, we also compare `low:high`

and `high:high`

. This can simply be done by specifying a list of custom contrasts on the EMMs (or reference grid in `emmeans`

parlance) which can be passed again to the `contrast`

function. The contrasts are a `list`

where each element should sum to one (i.e., be a proper contrast) and be of length equal to the number of EMMs (although we have 16 EMMs in total, we only need to specify it for a length of four due to conditiong by `c("stimulus", "task")`

). To control for the family wise error rate across all tests, we again use `update(..., by = NULL)`

on the result this time again specifying `by = NULL`

to revert the effect of conditiong. Note that although we entered the variables into `emmeans`

in the same order as into our plot call above, the order of the four EMMs differs.

`emm_i2`

```
## stimulus = word, task = naming:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low -0.313843596 0.04478638 Inf -0.40162328 -0.22606391
## high low -0.332678043 0.04081261 Inf -0.41266929 -0.25268679
## low high -0.377406551 0.04662971 Inf -0.46879910 -0.28601400
## high high -0.386450269 0.04721965 Inf -0.47899907 -0.29390146
##
## stimulus = nonword, task = naming:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low -0.103989924 0.04996611 Inf -0.20192171 -0.00605814
## high low -0.182297019 0.04415685 Inf -0.26884286 -0.09575118
## low high 0.078231630 0.05201926 Inf -0.02372424 0.18018750
## high high 0.049022979 0.04947457 Inf -0.04794540 0.14599136
##
## stimulus = word, task = lexdec:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low 0.037141777 0.04035764 Inf -0.04195775 0.11624130
## high low 0.009322184 0.03679361 Inf -0.06279198 0.08143634
## low high -0.045127696 0.04192675 Inf -0.12730261 0.03704722
## high high -0.034791927 0.04243206 Inf -0.11795723 0.04837338
##
## stimulus = nonword, task = lexdec:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low 0.044799574 0.04490982 Inf -0.04322205 0.13282120
## high low 0.163314772 0.03986203 Inf 0.08518664 0.24144291
## low high -0.007277516 0.04669875 Inf -0.09880539 0.08425036
## high high -0.005632691 0.04446001 Inf -0.09277271 0.08150732
##
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
```

```
# desired contrats:
des_c <- list(
ll_hl = c(1, -1, 0, 0),
lh_hh = c(0, 0, 1, -1)
)
update(contrast(emm_i2, des_c), by = NULL, adjust = "holm")
```

```
## contrast stimulus task estimate SE df z.ratio p.value
## ll_hl word naming 0.018834447 0.02097390 Inf 0.898 1.0000
## lh_hh word naming 0.009043717 0.02120335 Inf 0.427 1.0000
## ll_hl nonword naming 0.078307095 0.02203079 Inf 3.554 0.0027
## lh_hh nonword naming 0.029208650 0.02108904 Inf 1.385 0.9763
## ll_hl word lexdec 0.027819593 0.01992811 Inf 1.396 0.9763
## lh_hh word lexdec -0.010335769 0.01986794 Inf -0.520 1.0000
## ll_hl nonword lexdec -0.118515198 0.02090111 Inf -5.670 <.0001
## lh_hh nonword lexdec -0.001644825 0.01975685 Inf -0.083 1.0000
##
## P value adjustment: holm method for 8 tests
```

In contrast to our expectation, the results show two significant effects and not only one. In line with our expectations, in the `nonword`

and `lexdec`

condition the EMM of `low:high`

is smaller than the EMM for `high:high`

, \(z = -6.30\), \(p < .0001\). However, in the `nonword`

and `naming`

condition we found the opposite pattern; the EMM of `low:high`

is larger than the EMM for `high:high`

, \(z = 3.65\), \(p = .002\). For all other effects \(|z| < 1.3\), \(p > .99\). In addition, there is no difference between `low:high`

and `high:high`

in any condition.

- Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal.
*Journal of Memory and Language*, 68(3), 255-278. https://doi.org/10.1016/j.jml.2012.11.001 - Bretz, F., Hothorn, T., & Westfall, P. H. (2011).
*Multiple comparisons using R*. Boca Raton, FL: CRC Press. https://CRAN.R-project.org/package=multcomp - Freeman, E., Heathcote, A., Chalmers, K., & Hockley, W. (2010). Item effects in recognition memory for words.
*Journal of Memory and Language*, 62(1), 1-18. https://doi.org/10.1016/j.jml.2009.09.004 - Lenth, R. (2017).
*emmeans: Estimated Marginal Means, aka Least-Squares Means*. R package version 0.9.1. https://CRAN.R-project.org/package=emmeans - Maxwell, S. E., & Delaney, H. D. (2004).
*Designing experiments and analyzing data: a model-comparisons perspective*. Mahwah, N.J.: Lawrence Erlbaum Associates.