# Using the nauf package

## The nauf package

It is often the case that a factor only makes sense in a subset of a dataset (i.e. for some observations a factor may simply not be meaningful or contrastive), or that with observational datasets there are no observations in some levels of an interaction term. There are also cases where a random effects grouping factor is only applicable in a subset the data, and it is desireable to model the noise introduced by the repeated measures on the group members within the subset of the data where the repeated measures exist. The nauf package allows unordered factors and random effects grouping factors to be coded as NA in the subsets of the data where they are not applicable or otherwise not contrastive. Sum contrasts are used for all unordered factors (using named_contr_sum in the standardize package), and then NA values are set to 0. This allows all of the data to be modeled together without creating collinearity or making the output difficult to interpret. It is highly recommended that regression variables be put on the same scale with the standardize function in the standardize package prior to using nauf functions (though this is not required for the functions to work). The “Using the standardize package” vignette also provides useful information on the difference between unordered and ordered factors.

## NA values in random effects grouping factors

The plosives dataset (Eager, 2017) included in the nauf package contains measures of plosive strength for instances of intervocalic Spanish /p/, /t/, /k/, /b/, /d/ and /g/ from speakers from three dialects. For this first example we will focus on the voiceless plosives /ptk/ from the 30 speakers in the Cuzco dialect. The dataset contains a combination of experimental data (with exactly repeated measures on items from a read speech task) and observational data (without exactly repeated measures; from interviews with the same speakers). For the spontaneous speech (as indicated by the logical variable spont) the item factor is coded as NA.

library(nauf)

summary(plosives)
#>       cdur            vdur             vpct           intdiff
#>  Min.   :  0.0   Min.   :  0.00   Min.   :0.0000   Min.   : 0.000
#>  1st Qu.: 93.0   1st Qu.:  0.00   1st Qu.:0.0000   1st Qu.: 5.951
#>  Median :127.0   Median :  0.00   Median :0.0000   Median :22.452
#>  Mean   :123.9   Mean   : 39.21   Mean   :0.2437   Mean   :22.201
#>  3rd Qu.:161.0   3rd Qu.: 81.00   3rd Qu.:0.5213   3rd Qu.:36.277
#>  Max.   :298.0   Max.   :212.00   Max.   :0.9079   Max.   :73.952
#>
#>      intvel            voicing          place             stress
#>  Min.   :0.0000   Voiced   :2694   Bilabial:1752   Post-Tonic:1629
#>  1st Qu.:0.1974   Voiceless:2587   Dental  :1862   Tonic     :2007
#>  Median :0.7440                    Velar   :1667   Unstressed:1645
#>  Mean   :0.8341
#>  3rd Qu.:1.2870
#>  Max.   :5.5594
#>
#>  prevowel posvowel    wordpos        wordfreq        speechrate
#>  a:1675   a:1942   Initial:1166   Min.   :     1   Min.   : 1.000
#>  e:1237   e: 900   Medial :4115   1st Qu.:  1057   1st Qu.: 4.115
#>  i:1280   i: 782                  Median :  4579   Median : 5.000
#>  o: 708   o:1217                  Mean   : 20189   Mean   : 5.236
#>  u: 381   u: 440                  3rd Qu.: 20368   3rd Qu.: 6.000
#>                                   Max.   :247340   Max.   :10.000
#>
#>    spont               dialect         sex            age
#>  Mode :logical   Cuzco     :2872   Female:2599   Older  :1362
#>  FALSE:2019      Lima      : 862   Male  :2682   Younger:3919
#>  NA's :0
#>
#>
#>
#>           ed                ling         speaker          item
#>  Secondary :1445   Bilingual  :1379   s34    : 133   i01    :  38
#>  University:3836   Monolingual:3902   s09    : 124   i02    :  38
#>                                       s32    : 124   i03    :  38
#>                                       s22    : 120   i04    :  38
#>                                       s38    : 116   i05    :  38
#>                                       s25    : 114   (Other):1829
#>                                       (Other):4550   NA's   :3262

dat <- droplevels(subset(plosives, dialect == "Cuzco" & voicing == "Voiceless"))

xtabs(~ is.na(item) + spont, dat)
#>            spont
#> is.na(item) FALSE TRUE
#>       FALSE   795    0
#>       TRUE      0  620

For our example, we want to model the voiceless duration of the plosives as a function of place of articulation, stress, and whether or not the speech was spontaneous. When modeling the read speech data (spont = FALSE), we want to account for noise introduced by the repeated measures on item, but we can’t apply this random effects structure to the interview data (spont = TRUE). In addition to this, we want to account for the noise introduced by the individual speakers. Rather than leaving out the item effects to analyze all the data together, or keeping the item effects and analyzing the read and spontaneous speech separately, we can model both subsets together and have the item effects apply only within the read speech data using nauf. We just need to have item coded as NA when it is not applicable, which it already is. Then the nauf_lmer function takes care of the rest.

sobj <- standardize(vdur ~ spont + place + stress +
(1 + spont + place + stress | speaker) + (1 | item), dat)

mod <- nauf_lmer(sobj$formula, sobj$data)

summary(mod)
#> Linear mixed model fit by REML ['nauf.lmerMod']
#> Formula: vdur ~ spont + place + stress + (1 + spont + place + stress |
#>     speaker) + (1 | item)
#>    Data: sobj$data #> #> REML criterion at convergence: 3543 #> #> Scaled residuals: #> Min 1Q Median 3Q Max #> -5.1474 -0.5000 0.0235 0.5397 4.6395 #> #> Random effects: #> Groups Name Variance Std.Dev. Corr #> speaker (Intercept) 0.215923 0.46468 #> spontTRUE 0.002052 0.04530 -0.09 #> placeBilabial 0.016284 0.12761 -0.44 0.49 #> placeDental 0.008709 0.09332 0.01 -0.93 -0.14 #> stressPost-Tonic 0.024526 0.15661 0.35 -0.92 -0.78 0.73 #> stressTonic 0.020310 0.14251 -0.41 0.51 1.00 -0.16 -0.79 #> item (Intercept) 0.109717 0.33123 #> Residual 0.621677 0.78846 #> Number of obs: 1415, groups: speaker, 30; item, 27 #> #> Fixed effects: #> Estimate Std. Error t value #> (Intercept) -0.011606 0.093282 -0.124 #> spontTRUE -0.182422 0.039642 -4.602 #> placeBilabial -0.014944 0.048795 -0.306 #> placeDental 0.001387 0.044381 0.031 #> stressPost-Tonic 0.098136 0.054977 1.785 #> stressTonic 0.143076 0.047996 2.981 #> #> Correlation of Fixed Effects: #> (Intr) spTRUE plcBlb plcDnt strP-T #> spontTRUE -0.266 #> placeBilabl -0.176 0.080 #> placeDental -0.008 -0.102 -0.450 #> strssPst-Tn 0.206 -0.005 -0.162 0.058 #> stressTonic -0.235 -0.021 0.292 -0.021 -0.613 This way, we are making use of all of the information in the dataset. We can obtain a more principled statistical test for the spont factor, and also get better information about the other fixed effects and the individual speakers, since the same random effects for speaker apply in both the read speech and spontaneous subsets. We can obtain Type III tests using anova (here with method = “S” to indicate Satterthwaite approximation for the denominator degrees of freedom in the F tests). anova(mod, method = "S") #> Mixed Model Anova Table (Type III tests, S-method) #> #> Model: vdur ~ spont + place + stress + (1 + spont + place + stress | #> Model: speaker) + (1 | item) #> Data:$
#> Data: sobj
#> Data: data
#>        num Df den Df      F    Pr(>F)
#> spont       1 36.900 21.176 4.817e-05 ***
#> place       2 80.249  0.054    0.9475
#> stress      2 74.356 14.893 3.635e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that stress is significant, and can then get predicted marginal means (often called least-squares means) and pairwise comparisons for its levels using nauf_ref.grid to create a reference grid, and then calling nauf_pmmeans.

rg <- nauf_ref.grid(mod)
nauf_pmmeans(rg, "stress", pairwise = TRUE)
#>
#> Predicted marginal means for 'stress'
#>
#> Factors averaged over: 'spont' 'place'
#>
#> $pmmeans #> stress pmmean SE df lower.CL upper.CL #> Post-Tonic 0.08653011 0.11764046 42.24 -0.15083790 0.32389813 #> Tonic 0.13146972 0.09434706 41.10 -0.05905417 0.32199361 #> Unstressed -0.25281748 0.10383491 41.03 -0.46251171 -0.04312325 #> #> Confidence level used: 0.95 #> #>$contrasts
#>  contrast                   estimate         SE    df t.ratio p.value
#>  Post-Tonic - Tonic      -0.04493961 0.09252431 59.26  -0.486  0.8783
#>  Post-Tonic - Unstressed  0.33934759 0.08901834 87.70   3.812  0.0007
#>  Tonic - Unstressed       0.38428720 0.07594482 82.73   5.060  <.0001
#>
#> P value adjustment: tukey method for comparing a family of 3 estimates

## NA values in the fixed effects

The fricatives dataset (Hualde and Prieto, 2014) included in the nauf package contains measures of duration and voicing for intervocalic alveolar fricatives in Spanish and Catalan. Spanish has only one such fricative, /s/ (underlyingly voiceless), which can occur in any position in the word (initial, medial, or final). In Catalan, the situation is much more complicated. At the beginning of a word and in the middle of a word, both /s/ (underlyingly voiceless) and /z/ (underlyingly voiced) can occur (though word-initial /z/ is rare, and does not occur in the dataset). In word-final position, there is no contrast between /s/ and /z/ (labeled /S/, underlyingly neutralized), with the voicing of the fricative determined by the following sound. Because all of the fricatives in the dataset are intervocalic, /S/ ought to be like /z/ (according to traditional descriptions of Catalan), but may be shorter and possibly less voiced. That is, in the dataset, we have the following set of unique values.

summary(fricatives)
#>       dur              pvoi             lang        wordpos
#>  Min.   : 29.71   Min.   :0.0625   Catalan:974   Final  :296
#>  1st Qu.: 71.14   1st Qu.:0.5263   Spanish:648   Initial:397
#>  Median : 86.29   Median :0.6471                 Medial :929
#>  Mean   : 90.35   Mean   :0.6937
#>  3rd Qu.:105.53   3rd Qu.:0.9231
#>  Max.   :250.78   Max.   :1.0000
#>
#>           uvoi         speaker
#>  Neutralized: 160   s12    : 142
#>  Voiced     : 327   s09    : 141
#>  Voiceless  :1135   s10    : 102
#>                     s20    :  71
#>                     s36    :  70
#>                     s19    :  64
#>                     (Other):1032

dat <- fricatives

u <- unique(dat[, c("lang", "wordpos", "uvoi")])
u <- u[order(u$lang, u$wordpos, u$uvoi), ] rownames(u) <- NULL u #> lang wordpos uvoi #> 1 Catalan Final Neutralized #> 2 Catalan Initial Voiceless #> 3 Catalan Medial Voiced #> 4 Catalan Medial Voiceless #> 5 Spanish Final Voiceless #> 6 Spanish Initial Voiceless #> 7 Spanish Medial Voiceless This raises a problem for the regression analysis, since underlying voicing uvoi is only contrastive within a subset of the data (specifically, when lang = “Catalan” and wordpos = “Medial”), and everywhere else is uniquely determined by lang and wordpos. Ideally, we would like to be able to have uvoi apply only where it is contrastive, and to be able to include random slopes for uvoi for the Catalan speakers, but not the Spanish speakers. Using traditional methods won’t help us here. If we were to take the full interaction of the three factors as a new factor with 7 levels, we can’t include slopes. If we run separate regressions on different subsets of the data, then we will have to limit the comparisons we can make, and won’t be making use of all of the information the data is providing us. Note also that this issue has nothing to do with the way the data were collected; the nature of the languages and the research questions creates these imbalances. With nauf, we can solve this problem by coding uvoi as NA in the subsets of the data where it is not contrastive. That is we can code the unique possible combinations as: u$uvoi[!(u$lang == "Catalan" & u$wordpos == "Medial")] <- NA
u
#>      lang wordpos      uvoi
#> 1 Catalan   Final      <NA>
#> 2 Catalan Initial      <NA>
#> 3 Catalan  Medial    Voiced
#> 4 Catalan  Medial Voiceless
#> 5 Spanish   Final      <NA>
#> 6 Spanish Initial      <NA>
#> 7 Spanish  Medial      <NA>

dat$uvoi[!(dat$lang == "Catalan" & dat$wordpos == "Medial")] <- NA In this way, we are recognizing that word-final Catalan fricatives are limited to one value for uvoi, as are all Spanish fricatives, etc. The meaning of NA in this table is not always the same, so we have to keep track of it, and we will definitely want to include an interaction between lang and wordpos since, for example, “Catalan:Final:NA” means neutralized /S/ while “Spanish:Final:NA” means voiceless /s/. As for being able to include uvoi slopes for Catalan speakers and not Spanish speakers, we can create two new random effects grouping factors, one for each language, where the factor is the speaker identifier or NA based on the language the observation comes from: dat$ca_speaker <- dat$sp_speaker <- dat$speaker
dat$ca_speaker[dat$lang != "Catalan"] <- NA
dat$sp_speaker[dat$lang != "Spanish"] <- NA

With these NA values, we can then fit a model with nauf_lmer to predict the duration of the fricatives based on lang, wordpos, and uvoi:

sobj <- standardize(dur ~ lang * wordpos + uvoi +
(1 + wordpos + uvoi | ca_speaker) + (1 + wordpos | sp_speaker), dat)

mod <- nauf_lmer(sobj$formula, sobj$data)

summary(mod)
#> Linear mixed model fit by REML ['nauf.lmerMod']
#> Formula:
#> dur ~ lang * wordpos + uvoi + (1 + wordpos + uvoi | ca_speaker) +
#>     (1 + wordpos | sp_speaker)
#>    Data: sobj$data #> #> REML criterion at convergence: 4003.3 #> #> Scaled residuals: #> Min 1Q Median 3Q Max #> -3.6998 -0.6383 -0.0850 0.5507 6.6149 #> #> Random effects: #> Groups Name Variance Std.Dev. Corr #> ca_speaker (Intercept) 0.057124 0.2390 #> wordposFinal 0.076718 0.2770 -0.75 #> wordposInitial 0.040187 0.2005 0.59 -0.98 #> uvoiVoiced 0.013167 0.1147 -1.00 0.69 -0.52 #> sp_speaker (Intercept) 0.068574 0.2619 #> wordposFinal 0.004475 0.0669 0.87 #> wordposInitial 0.021641 0.1471 -0.72 -0.97 #> Residual 0.645343 0.8033 #> Number of obs: 1622, groups: ca_speaker, 26; sp_speaker, 16 #> #> Fixed effects: #> Estimate Std. Error t value #> (Intercept) -0.007564 0.048331 -0.157 #> langCatalan 0.029407 0.048331 0.608 #> wordposFinal -0.505660 0.048465 -10.433 #> wordposInitial 0.408757 0.045600 8.964 #> uvoiVoiced -0.514556 0.042640 -12.067 #> langCatalan:wordposFinal -0.231350 0.048465 -4.774 #> langCatalan:wordposInitial 0.338058 0.045600 7.414 #> #> Correlation of Fixed Effects: #> (Intr) lngCtl wrdpsF wrdpsI uvoVcd lngC:F #> langCatalan -0.285 #> wordposFinl 0.028 -0.297 #> wordposIntl -0.067 0.290 -0.768 #> uvoiVoiced -0.292 -0.292 0.216 -0.083 #> lngCtln:wrF -0.297 0.028 0.289 -0.154 0.216 #> lngCtln:wrI 0.290 -0.067 -0.154 -0.077 -0.083 -0.768 To understand how nauf treats the NA values in uvoi, we call nauf_contrasts: nauf_contrasts(mod) #>$lang
#>         Catalan
#> Catalan       1
#> Spanish      -1
#>
#> $wordpos #> Final Initial #> Final 1 0 #> Initial 0 1 #> Medial -1 -1 #> #>$uvoi
#>           Voiced
#> Voiced         1
#> Voiceless     -1
#> <NA>           0

The function returns a list with an element for each of the three factors. Sum contrasts have been applied for all three with named_contr_sum from the standardize package, but for uvoi there is an additional row indicating that a value of 0 is used when it is NA. In this way, the uvoiVoiced coefficient in the summary never contributes to the fitted value for any observation that belongs to a subset where underlying voicing is not contrastive, and its estimate represents half the difference between underlyingly voiced and voiceless Catalan word-medial observations. We can run an anova just as we did in the Cuzco example (though in this case it is not as useful since we are interested in making specific comparisons), and also create a reference grid:

anova(mod, method = "S")
#> Mixed Model Anova Table (Type III tests, S-method)
#>
#> Model: dur ~ lang * wordpos + uvoi + (1 + wordpos + uvoi | ca_speaker) +
#> Model:     (1 + wordpos | sp_speaker)
#> Data: $#> Data: sobj #> Data: data #> num Df den Df F Pr(>F) #> lang 1 27.374 0.3702 0.5479 #> wordpos 2 26.862 55.5357 2.863e-10 *** #> uvoi 1 44.259 145.6220 1.332e-15 *** #> lang:wordpos 2 26.862 28.5074 2.282e-07 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 rg <- nauf_ref.grid(mod) We can now use the nauf_pmmeans function’s subset, by, and na_as_level arguments to test hypotheses in the subset of the reference grid where comparisons are valid. For example, if we want to test whether Spanish fricatives are longer than Catalan fricatives, the only just comparison we can make is for word-initial and word-medial /s/. To do this, we provide nauf_pmmeans with a list of valid groups, with each group specified as a named list of unordered factor levels: nauf_pmmeans(rg, "lang", pairwise = TRUE, subset = list( list(lang = "Catalan", wordpos = "Initial", uvoi = NA), list(lang = "Catalan", wordpos = "Medial", uvoi = "Voiceless"), list(lang = "Spanish", wordpos = c("Initial", "Medial"), uvoi = NA) ) ) #> #> Predicted marginal means for 'lang' #> Note: 'lang' is included in higher order interaction(s) 'lang:wordpos' #> #> Factors conditioned on: 'lang' 'wordpos' 'uvoi' #> #> See the 'subset' element of the 'specs' attribute for subsetted groups #> #>$pmmeans
#>  lang       pmmean         SE    df    lower.CL  upper.CL
#>  Catalan 0.6476254 0.09069385 26.54  0.46138613 0.8338646
#>  Spanish 0.1001844 0.07289153 13.23 -0.05700457 0.2573734
#>
#> Confidence level used: 0.95
#>
#> $contrasts #> contrast estimate SE df t.ratio p.value #> Catalan - Spanish 0.547441 0.1163553 39.15 4.705 <.0001 Each of the lists specified in the subset list represents a set of valid combinations of factor levels where language is truly contrastive. Any row in the reference grid which belongs to at least one of the specified groups is kept and the others are ignored. So the Catalan estimate is an average of Catalan:Initial:NA and Catalan:Medial:Voiceless, and the Spanish estimate is an average of Spanish:Initial:NA and Spanish:Medial:NA. To test the effect of word position in the two languages separately, we can make use of the by argument, and specify the groups for each language where word position is truly contrastive: nauf_pmmeans(rg, c("lang", "wordpos"), pairwise = TRUE, by = "lang", subset = list( list(lang = "Catalan", wordpos = "Initial", uvoi = NA), list(lang = "Catalan", wordpos = "Medial", uvoi = "Voiceless"), list(lang = "Spanish", uvoi = NA) ) ) #> #> Predicted marginal means for 'lang:wordpos' #> Pairwise comparisons by 'lang' #> #> Factors conditioned on: 'lang' 'wordpos' 'uvoi' #> #> See the 'subset' element of the 'specs' attribute for subsetted groups #> #>$pmmeans
#>  lang    wordpos      pmmean         SE    df    lower.CL    upper.CL
#>  Catalan Initial  0.76865796 0.09560913 26.21  0.57220781  0.96510810
#>  Catalan Medial   0.52659283 0.10499041 26.58  0.31100913  0.74217652
#>  Spanish Final   -0.31128082 0.11129830 13.47 -0.55087547 -0.07168617
#>  Spanish Initial  0.03372864 0.08563586 10.03 -0.15700539  0.22446267
#>  Spanish Medial   0.16664021 0.09218783 12.00 -0.03421486  0.36749527
#>
#> Confidence level used: 0.95
#>
#> $contrasts_1 #> contrast estimate SE df t.ratio #> Catalan,Initial - Catalan,Medial 0.2420651 0.08617821 29.03 2.809 #> p.value #> 0.0088 #> #> #>$contrasts_2
#>  contrast                           estimate         SE     df t.ratio
#>  Spanish,Final - Spanish,Initial  -0.3450095 0.11508620  12.33  -2.998
#>  Spanish,Final - Spanish,Medial   -0.4779210 0.08357926 120.38  -5.718
#>  Spanish,Initial - Spanish,Medial -0.1329116 0.10203672   5.96  -1.303
#>  p.value
#>   0.0272
#>   <.0001
#>   0.4444
#>
#> P value adjustment: tukey method for comparing a family of 3 estimates

In this way we obtain 5 predicted marginal means, and pairwise comparisons within-language. Underlying voicing is only contrastive for Catalan medial fricatives, and so we can call:

nauf_pmmeans(rg, "uvoi", pairwise = TRUE,
subset = list(lang = "Catalan", wordpos = "Medial")
)
#>
#> Predicted marginal means for 'uvoi'
#> NA not considered a level for: 'uvoi'
#>
#> Factors conditioned on: 'lang' 'wordpos'
#>
#> See the 'subset' element of the 'specs' attribute for subsetted groups
#>
#> $pmmeans #> uvoi pmmean SE df lower.CL upper.CL #> Voiced -0.5025186 0.06253773 31.46 -0.6299897 -0.3750475 #> Voiceless 0.5265928 0.10499041 26.58 0.3110091 0.7421765 #> #> Confidence level used: 0.95 #> #>$contrasts
#>  contrast            estimate         SE    df t.ratio p.value
#>  Voiced - Voiceless -1.029111 0.08528034 44.26 -12.067  <.0001

In this call, we are only specifying one group in subset, so we don’t need to double-list it (i.e. nauf will understnd list(lang = “Catalan”, wordpos = “Medial”) as list(list(lang = “Catalan”, wordpos = “Medial”))). For Catalan word-final neutralized /S/, we may be interested in making a comparison to see whether it is somewhere in between /s/ and /z/ in terms of duration, or not significantly different from /z/. This case requires the additional argument na_as_level to be specified since we want an estimate for a group where uvoi is NA (the default is that estiamtes are not generated for any group that has NA’s in the estimate table):

nauf_pmmeans(rg, c("wordpos", "uvoi"), pairwise = TRUE, na_as_level = "uvoi",
subset = list(
list(lang = "Catalan", wordpos = "Medial", uvoi = c("Voiced", "Voiceless")),
list(lang = "Catalan", wordpos = "Final", uvoi = NA)
)
)
#>
#> Predicted marginal means for 'wordpos:uvoi'
#> NA considered a level for: 'uvoi'
#> Note: The interaction term 'wordpos:uvoi' is not in the model.
#>
#> Factors conditioned on: 'lang' 'wordpos' 'uvoi'
#>
#> See the 'subset' element of the 'specs' attribute for subsetted groups
#>
#> $pmmeans #> wordpos uvoi pmmean SE df lower.CL upper.CL #> Final NA -0.7151669 0.08290231 14.38 -0.8925304 -0.5378035 #> Medial Voiced -0.5025186 0.06253773 31.46 -0.6299897 -0.3750475 #> Medial Voiceless 0.5265928 0.10499041 26.58 0.3110091 0.7421765 #> #> Confidence level used: 0.95 #> #>$contrasts
#>  contrast                           estimate         SE    df t.ratio
#>  Final,NA - Medial,Voiced         -0.2126484 0.10320044 19.49  -2.061
#>  Final,NA - Medial,Voiceless      -1.2417598 0.12851501 24.82  -9.662
#>  Medial,Voiced - Medial,Voiceless -1.0291114 0.08528034 44.26 -12.067
#>  p.value
#>   0.1245
#>   <.0001
#>   <.0001
#>
#> P value adjustment: tukey method for comparing a family of 3 estimates

As a final example, we could test if Spanish word-final /s/ (being shorter than Spanish word-medial /s/ above) is as short as Catalan word-final /S/:

nauf_pmmeans(rg, "lang", pairwise = TRUE,
subset = list(wordpos = "Final", uvoi = NA)
)
#>
#> Predicted marginal means for 'lang'
#> Note: 'lang' is included in higher order interaction(s) 'lang:wordpos'
#>
#> Factors conditioned on: 'wordpos' 'uvoi'
#>
#> See the 'subset' element of the 'specs' attribute for subsetted groups
#>
#> $pmmeans #> lang pmmean SE df lower.CL upper.CL #> Catalan -0.7151669 0.08290231 14.38 -0.8925304 -0.53780347 #> Spanish -0.3112808 0.11129830 13.47 -0.5508755 -0.07168617 #> #> Confidence level used: 0.95 #> #>$contrasts
#>  contrast            estimate        SE    df t.ratio p.value
#>  Catalan - Spanish -0.4038861 0.1387808 25.26   -2.91  0.0074

In this way, we are able to use all of the information in the dataset when fitting the model, account for the uncertainty introduced by repeated measures on the subjects, assigning different random effects to the two langauges, and then test any variety of hypotheses we might have about the effects of the factors within the subsets of the data where the comparisons make sense.

## Multiple sets of contrasts for a factor

There are two situations in which unordered factors will need more than one set of contrasts: (1) when an unordered factor with NA values interacts with another unordered factor, and some levels are collinear with NA; and (2) when an unordered factor is included as a slope for a random effects grouping factor that has NA values, but only a subset of the levels for the slope factor occur when the grouping factor is applicable. Both of these situations occur when we consider all three dialects in the plosives dataset jointly (here we will look at the voiceless subset):

dat <- subset(plosives, voicing == "Voiceless")

xtabs(~ dialect + spont, dat)
#>             spont
#> dialect      FALSE TRUE
#>   Cuzco        795  620
#>   Lima         215  206
#>   Valladolid     0  751

The data for Cuzco and Lima consist of two subsets: read speech and spontaneous speech (with the read speech task being the same for both dialects). The Valladolid data, however, come from a different corpus consisting only of spontaneous speech, with the measurements taken in the same way as for Cuzco and Lima. While it would of course be ideal to have read speech for the Valladolid speakers as well, this doesn’t mean that we need to split up the data and run multiple regressions. We can simply code spont as NA for Valladolid, split the speaker random effects by dialect, and only include a spont slope for Cuzco and Lima:

dat$spont[dat$dialect == "Valladolid"] <- NA
dat$c_speaker <- dat$l_speaker <- dat$v_speaker <- dat$speaker
dat$c_speaker[dat$dialect != "Cuzco"] <- NA
dat$l_speaker[dat$dialect != "Lima"] <- NA
dat$v_speaker[dat$dialect != "Valladolid"] <- NA

sobj <- standardize(cdur ~ spont * dialect +
(1 + spont | c_speaker) + (1 + spont | l_speaker) + (1 | v_speaker) +
(1 + dialect | item),
dat)

mod <- nauf_lmer(sobj$formula, sobj$data)

summary(mod)
#> Linear mixed model fit by REML ['nauf.lmerMod']
#> Formula: cdur ~ spont * dialect + (1 + spont | c_speaker) + (1 + spont |
#>     l_speaker) + (1 | v_speaker) + (1 + dialect | item)
#>    Data: sobj$data #> #> REML criterion at convergence: 6242.3 #> #> Scaled residuals: #> Min 1Q Median 3Q Max #> -3.0413 -0.6728 -0.1214 0.5382 4.5849 #> #> Random effects: #> Groups Name Variance Std.Dev. Corr #> c_speaker (Intercept) 0.0846787 0.29100 #> spontTRUE 0.0065212 0.08075 -0.53 #> item (Intercept) 0.2805902 0.52971 #> dialect.c2.Cuzco 0.0009203 0.03034 -1.00 #> v_speaker (Intercept) 0.0633240 0.25164 #> l_speaker (Intercept) 0.0428048 0.20689 #> spontTRUE 0.0158277 0.12581 -1.00 #> Residual 0.6030539 0.77657 #> Number of obs: 2587, groups: #> c_speaker, 30; item, 27; v_speaker, 18; l_speaker, 8 #> #> Fixed effects: #> Estimate Std. Error t value #> (Intercept) -0.180515 0.052463 -3.441 #> spontTRUE -0.152609 0.060147 -2.537 #> dialectCuzco 0.574724 0.053716 10.699 #> dialectLima -0.203724 0.065241 -3.123 #> spontTRUE:dialect.c2.Cuzco -0.004251 0.032064 -0.133 #> #> Correlation of Fixed Effects: #> (Intr) spTRUE dlctCz dlctLm #> spontTRUE -0.738 #> dialectCuzc -0.011 -0.092 #> dialectLima 0.392 -0.517 -0.421 #> spTRUE:.2.C 0.343 -0.434 -0.356 0.569 In this way, speaker effects are accounted for in each dialect, and within the spont = FALSE subset, item effects are additionally accounted for. However, note that for the interaction term spont:dialect, .c2. appears before Cuzco, and there is only one coefficient for the interaction term rather than two as we would normally expect. The same goes for the item slope for dialect. This is because using the main effect contrasts for dialect in the spont:dialect interaction term and item dialect slope would lead to collinearity issues (as explained in detail in the help page for the nauf_contrasts function). The nauf package automatically recognizes when this happens, and creates additional sets of contrasts which it uses only when it needs to. To see how the contrasts are coded, we can call nauf_contrasts: nauf_contrasts(mod) #>$spont
#>       TRUE
#> TRUE     1
#> FALSE   -1
#> <NA>     0
#>
#> \$dialect
#>            Cuzco Lima .c2.Cuzco
#> Cuzco          1    0         1
#> Lima           0    1        -1
#> Valladolid    -1   -1         0

## Regression functions

There are nauf regression functions for fixed effects regressions which would normally be fit with lm, glm, and glm.nb (from the stats and MASS packages), mixed effects regressions which would normally be fit with lmer, glmer, and glmer.nb (from the lme4 package), and Bayesian versions of these six regression functions which would normally be fit with stan_lm, stan_glm, stan_glm.nb, stan_lmer, stan_glmer, and stan_glmer.nb (from the rstanarm package). In each case, the nauf regression function has the same name but preceded by nauf_ (e.g. nauf_lm, nauf_glmer, nauf_stan_glmer.nb, etc.). For Bayesian regressions, there are methods for the generic functions in the rstantools package, as well as functions nauf_kfold and nauf_launch_shinystan, and nauf_ref.grid and nauf_pmmeans work in the same way as illustrated above, but returning summaries based on computing posterior marginal means at each iteration of the model.

## Conclusion

The nauf package allows unordered factors and random effects grouping factors to be used even when they are only applicable within a subset of a dataset. The user only needs to code them as NA when they are not applicable/contrastive, and nauf regression fitting functions take care of the rest. Different random effects structures for the same grouping variable can be fit in different subsets by creating new factors from the grouping factor, and setting them to NA in the appropriate subsets. The nauf_pmmeans function can then be used to test hypotheses conditioning on subsets of the data where a just comparison can be made.

## References

If you use the nauf package in a publication, please cite:

Eager, Christopher D. (2017). nauf: Regression with NA Values in Unordered Factors. R package version 1.1.0. https://CRAN.R-project.org/package=nauf

If you analyze the plosives dataset in a publication, please cite:

Eager, Christopher D. (2017). Contrast preservation and constraints on individual phonetic variation. Doctoral thesis. University of Illinois at Urbana-Champaign.

If you analyze the fricatives dataset in a publication, please cite:

Hualde, J. I., & Prieto, P. (2014). Lenition of intervocalic alveolar fricatives in Catalan and Spanish. Phonetica, 71(2), 109-127.