How are categorical predictors handled in recipes?

Recipes can be different from their base R counterparts such as model.matrix. This vignette describes the different methods for encoding categorical predictors with special attention to interaction terms.

Creating Dummy Variables

Let’s start, of course, with iris data. This has four numeric columns and a single factor column with three levels: 'setosa', 'versicolor', and 'virginica'. Our initial recipe will have no outcome:

library(recipes)
iris_rec <- recipe( ~ ., data = iris)
summary(iris_rec)
#> # A tibble: 5 x 4
#>   variable     type    role      source  
#>   <chr>        <chr>   <chr>     <chr>   
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width  numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width  numeric predictor original
#> 5 Species      nominal predictor original

A contrast function in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.

The default approach is to create dummy variables using the “reference cell” parameterization. This means that, if there are C levels of the factor, there will be C - 1 dummy variables created and all but the first factor level are made into new columns:

ref_cell <- iris_rec %>% 
  step_dummy(Species) %>%
  prep(training = iris, retain = TRUE)
summary(ref_cell)
#> # A tibble: 6 x 4
#>   variable           type    role      source  
#>   <chr>              <chr>   <chr>     <chr>   
#> 1 Sepal.Length       numeric predictor original
#> 2 Sepal.Width        numeric predictor original
#> 3 Petal.Length       numeric predictor original
#> 4 Petal.Width        numeric predictor original
#> 5 Species_versicolor numeric predictor derived 
#> 6 Species_virginica  numeric predictor derived

# Get a row for each factor level
rows <- c(1, 51, 101)
juice(ref_cell, starts_with("Species"))[rows,]
#> # A tibble: 3 x 2
#>   Species_versicolor Species_virginica
#>                <dbl>             <dbl>
#> 1               0                 0   
#> 2               1.00              0   
#> 3               0                 1.00

Note that the original column (Species) is no longer there.

There are different types of contrasts that can be used for different types of factors. The defaults are:

param <- getOption("contrasts")
param
#>         unordered           ordered 
#> "contr.treatment"      "contr.poly"

Looking at ?contrast, there are other options. One alternative is the little known Helmert contrast:

contr.helmert returns Helmert contrasts, which contrast the second level with the first, the third with the average of the first two, and so on.

To get this encoding, the global option for the contrasts can be changed and saved. step_dummy picks up on this and makes the correct calculations:

# change it:
new_cont <- param
new_cont["unordered"] <- "contr.helmert"
options(contrasts = new_cont)

# now make dummy variables with new parameterization
helmert <- iris_rec %>% 
  step_dummy(Species) %>%
  prep(training = iris, retain = TRUE)
summary(helmert)
#> # A tibble: 6 x 4
#>   variable     type    role      source  
#>   <chr>        <chr>   <chr>     <chr>   
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width  numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width  numeric predictor original
#> 5 Species_X1   numeric predictor derived 
#> 6 Species_X2   numeric predictor derived

juice(helmert, starts_with("Species"))[rows,]
#> # A tibble: 3 x 2
#>   Species_X1 Species_X2
#>        <dbl>      <dbl>
#> 1      -1.00      -1.00
#> 2       1.00      -1.00
#> 3       0          2.00

# Yuk; go back to the original method
options(contrasts = param)

Interactions with Dummy Variables

Creating interactions with recipes requires the use of a model formula, such as

iris_int <- iris_rec %>%
  step_interact( ~ Sepal.Width:Sepal.Length) %>%
  prep(training = iris, retain = TRUE)
summary(iris_int)
#> # A tibble: 6 x 4
#>   variable                   type    role      source  
#>   <chr>                      <chr>   <chr>     <chr>   
#> 1 Sepal.Length               numeric predictor original
#> 2 Sepal.Width                numeric predictor original
#> 3 Petal.Length               numeric predictor original
#> 4 Petal.Width                numeric predictor original
#> 5 Species                    nominal predictor original
#> 6 Sepal.Width_x_Sepal.Length numeric predictor derived

In R model formulae, using a * between two variables would expand to a*b = a + b + a:b so that the main effects are included. In step_interact, you can do use *, but only the interactions are recorded as columns that needs to be created.

One thing that recipes does differently than base R is to construct the design matrix in sequential iterations. This is relevant when thinking about interactions between continuous and categorical predictors.

For example, if you were to use the standard formula interface, the creation of the dummy variables happens at the same time as the interactions are created:

model.matrix(~ Species*Sepal.Length, data = iris)[rows,]
#>     (Intercept) Speciesversicolor Speciesvirginica Sepal.Length
#> 1             1                 0                0          5.1
#> 51            1                 1                0          7.0
#> 101           1                 0                1          6.3
#>     Speciesversicolor:Sepal.Length Speciesvirginica:Sepal.Length
#> 1                                0                           0.0
#> 51                               7                           0.0
#> 101                              0                           6.3

With recipes, you create them sequentially. This raises an issue: do I have to type out all of the interaction effects by their specific names when using dummy variable?

# Must I do this?
iris_rec %>%
  step_interact( ~ Species_versicolor:Sepal.Length + 
                   Species_virginica:Sepal.Length) 

Note only is this a pain, but it may not be obvious what dummy variables are available (especially when step_other is used).

The solution is to use a selector:

iris_int <- iris_rec %>% 
  step_dummy(Species) %>%
  step_interact( ~ starts_with("Species"):Sepal.Length) %>%
  prep(training = iris, retain = TRUE)
summary(iris_int)
#> # A tibble: 8 x 4
#>   variable                          type    role      source  
#>   <chr>                             <chr>   <chr>     <chr>   
#> 1 Sepal.Length                      numeric predictor original
#> 2 Sepal.Width                       numeric predictor original
#> 3 Petal.Length                      numeric predictor original
#> 4 Petal.Width                       numeric predictor original
#> 5 Species_versicolor                numeric predictor derived 
#> 6 Species_virginica                 numeric predictor derived 
#> 7 Species_versicolor_x_Sepal.Length numeric predictor derived 
#> 8 Species_virginica_x_Sepal.Length  numeric predictor derived

What happens here is that starts_with("Species") is executed on the data that are available when the previous steps have been applied to the data. That means that the dummy variable columns are present. The results of this selectors are then translated to an additive function of the results. In this case, that means that

starts_with("Species")

becomes

(Species_versicolor + Species_virginica)

The entire interaction formula is shown here:

iris_int
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor          5
#> 
#> Training data contained 150 data points and no missing data.
#> 
#> Operations:
#> 
#> Dummy variables from Species [trained]
#> Interactions with (Species_versicolor + Species_virginica):Sepal.Length [trained]

Warning!

Would it work if I didn’t convert species to a factor and used the interactions step?

iris_int <- iris_rec %>% 
  step_interact( ~ Species:Sepal.Length) %>%
  prep(training = iris, retain = TRUE)
#> Warning in prep.step_interact(x$steps[[i]], training = training, info = x
#> $term_info): Categorical variables used in `step_interact` should probably
#> be avoided; This can lead to differences in dummy variable values that are
#> produced by `step_dummy`.
summary(iris_int)
#> # A tibble: 7 x 4
#>   variable                         type    role      source  
#>   <chr>                            <chr>   <chr>     <chr>   
#> 1 Sepal.Length                     numeric predictor original
#> 2 Sepal.Width                      numeric predictor original
#> 3 Petal.Length                     numeric predictor original
#> 4 Petal.Width                      numeric predictor original
#> 5 Species                          nominal predictor original
#> 6 Speciesversicolor_x_Sepal.Length numeric predictor derived 
#> 7 Speciesvirginica_x_Sepal.Length  numeric predictor derived

The columns Species isn’t affected and a warning is issued. Basically, you only get half of what model.matrix does and that could really be problematic in subsequent steps.