On Skipping Steps

When steps are created in a recipe, they can be applied to data (i.e. baked) at two distinct times:

  1. During the process of preparing the recipe, each step is estimated via prep and then applied to the training set using bake before proceeding to the next step.
  2. After the recipe has been prepared, bake can be used with any data set to apply the preprocessing to those data.

There are times where we would like to circumvent baking on a new data set (i.e., #2 above). For example:

Example: Class Imbalance Sampling and Skipping Steps

As an example of the second case, consider the problem of a severe class imbalance. Suppose that there are two classes to be predicted and the event of interest occurs in only 5% of the time. Many models will quickly optimize accuracy by overfitting to the majority class by predicting everything to be a non-event. One method to compensate for this is to down-sample the training set so that the class frequencies are about equal. Although somewhat counter-intuitive, this can often lead to better models.

The important consideration is that this preprocessing is only applied to the training set so that it can impact the model fit. The test set should be unaffected by this operation. If the recipe is used to create the design matrix for the model, down-sampling would remove rows. This would be a bad idea for the test set since these data should represent what the population of samples looks like “in the wild.”. Based on this, a recipe that included down-sample should skip this step when data are baked for the test set.

How To Skip Steps

As of version recipes 0.1.2, each step has an optional logical argument called skip. In almost every case, the default is FALSE. When using this option:

Recall that there are two ways of getting the results for the training set with recipes. First, bake can be used as usual. Second, juice is a shortcut that will use the already processed data that is contained in the recipe when prep(recipe, retain = TRUE) is used. juice is much faster and would be the way to get the training set with all of the steps applied to the data. For this reason, you should almost always used retain = TRUE if any steps are skipped (and a warning is produced otherwise).

Be Careful!

Skipping is a necessary feature but can be dangerous if used carelessly.

As an example, skipping an operation whose variables are used later might be an issue:

library(recipes)
car_recipe <- recipe(mpg ~ ., data = mtcars) %>%
  step_log(disp, skip = TRUE) %>%
  step_center(all_predictors()) %>%
  prep(training = mtcars, retain = TRUE)

# These *should* produce the same results (as they do for `hp`)
juice(car_recipe) %>% head() %>% select(disp, hp)
#> # A tibble: 6 x 2
#>     disp    hp
#>    <dbl> <dbl>
#> 1 -0.210 -36.7
#> 2 -0.210 -36.7
#> 3 -0.603 -53.7
#> 4  0.268 -36.7
#> 5  0.601  28.3
#> 6  0.131 -41.7
bake(car_recipe, new_data = mtcars) %>% head() %>% select(disp, hp)
#> # A tibble: 6 x 2
#>    disp    hp
#>   <dbl> <dbl>
#> 1  155. -36.7
#> 2  155. -36.7
#> 3  103. -53.7
#> 4  253. -36.7
#> 5  355.  28.3
#> 6  220. -41.7

This should emphasize that juice should be used to get the training set values whenever a step is skipped.