Selecting Variables

When recipe steps are used, there are different approaches that can be used to select which variables or features should be used.

The three main characteristics of variables that can be queried:

The manual pages for ?selections and ?has_role have details about the available selection methods.

To illustrate this, the credit data will be used:

library(recipes)
data("credit_data")
str(credit_data)
#> 'data.frame':    4454 obs. of  14 variables:
#>  $ Status   : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
#>  $ Seniority: int  9 17 10 0 0 1 29 9 0 0 ...
#>  $ Home     : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
#>  $ Time     : int  60 60 36 60 36 60 60 12 60 48 ...
#>  $ Age      : int  30 58 46 24 26 36 44 27 32 41 ...
#>  $ Marital  : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
#>  $ Records  : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Job      : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
#>  $ Expenses : int  73 48 90 63 46 75 75 35 90 90 ...
#>  $ Income   : int  129 131 200 182 107 214 125 80 107 80 ...
#>  $ Assets   : int  0 0 3000 2500 0 3500 10000 0 15000 0 ...
#>  $ Debt     : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Amount   : int  800 1000 2000 900 310 650 1600 200 1200 1200 ...
#>  $ Price    : int  846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...

rec <- recipe(Status ~ Seniority + Time + Age + Records, data = credit_data)
rec
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          4

Before any steps are used the information on the original variables is:

summary(rec, original = TRUE)
#> # A tibble: 5 x 4
#>   variable  type    role      source  
#>   <chr>     <chr>   <chr>     <chr>   
#> 1 Seniority numeric predictor original
#> 2 Time      numeric predictor original
#> 3 Age       numeric predictor original
#> 4 Records   nominal predictor original
#> 5 Status    nominal outcome   original

We can add a step to compute dummy variables on the non-numeric data after we impute any missing data:

dummied <- rec %>% step_dummy(all_nominal())

This will capture any variables that are either character strings or factors: Status and Records. However, since Status is our outcome, we might want to keep it as a factor so we can subtract that variable out either by name or by role:

dummied <- rec %>% step_dummy(Records) # or
dummied <- rec %>% step_dummy(all_nominal(), - Status) # or
dummied <- rec %>% step_dummy(all_nominal(), - all_outcomes()) 

Using the last definition:

dummied <- prep(dummied, training = credit_data)
with_dummy <- bake(dummied, newdata = credit_data)
with_dummy
#> # A tibble: 4,454 x 5
#>    Status Seniority  Time   Age Records_X1
#>    <fct>      <int> <int> <int>      <dbl>
#>  1 good           9    60    30        -1.
#>  2 good          17    60    58        -1.
#>  3 bad           10    36    46         1.
#>  4 good           0    60    24        -1.
#>  5 good           0    36    26        -1.
#>  6 good           1    60    36        -1.
#>  7 good          29    60    44        -1.
#>  8 good           9    12    27        -1.
#>  9 good           0    60    32        -1.
#> 10 bad            0    48    41        -1.
#> # ... with 4,444 more rows

Status is unaffected.

One important aspect about selecting variables in steps is that the variable names and types may change as steps are being executed. In the above example, Records is a factor variable before the step is executed. Afterwards, Records is gone and the binary variable Records_yes is in its place. One reason to have general selection routines like all_predictors or contains is to be able to select variables that have not be created yet.