C5.0 Classification Models

The C50 package contains an interface to the C5.0 classification model. The main two modes for this model are:

Many of the details of this model can be found in Quinlan (1993) although the model has new features that are described in Kuhn and Johnson (2013). The main public resource on this model comes from the RuleQuest website.

To demonstrate a simple model, we’ll use the credit data that can be accessed in the recipes package:

library(recipes)
data(credit_data)

The outcome is in a column called Status and, to demonstrate a simple model, the Home and Seniority predictors will be used.

vars <- c("Home", "Seniority")
str(credit_data[, c(vars, "Status")])
## 'data.frame':    4454 obs. of  3 variables:
##  $ Home     : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
##  $ Seniority: int  9 17 10 0 0 1 29 9 0 0 ...
##  $ Status   : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
# a simple split
set.seed(2411)
in_train <- sample(1:nrow(credit_data), size = 3000)
train_data <- credit_data[ in_train,]
test_data  <- credit_data[-in_train,]

Classification Trees

To fit a simple classification tree model, we can start with the non-formula method:

library(C50)
tree_mod <- C5.0(x = train_data[, vars], y = train_data$Status)
tree_mod
## 
## Call:
## C5.0.default(x = train_data[, vars], y = train_data$Status)
## 
## Classification Tree
## Number of samples: 3000 
## Number of predictors: 2 
## 
## Tree size: 3 
## 
## Non-standard options: attempt to group attributes

To understand the model, the summary method can be used to get the default C5.0 command-line output:

summary(tree_mod)
## 
## Call:
## C5.0.default(x = train_data[, vars], y = train_data$Status)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Dec  1 13:22:24 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3000 cases (3 attributes) from undefined.data
## 
## Decision tree:
## 
## Seniority > 2: good (1971/396)
## Seniority <= 2:
## :...Home in {ignore,other,priv,rent}: bad (411.4/171)
##     Home in {owner,parents}: good (617.6/226.6)
## 
## 
## Evaluation on training data (3000 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       3  794(26.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     240   623    (a): class bad
##     171  1966    (b): class good
## 
## 
##  Attribute usage:
## 
##  100.00% Seniority
##   34.27% Home
## 
## 
## Time: 0.0 secs

A graphical method for examining the model can be generated by the plot method:

plot(tree_mod)

A variety of options are outlines in the documentation for C5.0Control function. Another option that can be used is the trials argument which enables a boosting procedure. This method is model similar to AdaBoost than to more statistical approaches such as stochastic gradient boosting.

For example, using three iterations of boosting:

tree_boost <- C5.0(x = train_data[, vars], y = train_data$Status, trials = 3)
summary(tree_boost)
## 
## Call:
## C5.0.default(x = train_data[, vars], y = train_data$Status, trials = 3)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Dec  1 13:22:24 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3000 cases (3 attributes) from undefined.data
## 
## -----  Trial 0:  -----
## 
## Decision tree:
## 
## Seniority > 2: good (1971/396)
## Seniority <= 2:
## :...Home in {ignore,other,priv,rent}: bad (411.4/171)
##     Home in {owner,parents}: good (617.6/226.6)
## 
## -----  Trial 1:  -----
## 
## Decision tree:
## 
## Seniority > 5: good (1331.5/339.5)
## Seniority <= 5:
## :...Seniority <= 0: bad (392.4/173)
##     Seniority > 0: good (1276.1/542.7)
## 
## -----  Trial 2:  -----
## 
## Decision tree:
##  good (2426/564.2)
## 
## 
## Evaluation on training data (3000 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0      3  794(26.5%)
##    1      3  843(28.1%)
##    2      1  863(28.8%)
## boost            803(26.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     103   760    (a): class bad
##      43  2094    (b): class good
## 
## 
##  Attribute usage:
## 
##  100.00% Seniority
##   34.27% Home
## 
## 
## Time: 0.0 secs

Note that the counting is zero-based. The plot method can also show a specific tree in the ensemble using the trial option.

Rule-Based Models

C5.0 can create an initial tree model then decompose the tree structure into a set of mutually exclusive rules. These rules can then be pruned and modified into a smaller set of potentially overlapping rules. The rules can be created using the rules option:

rule_mod <- C5.0(x = train_data[, vars], y = train_data$Status, rules = TRUE)
rule_mod
## 
## Call:
## C5.0.default(x = train_data[, vars], y = train_data$Status, rules = TRUE)
## 
## Rule-Based Model
## Number of samples: 3000 
## Number of predictors: 2 
## 
## Number of Rules: 3 
## 
## Non-standard options: attempt to group attributes
summary(rule_mod)
## 
## Call:
## C5.0.default(x = train_data[, vars], y = train_data$Status, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Dec  1 13:22:25 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3000 cases (3 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (411/171, lift 2.0)
##  Home in {ignore, other, priv, rent}
##  Seniority <= 2
##  ->  class bad  [0.584]
## 
## Rule 2: (1971/396, lift 1.1)
##  Seniority > 2
##  ->  class good  [0.799]
## 
## Rule 3: (1940/423, lift 1.1)
##  Home in {owner, parents}
##  ->  class good  [0.782]
## 
## Default class: good
## 
## 
## Evaluation on training data (3000 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       3  794(26.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     240   623    (a): class bad
##     171  1966    (b): class good
## 
## 
##  Attribute usage:
## 
##   79.40% Seniority
##   78.37% Home
## 
## 
## Time: 0.0 secs

Note that no pruning was warranted for this model.

There is no plot method for rule-based models.

Predictions

The predict method can be used to get hard class predictions or class probability estimates (aka “confidence values” in documentation).

predict(rule_mod, newdata = test_data[1:3, vars])
## [1] good bad  good
## Levels: bad good
predict(tree_boost, newdata = test_data[1:3, vars], type = "prob")
##         bad      good
## 3 0.0000000 1.0000000
## 4 0.5981465 0.4018535
## 7 0.0000000 1.0000000

Cost-Sensitive Models

A cost-matrix can also be used to emphasize certain classes over others. For example, to get model of the “bad” samples:

cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2)
rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good")
cost_mat
##      bad good
## bad    0    1
## good   2    0
cost_mod <- C5.0(x = train_data[, vars], y = train_data$Status, 
                 costs = cost_mat)
summary(cost_mod)
## 
## Call:
## C5.0.default(x = train_data[, vars], y = train_data$Status, costs
##  = cost_mat)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Dec  1 13:22:25 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3000 cases (3 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## Seniority <= 2: bad (1029/562)
## Seniority > 2:
## :...Home in {ignore,owner,parents}: good (1331.7/199.4)
##     Home in {other,priv,rent}:
##     :...Seniority > 15: good (130/21)
##         Seniority <= 15:
##         :...Seniority <= 5: bad (199.6/116.3)
##             Seniority > 5: good (309.6/92.3)
## 
## 
## Evaluation on training data (3000 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##       5  991(33.0%)   0.43   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     550   313    (a): class bad
##     678  1459    (b): class good
## 
## 
##  Attribute usage:
## 
##  100.00% Seniority
##   65.57% Home
## 
## 
## Time: 0.0 secs
# more samples predicted as "bad"
table(predict(cost_mod, test_data[, vars]))
## 
##  bad good 
##  569  885
# that previously
table(predict(tree_mod, test_data[, vars]))
## 
##  bad good 
##  190 1264