FFTrees tree construction algorithms

Nathaniel Phillips

2017-04-18

Default FFT construction algorithm “m”

Trees are built using the wrapper function FFTrees.R which calls the functions cuerank() and grow.FFTrees() to complete the steps of creating the trees. The default algorithm used to create trees algorithm = "m"is very simple. It can be summarised in four steps.

4 Steps in growing FFTs using the algorithm = "m" algorithm.
Step Function Description
1 cuerank For each cue, claculate a classification threshold that maximizes the balanced accuracy (average of sensitivity and specificity) of classifications of all data based on that cue (that is, ignoring all other cues). If the cue is numeric, the threshold is a number. If the cue is a factor, the threshold is one or more factor levels.
2 grow.FFTrees() Rank cues in order of their highest balanced accuracy value calculated using the classification threshold determined in step 1
3 grow.FFTrees() Create all possible trees by varying the exit direction (left or right) at each level to a maximum of X levels (default of max.levels = 4).
4 grow.FFTrees() Reduce the size of trees by removing (pruning) lower levels containing less than X% (default of stopping.par = .10) of the cases in the original data.

Example: Heart Disease

First, we’ll calculate a classification threshold for each cue using cuerank():

heartdisease.ca <- FFTrees::cuerank(formula = diagnosis ~., 
                                    data = heartdisease)

# Print key results
heartdisease.ca[c("cue", "threshold", "direction", "bacc")]
##         cue            threshold direction      bacc
## 1       age                   55         > 0.6347166
## 2       sex                    0         > 0.6295841
## 3        cp                    a         = 0.7587954
## 4  trestbps                  140         > 0.5579707
## 5      chol                  242         > 0.5677092
## 6       fbs                    0         > 0.5090147
## 7   restecg hypertrophy,abnormal         = 0.5881953
## 8   thalach                  148         < 0.7042902
## 9     exang                    0         > 0.7032593
## 10  oldpeak                  0.8         > 0.6978856
## 11    slope            flat,down         = 0.6936743
## 12       ca                    0         > 0.7308738
## 13     thal                rd,fd         = 0.7596508

Here, we see the best decision threshold for each cue that maximizes its balanced accuracy (bacc) when applied to the entire dataset (independently of other cues). For example, for the age cue, the best threshold is age > 55 which leads to a balanced accuracy of 0.63. In other words, if we only had the age cue, then the best decision is: “If age > 55, predict heart disease, otherwise, predict no heart disease”.

Let’s confirm that this threshold makes sense. To do this, we can plot the bacc value for all possible thresholds as in Figure @ref(fig:agethreshold):

Plotting the balanced accuracy of each decision threshold for the age cue.

Plotting the balanced accuracy of each decision threshold for the age cue.

Next, the cues are ranked by their balanced accuracy. Let’s do that with the heart disease cues:

# Rank heartdisease cues by balanced accuracy
heartdisease.ca <- heartdisease.ca[order(heartdisease.ca$bacc, decreasing = TRUE),]

# Print the key columns
heartdisease.ca[c("cue", "threshold", "direction", "bacc")]
##         cue            threshold direction      bacc
## 13     thal                rd,fd         = 0.7596508
## 3        cp                    a         = 0.7587954
## 12       ca                    0         > 0.7308738
## 8   thalach                  148         < 0.7042902
## 9     exang                    0         > 0.7032593
## 10  oldpeak                  0.8         > 0.6978856
## 11    slope            flat,down         = 0.6936743
## 1       age                   55         > 0.6347166
## 2       sex                    0         > 0.6295841
## 7   restecg hypertrophy,abnormal         = 0.5881953
## 5      chol                  242         > 0.5677092
## 4  trestbps                  140         > 0.5579707
## 6       fbs                    0         > 0.5090147

Now, we can see that the top five cues are thal, cp, ca, thalach and exang. Because ffts rarely exceed 5 cues, we can expect that the trees will use a subset (not necessarily all) of these 5 cues.

We can also plot the cue accuracies in ROC space using the showcues() function:

# Show the accuracy of cues in ROC space
showcues(cue.accuracies = heartdisease.ca)
Cue accuracies for the heartdisease dataset. The top 5 cues in terms of balanced accuracy are highlighted.

Cue accuracies for the heartdisease dataset. The top 5 cues in terms of balanced accuracy are highlighted.

Next, grow.FFTrees() will grow several trees from these cues using different exit structures:

# Grow FFTs
heartdisease.ffts <- grow.FFTrees(formula = diagnosis ~., 
                                  data = heartdisease)

# Print the tree definitions
heartdisease.ffts$tree.definitions
##   tree               cues nodes classes     exits    thresholds directions
## 1    1 thal;cp;ca;thalach     4 c;c;n;n 0;0;0;0.5 rd,fd;a;0;148    =;=;>;<
## 5    2 thal;cp;ca;thalach     4 c;c;n;n 0;0;1;0.5 rd,fd;a;0;148    =;=;>;<
## 3    3         thal;cp;ca     3   c;c;n   0;1;0.5     rd,fd;a;0      =;=;>
## 2    4         thal;cp;ca     3   c;c;n   1;0;0.5     rd,fd;a;0      =;=;>
## 6    5 thal;cp;ca;thalach     4 c;c;n;n 1;0;1;0.5 rd,fd;a;0;148    =;=;>;<
## 4    6 thal;cp;ca;thalach     4 c;c;n;n 1;1;0;0.5 rd,fd;a;0;148    =;=;>;<
## 7    7 thal;cp;ca;thalach     4 c;c;n;n 1;1;1;0.5 rd,fd;a;0;148    =;=;>;<

Here, we see that we have 7 different trees, each using some combination of the top 5 cues we identified earlier. For example, tree 1 uses the top 4 cues, while tree 3 uses only the top 3 cues. Why is that? The reason is that the algorithm also prunes lower branches of the tree if there are too few cases classified at lower levels. By default, the algorithm will remove any lower leves that classfify fewer than 10% of the original cases. The pruning criteria can be controlled using the stopping.rule, stopping.par and max.levels arguments in grow.FFTrees()

Now let’s use the wrapper function FFTrees() to create the trees all at once. We will then plot tree #4 which, according to our results above, should contain the cues thal, cp, ca"

library(FFTrees)

# Create trees
heart.fft <- FFTrees(formula = diagnosis ~., data = heartdisease)

# Plot tree # 4
plot(heart.fft, 
     stats = FALSE,    # Don't include statistics
     tree = 4)

Alternative algorithms