Trees are built using the wrapper function
FFTrees.R which calls the functions
grow.FFTrees() to complete the steps of creating the trees. The default algorithm used to create trees
algorithm = "m"is very simple. It can be summarised in four steps.
||For each cue, claculate a classification threshold that maximizes the balanced accuracy (average of sensitivity and specificity) of classifications of all data based on that cue (that is, ignoring all other cues). If the cue is numeric, the threshold is a number. If the cue is a factor, the threshold is one or more factor levels.|
||Rank cues in order of their highest balanced accuracy value calculated using the classification threshold determined in step 1|
||Create all possible trees by varying the exit direction (left or right) at each level to a maximum of X levels (default of
||Reduce the size of trees by removing (pruning) lower levels containing less than X% (default of
First, we’ll calculate a classification threshold for each cue using
heartdisease.ca <- FFTrees::cuerank(formula = diagnosis ~., data = heartdisease) # Print key results heartdisease.ca[c("cue", "threshold", "direction", "bacc")]
## cue threshold direction bacc ## 1 age 55 > 0.6347166 ## 2 sex 0 > 0.6295841 ## 3 cp a = 0.7587954 ## 4 trestbps 140 > 0.5579707 ## 5 chol 242 > 0.5677092 ## 6 fbs 0 > 0.5090147 ## 7 restecg hypertrophy,abnormal = 0.5881953 ## 8 thalach 148 < 0.7042902 ## 9 exang 0 > 0.7032593 ## 10 oldpeak 0.8 > 0.6978856 ## 11 slope flat,down = 0.6936743 ## 12 ca 0 > 0.7308738 ## 13 thal rd,fd = 0.7596508
Here, we see the best decision threshold for each cue that maximizes its balanced accuracy (
bacc) when applied to the entire dataset (independently of other cues). For example, for the age cue, the best threshold is age > 55 which leads to a balanced accuracy of 0.63. In other words, if we only had the age cue, then the best decision is: “If age > 55, predict heart disease, otherwise, predict no heart disease”.
Let’s confirm that this threshold makes sense. To do this, we can plot the bacc value for all possible thresholds as in Figure @ref(fig:agethreshold):
Next, the cues are ranked by their balanced accuracy. Let’s do that with the heart disease cues:
# Rank heartdisease cues by balanced accuracy heartdisease.ca <- heartdisease.ca[order(heartdisease.ca$bacc, decreasing = TRUE),] # Print the key columns heartdisease.ca[c("cue", "threshold", "direction", "bacc")]
## cue threshold direction bacc ## 13 thal rd,fd = 0.7596508 ## 3 cp a = 0.7587954 ## 12 ca 0 > 0.7308738 ## 8 thalach 148 < 0.7042902 ## 9 exang 0 > 0.7032593 ## 10 oldpeak 0.8 > 0.6978856 ## 11 slope flat,down = 0.6936743 ## 1 age 55 > 0.6347166 ## 2 sex 0 > 0.6295841 ## 7 restecg hypertrophy,abnormal = 0.5881953 ## 5 chol 242 > 0.5677092 ## 4 trestbps 140 > 0.5579707 ## 6 fbs 0 > 0.5090147
Now, we can see that the top five cues are
exang. Because ffts rarely exceed 5 cues, we can expect that the trees will use a subset (not necessarily all) of these 5 cues.
We can also plot the cue accuracies in ROC space using the
# Show the accuracy of cues in ROC space showcues(cue.accuracies = heartdisease.ca)
grow.FFTrees() will grow several trees from these cues using different exit structures:
# Grow FFTs heartdisease.ffts <- grow.FFTrees(formula = diagnosis ~., data = heartdisease) # Print the tree definitions heartdisease.ffts$tree.definitions
## tree cues nodes classes exits thresholds directions ## 1 1 thal;cp;ca;thalach 4 c;c;n;n 0;0;0;0.5 rd,fd;a;0;148 =;=;>;< ## 5 2 thal;cp;ca;thalach 4 c;c;n;n 0;0;1;0.5 rd,fd;a;0;148 =;=;>;< ## 3 3 thal;cp;ca 3 c;c;n 0;1;0.5 rd,fd;a;0 =;=;> ## 2 4 thal;cp;ca 3 c;c;n 1;0;0.5 rd,fd;a;0 =;=;> ## 6 5 thal;cp;ca;thalach 4 c;c;n;n 1;0;1;0.5 rd,fd;a;0;148 =;=;>;< ## 4 6 thal;cp;ca;thalach 4 c;c;n;n 1;1;0;0.5 rd,fd;a;0;148 =;=;>;< ## 7 7 thal;cp;ca;thalach 4 c;c;n;n 1;1;1;0.5 rd,fd;a;0;148 =;=;>;<
Here, we see that we have 7 different trees, each using some combination of the top 5 cues we identified earlier. For example, tree 1 uses the top 4 cues, while tree 3 uses only the top 3 cues. Why is that? The reason is that the algorithm also prunes lower branches of the tree if there are too few cases classified at lower levels. By default, the algorithm will remove any lower leves that classfify fewer than 10% of the original cases. The pruning criteria can be controlled using the
max.levels arguments in
Now let’s use the wrapper function
FFTrees() to create the trees all at once. We will then plot tree #4 which, according to our results above, should contain the cues
thal, cp, ca"
library(FFTrees) # Create trees heart.fft <- FFTrees(formula = diagnosis ~., data = heartdisease) # Plot tree # 4 plot(heart.fft, stats = FALSE, # Don't include statistics tree = 4)
algorithm = "c": The
"c"algorithm is identical to the
"m"algorithm with one (important) exception: In
algorithm = "c", the thresholds and rankings of cues are recalculated for each level in the FFT conditioned on the exemplars that were not classified at higher leves in the tree. For example, in the
algorithm = "c"would first classify some cases using the
thalcue at the first level, and would then calculate new accuracies for the remaining cues on the remaining cases that were not yet classified. This algorithm is appropriate for datasets where cue validities systematically differ for different (and predictable) subsets of data. However, because it calculates cue thresholds for increasingly smaller samples of data as the tree grows, it is also, potentially, more prone to overfitting compared to
algorithm = "m"