Creating FFTs

Nathaniel Phillips

2017-05-04

The FFTrees() function is at the heart of the FFTrees package. The function takes a training dataset as an argument, and generates several fast and frugal trees which attempt to classify cases into one of two classes based on cues (aka., features).

heartdisease example

Let’s start with an example, we’ll create FFTrees fitted to the heartdisease dataset. This dataset contains data from 202 patients suspected of having heart disease. Here’s how the dataset looks:

head(heartdisease)
##   age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope ca
## 1  63   1 ta      145  233   1 hypertrophy     150     0     2.3  down  0
## 2  67   1  a      160  286   0 hypertrophy     108     1     1.5  flat  3
## 3  67   1  a      120  229   0 hypertrophy     129     1     2.6  flat  2
## 4  37   1 np      130  250   0      normal     187     0     3.5  down  0
## 5  41   0 aa      130  204   0 hypertrophy     172     0     1.4    up  0
## 6  56   1 aa      120  236   0      normal     178     0     0.8    up  0
##     thal diagnosis
## 1     fd         0
## 2 normal         1
## 3     rd         1
## 4 normal         0
## 5 normal         0
## 6 normal         0

The critical dependent variable is diagnosis which indicates whether a patient has heart disease or not. The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors.

Now we’ll split the original dataset into a training dataset, and a testing dataset. We will create the trees with the training set, then test its performance in the test dataset:

set.seed(100) # For replication
heart.rand <- heartdisease[sample(nrow(heartdisease)),]
heart.train <- heart.rand[1:150,]
heart.test <- heart.rand[151:303,]

We’ll create a new FFTrees object called heart.fft using the FFTrees() function. We’ll specify diagnosis as the (binary) dependent variable, and include all independent variables with formula = diagnosis ~ .:

heart.fft <- FFTrees(formula = diagnosis ~.,
                    data = heart.train,
                    data.test = heart.test)

Elements of an FFTrees object

FFTrees() returns an object with the FFTrees class. There are many elements in an FFTrees object, here are their names:

names(heart.fft)
##  [1] "formula"          "data.desc"        "cue.accuracies"  
##  [4] "tree.definitions" "tree.stats"       "level.stats"     
##  [7] "decision"         "levelout"         "auc"             
## [10] "params"           "comp"             "data"

Printing an FFTrees object

You can view basic information about the FFTrees object by printing its name. This will give you a quick summary of the object, including how many trees it has, which cues the tree(s) use, and how well they performed.

heart.fft
## [1] "7 FFTs predicting diagnosis"
## [1] "FFT #4 {thal,cp,ca} maximizes training wacc:"
##                    train   test
## cases       :n    150.00 153.00
## speed       :mcu    1.74   1.73
## frugality   :pci    0.88   0.88
## accuracy    :acc    0.80   0.82
## balanced    :bacc   0.80   0.82
## weighted    :wacc   0.80   0.82
## sensitivity :sens   0.82   0.88
## specificity :spec   0.79   0.76

Cue accuracy statistics: cue.accuracies

You can obtain marginal cue accuracy statistics from the cue.accuracies list. The list contains dataframes with marginal cue accuracies. That is, for each cue, the threshold that maximizes the v-statistic (HR - FAR) in the training dataset is chosen. If the object has test data, you can see the marginal cue accuracies in the test dataset (using the thresholds calculated from the training data):

heart.fft$cue.accuracies
## $train
##         cue     class            threshold direction   n hi mi fa cr
## 1       age   numeric                   54         > 150 47 19 31 53
## 2       sex   numeric                    0         > 150 53 13 48 36
## 3        cp character                    a         = 150 48 18 18 66
## 4  trestbps   numeric                  138         > 150 26 40 21 63
## 5      chol   numeric                  223         > 150 49 17 51 33
## 6       fbs   numeric                    0         > 150 10 56  9 75
## 7   restecg character hypertrophy,abnormal         = 150 40 26 34 50
## 8   thalach   numeric                  156         < 150 45 21 29 55
## 9     exang   numeric                    0         > 150 31 35 14 70
## 10  oldpeak   numeric                  0.9         > 150 41 25 21 63
## 11    slope character            flat,down         = 150 45 21 27 57
## 12       ca   numeric                    0         > 150 47 19 19 65
## 13     thal character                rd,fd         = 150 47 19 16 68
##         sens      spec       far       acc      bacc      wacc    dprime
## 1  0.7121212 0.6309524 0.3690476 0.6666667 0.6715368 0.6715368 0.8939691
## 2  0.8030303 0.4285714 0.5714286 0.5933333 0.6158009 0.6158009 0.6724827
## 3  0.7272727 0.7857143 0.2142857 0.7600000 0.7564935 0.7564935 1.3962240
## 4  0.3939394 0.7500000 0.2500000 0.5933333 0.5719697 0.5719697 0.4054236
## 5  0.7424242 0.3928571 0.6071429 0.5466667 0.5676407 0.5676407 0.3789573
## 6  0.1515152 0.8928571 0.1071429 0.5666667 0.5221861 0.5221861 0.2119100
## 7  0.6060606 0.5952381 0.4047619 0.6000000 0.6006494 0.6006494 0.5101065
## 8  0.6818182 0.6547619 0.3452381 0.6666667 0.6682900 0.6682900 0.8709980
## 9  0.4696970 0.8333333 0.1666667 0.6733333 0.6515152 0.6515152 0.8913899
## 10 0.6212121 0.7500000 0.2500000 0.6933333 0.6856061 0.6856061 0.9831556
## 11 0.6818182 0.6785714 0.3214286 0.6800000 0.6801948 0.6801948 0.9364969
## 12 0.7121212 0.7738095 0.2261905 0.7466667 0.7429654 0.7429654 1.3110438
## 13 0.7121212 0.8095238 0.1904762 0.7666667 0.7608225 0.7608225 1.4357351
## 
## $test
##         cue     class            threshold direction   n hi mi fa cr
## 1       age   numeric                   54         > 153 48 25 34 46
## 2       sex   numeric                    0         > 153 61 12 44 36
## 3        cp character                    a         = 153 57 16 21 59
## 4  trestbps   numeric                  138         > 153 28 45 23 57
## 5      chol   numeric                  223         > 153 51 22 47 33
## 6       fbs   numeric                    0         > 153 12 61 14 66
## 7   restecg character hypertrophy,abnormal         = 153  0 73  0 80
## 8   thalach   numeric                  156         < 153 56 17 33 47
## 9     exang   numeric                    0         > 153 45 28  9 71
## 10  oldpeak   numeric                  0.9         > 153 51 22 24 56
## 11    slope character            flat,down         = 153  0 73  0 80
## 12       ca   numeric                    0         > 153 46 27 15 65
## 13     thal character                rd,fd         = 153  0 73  0 80
##         sens   spec    far       acc      bacc      wacc      dprime
## 1  0.6575342 0.5750 0.4250 0.6143791 0.6162671 0.6162671  0.59486134
## 2  0.8356164 0.4500 0.5500 0.6339869 0.6428082 0.6428082  0.85093885
## 3  0.7808219 0.7375 0.2625 0.7581699 0.7591610 0.7591610  1.41062908
## 4  0.3835616 0.7125 0.2875 0.5555556 0.5480308 0.5480308  0.26456319
## 5  0.6986301 0.4125 0.5875 0.5490196 0.5555651 0.5555651  0.29934599
## 6  0.1643836 0.8250 0.1750 0.5098039 0.4946918 0.4946918 -0.04201090
## 7  0.0000000 1.0000 0.0000 0.5228758 0.5000000 0.5000000  0.03216494
## 8  0.7671233 0.5875 0.4125 0.6732026 0.6773116 0.6773116  0.95052459
## 9  0.6164384 0.8875 0.1125 0.7581699 0.7519692 0.7519692  1.50947947
## 10 0.6986301 0.7000 0.3000 0.6993464 0.6993151 0.6993151  1.04486521
## 11 0.0000000 1.0000 0.0000 0.5228758 0.5000000 0.5000000  0.03216494
## 12 0.6301370 0.8125 0.1875 0.7254902 0.7213185 0.7213185  1.21936274
## 13 0.0000000 1.0000 0.0000 0.5228758 0.5000000 0.5000000  0.03216494

You can also view the cue accuracies in an ROC-type plot with plot() combined with the what = "cues" argument:

plot(heart.fft, 
     main = "Heartdisease Cue Accuracy",
     what = "cues")

Tree definitions and accuracy statistics

The tree.definitions dataframe contains definitions (cues, classes, exits, thresholds, and directions) of all trees in the object:

heart.fft$tree.definitions
##   tree               cues nodes classes     exits    thresholds directions
## 1    1 thal;cp;ca;oldpeak     4 c;c;n;n 0;0;0;0.5 rd,fd;a;0;0.9    =;=;>;>
## 5    2         thal;cp;ca     3   c;c;n   0;0;0.5     rd,fd;a;0      =;=;>
## 3    3         thal;cp;ca     3   c;c;n   0;1;0.5     rd,fd;a;0      =;=;>
## 2    4         thal;cp;ca     3   c;c;n   1;0;0.5     rd,fd;a;0      =;=;>
## 6    5 thal;cp;ca;oldpeak     4 c;c;n;n 1;0;1;0.5 rd,fd;a;0;0.9    =;=;>;>
## 4    6 thal;cp;ca;oldpeak     4 c;c;n;n 1;1;0;0.5 rd,fd;a;0;0.9    =;=;>;>
## 7    7 thal;cp;ca;oldpeak     4 c;c;n;n 1;1;1;0.5 rd,fd;a;0;0.9    =;=;>;>

The tree.stats list contains classification statistics for all trees applied to both training tree.stats$train and test tree.stats$test data. Here are the training statistics

heart.fft$tree.stats$train
##   tree               cues nodes classes     exits    thresholds directions
## 1    1 thal;cp;ca;oldpeak     4 c;c;n;n 0;0;0;0.5 rd,fd;a;0;0.9    =;=;>;>
## 2    2         thal;cp;ca     3   c;c;n   0;0;0.5     rd,fd;a;0      =;=;>
## 3    3         thal;cp;ca     3   c;c;n   0;1;0.5     rd,fd;a;0      =;=;>
## 4    4         thal;cp;ca     3   c;c;n   1;0;0.5     rd,fd;a;0      =;=;>
## 5    5 thal;cp;ca;oldpeak     4 c;c;n;n 1;0;1;0.5 rd,fd;a;0;0.9    =;=;>;>
## 6    6 thal;cp;ca;oldpeak     4 c;c;n;n 1;1;0;0.5 rd,fd;a;0;0.9    =;=;>;>
## 7    7 thal;cp;ca;oldpeak     4 c;c;n;n 1;1;1;0.5 rd,fd;a;0;0.9    =;=;>;>
##     n hi mi fa cr      sens      spec        far       acc      bacc
## 1 150 21 45  0 84 0.3181818 1.0000000 0.00000000 0.7000000 0.6590909
## 2 150 28 38  2 82 0.4242424 0.9761905 0.02380952 0.7333333 0.7002165
## 3 150 44 22  7 77 0.6666667 0.9166667 0.08333333 0.8066667 0.7916667
## 4 150 54 12 18 66 0.8181818 0.7857143 0.21428571 0.8000000 0.8019481
## 5 150 56 10 21 63 0.8484848 0.7500000 0.25000000 0.7933333 0.7992424
## 6 150 59  7 32 52 0.8939394 0.6190476 0.38095238 0.7400000 0.7564935
## 7 150 64  2 52 32 0.9696970 0.3809524 0.61904762 0.6400000 0.6753247
##        wacc   dprime       pci  mcu
## 1 0.6590909 2.053928 0.8642857 1.90
## 2 0.7002165 1.789700 0.8785714 1.70
## 3 0.7916667 1.813721 0.8885714 1.56
## 4 0.8019481 1.700096 0.8757143 1.74
## 5 0.7992424 1.704447 0.8685714 1.84
## 6 0.7564935 1.550734 0.8485714 2.12
## 7 0.6753247 1.573378 0.8357143 2.30

decision

The decision list contains the raw classification decisions for each tree for each training (and test) case.

Here are is how each tree classified the first five cases in the training data:

heart.fft$decision$train[1:5,]
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [4,] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

levelout

The levelout list contains the levels at which each case was classified for each tree.

Here are the levels at which the first 5 test cases were classified:

heart.fft$levelout$test[1:5,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    1    1    1    2    2    3    4
## [2,]    2    2    3    1    1    1    1
## [3,]    4    3    2    1    1    1    1
## [4,]    4    3    2    1    1    1    1
## [5,]    1    1    1    3    4    2    2

Predicting new data with predict()

Once you’ve created an FFTrees object, you can use it to predict new data using predict(). To specify which tree to In this example, I’ll use the heart.fft object to make predictions for cases 1 through 50 in the heartdisease dataset:

predict(heart.fft,
        data = heartdisease[1:50,])
##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## [12] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [34]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
## [45] FALSE  TRUE FALSE  TRUE FALSE FALSE

Visualising trees

Once you’ve created an FFTrees object using FFTrees() you can visualize the tree (and ROC curves) using plot(). The following code will visualize the best training tree (tree 2) applied to the test data:

plot(heart.fft,
     main = "Heart Disease",
     decision.names = c("Healthy", "Disease"))

See the vignette on plotting trees here for more details on visualizing trees.

Additional arguments

The FFTrees() function has several additional arguments than change how trees are built.