Creating FFTs for heart disease

Nathaniel Phillips

2017-11-01

The following example follows the tutorial presented in Phillips et al. (2017) FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Soon to be published in Judgment and Decision Making.

Step 1: Install and load the FFTrees package

You can install FFTrees from CRAN using install.packages() (you only need to do this once)

# Install the package from CRAN
install.packages("FFTrees")

To use the package, you first need to load it into your current R session. You can load the package using library()

# Load the package
library(FFTrees)

The package contains several guides (like this one). To open the main guide, run FFTrees.guide()

# Open the main package guide
FFTrees.guide() 

Step 2: Create FFTs from training data (and test on testing data)

In this example, we will create FFTs from a heart disease data set. The training data are in an object called heart.train, and the testing data are in an object called heart.test. For these data, we will predict diagnosis, a binary criterion that indicates whether each patent has or does not have heart disease (i.e., is at high-risk or low-risk).

To create the FFTrees object, we’ll use the function FFTrees() with two main arguments: formula, a formula indicating the binary criterion as a function of one or more predictors to be considered for the tree (the shorthand formula = diagnosis ~ . means to include all predictors), and data, the training data.

# Create an FFTrees object

heart.fft <- FFTrees(formula = diagnosis ~ .,           # Criterion and (all) predictors
                     data = heart.train,                # Training data
                     data.test = heart.test,            # Testing data
                     main = "Heart Disease",            # General label
                     decision.labels = c("Low-Risk", "High-Risk"))  # Labels for decisions

The resulting trees, decisions, and accuracy statistics are now stored in the FFTrees object called heart.fft.

Other arguments

The following arguments apply to the “ifan” and “dfan” algorithms only:

Step 3: Inspect and summarize FFTs

Now we can inspect and summarize the trees. We will start by printing the object to return basic information to the console:

heart.fft   # Print the object
## Heart Disease
## FFT #1 predicts diagnosis using 3 cues: {thal,cp,ca}
## 
## [1] If thal = {rd,fd}, predict High-Risk.
## [2] If cp != {a}, predict Low-Risk.
## [3] If ca <= 0, predict Low-Risk, otherwise, predict High-Risk.
## 
##                    train   test
## cases       :n    150.00 153.00
## speed       :mcu    1.74   1.73
## frugality   :pci    0.88   0.88
## accuracy    :acc    0.80   0.82
## weighted    :wacc   0.80   0.82
## sensitivity :sens   0.82   0.88
## specificity :spec   0.79   0.76
## 
## pars: algorithm = 'ifan', goal = 'wacc', goal.chase = 'bacc', sens.w = 0.5, max.levels = 4

The output tells us several pieces of information:

To summaries performance statistics for all trees in the object, use the summary() function:

# Pring summary statistics of all trees
summary(heart.fft)
## $train
##   tree   n hi mi fa cr  sens  spec   ppv   npv    far   acc  bacc  wacc
## 1    1 150 54 12 18 66 0.818 0.786 0.750 0.846 0.2143 0.800 0.802 0.802
## 2    2 150 56 10 21 63 0.848 0.750 0.727 0.863 0.2500 0.793 0.799 0.799
## 3    3 150 44 22  7 77 0.667 0.917 0.863 0.778 0.0833 0.807 0.792 0.792
## 4    4 150 59  7 32 52 0.894 0.619 0.648 0.881 0.3810 0.740 0.756 0.756
## 5    5 150 28 38  2 82 0.424 0.976 0.933 0.683 0.0238 0.733 0.700 0.700
## 6    6 150 64  2 52 32 0.970 0.381 0.552 0.941 0.6190 0.640 0.675 0.675
## 7    7 150 21 45  0 84 0.318 1.000 1.000 0.651 0.0000 0.700 0.659 0.659
##     bpv dprime  cost   pci  mcu
## 1 0.798   1.70 0.200 0.876 1.74
## 2 0.795   1.70 0.207 0.869 1.84
## 3 0.820   1.81 0.193 0.889 1.56
## 4 0.765   1.55 0.260 0.849 2.12
## 5 0.808   1.79 0.267 0.879 1.70
## 6 0.746   1.57 0.360 0.836 2.30
## 7 0.826   2.05 0.300 0.864 1.90
## 
## $test
##   tree   n hi mi fa cr  sens  spec   ppv   npv   far   acc  bacc  wacc
## 1    1 153 64  9 19 61 0.877 0.762 0.771 0.871 0.237 0.817 0.820 0.820
## 2    2 153 66  7 24 56 0.904 0.700 0.733 0.889 0.300 0.797 0.802 0.802
## 3    3 153 49 24  8 72 0.671 0.900 0.860 0.750 0.100 0.791 0.786 0.786
## 4    4 153 69  4 35 45 0.945 0.562 0.663 0.918 0.438 0.745 0.754 0.754
## 5    5 153 28 45  0 80 0.384 1.000 1.000 0.640 0.000 0.706 0.692 0.692
## 6    6 153 72  1 56 24 0.986 0.300 0.562 0.960 0.700 0.627 0.643 0.643
## 7    7 153 23 50  0 80 0.315 1.000 1.000 0.615 0.000 0.673 0.658 0.658
##     bpv dprime  cost   pci  mcu
## 1 0.821   1.87 0.183 0.877 1.73
## 2 0.811   1.83 0.203 0.868 1.85
## 3 0.805   1.72 0.209 0.884 1.63
## 4 0.791   1.76 0.255 0.861 1.95
## 5 0.820   2.21 0.294 0.873 1.78
## 6 0.761   1.68 0.373 0.849 2.11
## 7 0.808   2.03 0.327 0.859 1.97

All statistics can be derived from a 2 x 2 confusion table like the one below. For definitions of all accuracy statistics, look at the accuracy statistic definitions vignette.

Confusion table illustrating frequencies of 4 possible outcomes.

Confusion table illustrating frequencies of 4 possible outcomes.

Step 4: Visualise the final FFT

Plotting FFTrees

To visualize a tree, use plot():

# Plot the best FFT when applied to the test data

plot(heart.fft,              # An FFTrees object
     data = "test")          # Which data to plot? "train" or "test"

Other arguments

# Plot only the tree without accuracy statistics
plot(heart.fft, 
     stats = FALSE)

# Show marginal cue accuracies in ROC space
plot(heart.fft, 
     what = "cues")

Additional Steps

Accesing outputs

An FFTrees object contains many different outputs, to see them all, run names()

# Show the names of all of the outputs in heart.fft

names(heart.fft)
##  [1] "formula"          "data.desc"        "cue.accuracies"  
##  [4] "tree.definitions" "tree.stats"       "cost"            
##  [7] "level.stats"      "decision"         "levelout"        
## [10] "tree.max"         "inwords"          "auc"             
## [13] "params"           "comp"             "data"

Here is a brief description of each of the outputs:

Output Description
formula The formula used to generate the object
data.desc Descriptions of the original training and test data
cue.accuracies Cue thresholds and accuracies
tree.definitions Definitions of all trees, including cues, thresholds and exit directions
tree.stats Performance statistics for trees
cost Cost statistics for each case and tree.
level.stats Cumulative performance statistics for all trees.
decision Classification decisions
levelout The level at which each case is classified
tree.max The best performing training tree in the object.
inwords A verbal description of the trees.
auc Area under the curve statistics
params A list of parameters used in building the trees
comp Models and statistics for competitive algorithms (e.g.; regression, (non-frugal) decision trees, support vector machines)
data The original training and test data

Predicting new data

To predict classifications for a new dataset, use the standard predict() function. For example, here’s how to predict the classifications for data in the heartdisease object (which actually is just a combination of heart.train and heart.test)

# Predict classifications for a new dataset
predict(heart.fft, 
        data = heartdisease)

Defining an FFT in words

Defining an FFT verbally

If you want to define a specific FFT and apply that tree to data, you can define it using the my.tree argument.

# Create an FFT manuly
my.heart.fft <- FFTrees(formula = diagnosis ~.,
                        data = heart.train,
                        data.test = heart.test,
                        main = "My custom Heart Disease FFT",
                        my.tree = "If chol > 350, predict True. 
                                   If cp != {a}, predict False. 
                                   If age <= 35, predict False. Otherwise, predict True")

Here is the result (It’s actually not too bad, although the first node is pretty worthless)

plot(my.heart.fft)

Create a forest of FFTs

The FFForest() function conducts a bootstrapped simulation on the training data, thus creating a forest of several FFTs. This can give you insight as to how important different cues are in the dataset

# Create an FFForest object (can take a few minutes)
heart.fff <- FFForest(diagnosis ~., 
                      data = heartdisease, 
                      ntree = 100, 
                      train.p = .5, 
                      cpus = 4)

Plotting the result shows cue importance and co-occurrence relationships:

plot(heart.fff)

Here, we see that the three cues cp, thal, and ca occur the most often in the forest and thus appear to be the most important three cues in the dataset.

References

Martignon, Laura, Konstantinos V Katsikopoulos, and Jan K Woike. 2008. “Categorization with Limited Resources: A Family of Simple Heuristics.” Journal of Mathematical Psychology 52 (6). Elsevier: 352–61.

Phillips, Nathaniel, Hansjoerg Neth, Wolfgang Gaissmaier, and Jan Woike. 2017. “FFTrees: A Toolbox to Create, Visualise, and Implement Fast-and-Frugal Decision Trees.”