Introduction to RFinfer

Cole Brokamp

2016-06-10

Overview

This vignette is an introductory overview of the RFinfer package for generating prediction variances from random forest predictions.The first section covers how to install the package in R. Next, we cover a simple example showing how to implement the main function, rfPredVar() to generate prediction with accompanying variance estimates. Finally, we cover a more in depth example using NHANES data, comparing the different types of random forests and using plots to carry out some statistical inference. These inferences are compared to a standard linear model.

For an detailed introduction to the infinitesimal jackknife and how it is applied to the random forest to generate prediction variances, see the other vignette titled Infinitesimal Jacknife.

Installation

Install the package from CRAN:

> install.packages('RFinfer')

Alternatively, install the latest version from GitHub:

> install_github('cole-brokamp/RFinfer')

Formerly, the package required a custom version of the randomForest package that counted the number of times that each observation was selected in each resample, rather than only indicating if it was selected at least once. This information is necessary for the implementation of the IJ, but has since been included in the latest CRAN version of the randomForest package (v4.6-12). If you run into problems relating to setting keep.inbag=TRUE, then make sure you have the most recent CRAN version of the randomForest version package installed (install.packages('randomForest')); however, this should be installed if necessary when installing the RFinfer package.

Random Forest Prediction Variance Function

At the heart of the package is the rfPredVar function which trains a random forest using either conditional inference trees or traditional CART trees and produces prediction estimates with accompanying estimates of the prediction variance. The function takes a random forest trained with the keep.inbag=TRUE in order to detect the resampling scheme (either bootstrap or subsample). This way the user can supply the same random forest to rfPredVar and the same resamples will be used when training the forest using either type of trees, facilitating a true comparison. The function defaults to making predictions using the original prediction data, which is useful for making inferences about the model, but can also be supplied with new data to predict on. The function returns a data.frame with the predictions and prediction variances. If requested in the call with CI=TRUE, the 95% confidence intervals will also be returned. Finally, since the training of forests using conditional inference trees can be computationally intensive, a progress bar can be requested using prog.bar=TRUE. See ?rfPredVar for more details, including examples on how to specify forests built with subsampling instead of bootstrap sampling or utilizing conditional inference trees instead of traditional CART trees.

Simple Example

First load the package and some example data included with R. The New York Air Quality Measurements is a dataset of daily ozone concentrations in New York from May to September of 1973 and includes meteorologic covariates (see ?airquality for more details). Because calls to random forest do not allow missing data, we will only use complete cases here for the sake of example. We’ll also kick out two very high outliers.

> library(RFinfer)
> data('airquality')
> d.aq <- na.omit(airquality)
> d.aq <- d.aq[d.aq$Ozone < 100, ]

Next, train a random forest keep.inbag=T option specified.

> rf <- randomForest(Ozone ~ .,data=d.aq,keep.inbag=T)

Extract the prediction variances for the training data along with their 95% confidence intervals. The values are returned in a data.frame.

> rf.preds <- rfPredVar(rf,rf.data=d.aq,CI=TRUE)
> str(rf.preds)
## 'data.frame':    104 obs. of  4 variables:
##  $ pred       : num  37.2 29.5 19.8 21.9 23.9 ...
##  $ pred.ij.var: num  -1.29 15.7 20.31 4.94 3.78 ...
##  $ l.ci       : num  39.71 -1.25 -19.97 12.25 16.47 ...
##  $ u.ci       : num  34.7 60.3 59.6 31.6 31.3 ...

Plot the predictions with their 95% confidence intervals according to the actual values.

> library(ggplot2)
> ggplot(rf.preds,aes(d.aq$Ozone,pred)) +
+   geom_abline(intercept=0,slope=1,lty=2,color='#999999') +
+   geom_point() +
+   geom_errorbar(aes(ymin=l.ci,ymax=u.ci,height=0.15)) +
+   xlab('Actual') + ylab('Predicted') +
+   theme_bw()

Here, we can see that the random forest is generally less confident about its more inaccurate predictions. This can also be visualized by plotting the prediction variance as a function of the prediction error.

> qplot(d.aq$Ozone - rf.preds$pred,rf.preds$pred.ij.var,
+      xlab='prediction error',ylab='prediction variance') + theme_bw()

Using Prediction Variance for Inference Example

Here we will explore a more in depth example using real data from NHANES showing how the random forest prediction variance might be used for statistical inference. We will compare the inferences to those from a linear model.

Get NHANES Data

We will use the nhanesA package to retrieve information on BMI and some possibly related demographic characteristics.

> library(nhanesA)
> bmx_d <- nhanes('BMX_D')
## Processing SAS dataset BMX_D      ..
> demo_d <- nhanes('DEMO_D')
## Processing SAS dataset DEMO_D     ..
> d_merged <- merge(bmx_d,demo_d)
> d <- data.frame('bmi' = d_merged$BMXBMI,
+                 'age' = d_merged$RIDAGEYR,
+                 'race' = factor(d_merged$RIDRETH1),
+                 'poverty_income_ratio' = d_merged$INDFMPIR,
+                 'edu' = factor(d_merged$DMDHREDU,levels=1:9))

For the sake of this example, we will omit any observations with missing data and take a random subsample of size 100. Here, we have the numeric BMI value for each observation along with the subjects age in years, race and education each defined as a categorical variable, and their income to poverty ratio.

> d <- na.omit(d)
> set.seed(24512399)
> d <- d[sample(1:nrow(d),size=100), ]
> summary(d)
##       bmi             age        race   poverty_income_ratio      edu    
##  Min.   :13.63   Min.   : 2.00   1:28   Min.   :0.000        4      :38  
##  1st Qu.:17.34   1st Qu.:10.00   2: 6   1st Qu.:1.147        5      :23  
##  Median :23.65   Median :19.00   3:34   Median :2.255        3      :19  
##  Mean   :23.77   Mean   :27.30   4:28   Mean   :2.524        1      :13  
##  3rd Qu.:28.68   3rd Qu.:41.25   5: 4   3rd Qu.:3.965        2      : 7  
##  Max.   :41.99   Max.   :85.00          Max.   :5.000        6      : 0  
##                                                              (Other): 0

Train Random Forests

Next, we will train an initial random forest and use the rfPredVar function to extract predictions and prediction variances for forests built using both CART and CI trees. Since we aim to plot the prediction confidence intervals, supply the CI=TRUE option.

> library(RFinfer)
> rf <- randomForest(bmi ~ .,data=d,keep.inbag=TRUE,
+                    sampsize=nrow(d)^0.7,replace=FALSE,ntrees=5000)
> rf.preds_cart <- rfPredVar(rf,rf.data=d,CI=TRUE,tree.type='rf')
> rf.preds_ci <- rfPredVar(rf,rf.data=d,CI=TRUE,tree.type='ci',prog.bar=FALSE)

To plot these, we will bind both prediction data.frames together.

> preds <- rbind(data.frame(rf.preds_cart,'tree'='CART'),
+                data.frame(rf.preds_ci,'tree'='CI'))
> preds$pred.ij.var <- NULL

Fit Linear Model

Next, fit a linear model using the same data and add its predictions with confidence intervals to the data.frame of model predictions. We will specify interval='prediction' since we are interested in the confidence intervals for future predictions.

> LM <- lm(bmi ~ .,data=d)
> LM.preds <- data.frame(predict(LM,interval='prediction'))
> names(LM.preds) <- c('pred','l.ci','u.ci')
> LM.preds$tree <- 'LM'
> preds <- rbind(preds,LM.preds)

Also add the measured BMI, the prediction deviance, confidence interval length, and some of the covariates for plotting purposes.

> preds$measured <- d[ ,'bmi']
> 
> preds$deviance <- preds$pred - preds$measured
> preds$CI_length <- (preds$u.ci - preds$l.ci)
> 
> preds$age <- rep(d$age,3)

Compare Model Predictions

To compare the three model predictions, we can plot the predicted BMI according to the measured BMI for each model type, including the prediction confidence intervals.

> ggplot(preds,aes(measured,pred,color=tree)) +
+   geom_point(size=0.5) +
+   geom_abline(intercept=0,slope=1,lty=2,color='#999999') +
+   geom_errorbar(aes(ymin=l.ci,ymax=u.ci)) +
+   xlab('Actual') + ylab('Predicted') +
+   theme_bw() +
+   facet_grid(~tree)

To compare the model performances, calculate the root mean squared error.

> tapply(preds$deviance,preds$tree,function(x) sqrt(sum(x^2)/length(x)))
##     CART       CI       LM 
## 5.101114 5.111245 5.774192

As we expected, the random forest models have a better performance than the linear model. There doesn’t seem to be a large difference in using CART or CI trees within the random forest. It is encouraging to see that the random forest is less confident about its less accurate predictions. To visualize this, we can plot the length of the confidence intervals according to the prediction deviance. The same pattern does not hold with the linear model, with some of the highest uncertainty being associated with the linear model’s most accurate predictions.

> ggplot(preds,aes(deviance,CI_length,color=tree)) +
+   geom_point() +
+   theme_bw() +
+   facet_wrap(~tree)

Interpreting Effect of Age on BMI

For a linear model, we can simply interpret the regression coefficient to estimate the effect of age on BMI.

> summary(LM)
## 
## Call:
## lm(formula = bmi ~ ., data = d)
## 
## Residuals:
## Body Mass Index (kg/m**2) 
##     Min      1Q  Median      3Q     Max 
## -12.315  -3.516  -1.024   2.800  21.307 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          16.85028    1.99268   8.456 5.01e-13 ***
## age                   0.16110    0.03059   5.267 9.56e-07 ***
## race2                 0.91166    2.82339   0.323   0.7475    
## race3                -2.94528    1.76826  -1.666   0.0993 .  
## race4                -1.62225    1.81169  -0.895   0.3730    
## race5                -4.78713    3.44657  -1.389   0.1683    
## poverty_income_ratio  0.66159    0.48804   1.356   0.1787    
## edu2                  2.18427    2.92488   0.747   0.4572    
## edu3                  4.59506    2.42236   1.897   0.0611 .  
## edu4                  2.30395    2.26884   1.015   0.3126    
## edu5                  2.37433    2.58314   0.919   0.3605    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.121 on 89 degrees of freedom
## Multiple R-squared:  0.2947, Adjusted R-squared:  0.2155 
## F-statistic: 3.719 on 10 and 89 DF,  p-value: 0.0003416

We can see that age is indeed a significant predictor of BMI, with an estimated increase of 0.16 in BMI per increase in year of age. However, we don’t necessarily expect age to have an exact linear relationship with age.

Since we do not have model coefficients for the random forest models, we can instead use the predictions to interpret the effect of age on BMI.

> ggplot(preds,aes(age,pred,color=tree)) +
+   geom_point(size=0.5) +
+   geom_smooth() +
+   geom_errorbar(aes(ymin=l.ci,ymax=u.ci)) +
+   xlab('Age') + ylab('Predicted BMI') +
+   theme_bw() +
+   facet_grid(~tree)

Here, we can see that relationship of age and BMI flattens out at about age 30 and may even negatively correlate in later life (after age 60). Although the random forest predictions cannot be directly interpreted like a regression coefficient where “all other variables are held constant” because other variables also change according to age, we can still see how our sample behaves with changing age. The linear model predictions do not reveal the same nuanced relationship between age and BMI.

To make the interpretations of the random forest more like the regression coefficients, we can predict BMI for a range of ages and hold the other covariates at fixed values. Make sure to define the classes of the objects in newdat so that they match the classes of the data used to train the original random forest.

> newdat <- expand.grid('age'=2:85,
+                       'race'=factor(1,levels=1:5),
+                       'poverty_income_ratio'=median(d$poverty_income_ratio),
+                       'edu'=factor(5,levels=1:9))
> class(newdat$age) <- c('labelled','integer')
> class(newdat$poverty_income_ratio) <- c('labelled','numeric')
> newdat.preds <- rfPredVar(rf,rf.data=d,pred.data=newdat,
+                           CI=TRUE,tree.type='ci',prog.bar=FALSE)
> newdat.preds.plot <- cbind(newdat,newdat.preds)

Now, we can plot the predictions for different values of age according to fixed values of race, education, and income poverty ratio. A locally smoothed polynomial regression fit is overlaid for interpretation purposes. Although this prediction occurs at fixed values of the other covariates, it is still more insightful that a regression coefficient that assumes a linear relationship. Furthermore, we did not have to explore higher order terms or interactions and make multiple model comparisons that would be required to uncover this relationship using a linear model.

> ggplot(newdat.preds.plot,aes(age,pred)) +
+   geom_point(size=0.5) +
+   geom_smooth() +
+   geom_errorbar(aes(ymin=l.ci,ymax=u.ci)) +
+   xlab('Age') + ylab('Predicted BMI') +
+   theme_bw()