Using mipred for the analysis of binary outcome.

Bart J. A. Mertens

2019-07-05

Vignette Info

This vignette is part of the R package mipred. It documents data analysis for the calibration of predictive models using mipred functions for binary outcome prediction when predictors contain missing values.

Outlines and objectives

The package mipred contains two basic functions. The first is mipred.cv, which estimates cross-validated predictions when predictors contain missing values and using multiple imputation. The second is mipred, which allows users to apply the same methodology to predict outcome for novel observations, based on past calibration data. Both the new observations, as well as the calibration data may contain missing observations in the predictor data. This document describes data analysis approaches using the mipred package functions for the above objectives. We first discuss cross-validation of prediction with mipred.cv, using both the averaging and rubin methods as described in the paper by (Mertens, Banzato, and Wreede 2018) to estimate the expected prediction performance on future data. We subsequently describe use of the mipred function to estimated predictions on new patient data, based on past data. Finally, mipred package functionality and options are discussed.

library(mipred)

Data preparation

The CLL data is included in the mipred package as a data frame. Refer to the published paper (ref) for some more detail.

data(cll)
head(cll)
#>   id age10     perfstat remstat           cyto          asct
#> 1  1  0.08         <NA>      CR          other no prior ASCT
#> 2  2  0.82 Karnofsky 90      PR no abnormality no prior ASCT
#> 3  3 -0.28 Karnofsky 80   SD/PD          other no prior ASCT
#> 4  4 -0.40 Karnofsky 90      PR          other no prior ASCT
#> 5  5 -0.36         <NA>   SD/PD         del11q no prior ASCT
#> 6  6  0.89 Karnofsky 90      PR           <NA>    prior ASCT
#>             donor          sex_match cond srv5y srv5y_s
#> 1 Matched Related PATfemaleDONfemale  NMA 60.02       0
#> 2      matched UD   PATfemaleDONmale  RIC 39.97       0
#> 3      matched UD PATfemaleDONfemale  NMA  8.09       1
#> 4      matched UD   PATfemaleDONmale  RIC 47.79       0
#> 5 Matched Related   PATmaleDONfemale  RIC  6.11       1
#> 6      matched UD     PATmaleDONmale  RIC 31.36       0

The CLL data considers a survival problem subject to (right) censoring. To get the same data as discussed in the paper for binary outcome analysis, we first apply administrative censoring and generate the one-year binary outcome as below.

cll_bin <- cll # Generate a new data copy
cll_bin$srv5y_s[cll_bin$srv5y>12] <- 0  # Apply an administrative censorship at t=12 months
cll_bin$srv5y[cll_bin$srv5y>12] <- 12

cll_bin$Status[cll_bin$srv5y_s==1] <- 1  # Define the new binary "Status" outcome variable
cll_bin$Status[cll_bin$srv5y_s==0] <- 0  # Encoding is 1:Dead, 0:Alive

Cross-tabulate the number of patients artificially censored at 12 months versus the new binary outcome status indicator.

binary.outcome<-cll_bin$Status
artificially.censored<-(cll_bin$Status==0 & cll_bin$srv5y<12)
tableout <- table(artificially.censored,binary.outcome)
tableout
#>                      binary.outcome
#> artificially.censored   0   1
#>                 FALSE 465 184
#>                 TRUE   45   0

The percentage artificially censored is 100*45/694= 6.5, with 694=465+184+45 the total sample size. There are 184 events corresponding to 26.5 percent of the dataset.

Remove the original survival outcomes from the data frame before proceeding.

cll_bin$srv5y <- NULL # Remove the original censoring and follow-up time 
cll_bin$srv5y_s <- NULL

This completely defines the data.frame for use in further analysis. Aside from the (binary) outcome variable (Status), we have a single continuous predictor (age10). All other predictor variables (7 in total) are categorical, which gives 8 predictors in total. The first column contains a numeric identification variable with unique integers (1:694) for each individual.

Data exploration

It makes sense to first check the missing data pattern as well as some simple exploratory data summaries on predictors. Load the mice package if not already loaded.

library(mice)
md.pattern(cll_bin, rotate.names = TRUE)

#>     id age10 asct donor Status sex_match cond remstat perfstat cyto    
#> 453  1     1    1     1      1         1    1       1        1    1   0
#> 137  1     1    1     1      1         1    1       1        1    0   1
#> 36   1     1    1     1      1         1    1       1        0    1   1
#> 12   1     1    1     1      1         1    1       1        0    0   2
#> 27   1     1    1     1      1         1    1       0        1    1   1
#> 10   1     1    1     1      1         1    1       0        1    0   2
#> 1    1     1    1     1      1         1    1       0        0    1   2
#> 2    1     1    1     1      1         1    1       0        0    0   3
#> 1    1     1    1     1      1         1    0       1        0    1   2
#> 5    1     1    1     1      1         1    0       1        0    0   3
#> 1    1     1    1     1      1         1    0       0        0    1   3
#> 1    1     1    1     1      1         1    0       0        0    0   4
#> 1    1     1    1     1      1         0    1       1        1    1   1
#> 3    1     1    1     1      1         0    1       1        1    0   2
#> 2    1     1    1     1      1         0    1       1        0    1   2
#> 1    1     1    1     1      1         0    1       1        0    0   3
#> 1    1     1    1     1      1         0    0       1        0    1   3
#>      0     0    0     0      0         8    9      42       63  171 293

There are 293 missing values in total. There are 241=694-453 records containing missing values. Most missing values occur in variables cyto (171), perfstat (63) and remstat (42). Age is the only continuous predictor. Since we wish to investigate predictions and (cross-)validation in particular, it makes sense to explore the distribution of factor levels for the categorical predictors.

table(cll_bin$cyto)
#> 
#>         del17p         del11q          other no abnormality 
#>            144            142            166             71
table(cll_bin$perfstat)
#> 
#>  Karnofsky 100   Karnofsky 90   Karnofsky 80 Karnofsky <=70 
#>            186            307            113             25
table(cll_bin$remstat)
#> 
#>    CR    PR SD/PD 
#>    84   344   224
table(cll_bin$asct)
#> 
#> no prior ASCT    prior ASCT 
#>           620            74
table(cll_bin$donor)
#> 
#>         Matched Related              matched UD partially mismatched UD 
#>                     283                     327                      84
table(cll_bin$sex_match)
#> 
#>     PATmaleDONmale   PATmaleDONfemale   PATfemaleDONmale 
#>                363                145                 98 
#> PATfemaleDONfemale 
#>                 80
table(cll_bin$cond)
#> 
#> NMA RIC MAC 
#> 223 360 102

These tables by default exclude the missing values. Note how the lowest factor level of remstat (Karnofsky <=70) is very sparse with only 25 observations. Similar output which retains the information on missing values can also be obtained from

summary(cll_bin)
#>       id                age10                    perfstat    remstat   
#>  Length:694         Min.   :-3.58000   Karnofsky 100 :186   CR   : 84  
#>  Class :character   1st Qu.:-0.58750   Karnofsky 90  :307   PR   :344  
#>  Mode  :character   Median : 0.04000   Karnofsky 80  :113   SD/PD:224  
#>                     Mean   :-0.06703   Karnofsky <=70: 25   NA's : 42  
#>                     3rd Qu.: 0.51000   NA's          : 63              
#>                     Max.   : 1.91000                                   
#>              cyto                asct                         donor    
#>  del17p        :144   no prior ASCT:620   Matched Related        :283  
#>  del11q        :142   prior ASCT   : 74   matched UD             :327  
#>  other         :166                       partially mismatched UD: 84  
#>  no abnormality: 71                                                    
#>  NA's          :171                                                    
#>                                                                        
#>               sex_match     cond         Status      
#>  PATmaleDONmale    :363   NMA :223   Min.   :0.0000  
#>  PATmaleDONfemale  :145   RIC :360   1st Qu.:0.0000  
#>  PATfemaleDONmale  : 98   MAC :102   Median :0.0000  
#>  PATfemaleDONfemale: 80   NA's:  9   Mean   :0.2651  
#>  NA's              :  8              3rd Qu.:1.0000  
#>                                      Max.   :1.0000

Model assessment via internal validation

Estimate predictive performance using mipred.cv functions

We first estimate prediction results using internal validation on the cll_bin data, using all predictors and running 10-fold cross-validation (argument folds) with 10 imputations (argument nimp) for each prediction. Note the -1 notation in cll_bin[,-1] is to remove the first column containing the observation identification number.

The above code implements cross-validation for the averaging approach as described in the paper by Mertens, Banzato, and Wreede (2018). In brief, to generate 10 imputations, 10 copies of the data matrix are made and a separate (in this case 10-fold) fold-partition is defined for each matrix. In each matrix copy, each fold is then considered in turn as a validation set (with the remainder of the data as calibration) and the outcomes are removed (artificially set to missing) in the left-out fold. A single imputation is then generated on this data matrix. The (imputed) training potion of the data is then used to fit a (logistic regression) model, which is applied to the imputed data in the left-out (training) fold. This process is repeated for all folds. By applying the above calculation to all matrices, we obtain 10 cross-validated predictions, each on a single imputation (10 in total), for each observation.

The returned object is a list containing three elements: the function call, as well as the predictions on the response scale (fitted probabilities in this case, in matrix pred) and the linear predictor (linpred). The prediction method used is the averaging method by default. The two prediction matrices are each saved in a matrix with 694 rows by 10 columns, with each column containing prediction results based on the prediction model calibrated on a single imputation. Rows correspond to the observations in the data matrix, in same order. The cross-validation fold-definitions used are different from column to column in the averaging approach. The function linking both matrices predand linpred is the logit link (by default of the family specification family = binomial).

To obtain predictions calculated from pooled Rubin’s rules models, we need to set the method option explicitly to “rubin”.

For both the “rubin” and “averaging” method, the above codes generate 10 predictions (1 for each imputation, of which we have 10 in total). With the averaging method, these will typically differ, because a separate prediction model is generated for any left-out datum based on each single-imputed dataset. The “rubin” approach uses the pooled model as single prediction model. Nevertheless, any record to be predicted which itself contains missing data will also be imputed. Thus, there will generally also be 10 distinct predictions for such observations with the (pooled) rubin approach. Records which are fully observed will have the same predictions across all imputations as we are applying the same - pooled - model on the same fully observed record of predictors.

Note how in the above prediction matrices pred and linpred the entries in rows 2-4 are the same for all 10 predictions, because these prediction are calculated from the same pooled model on observations which do not contain missing values. Observations 1, 5 and 6 however do contain missing values and hence, the imputations for these missing data also change across the 10 predictions, even though the prediction model itself is fixed. The final prediction for any record can be obtained by taking averages, medians or other suitable statistic across the distinct predictions. Here we use means for example, but other applications may well require another combination.

We can investigate some plots to get a feel for the calibrated predictions. Before proceeding, we first generate a missingness indicator for each observation, to distinguish predictions on fully observed records from those on records containing missing values, in tables and figures.

The below figure plots the combined predictions versus the status outcome and also compares predictions between the averaging and rubin combination method. Plotted predictions are colored red for observations containing missing predictor data and black otherwise. In the second row of plots, the left-most figure plots predictions from the rubin method versus those from the averaging approach. The second right-most figure, plots differences between predictions from both approaches versus the averaged predictions obtained from both the rubin and averaging method.

Predicted values from the averaging and rubin method based on 10 imputations The combined predictions as obtained from the mean seem to be similar between both approaches, but some variation is observed between predictions from both methods. Most of this variation is due to between-imputation variation. This raises some obvious questions.

  1. The first is how large is the variation of predictions associated with any specific calibration approach (either averaging or rubin)?
  2. Secondly, is variation different between both methodologies?
  3. A logical follow-up question would be to what extent such variation can be reduced by increasing numbers of imputations.
  4. Finally, we could try to estimate the number of imputations that would be sufficient to reduce the variation to acceptable levels.

We can investigate the impact of increased numbers of imputations for each prediction combination approach by running multiple replicates of the above analysis and with different numbers of imputations. Note the below code can take considerable time to run.

Now generate same analysis for the “rubin” method.

Summarize results using plots

We first investigate the averaging method. To investigate the spread of individual predictions, calculated from models on a single imputation, we first plot the these constituent (individual) predictions (from repslist_av) versus the combined predictions (the means) for the averaging method. Use distinct plotting symbols for predictions based on fully observed records versus records that contained missing values. Use only the first replicate from nimp=100 (the third component of repslist_av, corresponding to results for 100 imputations. Calculate the final prediction for each patient for this replicate by combining the individual predictions using means.

Now make the plot.

The left plot shows individual predictions (from single-imputation-based models) versus the mean prediction. The right plot shows mean-centered predictions versus mean prediction. As can be seen, the individual predictions have very high levels of variation (around the mean prediction). Predictions on (partially missing records - plotted red) are much more variable than those on fully-observed records. Variation tends to decrease as the mean prediction approaches either 0 or 1.

Now investigate the Rubin method results in the same manner. Again only consider a single replication for nimp=100. Plot the constituent (individual) predictions (from repslist_rb) versus the combined predictions (the means) for the rubin method. Use distinct plotting symbols for predictions based on fully observed records versus records that contained missing values.

First, calculate the final prediction for each patient for this replicate by combining the individual predictions using means.

Now make the plots.

The left plot shows individual predictions (from the Rubin-rule pooled model on each of the [possibly imputed] records) versus the mean prediction. The right plot shows mean-centered predictions versus mean prediction. As can be seen, the individual predictions have very high levels of variation (around the mean level) when predicting on records which are partially observed (contain missing values). Note how there is no variability in predictions on fully-observed records now between the imputations, because we are predicting from a single pooled model. Predictions on (partially missing records - plotted red) are however variable because we also need to impute the missing data each time we want to predict (so there are nimp=100 distinct predictions in that case). Variation tends to decrease as the mean prediction approaches either 0 or 1.

Next we study the differences (variation) between the predictions obtained from the averaging method when generating replicates of the prediction, calculated from the same number of imputations, but with distinct imputations. We repeat this analysis for nimp=1, nimp=10 and nimp=100 imputations. In other words, we replicate the prediction analysis and investigate the impact of increasing the number of imputations used on the stability of the prediction.

In the below plots, the top row of plots show the replicated predictions versus the mean prediction. The bottom row of plots displays the mean-centered replicate predictions versus mean prediction. First calculate the mean across all imputations and replications.

Now make the plots.

par(mfrow=c(2,3),pty="s")
matplot(avcv_means1,apply(repslist_av[[1]],c(1,3),mean),pch=19,type="n",
  xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means1[miss==1],apply(repslist_av[[1]][miss==1,,,drop=FALSE],c(1,3),mean),
  col=2,pch=1)
matpoints(avcv_means1[miss==0],apply(repslist_av[[1]][miss==0,,,drop=FALSE],c(1,3),mean),
  col=1,pch=1)
title('single imputation')

matplot(avcv_means2,apply(repslist_av[[2]],c(1,3),mean),pch=19,type="n",
  xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means2[miss==1],apply(repslist_av[[2]][miss==1,,],c(1,3),mean),
  col=2,pch=1)
matpoints(avcv_means2[miss==0],apply(repslist_av[[2]][miss==0,,],c(1,3),mean),
  col=1,pch=1)
title('10 imputations')

matplot(avcv_means3,apply(repslist_av[[3]],c(1,3),mean),pch=19,type="n",
  xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means3[miss==1],apply(repslist_av[[3]][miss==1,,],c(1,3),mean),
  col=2,pch=1)
matpoints(avcv_means3[miss==0],apply(repslist_av[[3]][miss==0,,],c(1,3),mean),col=1,pch=1)
title('100 imputations')

matplot(avcv_means1,apply(repslist_av[[1]],c(1,3),mean)-avcv_means1%*%matrix(1,nrow=1,ncol=10),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means1[miss==1],
  apply(repslist_av[[1]][miss==1,,,drop=FALSE],c(1,3),mean)-
    avcv_means1[miss==1]%*%matrix(1,nrow=1,ncol=10),
  col=2,pch=1)
matpoints(avcv_means1[miss==0],
  apply(repslist_av[[1]][miss==0,,,drop=FALSE],c(1,3),mean)-
    avcv_means1[miss==0]%*%matrix(1,nrow=1,ncol=10),
  col=1,pch=1)

matplot(avcv_means2,
  apply(repslist_av[[2]],c(1,3),mean)-avcv_means2%*%matrix(1,nrow=1,ncol=10),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means2[miss==1],
  apply(repslist_av[[2]][miss==1,,],c(1,3),mean)-
    avcv_means2[miss==1]%*%matrix(1,nrow=1,ncol=10),
  col=2,pch=1)
matpoints(avcv_means2[miss==0],
  apply(repslist_av[[2]][miss==0,,],c(1,3),mean)-
    avcv_means2[miss==0]%*%matrix(1,nrow=1,ncol=10),
  col=1,pch=1)

matplot(avcv_means3,
  apply(repslist_av[[3]],c(1,3),mean)-avcv_means3%*%matrix(1,nrow=1,ncol=10),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(avcv_means3[miss==1],
  apply(repslist_av[[3]][miss==1,,],c(1,3),mean)-
    avcv_means3[miss==1]%*%matrix(1,nrow=1,ncol=10),
  col=2,pch=1)
matpoints(avcv_means3[miss==0],
  apply(repslist_av[[3]][miss==0,,],c(1,3),mean)-
    avcv_means3[miss==0]%*%matrix(1,nrow=1,ncol=10),
  col=1,pch=1)

We observe prediction on partially observed records to have lower levels of (between-replication) variation as compared to prediction on fully observed data. Variation reduces substantially when the numbers of imputations used are increased. Observation with mean predicted probabilities close to either 0 or 1 have lower between-replication variation.

Next investigate the dispersal between the predictions for replicates of the analyses with different number of imputations used, for the rubin method. Top row of plots is replicate predictions versus the mean prediction. Bottom row of plot is mean-centered replicate predictions versus mean prediction. First calculate the mean across all imputations and replications.

Now make the plots.

par(mfrow=c(2,3),pty="s")
matplot(rbcv_means1,apply(repslist_rb[[1]],c(1,3),mean),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means1[miss==1],apply(repslist_rb[[1]][miss==1,,,drop=FALSE],c(1,3),mean),
  col=2,pch=1)
matpoints(rbcv_means1[miss==0],apply(repslist_rb[[1]][miss==0,,,drop=FALSE],c(1,3),mean),
  col=1,pch=1)
title('single imputation')

matplot(rbcv_means2,apply(repslist_rb[[2]],c(1,3),mean),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means2[miss==1],apply(repslist_rb[[2]][miss==1,,],c(1,3),mean),col=2,pch=1)
matpoints(rbcv_means2[miss==0],apply(repslist_rb[[2]][miss==0,,],c(1,3),mean),col=1,pch=1)
title('10 imputations')

matplot(rbcv_means3,apply(repslist_rb[[3]],c(1,3),mean),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means3[miss==1],apply(repslist_rb[[3]][miss==1,,],c(1,3),mean),col=2,pch=1)
matpoints(rbcv_means3[miss==0],apply(repslist_rb[[3]][miss==0,,],c(1,3),mean),col=1,pch=1)
title('100 imputations')

matplot(rbcv_means1,
  apply(repslist_rb[[1]],c(1,3),mean)-rbcv_means1%*%matrix(1,nrow=1,ncol=10),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means1[miss==1],
  apply(repslist_rb[[1]][miss==1,,,drop=FALSE],c(1,3),mean)-
    rbcv_means1[miss==1]%*%matrix(1,nrow=1,ncol=10),
  col=2,pch=1)
matpoints(rbcv_means1[miss==0],
  apply(repslist_rb[[1]][miss==0,,,drop=FALSE],c(1,3),mean)-
    rbcv_means1[miss==0]%*%matrix(1,nrow=1,ncol=10),
  col=1,pch=1)

matplot(rbcv_means2,
  apply(repslist_rb[[2]],c(1,3),mean)-rbcv_means2%*%matrix(1,nrow=1,ncol=10),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means2[miss==1],
  apply(repslist_rb[[2]][miss==1,,],c(1,3),mean)-
    rbcv_means2[miss==1]%*%matrix(1,nrow=1,ncol=10),
  col=2,pch=1)
matpoints(rbcv_means2[miss==0],
  apply(repslist_rb[[2]][miss==0,,],c(1,3),mean)-
    rbcv_means2[miss==0]%*%matrix(1,nrow=1,ncol=10),
  col=1,pch=1)

matplot(rbcv_means3,
  apply(repslist_rb[[3]],c(1,3),mean)-rbcv_means3%*%matrix(1,nrow=1,ncol=10),
  pch=19,type="n",xlab="ordered mean probability",ylab="individual imputed predictions")
matpoints(rbcv_means3[miss==1],
  apply(repslist_rb[[3]][miss==1,,],c(1,3),mean)-
    rbcv_means3[miss==1]%*%matrix(1,nrow=1,ncol=10),
  col=2,pch=1)
matpoints(rbcv_means3[miss==0],
  apply(repslist_rb[[3]][miss==0,,],c(1,3),mean)-
    rbcv_means3[miss==0]%*%matrix(1,nrow=1,ncol=10),
  col=1,pch=1)

We again observe prediction on partially observed records to have lower levels of (between-replication) variation as compared to prediction on fully observed data. Variation reduces substantially when the numbers of imputations used are increased. Observation with mean predicted probabilities close to either 0 or 1 have lower between-replication variation. Note however that the dispersal seem to be larger as compared to results from the averaging method and this effect is particularly noticeable for the higher numbers of imputations.

Summarize predictive performance using variance functions

To further study the variation due to imputation, we summarize variation using the `R’ statistics as defined in the paper (ref). First define a function that calculates the required statistics on the replicated analyses.

Now apply the function to the replicated results from the averaging approach and combine the summary measures in a matrix. Multiply all statistics by a factor 100 so they can be read at percentage scale.

Now do the same for the replicated results from the rubin approach and combine the summary measures in a matrix. Multiply all statistics by a factor 100 so they can be read at percentage scale.

Produce the variance `decay’ plots for both approaches.

Present the results as table, for approach 1 (prediction-averaging).

Variance summaries of deviation relative to the mean prediction between replicate predictions for different imputations with the averaging approach. All statistics were multiplied by 100.
R (range) q10 median q90 missing replicates mean
missing 15.37 -7.74 -0.36 7.63 1 10 0
observed 9.13 -4.62 0.02 4.51 0 10 0
missing 4.51 -2.26 0.01 2.24 1 10 0
observed 2.84 -1.39 -0.02 1.45 0 10 0
missing 1.43 -0.72 -0.01 0.72 1 10 0
observed 0.90 -0.45 0.00 0.45 0 10 0

Present the results as table, for approach 2 (rubin).

Variance summaries of deviation relative to the mean prediction between predictions for different imputations with the rubin approach. All statistics were multiplied by 100.
R (range) q10 median q90 missing replicates mean
missing 14.56 -6.82 -0.53 7.74 1 10 0
observed 9.06 -4.40 -0.04 4.66 0 10 0
missing 7.72 -3.66 -0.17 4.06 1 10 0
observed 7.27 -3.62 -0.01 3.65 0 10 0
missing 6.17 -3.16 0.05 3.02 1 10 0
observed 6.92 -3.45 -0.01 3.47 0 10 0

Next, we investigate Brier scores and AUCs. We use the pROC package for the AUC calculations. We again first define a function which calculates the required measures.