Which observations are outlyers?

Regression use case - dragons data

To illustrate applications of auditor to regression problems we will use an artificial dataset dragons available in the DALEX package. Our goal is to predict the length of life of dragons.

library(DALEX)
data("dragons")
head(dragons)
##   year_of_birth   height   weight scars colour year_of_discovery
## 1         -1291 59.40365 15.32391     7    red              1700
## 2          1589 46.21374 11.80819     5    red              1700
## 3          1528 49.17233 13.34482     6    red              1700
## 4          1645 48.29177 13.27427     5  green              1700
## 5            -8 49.99679 13.08757     1    red              1700
## 6           915 45.40876 11.48717     2    red              1700
##   number_of_lost_teeth life_length
## 1                   25   1368.4331
## 2                   28   1377.0474
## 3                   38   1603.9632
## 4                   33   1434.4222
## 5                   18    985.4905
## 6                   20    969.5682

Models

Linear model

lm_model <- lm(life_length ~ ., data = dragons)

Random forest

library("randomForest")
set.seed(59)
rf_model <- randomForest(life_length ~ ., data = dragons)

Preparation for error analysis

The beginning of each analysis is creation of an explainer object with DALEX package. It’s an object that can be used to audit a model.

lm_exp <- DALEX::explain(lm_model, label = "lm", data = dragons, y = dragons$life_length)
## Preparation of a new explainer is initiated
##   -> model label       :  lm 
##   -> data              :  2000  rows  8  cols 
##   -> target variable   :  2000  values 
##   -> predict function  :  yhat.lm  will be used (default)
##   -> predicted values  :  numerical, min =  540.9447 , mean =  1370.986 , max =  3925.691  
##   -> residual function :  difference between y and yhat (default)
##   -> residuals         :  numerical, min =  -108.2062 , mean =  -3.701928e-12 , max =  113.8603  
## A new explainer has been created!
rf_exp <- DALEX::explain(rf_model, label = "rf", data = dragons, y = dragons$life_length)
## Preparation of a new explainer is initiated
##   -> model label       :  rf 
##   -> data              :  2000  rows  8  cols 
##   -> target variable   :  2000  values 
##   -> predict function  :  yhat.randomForest  will be used (default)
##   -> predicted values  :  numerical, min =  610.9752 , mean =  1370.181 , max =  3292.296  
##   -> residual function :  difference between y and yhat (default)
##   -> residuals         :  numerical, min =  -135.4756 , mean =  0.8047108 , max =  720.0888  
## A new explainer has been created!

Audit of observations

In this section we give short overview of a visual validation of model errors and show the propositions for the validation scores. Auditor helps to find answers for questions that may be crucial for further analyses.

In further sections, we overview auditor functions for analysis of model residuals. They are discussed in alphabetical order.

observationInfluence()

First, we need to create a auditor_model_residual objects.

library(auditor)
lm_cd <- model_cooksdistance(lm_exp)

Some plots may require specified variable or fitted values for modelResidual object.

Cook's distances

Cook's distance is used to estimate of the influence of an single observation. It is a tool for identifying observations that may negatively affect the model.

Data points indicated by Cook's distances are worth checking for validity. Cook's distances may be also used for indicating regions of the design space where it would be good to obtain more observations.

Cook’s Distances are calculated by removing the i-th observation from the data and recalculating the model. It shows how much all the values in the model change when the i-th observation is removed.

In the case of models of classes other than lm and glm the distances are computed directly from the definition, so this may take a while. In this example we will compute them for a linear model.

plot(lm_cd)

plot of chunk unnamed-chunk-6

Other methods

Other methods and plots are described in vignettes: