Preprocomb

Markus Vattulainen

2016-06-26

Preprocessing is often the most time-consuming phase in data analysis and preprocessing transformations interdependent in unexpected ways. This package helps to make preprocessing faster and more effective. It provides an S4 framework for creating and evaluating preprocessing combinations for classification, clustering and outlier detection. The framework supports adding of user-defined preprocessors and preprocessing phases. Default preprocessors can be used for low variance removal, missing value imputation, scaling, outlier removal, noise smoothing, feature selection and class imbalance correction.

Test data

Let’s start by adding contaminations to Iris-data to simulate the need for preprocessing:

set.seed(1)
testdata <- iris
testdata[sample(1:150,40),3] <- NA # add missing values to the third variable
testdata[,4] <- rnorm(150, testdata[,4], 2) # add noise to the fourth variable
testdata$Irrelevant <- runif(150, 0, 1) # add an irrelevant feature

Interactive mode

In the interactive mode preprocessing techniques can be applied in a sequence with function prepro(). The resulting object contains the preprocessing call history, computations and the fitness of the preprocessed data for model fitting. In the example below missing values are imputed first with meanimpute and then outliers removed with Orh-algorithm. Support vector machine svmRadial from kernlab package is used as a classifier. The default classifier is rpart from rpart package.

library(preprocomb)
step1 <- prepro(testdata, "meanimpute", model="svmRadial")
step2 <- prepro(step1, "orhoutlier", model="svmRadial")
step2
## # OBJECT: 
## # class: orhoutlier 
## # call history: meanimpute orhoutlier 
## 
## # COMPUTATIONS: 
## # classification accuracy: 0.81 
## # hopkins statistic, clustering tendency: 0.31 
## # skewness of ORH scores, outlier tendency: -0.22 
## 
## # FITNESS FOR MODEL FITTING: 
## # variance in all variables: TRUE 
## # only finite values: TRUE 
## # complete observations: TRUE 
## # class balance: TRUE 
## # n to p ratio more than 2: TRUE 
## # 3 or more predictors and more than 20 observations: TRUE

Programmatic mode

In the programmatic mode search for the best combinations is executed. First, a grid of preprocessing combinations and corresponding preprocessed data sets is created. Secondly, the preprocessed data sets are evaluated for classification accuracy, clustering tendency and skewness of outlier scores In the example below the preprocessing pipeline consists 540 combinations and their evaluations.

examplegrid <- setgrid(phases=c("imputation", "outliers", "scaling", "smoothing", "selection"), data=testdata)
exampleresult <- preprocomb(grid=examplegrid, models=c("svmRadial"), nholdout=10, cluster=TRUE, outlier=TRUE, cores=2)

Extracting the wall-clock time of execution in minutes:

exampleresult@walltime
## [1] 48

Extracting the best combinations for classification:

exampleresult@bestclassification
##          imputation   outliers     scaling    smoothing  selection
## 253 meanclassimpute   noaction  basicscale coarsesmooth rfselect50
## 273 meanclassimpute   noaction minmaxscale coarsesmooth rfselect50
## 218 meanclassimpute orhoutlier minmaxscale     noaction rfselect50
## 208 meanclassimpute orhoutlier centerscale     noaction rfselect50
## 258 meanclassimpute orhoutlier  basicscale coarsesmooth rfselect50
## 278 meanclassimpute orhoutlier minmaxscale coarsesmooth rfselect50
##      svmRadial   ALL_MEAN
## 253 0.98+-0.02 0.98+-0.02
## 273 0.98+-0.01 0.98+-0.01
## 218 0.97+-0.04 0.97+-0.04
## 208 0.97+-0.02 0.97+-0.02
## 258 0.97+-0.02 0.97+-0.02
## 278 0.97+-0.02 0.97+-0.02

Default options

The package is intended to be used with domain specific preprocessing phases and techniques. There are however a set of default options available. Phases:

Each of the phases has two or more preprocessing techniques including “noaction”. Available preprocessing techniques can be shown by:

getpreprocessor()
##  [1] "nearzerovar"        "naomit"             "meanimpute"        
##  [4] "meanclassimpute"    "knnimpute"          "randomforestimpute"
##  [7] "basicscale"         "centerscale"        "minmaxscale"       
## [10] "decimalscale"       "softmaxscale"       "orhoutlier"        
## [13] "oversample"         "undersample"        "rfselect75"        
## [16] "rfselect50"         "lowesssmooth"       "coarsesmooth"      
## [19] "noaction"

and preprocecssor function definition by giving the name of the preprocessing technique as argument:

getpreprocessor("basicscale")
## {dataobject <- initializedataclassobject(data.frame(x = scale(dataobject@x, center = FALSE), dataobject@y))

Customization

Preproccessing techniques can be added to the system in two steps:

Step 1: Function definition

scaleexample <- function(dataobject) {
dataobject <- initializedataclassobject(data.frame(x=scale(dataobject@x), y=dataobject@y))
}

Notice that added preprocecessing technique definition input and output are both DataClass objects. The slot “y” is a factor vector containing the class labels and slot “x” the other variables, which all must be numeric.

Step 2: Adding of the function to the system

setpreprocessor("scaleexample", "scaleexample(dataobject)")
## [1] "transformdata"
step3 <- prepro(step2, "scaleexample", model="svmRadial") # continues the example above
step3
## # OBJECT: 
## # class: scaleexample 
## # call history: meanimpute orhoutlier scaleexample 
## 
## # COMPUTATIONS: 
## # classification accuracy: 0.88 
## # hopkins statistic, clustering tendency: 0.37 
## # skewness of ORH scores, outlier tendency: -0.29 
## 
## # FITNESS FOR MODEL FITTING: 
## # variance in all variables: TRUE 
## # only finite values: TRUE 
## # complete observations: TRUE 
## # class balance: TRUE 
## # n to p ratio more than 2: TRUE 
## # 3 or more predictors and more than 20 observations: TRUE

Added preprocessing techniques can be added to phases and used in creating a new grid of combinations:

newscaling <- setphase("newscaling", c("noaction", "scaleexample"), TRUE)
newexamplegrid <- setgrid(phases=c("imputation", "newscaling"), data=testdata)