Preproviz

Markus Vattulainen

2016-07-09

Data quality issues such as missing values and outliers are often interdependent, which makes preprocessing both time-consuming and leads to suboptimal performance in knowledge discovery tasks. This package supports preprocessing decision making by visualizing interdependent data quality issues through means of feature construction. The user can define his own application domain specific constructed features that express the quality of a data point such as number of missing values in the point or use nine default features. The outcome can be explored with plot methods and the feature constructed data acquired with get methods.

Quick start

Simple exploration can be done by passing a data frame as an argument. The data frame must have one factor variable and other variables numeric.

library(preproviz)
result <- preproviz(iris)
## [1] "Data in process: controlobject"

The resulting object can be plotted with various plot functions.

plotDENSITY(result)

plotHEATMAP(result)

Comparisons

The package supports comparison of multiple data sets or different versions of a same data set.

Let’s make some test data.

iris2 <- iris
iris2[sample(1:150,30), 1] <- NA # adding missing values
iris2[sample(1:150,30), 5] <- levels(iris2$Species)[2] # adding inconsistency 

and then setup comparison between iris and iris2.

result <- preproviz(list(iris, iris2))
## [1] "Data in process: A"
## [1] "Data in process: B"

Plotting how the constructed features cluster (that is, which features are linearly dependent on each other).

plotVARCLUST(result)

and then how the constructed feature data points cluster when reduced to two dimensions.

plotCMDS(result)

Customization

Finally, the setups can be customized in detail:

customparameters <- initializeparameterclassobject(list("LOFScore", "ScatterCounter"))
setup1 <- initializesetupclassobject("setup1", customparameters, initializedataobject(iris))
setup2 <- initializesetupclassobject("setup2", customparameters, initializedataobject(otherdataframehere)) 
control <- initializecontrolclassobject(list("setup1", "setup2")) 
result <- preproviz(control)

and new constructe features added to the system:

constructfeature("MissingValueShare", "apply(data, 1, function(x) sum(is.na(x))/ncol(data))", impute=TRUE)

Default contructed features

There are nine default constructed features: