ordinalClust

Description

ordinalClust is an R package that allows users to perform classification, clustering and co-clustering of ordinal data. Furthermore, it allows to handle different numbers of levels and missing values. The ordinal data is considered to follow a BOS distribution (Biernacki and Jacques 2016), which is specific for this kind of data. The Latent Block Model is used for performing co-clustering (Jacques and Biernacki 2017).

Installation

set.seed(5)
library(ordinalClust)

Datasets

The package contains real datasets created from (Anota et al. 2017). They concerns quality of life questionnaires for patient affected by breast cancer.

Univariate Ordinal Data Simulation

To simulate a sample of ordinal data following the BOS distribution, the function pejSim is used.

Basic example code

This snippet creates a sample of ordinal data with 7 categories, that follows a BOS distribution parametrized by mu=5 and pi=0.5:

m=7
nr=10000
mu=5
pi=0.5

probaBOS=rep(0,m)
for (im in 1:m) probaBOS[im]=pejSim(im,m,mu,pi)
M <- sample(1:m,nr,prob = probaBOS, replace=TRUE)

Plotting

To plot the resulting distribution, the ggplot2 library can be used.

Perform clustering

In this section, a clustering is executed with the dataqol dataset. The purpose of performing a clustering is to highlight a structure through the matrix rows.

Example code

library(ordinalClust)
data("dataqol")
set.seed(5)

# loading the ordinal data
M <- as.matrix(dataqol[,2:29])

m = 4

krow = 4

nbSEM=50
nbSEMburn=40
nbindmini=2
init = "random"

object <- bosclust(x=M,kr=krow, m=m, nbSEM=nbSEM,
    nbSEMburn=nbSEMburn, nbindmini=nbindmini, init=init)

Plotting the result

plot(object)
## Warning: package 'knitr' was built under R version 3.4.3

Perform co-clustering

Example code

In this example, a co-clustering is performed with the dataqol dataset. In this case, the interest of co-clustering is to detect an internal struture throughout the rows and the columns of the data.

library(ordinalClust)

# loading the real dataset
data("dataqol")
set.seed(5)

# loading the ordinal data
M <- as.matrix(dataqol[,2:29])


# defining different number of categories:
m=4


# defining number of row and column clusters
krow = 5
kcol = 4

# configuration for the inference
nbSEM=50
nbSEMburn=40
nbindmini=2
init = "kmeans"

# Co-clustering execution
object <- boscoclust(x=M,kr=krow,kc=kcol,m=m,nbSEM=nbSEM,
          nbSEMburn=nbSEMburn, nbindmini=nbindmini, init=init)

Plotting the result

This snippet shows how to visualize the resulting co-clustering, with the plot function:

plot(object)

Perform classification

In this section, the dataset dataqol.classif is used. It contains the responses to a questionnaire for 40 patients affected by breast cancer. Furhermore, a column called death indicates if the patient died from the disease (2) or not (1). The aim of this section is to predict the classes of a validation dataset from a training dataset.

Choosing a good kc parameter with cross-validation

The classification function bosclassif proposes two classification models. The first one, (chosen by the option kc=0), is a multivariate BOS model assuming that, conditionally on the class of the observations, the feature are independent. The second model is a parsimonious version of the first model. Parcimony is introduced by grouping the features into clusters (as in co-clustering) and assuming that the features of a cluster have a common distribution. The number L of clusters of features is defined with the option kc=L. In practice L can be chosen by cross-validation, as in the following example:

library(ordinalClust)
# loading the real dataset
data("dataqol.classif")

set.seed(5)

# loading the ordinal data
M <- as.matrix(dataqol.classif[,2:29])


# creating the classes values
y <- as.vector(dataqol.classif$death)


# sampling datasets for training and to predict
nb.sample <- ceiling(nrow(M)*2/3)
sample.train <- sample(1:nrow(M), nb.sample, replace=FALSE)

M.train <- M[sample.train,]
M.validation <- M[-sample.train,]
nb.missing.validation <- length(which(M.validation==0))


y.train <- y[sample.train]
y.validation <- y[-sample.train]

# number of classes to predict
kr <- 2

# configuration for SEM algorithm
nbSEM=50
nbSEMburn=40
nbindmini=2
init="kmeans"


# different kc to test with cross-validation
kcol <- c(0,1,2,3)
m <- 4


# matrix which contains the predictions for all different kc
predictions <- matrix(0,nrow=length(kcol),ncol=nrow(M.validation))

for(kc in 1:length(kcol)){
  res <- bosclassif(x=M.train, y=y.train, idx_list=c(0),
                    kr=kr, kc=kcol[kc], init=init, m=m, nbSEM=nbSEM, 
                    nbSEMburn=nbSEMburn, nbindmini=nbindmini)
  new.prediction <- predict(res, M.validation)
  predictions[kc,] <- new.prediction@zr_topredict
}

predictions = as.data.frame(predictions)
row.names <- c()
for(kc in kcol){
  name= paste0("kc=",kc)
  row.names <- c(row.names,name)
}
rownames(predictions)=row.names

Computing the precision, sensitivity and specificity rates for each kc

library(caret)

actual <- y.validation -1


precisions <- rep(0,length(kcol))
recalls <- rep(0,length(kcol))
sensitivities <- rep(0,length(kcol))
specificities <- rep(0,length(kcol))

for(i in 1:length(kcol)){
  prediction <- unlist(as.vector(predictions[i,])) -1
  conf_matrix<-table(prediction,actual)
  precisions[i] <- precision(conf_matrix)
  recalls[i] <- recall(conf_matrix)
  sensitivities[i] <- sensitivity(conf_matrix)
  specificities[i] <- specificity(conf_matrix)
}
precisions
## [1] 0.6666667 0.8181818 0.8000000 0.8000000
recalls
## [1] 0.6666667 1.0000000 0.8888889 0.8888889
sensitivities
## [1] 0.6666667 1.0000000 0.8888889 0.8888889
specificities
## [1] 0.25 0.50 0.50 0.50

Handling different numbers of categories

The package allows the user to deal with ordinal data that have different numbers of categories. In this section, we show how to introduce this kind of datasets in the co-clustering context.

Example code

In this example, co-clustering is performed with the dataset dataqol, by including the questions with 4 categories, and questions with 7 categories. The function boscoclustMulti is executed, and it might take a few minutes.

library(ordinalClust)

# loading the real dataset
data("dataqol")
set.seed(5)

# loading the ordinal data
M <- as.matrix(dataqol[,2:31])


# defining different number of categories:
m=c(4,7)


# defining number of row and column clusters
krow = 3
kcol = c(3,1)

# configuration for the inference
nbSEM=50
nbSEMburn=40
nbindmini=2
init='kmeans'

d.list <- c(0,28)

# Co-clustering execution
object <- boscoclust(x=M,kr=krow,kc=kcol,m=m, idx_list=d.list,
                    nbSEM=nbSEM,nbSEMburn=nbSEMburn,
                     nbindmini=nbindmini, init=init)

References

Anota, Amelie, Marion Savina, Caroline Bascoul-Mollevi, and Franck Bonnetain. 2017. “QoLR: An R Package for the Longitudinal Analysis of Health-Related Quality of Life in Oncology.” Journal of Statistical Software, Articles 77 (12): 1–30. doi:10.18637/jss.v077.i12.

Biernacki, Christophe, and Julien Jacques. 2016. “Model-Based Clustering of Multivariate Ordinal Data Relying on a Stochastic Binary Search Algorithm.” Statistics and Computing 26 (5). Springer Verlag (Germany): 929–43. https://hal.inria.fr/hal-01052447.

Jacques, Julien, and Christophe Biernacki. 2017. “Model-Based Co-clustering for Ordinal Data.” https://hal.inria.fr/hal-01448299.