Cluster Analysis using diceR

Introduction

Cluster analysis is a way of “slicing and dicing” data to allow the grouping together of similar entities and the separation of dissimilar ones. Issues arise due to the existence of a diverse number of clustering algorithms, each with different techniques and inputs, and with no universally optimal methodology. Thus, a framework for cluster analysis and validation methods are needed. Our approach is to use cluster ensembles from a diverse set of algorithms so that the final class labels. This ensures that the data has been considered from several angles and using a variety of methods. We have currently implemented about 15 clustering algorithms, and we provide a simple framework to add additional algorithms (see example("consensus_cluster")). Although results are dependent on the subset of algorithms chosen for the ensemble, the intent is that we capture a variety of clustering and select those that are most consistent with the data.

You can install diceR from CRAN with:

install.packages("diceR")

Or get the latest development version from GitHub:

# install.packages("devtools")
devtools::install_github("AlineTalhouk/diceR")
library(diceR)
library(dplyr)
library(ggplot2)
library(knitr)
data(hgsc)
hgsc <- hgsc[1:100, 1:50]

We load an excerpt of an expression data set from TCGA of 100 high grade serous carcinoma samples measured on 50 genes.

Consensus Clustering

When Monti et al. (2003) first introduced consensus clustering, the idea was to use one clustering algorithm on bootstrapped subsamples of the data. We implement some extensions where a consensus is reached across subsamples and across algorithms. The final cluster assignment is then computed using statistical transformations on the ensemble cluster.

The base function of this package is consensus_cluster(), which outputs cluster assignments across subsamples and algorithms, for different k (number of clusters). For example, let’s say we were interested in clustering the hgsc data into 3 or 4 clusters, using 80% resampling on 5 replicates, for these clustering algorithms: Hierarchical Clustering, PAM, and DIvisive ANAlysis Clustering (DIANA). Euclidean distance is used for all algorithms.

CC <- consensus_cluster(hgsc, nk = 3:4, p.item = 0.8, reps = 5,
                        algorithms = c("hc", "pam", "diana"))

The output is a 4-dimensional array: rows are samples, columns are different bootstrap subsample replicates, slices are algorithms, and each “box” (4th dimension) is for a different k. Below are the first few cluster assignments for each replicate in the DIANA algorithm for k = 3.

str(CC)
#>  int [1:100, 1:5, 1:3, 1:2] 1 3 1 3 1 1 1 NA 1 1 ...
#>  - attr(*, "dimnames")=List of 4
#>   ..$ : chr [1:100] "TCGA.04.1331_PRO.C5" "TCGA.04.1332_MES.C1" "TCGA.04.1336_DIF.C4" "TCGA.04.1337_MES.C1" ...
#>   ..$ : chr [1:5] "R1" "R2" "R3" "R4" ...
#>   ..$ : chr [1:3] "HC_Euclidean" "PAM_Euclidean" "DIANA_Euclidean"
#>   ..$ : chr [1:2] "3" "4"
kable(head(CC[, , "DIANA_Euclidean", "3"]))
R1 R2 R3 R4 R5
TCGA.04.1331_PRO.C5 1 2 NA NA NA
TCGA.04.1332_MES.C1 1 2 1 1 NA
TCGA.04.1336_DIF.C4 2 NA NA 1 NA
TCGA.04.1337_MES.C1 2 1 3 1 2
TCGA.04.1338_MES.C1 2 1 3 1 2
TCGA.04.1341_PRO.C5 1 2 1 2 1

Note the unavoidable presence of NAs because we used 80% subsampling. This can be problematic in certain downstream ensemble methods, so we can impute missing values using K-Nearest Neighbours beforehand. There might still be NAs after KNN because of how the decision threshold was set. The remaining missing values can be imputed using majority voting.

CC <- apply(CC, 2:4, impute_knn, data = hgsc, seed = 1)
CC_imputed <- impute_missing(CC, hgsc, nk = 4)
sum(is.na(CC))
#> [1] 5
sum(is.na(CC_imputed))
#> [1] 0

We can carry on the analysis using either CC or CC_imputed.

Consensus Functions

diceR provides functions for retrieving useful summaries and other results for consensus clustering.

Compute consensus matrix with consensus_matrix()

The consensus matrix is an n by n matrix, where n is the number of samples. Each element is a real-valued number between 0 and 1 inclusive, representing the proportion of times that two samples were clustered together out of the times that the same samples were chosen in the bootstrap resampling. The diagonal are all one’s. Suppose we wanted to compute the consensus matrix for PAM, k = 4, and visualize using graph_heatmap():

pam.4 <- CC[, , "PAM_Euclidean", "4", drop = FALSE]
cm <- consensus_matrix(pam.4)
dim(cm)
#> [1] 100 100
hm <- graph_heatmap(pam.4)

Combine consensus summaries with consensus_combine()

If we wish to separately extract consensus matrices and consensus classes for every algorithm, consensus_combine() is a convenient wrapper to do so. Setting element = "matrix" returns a list of consensus matrices. On the other hand, setting element = "class" returns a matrix with rows as samples, and columns as clustering assignments for each algorithm.

ccomb_matrix <- consensus_combine(CC, element = "matrix")
ccomb_class <- consensus_combine(CC, element = "class")
str(ccomb_matrix, max.level = 2)
#> List of 2
#>  $ 3:List of 3
#>   ..$ HC_Euclidean   : num [1:100, 1:100] 1 0.4 1 0.4 1 1 1 1 1 0.8 ...
#>   ..$ PAM_Euclidean  : num [1:100, 1:100] 1 1 0.2 1 0 1 1 0.8 0 0 ...
#>   ..$ DIANA_Euclidean: num [1:100, 1:100] 1 0.6 0 0 0 1 1 0 0 1 ...
#>  $ 4:List of 3
#>   ..$ HC_Euclidean   : num [1:100, 1:100] 1 0.2 1 0 1 1 1 1 1 0.8 ...
#>   ..$ PAM_Euclidean  : num [1:100, 1:100] 1 0.8 0 0.4 0 1 1 0.8 0 0 ...
#>   ..$ DIANA_Euclidean: num [1:100, 1:100] 1 0.6 0 0 0 1 1 0 0 1 ...
kable(head(ccomb_class$`4`))
HC_Euclidean PAM_Euclidean DIANA_Euclidean
1 1 1
2 1 1
1 2 2
3 1 2
1 2 2
1 1 1

One can feed in ccomb_class (instead of CC) into consensus_matrix() to obtain a consensus matrix across both subsamples and algorithms.

A situation might also arise where we initially decided on using 3 clustering algorithms for the ensemble, but later wish to add additional algorithms for analysis. consensus_combine() takes in any number of ensemble objects (e.g. CC and CC2) and combines the results.

CC2 <- consensus_cluster(hgsc, nk = 3:4, p.item = 0.8, reps = 5,
                         algorithms = "km")
ccomb_class2 <- consensus_combine(CC, CC2, element = "class")
kable(head(ccomb_class2$`4`))
HC_Euclidean PAM_Euclidean DIANA_Euclidean KM_Euclidean
1 1 1 1
2 1 1 1
1 2 2 2
3 1 2 3
1 2 2 2
1 1 1 1

Evaluate, trim, and reweigh algortihms with consensus_evaluate()

Internal cluster validation indices assess the performance of results by taking into account the compactness and separability of the clusters. We choose a variety of indices on which to compare the collection of clustering algorithms. We use the PAC (Proportion of Ambiguous Clusters), the proportion of entries in a consensus matrix that are strictly between lower (defaults to 0) and upper (defaults to 1), to give a measure of cluster stability. In addition, if no reference class is provided, we calculate the average PAC across algorithms within each k, and choose the k with the greatest average PAC. If there is a reference class, k is the number of distinct classes in the reference.

ccomp <- consensus_evaluate(hgsc, CC, CC2, plot = FALSE)
kable(ccomp$internal)
Algorithms c_index calinski_harabasz davies_bouldin dunn mcclain_rao pbm sd_dis ray_turi tau gamma g_plus silhouette s_dbw Compactness Connectivity
HC_Euclidean 0.2628654 4.410259 0.4882016 0.3811705 0.7501849 15.658647 0.2770320 0.7461296 0.2795918 0.6686529 0.0289668 NaN NaN 7.810255 12.65913
PAM_Euclidean 0.2050865 20.228140 2.1384974 0.2763511 0.7985953 10.366476 0.4987234 1.2721484 0.3549500 0.5249518 0.1085933 0.1255940 NaN 6.944100 59.22500
DIANA_Euclidean 0.2054155 17.267393 2.3827644 0.2833219 0.7948289 5.481678 0.5285429 1.9552072 0.3678170 0.5261285 0.1158006 0.0888720 NaN 7.118123 55.26071
KM_Euclidean 0.1550889 21.207103 1.9314614 0.3016047 0.7637774 11.444229 0.4713559 1.1327433 0.4287440 0.6325554 0.0844035 0.1203563 NaN 6.908208 57.88492
Algorithms c_index calinski_harabasz davies_bouldin dunn mcclain_rao pbm sd_dis ray_turi tau gamma g_plus silhouette s_dbw Compactness Connectivity
HC_Euclidean 0.2092373 3.948713 0.5132453 0.4523993 0.7277567 13.247761 0.2665937 0.5898774 0.3188450 0.7035632 0.0304408 NaN NaN 7.782720 15.93095
PAM_Euclidean 0.1795640 15.085862 2.0559155 0.2979097 0.7820270 8.455729 0.4753546 1.2251523 0.3755686 0.5701096 0.0932802 0.0814430 NaN 6.875115 72.86746
DIANA_Euclidean 0.1908600 12.512635 1.5074910 0.2663142 0.7848426 5.698152 0.4672917 1.6854933 0.3863081 0.5516210 0.1099515 0.0274828 NaN 7.070627 50.82460
KM_Euclidean 0.1195002 15.838879 1.9122514 0.2541215 0.7397945 8.839085 0.4449419 1.1424613 0.4682810 0.7021381 0.0662450 0.0639874 NaN 6.840458 73.99008

We see that the biclustering algorithm is the least ambiguous and also most well-clustered (high compactness and separability).

Some algorithms perform too poorly to deserve membership in the cluster ensemble. We consider the relative ranks of each algorithm across all internal indices, and compute their sum. All algorithms below a certain quantile for the sum rank are trimmed (removed). By default this quantile is 75%.

After trimming, we can optionally choose to reweigh the algorithms based on the internal index magnitudes. Of course, we take into account the direction of optimality (higher is better is lower is better). Algorithms reweighed are then fed into the consensus functions. This is done by replicating each algorithm by a scalar factor that is proportional to its weight. For example, if we have two algorithms A and B, and A is given a weight of 80% and B is given a weight of 20%, then we make 4 copies of A and 1 copy of B. To minimize computational time, the total number of copies out of all algorithms has an upper bound of 100. Without reweighing, each algorithm is given equal weight.

ctrim <- consensus_evaluate(hgsc, CC, CC2, trim = TRUE, reweigh = FALSE, n = 2)
str(ctrim, max.level = 2)
#> List of 5
#>  $ k       : int 3
#>  $ pac     :'data.frame':    2 obs. of  5 variables:
#>   ..$ k              : chr [1:2] "3" "4"
#>   ..$ HC_Euclidean   : num [1:2] 0.487 0.432
#>   ..$ PAM_Euclidean  : num [1:2] 0.248 0.251
#>   ..$ DIANA_Euclidean: num [1:2] 0.215 0.242
#>   ..$ KM_Euclidean   : num [1:2] 0.275 0.242
#>  $ internal:List of 2
#>   ..$ 3:'data.frame':    4 obs. of  16 variables:
#>   ..$ 4:'data.frame':    4 obs. of  16 variables:
#>  $ external: NULL
#>  $ trim    :List of 5
#>   ..$ alg.keep  : chr [1:2] "HC_Euclidean" "KM_Euclidean"
#>   ..$ alg.remove: chr [1:2] "PAM_Euclidean" "DIANA_Euclidean"
#>   ..$ rank.agg  :List of 1
#>   ..$ top.list  :List of 1
#>   ..$ data.new  :List of 1

The return value shows which algorithms were kept, removed (if any), and the trimmed (and potentially reweighed) cluster ensemble.

Significance Testing

To test whether there are four statistically distinct clusters (k = 4) versus no clusters (k = 1) using the PAM algorithm, we run sigclust with 50 simulations and generate a p-value for this significance test.

set.seed(1)
pam_4 <- ccomb_class2$`4`[, "PAM_Euclidean"]
sig_obj <- sigclust(hgsc, k = 4, nsim = 100, labflag = 0, label = pam_4)
str(sig_obj)
#> Formal class 'sigclust' [package "sigclust"] with 10 slots
#>   ..@ raw.data  : num [1:100, 1:50] -0.0107 -0.7107 0.8815 -1.0851 -0.9322 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:100] "TCGA.04.1331_PRO.C5" "TCGA.04.1332_MES.C1" "TCGA.04.1336_DIF.C4" "TCGA.04.1337_MES.C1" ...
#>   .. .. ..$ : chr [1:50] "ABAT" "ABHD2" "ACTB" "ACTR2" ...
#>   ..@ veigval   : num [1:50] 11.81 4.51 2.66 2.29 1.84 ...
#>   ..@ vsimeigval: num [1:50] 11.81 4.51 2.66 2.29 1.84 ...
#>   ..@ simbackvar: num 0.42
#>   ..@ icovest   : num 2
#>   ..@ nsim      : num 100
#>   ..@ simcindex : num [1:100] 0.67 0.634 0.641 0.637 0.606 ...
#>   ..@ pval      : num 0.14
#>   ..@ pvalnorm  : num 0.163
#>   ..@ xcindex   : num 0.63

The p-value is 0.14, indicating we do not have sufficient evidence to conclude there are four distinct clusters. Note that we did not use the full hgsc data set in this example, so the underlying biological mechanisms may not be fully captured.