Infinite Mixtures of Infinite Factor Analysers

Keefe Murphy

2017-07-07

Introduction

IMIFA is an R package that provides flexible efficient Gibbs sampling functions for fitting Infinite Mixtures of Infinite Factor Analysers (IMIFA) and related models. The main model, IMIFA itself, conducts Bayesian nonparametric clustering via latent Gaussian models. While the package offers a Bayesian implementation of the Factor Analysis (FA) and Mixtures of Factor Analysers (MFA) models, among others, these require pre-specification of the number of latent factors &/or the number of components, which must remain fixed, the main advantages of IMIFA are that a) these quantities are estimated automatically, and b) the number of latent factors is allowed to be cluster-specific.

Typically one would run FA or MFA models over ranges of values for the numbers of clusters and factors, with the pair which optimises some model selection criteria typically chosen. IMIFA instead enables Bayesian nonparametric model-based clustering with factor analytic covariance structures, without recourse to model selection criteria, to automatically choose the number of clusters &/or cluster-specific latent factors.

The main features of the IMIFA model are the multiplicative gamma process shrinkage prior on the factor loadings, which allows theoretically infinitely many factors (this can also be employed in a MIFA context, for instance, where the number of clusters is fixed but cluster-specific factors are estimated) and the Dirichlet process prior, which utilises the stick-breaking construction and slice-efficient sampling, and allows theoretically infinitely many clusters. Tools are provided for soliciting sensible hyperparameters for these priors. As of IMIFA v1.2.0, it’s now also possible to assume a Pitman-Yor process prior on the number of mixture components, and to learn its concentration and discount parameters via Metropolis-Hastings updates.

Model-specific diagnostic tools are also provided, as well as many extensive options for plotting results and conducting posterior inference on parameters of interest. The functions are typically verbose, offering plenty of messages and warnings to the user where appropriate. Please see Murphy et al. (2017) for more info.

If you find bugs or want to suggest new features please visit the IMIFA GitHub issues page. This vignette aims to reproduce some results in the Murphy et al. (2017) paper using the mcmc_IMIFA() and get_IMIFA_results() functions and demonstrates how the plots therein were created using the dedicated S3 plot method, while also demonstrating how to fit other models in the IMIFA family.

Installing IMIFA

IMIFA will run in Windows, Mac OS X or Linux. To install IMIFA you first need to install R. I would also recommend installing Rstudio as a nice desktop environment for using R.

Once in R you can type:

install.packages('devtools')
devtools::install_github('Keefe-Murphy/IMIFA')

at the R command prompt to install the latest development version of the package from the IMIFA GitHub page.

To instead install the latest stable official release of the package from CRAN go to R and type:

install.packages('IMIFA')

In either case, if you then type:

library(IMIFA)

it will load in all the IMIFA functions.

The GitHub version contains a few more features but some of these may not yet be fully tested, and occasionally this version might be liable to break when it is in the process of being updated.

The three main functions

There exist several utility functions in the package to solicit good prior hyperparameters (e.g. G_priorDensity(), psi_hyper(), MGP_check()) and to prepare results for producing nice plots (e.g. mat2cols()), this vignette focuses only on the three most important functions:

1.  mcmc_IMIFA() 
2.  get_IMIFA_results()
3.  and a dedicated S3 plot() method for objects of class "Results_IMIFA"

While it is possible to simulate data from a factor analytic mixture using the sim_IMIFA_data() function, specifying, among other things, the sample size N, the number of groups G, and the number of variables P, e.g.

# Simulate 100 observations from 3 balanced groups with group-specific numbers of latent factors
sim.data <- sim_IMIFA_data(N=100, G=3, P=20, Q=c(2, 2, 5))

the well-known Italian olive oil data set will be used throughout this vignette instead. You can load this data set after loading the IMIFA package by typing

data(olive)

and learn more about this data set by typing

?olive

Fitting the model & running the MCMC chain

The mcmc_IMIFA function provides an adaptive Gibbs sampler for nonparameteric model-based clustering using models from the IMIFA family. The function facilitates model-based clustering with dimensionally reduced factor-analytic covariance structures, with automatic estimation of the number of clusters and cluster-specific factors as appropriate to the method employed. Factor analysis with one group (FA/IFA), finite mixtures (MFA/MIFA), overfitted mixtures (OMFA/OMIFA), infinite factor models which employ the multiplicative gamma process (MGP) shrinkage prior (IFA/MIFA/OMIFA/IMIFA), and infinite mixtures which employ Dirichlet Process Mixture Models (IMFA/IMIFA) are all provided. The function creates a raw object of class 'IMIFA' from which the optimal/modal model can be extracted by get_IMIFA_results.

There are many, many options for specifying hyperparameters, specifying running conditions and pre-processing the data that are deferred, for brevity, to the function’s help file. Great care was taken to ensure the default function arguments would be appropriate in most applications, but you can nevertheless access further helpful instructions by typing

?mcmc_IMIFA

Be warned that the mcmc_IMIFA function calls in this section may take quite some time to run. Let’s begin by fitting a Mixture of Factor Analysers model (MFA) to the unit-scaled olive data. For this, we must specify sequences of values for range.G, the number of clusters, and range.Q, the number of latent factors. Let’s assume that uniquenesses are isotropic rather than unconstrained. This isotropic constraint provides the link between factor analysis and the probabilistic principal component analysis model: note that we could also constrain uniqueness across groups (but still be diagonal within each group) by specifiying uni.type="constrained" or constrain uniquenesses to a single value (i.e. equal across all groups and all variables) by specifying uni.type="single". Let’s also allow diagonal covariance as a special case where range.Q is 0, and accept all other defaults (for instance, cluster labels will be initialised by mclust).

simMFA   <- mcmc_IMIFA(olive, method="MFA", n.iters=10000, range.G=3:6, range.Q=0:3, 
                       centering=FALSE, scaling="unit", uni.type="isotropic")

Now let’s instead have the numbers of cluster-specific latent factors be estimated automatically using a Mixture of Infinite Factor Analysers model (MIFA). This time, we’ll also mean-center the data and initialise the cluster labels using kmeans instead. Note that range.Q no longer needs to be specified, but it can be given as a conservatively high starting value and upper limit. Let’s accept the default, and also include the Infinite Factor Analysis model (IFA) as a special case where range.G is 1.

simMIFA  <- mcmc_IMIFA(olive, method="MIFA", n.iters=10000, centering=TRUE, 
                       range.G=1:3, z.init="kmeans")

MIFA doesn’t entirely solve the issue of model choice, as you can see; range.G still needs to be specified. We can allow the number of clusters to instead/also be estimated automatically by fitting one of the overfitted mixture models (OMFA/OMIFA) or one of the infinite mixture models (IMFA/IMIFA). Let’s fit an Overfitted Mixture of Infinite Factor Analysers, and override the default value for the starting value / upper limit for the number of clusters (range.G) and supply a sufficiently small dirichlet hyperparameter (alpha) for the cluster mixing proportions. We can enforce additional shrinkage by varying other hyperparameters.

simOMIFA <- mcmc_IMIFA(olive, method="OMIFA", n.iters=10000, range.G=10, nu=3, alpha=0.8,
                       alpha.d1=3.5, alpha.d2=7, prop=0.6, epsilon=0.12)

Finally, let’s run the flagship IMIFA model, on which all subsequent demonstrations and results will be based, for a greater number of iterations, accepting the defaults for most arguments. Note that the verbose argument, which defaults to TRUE will ordinarily print a progress bar to the console window. We can avoid scoring the latent scores if we wish, as this can be a huge drain on memory, with the caveat that posterior inference on the scores won’t be possible. The default implementation uses the independent rather than dependent slice-efficient sampler; we could override the default for the parameter governing the rate of geometric decay by specifying rho. Furthermore, the default implementation assumes a Dirichlet process prior - let’s override this and assume instead a Pitman-Yor process prior, and also learn its discount parameter \(d\) via the learn.d argument.

simIMIFA <- mcmc_IMIFA(olive, method="IMIFA", n.iters=50000, verbose=FALSE, s.sw=FALSE, learn.d=TRUE)

Postprocessing and extracing optimum results

In order to extract results, conduct posterior inference and compute performance metrics for MCMC samples of models from the IMIFA family, we can pass the output of mcmc_IMIFA to the function get_IMIFA_results(). If, for instance, simMFA above was supplied, this function would find the pair of \(G\) and \(Q\) values which optimises a model selection criteria of our choosing and prepare results from that model only. If simIMIFA is supplied, this function finds the modal estimates of \(G\) and each \(Q_g\) (the cluster-specific number of latent factors), and likewise prepares results accordingly.

This function can be re-ran at little computational cost in order to extract different models explored by the sampler used by mcmc_IMIFA, without having to re-run the model itself. New results objects using different numbers of clusters and different numbers of factors (if visited by the model in question), or using different model selection criteria (if necessary) can be generated with ease. The function also performs post-hoc corrections for label switching, as well as post-hoc Procrustes rotation, to ensure sensible posterior parameter estimates, and constructs credible intervals. Please see the function’s help manual by typing ?get_IMIFA_results for further assistance with the various function arguments.

If we wanted to choose the optimum MFA model, we would simply type

resMFA  <- get_IMIFA_results(simMFA)

If we instead wanted to explore the 3-cluster solution, construct 90% credible intervals, and have the number of latent factors chosen by another criteria, we could try

resMFA2 <- get_IMIFA_results(simMFA, G=3, criterion="aic.mcmc")

For now, let’s just extract results from our IMIFA run above so we can proceed to visually examine them. Though the IMIFA model obviates the need for model selection criteria, the syntax for extracting results is exactly the same. However, this time, let’s also summarise the clustering via the N x N similarity matrix obtained by averaging the adjacency matrices (admittedly at the expense of slowing the function down considerably!), so that we can visualise it later.

resIMIFA <- get_IMIFA_results(simIMIFA, z.avgsim=TRUE)

Before we examine the results in great detail, we can quickly summarise the solution as follows

summary(resIMIFA)
## Call:    get_IMIFA_results.IMIFA(sims = simPY, z.avgsim = TRUE, zlabels = newlabs)
## 
## The chosen IMIFA model has 4 groups, with 6, 2, 3 and 2 factors respectively

Visualing IMIFA results

The object resIMIFA above is of the class Results_IMIFA. We can call plot an objects of this class to access a dedicated function for visualising output and parameters of inferential interest for IMIFA and related models. The two most important arguments, beyond the Results_IMIFA object itself, are plot.meth and param, where the former dictates the type of plot to be produced (one of c("all", "correlation", "density", "errors", "GQ", "means", "parallel.coords", "trace", "zlabels"), depending on the method employed originally by mcmc_IMIFA) for the parameter of interest (one of c("means", "scores", "loadings", "uniquenesses", "pis", "alpha"), depending on the method employed originally by mcmc_IMIFA). Note that many of the function calls below will also print relevant output to the console window that is not always shown here. Please see the function’s help manual by typing plot.Results_IMIFA for further assistance with the various function arguments.

Let’s examine the posterior distribution of \(G\) and the posterior distribution of \(Q_g\) for each of the 4 groups. The third plot below, depicting the trace of the numbers of active and non-empty groups, allows us to examine mixing of the chain with respect to \(G\).

plot(resIMIFA, plot.meth="GQ")

Let’s examine clustering performance against the known group labels. Note that the group labels could have been already supplied to get_IMIFA_results too, but plotting allows clustering performance to be evaluated against new labels without having to extract the full results object all over again. More specifically, the code below allows us to visualise the clustering uncertainty (with or without the labels being supplied, in fact). In this case, g=2 would mean to instead visualise the uncertainties in the form of a histogram. We can visualise the N * N similarity matrix by supplying g=3 when plot.meth="zlabels". Had we not specified g, the function would cycle through the available plots.

plot(resIMIFA, plot.meth="zlabels", zlabels=olive$area, g=1)
## confusion.matrix :
##          Observed
## Predicted   1   2   3
##         1 323   0   0
##         2   0  97   0
##         3   0   1 103
##         4   0   0  48
## 
## rand :
## [1] 0.9685009
## 
## crand :
## [1] 0.934535
## 
## misclassified :
##  [1] 390 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437
## [18] 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454
## [35] 455 456 457 458 459 461 462 463 464 465 467 469 470 471 472
## 
## error.rate :
## [1] "8.57%"

plot(resIMIFA, plot.meth="zlabels", g=3)

To examine the posterior mean estimates of the cluster means in a given cluster (say the 1st), we can set both plot.meth and param to "means". If the cluster isn’t specified using the argument g, the user will be prompted to hit <Return> at the onset of each plot in order to cycle through similar plots for all clusters. In the code below mat=TRUE means, in this case, to plot all variables simultaneously. By default, credible intervals are also plotted. Note that the data were originally mean-centered and unit-scaled when mcmc_IMIFA was called.

plot(resIMIFA, plot.meth="means", param="means", mat=TRUE, g=1)

Had the factor scores been stored, we could examine the trace plots for them using the code below, where mat=TRUE and by.fac=FALSE (the default) means, in this case, to plot all factors simultaneously for a given observation, say the 1st (however this is not shown, as the scores were not stored).

plot(resIMIFA, plot.meth="trace", param="scores", mat=TRUE, ind=1)

We could instead plot all observations simultaneously for a given factor, say the 2nd (also not shown).

plot(resIMIFA, plot.meth="trace", param="scores", mat=TRUE, by.fac=TRUE, fac=2)

The code below will produce only a heatmap of the loadings matrix in the first cluster: note, however, that heat.map=TRUE by default for loadings (whereas the opposite is true for the scores). Darker colours correspond to entries which are more negatively loaded and vice versa.

plot(resIMIFA, plot.meth="means", param="loadings", heat.map=TRUE, g=1)

To examine uniquenesses from all clusters in the form of a parellel coordinates plot, type

plot(resIMIFA, plot.meth="parallel.coords", param="uniquenesses")

The error between the empirical and estimated covariance matrices (by group and averaged across groups) can sometimes provide a useful indicator of the validity of a clustering solution, and can be visualised as follows:

plot(resIMIFA, plot.meth="errors")
## Average Error Metrics :
##          MSE          MAE        MEDSE        MEDAE         RMSE 
## 7.380864e-04 1.443915e-02 7.000663e-05 7.689004e-03 2.429045e-02 
##        NRMSE 
## 2.096497e-02

Finally, inference can be conducted on the Pitman-Yor concentration parameter \(\alpha\) and discount parameter \(d\) (assuming learn.alpha and learn.d, respectively, were not set to FALSE when mcmc_IMIFA was initially run using the IMFA/IMIFA methods), where all below refers to trace, density, posterior means, and ACF (correlation) plots. Note that the density for discount accounts for the point-mass at zero built into its prior.

plot(resIMIFA, plot.meth="all", param="alpha")

plot(resIMIFA, plot.meth="all", param="discount")

References

For a more detailed description of the kind of things IMIFA can do and the purposes for which the model was developed, please consult the followng papers:

Murphy, K., Gormley, I. C. and Viroli, C. (2017) Infinite Mixtures of Infinite Factor Analysers: Nonparametric Model-Based Clustering via Latent Gaussian Models, to appear, arXiv:1701.07010.

Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika, 98(2): 291-306.