Frequently Asked Questions

Thomas Quinn

2018-05-09

Why does proportionality change when I remove features?

We use the term sub-compositional coherence to describe a method in which the results do not change when features (e.g., genes) are added or removed. The numerator portion of \(\rho\) (i.e., the VLR) is sub-compositionally coherent. However, \(\rho\) itself is not sub-compositionally coherent because the denominator portion of \(\rho\) (and \(\phi\) and \(\phi_s\)) scales the VLR by the variance of the centered log-ratio (clr) transformed features. The clr-transformation is not sub-compositionally coherent because the removal of a feature will change geometric mean reference for each composition (i.e., each sample) and therefore its transformation; consequently, the variance of each transformed feature vector changes too. An exception to this is when using the additive log-ratio (alr) transformation instead of clr.

Isn’t the centered log-ratio bad then?

The centered log-ratio is not perfect, but it is maybe good enough. Still, this is the major limitation of proportionality analysis: it inherits all the limitations of the choice in log-ratio transformation. However, as long as the majority of genes remain unchanged across the conditions, the clr is appropriate. Excerpted from Erb et al. 2016:

This is known as the centered log-ratio transformation with g(x) the geometric mean over the genes for the given condition. The problem with this transformation is that it is sub-compositionally incoherent Aitchison (2003), so results will change to some extent when using subsets of genes for the analysis. In some cases, g(x) can approximate an unchanged reference. This applies whenever the majority of genes remains unchanged across conditions, so the unchanged genes will dominate the behaviour of the reference. Note that this is also the condition needed for a normalization by effective library size Robinson and Oshlack (2012).

What if the centered log-ratio isn’t for me?

One option is to use a house-keeping gene or RNA spike-in that has an a priori known fixed abundance. If this is available, you can use the additive log-ratio (alr) transformation to effectively “back-calculate” absolute abundances. In practice, this is sometimes infeasible. Another option is to use a novel transformation introduced in Fernandes et al. 2013 called the inter-quartile log-ratio (iqlr) transformation. Excerpted from the ALDEx2 vignette:

The [approach] is to include as the denominator for the geometric mean those features that are relatively invariant across all samples. This is termed the ‘iqlr’ method, and takes as the denominator the geometric mean of those features with variance calculated from the clr that are between the first and third quartile. This approach can be used until the asymmetry becomes so severe that more than 25% of the features are asymmetric between the groups. The iqlr approach has the advantage that it gives essentially the same answer as using the entire set of features in symmetric datasets.

Greg Gloor and I have coordinated to make the iqlr transformation and the Monte Carlo (MC) instances from ALDEx2 available for proportionality analysis in propr. You can use the aldex2propr function to build a propr object from an aldex.clr object. In practice, the code might look something like this:

data(caneToad.counts)
data(caneToad.groups)
counts <- as.data.frame(t(caneToad.counts))
x <- ALDEx2::aldex.clr(counts, caneToad.groups, denom = "iqlr")
rho <- propr::aldex2propr(x, how = "rho")

Note that this will average \(\rho\) (or \(\phi\) or \(\phi_s\)) across many MC instances which may or may not be desired. In theory, using MC instances should lessen the impact of zero or low counts on your final result. In practice, proportionality analysis will take a lot longer to run. Note, however, that you can also use the iqlr transformation without any MC instances:

rho <- perb(caneToad.counts, ivar = "iqlr")

When using aldex2propr (or the iqlr transformation), please make sure to cite the appropriate ALDEx2 paper(s) to thank Greg for his generous contribution!

Why can’t we just use the VLR by itself?

Quoting Lovell et al. 2015 (quoting Friedman et al. 2012):

Aitchison proposed logratio variance, var(log(x/y)), as a measure of association for variables that carry only relative information. When x and y are exactly proportional var(log(x/y)) = 0, but when x and y are not exactly proportional, “it is hard to interpret as it lacks a scale. That is, it is unclear what constitutes a large or small value… (does a value of 0.1 indicate strong dependence, weak dependence, or no dependence?)”

Why does my smear plot show perfectly proportional pairs?

We call this phenomena auto-proportionality. Depending on how genes (or transcripts) get mapped, annotations with similar open reading frames (ORFs) can wrongly appear as if they have proportional abundance only because some sequences map to multiple annotations equally well. This is especially a problem when analyzing transcriptomic data: transcript isoforms tend to appear proportional with one another by virtue of their shared ORF. The discovery of proportional isoforms could reflect a real biological event (i.e., transcript isoforms under shared regulatory control) or an artifact of mapping. We recommend removing such pairs to avoid swamping interesting biological signals with artifactual noise. You can do this by removing the set of extremely proportional pairs (e.g., \(\rho > .995\)) from the set of highly proportional pairs (e.g., \(\rho > .95\)):

data(caneToad.counts)
rho <- perb(caneToad.counts)
autoprop <- rho[">", .995]@pairs
highprop <- rho[">", .95]@pairs
rho@pairs <- setdiff(highprop, autoprop)

How can I analyze multi-omics data?

This is an area of active research. At a minimum, we recommend making sure that (a) all combined multi-omics data derive from the same samples and (b) log-ratio transformations are applied to each data source individually. The functions propr:::lr2rho, propr:::lr2phi, and propr:::lr2phs, used in conjunction with new("propr"), offer a way forward. The propr initialization method will perform the log-ratio transformation without building a proportionality matrix:

data(caneToad.counts)
prop <- new("propr", counts = caneToad.counts, ivar = "iqlr")
prop@logratio

You could then cbind any number of separate @logratio slots into a single data.frame to use as input for an “lr2” function.

How can I analyze multiple groups?

This is another area of active research. At a minimum, we recommend making sure that data filtering (e.g., removal of low and zero counts) and log-ratio transformation is applied identically to all data prior to analysis. From there, the usual pipeline would likely involve separating the data into several pairwise (i.e., two-way) comparisons. Users should take care when crafting their hypotheses and consider whether one-vs-all or one-vs-one comparisons better suit their goals.

References

  1. Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.

  2. Fernandes, Andrew D., Jean M. Macklaim, Thomas G. Linn, Gregor Reid, and Gregory B. Gloor. “ANOVA-like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq.” PloS One 8, no. 7 (2013): e67019. http://dx.doi.org/10.1371/journal.pone.0067019.

  3. Friedman, Jonathan, and Eric J. Alm. “Inferring Correlation Networks from Genomic Survey Data.” PLoS Computational Biology 8, no. 9 (2012): e1002687. http://dx.doi.org/10.1371/journal.pcbi.1002687.

  4. Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.