An Introduction to Proportionality

Thomas Quinn

2018-05-09

Introduction

The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, correlation lacks validity when applied to relative data. This includes data generated by high-throughput RNA-sequencing, chromatin immunoprecipitation (ChIP), ChIP-sequencing, Methyl-Capture sequencing, and more. This package implements three measures of proportionality to measure dependence between relative features using compositional data analysis: \(\phi\) (Lovell 2015), \(\rho\) (Lovell 2015; Erb 2016), and \(\phi_s\) (Erb 2016; Quinn 2017). Unlike correlation, proportionality gives the same result for both relative and absolute data. Meanwhile, pairs that are strongly proportional in relative space are also strongly correlated in absolute space. Proportionality helps avoid the pitfall of spurious correlation.

Theory

Let \(A_i\) and \(A_j\) each represent a log-ratio transformed feature vector measured across \(N\) samples. We then define the metrics \(\phi\), \(\rho\), and \(\phi_s\):

\[\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}\]

\[\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}\]

\[\phi_s(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i + A_j)}\]

Log-ratio transformation is applied to each subject vector, \(x\), containing \(D\) features (e.g., genes). In this vignette, we consider the centered log-ratio (clr) transformation and the additive log-ratio (alr) transformation. In clr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, \(g(\textrm{x}) = \sqrt[D]{x_i...x_D}\). In alr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and chosen reference feature, \(x_D\). We define these accordingly:

\[\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};...;\ln\frac{x_D}{g(\textrm{x})}\right]\]

\[\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};...;\ln\frac{x_{D-1}}{x_D}\right]\]

Note that this package also implements the interquartile log-ratio transformation (iqlr) as used by the ALDEx2 package. For more information on integrating propr with ALDEx2, we refer the reader to the “Frequently Asked Questions” vignette.

Calculating proportionality

We provide three principal functions for calculating proportionality. The first, phit, calculates \(\phi\). The second, perb, calculates \(\rho\). The third, phis, calculates \(\phi_s\).

The measures \(\phi\) and \(\rho\) differ in three ways. First, the values of \(\phi\) range from \([0, \infty)\) (with lower \(\phi\) values indicating more proportionality) while the values of \(\rho\) range from \([-1, 1]\) (with greater \(|\rho|\) values indicating more proportionality and negative \(\rho\) values indicating inverse proportionality). Second, \(\phi\) lacks symmetry, although one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE). Third, \(\rho\) corrects for the individual variance of each feature in the pair, rather than for just one of the features. On the other hand, \(\phi_s\) is a naturally symmetric variant of \(\phi\) that also corrects for the individual variance of each feature in the pair.

Let us begin by building an arbitrary dataset of 4 features (e.g., genes) measured across 100 subjects. In this example, the feature pairs “a” and “b” will show proportional change as will the feature pairs “c” and “d”.

set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
                c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))

Let \(D\) represent any number of features measured across \(N\) observations exposed to a binary or continuous event \(E\). For example, \(E\) could represent differences in case-control status, treatment status, treatment dose, or time. These functions convert a “count matrix” with \(N\) rows and \(D\) columns into a proportionality matrix of \(D\) rows and \(D\) columns measuring proportionality for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of \(\phi\)) or a correlation matrix (in the case of \(\rho\)). Both functions return the proportionality matrix bundled within an object of the class propr. This object contains four key slots:

library(propr)
phi <- phit(X, symmetrize = TRUE)
rho <- perb(X, ivar = 0)
phs <- phis(X, ivar = 0)

Note that the log-ratio transformation, by its nature, fails if the input data contain any zero values. To avoid an error in this case, these functions automatically replace all zero values with 1. However, the topic of zero replacement is controversial. Proceed carefully when analyzing data that contain zero values.

Subsetting propr objects

We have provided methods for indexing and subsetting objects belonging to the propr class. Using the familiar [ method, we can efficiently index the proportionality matrix (i.e., the @matrix slot) based on an inequality operator and a reference value.

In this first example, we use [ to index the matrix by \(\rho > .95\). This indexes the location of all values satisfying that inequality (i.e., in the lower left triangle of the matrix), and saves those indices to the @pairs slot. Indexing helps guide some of the bundled visualization methods in lieu of copy-on-modify subsetting. Note that indexing an already indexed object appends the new index to the previous index (via a union merge).

rho99 <- rho[">", .95]
rho99@pairs
## [1]  2 12

Alternatively, using the subset method, we can subset an entire propr object based on a vector of feature indices or names. However, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large datasets.

In this second example, we subset by the feature names “a” and “b”.

rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix
##           a         b
## a 1.0000000 0.9999151
## b 0.9999151 1.0000000

The convenience function, simplify, can subset an entire propr object based on the index saved in its @pairs slot. This function converts the saved index into a paired list of coordinates and passes them along to the subset method. As such, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large datasets. Unlike subset, simplify returns an object with the @pairs slot updated. Most users will find simplify preferable to subset.

simplify(rho99)
## @counts summary: 100 subjects by 4 features
## @logratio summary: 100 subjects by 4 features
## @matrix summary: 4 features by 4 features
## @pairs summary: 2 feature pairs

Visualizing pairs

Each feature belonging to a highly proportional data pair should have about the same log-ratio transformed abundance across all subjects. The method plot (or, equivalently, smear) provides a means by which to visually inspect whether this holds true. Since this function will plot all pairs unless indexed with the [ method, we recommend the user first index or subset the propr object before plotting. A “noisy” relationship between some feature pairs could suggest that the proportionality cutoff is too lenient. We include this plot as a handy “sanity check” when working with high-dimensional datasets.

plot(rho99)
## Alert: Generating plot using indexed feature pairs.

Computational burden

High-throughput genomic sequencing has the ability to measure tens of thousands of features for each subject. Since calculating proportionality generates a matrix sized \(D^2\), this method uses a lot of RAM when applied to real biological datasets. To address this, propr harnesses the power of C++ (via the Rcpp package) to minimize the run-time and RAM overhead. Below, we provide a table that estimates the approximate amount of RAM needed to render a proportionality matrix based on the number of features studied. The user should account for up to 25% more RAM for subsequent [ indexing and visualization.

Features Peak RAM (MiB)
1000 8
2000 31
4000 123
8000 491
16000 1959
24000 4405
32000 7829
64000 31301
100000 76406

Limitations

Although we developed this package with biological count data in mind, compositional count data is not truly compositional in that count data contain integer values only. As such, measuring “Gene A” as \(1\) in one subject and “Gene B” as \(2\) in another subject (i.e., the feature vector \([1, 2]\)) does not carry the same information as measuring “Gene A” as \(1000\) in one subject and “Gene B” as \(2000\) in another subject (i.e., the feature vector \([1000, 2000]\)) due to how additive variation affects the relative abundance of small counts more than large counts (Quinn 2017). We advise the investigator to proceed with caution when analyzing low counts.

References

  1. Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.

  2. Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.

  3. Quinn, Thomas, Mark F. Richardson, David Lovell, and Tamsyn Crowley. “Propr: An R-Package for Identifying Proportionally Abundant Features Using Compositional Data Analysis.” bioRxiv, February 1, 2017, 104935. http://dx.doi.org/10.1101/104935.