Our understanding of dynamical biological systems such as developmental processes or transitions into disease states is limited by our ability to reverse-engineer these systems using available data. Most medium- and high-throughput experimental protocols (e.g. single cell RNA-seq) are destructive in nature, generating cross-sectional time series in which it is not possible to track the progress of one cell through the system. In typical systems, individual cells progress at different rates and a sample’s experimental capture time will not accurately reflect how far it has progressed. In these cases, cross-sectional data can appear particularly noisy.

We propose a probabilistic model that uses smoothness assumptions to estimate and correct for this effect. Each cell is assigned a pseudotime that represents its progress through the system. These pseudotimes are related to but not determined by the cells’ experimental capture times. Replacing capture times with pseudotimes gives us a more representative view of the underlying system, improving downstream analyses.

We model the smoothness assumption on each gene’s expression profile \(y_g\) using a Gaussian process over pseudotime. Gene-specific parameters in the covariance function represent intrinsic measurement noise \(\omega_g\) and variation of the profile over time \(\psi_g\). Each sample’s pseudotime \(\tau_c\) is given a normal prior centred on its capture time. \[ \begin{aligned} y_{g} &\sim \mathcal{GP}(\phi_g, \Sigma_g) \\ \Sigma_g(\tau_1, \tau_2) &= \psi_g \Sigma_\tau(\tau_1, \tau_2) + \omega_g \delta_{\tau_1,\tau_2} \\ \log \psi_g &\sim \mathcal{N}(\mu_\psi, \sigma_\psi) \\ \log \omega_g &\sim \mathcal{N}(\mu_\omega, \sigma_\omega) \\ \Sigma_\tau(\tau_1, \tau_2) &= \textrm{Matern}_{3/2}\bigg(r=\frac{|\tau_1 - \tau_2|}{l}\bigg) = (1 + \sqrt{3}r) \exp[-\sqrt{3}r] \\ \tau_c &\sim \mathcal{N}(k_c, \sigma_\tau) \end{aligned} \] This model is effectively a one-dimensional Gaussian process latent variable model with a structured prior on the latent variable (pseudotime).

Guo et al. assayed single cell expression values at 7 time points in mouse embryonic cells and the data is contained in the `DeLorean`

package. We will load this data and analyse a subset of it corresponding to the epiblast lineage. First we must build a `de.lorean`

object with the correct data. Load the data.

```
library(DeLorean)
library(dplyr)
data(GuoDeLorean)
# Limit number of cores to 2 for CRAN
options(DL.num.cores=min(default.num.cores(), 2))
```

Create a de.lorean object from the full data set.

`dl <- de.lorean(guo.expr, guo.gene.meta, guo.cell.meta)`

Estimate hyperparameters for the model from the whole data set. Here we set the width of the normal prior on the pseudotimes to be `0.5`

.

```
dl <- estimate.hyper(
dl,
sigma.tau=0.5,
length.scale=1.5,
model.name='exact-sizes')
```

DeLorean also offers slight variations of the model, we could use `model.name='lowrank'`

or `model.name='exact'`

. See the documentation for `estimate.hyper`

for more details.

Choose a few cells from each capture point from the epiblast lineage.

```
num.at.each.stage <- 5
epi.sampled.cells <- guo.cell.meta %>%
filter(capture < "32C" |
"EPI" == cell.type |
"ICM" == cell.type) %>%
group_by(capture) %>%
do(sample_n(., num.at.each.stage))
dl <- filter_cells(dl, cells=epi.sampled.cells$cell)
```

We only have data for a few genes and can easily model them all which is typical for qPCR data. RNA-seq data often has far too many genes for the model to fit. In any case most are probably irrelevant. In these cases we recommend an analysis of variance across the capture times to choose those genes whose means vary most across time. These are most likely to be relevant for fitting the model.

`dl <- aov.dl(dl)`

The most temporally variable genes (by p-value) are at the head of the result:

`head(dl$aov)`

```
## # A tibble: 6 x 7
## gene term df sumsq meansq statistic p.value
## <fctr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tspan8 capture 6 318.94068 53.15678 49.64417 1.214143e-13
## 2 Gata4 capture 6 681.80135 113.63356 48.63760 1.572470e-13
## 3 Sox2 capture 6 427.32840 71.22140 46.67877 2.637983e-13
## 4 Grhl1 capture 6 763.76926 127.29488 44.08173 5.402009e-13
## 5 DppaI capture 6 485.72076 80.95346 40.90517 1.368646e-12
## 6 Tcfap2c capture 6 77.95811 12.99302 39.01037 2.457585e-12
```

The least temporally variable genes (by p-value) are at the tail of the result:

`tail(dl$aov)`

```
## # A tibble: 6 x 7
## gene term df sumsq meansq statistic p.value
## <fctr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cdx2 capture 6 41.53234 6.922057 2.4769523 0.04763159
## 2 Esrrb capture 6 50.05827 8.343046 2.1950267 0.07348559
## 3 Tcf23 capture 6 80.61247 13.435411 2.1187753 0.08267670
## 4 Msx2 capture 6 61.07005 10.178341 2.0900376 0.08643484
## 5 Atp12a capture 6 24.90279 4.150465 1.8682503 0.12183422
## 6 Tcfap2a capture 6 28.01346 4.668910 0.9040465 0.50604553
```

and for instance you could run the model on the 20 most variable genes by executing

`dl <- filter_genes(dl, genes=head(dl$aov, 20)$gene)`

otherwise do not call `filter_genes`

and `DeLorean`

will use all the genes.

Now we have the data we can fit our model using Stan’s ADVI variational Bayes algorithm. To run the No-U-Turn sampler use `method='sample'`

.

`dl <- fit.dl(dl, method='vb')`

If running a sampler, Stan provides \(\hat{R}\) statistics that can aid detecting convergence problems. This makes no sense for ADVI but we show how to produce the boxplots here for users of the samplers.

```
dl <- examine.convergence(dl)
plot(dl, type='Rhat')
```

Plot the pseudotimes from the best sample (best in the sense of highest likelihood). The prior means for the capture points are shown as dashed lines.

`plot(dl, type='pseudotime')`

Plot the expression data over the pseudotimes from the best sample.

```
dl <- make.predictions(dl)
plot(dl, type='profiles')
```