# Introduction to cytometree

cytometree is a package that implements a binary tree algorithm for the analysis of cytometry data.

Its core algorithm is based on the construction of a binary tree, whose nodes represents cell subpopulations.

## Binary tree construction

1. At each node, observed cells (or “events”) and markers are modeled by both a normal distribution (so unimodal), and a mixture of 2 normal distributions (so bimodal).

2. Splitting of the events at each node is done according to a normalized difference of AIC between the two distributional fit (unimodal or bimodal), allowing to pick the marker that best splits those data.

3. When AIC noralized differences $$D$$ are not significant anymore, the tree has been constructed, and the cells have been automatically gated (i.e. partitioned).

## Post-hoc annotation

Given the unsupervised nature of the binary tree, some of the available markers may not be used to find the different cell populations present in a given sample. To recover a complete annotation, we defined, as a post processing procedure, an annotation method which allows the user to distinguish two (or three) expression levels per marker.

# Examples of analysis with cytometree

## Diffuse large B-cell lymphoma dataset with 3 markers

In this example, we will use a diffuse large B-cell lymphoma dataset (from the flowCAP-I challenge, with only 3 markers.

### Automatic gating step

First, we need to load the package cytometree and we can have a look at the data:

library(cytometree)
data(DLBCL)
dim(DLBCL)
## [1] 5524    4
head(DLBCL)
##   FL1 FL2 FL4 label
## 1 416 251 293     2
## 2 210  93 184     2
## 3 387 300 233     2
## 4 403 438 288     2
## 5 399 128 116     2
## 6 538 288 189     2

We have 3 markers measured FL1, FL2, FL4 as well as the label obtained from manual gating, and 5524 cells.

# getting the only the cell events with the 3 markers measured:
cellevents <- DLBCL[, c("FL1", "FL2", "FL4")]
# storing the maanual gating reference from FlowCAP-I:
manual_labels <- DLBCL[, "label"]

The function CytomeTree builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels attribute):

# Build the binary tree:
Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.1)
# Retreive the resulting partition (i.e. automatic gating):
Tree_Partition <- Tree$labels One can fiddle with the t threshold to change the desired depth of the tree. The function plot_graph plots a graph representing this binary tree, showing which markers were used in which order to splits the events: # Plot a graph of the tree (with specific graphical parameters): plot_graph(Tree, edge.arrow.size = 0.3, Vcex = 0.8, vertex.size = 30) The function plot_nodes allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node: # Plot the distribution fit for each node: plot_nodes(Tree) # Plot the distribution fit for a particular node: plot_nodes(Tree, "FL4.1") ### Post-processing annotation of automatically gated subpopulations The function Annotation annotates the subpopulations partitionned by the binary tree: # Run the annotation algorithm: Annot <- Annotation(Tree, plot = FALSE) Annot$combinations
##   FL1 FL2 FL4 leaves count   prop
## 1  NA   1   0      1  4905 0.8879
## 2  NA   0   1      2   590 0.1068
## 3  NA   1   1      3    29 0.0052

The post-procesing annotation can be compared to the uncomplete annotation obtained from the tree alone:

# Annotation gotten from the tree:
Tree$annotation ## FL1 FL2 FL4 labels count prop ## 1 NA NA 0 1 4905 0.8879 ## 2 NA 0 1 2 590 0.1068 ## 3 NA 1 1 3 29 0.0052 This completed annotation eases the search for specific cell types of interest, where 1 represents a postive measurement of a marker, while 0 corresponds to its absence: # seeked cell type: FL2+FL4+ pheno_ex1 <- RetrievePops(Annot, phenotypes = list(rbind(c("FL2", 1), c("FL4", 1)))) pheno_ex1$phenotypesinfo
## [[1]]
## [[1]]$phenotype ## [1] "FL2-1" "FL4-1" ## ## [[1]]$proportion
## [1] 0.0052
##
## [[1]]$label ## [1] 3 # One can look for several cell types at once: phenotypes <- list() # FL2+FL4- phenotypes[[1]] <- rbind(c("FL2", 1), c("FL4", 0)) # FL2-FL4+ phenotypes[[2]] <- rbind(c("FL2", 0), c("FL4", 1)) # Retreive corresponding cell populations: PhenoInfos <- RetrievePops(Annot, phenotypes) PhenoInfos$phenotypesinfo
## [[1]]
## [[1]]$phenotype ## [1] "FL2-1" "FL4-0" ## ## [[1]]$proportion
## [1] 0.8879
##
## [[1]]$label ## [1] 1 ## ## ## [[2]] ## [[2]]$phenotype
## [1] "FL2-0" "FL4-1"
##
## [[2]]$proportion ## [1] 0.1068 ## ## [[2]]$label
## [1] 2

### Comparison between manual and automatic gating

Finally, we can compare the automatic gating obtained using cytometree to the original manual gating (which had 47 events unconsitantly gated accross different operators performing the manual gating on those very same data - so those 47 cells were identified as outliers in the reference label and were kept out of the evaluation criteria in the flowCAP-I challenge):

## Loading required package: viridisLite

## PBMC sample from the Human Immunology Project

In this example, we will use a dataset from the HIPC (Human Immunology Project Consortium program where a PBMC sample from patient 12828 was analysed by Stanford.

### Automatic gating step

First, we need to load the package cytometree and we can have a look at the data:

data(HIPC)
dim(HIPC)
## [1] 33992     7
head(HIPC)
##           CCR7       CD4   CD45RA      HLADR      CD38      CD8 label
## [1,] 1561.2429 1052.6527 3169.944  889.99200 1326.2701 2733.810     1
## [2,] 1362.2820  978.1219 2826.469 1810.43335 1294.4667 3054.555     1
## [3,] 1321.3787 1180.9497 2780.128  -57.02759  211.0515 2450.657     1
## [4,]  779.8231 1097.8448 3054.914  -67.70583  579.6187 1898.014     1
## [5,]  956.2596  942.8704 2612.188   60.35059 1776.3430 3262.799     1
## [6,] 1100.1621 1080.4724 2560.610  702.39624 1179.5389 2836.238     1

We have 6 markers measured CCR7, CD4, CD45RA, HLADR, CD38, CD8 as well as the label obtained from manual gating, and 33992 cell.

# getting the only the cell events with the 3 markers measured:
cellevents <- HIPC[, c("CCR7", "CD4", "CD45RA", "HLADR", "CD38", "CD8")]
# storing the maanual gating reference from FlowCAP-I:
manual_labels <- HIPC[, "label"]

The function CytomeTree builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels attribute):

# Build the binary tree:
Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.2)
# Retreive the resulting partition (i.e. automatic gating):
Tree_Partition <- Tree$labels One can fiddle with the t threshold to change the desired depth of the tree. The function plot_graph plots a graph representing this binary tree, showing which markers were used in which order to splits the events: # Plot a graph of the tree (with specific graphical parameters): plot_graph(Tree, edge.arrow.size = 0.4, Vcex = 0.45) The function plot_nodes allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node: # Plot the fit for 2 specific nodes: plot_nodes(Tree, c("CD4.1", "CD45RA.7")) ### Post-processing annotation of automatically gated subpopulations The function Annotation annotates the subpopulations partitionned by the binary tree: # Run the annotation algorithm: Annot <- Annotation(Tree, plot = FALSE) Annot$combinations
##    CCR7 CD4 CD45RA HLADR CD38 CD8 leaves count   prop
## 5     1   1      1     0    1   0      5 12123 0.3566
## 4     1   1      0     0    0   0      4  8213 0.2416
## 3     1   0      1     0    1   1      3  4589 0.1350
## 8     0   1      0     0    0   0      8  4073 0.1198
## 1     0   0      0     0    0   1      1  3136 0.0923
## 2     0   0      1     0    0   1      2   831 0.0244
## 9     0   1      0     0    1   0      9   478 0.0141
## 6     0   0      0     1    1   1      6   193 0.0057
## 10    0   1      0     1    1   0     10   142 0.0042
## 7     1   0      0     1    1   1      7   102 0.0030
## 11    0   1      0     1    0   0     11    84 0.0025
## 12    0   1      1     1    0   0     12    28 0.0008

This completed annotation eases the search for specific cell types of interest, where 1 represents a postive measurement of a marker, while 0 corresponds to its absence:

# One can look for several cell types at once:
phenotypes <- list()
# CD8-CD4+CCR7-CD45RA+
CD4effector <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 1))
phenotypes[[1]] <- CD4effector

# CD8-CD4+CCR7+CD45RA+
CD4naive <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 1))
phenotypes[[2]] <- CD4naive

# CD8-CD4+CCR7+CD45RA-
CD4CM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 0))
phenotypes[[3]] <- CD4CM

# CD8-CD4+CCR7+CD45RA+
CD4effectorM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 0))
phenotypes[[4]] <- CD4effectorM

# Retreive corresponding cell populations:
PhenoInfos <- RetrievePops(Annot, phenotypes)
# One can find informations about the seeked phenotypes in
PhenoInfos$phenotypesinfo ## [[1]] ## [[1]]$phenotype
## [1] "CD8-0"    "CD4-1"    "CCR7-0"   "CD45RA-1"
##
## [[1]]$proportion ## [1] 8e-04 ## ## [[1]]$label
## [1] 12
##
##
## [[2]]
## [[2]]$phenotype ## [1] "CD8-0" "CD4-1" "CCR7-1" "CD45RA-1" ## ## [[2]]$proportion
## [1] 0.3566
##
## [[2]]$label ## [1] 5 ## ## ## [[3]] ## [[3]]$phenotype
## [1] "CD8-0"    "CD4-1"    "CCR7-1"   "CD45RA-0"
##
## [[3]]$proportion ## [1] 0.2416 ## ## [[3]]$label
## [1] 4
##
##
## [[4]]
## [[4]]$phenotype ## [1] "CD8-0" "CD4-1" "CCR7-0" "CD45RA-0" ## ## [[4]]$proportion
## [1] 0.1406
##
## [[4]]$Mergedlabels ## [1] 8 9 10 11 ## ## [[4]]$Newlabel
## [1] 13
# According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7-CD45RA+ population is # labeled 12 According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+
# population is labeled 5 According to PhenoInfos$phenotypesinfo, # CD8-CD4+CCR7+CD45RA- population is labeled 4 According to # PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+ population is comprised of
# 4 clusters labeled 8,9,10, 11. They were merged into a new cluster labeled
# 13.

# The new partition, with merged clusters is to be found in
# PhenoInfos$Mergedleaves. auto_labels_merged <- PhenoInfos$Mergedleaves

# Get the indices of the subpopulation comprised of classes labeled 12, 5,
# 4,13.
subpopulation_indices <- auto_labels_merged %in% c(12, 5, 4, 13)

# Scatter plot the data.
CD45RA <- cellevents[subpopulation_indices, "CD45RA"]
CCR7 <- cellevents[subpopulation_indices, "CCR7"]
auto_lab <- factor(auto_labels_merged[subpopulation_indices])
levels(auto_lab) <- viridis::viridis(length(levels(auto_lab)))
auto_lab <- as.character(auto_lab)
plot(CD45RA, CCR7, col = auto_lab, main = "Automatic gating")