## A Basic Deconvolution Example

In this vignette we will work through a simple example of deconvolving cell type proportions from DNA microarray data. We work with a data set created from rats and introduced by Shen-Orr et al. This is available on GEO with accession GSE19830. The data set we will work with is subset of the Shen-Orr data and is included in the dtangle package under the name shen_orr_ex. Alternatively, we can access this and other data sets data set through the supplementary dtangle.data package we have made available here. More information about the data set is available as part of the R help, ?shen_orr_ex. First load up the data set.

library('dtangle')
names(shen_orr_ex)

## [1] "data"       "annotation" "name"


In this data set rat brain, liver and lung cells have been mixed together in various proportions the resulting mixtures were analyzed with DNA microarrays. The mixing proportions are encoded in the mixture matrix

truth = shen_orr_ex$annotation$mixture

##           Liver Brain Lung
## GSM495209     1     0    0
## GSM495210     1     0    0
## GSM495211     1     0    0
## GSM495212     0     1    0
## GSM495213     0     1    0
## GSM495214     0     1    0


Each row of this matrix is a sample and each column gives the mixing proportions of the cell types in each sample. From this we can extract out the pure samples of each of the three cell types.

pure_samples <- lapply(1:3, function(i) {
which(truth[, i] == 1)
})
names(pure_samples) = colnames(truth)
pure_samples

## $Liver ## GSM495209 GSM495210 GSM495211 ## 1 2 3 ## ##$Brain
## GSM495212 GSM495213 GSM495214
##         4         5         6
##
## $Lung ## GSM495215 GSM495216 GSM495217 ## 7 8 9  The RMA-summarized gene expression data generated as part of the Shen-Orr experiment is accessible under data$log,

Y <- shen_orr_ex$data$log
Y[1:4,1:4]

##           X1367566_at X1367568_a_at X1367570_at X1367584_at
## GSM495209    3.396192      7.685769    5.722330    6.628653
## GSM495210    2.882626      7.759002    6.005583    6.771917
## GSM495211    3.072980      7.598871    5.741630    6.564820
## GSM495212    3.168440      7.209959    6.396841    7.040779


Each row is a different individual and each column is a particular gene. The values of the matrix are $$\log_2$$ RMA-summarized gene expressions.

The first step in running dtangle is to identify marker genes for each cell type. These may be provided by the scientist if they are already known or may be determined by dtangle or another algorithm. To find marker genes using dtangle we pass the following arugments to the find_markers function:

1. the data matrix, Y,

2. the list of pure samples for each type, pure_samples,

3. the data type, data_type,

4. the method used to rank markers, marker_method.

marker_list = find_markers(Y,pure_samples,data_type="microarray-gene",marker_method='ratio')


The function find_markers will determine which genes to use as markers of each cell type. The function returns a list of two elements. The first element is L which is a list where the $$i^{th}$$ element is a vector of marker indices (columns of $$Y$$) for the $$i^{th}$$ type ranked in decreasing order of utility (best markers listed first) according to the chosen method.

lapply(marker_list$L,head)  ## [[1]] ## [1] 120 253 479 446 529 90 ## ## [[2]] ## [1] 180 95 429 14 451 430 ## ## [[3]] ## [1] 1 168 481 109 454 206  The second element of the list returned by find_markers is a list V (with the same structure as L). For each of the indices in L the list V contains the marker score as determined by the method utilized in find_markers. Larger values are better. The meaning of this value in V depends on the choice of marker_method. For marker_method=ratio (the default method) the values in V are the ratio of the estimated amount the particular gene is expressed in each cell type over the sum of the expression of this gene in all other cell types. lapply(marker_list$V,head)

## [[1]]
## [1] 242.4604 222.9145 215.2806 212.6259 190.3624 179.9314
##
## [[2]]
## [1] 390.1903 294.5253 289.4251 211.6035 179.4170 158.1940
##
## [[3]]
## [1] 599.3597 415.0520 399.9639 381.2483 374.7274 256.4239


After we have ranked our markers with find_markers we need to determine how many markers to use for each cell type. The simplest way to do this is to choose, say, the top 10% of all marker genes for each type.

q = .1
quantiles = lapply(marker_list$V,function(x)quantile(x,1-q)) K = length(pure_samples) n_choose = sapply(1:K,function(i){max(which(marker_list$V[[i]] > quantiles[[i]]))})
n_choose

## [1] 20 20 20


Now that we have ranked the genes as markers for each type and chosen how many marker genes to use for each cell type we can run the dtangle deconvolution algorithm.

marks = marker_list$L dc <- dtangle(Y,pure_samples,n_choose,data_type='microarray-gene',markers=marks)  providing to the dtangle function the arguments: 1. Y, our microarray data 2. the list of pure_samples 3. the number of markers to use for each cell type, n_choose 4. the data_type 5. the list of ranked markers (output from find_markers) to the markers argument. The dtangle function returns to us a list with elements 1. estimates, the estimated mixing proportions for each type for each sample 2. markers, the markers used for each cell type 3. n_choose, how many markers we used for each cell type 4. gamma, the value of the sensitivity parameter used. dc  ##$estimates
##                   [,1]         [,2]         [,3]
## GSM495209 0.9993246993 0.0003234491 0.0003518517
## GSM495210 0.9993615673 0.0003062240 0.0003322088
## GSM495211 0.9992811725 0.0003360633 0.0003827643
## GSM495212 0.0003185096 0.9994276052 0.0002538853
## GSM495213 0.0003218458 0.9994205963 0.0002575578
## GSM495214 0.0003147015 0.9994680759 0.0002172226
## GSM495215 0.0001288243 0.0002892625 0.9995819132
## GSM495216 0.0001221291 0.0003182323 0.9995596386
## GSM495217 0.0001288553 0.0003294425 0.9995417022
## GSM495218 0.0192531359 0.2558945703 0.7248522938
## GSM495219 0.0214541993 0.2559384615 0.7226073392
## GSM495220 0.0204171551 0.2507900905 0.7287927544
## GSM495221 0.7167919114 0.0414526933 0.2417553953
## GSM495222 0.7105361086 0.0415612499 0.2479026415
## GSM495223 0.7179906568 0.0385809496 0.2434283936
## GSM495224 0.1930051131 0.7851689141 0.0218259728
## GSM495225 0.1843333285 0.7936060070 0.0220606645
## GSM495226 0.1966666812 0.7812835228 0.0220497960
## GSM495227 0.6922122805 0.2760794970 0.0317082225
## GSM495228 0.6952670365 0.2733095167 0.0314234468
## GSM495229 0.6916797759 0.2762819556 0.0320382685
## GSM495230 0.4384612509 0.5001041434 0.0614346057
## GSM495231 0.4405804738 0.4947203283 0.0646991979
## GSM495232 0.4454183226 0.4927342521 0.0618474253
## GSM495233 0.5670251353 0.2067197106 0.2262551541
## GSM495234 0.5741769982 0.2023947529 0.2234282490
## GSM495235 0.5692060787 0.2035292004 0.2272647209
## GSM495236 0.5079495806 0.3240765201 0.1679738993
## GSM495237 0.5110793396 0.3204845439 0.1684361165
## GSM495238 0.5115437773 0.3237950257 0.1646611969
## GSM495239 0.5653753667 0.3196889291 0.1149357042
## GSM495240 0.5581525417 0.3259303424 0.1159171159
## GSM495241 0.5598210657 0.3240142598 0.1161646745
## GSM495242 0.4993009659 0.4382637175 0.0624353167
## GSM495243 0.5049839675 0.4318495632 0.0631664692
## GSM495244 0.4989121159 0.4353970891 0.0656907950
## GSM495245 0.5994200903 0.3751932645 0.0253866452
## GSM495246 0.5938136391 0.3794653529 0.0267210080
## GSM495247 0.5986306090 0.3741729394 0.0271964516
## GSM495248 0.6377419320 0.3587503014 0.0035077665
## GSM495249 0.6355750333 0.3610877624 0.0033372043
## GSM495250 0.6358441038 0.3608750179 0.0032808783
##
## $markers ##$markers[[1]]
##  [1] 120 253 479 446 529  90  61  66 174  77 316 162 215 551 488 473 574
## [18]  91 185  41
##
## $markers[[2]] ## [1] 180 95 429 14 451 430 56 396 598 349 165 313 259 23 593 12 21 ## [18] 30 151 84 ## ##$markers[[3]]
##  [1]   1 168 481 109 454 206 368 487 179 476 227 405 262 504 586 339 552
## [18] 362 447   7
##
##
## $n_choose ## [1] 20 20 20 ## ##$gamma
## [1] 0.6999978


We can plot our estimates against the known truth as follows

phats <- dc$estimates plot(truth,phats,xlab="Truth",ylab="Estimates",xlim=c(0,1),ylim=c(0,1)) abline(coef=c(0,1))  If desired, we can specify the value of the sensivity parameter gamma numerically instead of letting dtangle choose it based upon the data_type. For example, dc <- dtangle(Y,pure_samples,n_choose,gamma=.7,markers=marks) phats <- dc$estimates
plot(truth,phats,xlab="Truth",ylab="Estimates",xlim=c(0,1),ylim=c(0,1))
abline(coef=c(0,1))


We can view the pre-selected values for gamma for each data type by using the function get_gamma

get_gamma('microarray-probe')

## [1] 0.4522564

get_gamma('microarray-gene')

## [1] 0.6999978

get_gamma('rna-seq')

## [1] 0.9433902


We can also specify the number of markers to be the same for each cell type by providing a single number to n_choose

dc <- dtangle(Y,pure_samples,n_choose=5,data_type='microarray-gene',markers=marks)
phats <- dc$estimates plot(truth,phats,xlab="Truth",ylab="Estimates",xlim=c(0,1),ylim=c(0,1)) abline(coef=c(0,1))  Alternatively, we can manually specify the number of markers to use for each cell type manually, dc <- dtangle(Y,pure_samples,n_choose=c(5,6,7),data_type='microarray-gene',markers=marks) phats <- dc$estimates
plot(truth,phats,xlab="Truth",ylab="Estimates",xlim=c(0,1),ylim=c(0,1))
abline(coef=c(0,1))


We can test different methods of choosing markers by specifying the marker_method argument. Notice that if we don't calculate the markers in advance (e.g. by using find_markers) then dtangle handles the markers internally by using find_markers with method='ratio'. A description of the marker choosing methods can be found in the help pages for dtangle.

dc <- dtangle(Y, pure_samples,n_choose,data_type='microarray-gene',marker_method = 'ratio')
phats <- dc$estimate plot(truth,phats,xlab="Truth",ylab="Estimates",xlim=c(0,1),ylim=c(0,1)) abline(coef=c(0,1)) dc2 <- dtangle(Y, pure_samples,n_choose,data_type='microarray-gene',marker_method = 'diff') phats2 <- dc2$estimates
points(truth,phats2,col='blue')

dc3 <- dtangle(Y, pure_samples,n_choose,data_type='microarray-gene',marker_method = 'regression')
phats3 <- dc3\$estimates
points(truth,phats3,col='red')