MW-HR-SIP

Nick Youngblut

2018-05-14


Introduction

HR-SIP method workflow:

MW-HR-SIP method workflow:

Dataset

First, let’s load some packages including HTSSIP.

library(dplyr)
library(ggplot2)
library(HTSSIP)

See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.

Let’s see the already parsed dataset

physeq_S2D2_l
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 47 samples ]
## sample_data() Sample Data:       [ 47 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]
## 
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

HR-SIP

Let’s set some parameters used later.

# adjusted P-value cutoff 
padj_cutoff = 0.1
# number of cores for parallel processing (increase depending on your computational hardware)
ncores = 2

One treatment-control comparison

First, we’ll just run HR-SIP on 1 treatment-control comparison. Let’s get the individual phyloseq object.

physeq = physeq_S2D2_l[[1]]
physeq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1072 taxa and 46 samples ]
## sample_data() Sample Data:       [ 46 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 1072 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1072 tips and 1071 internal nodes ]

Let’s check that the samples belong to either a 13C-treatment or 12C-control.

physeq %>% sample_data %>% .$Substrate %>% table
## .
## 12C-Con 13C-Cel 
##      23      23

OK, we should be ready to run HR-SIP!

Note that the design parameter for HRSIP() is the experimental design parameter for calculating log2 fold change (l2fc) values with DESeq. Here, it’s used to distinguish label-treatment and unlabel-control samples.

df_l2fc = HRSIP(physeq,
                design = ~Substrate,
                padj_cutoff = padj_cutoff,
                sparsity_threshold = c(0,0.15,0.3))   # just using 3 thresholds to reduce time
## Sparsity threshold: 0 
## Density window: 1.7-1.75 
## Sparsity threshold: 0.15 
## Density window: 1.7-1.75 
## Sparsity threshold: 0.3 
## Density window: 1.7-1.75 
## Sparsity threshold with the most rejected hypotheses: 0
df_l2fc %>% head(n=3)
## # A tibble: 3 x 17
##       OTU log2FoldChange           p       padj    Rank1            Rank2
##     <chr>          <dbl>       <dbl>      <dbl>   <fctr>           <fctr>
## 1 OTU.514     -0.1326734 0.744892882 0.99999999 Bacteria __Proteobacteria
## 2 OTU.729      1.3853310 0.048577337 0.42707575 Bacteria  __Acidobacteria
## 3 OTU.322      1.5298136 0.003258379 0.09756653 Bacteria  __Acidobacteria
## # ... with 11 more variables: Rank3 <fctr>, Rank4 <fctr>, Rank5 <fctr>,
## #   Rank6 <fctr>, Rank7 <fctr>, Rank8 <fctr>, density_min <dbl>,
## #   density_max <dbl>, sparsity_threshold <dbl>, sparsity_apply <chr>,
## #   l2fc_threshold <dbl>

How many “incorporators”" (rejected hypotheses)?

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by() %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length)
## # A tibble: 1 x 1
##   n_incorp_OTUs
##           <int>
## 1            36

Let’s plot a breakdown of incorporators for each phylum.

# summarizing
df_l2fc_s = df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=45, hjust=1)
    )

All treatment-control comparisons

Let’s now run HR-SIP on all treatment-control comparisons in the dataset:

# Number of comparisons
physeq_S2D2_l %>% length
## [1] 4

The function plyr::ldply() is useful (compared to lapply()) beccause it can be run in parallel and returns a data.frame object.

# Running in parallel; you may need to change the backend for your environment.
# Or you can just set .parallel=FALSE. 
doParallel::registerDoParallel(ncores)

df_l2fc = plyr::ldply(physeq_S2D2_l, 
                      HRSIP, 
                      design = ~Substrate, 
                      padj_cutoff = padj_cutoff,
                      sparsity_threshold = c(0,0.15,0.3),  # just using 3 thresholds to reduce run time
                      .parallel=TRUE)
df_l2fc %>% head(n=3)
##                                                                       .id
## 1 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 2 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 3 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
##       OTU log2FoldChange           p       padj    Rank1            Rank2
## 1 OTU.514     -0.1326734 0.744892882 0.99999999 Bacteria __Proteobacteria
## 2 OTU.729      1.3853310 0.048577337 0.42707575 Bacteria  __Acidobacteria
## 3 OTU.322      1.5298136 0.003258379 0.09756653 Bacteria  __Acidobacteria
##                   Rank3                                Rank4
## 1 __Deltaproteobacteria                  __Desulfobacterales
## 2                __RB25 __uncultured_Acidobacteria_bacterium
## 3               __DA023               __uncultured_bacterium
##              Rank5        Rank6                  Rank7 Rank8 density_min
## 1 __Nitrospinaceae __uncultured __uncultured_bacterium  <NA>         1.7
## 2             <NA>         <NA>                   <NA>  <NA>         1.7
## 3             <NA>         <NA>                   <NA>  <NA>         1.7
##   density_max sparsity_threshold sparsity_apply l2fc_threshold
## 1        1.75                  0            all           0.25
## 2        1.75                  0            all           0.25
## 3        1.75                  0            all           0.25

Each specific phyloseq subset (treatment-control comparison) is delimited with the “.id” column.

df_l2fc %>% .$.id %>% unique
## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"  
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"

For clarity, let’s edit these long strings to make them more readable when plotted.

df_l2fc = df_l2fc %>%
  mutate(.id = gsub(' \\| ', '\n', .id))
df_l2fc %>% .$.id %>% unique
## [1] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')"  
## [2] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')"

How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?

Note: you could set one sparsity cutoff for all comparisons by setting the sparsity_cutoff to a specific value.

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by(.id, sparsity_threshold) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  as.data.frame
##                                                                        .id
## 1 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')
## 3   (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')
## 4   (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')
##   sparsity_threshold n_incorp_OTUs
## 1                0.3            97
## 2                0.0           189
## 3                0.0            36
## 4                0.3            65

How about a breakdown of incorporators for each phylum in each comparision.

# summarizing
df_l2fc_s = df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(.id, Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    facet_wrap(~ .id, scales='free') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

MW-HR-SIP

MW-HR-SIP is run very similarly to HRSIP, but it uses multiple buoyant density (BD) windows. MW-HR-SIP is performed with the HRSIP() function, but multiple BD windows are specified.

Let’s use 3 buoyant density windows (g/ml):

1.70-1.73, 1.72-1.75, 1.74-1.77

windows = data.frame(density_min=c(1.70, 1.72, 1.74), 
                     density_max=c(1.73, 1.75, 1.77))
windows
##   density_min density_max
## 1        1.70        1.73
## 2        1.72        1.75
## 3        1.74        1.77

Running HRSIP with all 3 BD windows. Let’s run this in parallel to speed things up.

You can turn off parallel processing by setting the parallel option to FALSE. Also, there’s 2 different levels that could be parallelized: either the ldply() or HRSIP(). Here, I’m running HRSIP in parallel, but it may make sense in other situations (eg., many comparisons but few density windows and/or sparsity cutoffs) to use ldply in parallel only.

doParallel::registerDoParallel(ncores)

df_l2fc = plyr::ldply(physeq_S2D2_l, 
                      HRSIP, 
                      density_windows = windows,
                      design = ~Substrate, 
                      padj_cutoff = padj_cutoff,
                      sparsity_threshold = c(0,0.15,0.3), # just using 3 thresholds to reduce run time
                      .parallel = TRUE)
df_l2fc %>% head(n=3)
##                                                                       .id
## 1 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 2 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 3 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
##       OTU log2FoldChange          p      padj    Rank1            Rank2
## 1 OTU.514    -0.06192716 0.64317475 1.0000000 Bacteria __Proteobacteria
## 2 OTU.729     1.40750346 0.11190135 0.6588185 Bacteria  __Acidobacteria
## 3 OTU.386     1.43831256 0.03408535 0.3867674 Bacteria  __Acidobacteria
##                   Rank3                                Rank4
## 1 __Deltaproteobacteria                  __Desulfobacterales
## 2                __RB25 __uncultured_Acidobacteria_bacterium
## 3                __RB25               __uncultured_bacterium
##              Rank5        Rank6                  Rank7 Rank8 density_min
## 1 __Nitrospinaceae __uncultured __uncultured_bacterium  <NA>         1.7
## 2             <NA>         <NA>                   <NA>  <NA>         1.7
## 3             <NA>         <NA>                   <NA>  <NA>         1.7
##   density_max sparsity_threshold sparsity_apply l2fc_threshold
## 1        1.73                  0            all           0.25
## 2        1.73                  0            all           0.25
## 3        1.73                  0            all           0.25

Let’s check that we have all treatment-control comparisons.

df_l2fc %>% .$.id %>% unique
## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"  
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"

How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?

Note: one sparsity cutoff could be set for all comparisons by setting the sparsity_cutoff to a specific value.

df_l2fc %>% 
  filter(padj < padj_cutoff) %>%
  group_by(.id, sparsity_threshold) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  as.data.frame
##                                                                         .id
## 1 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')
## 3   (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 4   (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')
##   sparsity_threshold n_incorp_OTUs
## 1                0.0           119
## 2                0.0           245
## 3                0.0            43
## 4                0.3            85

The density windows can vary for each OTU. Let’s plot which density windows were used for the OTUs in the dataset.

# summarizing
df_l2fc_s = df_l2fc %>% 
  mutate(.id = gsub(' \\| ', '\n', .id)) %>%
  filter(padj < padj_cutoff) %>%
  mutate(density_range = paste(density_min, density_max, sep='-')) %>% 
  group_by(.id, sparsity_threshold, density_range) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) 

#plotting
ggplot(df_l2fc_s, aes(.id, n_incorp_OTUs, fill=density_range)) +
    geom_bar(stat='identity', position='fill') +
    labs(x='Control-treatment comparision', y='Fraction of incorporators') +
    scale_y_continuous(expand=c(0,0)) +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

Different BD windows were used for different treatment-control comparisons because the amount of BD shift likely varied among taxa. For example, if a taxon incorporates 100% 13C isotope, then a very ‘heavy’ BD window may show a larger l2fc than a less ‘heavy’ BD window.

Let’s look at a breakdown of incorporators for each phylum in each comparision.

# summarizing
df_l2fc_s = df_l2fc %>% 
  mutate(.id = gsub(' \\| ', '\n', .id)) %>%
  filter(padj < padj_cutoff) %>%
  mutate(Rank2 = gsub('^__', '', Rank2)) %>%
  group_by(.id, Rank2) %>%
  summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
  ungroup()

# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
    geom_bar(stat='identity') +
    labs(x='Phylum', y='Number of incorporators') +
    facet_wrap(~ .id, scales='free') +
    theme_bw() +
    theme(
      axis.text.x = element_text(angle=55, hjust=1)
    )

Note that the MW-HR-SIP method identifies more incorporators than the HR-SIP method (which uses just one BD window).

MW-HR-SIP detects more taxa for 2 main reasons. First, taxa vary in G+C content, so using only 1 BD window likely encompasses BD shifts for taxa of certain G+C contents (eg., ~50% G+C), but may miss other taxa with higher or lower G+C content. Second, taxa can vary in how much isotope was incorporated, which will affect where each taxon’s DNA is in the density gradient.

Session info

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2    phyloseq_1.20.0 HTSSIP_1.4.0    ggplot2_2.2.1  
## [5] tidyr_0.7.2     dplyr_0.7.4    
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-131               bitops_1.0-6              
##   [3] matrixStats_0.52.2         bit64_0.9-7               
##   [5] doParallel_1.0.11          RColorBrewer_1.1-2        
##   [7] rprojroot_1.2              GenomeInfoDb_1.12.3       
##   [9] tools_3.4.3                backports_1.1.1           
##  [11] R6_2.2.2                   coenocliner_0.2-2         
##  [13] vegan_2.4-4                rpart_4.1-11              
##  [15] Hmisc_4.0-3                DBI_0.7                   
##  [17] lazyeval_0.2.0             BiocGenerics_0.22.1       
##  [19] mgcv_1.8-22                colorspace_1.3-2          
##  [21] permute_0.9-4              ade4_1.7-8                
##  [23] nnet_7.3-12                tidyselect_0.2.2          
##  [25] gridExtra_2.3              DESeq2_1.16.1             
##  [27] bit_1.1-12                 compiler_3.4.3            
##  [29] Biobase_2.36.2             htmlTable_1.9             
##  [31] DelayedArray_0.2.7         labeling_0.3              
##  [33] scales_0.5.0               checkmate_1.8.5           
##  [35] genefilter_1.58.1          stringr_1.2.0             
##  [37] digest_0.6.12              foreign_0.8-69            
##  [39] rmarkdown_1.6              XVector_0.16.0            
##  [41] base64enc_0.1-3            pkgconfig_2.0.1           
##  [43] htmltools_0.3.6            htmlwidgets_0.9           
##  [45] rlang_0.1.2                RSQLite_2.0               
##  [47] bindr_0.1                  jsonlite_1.5              
##  [49] BiocParallel_1.10.1        acepack_1.4.1             
##  [51] RCurl_1.95-4.8             magrittr_1.5              
##  [53] GenomeInfoDbData_0.99.0    Formula_1.2-2             
##  [55] biomformat_1.4.0           Matrix_1.2-11             
##  [57] Rcpp_0.12.16               munsell_0.4.3             
##  [59] S4Vectors_0.14.7           ape_4.1                   
##  [61] stringi_1.1.5              yaml_2.1.14               
##  [63] MASS_7.3-47                SummarizedExperiment_1.6.5
##  [65] zlibbioc_1.22.0            rhdf5_2.20.0              
##  [67] plyr_1.8.4                 blob_1.1.0                
##  [69] grid_3.4.3                 parallel_3.4.3            
##  [71] lattice_0.20-35            Biostrings_2.44.2         
##  [73] splines_3.4.3              multtest_2.32.0           
##  [75] annotate_1.54.0            locfit_1.5-9.1            
##  [77] knitr_1.17                 igraph_1.1.2              
##  [79] GenomicRanges_1.28.6       geneplotter_1.54.0        
##  [81] reshape2_1.4.2             codetools_0.2-15          
##  [83] stats4_3.4.3               XML_3.98-1.9              
##  [85] glue_1.1.1                 evaluate_0.10.1           
##  [87] latticeExtra_0.6-28        data.table_1.10.4-2       
##  [89] foreach_1.4.3              gtable_0.2.0              
##  [91] purrr_0.2.4                assertthat_0.2.0          
##  [93] xtable_1.8-2               survival_2.41-3           
##  [95] tibble_1.3.4               iterators_1.0.8           
##  [97] memoise_1.1.0              AnnotationDbi_1.38.2      
##  [99] IRanges_2.10.5             cluster_2.0.6