Please report comments or bugs to Pierre Lefeuvre - pierre.lefeuvre@cirad.fr

BoSSA CRAN page

Summary

A phylogenetic placement corresponds to the position of a query sequence in a reference tree. Different tools exits to infer phylogenetic placements, such as pplacer, EPA or RAPPAS. Importantly, these three programs produce placements under a common file format. Placements can later be analysed using the guppy software from the pplacer suite to obtain statistically based taxonomic classification of sequences. The BoSSA package implements functions to reads, plots and summarizes phylogentic placements. This vignette is intended to provide examples of placements analyses using BoSSA.

Important note

How to obtain phylogentic placement file suitable for analysis with BoSSA ?

The process to obtain placement files is dependent of the program you use. Assuming you are using pplacer, the process would be (1) build a reference package that contains an align set of reference sequences and a reference phylogenetic tree, (2) align query sequences to the reference alignment, (3) use pplacer to infer placements (jplace file output, format describe here) and optionally (4) infer the classification of each sequences using guppy (sqlite file output).

Let’s say you have obtained a reference package (refpkg), a placement file (jplace file) and a guppy classification output (sqlite file). The example files presented here are derived from other reference packages and jplaces files from the Matsen group pplacer tutorials. The sqlite file was obtained using the following command:

guppy classify --multiclass-min 0 --cutoff 0.5 -c example.refpkg --sqlite example.sqlite example.jplace

Exploration of a reference package

Let’s start by loading the `BoSSA-package:

library("BoSSA")

A good practice would be to inspect the refpkg content.

refpkg_path <- paste(find.package("BoSSA"),"/extdata/example.refpkg",sep="")
refpkg(refpkg_path)
## ### Reference package summary
## 
## Path:/tmp/Rtmpbwc4y6/Rinst44cef328f02/BoSSA/extdata/example.refpkg
## 
## Tree with 652 tips 650 nodes
## 
## Classification:
## root 1 
## below_root 1 
## superkingdom 1 
## below_superkingdom 1 
## below_below_superkingdom 1 
## superphylum 1 
## phylum 6 
## subphylum 1 
## class 11 
## subclass 2 
## order 15 
## below_order 3 
## below_below_order 1 
## suborder 3 
## family 28 
## below_family 5 
## genus 45 
## species_group 6 
## species_subgroup 1 
## species 138

It is possible to extract the taxonomy of the sequences included in the refpkg.

taxo <- refpkg(refpkg_path,type="taxonomy")
head(taxo)
##   tax_id parent_id                     rank                     tax_name
## 1      1         1                     root                         root
## 2 131567         1               below_root           cellular organisms
## 3      2    131567             superkingdom                     Bacteria
## 4   2323         2       below_superkingdom        unclassified Bacteria
## 5  95818      2323 below_below_superkingdom       candidate division TM7
## 6  68336         2              superphylum Bacteroidetes/Chlorobi group
##   root below_root superkingdom below_superkingdom below_below_superkingdom
## 1    1                                                                    
## 2    1     131567                                                         
## 3    1     131567            2                                            
## 4    1     131567            2               2323                         
## 5    1     131567            2               2323                    95818
## 6    1     131567            2                                            
##   superphylum phylum subphylum class subclass order below_order
## 1                                                              
## 2                                                              
## 3                                                              
## 4                                                              
## 5                                                              
## 6       68336                                                  
##   below_below_order suborder family below_family genus species_group
## 1                                                                   
## 2                                                                   
## 3                                                                   
## 4                                                                   
## 5                                                                   
## 6                                                                   
##   species_subgroup species below_species
## 1                                       
## 2                                       
## 3                                       
## 4                                       
## 5                                       
## 6

or display a pie chart that summarize the taxonomy…

refpkg(refpkg_path,type="pie",cex.text=0.5)

… or a subset of the taxonomy levels. Here, an example with the “class”, “order” and “family” levels. Note there is a slight decay between the text labels and slices… this will need a fix in a future package update.

refpkg(refpkg_path,type="pie",rank_pie=c("class","order","family"),cex.text=0.6)

Finally, a tree display with branch colored according to a given taxonomic level is available. Here tips are colored according to the “order” classification.

refpkg(refpkg_path,type="tree",rank_tree="class",cex.text=0.5)

Loading the example data

The BoSSA package comes along with examples of phylogenetic placements from the Masten group.

sqlite_file <- system.file("extdata", "example.sqlite", package = "BoSSA")
jplace_file <- system.file("extdata", "example.jplace", package = "BoSSA")

To read the data, use the read_sqlite function.

pplace <- read_sqlite(sqlite_file,jplace_file)
pplace
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --multiclass-min 0 --cutoff 0.5 -c example.refpkg --sqlite example.sqlite example.jplace
## Placement on a phylogenetic tree with 652 tips and 650 internal nodes.
## sequence nb: 100
## placement nb: 30

A summary of the object is printed with the number of runs, the command line, a short description of the phylogenetic tree, the number of placements and the number of sequences being placed. Pplace objects are stored in a list of 15 components, with 12 components being outputs from a guppy classify run and 3 components corresponding to the phylogenetic tree used for placement:

str(pplace)
## List of 15
##  $ multiclass                 :'data.frame': 100 obs. of  6 variables:
##   ..$ placement_id: int [1:100] 28 22 1 28 5 3 9 28 17 28 ...
##   ..$ name        : chr [1:100] "GLKT0ZE01A0Y5X" "GLKT0ZE01A1B65" "GLKT0ZE01A608I" "GLKT0ZE01A6F1M" ...
##   ..$ want_rank   : chr [1:100] "species" "species" "species" "species" ...
##   ..$ rank        : chr [1:100] "species" "genus" "species" "species" ...
##   ..$ tax_id      : chr [1:100] "40543" "168808" "906_1" "40543" ...
##   ..$ likelihood  : num [1:100] 1 1 1 1 1 ...
##  $ placement_classifications  :'data.frame': 280 obs. of  3 variables:
##   ..$ placement_id: int [1:280] 1 1 1 1 1 1 1 1 1 2 ...
##   ..$ tax_id      : chr [1:280] "1" "131567" "2" "1239" ...
##   ..$ likelihood  : num [1:280] 1 1 1 1 1 ...
##  $ placement_evidence         :'data.frame': 630 obs. of  4 variables:
##   ..$ placement_id: int [1:630] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ rank        : chr [1:630] "root" "below_root" "superkingdom" "below_superkingdom" ...
##   ..$ evidence    : num [1:630] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ bayes_factor: num [1:630] NA NA NA NA NA NA NA NA NA NA ...
##  $ placement_median_identities:'data.frame': 0 obs. of  3 variables:
##   ..$ placement_id           : int(0) 
##   ..$ tax_id                 : chr(0) 
##   ..$ median_percent_identity: num(0) 
##  $ placement_names            :'data.frame': 100 obs. of  4 variables:
##   ..$ placement_id: int [1:100] 1 2 2 3 3 3 4 4 4 5 ...
##   ..$ name        : chr [1:100] "GLKT0ZE01A608I" "GLKT0ZE01DNHR0" "GLKT0ZE01EDYPL" "GLKT0ZE01AHXIQ" ...
##   ..$ origin      : chr [1:100] "example" "example" "example" "example" ...
##   ..$ mass        : num [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##  $ placement_nbc              :'data.frame': 0 obs. of  3 variables:
##   ..$ placement_id: int(0) 
##   ..$ tax_id      : chr(0) 
##   ..$ bootstrap   : num(0) 
##  $ placement_positions        :'data.frame': 193 obs. of  9 variables:
##   ..$ placement_id      : int [1:193] 1 1 1 1 1 1 1 2 2 2 ...
##   ..$ location          : num [1:193] 555 556 553 552 557 554 550 111 115 113 ...
##   ..$ ml_ratio          : num [1:193] 0.143 0.143 0.143 0.143 0.143 ...
##   ..$ log_like          : num [1:193] -11608 -11608 -11608 -11608 -11608 ...
##   ..$ distal_bl         : num [1:193] 5.15e-07 5.15e-07 5.15e-07 7.80e-06 5.78e-06 ...
##   ..$ pendant_bl        : num [1:193] 0.0584 0.0584 0.0584 0.0584 0.0584 ...
##   ..$ tax_id            : chr [1:193] "906_1" "906_1" "906_1" "906_1" ...
##   ..$ map_identity_ratio: num [1:193] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ map_identity_denom: int [1:193] NA NA NA NA NA NA NA NA NA NA ...
##  $ placements                 :'data.frame': 30 obs. of  3 variables:
##   ..$ placement_id: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ classifier  : chr [1:30] "pplacer" "pplacer" "pplacer" "pplacer" ...
##   ..$ run_id      : int [1:30] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ranks                      :'data.frame': 21 obs. of  2 variables:
##   ..$ rank      : chr [1:21] "root" "below_root" "superkingdom" "below_superkingdom" ...
##   ..$ rank_order: int [1:21] 0 1 2 3 4 5 6 7 8 9 ...
##  $ runs                       :'data.frame': 1 obs. of  2 variables:
##   ..$ run_id: int 1
##   ..$ params: chr "/home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --multiclass-min 0 --cutoff 0.5 -c exam"| __truncated__
##  $ sqlite_sequence            :'data.frame': 2 obs. of  2 variables:
##   ..$ name: chr [1:2] "runs" "placements"
##   ..$ seq : int [1:2] 1 30
##  $ taxa                       :'data.frame': 281 obs. of  3 variables:
##   ..$ tax_id  : chr [1:281] "1" "103621" "109790" "113286" ...
##   ..$ tax_name: chr [1:281] "root" "Actinomyces urogenitalis" "Lactobacillus jensenii" "Pseudoramibacter" ...
##   ..$ rank    : chr [1:281] "root" "species" "species" "genus" ...
##  $ arbre                      :List of 5
##   ..$ edge       : int [1:1301, 1:2] 653 654 655 656 657 658 658 657 656 659 ...
##   ..$ edge.length: num [1:1301] 0.0014 0.2102 0.0728 0.1295 0.1131 ...
##   ..$ Nnode      : int 650
##   ..$ tip.label  : chr [1:652] "S000010758" "S002034883" "S000006177" "S001188095" ...
##   ..$ root.edge  : num 0
##   ..- attr(*, "class")= chr "phylo"
##   ..- attr(*, "order")= chr "cladewise"
##  $ edge_key                   : num [1:2, 1:1301] 1 1298 2 1296 3 ...
##  $ original_tree              : chr "((((((S000010758:0.0192957{0},S002034883:0.0325555{1}):0.113088{2},S000006177:0.204859{3}):0.129507{4},(S001188095:0.00721428{5"| __truncated__
##  - attr(*, "class")= chr "pplace"

Among these:

Some plots

Four different plots are available to display placements on a phylogenetic tree:

plot(pplace,type="number",main="number",cex.number=1.5)

plot(pplace,type="color",main="color",edge.width=2)

plot(pplace,type="fattree",main="fattree")

plot(pplace,type="precise",main="precise")

Note that it is possible to apply a function to modify the dot size using the transfo option. In the following example, the dot size is multiplied by 2. In some other cases log or log10 transformations could be usefull. Beware that when using the transfo option, the legend does not anymore correspond to the placement size but to the transform dot size (i.e. the transform function applied to the dot size).

plot(pplace,type="precise",main="precise",transfo=function(X){X*2})

Subsetting the pplace object

Placement object can be subseted. This could be done using placements ids…

sub1 <- sub_pplace(pplace,placement_id=1:100)
sub1
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --multiclass-min 0 --cutoff 0.5 -c example.refpkg --sqlite example.sqlite example.jplace
## Placement on a phylogenetic tree with 652 tips and 650 internal nodes.
## sequence nb: 100
## placement nb: 30

…or using placements names.

ids <- sample(pplace$multiclass$name,50)
sub2 <- sub_pplace(pplace,ech_id=ids)
sub2
## pplace object
## run: 1
## call run 1: /home/lefeuvre/prgrm_phylo/pplacer/pplacer-Linux-v1.1.alpha18-2-gcb55169/guppy classify --multiclass-min 0 --cutoff 0.5 -c example.refpkg --sqlite example.sqlite example.jplace
## Placement on a phylogenetic tree with 652 tips and 650 internal nodes.
## sequence nb: 50
## placement nb: 20

Conversion

To a table

Using the pplace_to_table function produces a table that contains the placement information along with the classification for each sequence. The output can be limited to the “best” placement (as in the example, i.e. the placements with the highest likelihood for each sequence).

pplace_table <- pplace_to_table(pplace,type="best")
head(pplace_table,n=3)
##   placement_id           name want_rank    rank tax_id_multilcass
## 1            1 GLKT0ZE01A608I   species species             906_1
## 2            2 GLKT0ZE01EDYPL   species   genus             84111
## 3            3 GLKT0ZE01AHXIQ   species species              2702
##   likelihood location  ml_ratio  log_like    distal_bl   pendant_bl
## 1          1      555 0.1428572 -11607.93 5.149085e-07 5.842901e-02
## 2          1      111 0.1428571 -11011.07 5.149085e-07 6.113515e-06
## 3          1       69 0.1428591 -11606.63 5.149085e-07 6.113515e-06
##   tax_id_placement map_identity_ratio map_identity_denom
## 1            906_1                 NA                 NA
## 2            84111                 NA                 NA
## 3             2702                 NA                 NA

To a contingency matrix

The pplace_to_matrix produces a contingency table. Let say the first 50 sequences in the multiclass table correspond to sequence from “sample 1” and the following 50 correspond to “sample 2”, the function output a contingency table for these two samples. You can either have the taxonomic names (tax_name=TRUE, in the example) or keep the taxonomic ids (tax_name=FALSE).

example_contingency <- pplace_to_matrix(pplace,c(rep("sample1",50),rep("sample2",50)),tax_name=TRUE)
example_contingency
##         Sneathia sanguinegens Sneathia Megasphaera sp. type 1
## sample1                    17       12                      6
## sample2                    14       14                      5
##         Atopobium vaginae Gardnerella vaginalis Dialister sp. type 2
## sample1                 1                     2                    1
## sample2                 1                     3                    0
##         Fusobacteriaceae Prevotella genogroup 1 Lactobacillus iners BVAB1
## sample1                1                      3                   4     1
## sample2                0                      5                   0     2
##         Prevotella genogroup 2 Prevotella genogroup 3
## sample1                      1                      1
## sample2                      0                      1
##         Aerococcus christensenii Parvimonas micra Eggerthella
## sample1                        0                0           0
## sample2                        1                2           2

To a taxonomy

Using the pplace_to_taxonomy function, a taxonomy table is obtained for each sequences with the taxonomy levels defined in the reference package. The taxonomy levels can be limited to a set of levels using the rank option.

example_taxo <- pplace_to_taxonomy(pplace,taxo,tax_name=TRUE,rank=c("order","family","genus","species"))
head(example_taxo)
##                order               family               genus        
## GLKT0ZE01A0Y5X "Fusobacteriales"   "Fusobacteriaceae"   "Sneathia"   
## GLKT0ZE01A1B65 "Fusobacteriales"   "Fusobacteriaceae"   "Sneathia"   
## GLKT0ZE01A608I "Clostridiales"     "Veillonellaceae"    "Megasphaera"
## GLKT0ZE01A6F1M "Fusobacteriales"   "Fusobacteriaceae"   "Sneathia"   
## GLKT0ZE01AH4NR "Coriobacteriales"  "Coriobacteriaceae"  "Atopobium"  
## GLKT0ZE01AHXIQ "Bifidobacteriales" "Bifidobacteriaceae" "Gardnerella"
##                species                 
## GLKT0ZE01A0Y5X "Sneathia sanguinegens" 
## GLKT0ZE01A1B65 "Unclassified"          
## GLKT0ZE01A608I "Megasphaera sp. type 1"
## GLKT0ZE01A6F1M "Sneathia sanguinegens" 
## GLKT0ZE01AH4NR "Atopobium vaginae"     
## GLKT0ZE01AHXIQ "Gardnerella vaginalis"

Make a phyloseq object

Assuming the sequences in the pplace object represent centroids of sequence cluster obtained from multiple samples, using the taxonomy table and an appropriate OTU file, you can create a phyloseq object.

example_OTU <- matrix(sample(1:100, 500, replace = TRUE), nrow = 100, ncol = 5,dimnames=list(pplace$multiclass$name,paste("sample",1:5,sep="_")))
head(example_OTU)
##                sample_1 sample_2 sample_3 sample_4 sample_5
## GLKT0ZE01A0Y5X       84       90       34       34       38
## GLKT0ZE01A1B65       73       79       84       28       40
## GLKT0ZE01A608I       74       23      100       99       73
## GLKT0ZE01A6F1M       48        2       90       36       69
## GLKT0ZE01AH4NR       12       31        3       30       73
## GLKT0ZE01AHXIQ        4       90       40       58       30

The exemple below is not run (commented) due to errors/warnings triggered by the used of Bioconductor packages (i.e. phyloseq) in CRAN vignette on some platform. Just uncomment the code if you like to have a try.

#library(phyloseq)
#example_phyloseq <- phyloseq(otu_table(example_OTU,taxa_are_rows=TRUE),tax_table(example_taxo))
#example_phyloseq

Citation

If you find BoSSA and/or its tutorials useful, you may cite:

citation("BoSSA")
## 
## To cite package 'BoSSA' in publications use:
## 
##   Pierre Lefeuvre (2018). BoSSA: A Bunch of Structure and Sequence
##   Analysis. R package version 3.4.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {BoSSA: A Bunch of Structure and Sequence Analysis},
##     author = {Pierre Lefeuvre},
##     year = {2018},
##     note = {R package version 3.4},
##   }
## 
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

Other resources

On phylogenetic placements

pplacer website and documentation

taxtastic

R packages used by BoSSA

RSQLite and jsonlite to read files, ape and phangorn to manipulate phylogenetic trees and plotrix for pie charts.