BoSSA CRAN page

# Summary

Phylogenetic placements are sets of mapping of query sequences to a reference phylogenetic tree. Phylogenetic placement inference using pplacer requires the construction of “reference packages”, i.e. informations on the phylogeny and taxonomy of a given group of organisms. This vignette presents example of reference package construction using taxtastic, rppr and R.

# Important note

The reference packages shiped with BoSSA are incomplete (they lack the alignment file) in order to reduce the package size. Whereas, the information available is sufficient to draw summary statistics, it won’t be enough to perform actual phylogenetic placement.

# The polerovirus example

Imagine you obtained some viral contigs that using BLAST (or some related similarity based search tools) were assigned to the Polerovirus genus. You want to obtain phylogenetic placements of these sequences using pplacer and now have to build a reference package for this genus.

A search on the ICTV website informs us that polerovirus belongs to the Luteovidiae family and that their “genome size is fairly uniform ranging from 5.6 kb to 6.0 kb”.

We will use the rentrez package to query NCBI from R. Along with a vignette available at the rentrez CRAN webpage, examples of rentrez use are available here.

library("rentrez")
library("XML")
library("ape")
library("BoSSA")

### How many polerovirus sequences are available on NCBI ?

We first search for the taxonomic id associated to the Polerovirus genus in NCBI…

r_search <- entrez_search(db="taxonomy", term="polerovirus")
r_search$ids ## [1] "119164" … and use this id (119164) to query the nucleotide database. Note that we remove the reference sequence set from the search (using “NOT srcdb_refseq[PROP]”) as these are duplicates from records already available in the nucleotide database. r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP]") r_search$count
## [1] 2063

This number represents the total number of sequences from the genus polerovirus. We will now search for full genomes using the “AND 5000:6500[slen]” query.

r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP] AND 5000:6500[slen]")
r_search$count ## [1] 233 To recover all the GenBank identifiers (gi), we have to increase the retmax parameter (default is 20) to something superior to 218. r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP] AND 5000:6500[slen]",retmax=500) gi <- r_search$ids
gi[1:10]
##  [1] "1153684428" "1316131152" "1316131147" "1316131143" "1316131139"
##  [6] "1316131135" "1316131123" "1316131119" "1316131115" "1316131107"

#### Get the genbank informations in XML format

We now use these gi to download the information relative to the sequences in XML format. XML is a handy format as it allow to directly extract specific fields form the genbank file.

all_recs <- entrez_fetch(db="nuccore", id=gi, rettype="gbc",retmode="xml",parsed=TRUE)
rec_list <- xmlToList(all_recs)

#### For a large set of sequences

Alternatively, when the number of sequence to recover is large (>500), you can use the use history option and fetch the sequence 500 by 500.

r_search <- entrez_search(db="nucleotide", term="txid119164[orgn] NOT srcdb_refseq[PROP] AND 5000:6500[slen]",use_history=TRUE)
all_recs <- NULL
for(seq_start in seq(1,ceiling(r_search$count/500)*500,500)){ all_recs <- c(all_recs,entrez_fetch(db="nuccore", web_history=r_search$web_history,rettype="gb",retmax=500,retstart=seq_start))
}

#### Extract the fasta from the XML

We now extract the sequences and write a fasta formated file on the disk.

write(sapply(rec_list,function(X){paste(">",X$INSDSeq_locus,"\n",X$INSDSeq_sequence,sep="")}),"polerovirus_from_genbank.fasta")

#### Extract the taxonomy from the XML

Also, we need to extract the polerovirus classification of each sequence.

source_name <- t(sapply(rec_list,function(X){c(X$INSDSeq_locus,X$INSDSeq_organism)}))

# Alignement and phylogeny

Now we have the sequences on the disk we can align the fasta (here using MAFFT) and construct a phylogenetic tree (here using FastTree)…

mafft --adjustdirection --reorder polerovirus_from_genbank.fasta > polerovirus_from_genbank_MAFFT.fasta

FastTree -nt -gamma -gtr -log polerovirus_from_genbank_MAFFT.log polerovirus_from_genbank_MAFFT.fasta > polerovirus_from_genbank_MAFFT.tre

…and import the tree in R for control.

plot(tree,cex=0.4,no.margin=TRUE)
nodelabels(cex=0.8)

Note that using the “adjustdirection” during alignment, MAFFT generates reverse complement sequences and align them together with the remaining sequences. The best orientation is conserved for each sequence. When the reverse complement is aligned, the program add the “R” prefix in the name of the sequence. In this case and in order to keep the correspondance between informations files and sequences names, you’ll have to edit back sequences names.

After plotting the tree, and the node labels, it is apparent that the node 231 may be a good rooting point (may change depending on the sequence set you download i.e. as more poleroviruses sequences are upload to NCBI, the tree structure and node numbering will change).

root_tree <- root(tree,node=231,resolve=TRUE)
plot(root_tree,cex=0.4,no.margin=TRUE)

The rooted tree is then written to the disk.

write.tree(root_tree,"polerovirus_ROOTED.tre")

# The Taxonomy

We need to compile the complete taxonomy information for the polerovirus. Taxastic has a function for that. We first download the full taxonomy from NCBI using taxit and subset it for the poleroviruses.

tax_ids <- tax_ids$links$nuccore_taxonomy
write(tax_ids,"taxonomy_polerovirus.id")
taxit new_database -d taxonomy.db
taxit taxtable -d taxonomy.db -t taxonomy_polerovirus.id -o polerovirus_taxonomy.csv

### The information file

We can now create and write the information file for the sequences from our phylogenetic tree.

tax_id <- taxo$tax_id[match(source_name[,2],taxo$tax_name)]
info <- data.frame(seqname=source_name[,1],accession=source_name[,1],tax_id=tax_id,species=source_name[,2],is_type=rep("no",nrow(source_name)))
write.table(info,"polerovirus_info.csv",sep=",",row.names=FALSE)

# Create the refpkg using taxit

The reference package is created using the taxit create command.

taxit create -l polerovirus -P polerovirus.refpkg \
--taxonomy polerovirus_taxonomy.csv \
--aln-fasta polerovirus_from_genbank_MAFFT.fasta \
--seq-info polerovirus_info.csv \
--tree-stats polerovirus_from_genbank_MAFFT.log \
--tree-file polerovirus_ROOTED.tre --no-reroot

### Check the refpkg using rppr

Rppr offers the possibility to check the reference package using the following commands.

rppr check -c polerovirus.refpkg
rppr info -c polerovirus.refpk

### Refpkg stats using BoSSA

Using BoSSA, we can extract some summary statistics and draw some plots.

• A reference package summary
refpkg_path <- paste(find.package("BoSSA"),"/extdata/polerovirus.refpkg",sep="")

refpkg(refpkg_path)
## ### Reference package summary
##
## Path:/tmp/RtmpuSwo8m/Rinst269fa345d6c/BoSSA/extdata/polerovirus.refpkg
##
## Tree with 204 tips 203 nodes
##
## Classification:
## root 1
## superkingdom 1
## below_superkingdom 1
## below_below_superkingdom 1
## family 1
## genus 1
## below_genus 1
## species 33
## below_species 2
• A tree with tuips colored according to a taxonomic level, here the species
refpkg(refpkg_path,type="tree",cex.text=0.3)

• A pie chart summarizing the taxonomy with here again, the species level. Note there is a slight decay between the text labels and slices…
refpkg(refpkg_path,type="pie",rank_pie="species",cex.text=0.6)

• Alternatively, a input file for KronaTools to generate a krona chart could be generated.
refpkg(refpkg_path,type="krona")

A file (default is for_krona.txt) is generated and can be used to generate a krona chart with the command:

ImportText.pl for_krona.txt

# Subsample the refpkg

It could be interesting to reduce the size of the reference package. Whereas, there is still a decent number of sequences in this polerovirus dataset, this number could be way higher and represent a computationnal burden for the rest of the analysis. Here is some code example to subset each species to a maximum of 10 sequences.

d <- cophenetic.phylo(root_tree)

ids <- NULL
for(usp in unique(info$species)){ sub <- info[as.character(info$species)==usp,]
if(nrow(sub)<=10){
ids <- c(ids,as.character(sub[,1]))
}
if(nrow(sub)>10){
acc <- as.character(sub[,1])
# the following code line prevent the code from crashing
# i.e. new sequences not available in the example may be uploaded when you will run the code
acc <- acc[!is.na(match(acc,colnames(d)))]
h <- hclust(as.dist(d[acc,acc]))
grp <- cutree(h,k=10)
ids <- c(ids,names(grp)[match(1:10,grp)])
}
}

to_remove <- root_tree$tip.label[!root_tree$tip.label%in%ids]

Note that the highly recommanded practice is to subset the alignement, re-align and compute a new phylogenetic tree, one can also directly subset the already available phylogeny (in a “quick and dirty way”).

root_tree2 <- drop.tip(root_tree,to_remove)
write.tree(root_tree2,"polerovirus_ROOTED_SUBSAMPLED.tre")

Using these new files, you can create another smaller refpkg using the “taxit create” command as described above.

# Citation

If you find phyloseq and/or its tutorials useful, you may cite BoSSA:

citation("BoSSA")
##
## To cite package 'BoSSA' in publications use:
##
##   Pierre Lefeuvre (2018). BoSSA: A Bunch of Structure and Sequence
##   Analysis. R package version 3.1.
##
## A BibTeX entry for LaTeX users is
##
##   @Manual{,
##     title = {BoSSA: A Bunch of Structure and Sequence Analysis},
##     author = {Pierre Lefeuvre},
##     year = {2018},
##     note = {R package version 3.1},
##   }
##
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

# Other resources

taxtastic

krona

### R packages used by BoSSA

RSQLite and jsonlite to read files, ape and phangorn to manipulate phylogenetic trees and plotrix for pie charts.