Meta-Genome Retrieval

2018-06-27

Perform Meta-Genome Retrieval

The number of genome assemblies generated and stored in sequence databases is growing exponentially every year. With the availability of this growing amount of genomic data, meta-genomics studies become more and more popular. By using this bulk of genomes for comparing them to thousands of other genomes new structural patterns and evolutionary insights can be obtained. However, the first step in any meta-genomics study is the retrieval of the genomes, proteomes, coding sequences or annotation files that shall be compared and investigated. For this purpose, the meta.retrieval() and meta.retrieval.all() functions allows users to perform straightforward meta-genome retrieval of hundreds of genomes, proteomes, CDS, etc in R.

The meta.retrieval() function generates an *.csv overview file in the doc folder which stores genome version, date, origin, etc information for all downloaded organisms and can be directly used as Supplementary Data file in publications to increase computational and biological reproducibility in genomics studies.

Getting Started

The meta.retrieval() and meta.retrieval.all() functions aim to simplify the genome retrieval process for subsequent meta-genomics studies. Both functions allow to either download genomes, proteomes, CDS, etc for species within a specific kinngdom or subgroup of life (meta.retrieval()) or of all species of all kingdoms. Usually this step is performed with shell scripts. However, since many meta-genomics packages exist for the R programming language, I implemented this functionality for easy integration into existing workflows.

For example, the pipeline logic of the magrittr package can be used with meta.retrieval() and meta.retrieval.all() as follows.

# download all vertebrate genomes, then apply ...
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") %>% ...

Here ... denotes any subsequent meta-genomics analysis. Hence, meta.retrieval() enables the pipelining methodology for meta-genomics.

Retrieve Genomic Sequences

To retrieve a list of all available kingdoms stored in the NCBI RefSeq, NCBI Genbank, ENSEMBL, and ENSEMBLGENOMES databases users can consult the getKingdoms() function which stores a list of all available kingdoms of life for the corresponding database.

Example NCBI RefSeq:

getKingdoms(db = "refseq")
[1] "archaea"              "bacteria"             "fungi"                "invertebrate"        
[5] "plant"                "protozoa"             "vertebrate_mammalian" "vertebrate_other"    
[9] "viral"

Example NCBI Genbank:

getKingdoms(db = "genbank")
[1] "archaea"              "bacteria"             "fungi"               
[4] "invertebrate"         "plant"                "protozoa"            
[7] "vertebrate_mammalian" "vertebrate_other"

In these examples the difference betwenn db = "refseq" and db = "genbank" is that db = "genbank" does not store viral information.

Example ENSEMBL

getKingdoms(db = "ensembl")
[1] "Ensembl"

The ENSEMBL database does not differentiate between different kingdoms, but specialized on storing high-quality reference genomes of diverse biological disciplines.

Example ENSEMBLGENOMES

getKingdoms(db = "ensemblgenomes")
[1] "EnsemblBacteria" "EnsemblFungi"    "EnsemblMetazoa"  "EnsemblPlants"  
[5] "EnsemblProtists"

The databse ENSEMBLGENOMES is divided into the specialised sub-databases EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants, and EnsemblProtists.

When chosing from which database species shall be retrieved (e.g. db = "refseq"), the kingdom argument specification must match the available kingdoms retrieved with e.g. getKingdoms(db = "refseq")

Retrieval from NCBI RefSeq

Download all mammalian vertebrate genomes from NCBI RefSeq.

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")

All geneomes are stored in the folder named according to the kingdom. In this case vertebrate_mammalian. Alternatively, users can specify the out.folder argument to define a custom output folder path.

Example Bacteria

# download all bacteria genomes
meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

Example Viruses

# download all virus genomes
meta.retrieval(kingdom = "viral", db = "refseq", type = "genome")

Example Archaea

# download all archaea genomes
meta.retrieval(kingdom = "archaea", db = "refseq", type = "genome")

Example Fungi

# download all fungi genomes
meta.retrieval(kingdom = "fungi", db = "refseq", type = "genome")

Example Plants

# download all plant genomes
meta.retrieval(kingdom = "plant", db = "refseq", type = "genome")

Example Invertebrates

# download all invertebrate genomes
meta.retrieval(kingdom = "invertebrate", db = "refseq", type = "genome")

Example Protozoa

# download all invertebrate genomes
meta.retrieval(kingdom = "protozoa", db = "refseq", type = "genome")

Retrieval from NCBI Genbank

Alternatively, download all mammalian vertebrate genomes from NCBI Genbank, e.g.

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome")

Example Bacteria

# download all bacteria genomes
meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")

Example Archaea

# download all archaea genomes
meta.retrieval(kingdom = "archaea", db = "genbank", type = "genome")

Example Fungi

# download all fungi genomes
meta.retrieval(kingdom = "fungi", db = "genbank", type = "genome")

Example Plants

# download all plant genomes
meta.retrieval(kingdom = "plant", db = "genbank", type = "genome")

Example Invertebrates

# download all invertebrate genomes
meta.retrieval(kingdom = "invertebrate", db = "genbank", type = "genome")

Example Protozoa

# download all invertebrate genomes
meta.retrieval(kingdom = "protozoa", db = "genbank", type = "genome")

Retrieval from ENSEMBL

# download all genomes from ENSEMBL
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "genome")

Retrieval from ENSEMBLGENOMES

Alternatively, download all species for a particular group from ENSEBLGENOMES

Example Metazoa:

# download all metazoa genomes
meta.retrieval(kingdom = "EnsemblMetazoa", db = "ensemblgenomes", type = "genome")

Example Fungi:

# download all fungi genomes
meta.retrieval(kingdom = "EnsemblFungi", db = "ensemblgenomes", type = "genome")

EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants, EnsemblProtists

Retrieve groups or subgroups of species

In case users do not wish to retrieve genomes from an entire kingdom, but rather from a group or subgoup (e.g. from species belonging to the Gammaproteobacteria class, a subgroup of the bacteria kingdom), they can use the following workflow.

Example retrieval of all Gammaproteobacteria genomes from NCBI RefSeq:

First, users can again consult the getKingdoms() function to retrieve kingdom information.

getKingdoms(db = "refseq")
[1] "archaea"              "bacteria"             "fungi"                "invertebrate"        
[5] "plant"                "protozoa"             "vertebrate_mammalian" "vertebrate_other"    
[9] "viral"

In this example, we will choose the bacteria kingdom. Now, the getGroups() function allows users to obtain available subgroups of the bacteria kingdom.

getGroups(db = "refseq", kingdom = "bacteria")
 [1] "Acidithiobacillia"                     "Acidobacteriia"                       
 [3] "Actinobacteria"                        "Alphaproteobacteria"                  
 [5] "Aquificae"                             "Armatimonadetes"                      
 [7] "Bacteroidetes/Chlorobi group"          "Balneolia"                            
 [9] "Betaproteobacteria"                    "Blastocatellia"                       
[11] "Candidatus Kryptonia"                  "Chlamydiae"                           
[13] "Chloroflexi"                           "Cyanobacteria/Melainabacteria group"  
[15] "Deinococcus-Thermus"                   "delta/epsilon subdivisions"           
[17] "Endomicrobia"                          "Fibrobacteres"                        
[19] "Firmicutes"                            "Fusobacteriia"                        
[21] "Gammaproteobacteria"                   "Gemmatimonadetes"                     
[23] "Kiritimatiellaeota"                    "Nitrospira"                           
[25] "Planctomycetes"                        "Spirochaetia"                         
[27] "Synergistia"                           "Tenericutes"                          
[29] "Thermodesulfobacteria"                 "Thermotogae"                          
[31] "unclassified Acidobacteria"            "unclassified Bacteria (miscellaneous)"
[33] "unclassified Proteobacteria"           "Verrucomicrobia"                      
[35] "Zetaproteobacteria" 

Please note, that the kingdom argument specified in getGroups() needs to match with an available kingdom retrieved with getKingdoms(). It is also important that in both cases: getKingdoms() and getGroups() the same database should be specified.

Now we choose the group Gammaproteobacteria and specify the group argument in the meta.retrieval() function.

meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "refseq", type = "genome")

Using this command, all bacterial (kingdom = "bacteria") genomes (type = "genome") that belong to the group Gammaproteobacteria (group = "Gammaproteobacteria") will be retrieved from NCBI RefSeq (db = "refseq").

Alternatively, Gammaproteobacteria genomes can be retrieved from NCBI Genbank by exchanging db = "refseq" to db = "genbank". If users wish to download proteome, CDS, or GFF files instead of genomes, they can specify the argument: type = "proteome", type = "cds", or type = "gff".

Example retrieval of all Adenoviridae genomes from NCBI RefSeq:

Retrieve groups for viruses.

getGroups(db = "refseq", kingdom = "viral")
 [1] "Adenoviridae"                                        "Alloherpesviridae"                                  
  [3] "Alphaflexiviridae"                                   "Alphatetraviridae"                                  
  [5] "Alvernaviridae"                                      "Amalgaviridae"                                      
  [7] "Ampullaviridae"                                      "Anelloviridae"                                      
  [9] "Apple fruit crinkle viroid"                          "Apple hammerhead viroid-like circular RNA"          
 [11] "Apscaviroid"                                         "Arenaviridae"                                       
 [13] "Arteriviridae"                                       "Ascoviridae"                                        
 [15] "Asfarviridae"                                        "Astroviridae"                                       
 [17] "Avsunviroid"                                         "Baculoviridae"                                      
 [19] "Barnaviridae"                                        "Benyviridae"                                        
 [21] "Betaflexiviridae"                                    "Bicaudaviridae"                                     
 [23] "Birnaviridae"                                        "Bornaviridae"                                       
 [25] "Bromoviridae"                                        "Bunyaviridae"                                       
 [27] "Caliciviridae"                                       "Carmotetraviridae"                                  
 [29] "Caulimoviridae"                                      "Cherry leaf scorch small circular viroid-like RNA 1"
 [31] "Cherry small circular viroid-like RNA"               "Chrysoviridae"                                      
 [33] "Circoviridae"                                        "Closteroviridae"                                    
 [35] "Cocadviroid"                                         "Coleviroid"                                         
 [37] "Coronaviridae"                                       "Corticoviridae"                                     
 [39] "Cystoviridae"                                        "Dicistroviridae"                                    
 [41] "Elaviroid"                                           "Endornaviridae"                                     
 [43] "Filoviridae"                                         "Flaviviridae"                                       
 [45] "Fusarividae"                                         "Fuselloviridae"                                     
 [47] "Gammaflexiviridae"                                   "Geminiviridae"                                      
 [49] "Genomoviridae"                                       "Globuloviridae"                                     
 [51] "Grapevine latent viroid"                             "Guttaviridae"                                       
 [53] "Hepadnaviridae"                                      "Hepeviridae"                                        
 [55] "Herpesviridae"                                       "Hostuviroid"                                        
 [57] "Hypoviridae"                                         "Hytrosaviridae"                                     
 [59] "Iflaviridae"                                         "Inoviridae"                                         
 [61] "Iridoviridae"                                        "Lavidaviridae"                                      
 [63] "Leviviridae"                                         "Lipothrixviridae"                                   
 [65] "Luteoviridae"                                        "Malacoherpesviridae"                                
 [67] "Marnaviridae"                                        "Marseilleviridae"                                   
 [69] "Megabirnaviridae"                                    "Mesoniviridae"                                      
 [71] "Microviridae"                                        "Mimiviridae"                                        
 [73] "Mulberry small circular viroid-like RNA 1"           "Mymonaviridae"                                      
 [75] "Myoviridae"                                          "Nanoviridae"                                        
 [77] "Narnaviridae"                                        "Nimaviridae"                                        
 [79] "Nodaviridae"                                         "Nudiviridae"                                        
 [81] "Nyamiviridae"                                        "Ophioviridae"                                       
 [83] "Orthomyxoviridae"                                    "Other"                                              
 [85] "Papillomaviridae"                                    "Paramyxoviridae"                                    
 [87] "Partitiviridae"                                      "Parvoviridae"                                       
 [89] "Pelamoviroid"                                        "Permutotetraviridae"                                
 [91] "Persimmon viroid"                                    "Phycodnaviridae"                                    
 [93] "Picobirnaviridae"                                    "Picornaviridae"                                     
 [95] "Plasmaviridae"                                       "Pneumoviridae"                                      
 [97] "Podoviridae"                                         "Polydnaviridae"                                     
 [99] "Polyomaviridae"                                      "Pospiviroid"                                        
[101] "Potyviridae"                                         "Poxviridae"                                         
[103] "Quadriviridae"                                       "Reoviridae"                                         
[105] "Retroviridae"                                        "Rhabdoviridae"                                      
[107] "Roniviridae"                                         "Rubber viroid India/2009"                           
[109] "Rudiviridae"                                         "Secoviridae"                                        
[111] "Siphoviridae"                                        "Sphaerolipoviridae"                                 
[113] "Sunviridae"                                          "Tectiviridae"                                       
[115] "Togaviridae"                                         "Tombusviridae"                                      
[117] "Totiviridae"                                         "Turriviridae"                                       
[119] "Tymoviridae"                                         "unclassified"                                       
[121] "unclassified Pospiviroidae"                          "Virgaviridae"

Now we can choose Adenoviridae as group argument for the meta.retrieval() function.

meta.retrieval(kingdom = "viral", group = "Adenoviridae", db = "refseq", type = "genome")

Again, by exchanging type = "genome" by either type = "proteome", type = "cds", type = "rna", type = "assemblystats", or type = "gff", users can retrieve proteome, CDS, RNA, genome assembly statistics or GFF files instead of genomes.

Meta retrieval of genome assembly quality information

Although much effort is invested to increase the genome assembly quality when new genomes are published or new versions are released, the influence of genome assembly quality on downstream analyses cannot be neglected. A rule of thumb is, that the larger the genome the more prone it is to genome assembly errors and therefore, a reduction of assembly quality.

In Veeckman et al., 2016 the authors conclude:

As yet, no uniform metrics or standards are in place to estimate the completeness of a genome assembly or the annotated gene space, despite their importance for downstream analyses

In most metagenomics studies, however, the influence or bias of genome assembly quality on the outcome of the analysis (e.g. comparative genomics, annotation, etc.) is neglected. To better grasp the genome assembly quality, the NCBI databases store genome assembly statistics of some species for which genome assemblies are available. An example assembly statistics report can be found at: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_assembly_stats.txt.

The biomartr package allows users to retrieve these genome assembly stats file in an automated way by specifying the argument type = "assemblystats" and combine = TRUE. Please make sure that combine = TRUE when selecting type = "assemblystats".

# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieve genome assembly stats for all mammal genome assemblies
# and store these stats in a data.frame
mammals.gc <- meta.retrieval(kingdom = "vertebrate_mammalian", 
                             db      = "refseq", 
                             type    = "assemblystats", 
                             combine = TRUE)

mammals.gc
                    species total_length spanned_gaps unspanned_gaps region_count scaffold_count
                      <chr>        <int>        <int>          <int>        <int>          <int>
1  Ornithorhynchus anatinus   1995607322       243698            137            0         200283
2      Sarcophilus harrisii           NA       201317              0            0          35974
3      Dasypus novemcinctus           NA       268413              0            0          46559
4       Erinaceus europaeus           NA       219764              0            0           5803
5         Echinops telfairi           NA       269444              0            0           8402
6           Pteropus alecto   1985975446       104566              0            0          65598
7     Rousettus aegyptiacus   1910250568          559              0            0             NA
8        Callithrix jacchus           NA       184972           2242            0          16399
9  Cebus capucinus imitator           NA       133441              0            0           7156
10          Cercocebus atys           NA        65319              0            0          11433
# ... with 89 more rows, and 9 more variables: scaffold_N50 <int>, scaffold_L50 <int>,
#   scaffold_N75 <int>, scaffold_N90 <int>, contig_count <int>, contig_N50 <int>, total_gap_length <int>,
#   molecule_count <int>, top_level_count <int>

Analogously, this information can be retrieved for each kingdom other than kingdom = "vertebrate_mammalian". Please consult getKingdoms() for available kingdoms.

Metagenome project retrieval from NCBI Genbank

NCBI Genbank stores metagenome projects in addition to species specific genome, proteome or CDS sequences. To retrieve these metagenomes users can perform the following combination of commands:

First, users can list the project names of available metagenomes by typing

# list available metagenomes at NCBI Genbank
listMetaGenomes()
[1] "metagenome"                     "human gut metagenome"           "epibiont metagenome"           
 [4] "marine metagenome"              "soil metagenome"                "mine drainage metagenome"      
 [7] "mouse gut metagenome"           "marine sediment metagenome"     "termite gut metagenome"        
[10] "hot springs metagenome"         "human lung metagenome"          "fossil metagenome"             
[13] "freshwater metagenome"          "saltern metagenome"             "stromatolite metagenome"       
[16] "coral metagenome"               "mosquito metagenome"            "fish metagenome"               
[19] "bovine gut metagenome"          "chicken gut metagenome"         "wastewater metagenome"         
[22] "microbial mat metagenome"       "freshwater sediment metagenome" "human metagenome"              
[25] "hydrothermal vent metagenome"   "compost metagenome"             "wallaby gut metagenome"        
[28] "groundwater metagenome"         "gut metagenome"                 "sediment metagenome"           
[31] "ant fungus garden metagenome"   "food metagenome"                "hypersaline lake metagenome"   
[34] "hydrocarbon metagenome"         "activated sludge metagenome"    "viral metagenome"              
[37] "bioreactor metagenome"          "wasp metagenome"                "permafrost metagenome"         
[40] "sponge metagenome"              "aquatic metagenome"             "insect gut metagenome"         
[43] "activated carbon metagenome"    "anaerobic digester metagenome"  "rock metagenome"               
[46] "terrestrial metagenome"         "rock porewater metagenome"      "seawater metagenome"           
[49] "scorpion gut metagenome"        "soda lake metagenome"           "glacier metagenome"

Internally the listMetaGenomes() function downloads the assembly_summary.txt file from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/metagenomes/ to retrieve available metagenome information. This procedure might take a few seconds during the first run of listMetaGenomes(). Subsequently, the assembly_summary.txt file will be stored in the tempdir() directory to achieve a much faster access of this information during following uses of listMetaGenomes().

In case users wish to retrieve detailed information about available metagenome projects they can specify the details = TRUE argument.

# show all elements of the data.frame
options(tibble.print_max = Inf)
# detailed information on available metagenomes at NCBI Genbank
listMetaGenomes(details = TRUE)
# A tibble: 857 x 21
   assembly_accession bioproject    biosample     wgs_master refseq_category  taxid species_taxid
                <chr>      <chr>        <chr>          <chr>           <chr>  <int>         <int>
1     GCA_000206185.1 PRJNA32359 SAMN02954317 AAGA00000000.1              na 256318        256318
2     GCA_000206205.1 PRJNA32355 SAMN02954315 AAFZ00000000.1              na 256318        256318
3     GCA_000206225.1 PRJNA32357 SAMN02954316 AAFY00000000.1              na 256318        256318
4     GCA_000208265.2 PRJNA17779 SAMN02954240 AASZ00000000.1              na 256318        256318
5     GCA_000208285.1 PRJNA17657 SAMN02954268 AATO00000000.1              na 256318        256318
6     GCA_000208305.1 PRJNA17659 SAMN02954269 AATN00000000.1              na 256318        256318
7     GCA_000208325.1 PRJNA16729 SAMN02954263 AAQL00000000.1              na 256318        256318
8     GCA_000208345.1 PRJNA16729 SAMN02954262 AAQK00000000.1              na 256318        256318
9     GCA_000208365.1 PRJNA13699 SAMN02954283 AAFX00000000.1              na 256318        256318
10    GCA_900010595.1 PRJEB11544 SAMEA3639840 CZPY00000000.1              na 256318        256318
# ... with 847 more rows, and 14 more variables: organism_name <chr>, infraspecific_name <chr>,
#   isolate <chr>, version_status <chr>, assembly_level <chr>, release_type <chr>, genome_rep <chr>,
#   seq_rel_date <date>, asm_name <chr>, submitter <chr>, gbrs_paired_asm <chr>, paired_asm_comp <chr>,
#   ftp_path <chr>, excluded_from_refseq <chr>

Finally, users can retrieve available metagenomes using getMetaGenomes(). The name argument receives the metagenome project name retrieved with listMetaGenomes(). The path argument specifies the folder path in which corresponding genomes shall be stored.

# retrieve all genomes belonging to the human gut metagenome project
getMetaGenomes(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut"))
1] "The metagenome of 'human gut metagenome' has been downloaded to '_ncbi_downloads/human_gut'."
  [1] "_ncbi_downloads/human_gut/GCA_000205525.2_ASM20552v2_genomic.fna.gz"
  [2] "_ncbi_downloads/human_gut/GCA_000205765.1_ASM20576v1_genomic.fna.gz"
  [3] "_ncbi_downloads/human_gut/GCA_000205785.1_ASM20578v1_genomic.fna.gz"
  [4] "_ncbi_downloads/human_gut/GCA_000207925.1_ASM20792v1_genomic.fna.gz"
  [5] "_ncbi_downloads/human_gut/GCA_000207945.1_ASM20794v1_genomic.fna.gz"
  [6] "_ncbi_downloads/human_gut/GCA_000207965.1_ASM20796v1_genomic.fna.gz"
  [7] "_ncbi_downloads/human_gut/GCA_000207985.1_ASM20798v1_genomic.fna.gz"
  [8] "_ncbi_downloads/human_gut/GCA_000208005.1_ASM20800v1_genomic.fna.gz"
  [9] "_ncbi_downloads/human_gut/GCA_000208025.1_ASM20802v1_genomic.fna.gz"
 [10] "_ncbi_downloads/human_gut/GCA_000208045.1_ASM20804v1_genomic.fna.gz"
 [11] "_ncbi_downloads/human_gut/GCA_000208065.1_ASM20806v1_genomic.fna.gz"
 [12] "_ncbi_downloads/human_gut/GCA_000208085.1_ASM20808v1_genomic.fna.gz"
 [13] "_ncbi_downloads/human_gut/GCA_000208105.1_ASM20810v1_genomic.fna.gz"
 [14] "_ncbi_downloads/human_gut/GCA_000208125.1_ASM20812v1_genomic.fna.gz"
 [15] "_ncbi_downloads/human_gut/GCA_000208145.1_ASM20814v1_genomic.fna.gz"
 [16] "_ncbi_downloads/human_gut/GCA_000208165.1_ASM20816v1_genomic.fna.gz"
 ...

Internally, getMetaGenomes() creates a folder specified in the path argument. Genomes associated with the metagenomes project specified in the name argument will then be downloaded and stored in this folder. As return value getMetaGenomes() returns the file paths to the genome files which can then be used as input to the read*() functions.

Alternatively or subsequent to the metagenome retrieval, users can retrieve annotation files of genomes belonging to a metagenome project selected with listMetaGenomes() by using the getMetaGenomeAnnotations() function.

# retrieve all genomes belonging to the human gut metagenome project
getMetaGenomeAnnotations(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut","annotations"))
[1] "The annotations of metagenome 'human gut metagenome' have been downloaded and stored at '_ncbi_downloads/human_gut/annotations'."
  [1] "_ncbi_downloads/human_gut/annotations/GCA_000205525.2_ASM20552v2_genomic.gff.gz"
  [2] "_ncbi_downloads/human_gut/annotations/GCA_000205765.1_ASM20576v1_genomic.gff.gz"
  [3] "_ncbi_downloads/human_gut/annotations/GCA_000205785.1_ASM20578v1_genomic.gff.gz"
  [4] "_ncbi_downloads/human_gut/annotations/GCA_000207925.1_ASM20792v1_genomic.gff.gz"
  [5] "_ncbi_downloads/human_gut/annotations/GCA_000207945.1_ASM20794v1_genomic.gff.gz"
  [6] "_ncbi_downloads/human_gut/annotations/GCA_000207965.1_ASM20796v1_genomic.gff.gz"
  [7] "_ncbi_downloads/human_gut/annotations/GCA_000207985.1_ASM20798v1_genomic.gff.gz"
  [8] "_ncbi_downloads/human_gut/annotations/GCA_000208005.1_ASM20800v1_genomic.gff.gz"
  [9] "_ncbi_downloads/human_gut/annotations/GCA_000208025.1_ASM20802v1_genomic.gff.gz"
 [10] "_ncbi_downloads/human_gut/annotations/GCA_000208045.1_ASM20804v1_genomic.gff.gz"
 [11] "_ncbi_downloads/human_gut/annotations/GCA_000208065.1_ASM20806v1_genomic.gff.gz"
 [12] "_ncbi_downloads/human_gut/annotations/GCA_000208085.1_ASM20808v1_genomic.gff.gz"
 [13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz"
 [13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz"
 [14] "_ncbi_downloads/human_gut/annotations/GCA_000208125.1_ASM20812v1_genomic.gff.gz"
 [15] "_ncbi_downloads/human_gut/annotations/GCA_000208145.1_ASM20814v1_genomic.gff.gz"
 [16] "_ncbi_downloads/human_gut/annotations/GCA_000208165.1_ASM20816v1_genomic.gff.gz"
 ...

The file paths of the downloaded *.gff are retured by getMetaGenomeAnnotations() and can be used as input for the read.gff() function in the seqreadr package.

Retrieve Protein Sequences

Download all mammalian vertebrate proteomes.

Retrieval from NCBI RefSeq:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "proteome")

Retrieval from NCBI Genbank:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "proteome")

Retrieval from ENSEMBL:

# download all Ensembl proteome sequneces
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "proteome")

Retrieval from ENSEMBLGENOMES:

# download all EnsemblGenomes Bacteria proteome sequneces
meta.retrieval(kingdom = "EnsemblBacteria", db = "ensemblgenomes", type = "proteome")

Retrieve CDS Sequences

Download all mammalian vertebrate CDS from RefSeq (Genbank does not store CDS data).

Retrieval from NCBI RefSeq:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "cds")

Retrieval from NCBI Genbank:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "cds")

Retrieval from ENSEMBL:

# download all Ensembl CDS sequneces
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "cds")

Retrieval from ENSEMBLGENOMES:

# download all EnsemblGenomes Bacteria CDS sequneces
meta.retrieval(kingdom = "EnsemblBacteria", db = "ensemblgenomes", type = "cds")

Retrieve GFF files

Download all mammalian vertebrate gff files.

Example NCBI RefSeq:

# download all vertebrate gff files
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "gff")

Example NCBI Genbank:

# download all vertebrate gff files
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "gff")

Retrieve GTF files

Download all mammalian vertebrate gtf files.

Example ENSEMBL:

# download all vertebrate gff files
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "gtf")

Example ENSEMBLGENOMES:

# download all vertebrate gtf files
meta.retrieval(kingdom = "EnsemblPlants", db = "ensemblgenomes", type = "gtf")

Retrieve RNA sequences

Download all mammalian vertebrate RNA sequences from NCBI RefSeq and NCBI Genbank.

Retrieval from NCBI RefSeq:

# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "rna")

Retrieval from NCBI Genbank:

# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "rna")

Retrieval from ENSEMBL:

# download all Ensembl RNA sequneces
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "rna")

Retrieval from ENSEMBLGENOMES:

# download all EnsemblGenomes Bacteria RNA sequneces
meta.retrieval(kingdom = "EnsemblBacteria", db = "ensemblgenomes", type = "rna")

Retrieve Repeat Masker Sequences

Download all mammalian vertebrate Repeat Masker Annotation files from NCBI RefSeq and NCBI Genbank.

Retrieval from NCBI RefSeq:

# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "rm")

Retrieval from NCBI Genbank:

# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "rm")

Users can obtain alternative kingdoms using getKingdoms().

Retrieve Individual Genomes for all Species in the Tree of Life

If users wish to download the all genomes, proteome, CDS, or gff files for all species available in RefSeq or Genbank, they can use the meta.retrieval.all() function for this purpose.

Genome Retrieval

Example RefSeq:

# download all geneomes stored in RefSeq
meta.retrieval.all(db = "refseq", type = "genome")

Example Genbank:

# download all geneomes stored in Genbank
meta.retrieval.all(db = "genbank", type = "genome")

Proteome Retrieval

Example RefSeq:

# download all proteome stored in RefSeq
meta.retrieval.all(db = "refseq", type = "proteome")

Example Genbank:

# download all proteome stored in Genbank
meta.retrieval.all(db = "genbank", type = "proteome")

Again, by exchanging type = "proteome" by either type = "genome", type = "cds", type = "rna", type = "assemblystats", or type = "gff", users can retrieve genome, CDS, RNA, genome assembly statistics or GFF files instead of proteomes