Sequence Retrieval

2018-06-27

Biological Sequence Retrieval

The biomartr package allows users to retrieve biological sequences in a very simple and intuitive way.

Using biomartr, users can retrieve either genomes, proteomes, CDS, RNA, GFF, and genome assembly statistics data using the specialized functions:

Getting Started with Sequence Retrieval

First users can check whether or not the genome, proteome, CDS, RNA, GFF, GTF, or genome assembly statistics of their interest is available for download.

Using the scientific name of the organism of interest, users can check whether the corresponding genome is available via the is.genome.available() function.

Please note that the first time you run this command it might take a while, because during the initial execution of this function all necessary information is retrieved from NCBI and then stored locally. All further runs are then much faster.

Example NCBI RefSeq (?is.genome.available):

# checking whether or not the Homo sapiens 
# genome is avaialable for download
is.genome.available(db = "refseq", organism = "Homo sapiens")
A reference or representative genome assembly is available for 'Homo sapiens'.
[1] TRUE

In the case of the human genome, there are more than one entry in the NCBI RefSeq database (see message). By specifying the argument ‘details = TRUE’ users can retrieve all information for ‘Homo sapiens’ that is stored in NCBI RefSeq.

# checking whether or not the Homo sapiens 
# genome is avaialable for download
is.genome.available(db = "refseq", organism = "Homo sapiens", details = TRUE)
  assembly_accessi… bioproject biosample wgs_master refseq_category taxid
  <chr>             <chr>      <chr>     <chr>      <chr>           <int>
1 GCF_000001405.38  PRJNA168   NA        NA         reference geno…  9606
# ... with 15 more variables: species_taxid <int>, organism_name <chr>,
#   infraspecific_name <chr>, isolate <chr>, version_status <chr>,
#   assembly_level <chr>, release_type <chr>, genome_rep <chr>,
#   seq_rel_date <date>, asm_name <chr>, submitter <chr>,
#   gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
#   excluded_from_refseq <chr>

Here, we find one possible versions of the human genome having the assembly_accession ID GCF_000001405.38.

When retrieving a genome with e.g. getGenome() the organism argument can also be specified using the assembly_accession ID instead of the scientific name of the organism. This is true for all get*() functions. Hence, instead of writing getGenome(db = "refseq", organism = "Homo sapiens"), users can specify getGenome(db = "refseq", organism = "GCF_000001405.38"). This is particularly useful when more than one entry is available for one organism.

Please note that the assembly_accession id might change internally in external databases such as NCBI RefSeq. Thus when writing automated scripts using the assembly_accession id they might stop working due to the internal change of ids at RefSeq. Recently, I had the case where the assembly_accession id was changed at RefSeq from GCF_000001405.37 to GCF_000001405.38. Thus, scripts based on screening for entries with is.genome.available() stopped working, because the id GCF_000001405.37 couldn’t be found anymore.

In some cases users will find several entries for the same scientific name. This might be due to the fact that ecotypes, strains, or other types of sub-species are available in the respective databases.

For example. when we search for the bacterium Mycobacterium tuberculosis in NCBI RefSeq we get 5377 hits.

is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE)
A tibble: 5,377 x 21
   assembly_accessi… bioproject biosample wgs_master refseq_category  taxid
   <chr>             <chr>      <chr>     <chr>      <chr>            <int>
 1 GCF_000008585.1   PRJNA2241… SAMN0260… NA         na               83331
 2 GCF_000016145.1   PRJNA2241… SAMN0260… NA         na              419947
 3 GCF_000016925.1   PRJNA2241… SAMN0010… NA         na              336982
 4 GCF_000023625.1   PRJNA2241… SAMN0076… NA         na              478434
 5 GCF_000152105.1   PRJNA2241… SAMN0259… ADHQ00000… na              675512
 6 GCF_000153685.2   PRJNA2241… SAMN0308… NA         na              395095
 7 GCF_000154585.2   PRJNA2241… SAMN0308… NA         na              478433
 8 GCF_000154605.2   PRJNA2241… SAMN0010… NA         na              478435
 9 GCF_000155795.1   PRJNA2241… SAMN0259… ABVM00000… na              555461
10 GCF_000159755.1   PRJNA2241… SAMN0259… ACHP00000… na              611303
# ... with 5,367 more rows, and 15 more variables: species_taxid <int>,
#   organism_name <chr>, infraspecific_name <chr>, isolate <chr>,
#   version_status <chr>, assembly_level <chr>, release_type <chr>,
#   genome_rep <chr>, seq_rel_date <date>, asm_name <chr>, submitter <chr>,
#   gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
#   excluded_from_refseq <chr>

When looking at the names of these organisms we see that they consist of different strains of Mycobacterium tuberculosis.

head(is.genome.available(organism = "Mycobacterium tuberculosis", 
                         db = "refseq", 
                         details = TRUE)$organism_name)
[1] "Mycobacterium tuberculosis CDC1551"     
[2] "Mycobacterium tuberculosis H37Ra"       
[3] "Mycobacterium tuberculosis F11"         
[4] "Mycobacterium tuberculosis KZN 1435"    
[5] "Mycobacterium tuberculosis SUMu001"     
[6] "Mycobacterium tuberculosis str. Haarlem"

Now we can use the assembly_accession id to retrieve the Mycobacterium tuberculosis strain we are interested in, e.g. Mycobacterium tuberculosis CDC1551.

MtbCDC1551 <- getGenome(db = "refseq", 
                        organism = "GCF_000008585.1",  
                        path   = file.path("_ncbi_downloads","genomes"), 
                        reference = FALSE)
                        
MtbCDC1551_genome <- read_genome(MtbCDC1551)

MtbCDC1551_genome
A DNAStringSet instance of length 1
      width seq                             names               
[1] 4403837 TTGACCGATGACCC...AGGGAGATACGTCG NC_002755.2 Mycob...

Using the NCBI Taxonomy ID instead of the scientific name to screen for organism availability

Instead of specifying the scientific name of the organism of interest users can specify the NCBI Taxonomy identifier (= taxid) of the corresponding organism. For example, the taxid of Homo sapiens is 9606. Now users can specify organism = "9606" to retrieve entries for Homo sapiens:

# checking availability for Homo sapiens using its taxid
is.genome.available(db = "refseq", organism = "9606", details = TRUE)
  assembly_accession bioproject biosample wgs_master refseq_category  taxid
  <chr>              <chr>      <chr>     <chr>      <chr>            <int>
1 GCF_000001405.38   PRJNA168   NA        NA         reference genome  9606
# ... with 15 more variables: species_taxid <int>, organism_name <chr>,
#   infraspecific_name <chr>, isolate <chr>, version_status <chr>,
#   assembly_level <chr>, release_type <chr>, genome_rep <chr>,
#   seq_rel_date <date>, asm_name <chr>, submitter <chr>,
#   gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
#   excluded_from_refseq <chr>

Using the accession ID instead of the scientific name or taxid to screen for organism availability

Finally, instead of specifying either the scientific name of the organism of interest nor the taxid, users can specify the accession ID of the organism of interest. In the following example we use the accession ID of Homo sapiens (= GCF_000001405.38):

# checking availability for Homo sapiens using its taxid
is.genome.available(db = "refseq", organism = "GCF_000001405.38", details = TRUE)
assembly_accession bioproject biosample wgs_master refseq_category  taxid
  <chr>              <chr>      <chr>     <chr>      <chr>            <int>
1 GCF_000001405.38   PRJNA168   NA        NA         reference genome  9606
# ... with 15 more variables: species_taxid <int>, organism_name <chr>,
#   infraspecific_name <chr>, isolate <chr>, version_status <chr>,
#   assembly_level <chr>, release_type <chr>, genome_rep <chr>,
#   seq_rel_date <date>, asm_name <chr>, submitter <chr>,
#   gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
#   excluded_from_refseq <chr>

A small negative example

In some cases your organism of interest will not be available in NCBI RefSeq. Here an example:

# check genome availability for Candida glabrata
is.genome.available(db = "refseq", organism = "Candida glabrata")
Unfortunatey, no entry for 'Candida glabrata' was found in the 'refseq' database. 
Please consider specifying 'db = genbank' or 'db = ensembl' or 'db = ensemblgenomes 
to check whether 'Candida glabrata' is availabe in these databases.
[1] FALSE

When now checking the availability in NCBI Genbank we find that indeed ‘Candida glabrata’ is availabe:

# check genome availability for Candida glabrata
is.genome.available(db = "genbank", organism = "Candida glabrata")
Only a non-reference genome assembly is available for 'Candida glabrata'. 
Please make sure to specify the argument 'reference = FALSE' when running any get*() function.
[1] TRUE

Although, an entry is available the is.genome.available() warns us that only a non-reference genome is available for ‘Candida glabrata’. To then retrieve the e.g. genome etc. files users need to speficy the reference = FALSE argument in the get*() functions.

For example:

# retrieve non-reference genome 
getGenome(db = "genbank", organism = "Candida glabrata", reference = FALSE)
Only a non-reference genome assembly is available for 'Candida glabrata'. 
Please make sure to specify the argument 'reference = FALSE' when running any get*() function.
Genome download is completed!
Checking md5 hash of file: _ncbi_downloads/genomes/Candida_glabrata_md5checksums.txt ...
The md5 hash of file '_ncbi_downloads/genomes/Candida_glabrata_md5checksums.txt' matches!
The genome of 'Candida_glabrata' has been downloaded to '_ncbi_downloads/genomes' 
and has been named 'Candida_glabrata_genomic_genbank.fna.gz'.
[1] "_ncbi_downloads/genomes/Candida_glabrata_genomic_genbank.fna.gz"

Example NCBI Genbank (?is.genome.available):

Test whether or not the genome of Homo sapiens is present at NCBI Genbank.

# checking whether or not the Homo sapiens 
# genome is avaialable for download
is.genome.available(db = "genbank", organism = "Homo sapiens")
A reference or representative genome assembly is available for 'Homo sapiens'.
More than one entry was found for 'Homo sapiens'. 
Please consider to re-run this funtion and specify 'details = TRUE'. 
This will allow you to select the 'assembly_accession' identifier 
that can then be specified in all get*() functions.
[1] TRUE

Now with details = TRUE.

# checking whether or not the Homo sapiens 
# genome is avaialable for download
is.genome.available(db = "genbank", organism = "Homo sapiens", details = TRUE)
 assembly_accessi… bioproject biosample   wgs_master  refseq_category taxid
   <chr>             <chr>      <chr>       <chr>       <chr>           <int>
 1 GCA_000001405.27  PRJNA31257 NA          NA          reference geno…  9606
 2 GCA_000002115.2   PRJNA1431  SAMN029812… AADD000000… na               9606
 3 GCA_000002125.2   PRJNA19621 SAMN029812… ABBA000000… na               9606
 4 GCA_000002135.3   PRJNA10793 NA          NA          na               9606
 5 GCA_000004845.2   PRJNA42199 SAMN000033… ADDF000000… na               9606
 6 GCA_000005465.1   PRJNA42201 NA          DAAB000000… na               9606
 7 GCA_000181135.1   PRJNA28335 SAMN000016… ABKV000000… na               9606
 8 GCA_000185165.1   PRJNA59877 SAMN029812… AEKP000000… na               9606
 9 GCA_000212995.1   PRJNA19621 SAMN029812… ABSL000000… na               9606
10 GCA_000252825.1   PRJNA19621 SAMN029812… ABBA000000… na               9606
# ... with 70 more rows, and 15 more variables: species_taxid <int>,
#   organism_name <chr>, infraspecific_name <chr>, isolate <chr>,
#   version_status <chr>, assembly_level <chr>, release_type <chr>,
#   genome_rep <chr>, seq_rel_date <date>, asm_name <chr>, submitter <chr>,
#   gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
#   excluded_from_refseq <chr>

As you can see there are several versions of the Homo sapiens genome available for download from NCBI Genbank. Using the assembly_accession id will now allow to specify which version shall be retrieved.

Using is.genome.available() with ENSEMBL and ENSEMBLGENOMES

Users can also specify db = "ensembl" or db = "ensemblgenomes" to retrieve available organisms provided by ENSEMBL or ENSEMBLGENOMES. Again, users might experience a delay in the execution of this function when running it for the first time. This is due to the download of ENSEMBL or ENSEMBLGENOMES information which is then stored internally to enable a much faster execution of this function in following runs. The corresponding information files are stored at file.path(tempdir(), "ensembl_summary.txt") and file.path(tempdir(), "ensemblgenomes_summary.txt").

Example ENSEMBL (?is.genome.available):

# cheking whether Homo sapiens is available in the ENSEMBL database
is.genome.available(db = "ensembl", organism = "Homo sapiens")
A reference or representative genome assembly is available for 'Homo sapiens'.
[1] TRUE
# retrieve details for Homo sapiens
is.genome.available(db = "ensembl", organism = "Homo sapiens", details = TRUE)
  division taxon_id name    release display_name accession   strain_collecti…
  <chr>       <int> <chr>     <int> <chr>        <chr>       <chr>           
1 Ensembl      9606 homo_s…      92 Human        GCA_000001… NA              
# ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr>

Again, users can either specify the taxid or accession id when searching for organism entries.

# retrieve details for Homo sapiens using taxid
is.genome.available(db = "ensembl", organism = "9606", details = TRUE)
division taxon_id name    release display_name accession   strain_collecti…
  <chr>       <int> <chr>     <int> <chr>        <chr>       <chr>           
1 Ensembl      9606 homo_s…      92 Human        GCA_000001… NA              
# ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr>
> 
# retrieve details for Homo sapiens using accession id
is.genome.available(organism = "GCA_000001405.27", db = "ensembl", details = TRUE)
  division taxon_id name    release display_name accession  strain_collecti…
  <chr>       <int> <chr>     <int> <chr>        <chr>      <chr>           
1 Ensembl      9606 homo_s…      92 Human        GCA_00000… NA              
# ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr>

Please note that the accession id can change internally at ENSEMBL. E.g. in a recent case the accession id changed from GCA_000001405.25 to GCA_000001405.27. Hence, please be careful and take this issue into account when you build automated retrieval scripts that are based on accession ids.

Example ENSEMBLGENOMES (?is.genome.available):

For example, some species that cannot be found at db = "ensembl" might be available at db = "ensemblgenomes" and vice versa. So I recommend users to to check in both databases ENSEMBL and ESEMBLGENOMES whether or not a particular species is present. In case of "Homo sapiens", the genome is available at db = "ensembl" but not at db = "ensemblgenomes" whereas the genome of "Arabidopsis thaliana" is available at db = "ensemblgenomes" but not at db = "ensembl".

# cheking whether Homo sapiens is available in the ENSEMBLGENOMES database
is.genome.available(db = "ensemblgenomes", organism = "Homo sapiens")
Unfortunatey, no entry for 'Homo sapiens' was found in the 'ensemblgenomes' database.
Please consider specifying 'db = refseq' or 'db = genbank' or 'db = ensembl' or 'db = uniprot' 
to check whether 'Homo sapiens' is available in these databases.
[1] FALSE
# cheking whether Arabidopsis thaliana is available in the ENSEMBLGENOMES database
is.genome.available(db = "ensemblgenomes", organism = "Arabidopsis thaliana")
A reference or representative genome assembly is available for 'Arabidopsis thaliana'.
[1] TRUE
# show details for Arabidopsis thaliana
is.genome.available(db = "ensemblgenomes", organism = "Arabidopsis thaliana", details = TRUE)
 division  taxon_id name    release display_name accession strain_collecti…
  <chr>        <int> <chr>     <int> <chr>        <chr>     <chr>           
1 EnsemblP…     3702 arabid…      92 Arabidopsis… GCA_0000… NA              
# ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr>

Please note that the detailed information provided by ENSEMBL or ENSEMBL genomes differs from the information provided by NCBI.

Also for ENSEMBLGENOMES users can specify taxid or accession id instead of the scientific name in the argument organism:

# retrieve information 
is.genome.available(db = "ensemblgenomes", "3702", details = TRUE)
  division  taxon_id name    release display_name accession strain_collecti…
  <chr>        <int> <chr>     <int> <chr>        <chr>     <chr>           
1 EnsemblP…     3702 arabid…      92 Arabidopsis… GCA_0000… NA              
# ... with 3 more variables: common_name <chr>, strain <chr>, assembly <chr>

Example UniProt (?is.genome.available):

Users can also check the availability of proteomes in the UniProt database by specifying:

# retrieve information from UniProt
is.genome.available(db = "uniprot", "Homo sapiens", details = FALSE)
A reference or representative genome assembly is available for 'Homo sapiens'.
[1] TRUE

or with details:

# retrieve information from UniProt
is.genome.available(db = "uniprot", "Homo sapiens", details = TRUE)
 name   description isReferenceProt… isRepresentativ… dbReference component
* <chr>  <chr>       <lgl>            <lgl>            <list>      <list>   
1 Homo … Homo sapie… TRUE             TRUE             <data.fram… <data.fr…
# ... with 7 more variables: reference <list>, upid <chr>, modified <dbl>,
#   taxonomy <int>, source <chr>, sourceTaxonomy <int>, superregnum <chr>

Users can also search available species at UniProt via taxid or upid id.

Here 9606 defines the taxonomy id for Homo sapiens.

# retrieve information from UniProt
is.genome.available(db = "uniprot", "9606", details = TRUE)
 name   description isReferenceProt… isRepresentativ… dbReference component
* <chr>  <chr>       <lgl>            <lgl>            <list>      <list>   
1 Homo … Homo sapie… TRUE             TRUE             <data.fram… <data.fr…
# ... with 7 more variables: reference <list>, upid <chr>, modified <dbl>,
#   taxonomy <int>, source <chr>, sourceTaxonomy <int>, superregnum <chr>

Here UP000005640 defines the upid for Homo sapiens.

# retrieve information from UniProt
is.genome.available(db = "uniprot", "UP000005640", details = TRUE)
 name   description isReferenceProt… isRepresentativ… dbReference component
* <chr>  <chr>       <lgl>            <lgl>            <list>      <list>   
1 Homo … Homo sapie… TRUE             TRUE             <data.fram… <data.fr…
# ... with 7 more variables: reference <list>, upid <chr>, modified <dbl>,
#   taxonomy <int>, source <chr>, sourceTaxonomy <int>, superregnum <chr>

In general, the argument db specifies from which database (refseq, genbank, ensembl or ensemblgenomes, uniprot) organism information shall be retrieved. Options are:

Listing the total number of available genomes

In some cases it might be useful to check how many genomes (in total) are available in the different databases.

Users can determine the total number of available genomes using the listGenomes() function.

Example refseq:

length(listGenomes(db = "refseq"))
[1] 14264

Example genbank:

length(listGenomes(db = "genbank"))
[1] 21770

Example ensembl:

length(listGenomes(db = "ensembl"))
[1] 113

Example ensemblgenomes:

length(listGenomes(db = "ensemblgenomes"))
[1] 45172

Hence, currently 14264 genomes (including all kingdoms of life) are stored on NCBI RefSeq (as of 31/05/2018).

Retrieving kingdom, group and subgroup information

Using this example users can retrieve the number of available species for each kingdom of life:

Example refseq:

# the number of genomes available for each kingdom
listKingdoms(db = "refseq")
    Archaea  Bacteria Eukaryota   Viroids   Viruses 
      169      6321       561        47      7166 

Example genbank:

# the number of genomes available for each kingdom
listKingdoms(db = "genbank")
  Archaea  Bacteria Eukaryota 
     1249     18083      2438

Example ENSEMBL:

# the number of genomes available for each kingdom
listKingdoms(db = "ensembl")
Ensembl 
    113

Example ENSEMBLGENOMES:

# the number of genomes available for each kingdom
listKingdoms(db = "ensemblgenomes")
EnsemblBacteria    EnsemblFungi  EnsemblMetazoa   EnsemblPlants EnsemblProtists 
          44048             811              71              53             189 

Analogous computations can be performed for groups and subgroups

Unfortunately, ENSEMBL and ENSEMBLGENOMES do not provide group or subgroup information. Therefore, group and subgroup listings are limited to refseq and genbank.

Example refseq:

# the number of genomes available for each group
listGroups(db = "refseq")
                 Abditibacteriota                                   Acidithiobacillia 
                                                  1                                                   5 
                                     Acidobacteriia                                    Ackermannviridae 
                                                 16                                                  21 
                                     Actinobacteria                                        Adenoviridae 
                                               1226                                                  62 
                                  Alloherpesviridae                                   Alphaflexiviridae 
                                                  8                                                  56 
                                Alphaproteobacteria                                   Alphasatellitidae 
                                                857                                                  67 
                                  Alphatetraviridae                                      Alvernaviridae 
                                                  3                                                   1 
                                      Amalgaviridae                                          Amphibians 
                                                 10                                                   3 
                                     Ampullaviridae                                       Anelloviridae 
                                                  3                                                  61 
                                      Apicomplexans                          Apple fruit crinkle viroid 
                                                 18                                                   1 
          Apple hammerhead viroid-like circular RNA                                         Apscaviroid 
                                                  1                                                  12 
                                          Aquificae                                        Archaeoglobi 
                                                  6                                                   3 
                                       Arenaviridae                                     Armatimonadetes 
                                                 37                                                   4 
                                      Arteriviridae                                         Ascomycetes 
                                                 13                                                  45 
                                        Ascoviridae                                        Asfarviridae 
                                                  4                                                   1 
                                        Aspiviridae                                        Astroviridae 
                                                  3                                                  40 
                                        Avsunviroid                                   Bacilladnaviridae 
                                                  1                                                   7 
                           Bacteria candidate phyla                        Bacteroidetes/Chlorobi group 
                                                  3                                                 654 
                                      Baculoviridae                                           Balneolia 
                                                 66                                                   4 
                                       Barnaviridae                                      Basidiomycetes 
                                                  1                                                   5 
                                        Benyviridae                                    Betaflexiviridae 
                                                  3                                                  92 
                                 Betaproteobacteria                                      Bicaudaviridae 
                                                407                                                   4 
                                              Birds                                        Birnaviridae 
                                                 54                                                   8 
                                     Blastocatellia                                        Bornaviridae 
                                                  1                                                   5 
                                       Bromoviridae                                       Caliciviridae 
                                                 35                                                  26 
                               Candidatus Kryptonia                            Candidatus Micrarchaeota 
                                                  3                                                   1 
                           Candidatus Tectomicrobia                                   Carmotetraviridae 
                                                  1                                                   1 
                                     Caulimoviridae Cherry leaf scorch small circular viroid-like RNA 1 
                                                 67                                                   1 
              Cherry small circular viroid-like RNA                                          Chlamydiae 
                                                  1                                                  19 
                                        Chloroflexi                                       Chrysoviridae 
                                                 23                                                   9 
                                       Circoviridae                                     Closteroviridae 
                                                148                                                  40 
                                        Cocadviroid                                          Coleviroid 
                                                  4                                                   6 
                               Coprothermobacterota                                       Coronaviridae 
                                                  1                                                  47 
                                     Corticoviridae                                       Crenarchaeota 
                                                  1                                                   9 
                Cyanobacteria/Melainabacteria group                                        Cystoviridae 
                                                 22                                                   5 
                                Deinococcus-Thermus                          delta/epsilon subdivisions 
                                                 23                                                 142 
                                  Deltaflexiviridae                                     Dicistroviridae 
                                                  1                                                  27 
                                          Elaviroid                                        Endomicrobia 
                                                  1                                                   2 
                                     Endornaviridae                                       Fibrobacteres 
                                                 29                                                   2 
                                        Filoviridae                                         Fimoviridae 
                                                  7                                                   6 
                                         Firmicutes                                              Fishes 
                                               1378                                                  52 
                                          Flatworms                                        Flaviviridae 
                                                  4                                                 107 
                                      Fusariviridae                                      Fuselloviridae 
                                                  7                                                  10 
                                      Fusobacteriia                                   Gammaflexiviridae 
                                                 20                                                   1 
                                Gammaproteobacteria                                       Geminiviridae 
                                               1285                                                 406 
                                   Gemmatimonadetes                                       Genomoviridae 
                                                  2                                                  62 
                                     Globuloviridae                             Grapevine latent viroid 
                                                  2                                                   1 
                                        Green Algae                                        Guttaviridae 
                                                  8                                                   1 
                                       Halobacteria                                        Hantaviridae 
                                                 80                                                  24 
                                     Hepadnaviridae                                         Hepeviridae 
                                                 16                                                   3 
                                      Herpesviridae                                         Hostuviroid 
                                                 72                                                   2 
                                  Hydrogenophilalia                                         Hypoviridae 
                                                  1                                                  15 
                                     Hytrosaviridae                                         Iflaviridae 
                                                  2                                                  34 
                                         Inoviridae                                             Insects 
                                                 42                                                 110 
                                       Iridoviridae                                        Kinetoplasts 
                                                 20                                                   5 
                                 Kiritimatiellaeota                                         Land Plants 
                                                  1                                                  80 
                                      Lavidaviridae                                       Lentisphaerae 
                                                  2                                                   1 
                                        Leviviridae                                    Lipothrixviridae 
                                                 11                                                   8 
                                       Luteoviridae                                 Malacoherpesviridae 
                                                 48                                                   2 
                                            Mammals                                        Marnaviridae 
                                                100                                                   1 
                                   Marseilleviridae                                    Megabirnaviridae 
                                                  6                                                   3 
                                      Mesoniviridae                                     Methanobacteria 
                                                  7                                                  18 
                                       Methanococci                                     Methanomicrobia 
                                                  2                                                  21 
                              Methanonatronarchaeia                                        Microviridae 
                                                  1                                                  42 
                                        Mimiviridae           Mulberry small circular viroid-like RNA 1 
                                                  5                                                   1 
                                      Mymonaviridae                                          Myoviridae 
                                                  1                                                 516 
                                       Nairoviridae                                         Nanoviridae 
                                                 13                                                   8 
                                       Narnaviridae                                         Nimaviridae 
                                                 39                                                   1 
                                         Nitrospira                                         Nodaviridae 
                                                 10                                                  15 
                                        Nudiviridae                                        Nyamiviridae 
                                                  5                                                   4 
                                        Oligoflexia                                    Orthomyxoviridae 
                                                  3                                                   6 
                                              Other                                       Other Animals 
                                                988                                                  35 
                                        Other Fungi                                      Other Protists 
                                                  3                                                  16 
                                   Papillomaviridae                                     Paramyxoviridae 
                                                134                                                  54 
                                     Partitiviridae                                        Parvoviridae 
                                                 55                                                  90 
                                       Pelamoviroid                                    Peribunyaviridae 
                                                  2                                                  32 
                                Permutotetraviridae                                    Persimmon viroid 
                                                  1                                                   1 
                                      Phasmaviridae                                       Phenuiviridae 
                                                  4                                                  28 
                                    Phycodnaviridae                                    Picobirnaviridae 
                                                 26                                                   5 
                                     Picornaviridae                                        Pithoviridae 
                                                104                                                   2 
                                     Planctomycetes                                       Plasmaviridae 
                                                 18                                                   1 
                                    Pleolipoviridae                                       Pneumoviridae 
                                                  8                                                   7 
                                        Podoviridae                                     Polycipiviridae 
                                                343                                                   6 
                                     Polydnaviridae                                      Polyomaviridae 
                                                  5                                                  83 
                                        Pospiviroid                                         Potyviridae 
                                                 11                                                 146 
                                         Poxviridae                                       Quadriviridae 
                                                 47                                                   1 
                                         Reoviridae                                            Reptiles 
                                                 67                                                  11 
                                       Retroviridae                                       Rhabdoviridae 
                                                 52                                                 150 
                                       Rhodothermia                                         Roniviridae 
                                                  4                                                   1 
                                         Roundworms                            Rubber viroid India/2009 
                                                  8                                                   1 
                                        Rudiviridae                                         Secoviridae 
                                                 15                                                  53 
                                       Siphoviridae                                       Solemoviridae 
                                               1061                                                  20 
                                       Solibacteres                                      Solinviviridae 
                                                  3                                                   2 
                                 Sphaerolipoviridae                                        Spirochaetia 
                                                  7                                                  39 
                                         Sunviridae                                         Synergistia 
                                                  1                                                   4 
                                       Tectiviridae                                         Tenericutes 
                                                  5                                                  69 
                                     Thaumarchaeota                                         Thermococci 
                                                  8                                                  19 
                              Thermodesulfobacteria                                      Thermoplasmata 
                                                  4                                                   7 
                                        Thermotogae                                         Togaviridae 
                                                 10                                                  24 
                                 Tolecusatellitidae                                       Tombusviridae 
                                                101                                                  68 
                                       Tospoviridae                                         Totiviridae 
                                                 17                                                  66 
                                   Tristromaviridae                                        Turriviridae 
                                                  1                                                   2 
                                        Tymoviridae                                        unclassified 
                                                 37                                                 606 
                         unclassified Acidobacteria               unclassified Bacteria (miscellaneous) 
                                                  1                                                  13 
                           unclassified Nitrospirae                         unclassified Proteobacteria 
                                                  1                                                   2 
                                    Verrucomicrobia                                        Virgaviridae 
                                                 25                                                  52 
                                 Zetaproteobacteria 
                                                  5 

Example genbank:

# the number of genomes available for each group
listGroups(db = "genbank")
                       Abditibacteriota                       Acidithiobacillia 
                                      1                                       8 
                         Acidobacteriia                          Actinobacteria 
                                     24                                    1596 
                    Alphaproteobacteria                              Amphibians 
                                   1604                                       6 
                          Apicomplexans                               Aquificae 
                                     47                                       7 
                           Archaeoglobi                         Armatimonadetes 
                                      5                                      38 
                            Ascomycetes                Bacteria candidate phyla 
                                    689                                    3532 
           Bacteroidetes/Chlorobi group                               Balneolia 
                                   2247                                      44 
                         Basidiomycetes                      Betaproteobacteria 
                                    204                                     751 
                                  Birds                          Blastocatellia 
                                     80                                       2 
                           Caldisericia                  candidate division WS1 
                                      1                                       1 
        candidate division Zixibacteria               Candidatus Aegiribacteria 
                                     17                                       1 
             Candidatus Aenigmarchaeota               Candidatus Bathyarchaeota 
                                     14                                      42 
               Candidatus Cloacimonetes               Candidatus Diapherotrites 
                                     88                                      11 
            Candidatus Fermentibacteria            Candidatus Geothermarchaeota 
                                      8                                       3 
           Candidatus Heimdallarchaeota              Candidatus Hydrogenedentes 
                                      4                                      10 
                Candidatus Korarchaeota                    Candidatus Kryptonia 
                                      1                                       4 
        Candidatus Lambdaproteobacteria              Candidatus Latescibacteria 
                                      6                                      13 
               Candidatus Lokiarchaeota               Candidatus Marinimicrobia 
                                      2                                      92 
               Candidatus Marsarchaeota                Candidatus Micrarchaeota 
                                     15                                      35 
                Candidatus Moduliflexus             Candidatus Muproteobacteria 
                                      1                                      14 
           Candidatus Nanohaloarchaeota                Candidatus Odinarchaeota 
                                     16                                       1 
                Candidatus Omnitrophica                Candidatus Pacearchaeota 
                                    126                                      41 
               Candidatus Parvarchaeota                Candidatus Tectomicrobia 
                                     11                                       6 
               Candidatus Thorarchaeota                 Candidatus Vecturithrix 
                                      7                                       1 
         Candidatus Verstraetearchaeota               Candidatus Woesearchaeota 
                                      5                                      36 
                             Chlamydiae                             Chloroflexi 
                                     43                                     403 
                         Chrysiogenetes                    Coprothermobacterota 
                                      1                                       1 
                          Crenarchaeota     Cyanobacteria/Melainabacteria group 
                                     54                                     172 
                        Deferribacteres                     Deinococcus-Thermus 
                                      8                                      27 
             delta/epsilon subdivisions                            Dictyoglomia 
                                    615                                       1 
                          Elusimicrobia                            Endomicrobia 
                                      1                                       2 
                  environmental samples                           Fibrobacteres 
                                      3                                      15 
                             Firmicutes                                  Fishes 
                                   2663                                     168 
                              Flatworms                           Fusobacteriia 
                                     35                                      29 
                    Gammaproteobacteria                        Gemmatimonadetes 
                                   2041                                      82 
                            Green Algae                            Hadesarchaea 
                                     30                                       5 
                           Halobacteria                              Holophagae 
                                    162                                       9 
                      Hydrogenophilalia                                 Insects 
                                     32                                     280 
                           Kinetoplasts                      Kiritimatiellaeota 
                                     33                                       3 
                            Land Plants                           Lentisphaerae 
                                    282                                      32 
                                Mammals                         Methanobacteria 
                                    145                                      31 
                           Methanococci                         Methanomicrobia 
                                      2                                      77 
                  Methanonatronarchaeia                             Methanopyri 
                                      2                                       1 
                          Nanoarchaeota nitrifying bacterium enrichment culture 
                                     11                                       1 
                            Nitrospinae                              Nitrospira 
                                     19                                      31 
                            Oligoflexia                                   Other 
                                     87                                      15 
                          Other Animals                             Other Fungi 
                                    125                                      58 
                           Other Plants                          Other Protists 
                                      1                                     128 
                         Planctomycetes                                Reptiles 
                                    171                                      21 
                           Rhodothermia                              Roundworms 
                                      4                                      91 
                           Solibacteres                            Spirochaetia 
                                      7                                     115 
                            Synergistia                             Tenericutes 
                                     43                                     141 
                         Thaumarchaeota                           Theionarchaea 
                                     92                                       2 
                            Thermococci                   Thermodesulfobacteria 
                                     21                                       7 
                         Thermoplasmata                             Thermotogae 
                                    135                                      23 
             unclassified Acidobacteria    unclassified Archaea (miscellaneous) 
                                     94                                      52 
  unclassified Bacteria (miscellaneous)                unclassified Caldiserica 
                                    197                                      11 
           unclassified Calditrichaeota              unclassified Elusimicrobia 
                                      2                                     101 
             unclassified Euryarchaeota                unclassified Nitrospirae 
                                    348                                      90 
            unclassified Proteobacteria            unclassified Rhodothermaeota 
                                    130                                       4 
              unclassified Spirochaetes              unclassified Synergistetes 
                                     37                                       3 
               unclassified Thermotogae                     uncultured archaeon 
                                      1                                       1 
            uncultured archaeon A07HB70             uncultured archaeon A07HN63 
                                      1                                       1 
            uncultured archaeon A07HR60             uncultured archaeon A07HR67 
                                      1                                       1 
                   uncultured bacterium                         Verrucomicrobia 
                                      1                                     296 
                     Zetaproteobacteria 
                                     41 

Note that when running the listGenomes() function for the first time, it might take a while until the function returns any results, because necessary information need to be downloaded from NCBI and ENSEMBL databases. All subsequent executions of listGenomes() will then respond very fast, because they will access the corresponding files stored on your hard drive.

Downloading Biological Sequences and Annotations

After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, CDS, or GFF file. The following functions allow users to download proteomes, genomes, CDS and GFF files from several database resources such as: NCBI RefSeq, NCBI Genbank, ENSEMBL and ENSEMBLGENOMES. When a corresponding proteome, genome, CDS or GFF file was loaded to your hard-drive, a documentation *.txt file is generated storing File Name, Organism, Database, URL, DATE, assembly_accession, bioproject, biosample, taxid, version_status, release_type, seq_rel_date etc. information of the retrieved file. This way a better reproducibility of proteome, genome, CDS and GFF versions used for subsequent data analyses can be achieved.

Genome Retrieval

The easiest way to download a genome is to use the getGenome() function.

In this example we will download the genome of Homo sapiens.

The getGenome() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL and ENSEMBLGENOMES databases from which corresponding genomes can be retrieved.

The db argument specifies from which database genome assemblies in *.fasta file format shall be retrieved.

Options are:

Furthermore, users need to specify the scientific name, the taxid (= NCBI Taxnonomy identifier), or the accession identifier of the organism of interest for which a genome assembly shall be downloaded, e.g. organism = "Homo sapiens" or organism = "9606" or organism = "GCF_000001405.37". Finally, the path argument specifies the folder path in which the corresponding assembly shall be locally stored. In case users would like to store the genome file at a different location, they can specify the path = file.path("put","your","path","here") argument (e.g. file.path("_ncbi_downloads","genomes")).

Example NCBI RefSeq:

# download the genome of Homo sapiens from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.refseq <- getGenome( db       = "refseq", 
                               organism = "Homo sapiens",
                               path     = file.path("_ncbi_downloads","genomes") )

In this example, getGenome() creates a directory named '_ncbi_downloads/genomes' into which the corresponding genome named GCF_000001405.34_GRCh38.p8_genomic.fna.gz is downloaded. The return value of getGenome() is the folder path to the downloaded genome file that can then be used as input to the read_genome() function. The variable HS.genome.refseq stores the path to the downloaded genome.

Users can also omit the path argument if they wish to store the genome in their current working directory. E.g.:

# download the genome of Homo sapiens from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.refseq <- getGenome( db       = "refseq", 
                               organism = "Homo sapiens")

Subsequently, users can use the read_genome() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_genome() function.

# import downloaded genome as Biostrings object
Human_Genome <- read_genome(file = HS.genome.refseq)
# look at the Biostrings object
Human_Genome
  A DNAStringSet instance of length 551
          width seq                                                names               
  [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN NC_000001.11 Homo...
  [2]    175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC NT_187361.1 Homo ...
  [3]     32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC NT_187362.1 Homo ...
  [4]    127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC NT_187363.1 Homo ...
  [5]     66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC NT_187364.1 Homo ...
  ...       ... ...
[547]    170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC NT_187685.1 Homo ...
[548]    215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC NT_187686.1 Homo ...
[549]    170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC NT_187687.1 Homo ...
[550]    177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC NT_113949.2 Homo ...
[551]     16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG NC_012920.1 Homo ...

Internally, a text file named doc_Homo_sapiens_db_refseq.txt is generated. The information stored in this log file is structured as follows:

File Name: Homo_sapiens_genomic_refseq.fna.gz
Organism Name: Homo_sapiens
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_genomic.fna.gz
Download_Date: Sat Oct 22 12:41:07 2016
refseq_category: reference genome
assembly_accession: GCF_000001405.35
bioproject: PRJNA168
biosample: NA
taxid: 9606
infraspecific_name: NA
version_status: latest
release_type: Patch
genome_rep: Full
seq_rel_date: 2016-09-26
submitter: Genome Reference Consortium

In summary, the getGenome() and read_genome() functions allow users to retrieve genome assemblies by specifying the scientific name of the organism of interest and allow them to import the retrieved genome assembly e.g. as Biostrings object. Thus, users can then perform the Biostrings notation to work with downloaded genomes and can rely on the log file generated by getGenome() to better document the source and version of genome assemblies used for subsequent studies.

Alternatively, users can perform the pipeline logic of the magrittr package:

# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Human_Genome <- getGenome( db       = "refseq", 
                           organism = "Homo sapiens",
                           path     = file.path("_ncbi_downloads","genomes")) %>%
    read_genome()
Human_Genome
  A DNAStringSet instance of length 551
          width seq                                                names               
  [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN NC_000001.11 Homo...
  [2]    175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC NT_187361.1 Homo ...
  [3]     32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC NT_187362.1 Homo ...
  [4]    127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC NT_187363.1 Homo ...
  [5]     66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC NT_187364.1 Homo ...
  ...       ... ...
[547]    170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC NT_187685.1 Homo ...
[548]    215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC NT_187686.1 Homo ...
[549]    170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC NT_187687.1 Homo ...
[550]    177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC NT_113949.2 Homo ...
[551]     16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG NC_012920.1 Homo ...

Use taxid id for genome retrieval

Alternatively, instead of specifying the scientific name in the argument organism users can specify the taxonomy id of the corresponding organism. Here, we specify the taxonomy id 559292 which encodes the species Saccharomyces cerevisiae.

# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
            db       = "refseq",
            organism = "559292") %>%
    read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 17
       width seq                                      names               
 [1]  230218 CCACACCACACCCACACAC...GGGTGTGGTGTGTGTGGG NC_001133.9 Sacch...
 [2]  813184 AAATAGCCCTCATGTACGT...TGTGGTGTGTGGGTGTGT NC_001134.8 Sacch...
 [3]  316620 CCCACACACCACACCCACA...GTGGGTGTGGTGTGTGTG NC_001135.5 Sacch...
 [4] 1531933 ACACCACACCCACACCACA...GTAGTAAGTAGCTTTTGG NC_001136.10 Sacc...
 [5]  576874 CGTCTCCTCCAAGCCCTGT...ATTTTCATTTTTTTTTTT NC_001137.3 Sacch...
 ...     ... ...
[13]  924431 CCACACACACACCACACCC...TGTGGTGTGTGTGTGGGG NC_001145.3 Sacch...
[14]  784333 CCGGCTTTCTGACCGAAAT...TGTGGGTGTGGTGTGGGT NC_001146.8 Sacch...
[15] 1091291 ACACCACACCCACACCACA...TGTGTGGGTGTGGTGTGT NC_001147.6 Sacch...
[16]  948066 AAATAGCCCTCATGTACGT...TTTAATTTCGGTCAGAAA NC_001148.4 Sacch...
[17]   85779 TTCATAATTAATTTTTTAT...TATAATATAATATCCATA NC_001224.1 Sacch...

Use assembly_accession id for genome retrieval

Alternatively, instead of specifying the scientific name or taxonomy in the argument organism users can specify the assembly_accession id of the corresponding organism. Here, we specify the assembly_accession id GCF_000146045.2 which encodes the species Saccharomyces cerevisiae.

# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
            db       = "refseq",
            organism = "GCF_000146045.2") %>%
    read_genome()
Scerevisiae_Genome
 A DNAStringSet instance of length 17
       width seq                                                    names               
 [1]  230218 CCACACCACACCCACACACCCACACA...GTGGTGTGGGTGTGGTGTGTGTGGG NC_001133.9 Sacch...
 [2]  813184 AAATAGCCCTCATGTACGTCTCCTCC...GTGTGGGTGTGGTGTGTGGGTGTGT NC_001134.8 Sacch...
 [3]  316620 CCCACACACCACACCCACACCACACC...GGGTGTGGTGGGTGTGGTGTGTGTG NC_001135.5 Sacch...
 [4] 1531933 ACACCACACCCACACCACACCCACAC...AATAAAGGTAGTAAGTAGCTTTTGG NC_001136.10 Sacc...
 [5]  576874 CGTCTCCTCCAAGCCCTGTTGTCTCT...GGGTTTCATTTTCATTTTTTTTTTT NC_001137.3 Sacch...
 ...     ... ...
[13]  924431 CCACACACACACCACACCCACACCAC...GTGTGGGTGTGGTGTGTGTGTGGGG NC_001145.3 Sacch...
[14]  784333 CCGGCTTTCTGACCGAAATTAAAAAA...GGGTGTGTGTGGGTGTGGTGTGGGT NC_001146.8 Sacch...
[15] 1091291 ACACCACACCCACACCACACCCACAC...GAGAGAGTGTGTGGGTGTGGTGTGT NC_001147.6 Sacch...
[16]  948066 AAATAGCCCTCATGTACGTCTCCTCC...TTTTTTTTTTAATTTCGGTCAGAAA NC_001148.4 Sacch...
[17]   85779 TTCATAATTAATTTTTTATATATATA...GCTTAATTATAATATAATATCCATA NC_001224.1 Sacch...

Example NCBI Genbank:

Genome retrieval from NCBI Genbank.

# download the genome of Homo sapiens from Genbank
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.genbank <- getGenome( db       = "genbank", 
                                organism = "Homo sapiens",
                                path     = file.path("_ncbi_downloads","genomes") )
# import downloaded genome as Biostrings object
Human_Genome <- read_genome(file = HS.genome.genbank)
# look at the Biostrings object
Human_Genome
  A DNAStringSet instance of length 551
          width seq                                                names               
  [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN CM000663.2 Homo s...
  [2]    175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC KI270706.1 Homo s...
  [3]     32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC KI270707.1 Homo s...
  [4]    127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC KI270708.1 Homo s...
  [5]     66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC KI270709.1 Homo s...
  ...       ... ...
[547]    170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC KI270931.1 Homo s...
[548]    215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC KI270932.1 Homo s...
[549]    170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC KI270933.1 Homo s...
[550]    177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC GL000209.2 Homo s...
[551]     16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG J01415.2 Homo sap...

Use taxonomy id for genome retrieval

Alternatively, instead of specifying the scientific name in the argument organism users can specify the taxonomy id of the corresponding organism. Here, we specify the taxonomy id 559292 which encodes the species Saccharomyces cerevisiae.

# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
            db       = "genbank",
            organism = "559292") %>%
    read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 16
       width seq                                names               
 [1]  230218 CCACACCACACCCACA...TGTGGTGTGTGTGGG BK006935.2 TPA_in...
 [2]  813184 AAATAGCCCTCATGTA...GGTGTGTGGGTGTGT BK006936.2 TPA_in...
 [3]  316620 CCCACACACCACACCC...GGTGTGGTGTGTGTG BK006937.2 TPA_in...
 [4] 1531933 ACACCACACCCACACC...GTAAGTAGCTTTTGG BK006938.2 TPA_in...
 [5]  576874 CGTCTCCTCCAAGCCC...TTCATTTTTTTTTTT BK006939.2 TPA_in...
 ...     ... ...
[12] 1078177 CACACACACACACCAC...ACATGAGGGCTATTT BK006945.2 TPA_in...
[13]  924431 CCACACACACACCACA...GGTGTGTGTGTGGGG BK006946.2 TPA_in...
[14]  784333 CCGGCTTTCTGACCGA...GGGTGTGGTGTGGGT BK006947.3 TPA_in...
[15] 1091291 ACACCACACCCACACC...GTGGGTGTGGTGTGT BK006948.2 TPA_in...
[16]  948066 AAATAGCCCTCATGTA...AATTTCGGTCAGAAA BK006949.2 TPA_in...

Use assembly_accession id for genome retrieval

Alternatively, instead of specifying the scientific name or taxonomy in the argument organism users can specify the assembly_accession id of the corresponding organism. Here, we specify the assembly_accession id GCA_000146045.2 which encodes the species Saccharomyces cerevisiae.

# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
            db       = "genbank",
            organism = "GCA_000146045.2") %>%
    read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 16
       width seq                                  names               
 [1]  230218 CCACACCACACCCACAC...GTGTGGTGTGTGTGGG BK006935.2 TPA_in...
 [2]  813184 AAATAGCCCTCATGTAC...TGGTGTGTGGGTGTGT BK006936.2 TPA_in...
 [3]  316620 CCCACACACCACACCCA...GGGTGTGGTGTGTGTG BK006937.2 TPA_in...
 [4] 1531933 ACACCACACCCACACCA...AGTAAGTAGCTTTTGG BK006938.2 TPA_in...
 [5]  576874 CGTCTCCTCCAAGCCCT...TTTCATTTTTTTTTTT BK006939.2 TPA_in...
 ...     ... ...
[12] 1078177 CACACACACACACCACC...TACATGAGGGCTATTT BK006945.2 TPA_in...
[13]  924431 CCACACACACACCACAC...TGGTGTGTGTGTGGGG BK006946.2 TPA_in...
[14]  784333 CCGGCTTTCTGACCGAA...TGGGTGTGGTGTGGGT BK006947.3 TPA_in...
[15] 1091291 ACACCACACCCACACCA...TGTGGGTGTGGTGTGT BK006948.2 TPA_in...
[16]  948066 AAATAGCCCTCATGTAC...TAATTTCGGTCAGAAA BK006949.2 TPA_in...

Example ENSEMBL:

# download the genome of Homo sapiens from ENSEMBL
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.ensembl <- getGenome( db       = "ensembl", 
                                organism = "Homo sapiens",
                                path     = file.path("_ncbi_downloads","genomes") )
# import downloaded genome as Biostrings object
Human_Genome <- read_genome(file = HS.genome.ensembl)
# look at the Biostrings object
Human_Genome
  A DNAStringSet instance of length 524
          width seq                                                names               
  [1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 1 dna:chromosome ...
  [2] 242193529 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 2 dna:chromosome ...
  [3] 198295559 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 3 dna:chromosome ...
  [4] 190214555 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 4 dna:chromosome ...
  [5] 181538259 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 5 dna:chromosome ...
  ...       ... ...
[520]       993 GCCCCACGTCCGGGAGGGAGGTGG...GAGGGAGGTGGGGGGTCAGCCCT KI270539.1 dna:sc...
[521]       990 TTTCATAGAGCATGTTTGAAACAC...TCAGAAACTTGTTGTGATGTGTG KI270385.1 dna:sc...
[522]       981 AGATTTCGTTGGAACGGGATAAAC...TCAGCTTTCAAACACTCTTTTTG KI270423.1 dna:sc...
[523]       971 ATTTGCGATGTGTGTTCTCAACTA...TTGGATAGCTTTGAAGTTTTCGT KI270392.1 dna:sc...
[524]       970 AAGTGGATATTTGGATAGCTTTGA...TCCTCAATAACAGAGTTGAACCT KI270394.1 dna:sc...

Use taxonomy id for genome retrieval

Alternatively, instead of specifying the scientific name in the argument organism users can specify the taxonomy id of the corresponding organism. Here, we specify the taxonomy id 4932 which encodes the species Saccharomyces cerevisiae.

# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Proteome <- getProteome(
            db       = "ensembl",
            organism = "4932") %>%
    read_proteome()
Scerevisiae_Proteome
  A AAStringSet instance of length 6692
       width seq                                names               
   [1]    61 MFSELINFQNEGHECQ...GNKSEETKKSCCSGK YHR055C pep chrom...
   [2]   657 MSDNGSPAVLPKTEFN...DESKEFQNSDIADLY YPR161C pep chrom...
   [3]  1341 MSLSPHVENASIPKGS...NEQECPGGCPGVAFI YOL138C pep chrom...
   [4]   944 MVQEQAILSCIEQTMV...KRDSLQVILEFVSQH YDR395W pep chrom...
   [5]   215 MDFYKLDEKLKELKRK...KQFNEKLSRESKGSE YGR129W pep chrom...
   ...   ... ...
[6688]   869 MTESAIDDQRFNLTKE...EQYIFCTLFIHTQIL YPL268W pep chrom...
[6689]   166 MSRYGKNLVHYIIVEH...TSQTRRVAARKGKLS YEL035C pep chrom...
[6690]   106 MSTLLKSAKSIVPLMD...ILFRDAEILAKIAKD YOR020C pep chrom...
[6691]   152 MWFPQIIAGMAAGGAA...TYVNAEKAAGQKKED YLL053C pep chrom...
[6692]   692 MAKRLLESSQNDQANR...EYGNVGNNEETLDDH YDR314C pep chrom...

Use assembly_accession id for genome retrieval

Alternatively, instead of specifying the scientific name or taxonomy in the argument organism users can specify the assembly_accession id of the corresponding organism. Here, we specify the assembly_accession id GCA_000146045.2 which encodes the species Saccharomyces cerevisiae.

# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
            db       = "ensembl",
            organism = "GCA_000146045.2") %>%
    read_genome()
Scerevisiae_Genome
  A DNAStringSet instance of length 17
       width seq                                names               
 [1]  230218 CCACACCACACCCACA...TGTGGTGTGTGTGGG I dna:chromosome ...
 [2]  813184 AAATAGCCCTCATGTA...GGTGTGTGGGTGTGT II dna:chromosome...
 [3]  316620 CCCACACACCACACCC...GGTGTGGTGTGTGTG III dna:chromosom...
 [4] 1531933 ACACCACACCCACACC...GTAAGTAGCTTTTGG IV dna:chromosome...
 [5]  439888 CACACACACCACACCC...GTGTGGTGTGTGTGT IX dna:chromosome...
 ...     ... ...
[13] 1078177 CACACACACACACCAC...ACATGAGGGCTATTT XII dna:chromosom...
[14]  924431 CCACACACACACCACA...GGTGTGTGTGTGGGG XIII dna:chromoso...
[15]  784333 CCGGCTTTCTGACCGA...GGGTGTGGTGTGGGT XIV dna:chromosom...
[16] 1091291 ACACCACACCCACACC...GTGGGTGTGGTGTGT XV dna:chromosome...
[17]  948066 AAATAGCCCTCATGTA...AATTTCGGTCAGAAA XVI dna:chromosom...

Example ENSEMBLGENOMES:

Due to the unavailability of "Homo sapiens" at db = "ensemblgenomes" here we choose "Arabidopsis thaliana" as example.

# download the genome of Arabidopsis thaliana from ENSEMBLGENOMES
# and store the corresponding genome file in '_ncbi_downloads/genomes'
AT.genome.ensemblgenomes <- getGenome( db       = "ensemblgenomes", 
                                       organism = "Arabidopsis thaliana",
                                       path     = file.path("_ncbi_downloads","genomes") )
# import downloaded genome as Biostrings object
Cress_Genome <- read_genome(file = AT.genome.ensemblgenomes)
# look at the Biostrings object
Cress_Genome
  A DNAStringSet instance of length 7
       width seq                                                   names               
[1] 30427671 CCCTAAACCCTAAACCCTAAACCCT...AGGGTTTAGGGTTTAGGGTTTAGGG 1 dna:chromosome ...
[2] 19698289 NNNNNNNNNNNNNNNNNNNNNNNNN...AGGGTTTAGGGTTTAGGGTTTAGGG 2 dna:chromosome ...
[3] 23459830 NNNNNNNNNNNNNNNNNNNNNNNNN...ACCCTAAACCCTAAACCCTAAACCC 3 dna:chromosome ...
[4] 18585056 NNNNNNNNNNNNNNNNNNNNNNNNN...AAGGGTTTAGGGTTTAGGGTTTAGG 4 dna:chromosome ...
[5] 26975502 TATACCATGTACCCTCAACCTTAAA...GTTTAGGATTTAGGGTTTTTAGATC 5 dna:chromosome ...
[6]   366924 GGATCCGTTCGAAACAGGTTAGCCT...TCGCAGAATGGAAACAAACCGGATT Mt dna:chromosome...
[7]   154478 ATGGGCGAACGACGGGAATTGAACC...TCATAATAACTTGGTCCCGGGCATC Pt dna:chromosome...

Proteome Retrieval

The getProteome() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL, ENSEMBLGENOMES, and UniProt databases from which corresponding proteomes can be retrieved. It works analogous to getGenome().

The db argument specifies from which database proteomes in *.fasta file format shall be retrieved.

Options are:

Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. organism = "Homo sapiens". Finally, the path argument specifies the folder path in which the corresponding proteome shall be locally stored. In case users would like to store the proteome file at a different location, they can specify the path = file.path("put","your","path","here") argument.

Example NCBI RefSeq:

# download the proteome of Homo sapiens from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
HS.proteome.refseq <- getProteome( db       = "refseq", 
                                   organism = "Homo sapiens",
                                   path     = file.path("_ncbi_downloads","proteomes"))

In this example, getProteome() creates a directory named '_ncbi_downloads/proteomes' into which the corresponding genome named GCF_000001405.34_GRCh38.p8_protein.faa.gz is downloaded. The return value of getProteome() is the folder path to the downloaded proteome file that can then be used as input to the read_proteome() function. The variable HS.proteome.refseq stores the path to the downloaded proteome. Subsequently, users can use the read_proteome() function to import the proteome into the R session. Users can choose to work with the proteome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_proteome() function.

# import proteome as Biostrings object
Human_Proteome <- read_proteome(file = HS.proteome.refseq)
Human_Proteome
A AAStringSet instance of length 113620
         width seq                          names               
     [1]  1474 MGKNKLLHPSLVL...YNAPCSKDLGNA NP_000005.2 alpha...
     [2]   290 MDIEAYFERIGYK...LVPKPGDGSLTI NP_000006.2 aryla...
     [3]   421 MAAGFGRCCRVLR...IVAREHIDKYKN NP_000007.1 mediu...
     [4]   412 MAAALLARASGPA...VIAGHLLRSYRS NP_000008.1 short...
     [5]   655 MQAARMAASLGRQ...RGGVVTSNPLGF NP_000009.1 very ...
     ...   ... ...
[113616]    98 MPLIYMNIMLAFT...LDYVHNLNLLQC YP_003024034.1 NA...
[113617]   459 MLKLIVPTIMLLP...SLNPDIITGFSS YP_003024035.1 NA...
[113618]   603 MTMHTTMTTLTLT...FFPLILTLLLIT YP_003024036.1 NA...
[113619]   174 MMYALFLLSVGLV...GVYIVIEIARGN YP_003024037.1 NA...
[113620]   380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy...

Alternatively, users can perform the pipeline logic of the magrittr package:

# install.packages("magrittr")
library(magrittr)
# import proteome as Biostrings object
Human_Proteome <- getProteome( db       = "refseq", 
                               organism = "Homo sapiens",
                               path     = file.path("_ncbi_downloads","proteomes")) %>%
    read_proteome()
Human_Proteome
 A AAStringSet instance of length 113620
         width seq                          names               
     [1]  1474 MGKNKLLHPSLVL...YNAPCSKDLGNA NP_000005.2 alpha...
     [2]   290 MDIEAYFERIGYK...LVPKPGDGSLTI NP_000006.2 aryla...
     [3]   421 MAAGFGRCCRVLR...IVAREHIDKYKN NP_000007.1 mediu...
     [4]   412 MAAALLARASGPA...VIAGHLLRSYRS NP_000008.1 short...
     [5]   655 MQAARMAASLGRQ...RGGVVTSNPLGF NP_000009.1 very ...
     ...   ... ...
[113616]    98 MPLIYMNIMLAFT...LDYVHNLNLLQC YP_003024034.1 NA...
[113617]   459 MLKLIVPTIMLLP...SLNPDIITGFSS YP_003024035.1 NA...
[113618]   603 MTMHTTMTTLTLT...FFPLILTLLLIT YP_003024036.1 NA...
[113619]   174 MMYALFLLSVGLV...GVYIVIEIARGN YP_003024037.1 NA...
[113620]   380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy...

Example NCBI Genbank:

# download the proteome of Homo sapiens from genbank
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
HS.proteome.genbank <- getProteome( db       = "genbank", 
                                   organism = "Homo sapiens",
                                   path     = file.path("_ncbi_downloads","proteomes"))
# import proteome as Biostrings object
Human_Proteome <- read_proteome(file = HS.proteome.genbank)
Human_Proteome
A AAStringSet instance of length 13
     width seq                              names               
 [1]   318 MPMANLLLLIVPILI...VSMPITISSIPPQT AAB58943.1 NADH d...
 [2]   347 MNPLAQPVIYSTIFA...TLLLPISPFMLMIL AAB58944.1 NADH d...
 [3]   513 MFADRWLFSTNHKDI...PPYHTFEEPVYMKS AAB58945.1 cytoch...
 [4]   227 MAHAAQVGLQDATSP...IPLKIFEMGPVFTL AAB58946.1 cytoch...
 [5]    68 MPQLNTTVWPTMITP...WTKICSLHSLPPQS AAB58947.1 ATPase...
 ...   ... ...
 [9]    98 MPLIYMNIMLAFTIS...YGLDYVHNLNLLQC AAB58951.1 NADH d...
[10]   459 MLKLIVPTIMLLPLT...LLSLNPDIITGFSS AAB58952.1 NADH d...
[11]   603 MTMHTTMTTLTLTSL...SFFFPLILTLLLIT AAB58953.1 NADH d...
[12]   174 MMYALFLLSVGLVMG...FVGVYIVIEIARGN AAB58954.1 NADH d...
[13]   380 MTPMRKTNPLMKLIN...PTISLIENKMLKWA AAB58955.3 cytoch...

Example ENSEMBL:

# download the proteome of Homo sapiens from ENSEMBL
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
HS.proteome.ensembl <- getProteome( db       = "ensembl", 
                                   organism = "Homo sapiens",
                                   path     = file.path("_ncbi_downloads","proteomes"))
# import proteome as Biostrings object
Human_Proteome <- read_proteome(file = HS.proteome.ensembl)
Human_Proteome
  A AAStringSet instance of length 107844
         width seq                          names               
     [1]     3 PSY                          ENSP00000451515.1...
     [2]     4 TGGY                         ENSP00000452494.1...
     [3]     2 EI                           ENSP00000451042.1...
     [4]     4 GTGG                         ENSP00000487941.1...
     [5]     4 GTGG                         ENSP00000488240.1...
     ...   ... ...
[107840]   135 MLQKKIEEKDLKV...LNHICKVPLAIK ENSP00000495237.1...
[107841]   166 MEHAFTPLEPLLS...IIQEESLIYLLQ ENSP00000496198.1...
[107842]    42 MEVPTAYMISPKE...GLKSENTMLLRC ENSP00000495723.1...
[107843]   508 MPSMLERISKNLV...GTLSLLQQLAEA ENSP00000496548.1...
[107844]   508 MPSMLERISKNLV...GTLSLLQQLAEA ENSP00000494855.1...

Example ENSEMBLGENOMES:

Due to the unavailability of "Homo sapiens" at db = "ensemblgenomes" here we choose "Arabidopsis thaliana" as example.

# download the proteome of Arabidopsis thaliana from ENSEMBLGENOMES
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
AT.proteome.ensemblgenomes <- getProteome( db       = "ensemblgenomes", 
                                           organism = "Arabidopsis thaliana",
                                           path     = file.path("_ncbi_downloads","proteomes"))
# import proteome as Biostrings object
Cress_Proteome <- read_proteome(file = AT.proteome.ensemblgenomes)
Cress_Proteome
  A AAStringSet instance of length 48321
        width seq                           names               
    [1]   924 MMPKRFNTSGFDT...YEKIFDLAFNYDH AT3G05780.1 pep c...
    [2]    61 MDSGLSWADQWDY...GFKWMKLGKKSDK AT1G24405.1 pep c...
    [3]   232 MRIVGLTGGIASG...LGLSVCKQLKIGS AT2G27490.4 pep c...
    [4]   232 MRIVGLTGGIASG...LGLSVCKQLKIGS AT2G27490.1 pep c...
    [5]   232 MRIVGLTGGIASG...LGLSVCKQLKIGS AT2G27490.3 pep c...
    ...   ... ...
[48317]   305 MKDHERAILEERI...DKTEEAGNKKIRL AT1G30814.1 pep c...
[48318]   321 MRAAINRANSLGG...KNGFEMAKYWRSQ AT2G43120.1 pep c...
[48319]   372 MAHNQFKDSLDPL...KNGFEMAKYWRSQ AT2G43120.2 pep c...
[48320]   345 MTATNKQVILKDY...KNVGKQVVVVARE AT5G16970.1 pep c...
[48321]   122 MATNACKFLCLVL...FDLKVVDGYIACS AT4G32100.1 pep c...

Example Retrieval Uniprot:

Another way of retrieving proteome sequences is from the UniProt database.

# download the proteome of Mus musculus from UniProt
# and store the corresponding proteome file in '_uniprot_downloads/proteomes'
Mm.proteome.uniprot<- getProteome( db       = "uniprot", 
                                   organism = "Mus musculus",
                                   path     = file.path("_uniprot_downloads","proteomes"))
# import proteome as Biostrings object
Mouse_Proteome <- read_proteome(file = Mm.proteome.uniprot)
Mouse_Proteome
A AAStringSet instance of length 84522
        width seq                                  names               
    [1]   781 MATQADLMELDMAMEPD...LPPGDSNQLAWFDTDL sp|Q02248|CTNB1_M...
    [2]  2531 MPRLLTPLLCLTLLPAL...PTTMPSQITHIPEAFK sp|Q01705|NOTC1_M...
    [3]   437 MLLLLARCFLVILASSL...LDSETMHPLGMAVKSS sp|Q62226|SHH_MOU...
    [4]   380 MKKPIGILSPGVALGTA...VKCKKCTEIVDQFVCK sp|P22725|WNT5A_M...
    [5]   387 MEESQSDISLELPLSQE...SRHKKTMVKKVGPDSD sp|P02340|P53_MOU...
    ...   ... ...
[84518]   459 MLKIILPSLMLLPLTWL...LILLTTSPKLITGLTM tr|A0A0F6PZ84|A0A...
[84519]   380 MTNMRKTHPLFKIINHS...LMPISGIIEDKMLKLY tr|U3TEV9|U3TEV9_...
[84520]   381 MTNMRKTHPLFKIINHS...MPISGIIEDKMLKLYP tr|A0A0F6PXN8|A0A...
[84521]   172 MNNYIFVLSSLFLVGCL...WSLFAGIFIIIEITRD tr|A0A0F6PXK9|A0A...
[84522]   381 MTNMRKTHPLFKIINHS...MPISGIIEDKMLKLYP tr|A0A0F6PYR5|A0A...

CDS Retrieval

The getCDS() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL and ENSEMBLGENOMES databases from which corresponding CDS files can be retrieved. It works analogous to getGenome() and getProteome().

The db argument specifies from which database proteomes in *.fasta file format shall be retrieved.

Options are:

Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. organism = "Homo sapiens". Finally, the path argument specifies the folder path in which the corresponding CDS file shall be locally stored. In case users would like to store the CDS file at a different location, they can specify the path = file.path("put","your","path","here") argument.

Example NCBI RefSeq:

# download the genome of Homo sapiens from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
HS.cds.refseq <- getCDS( db       = "refseq", 
                         organism = "Homo sapiens",
                         path     = file.path("_ncbi_downloads","CDS"))

In this example, getCDS() creates a directory named '_ncbi_downloads/CDS' into which the corresponding genome named Homo_sapiens_cds_from_genomic_refseq.fna.gz is downloaded. The return value of getCDS() is the folder path to the downloaded genome file that can then be used as input to the read_cds() function. The variable HS.cds.refseq stores the path to the downloaded CDS file. Subsequently, users can use the read_cds() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_cds() function.

# import downloaded CDS as Biostrings object
Human_CDS <- read_cds(file     = HS.cds.refseq, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Human_CDS
 A BStringSet instance of length 114967
          width seq                                                names               
     [1]    918 ATGGTGACTGAATTCATTTTTCTG...CACATTCTAGTGTAAAGTTTTAG lcl|NC_000001.11_...
     [2]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     [3]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     [4]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     [5]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     ...    ... ...
[114963]    297 ATGCCCCTCATTTACATAAATATT...ACCTAAACCTACTCCAATGCTAA lcl|NC_012920.1_c...
[114964]   1378 ATGCTAAAACTAATCGTCCCAACA...CATCATTACCGGGTTTTCCTCTT lcl|NC_012920.1_c...
[114965]   1812 ATAACCATGCACACTACTATAACC...TAACCCTACTCCTAATCACATAA lcl|NC_012920.1_c...
[114966]    525 ATGATGTATGCTTTGTTTCTGTTG...TTGAGATTGCTCGGGGGAATAGG lcl|NC_012920.1_c...
[114967]   1141 ATGACCCCAATACGCAAAACTAAC...AAACAAAATACTCAAATGGGCCT lcl|NC_012920.1_c...

Internally, a text file named doc_Homo_sapiens_db_refseq.txt is generated. The information stored in this log file is structured as follows:

File Name: Homo_sapiens_cds_from_genomic_refseq.fna.gz
Organism Name: Homo_sapiens
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_cds_from_genomic.fna.gz
Download_Date: Sun Oct 23 17:19:05 2016
refseq_category: reference genome
assembly_accession: GCF_000001405.35
bioproject: PRJNA168
biosample: NA
taxid: 9606
infraspecific_name: NA
version_status: latest
release_type: Patch
genome_rep: Full
seq_rel_date: 2016-09-26
submitter: Genome Reference Consortium

In summary, the getCDS() and read_cds() functions allow users to retrieve CDS files by specifying the scientific name of the organism of interest and allow them to import the retrieved CDS file e.g. as Biostrings object. Thus, users can then perform the Biostrings notation to work with downloaded CDS and can rely on the log file generated by getCDS() to better document the source and version of CDS used for subsequent studies.

Alternatively, users can perform the pipeline logic of the magrittr package:

# install.packages("magrittr")
library(magrittr)
# import CDS as Biostrings object
Human_CDS <- getCDS( db       = "refseq", 
                     organism = "Homo sapiens",
                     path     = file.path("_ncbi_downloads","CDS")) %>%
    read_cds(obj.type = "Biostrings")
Human_CDS
 A BStringSet instance of length 114967
          width seq                                                names               
     [1]    918 ATGGTGACTGAATTCATTTTTCTG...CACATTCTAGTGTAAAGTTTTAG lcl|NC_000001.11_...
     [2]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     [3]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     [4]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     [5]    402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
     ...    ... ...
[114963]    297 ATGCCCCTCATTTACATAAATATT...ACCTAAACCTACTCCAATGCTAA lcl|NC_012920.1_c...
[114964]   1378 ATGCTAAAACTAATCGTCCCAACA...CATCATTACCGGGTTTTCCTCTT lcl|NC_012920.1_c...
[114965]   1812 ATAACCATGCACACTACTATAACC...TAACCCTACTCCTAATCACATAA lcl|NC_012920.1_c...
[114966]    525 ATGATGTATGCTTTGTTTCTGTTG...TTGAGATTGCTCGGGGGAATAGG lcl|NC_012920.1_c...
[114967]   1141 ATGACCCCAATACGCAAAACTAAC...AAACAAAATACTCAAATGGGCCT lcl|NC_012920.1_c...

Example NCBI Genbank:

# download the genome of Homo sapiens from genbank
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
HS.cds.genbank <- getCDS( db       = "genbank", 
                         organism = "Homo sapiens",
                         path     = file.path("_ncbi_downloads","CDS"))
# import downloaded CDS as Biostrings object
Human_CDS <- read_cds(file     = HS.cds.genbank, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Human_CDS
  A BStringSet instance of length 13
     width seq                                                     names               
 [1]   956 ATACCCATGGCCAACCTCCTACTCCT...ATCTCCAGCATTCCCCCTCAAACCTA lcl|J01415.2_cds_...
 [2]  1042 ATTAATCCCCTGGCCCAACCCGTCAT...CTCCCCTTTTATACTAATAATCTTAT lcl|J01415.2_cds_...
 [3]  1542 ATGTTCGCCGACCGTTGACTATTCTC...AAGAACCCGTATACATAAAATCTAGA lcl|J01415.2_cds_...
 [4]   684 ATGGCACATGCAGCGCAAGTAGGTCT...AAATAGGGCCCGTATTTACCCTATAG lcl|J01415.2_cds_...
 [5]   207 ATGCCCCAACTAAATACTACCGTATG...TTCATTCATTGCCCCCACAATCCTAG lcl|J01415.2_cds_...
 ...   ... ...
 [9]   297 ATGCCCCTCATTTACATAAATATTAT...ATAACCTAAACCTACTCCAATGCTAA lcl|J01415.2_cds_...
[10]  1378 ATGCTAAAACTAATCGTCCCAACAAT...CGACATCATTACCGGGTTTTCCTCTT lcl|J01415.2_cds_...
[11]  1812 ATAACCATGCACACTACTATAACCAC...TCCTAACCCTACTCCTAATCACATAA lcl|J01415.2_cds_...
[12]   525 ATGATGTATGCTTTGTTTCTGTTGAG...TAATTGAGATTGCTCGGGGGAATAGG lcl|J01415.2_cds_...
[13]  1141 ATGACCCCAATACGCAAAACTAACCC...TGAAAACAAAATACTCAAATGGGCCT lcl|J01415.2_cds_...

Example ENSEMBL:

# download the genome of Homo sapiens from ensembl
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
HS.cds.ensembl <- getCDS( db       = "ensembl", 
                         organism = "Homo sapiens",
                         path     = file.path("_ncbi_downloads","CDS"))
# import downloaded CDS as Biostrings object
Human_CDS <- read_cds(file     = HS.cds.ensembl, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Human_CDS
  A BStringSet instance of length 102915
          width seq                                                names               
     [1]     13 ACTGGGGGATACG                                      ENST00000448914.1...
     [2]     12 GGGACAGGGGGC                                       ENST00000631435.1...
     [3]     12 GGGACAGGGGGC                                       ENST00000632684.1...
     [4]      9 CCTTCCTAC                                          ENST00000434970.2...
     [5]      8 GAAATAGT                                           ENST00000415118.1...
     ...    ... ...
[102911]   1665 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000436984.6...
[102912]   1920 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000439889.6...
[102913]   2094 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000339313.9...
[102914]    466 ATGCTACTGCCACTGCTGCTGTCC...AGCCCTGGACCTCTCTGTGCAGT ENST00000529627.1...
[102915]    559 ATGCGGAGATGCTACTGCCACTGC...CCTCACCTGCCATGTGGACTTCT ENST00000530476.1...

Example ENSEMBLGENOMES:

Due to the inavailability of "Homo sapiens" at db = "ensemblgenomes" here we choose "Arabidopsis thaliana" as example.

# download the genome of Homo sapiens from ensemblgenomes
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
AT.cds.ensemblgenomes <- getCDS( db       = "ensemblgenomes", 
                                 organism = "Arabidopsis thaliana",
                                 path     = file.path("_ncbi_downloads","CDS"))
# import downloaded CDS as Biostrings object
Cress_CDS <- read_cds(file     = AT.cds.ensemblgenomes, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Cress_CDS
  A BStringSet instance of length 35386
        width seq                                                  names               
    [1]  1248 ATGGGGAGAGATGAAACAGAGACGT...ACTACTCATATTATGCCTTTTTGA AT3G18710.1 cds:k...
    [2]  2568 ATGGCAACTGAGAATCCTATTAGGA...GAAAACCAAGAATTGAGGAGATGA AT4G25880.2 cds:k...
    [3]  2577 ATGGCAACTGAGAATCCTATTAGGA...TTCCCCAATAAAACCAAGAATTGA AT4G25880.3 cds:k...
    [4]  2586 ATGGCAACTGAGAATCCTATTAGGA...GAAAACCAAGAATTGAGGAGATGA AT4G25880.1 cds:k...
    [5]  1077 ATGACAAAGGCTTATTCAACACGTG...GAGGAAGCTATTTCCATGATCTAA AT1G71695.1 cds:k...
    ...   ... ...
[35382]  1125 ATGCATTCGCGCTCCGCCTTGCTCT...GCTTCTCCTTCTACATCCCCGTAG AT2G20860.1 cds:k...
[35383]  1179 ATGGCAGACAATTTGAATTTGGTGA...TTGGGCCTCGCCGCCTATTATTAG AT3G14210.1 cds:k...
[35384]  1488 ATGGCCTCTCTTCTCTCTCCCGCCA...TCAAAGCTCAGTTTCAGCAAGTGA AT5G01920.1 cds:k...
[35385]  1689 ATGAGTTTAACAAAGAAAGCAAGTG...GAACCACAGGCCGGTCTCCTTAGA AT2G26280.1 cds:k...
[35386]  1362 ATGGGTGTTTCTCTTCTGAAACAAC...GGTTTGGCCAATAATGTTATCTAG AT4G32600.1 cds:k...

RNA Retrieval

The getRNA() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL and ENSEMBLGENOMES databases from which corresponding RNA files can be retrieved. It works analogous to getGenome(), getProteome(), and getCDS().

The db argument specifies from which database proteomes in *.fasta file format shall be retrieved.

Options are:

Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. organism = "Homo sapiens". Finally, the path argument specifies the folder path in which the corresponding RNA file shall be locally stored. In case users would like to store the RNA file at a different location, they can specify the path = file.path("put","your","path","here") argument.

Example NCBI RefSeq:

# download the RNA of Homo sapiens from refseq
# and store the corresponding RNA file in '_ncbi_downloads/RNA'
HS.rna.refseq <- getRNA( db       = "refseq", 
                         organism = "Homo sapiens",
                         path     = file.path("_ncbi_downloads","RNA"))

In this example, getRNA() creates a directory named '_ncbi_downloads/RNA' into which the corresponding RNA file named Homo_sapiens_rna_from_genomic_refseq.fna.gz is downloaded. The return value of getRNA() is the folder path to the downloaded genome file that can then be used as input to the read_rna() function. The variable HS.rna.refseq stores the path to the downloaded RNA file. Subsequently, users can use the read_cds() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_rna() function.

# import downloaded RNA as Biostrings object
Human_rna <- read_rna(file     = HS.rna.refseq, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Human_rna
   A BStringSet instance of length 164136
          width seq                                                                                       names               
     [1]   1652 CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGT...CACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG lcl|NC_000001.11_...
     [2]   1769 TCCGGCAGAGCGGAAGCGGCGGCGGGAGCTTCCGGGAGGGCGG...ACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA lcl|NC_000001.11_...
     [3]     68 TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG                      lcl|NC_000001.11_...
     [4]     23 TGACCCCCATGTCGCCTCTGTAG                                                                   lcl|NC_000001.11_...
     [5]     23 GAGAGGAACATGGGCTCAGGACA                                                                   lcl|NC_000001.11_...
     ...    ... ...
[164132]     59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA                               lcl|NC_012920.1_t...
[164133]     71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA                   lcl|NC_012920.1_t...
[164134]     69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA                     lcl|NC_012920.1_t...
[164135]     66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA                        lcl|NC_012920.1_t...
[164136]     68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA                      lcl|NC_012920.1_t...

Internally, a text file named doc_Homo_sapiens_db_refseq.txt is generated. The information stored in this log file is structured as follows:

File Name: Homo_sapiens_rna_from_genomic_refseq.fna.gz
Organism Name: Homo_sapiens
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_rna_from_genomic.fna.gz
Download_Date: Wed Mar 15 16:46:45 2017
refseq_category: reference genome
assembly_accession: GCF_000001405.36
bioproject: PRJNA168
biosample: NA
taxid: 9606
infraspecific_name: NA
version_status: latest
release_type: Patch
genome_rep: Full
seq_rel_date: 2017-01-06
submitter: Genome Reference Consortium

In summary, the getRNA() and read_rna() functions allow users to retrieve RNA files by specifying the scientific name of the organism of interest and allow them to import the retrieved RNA file e.g. as Biostrings object. Thus, users can then perform the Biostrings notation to work with downloaded RNA and can rely on the log file generated by getRNA() to better document the source and version of RNA used for subsequent studies.

Alternatively, users can perform the pipeline logic of the magrittr package:

# install.packages("magrittr")
library(magrittr)
# import RNA as Biostrings object
Human_rna <- getRNA( db       = "refseq", 
                     organism = "Homo sapiens",
                     path     = file.path("_ncbi_downloads","RNA")) %>%
    read_cds(obj.type = "Biostrings")
Human_rna
   A BStringSet instance of length 164136
          width seq                                                                                       names               
     [1]   1652 CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGT...CACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG lcl|NC_000001.11_...
     [2]   1769 TCCGGCAGAGCGGAAGCGGCGGCGGGAGCTTCCGGGAGGGCGG...ACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA lcl|NC_000001.11_...
     [3]     68 TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG                      lcl|NC_000001.11_...
     [4]     23 TGACCCCCATGTCGCCTCTGTAG                                                                   lcl|NC_000001.11_...
     [5]     23 GAGAGGAACATGGGCTCAGGACA                                                                   lcl|NC_000001.11_...
     ...    ... ...
[164132]     59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA                               lcl|NC_012920.1_t...
[164133]     71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA                   lcl|NC_012920.1_t...
[164134]     69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA                     lcl|NC_012920.1_t...
[164135]     66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA                        lcl|NC_012920.1_t...
[164136]     68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA                      lcl|NC_012920.1_t...

Example NCBI Genbank:

# download the RNA of Homo sapiens from genbank
# and store the corresponding genome RNA file in '_ncbi_downloads/RNA'
HS.rna.genbank <- getRNA( db       = "genbank", 
                         organism = "Homo sapiens",
                         path     = file.path("_ncbi_downloads","RNA"))
# import downloaded RNA as Biostrings object
Human_rna <- read_cds(file     = HS.rna.genbank, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Human_rna
  A BStringSet instance of length 24
     width seq                                                                                            names               
 [1]    71 GTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACA                        lcl|J01415.2_trna...
 [2]   954 AATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACA...AAGTCGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAAC lcl|J01415.2_rrna...
 [3]    69 CAGAGTGTAGCTTAACACAAAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGA                          lcl|J01415.2_trna...
 [4]  1559 GCTAAACCTAGCCCCAAACCCACTCCACCTTACTACCAGACAACCT...ATCTCAACTTAGTATTATACCCACACCCACCCAAGAACAGGGTTT lcl|J01415.2_rrna...
 [5]    75 GTTAAGATGGCAGAGCCCGGTAATCGCATAAAACTTAAAACTTTACAGTCAGAGGTTCAATTCCTCTTCTTAACA                    lcl|J01415.2_trna...
 ...   ... ...
[20]    59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA                                    lcl|J01415.2_trna...
[21]    71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA                        lcl|J01415.2_trna...
[22]    69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA                          lcl|J01415.2_trna...
[23]    66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA                             lcl|J01415.2_trna...
[24]    68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA                           lcl|J01415.2_trna...

Example ENSEMBL:

# download the RNA of Homo sapiens from ensembl
# and store the corresponding genome RNA file in '_ncbi_downloads/RNA'
HS.rna.ensembl <- getRNA( db       = "ensembl", 
                          organism = "Homo sapiens",
                          path     = file.path("_ncbi_downloads","RNA"))
# import downloaded RNA as Biostrings object
Human_rna <- read_cds(file     = HS.rna.ensembl, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Human_rna
  A BStringSet instance of length 36701
         width seq                                                                                        names               
    [1]    104 GTGCTCACTTTGGCAACATACATACTAAAATTGGACGGATACAG...GCACAAGGATGACATGCAAATTCATGAAGCATTCCATATTTTT ENST00000516494.2...
    [2]    164 TTAACTACCTGACAGAGGAGATACTGTGATCATGAAAGTGGTTT...CATAATTTCTGGTGGTAGGGGACTGCGTTCATGTTCTCCCCTA ENST00000627793.1...
    [3]    114 ACACTGGTTTCTCTTCAGATCGAATAAATCTTTCGCCTTTTACT...AGTTATAAGCTAATTTTTTGTAAGCCTTGCCCTGGGGAGGCAG ENST00000629478.1...
    [4]     64 CAGTGTTACACCTCTTTTAGAATTTATCTATCAGGTTTTCCAGTGTTCACTGAAATTTGTCTCT                           ENST00000629187.1...
    [5]    107 GTGTTGGCCTGGGCAGCACGTATACTAAAGTTGGAATGACACAG...GCGCAAGGATGATGTGCAAATTCGTGACAAGTTCCATATTTTT ENST00000631612.1...
    ...    ... ...
[36697]     90 GGCTAGTCCAAATGTAGTGGTGTTCCAAACTAATTAATCACAACCAGTTACAGATTTCTTGTTTTCTTTTCCACTCACACTTAGCCTTAA ENST00000410951.1...
[36698]    109 GGCTGATCTGAAGATGATGAGTTATCTCAATTGATTGTTCAGCC...TCTATTCTTTCCTCTCTTCTCACTACTGCACTTGGCTAGGAAA ENST00000410462.2...
[36699]    109 GGTGGGTCCAAAGGTAGCAAGTTATCTCAATTGATCACAGTCAG...CATTCTATCACCCCTTCTCATTACTGTACTTGACTAGTCTTTT ENST00000364501.1...
[36700]    293 GGATATGAGGGCGATCTGGCAGGGACATCTGTCACCCCACTGAT...AAAATTAGCTGGGCATAGTGGCGTGCACCTGTCGTCCTAGCTA ENST00000365097.1...
[36701]    172 TTGAAGGCGTGGAGACTGAAGTCCTCTCTATATCCACAGAACAG...AAGAGGGCTGTTCAGTCTCCATGCCCTTCAATCCTTGGCTACT ENST00000615087.1...

Example ENSEMBLGENOMES:

Due to the inavailability of "Homo sapiens" at db = "ensemblgenomes" here we choose "Arabidopsis thaliana" as example.

# download the genome of Arabidopsis thaliana from ensemblgenomes
# and store the corresponding genome RNA file in '_ncbi_downloads/RNA'
AT.rna.ensemblgenomes <- getRNA( db       = "ensemblgenomes", 
                                 organism = "Arabidopsis thaliana",
                                 path     = file.path("_ncbi_downloads","RNA"))
# import downloaded RNA as Biostrings object
Cress_rna <- read_cds(file     = AT.rna.ensemblgenomes, 
                      obj.type = "Biostrings")
# look at the Biostrings object
Cress_rna
  A BStringSet instance of length 1398
       width seq                                                                                          names               
   [1]   101 TAGTGAGCACAAAGCGAACTATTCTTTCGCCTTTTACTAAAGAAT...CCACGCTAAGTGGCATACGCCTATTTTTGGAGGGATCTGCCTTC AT1G05853.1 ncrna...
   [2]   195 ATACCTTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTT...AGTCGCCCATGCGTTGCACTACTGCATGGGCCTGGCTCAACCCG AT2G04455.1 ncrna...
   [3]   195 TTTTTCTAATTACTTGCAAGCGATGTGTAAATTAGATTAATTCAA...TCATAAATTTAGATAATATTTCACTAAATAAGTTGAAATTGACC AT5G09945.1 ncrna...
   [4]   101 CGCAGCTATGTGGTGAGCACAAAGCGAACTATTCTTTCGCCTTTT...CGTGTGCTCTCCACGCCAAGTGGCTTACGCCTATTTTTGGAGGG AT4G04185.1 ncrna...
   [5]    55 GTGATCAATAAGACAAGTGGCCTAGGCTAGTGATCGCATTGCACATATCGTTTGG                                      AT4G06405.1 ncrna...
   ...   ... ...
[1394]   118 ACACATAAAACTAGTATATTGCCTTGAACATGGTTTATTAGGAAA...TCCTTATATACAATGTTCAAAGTAATGAATTAGTTTTATTTTGT at4g21362 ncrna c...
[1395]   121 CTTACAGAGATCTTTGGCATTCTGTCCACCTCCTCTCTCTATATT...TTTCGTAAGAGGAGGTGGGCATACTGCCAATAGAGATCTGTTAG at1g76135 ncrna c...
[1396]   100 GGTTGGATTACTGGGCGAATACTCCTATGGCAGATCGCATTGGCT...TAAAATGCTTCTCTGCCAAAGGAGATTTGCCCCGCAATTCATCC at2g34202 ncrna c...
[1397]   174 TTGACCCAGCCAAATCGACTCACCGTTAAAACTTCTTAACAGCTC...CGCTGTTAAGGAGTGTTAACGGTGAGTCGATTTGGCCGGGTCAA at4g07565 ncrna c...
[1398]   189 CTTAAGGCAATTTGACCTATATCAAATTTGGCACCGTTGCCGGGG...CCCGGCAACGGCGCCAAATTTGATATAGCTCAAATCGCCTTAAG at2g05455 ncrna c...

Retrieve the annotation file of a particular genome

Finally, users can download the corresponding annotation .gff files for particular genomes of interest using the getGFF() or alternatively for ensembl and ensemblgenomes the getGTF() function.

Example NCBI RefSeq:

# download the GFF file of Homo sapiens from refseq
# and store the corresponding file in '_ncbi_downloads/annotation'
HS.gff.refseq <- getGFF( db       = "refseq", 
                         organism = "Homo sapiens", 
                         path = file.path("_ncbi_downloads","annotation"))

After downloading the .gff file, users can import the .gff file with read_gff().

# import downloaded GFF file
Human_GFF <- read_gff(file = HS.gff.refseq)
Human_GFF
          seqid     source       type start       end score strand phase
          <chr>      <chr>      <chr> <int>     <int> <dbl>  <chr> <dbl>
1  NC_000001.11     RefSeq     region     1 248956422     0      +     0
2  NC_000001.11 BestRefSeq       gene 11874     14409     0      +     0
3  NC_000001.11 BestRefSeq transcript 11874     14409     0      +     0
4  NC_000001.11 BestRefSeq       exon 11874     12227     0      +     0
5  NC_000001.11 BestRefSeq       exon 12613     12721     0      +     0
6  NC_000001.11 BestRefSeq       exon 13221     14409     0      +     0
7  NC_000001.11 BestRefSeq       gene 14362     29370     0      -     0
8  NC_000001.11 BestRefSeq transcript 14362     29370     0      -     0
9  NC_000001.11 BestRefSeq       exon 29321     29370     0      -     0
10 NC_000001.11 BestRefSeq       exon 24738     24891     0      -     0

Example NCBI Genbank:

# download the GFF file of Homo sapiens from genbank
# and store the corresponding file in '_ncbi_downloads/annotation'
HS.gff.genbank <- getGFF( db       = "genbank", 
                         organism = "Homo sapiens", 
                         path = file.path("_ncbi_downloads","annotation"))

After downloading the .gff file, users can import the .gff file with read_gff().

# import downloaded GFF file
Human_GFF <- read_gff(file = HS.gff.genbank)
# show all elements of the data.frame
# options(tibble.print_max = Inf)
Human_GFF
        seqid  source       type     start       end score strand phase
        <chr>   <chr>      <chr>     <int>     <int> <dbl>  <chr> <dbl>
1  CM000663.2 Genbank     region         1 248956422     0      +     0
2  CM000663.2 Genbank centromere 122026460 125184587     0      +     0
3  KI270706.1 Genbank     region         1    175055     0      +     0
4  KI270707.1 Genbank     region         1     32032     0      +     0
5  KI270708.1 Genbank     region         1    127682     0      +     0
6  KI270709.1 Genbank     region         1     66860     0      +     0
7  KI270710.1 Genbank     region         1     40176     0      +     0
8  KI270711.1 Genbank     region         1     42210     0      +     0
9  KI270712.1 Genbank     region         1    176043     0      +     0
10 KI270713.1 Genbank     region         1     40745     0      +     0

Example ENSEMBL:

# download the GFF file of Homo sapiens from ENSEMBL
# and store the corresponding file in 'ensembl/annotation'
HS.gff.ensembl <- getGFF( db       = "ensembl", 
                         organism = "Homo sapiens", 
                         path = file.path("ensembl","annotation"))

After downloading the .gff file, users can import the .gff file with read_gff().

# import downloaded GFF file
Human_GFF <- read_gff(file = HS.gff.ensembl)
# show all elements of the data.frame
# options(tibble.print_max = Inf)
Human_GFF
   seqid source              type start       end   score strand phase
   <int>  <chr>             <chr> <int>     <int>   <chr>  <chr> <dbl>
1      1 GRCh38        chromosome     1 248956422       .      .     0
2      1      . biological_region 10469     11240 1.3e+03      .     0
3      1      . biological_region 10650     10657   0.999      +     0
4      1      . biological_region 10655     10657   0.999      -     0
5      1      . biological_region 10678     10687   0.999      +     0
6      1      . biological_region 10681     10688   0.999      -     0
7      1      . biological_region 10707     10716   0.999      +     0
8      1      . biological_region 10708     10718   0.999      -     0
9      1      . biological_region 10735     10747   0.999      -     0
10     1      . biological_region 10737     10744   0.999      +     0

Alternatively for getGTF():

# download the GTF file of Homo sapiens from ENSEMBL
# and store the corresponding file in 'ensembl/annotation'
HS.gtf.ensembl <- getGTF( db       = "ensembl", 
                         organism = "Homo sapiens", 
                         path = file.path("ensembl","annotation"))

Example ENSEMBLGENOMES:

Due to the inavailability of "Homo sapiens" at db = "ensemblgenomes" here we choose "Arabidopsis thaliana" as example.

# download the GFF file of Arabidopsis thaliana from ENSEMBLGENOMES
# and store the corresponding file in 'ensemblgenomes/annotation'
AT.gff.ensemblgenomes <- getGFF( db       = "ensemblgenomes", 
                                 organism = "Arabidopsis thaliana", 
                                 path = file.path("ensemblgenomes","annotation"))

After downloading the .gff file, users can import the .gff file with read_gff().

# import downloaded GFF file
Cress_GFF <- read_gff(file = AT.gff.ensemblgenomes)
# show all elements of the data.frame
options(tibble.print_max = Inf)
Cress_GFF
   seqid source           type start      end score strand phase
   <int>  <chr>          <chr> <int>    <int> <dbl>  <chr> <dbl>
1      1   TAIR     chromosome     1 30427671     0      .     0
2      1   tair           gene  3631     5899     0      +     0
3      1   tair     transcript  3631     5899     0      +     0
4      1   tair five_prime_UTR  3631     3759     0      +     0
5      1   tair           exon  3631     3913     0      +     0
6      1   tair            CDS  3760     3913     0      +     0
7      1   tair           exon  3996     4276     0      +     0
8      1   tair            CDS  3996     4276     0      +     2
9      1   tair           exon  4486     4605     0      +     0
10     1   tair            CDS  4486     4605     0      +     0

Alternatively for getGTF():

# download the GTF file of Arabidopsis thaliana from ENSEMBLGENOMES
# and store the corresponding file in 'ensemblgenomes/annotation'
AT.gtf.ensemblgenomes <- getGTF( db       = "ensemblgenomes", 
                                 organism = "Arabidopsis thaliana", 
                                 path = file.path("ensemblgenomes","annotation"))

Taken together, getGFF() or getGTF() in combination with getGenome(), getProteome(), getRNA() and getCDS() allows users to retrieve the genome information together with the corresponding .gff or gtf annotation file to make sure that they both have the same version and origin.

Repeat Masker Retrieval

Repeat Masker is a tool for screening DNA sequences for interspersed repeats and low complexity DNA sequences. NCBI stores the Repeat Masker for sevel species in their databases and can be retrieved using getRepeatMasker() and imported via read_rm().

Example NCBI RefSeq:

# download repeat masker annotation file for Homo sapiens
Hsapiens_rm <- getRepeatMasker( db       = "refseq", 
                                 organism = "Homo sapiens", 
                                 path = file.path("refseq","TEannotation"))

Now users can import the TE annotation file using read_rm().

# import TE annotation file
Hsapiens_rm_import <- read_rm("refseq/TEannotation/Homo_sapiens_rm_refseq.out.gz")
# look at data
Hsapiens_rm_import

Genome Assembly Stats Retrieval

By specifying the scientific name of an organism of interest the corresponding genome assembly stats file storing the assembly statistics of the organism of interest can be downloaded and stored locally.

Example NCBI RefSeq:

# download genome assembly stats file for Homo sapiens
Hsapiens_stats <- getAssemblyStats( db       = "refseq", 
                                 organism = "Homo sapiens", 
                                 path = file.path("refseq","AssemblyStats"))

Now users can import the TE annotation file using read_rm().

# import TE annotation file
Hsapiens_stats_import <- read_assemblystats(Hsapiens_stats)
# look at data
Hsapiens_stats_import
 A tibble: 1,196 x 6
   unit_name molecule_name molecule_type sequence_type statistic         value
   <chr>     <chr>         <chr>         <chr>         <chr>             <int>
 1 all       all           all           all           total-length         NA
 2 all       all           all           all           spanned-gaps        661
 3 all       all           all           all           unspanned-gaps      349
 4 all       all           all           all           region-count        317
 5 all       all           all           all           scaffold-count      875
 6 all       all           all           all           scaffold-N50   59364414
 7 all       all           all           all           scaffold-L50         17
 8 all       all           all           all           scaffold-N75   31699399
 9 all       all           all           all           scaffold-N90    7557127
10 all       all           all           all           contig-count       1536

Example NCBI Genbank:

# download genome assembly stats file for Homo sapiens
Hsapiens_stats <- getAssemblyStats( db       = "genbank", 
                                 organism = "Homo sapiens", 
                                 path = file.path("genbank","AssemblyStats"))

Now users can import the TE annotation file using read_rm().

# import TE annotation file
Hsapiens_stats_import <- read_assemblystats(Hsapiens_stats)
# look at data
Hsapiens_stats_import
 A tibble: 1,196 x 6
   unit_name molecule_name molecule_type sequence_type statistic         value
   <chr>     <chr>         <chr>         <chr>         <chr>             <int>
 1 all       all           all           all           total-length         NA
 2 all       all           all           all           spanned-gaps        661
 3 all       all           all           all           unspanned-gaps      349
 4 all       all           all           all           region-count        317
 5 all       all           all           all           scaffold-count      875
 6 all       all           all           all           scaffold-N50   59364414
 7 all       all           all           all           scaffold-L50         17
 8 all       all           all           all           scaffold-N75   31699399
 9 all       all           all           all           scaffold-N90    7557127
10 all       all           all           all           contig-count       1536

Collection Retrieval

The automated retrieval of collections (= Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats) will make sure that the genome file of an organism will match the CDS, proteome, RNA, GFF, etc file and was generated using the same genome assembly version. One aspect of why genomics studies fail in computational and biological reproducibility is that it is not clear whether CDS, proteome, RNA, GFF, etc files used in a proposed analysis were generated using the same genome assembly file denoting the same genome assembly version. To avoid this seemingly trivial mistake we encourage users to retrieve genome file collections using the biomartr function getCollection() and attach the corresponding output as Supplementary Data to the respective genomics study to ensure computational and biological reproducibility.

By specifying the scientific name of an organism of interest a collection consisting of the genome file, proteome file, CDS file, RNA file, GFF file, Repeat Masker file, AssemblyStats file of the organism of interest can be downloaded and stored locally.

Example NCBI RefSeq:

# download collection for Saccharomyces cerevisiae
getCollection( db = "refseq", 
               organism = "Saccharomyces cerevisiae", 
               path = file.path("refseq","Collections"))

Internally, the getCollection() function will now generate a folder named refseq/Collection/Saccharomyces_erevisiae and will store all genome and annotation files for Saccharomyces cerevisiae in the same folder. In addition, the exact genoem and annotation version will be logged in the doc folder.

Genome download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The genome of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.fna.gz'.
Starting proteome retrieval of 'Saccharomyces cerevisiae' from refseq ...


Proteome download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The proteome of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_protein_refseq.faa.gz' .
Starting CDS retrieval of 'Saccharomyces cerevisiae' from refseq ...


CDS download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The genomic CDS of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_cds_from_genomic_refseq.fna.gz' .
Starting GFF retrieval of 'Saccharomyces cerevisiae' from refseq ...


GFF download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The *.gff annotation file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.gff.gz'.
Starting RNA retrieval of 'Saccharomyces cerevisiae' from refseq ...


RNA download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The genomic RNA of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_rna_from_genomic_refseq.fna.gz' .
Starting Repeat Masker retrieval of 'Saccharomyces cerevisiae' from refseq ...


RepeatMasker download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The Repeat Masker output file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_rm_refseq.out.gz'.
Starting assembly quality stats retrieval of 'Saccharomyces cerevisiae' from refseq ...


Genome assembly quality stats file download completed!
The assembly statistics file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_assembly_stats_refseq.txt'.
Collection retrieval finished successfully!


We retrieved the genome assembly and checked the annotation for 'Saccharomyces cerevisiae' (taxid: 559292, strain=S288C) from 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt' (accession: GCF_000146045.2, bioproject: PRJNA128, biosample: NA) using the biomartr R package (Drost and Paszkowski, 2017).

Users can now simply attach the output folder `` as supplementary data in their study and state in the materials sections:

We retrieved the genome assembly and checked the annotation for ‘Saccharomyces cerevisiae’ (taxid: 559292, strain=S288C) from ‘ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt’ (accession: GCF_000146045.2, bioproject: PRJNA128, biosample: NA) using the biomartr R package (Drost and Paszkowski, 2017).

This way, computational and biological reproducibility can be standardized and indeed will become trivial in the context of genome version and annotation version.