NCBI Database Retrieval

2018-01-02

Retrieve Sequence Databases from NCBI

NCBI stores a variety of specialized database such as Genbank, RefSeq, Taxonomy, SNP, etc. on their servers. The download.database() and download.database.all() functions implemented in biomartr allows users to download these databases from NCBI.

Search for available databases

Before downloading specific databases from NCBI users might want to list available databases. Using the listNCBIDatabases() function users can retrieve a list of available databases stored on NCBI.

# retrieve a list of available sequence databases at NCBI
biomartr::listNCBIDatabases(db = "all")
#>  [1] "16SMicrobial"            "FASTA"                   "cdd_delta"               "cloud"                   "env_nr"                  "env_nt"                  "est"                     "est_human"               "est_mouse"               "est_others"              "gene_info"               "gss"                     "gss_annot"               "htgs"                    "human_genomic"           "landmark"                "nr"                      "nt"                      "other_genomic"           "pataa"                   "patnt"                   "pdbaa"                   "pdbnt"                   "ref_prok_rep_genomes"    "ref_viroids_rep_genomes" "ref_viruses_rep_genomes" "refseq_genomic"          "refseq_protein"         
#> [29] "refseq_rna"              "refseqgene"              "sts"                     "swissprot"               "taxdb"                   "tsa_nr"                  "tsa_nt"                  "vector"

However, in case users already know which database they would like to retrieve they can filter for the exact files by specifying the NCBI database name. In the following example all sequence files that are part of the NCBI nr database shall be retrieved.

First, the listNCBIDatabases(db = "nr") allows to list all files corresponding to the nr database.

# show all NCBI nr files
biomartr::listNCBIDatabases(db = "nr")
#>  [1] "nr.00.tar.gz" "nr.75.tar.gz" "nr.01.tar.gz" "nr.02.tar.gz" "nr.03.tar.gz" "nr.04.tar.gz" "nr.05.tar.gz" "nr.76.tar.gz" "nr.16.tar.gz" "nr.54.tar.gz" "nr.55.tar.gz" "nr.56.tar.gz" "nr.06.tar.gz" "nr.15.tar.gz" "nr.30.tar.gz" "nr.57.tar.gz" "nr.58.tar.gz" "nr.59.tar.gz" "nr.60.tar.gz" "nr.61.tar.gz" "nr.62.tar.gz" "nr.63.tar.gz" "nr.64.tar.gz" "nr.65.tar.gz" "nr.66.tar.gz" "nr.67.tar.gz" "nr.68.tar.gz" "nr.69.tar.gz" "nr.70.tar.gz" "nr.71.tar.gz" "nr.72.tar.gz" "nr.73.tar.gz" "nr.74.tar.gz" "nr.07.tar.gz" "nr.77.tar.gz" "nr.78.tar.gz" "nr.08.tar.gz" "nr.09.tar.gz" "nr.10.tar.gz" "nr.11.tar.gz" "nr.12.tar.gz" "nr.13.tar.gz" "nr.14.tar.gz" "nr.28.tar.gz" "nr.29.tar.gz" "nr.31.tar.gz" "nr.17.tar.gz" "nr.18.tar.gz" "nr.19.tar.gz"
#> [50] "nr.20.tar.gz" "nr.21.tar.gz" "nr.22.tar.gz" "nr.23.tar.gz" "nr.32.tar.gz" "nr.24.tar.gz" "nr.25.tar.gz" "nr.26.tar.gz" "nr.27.tar.gz" "nr.33.tar.gz" "nr.34.tar.gz" "nr.35.tar.gz" "nr.36.tar.gz" "nr.37.tar.gz" "nr.38.tar.gz" "nr.39.tar.gz" "nr.40.tar.gz" "nr.41.tar.gz" "nr.42.tar.gz" "nr.43.tar.gz" "nr.44.tar.gz" "nr.45.tar.gz" "nr.46.tar.gz" "nr.47.tar.gz" "nr.48.tar.gz" "nr.49.tar.gz" "nr.53.tar.gz" "nr.50.tar.gz" "nr.51.tar.gz" "nr.52.tar.gz"

The output illustrates that the NCBI nr database has been separated into 41 files.

Further examples are:

# show all NCBI nt files
biomartr::listNCBIDatabases(db = "nt")
#>  [1] "nt.00.tar.gz" "nt.01.tar.gz" "nt.02.tar.gz" "nt.03.tar.gz" "nt.04.tar.gz" "nt.05.tar.gz" "nt.06.tar.gz" "nt.07.tar.gz" "nt.08.tar.gz" "nt.09.tar.gz" "nt.10.tar.gz" "nt.40.tar.gz" "nt.11.tar.gz" "nt.41.tar.gz" "nt.42.tar.gz" "nt.43.tar.gz" "nt.44.tar.gz" "nt.45.tar.gz" "nt.46.tar.gz" "nt.47.tar.gz" "nt.48.tar.gz" "nt.49.tar.gz" "nt.50.tar.gz" "nt.12.tar.gz" "nt.51.tar.gz" "nt.13.tar.gz" "nt.14.tar.gz" "nt.15.tar.gz" "nt.16.tar.gz" "nt.26.tar.gz" "nt.27.tar.gz" "nt.17.tar.gz" "nt.18.tar.gz" "nt.28.tar.gz" "nt.29.tar.gz" "nt.19.tar.gz" "nt.20.tar.gz" "nt.21.tar.gz" "nt.22.tar.gz" "nt.23.tar.gz" "nt.24.tar.gz" "nt.25.tar.gz" "nt.30.tar.gz" "nt.31.tar.gz" "nt.32.tar.gz" "nt.33.tar.gz" "nt.34.tar.gz" "nt.35.tar.gz" "nt.36.tar.gz"
#> [50] "nt.37.tar.gz" "nt.39.tar.gz" "nt.38.tar.gz"
# show all NCBI ESTs others
biomartr::listNCBIDatabases(db = "est_others")
#>  [1] "est_others.00.tar.gz" "est_others.01.tar.gz" "est_others.02.tar.gz" "est_others.03.tar.gz" "est_others.04.tar.gz" "est_others.05.tar.gz" "est_others.06.tar.gz" "est_others.07.tar.gz" "est_others.08.tar.gz" "est_others.09.tar.gz" "est_others.10.tar.gz"
# show all NCBI RefSeq (only genomes)
head(biomartr::listNCBIDatabases(db = "refseq_genomic"), 20)
#>  [1] "refseq_genomic.218.tar.gz" "refseq_genomic.219.tar.gz" "refseq_genomic.220.tar.gz" "refseq_genomic.00.tar.gz"  "refseq_genomic.01.tar.gz"  "refseq_genomic.02.tar.gz"  "refseq_genomic.03.tar.gz"  "refseq_genomic.04.tar.gz"  "refseq_genomic.05.tar.gz"  "refseq_genomic.06.tar.gz"  "refseq_genomic.07.tar.gz"  "refseq_genomic.08.tar.gz"  "refseq_genomic.09.tar.gz"  "refseq_genomic.10.tar.gz"  "refseq_genomic.11.tar.gz"  "refseq_genomic.12.tar.gz"  "refseq_genomic.13.tar.gz"  "refseq_genomic.14.tar.gz"  "refseq_genomic.15.tar.gz"  "refseq_genomic.16.tar.gz"
# show all NCBI RefSeq (only proteomes)
biomartr::listNCBIDatabases(db = "refseq_protein")
#>  [1] "refseq_protein.25.tar.gz" "refseq_protein.00.tar.gz" "refseq_protein.01.tar.gz" "refseq_protein.02.tar.gz" "refseq_protein.03.tar.gz" "refseq_protein.26.tar.gz" "refseq_protein.27.tar.gz" "refseq_protein.28.tar.gz" "refseq_protein.29.tar.gz" "refseq_protein.04.tar.gz" "refseq_protein.05.tar.gz" "refseq_protein.06.tar.gz" "refseq_protein.07.tar.gz" "refseq_protein.08.tar.gz" "refseq_protein.09.tar.gz" "refseq_protein.14.tar.gz" "refseq_protein.15.tar.gz" "refseq_protein.16.tar.gz" "refseq_protein.10.tar.gz" "refseq_protein.11.tar.gz" "refseq_protein.17.tar.gz" "refseq_protein.12.tar.gz" "refseq_protein.13.tar.gz" "refseq_protein.18.tar.gz" "refseq_protein.19.tar.gz" "refseq_protein.20.tar.gz" "refseq_protein.21.tar.gz"
#> [28] "refseq_protein.22.tar.gz" "refseq_protein.23.tar.gz" "refseq_protein.24.tar.gz" "refseq_protein.30.tar.gz" "refseq_protein.31.tar.gz" "refseq_protein.32.tar.gz" "refseq_protein.33.tar.gz" "refseq_protein.34.tar.gz" "refseq_protein.35.tar.gz" "refseq_protein.36.tar.gz" "refseq_protein.37.tar.gz"
# show all NCBI RefSeq (only RNA)
biomartr::listNCBIDatabases(db = "refseq_rna")
#>  [1] "refseq_rna.00.tar.gz" "refseq_rna.01.tar.gz" "refseq_rna.10.tar.gz" "refseq_rna.02.tar.gz" "refseq_rna.05.tar.gz" "refseq_rna.03.tar.gz" "refseq_rna.06.tar.gz" "refseq_rna.04.tar.gz" "refseq_rna.07.tar.gz" "refseq_rna.08.tar.gz" "refseq_rna.09.tar.gz" "refseq_rna.11.tar.gz" "refseq_rna.12.tar.gz"
# show NCBI swissprot
biomartr::listNCBIDatabases(db = "swissprot")
#> [1] "swissprot.tar.gz"
# show NCBI PDB
biomartr::listNCBIDatabases(db = "pdb")
#>  [1] "pdbnt.00.tar.gz" "pdbnt.40.tar.gz" "pdbnt.41.tar.gz" "pdbnt.42.tar.gz" "pdbnt.43.tar.gz" "pdbnt.44.tar.gz" "pdbnt.45.tar.gz" "pdbnt.46.tar.gz" "pdbnt.47.tar.gz" "pdbnt.48.tar.gz" "pdbnt.49.tar.gz" "pdbnt.50.tar.gz" "pdbnt.51.tar.gz" "pdbnt.26.tar.gz" "pdbnt.27.tar.gz" "pdbnt.01.tar.gz" "pdbnt.02.tar.gz" "pdbnt.03.tar.gz" "pdbnt.04.tar.gz" "pdbnt.05.tar.gz" "pdbnt.06.tar.gz" "pdbnt.07.tar.gz" "pdbnt.08.tar.gz" "pdbnt.09.tar.gz" "pdbnt.10.tar.gz" "pdbnt.11.tar.gz" "pdbnt.12.tar.gz" "pdbnt.13.tar.gz" "pdbnt.14.tar.gz" "pdbnt.15.tar.gz" "pdbnt.16.tar.gz" "pdbnt.17.tar.gz" "pdbnt.18.tar.gz" "pdbnt.28.tar.gz" "pdbnt.19.tar.gz" "pdbnt.20.tar.gz" "pdbnt.21.tar.gz" "pdbnt.22.tar.gz" "pdbnt.23.tar.gz" "pdbnt.24.tar.gz" "pdbnt.29.tar.gz"
#> [42] "pdbnt.25.tar.gz" "pdbaa.tar.gz"    "pdbnt.30.tar.gz" "pdbnt.31.tar.gz" "pdbnt.32.tar.gz" "pdbnt.33.tar.gz" "pdbnt.34.tar.gz" "pdbnt.35.tar.gz" "pdbnt.36.tar.gz" "pdbnt.37.tar.gz" "pdbnt.39.tar.gz" "pdbnt.38.tar.gz"
# show NCBI Human database
biomartr::listNCBIDatabases(db = "human")
#>  [1] "human_genomic.00.tar.gz" "human_genomic.01.tar.gz" "human_genomic.02.tar.gz" "human_genomic.03.tar.gz" "human_genomic.04.tar.gz" "human_genomic.05.tar.gz" "human_genomic.06.tar.gz" "human_genomic.07.tar.gz" "human_genomic.08.tar.gz" "human_genomic.10.tar.gz" "human_genomic.11.tar.gz" "human_genomic.12.tar.gz" "human_genomic.13.tar.gz" "human_genomic.14.tar.gz" "human_genomic.15.tar.gz" "human_genomic.16.tar.gz" "human_genomic.18.tar.gz" "human_genomic.19.tar.gz" "human_genomic.20.tar.gz" "human_genomic.21.tar.gz" "human_genomic.22.tar.gz"
# show NCBI EST databases
biomartr::listNCBIDatabases(db = "est")
#>  [1] "est.tar.gz"           "est_human.00.tar.gz"  "est_human.01.tar.gz"  "est_mouse.tar.gz"     "est_others.00.tar.gz" "est_others.01.tar.gz" "est_others.02.tar.gz" "est_others.03.tar.gz" "est_others.04.tar.gz" "est_others.05.tar.gz" "est_others.06.tar.gz" "est_others.07.tar.gz" "est_others.08.tar.gz" "est_others.09.tar.gz" "est_others.10.tar.gz"

Please not that all lookup and retrieval function will only work properly when a sufficient internet connection is provided.

In a next step users can use the listNCBIDatabases() and download.database.all() functions to retrieve all files corresponding to a specific NCBI database.

Download available databases

Using the same search strategy by specifying the database name as described above, users can now download these databases using the download.database() function.

For downloading only single files users can type:

# select NCBI nr version 27 =  "nr.27.tar.gz"
# and download it into the folder
biomartr::download.database(db = "nr.27.tar.gz", path = "nr")

Using this command first a folder named nr is created and the file nr.27.tar.gz is downloaded to this folder. This command will download the pre-formatted (by makeblastdb formatted) database version is retrieved.

Alternatively, users can retrieve all nr files with one command by typing:

# download the entire NCBI nr database
biomartr::download.database.all(db = "nr", path = "nr")

Using this command, all NCBI nr files are loaded into the nr folder (path = "nr").

The same approach can be applied to all other databases mentioned above, e.g.:

# download the entire NCBI nt database
biomartr::download.database.all(db = "nt", path = "nt")
# download the entire NCBI refseq (protein) database
biomartr::download.database.all(db = "refseq_protein", path = "refseq_protein")
# download the entire NCBI PDB database
biomartr::download.database.all(db = "pdb", path = "pdb")

Please notice that most of these databases are very large, so users should take of of providing a stable internect connection throughout the download process.