mglR: an R package to integrate genomic data for candidate genes

Katherine Hartmann

2017-01-18

The package mglR standing for “Master Gene List” was developed to download and organize large-scale, publicly available genomic studies on a candidate gene scale. It also includes functions to integrate these datasources and compare features across candidate genes.


This package makes use of the list structure in ‘R’ to organize datasources in an identical fashion for each gene of interest. It is building a nested list from which data can be pulled in a very systematic way. Practically speaking, the package can be divided into two components:

The first section of this vignette covers how to build the list structure, the second discusses some pecularities of the list structure, and the third presents the functions introduced to integrate these datasources. A separate vignette “Building on mglR” presents some ideas about how this package may be expanded by the user to include additional datasources or functions to accomodate their specific interests.

This package relies on several packages that have been developed by others to handle many of these datasources. Perhaps the most valuable, and the only one that is truly essential, is biomaRt. Although biomaRt does a phenomenal job in facilitating navigation between gene names, postions, etc. in a way that is critical for this package, it at times will throw error messages “Request to BioMart web service failed. Verify if you are still connected to the internet. Alternatively the BioMart web service is temporarily down.”. If this occurs please try the function again.

Also, please note that a sample mgl list myMgl is included as a dataset in the package. As this list is complete it can be used to try any of the make functions.


Building the list structure

Step 1: Build an empty list.

An mgl list can be generated from gene names, ENSG ids, or genomic coordinates using one of the following three functions.

  1. colloquial gene names - buildFromNames
  2. ENSG gene identifiers - buildFromEnsgs
  3. a single gene region (chromosome and hg19 position range) - buildFromRegion

Each of these functions will generate a standard list in R with length corresponding to the number of candidate genes. Each element of this list is itself a list with length of twenty corresponding to the number of datasources.

The following are all equivalent.

# Load the library
library(mglR)

# Build an empty list from gene names
buildFromNames(c("RP11-109G10.2", "RP11-481A12.2", "MSMB", "NCOA4", "TIMM23")) -> exampleMgl

# Build an empty list from ENSG identifiers
buildFromEnsgs(c("ENSG00000228326", "ENSG00000230553", "ENSG00000138294", "ENSG00000138293", "ENSG00000138297")) -> exampleMgl

# Build an empty list from a single gene region
buildFromRegion(10,51518775,51600147) -> exampleMgl

For each of the above, the output is an empty, nested list. Where the first level of heirarchy is the gene and the second is the datasource (i.e. mgl[[gene]][[datasource]])

# The output is a nested list
class(exampleMgl)
## [1] "list"
# 
# The length of the list corresponds to the number of candidate genes
length(exampleMgl)
## [1] 5
names(exampleMgl)
## [1] "RP11-109G10.2" "RP11-481A12.2" "MSMB"          "NCOA4"        
## [5] "TIMM23"
#
# Each element corresponding to a gene is itself a list
class(exampleMgl[[1]])
## [1] "list"
names(exampleMgl[[1]])
##  [1] "name"                   "enst"                  
##  [3] "location"               "antisense"             
##  [5] "go"                     "pubmed"                
##  [7] "gtex.normalized"        "gtex.gene.counts"      
##  [9] "gtex.transcript.counts" "gtex.gene.rpkm"        
## [11] "gtex.transcript.rpkm"   "dnase"                 
## [13] "transEqtls"             "cisEqtls"              
## [15] "sqtlSeek"               "sqtlAltrans"           
## [17] "pqtl"                   "gwasCatalog"           
## [19] "grasp"                  "aei"
#
# Currently these are empty placeholders for a particular data source
exampleMgl[[1]][2]
## $enst
## [1] 2

Step 2: Add elements of interest to the list.

Functions beginning with add can be used to fill in the various elements for each gene. Only those datasources of interest need be filled in, it is no problem for elements to remain empty.

Note: the example mgl ‘myMgl’ was too large to include in the package so the function exMgl can be used to download and load it

# Load `myMgl`, example mgl list
exMgl() -> myMgl

Element 2: enst

## # Add transcript information
## addEnst(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]]
myMgl[[1]][2]
## $enst
##   external_transcript_id ensembl_transcript_id   transcript_biotype
## 1      RP11-109G10.2-001       ENST00000439573 processed_pseudogene

Element 3: location

## # Add position information
## addLoc(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]]
myMgl[[1]][3]
## $location
##   chromosome_name start_position end_position strand
## 1              10       51532102     51532572     -1

Element 4: antisense

## # Add information about transcripts in antisense position
## addAntisense(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]]. note: none of these genes have antisense transcripts
myMgl[[1]][4]
## $antisense
## [1] external_gene_id      ensembl_transcript_id transcript_biotype   
## [4] chromosome_name       start_position        end_position         
## [7] strand               
## <0 rows> (or 0-length row.names)

Element 5: go

## # Add gene ontology terms
## addGo(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]]
myMgl[[3]][5]
## $go
##             name_1006
## 1  biological_process
## 2 extracellular space
## 3             nucleus
## 4  molecular_function

Element 6: pubmed

## # Add number of papers cataloged in PubMed
## addPubmed(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]]
myMgl[[4]][6]
## $pubmed
## [1] 152

Elements 7-11: gtex.normalized, gtex.gene.counts, gtex.transcript.counts, gtex.gene.rpkm, gtex.transcript.rpkm Even with relatively small candidate gene lists (< 10 genes), adding expression data to the list (elements 7-11) can be time and memory intensive. The function works by downloading the data locally (which can be saved for future use using the flag saveDownload = TRUE), subsetting the datasets for the genes of interest using grep commands run by R through unix, loading the subsetted data into R and incorporating the data into the mgl list. The most time consuming parts are downloading and subsetting the data. Both of these steps can be run outside of R using the command line if preferred.

## # Add expression data from GTEx
## addExpression(myMgl, download = TRUE, saveDownload = FALSE, normalized = TRUE, xpGeneReads = TRUE, xpTranscriptReads = TRUE, xpGeneRpkm = TRUE, xpTranscriptRpkm = TRUE) -> myMgl
## 
## # examples mgl[[gene]][[datasource]][[tissue]][expression&covariates, people]
## # normalized expression data
myMgl[[3]][[7]][[1]][1:3,1:3]
##                   GTEX.111CU GTEX.111FC GTEX.111VG
## ENSG00000138294.9  0.9388143  -1.116486  0.1049847
## C1                -0.0180000  -0.015600 -0.0197000
## C2                 0.0010000   0.016800  0.0108000
## # gene reads
myMgl[[3]][[8]][[1]][,1:2]
##                 GTEX-111CU-1826-SM-5GZYN GTEX-111FC-0226-SM-5N9B8
## ENSG00000138294                        4                        0
## # transcript RPKM
myMgl[[3]][[11]][[1]][,1:2]
##                 GTEX-111CU-1826-SM-5GZYN GTEX-111FC-0226-SM-5N9B8
## ENST00000358559                  0.42109                        0
## ENST00000487536                  0.00000                        0
## ENST00000474170                  0.00000                        0
## ENST00000478719                  0.00000                        0
## ENST00000466268                  0.00000                        0
## ENST00000298239                  0.00000                        0

Element 12: dnase

## # Add results from Maurano Nat. Genetics 2015
## addDnase(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]]
myMgl[[3]][12]
## $dnase
##   chrom chromStart chromEnd         rs numhets allele.1 readsA allele.2
## 1 chr10   51553064 51553065 rs10994201       4        G     86        A
## 2 chr10   51558124 51558125  rs7076948       7        C     71        T
## 3 chr10   51559468 51559469  rs7904463       8        T    177        C
## 4 chr10   51561798 51561799 rs17178655      14        A    235        G
##   readsB totalReads pctRef  q.value    significance.level
## 1     76        162  0.469 7.63e-01        not_imbalanced
## 2     64        135  0.474 8.53e-01        not_imbalanced
## 3     71        248  0.286 5.17e-10 imbalanced_(0.1%_FDR)
## 4    126        361  0.349 2.70e-07   imbalanced_(5%_FDR)

Element 13: transEqtls

## # Add transEqtl data from GTEx
## addTransEqtl(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[13]][20]
## $`Heart - Left Ventricle`
##  [1] DiscoveryMode            rs_id_dbSNP142_GRCh37p13
##  [3] variant_id               snp_chr                 
##  [5] snp_pos                  gene_id                 
##  [7] gene_name                tissue_site_detail      
##  [9] pvalue                   FDR                     
## [11] cis.eGene_id             cis.eGene_name          
## [13] GWAS_pvalue              trait                   
## [15] PMID                    
## <0 rows> (or 0-length row.names)

Element 14: cisEqtls

## # Add cisEqtl data from GTEx
## addCisEqtl(myMgl, download = TRUE, saveDownload = FALSE) -> myMgl
## 
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[14]][[44]][1:3,]
##        tissue          variant_id            gene_id tss_distance
## 1 Whole_Blood 10_51489257_T_C_b37 ENSG00000138293.15       -75851
## 2 Whole_Blood 10_51489667_T_C_b37 ENSG00000138293.15       -75441
## 3 Whole_Blood 10_51490628_T_C_b37 ENSG00000138293.15       -74480
##   pval_nominal     slope  slope_se slope_fpkm slope_fpkm_se
## 1  8.87850e-06 -0.143856 0.0318135   -1.92895       2.39498
## 2  2.45591e-05 -0.129756 0.0302692   -1.78156       2.27139
## 3  1.45179e-05 -0.137201 0.0311152   -1.50279       2.33963
##   pval_nominal_threshold min_pval_nominal   pval_beta       rsId
## 1            7.60232e-05      1.12164e-07 0.000288916 rs11595194
## 2            7.60232e-05      1.12164e-07 0.000288916 rs12783102
## 3            7.60232e-05      1.12164e-07 0.000288916 rs10825136

Element 15: sqtlSeek

## # Add splicing QTL data from GTEx sqtlSeek algorithm
## addSqtlSeek(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]][[tissue]]. note: none of these genes have splicing QTls as defined by sqtlSeek
myMgl[[1]][[15]][1]
## $AdiposeTissue
## [1] snpId    pv       transId1 transId2
## <0 rows> (or 0-length row.names)

Element 16: sqtlAltrans

## # Add splicing QTL data from GTEx Altrans algorithm
## addSqtlAltrans(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[16]][2]
## $ArteryTibial
##       Variation X.log10.p.
## 2921  rs4519043     6.4089
## 2922 rs79490275     6.1975

Element 17: pqtl

## # Add protein truncating variants from GTEx
## addPtv(myMgl) -> myMgl
## 
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[17]][[1]][1:2,]
##       identifier of snp alternative identifier of SNP
## 13458       rs117273914                             -
## 13935        rs72795897                             -
##       name of exon that exon belongs to           chr_start_end
## 13458                             NCOA4 chr10_51580880_51580968
## 13935                             NCOA4 chr10_51580880_51580968
##       chromosome of SNP chromosome of exon location of SNP
## 13458                10                 10        51580469
## 13935                10                 10        51581889
##       location of middle point of exon distance between 8 and 9
## 13458                         51580924                      455
## 13935                         51580924                      965
##       spearman correlation    pvalue -log10(pvalue) NA
## 13458                0.135 0.1940369         0.7121 NA
## 13935                0.132 0.2062249         0.6857 NA

Element 18: gwasCatalog

## # Add GWAS data from the NHGRI-EBI GWAS Catalog
## addGwasCatalog(myMgl, range = 10, download = TRUE, saveDownload = FALSE) -> myMgl
## 
## # example mgl[[gene]][[datasource]][GWASEntry, Details]
myMgl[[3]][[18]][1:2,]
##    DATE.ADDED.TO.CATALOG PUBMEDID FIRST.AUTHOR       DATE   JOURNAL
## 27            2008-06-16 18264097     Eeles RA 2008-02-10 Nat Genet
## 34            2008-06-16 18264096     Thomas G 2008-02-10 Nat Genet
##                                    LINK
## 27 www.ncbi.nlm.nih.gov/pubmed/18264097
## 34 www.ncbi.nlm.nih.gov/pubmed/18264096
##                                                                              STUDY
## 27  Multiple newly identified loci associated with prostate cancer susceptibility.
## 34 Multiple loci identified in a genome-wide association study of prostate cancer.
##      DISEASE.TRAIT
## 27 Prostate cancer
## 34 Prostate cancer
##                                                INITIAL.SAMPLE.SIZE
## 27 1,854 European ancestry cases, 1,894 European ancestry controls
## 34 1,172 European ancestry cases, 1,157 European ancestry controls
##                                            REPLICATION.SAMPLE.SIZE
## 27 3,268 European ancestry cases, 3,366 European ancestry controls
## 34 3,941 European ancestry cases, 3,964 European ancestry controls
##      REGION CHR_ID  CHR_POS REPORTED.GENE.S.      MAPPED_GENE
## 27 10q11.22     10 46046326             MSMB MSMB - RPL23AP61
## 34 10q11.22     10 46046326             MSMB MSMB - RPL23AP61
##    UPSTREAM_GENE_ID DOWNSTREAM_GENE_ID SNP_GENE_IDS UPSTREAM_GENE_DISTANCE
## 27             4477             728484                                  57
## 34             4477             728484                                  57
##    DOWNSTREAM_GENE_DISTANCE STRONGEST.SNP.RISK.ALLELE       SNPS MERGED
## 27                    16924              rs10993994-T rs10993994      0
## 34                    16924              rs10993994-T rs10993994      0
##    SNP_ID_CURRENT               CONTEXT INTERGENIC RISK.ALLELE.FREQUENCY
## 27       10993994 upstream_gene_variant          1                  0.40
## 34       10993994 upstream_gene_variant          1                  0.40
##    P.VALUE PVALUE_MLOG P.VALUE..TEXT. OR.or.BETA X95..CI..TEXT.
## 27   9e-29    28.04576                      1.25    [1.17-1.34]
## 34   7e-13    12.15490                      1.16    [1.04-1.29]
##    PLATFORM..SNPS.PASSING.QC. CNV
## 27          Illumina [541129]   N
## 34          Illumina [527869]   N

Element 19: grasp

# Add GWAS data from GRASP
addGrasp(myMgl, range = 0) -> myMgl

# example
#myMgl[[1]][[19]][1:3,]

Element 20: aei Please note that access to allelic ratios from GTEx requires approval and download through dbGaP. See http:/www.ncbi.nlm.nih.gov/gap

# Add allelic ratios from GTEx
addAei(myMgl, fp = '~/Downloads') -> myMgl

# example mgl[[gene]][[datasource]][[tissue]][SNP,Details]
myMgl[[4]][[20]][[1]][1:3,]

Quirks of the list

It’s a list!

The list that is created can be compared to filing cabinet. Where the cabinet represents the entirety of the list, each drawer a different gene, and each datasource a folder within the drawer. If you simply print the list to the screen in R, it would be like dumping an entire filing cabinet full of papers all over the floor. Even printing one element/gene is too much and in some cases the datasource itself (i.e. the folder) is really big and even this is too much to print to the screen. To cautiously work your way through the list, use class and dim or length to preview and subsetting either of lists [[#]] or of dataframes [row#, column#].

# check class
class(myMgl[[3]][[12]])
## [1] "data.frame"
# check dimensions
dim(myMgl[[3]][[12]])
## [1]  4 13
# print subset
myMgl[[3]][[12]][1:3,]
##   chrom chromStart chromEnd         rs numhets allele.1 readsA allele.2
## 1 chr10   51553064 51553065 rs10994201       4        G     86        A
## 2 chr10   51558124 51558125  rs7076948       7        C     71        T
## 3 chr10   51559468 51559469  rs7904463       8        T    177        C
##   readsB totalReads pctRef  q.value    significance.level
## 1     76        162  0.469 7.63e-01        not_imbalanced
## 2     64        135  0.474 8.53e-01        not_imbalanced
## 3     71        248  0.286 5.17e-10 imbalanced_(0.1%_FDR)

Working with the list

Included are two functions to make navigating the mgl list a bit easier.

(1) Which elements have been added? Use listElements to display which elements have been added or which ones are still missing.

# display missing elements
listElements(exampleMgl, added = FALSE)
##                   enst               location              antisense 
##                      2                      3                      4 
##                     go                 pubmed        gtex.normalized 
##                      5                      6                      7 
##       gtex.gene.counts gtex.transcript.counts         gtex.gene.rpkm 
##                      8                      9                     10 
##   gtex.transcript.rpkm                  dnase             transEqtls 
##                     11                     12                     13 
##               cisEqtls               sqtlSeek            sqtlAltrans 
##                     14                     15                     16 
##                   pqtl            gwasCatalog                  grasp 
##                     17                     18                     19 
##                    aei 
##                     20
#
# display elements that have already been added
listElements(exampleMgl, added = TRUE)
## name 
##    1

(2) One element at a time. Use listSeparate to subset the list with just one datasource for all genes (i.e. to get ensts only for all genes)

# subset the second element with ENST information
listSeparate(myMgl, 2) -> subset
#
subset
## $`RP11-109G10.2`
##   external_transcript_id ensembl_transcript_id   transcript_biotype
## 1      RP11-109G10.2-001       ENST00000439573 processed_pseudogene
## 
## $`RP11-481A12.2`
##   external_transcript_id ensembl_transcript_id   transcript_biotype
## 1      RP11-481A12.2-001       ENST00000449598 processed_pseudogene
## 
## $MSMB
##   external_transcript_id ensembl_transcript_id   transcript_biotype
## 1               MSMB-001       ENST00000358559       protein_coding
## 2               MSMB-003       ENST00000487536 processed_transcript
## 3               MSMB-006       ENST00000474170 processed_transcript
## 4               MSMB-004       ENST00000478719 processed_transcript
## 5               MSMB-005       ENST00000466268 processed_transcript
## 6               MSMB-002       ENST00000298239       protein_coding
## 
## $NCOA4
##    external_transcript_id ensembl_transcript_id   transcript_biotype
## 1               NCOA4-005       ENST00000498586 processed_transcript
## 2               NCOA4-001       ENST00000374087       protein_coding
## 3               NCOA4-003       ENST00000344348       protein_coding
## 4               NCOA4-002       ENST00000374082       protein_coding
## 5               NCOA4-004       ENST00000431200       protein_coding
## 6               NCOA4-203       ENST00000438493       protein_coding
## 7               NCOA4-205       ENST00000452682       protein_coding
## 8               NCOA4-204       ENST00000443446       protein_coding
## 9               NCOA4-202       ENST00000430396       protein_coding
## 10              NCOA4-201       ENST00000414907       protein_coding
## 
## $TIMM23
##   external_transcript_id ensembl_transcript_id   transcript_biotype
## 1             TIMM23-007       ENST00000444743       protein_coding
## 2             TIMM23-001       ENST00000260867       protein_coding
## 3             TIMM23-006       ENST00000485812 processed_transcript
## 4             TIMM23-002       ENST00000374065       protein_coding
## 5             TIMM23-003       ENST00000476778 processed_transcript
## 6             TIMM23-004       ENST00000469116 processed_transcript
## 7             TIMM23-005       ENST00000480051 processed_transcript
## 8             TIMM23-201       ENST00000374064       protein_coding

Gene Names

Colloquial gene names (e.g. RP11-109G10.2) are often far more convienent for every day use than ENSG identifiers (e.g. ENSG00000228326), although colloquial names are not always specific. biomaRt does a phenomenal job converting between different gene names and identifiers. However, in the case that biomaRt is not able to find a corresponding ENSG id for a given gene a warning message will be displayed. This should be fixed using missNames and fixNames as various functions in the package rely on ENSG ids. bklIn the cases where these ENSG ids are not identified through biomaRt and must be fixed, they can often be found by googling the gene name. GeneCards http://www.genecards.org/ does a particularly good job cataloging various colloquial names for a given gene and always provides the ENSG id.

# Build an empty list from gene names
buildFromNames(c("RP11-109G10.2", "RP11-481A12.2", "MSMB", "NCOA4", "TIMM23")) -> newMgl
## Warning in buildFromNames(c("RP11-109G10.2", "RP11-481A12.2", "MSMB",
## "NCOA4", : Gene names missing: RP11-109G10.2, RP11-481A12.2
#
# Gene name was not found by *biomaRt*
newMgl[[1]][1]
## $name
## [1] "NA"
#
# Identify missing names
missNames(newMgl) 
## [1] "RP11-109G10.2" "RP11-481A12.2"
#
# Fix missing names
fixNames(newMgl, c('ENSG00000228326', 'ENSG00000220508')) -> newMgl
#
# Gene names now filled in
newMgl[[1]][1]
## $name
##   external_gene_id ensembl_gene_id description
## 1    RP11-109G10.2 ENSG00000228326          NA

SNP Ids

The cisEqtl data released by GTEx uses chr_pos… instead of rs numbers to identify SNPs. In order to add rs ids use fixSnpIds. It is preferable to use rs Ids for functions like makeSnps, makeMultiEqtl, makeOverlap, and makeOverlapTable.

# add rs Id
fixSnpIds(myMgl) -> myMgl

Dependent Elements

Some elements are dependent on one another. For example adding information from the GWAS Catalog (element 18) requires that position information for each gene (element 3) has been filled in because GWAS-based associations are pulled not only based on the assigned gene name but also on the position range of the gene. In these instances of inter-dependent elements, if the required element has not yet been filled in an error message appears directing the user to add the required information.

# Build list
## buildFromRegion(10,51518775,51600147) -> exampleMgl
#
# Add GWAS data from the NHGRI-EBI GWAS Catalog [element #18]
addGwasCatalog(exampleMgl, range = 10, download = TRUE, saveDownload = FALSE) -> exampleMgl
## Error in addGwasCatalog(exampleMgl, range = 10, download = TRUE, saveDownload = FALSE): location data not available. see function addLoc
#
# Add position information [element #3]
addLoc(exampleMgl) -> exampleMgl
#
exampleMgl[[1]][[3]]
##   chromosome_name start_position end_position strand
## 1              10       51532102     51532572     -1
#
# Add GWAS data from the NHGRI-EBI GWAS Catalog [element #18]
addGwasCatalog(exampleMgl, range = 10, download = TRUE, saveDownload = FALSE) -> exampleMgl
#
# example mgl[[gene]][[datasource]][GWASEntry, Details]
exampleMgl[[3]][[18]][1:2,1:5]
##   DATE.ADDED.TO.CATALOG PUBMEDID FIRST.AUTHOR       DATE   JOURNAL
## 1            2008-06-16 18264097     Eeles RA 2008-02-10 Nat Genet
## 2            2008-06-16 18264096     Thomas G 2008-02-10 Nat Genet

Using the list

Gene Ontology

Two functions makeGo and makeGoSearch are included to organize Gene Ontology terms and determine those GO terms that are shared among candidate genes.

# summarize gene ontology (GO) terms
makeGo(myMgl, saveFile = F) -> myGo
#
# output of makeGo is a list with three elements
class(myGo)
## [1] "list"
names(myGo)
## [1] "goRes"   "goTable" "goCount"
#
#goRes is a list with GO terms for each gene (subset of element 5 from mgl)
myGo[[1]][[4]]
##                                             name_1006
## 1                 androgen receptor signaling pathway
## 2                                 response to hormone
## 3                              male gonad development
## 4 positive regulation of transcription, DNA-dependent
## 5                        transcription, DNA-templated
## 6                                             nucleus
## 7                           androgen receptor binding
## 8                  transcription coactivator activity
#
# goTable is a dataframe with GO term and gene
myGo[[2]][1:2,]
##                GoTerm Gene
## 1  biological_process MSMB
## 2 extracellular space MSMB
#
# goCount is a ranking of GO terms and the number of genes they are assigned to
myGo[[3]][1:2]
## 
##                   nucleus androgen receptor binding 
##                         2                         1
#
# to search for genes with a specific GO term
makeGoSearch(myMgl, 'nucleus', go = myGo, saveFile = F)
## [1] "MSMB"  "NCOA4"

Phenotypes

Similar to the functions for summarizing Gene Ontology terms, two functions makePhenotypes and makePhenotypeSearch are included to organize GWAS-based trait associations and determine those phenotypes that are shared among candidate genes.

# summarize phenotypes
makePhenotypes(myMgl, saveFile = F) -> myPhenotypes
#
# output of makePhenotypes is a list with four elements
class(myPhenotypes)
## [1] "list"
names(myPhenotypes)
## [1] "phenRes"   "phenList"  "phenTable" "phenCount"
#
# phenRes is a list with GWAS-based trait associations for each gene split by datasource (NHGRI-EBI GWAS Catalog and GRASP)
myPhenotypes[[1]][[5]]
## $GwasCatalog
## [1] "Mean corpuscular hemoglobin" "Mean corpuscular volume"    
## [3] "Prostate cancer"            
## 
## $GRASP
##  [1] "HDL cholesterol"                                                              
##  [2] "Mean corpuscular hemoglobin (MCH)"                                            
##  [3] "Mean corpuscular volume (MCV)"                                                
##  [4] "Neuroblastoma (brain cancer)"                                                 
##  [5] "Gene expression of TIMM23 [probe ILMN_22871] in osteoblasts treated with BMP2"
##  [6] "TIMM23 gene expression in liver tissue"                                       
##  [7] "Comorbid depressive syndrome and alcohol dependence"                          
##  [8] "Mean corpuscular hemoglobin concentration (MCHC)"                             
##  [9] "Infant head circumference"                                                    
## [10] "Gene expression of NCOA4 in blood"
#
# phenList is a list with GWAS-based trait associations for each gene 
myPhenotypes[[2]][[4]][1:5]
## [1] "Birth weight"                                                                
## [2] "Carboplatin-induced neutropenia/leukopenia in cancer patients"               
## [3] "Differential exon level expression of CTGLF5 [probe 3288865] in brain cortex"
## [4] "Differential exon level expression of CTGLF5 [probe 3288879] in brain cortex"
## [5] "Differential exon level expression of CTGLF5 [probe 3288880] in brain cortex"
#
# phenTable is a dataframe with phenotype and gene
myPhenotypes[[3]][1:2,]
##      Phenotype         Gene           
## [1,] "Fasting insulin" "RP11-109G10.2"
## [2,] "HDL cholesterol" "RP11-109G10.2"
#
# phenCount is a ranking of phenotypes and the number of genes they are assigned to
myPhenotypes[[4]][1:2]
## 
## HDL cholesterol Fasting insulin 
##               4               3
#
# to search for genes with a specific GO term
makePhenotypeSearch(myMgl, 'Mean corpuscular hemoglobin', phen = myPhenotypes, saveFile = F)
## [1] "MSMB"   "NCOA4"  "TIMM23"

SNPs

Again similar to Gene Ontology terms and Phenotypes, two functions makeSnps and makeSnpSearch are included to organize genetic variants associated with candidate genes and determine those SNPs that are shared among candidate genes.

# summarize SNPs
makeSnps(myMgl, saveFile = F) -> mySnps
#
# output of makeSnps is a list with four elements
class(mySnps)
## [1] "list"
names(mySnps)
## [1] "snpRes"   "snpList"  "snpTable" "snpCount"
#
# snpRes is a list with SNPs for each gene split by datasource
mySnps[[1]][[5]]
## $TranseQTL
## [1] "None"
## 
## $CiseQTL
##  [1] "rs17720205"             "rs78937603"            
##  [3] "rs11592181"             "rs11597667"            
##  [5] "rs61847076"             "rs61847077"            
##  [7] "10_51593069_GACA_G_b37" "rs10994675"            
##  [9] "rs41306524"             "rs7085433"             
## [11] "rs3122258"              "rs188856704"           
## [13] "rs142782792"            "rs181655860"           
## [15] "rs117979646"            "10_51507845_GA_G_b37"  
## [17] "rs75901817"             "rs17720193"            
## [19] "rs17776547"             "rs41306524"            
## [21] "rs74877435"             "rs7085433"             
## 
## $sqtlSeek
## [1] "None"
## 
## $sqtlAltrans
## [1] "rs7085005" "rs2177370"
## 
## $Ptv
## [1] "None"
## 
## $GwasCatalog
## [1] "rs7085433"  "rs10993994"
## 
## $GRASP
## [1] "rs7085433" "rs7350420"
## 
## $DNAse
## [1] "None"
#
# snpList is a list with SNPs for each gene 
mySnps[[2]][[5]]
##  [1] "10_51507845_GA_G_b37"   "10_51593069_GACA_G_b37"
##  [3] "rs10993994"             "rs10994675"            
##  [5] "rs11592181"             "rs11597667"            
##  [7] "rs117979646"            "rs142782792"           
##  [9] "rs17720193"             "rs17720205"            
## [11] "rs17776547"             "rs181655860"           
## [13] "rs188856704"            "rs2177370"             
## [15] "rs3122258"              "rs41306524"            
## [17] "rs61847076"             "rs61847077"            
## [19] "rs7085005"              "rs7085433"             
## [21] "rs7350420"              "rs74877435"            
## [23] "rs75901817"             "rs78937603"
#
# snpTable is a dataframe with SNP and gene
mySnps[[3]][1:3,]
##      SNP         Gene           
## [1,] "rs2611504" "RP11-109G10.2"
## [2,] "rs4581397" "RP11-109G10.2"
## [3,] "rs4630240" "RP11-109G10.2"
#
# snpCount is a ranking of SNPs and the number of genes they are assigned to
mySnps[[4]][1:3]
## 
## rs2611504 rs4581397 rs4630240 
##         3         3         3
#
# to search for genes with a specific SNP
makeSnpSearch(myMgl, 'rs2611504', snp = mySnps, saveFile = F)
## [1] "RP11-109G10.2" "MSMB"          "NCOA4"

Overlap

Two functions makeOverlap and makeOverlapTable have been developed to identify SNPs that are assigned to a given gene for multiple reasons, e.g. as both a cis-eQTL and a GWAS based variant. makeOverlap identifies overlapping SNPs for two user-defined groups, while makeOverlapTable gives the details for these SNPs.

# find SNPs that are both cis-eQTLs and GWAS-based variants
makeOverlap(myMgl, snpsA = 'cisEqtls', snpsB = 'gwasCatalog', saveFile = FALSE) -> myOverlap
#
myOverlap
## $overlap
## $overlap$`RP11-109G10.2`
## [1] "None"
## 
## $overlap$`RP11-481A12.2`
## [1] "None"
## 
## $overlap$MSMB
## [1] "rs10993994" "rs3123078"  "rs4631830" 
## 
## $overlap$NCOA4
## [1] "None"
## 
## $overlap$TIMM23
## [1] "rs7085433"
## 
## 
## $snpsA
## [1] "cisEqtls"
## 
## $snpsB
## [1] "gwasCatalog"
# get details for these SNPs
makeOverlapTable(myMgl, myOverlap, saveFile = FALSE) -> myOverlapTable
#[[gene]][[snp]][[datasource]]
myOverlapTable[[1]][[1]][[1]][,1:3]
##                            tissue          variant_id           gene_id
## 1                            Lung 10_51549496_T_C_b37 ENSG00000138294.9
## 2 Skin_Not_Sun_Exposed_Suprapubic 10_51549496_T_C_b37 ENSG00000138294.9
## 3      Skin_Sun_Exposed_Lower_leg 10_51549496_T_C_b37 ENSG00000138294.9
## 4                         Stomach 10_51549496_T_C_b37 ENSG00000138294.9
myOverlapTable[[1]][[1]][[2]][,1:3]
##    DATE.ADDED.TO.CATALOG PUBMEDID  FIRST.AUTHOR
## 1             2008-06-16 18264097      Eeles RA
## 2             2008-06-16 18264096      Thomas G
## 3             2011-07-30 21743057 Schumacher FR
## 4             2010-09-11 20676098      Takata R
## 5             2011-01-13 21160077 Gudmundsson J
## 6             2013-03-22 23269536         Sun J
## 7             2013-09-18 23555189        Chen Z
## 8             2014-11-09 24740154      Lange EM
## 9             2014-11-07 24753544      Knipe DW
## 10            2016-06-13 26443449        Wang M
## 11            2016-02-12 25939597     Berndt SI
## 12            2016-07-15 26034056   Hoffmann TJ
## 13            2016-07-15 26034056   Hoffmann TJ

Co-expression

Two functions makeCoXpGene and makeCoXpTranscript are included to test co-expression between gene or transcript pairs in specific tissues. For each pair of genes or transcripts in the specified tissues, co-expression is tested using cor.test. Two flags makePlot and saveFile can be used to save a pdf with corresponding plots and/or a csv with the results of cor.test. The plot includes expression of one gene on the x-axis and the other gene on the y-axis with results of the cor.test as part of the figure legend. The csv includes the cor.test output for all combinations of genes or transcripts in the specified tissues.

# plot co-expressions between genes using count normalized data
makeCoXpGene(myMgl, gene = c("RP11-109G10.2", "NCOA4"), makePlot = FALSE, saveFile = FALSE, data = 7, tissue = c('Prostate'))
## $`RP11-109G10.2|NCOA4`
## $`RP11-109G10.2|NCOA4`$Prostate
## 
##  Pearson's product-moment correlation
## 
## data:  g1 and g2
## t = 7.0002, df = 85, p-value = 5.546e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4515997 0.7232408
## sample estimates:
##       cor 
## 0.6047216

DNAse hypersensitivity sites

To show only those results from Maurano et. al that are significant

# filter DNAse results for significant p-values
makeDnaseSig(myMgl) -> results
#
results
## $`RP11-109G10.2`
## character(0)
## 
## $`RP11-481A12.2`
## character(0)
## 
## $MSMB
## [1] "rs7904463"  "rs17178655"
## 
## $NCOA4
## character(0)
## 
## $TIMM23
## character(0)

Allelic ratios

Allelic ratios can be plotted for one gene at a time using makeAeiPlot. A single pdf entitle ‘AeiPlot_gene.pdf’ will be generated with each individual as a separate page in the document. Tissues are displayed on the x-axis and allelic ratios on the y-axis. Each do represents an individual SNP. Dot colors correspond to the SNP (identified by its position), while dot shapes correspond to the class of SNP - intron, exon, etc. The results used to generate the plot can be saved in an csv file using the ‘saveFile’ flag.

# plot allelic ratios
makeAeiPlot(myMgl, gene = c('NCOA4'), saveFile = FALSE) -> myAeiPlot

MultiEqtl

The function makeMultiEqtl was designed to identify eQTL variants that serve as eQTLs for multiple candidate genes. The search can be done in a tissue-specific fashion by specifying a particular tissue using the tissue flag or irrespective of tissue by using tissue = 'All'. There are two optional flags makePlot and saveFile that will generate, respectively, either a pdf or a csv summarizing the results.

# if there are no MultiEqtls in a given tissue an error message appears 
makeMultiEqtl(myMgl, makePlot = FALSE, saveFile = FALSE, cis = TRUE, trans = TRUE, tissue = c('All', 'Prostate')) -> myMultiEqtl
#
# summarize MultiEqtls
makeMultiEqtl(myMgl, makePlot = FALSE, saveFile = FALSE, cis = TRUE, trans = TRUE, tissue = c('All')) -> myMultiEqtl
#
# output of makeMultiEqtl is a list with four elements
# note in the event of multiple tissues the output is a list with length corresponding to the number tissues where each element is a list with four elements as described below
class(myMultiEqtl)
## [1] "list"
names(myMultiEqtl)
## [1] "fullTable"  "ranking"    "topSnps"    "topSummary"
#
# fullTable is a dataframe where each row is an eQTL
myMultiEqtl[[1]][1:3,]
##              V1       V2        V3  V4
## 1 RP11-109G10.2 Prostate rs2611504 cis
## 2 RP11-109G10.2 Prostate rs4630240 cis
## 3          MSMB     Lung rs2611504 cis
#
# ranking is a list of tables ranking the number of genes each eQTL is assigned to  
myMultiEqtl[[2]][[1]][1:3]
## 
##              rs2611504              rs4630240 10_51501963_AGTT_A_b37 
##                      3                      3                      2
#
# topSnps is a list of vectors of eQTLs appearing for the maximum number of genes (i.e. if the most number of genes an eQTL applies to is 3 all eQTLs assinged to 3 genes will be displayed)
myMultiEqtl[[3]]
## [[1]]
## 
## rs2611504 rs4630240 
##         3         3
#
# topSummary shows the information from fullTable (gene, tissue, snp) for those snps included in topSnps
myMultiEqtl[[4]]
## [[1]]
##                V1                              V2        V3  V4
## 1   RP11-109G10.2                        Prostate rs2611504 cis
## 3            MSMB                            Lung rs2611504 cis
## 81           MSMB Skin_Not_Sun_Exposed_Suprapubic rs2611504 cis
## 207          MSMB      Skin_Sun_Exposed_Lower_leg rs2611504 cis
## 368          MSMB                         Stomach rs2611504 cis
## 545         NCOA4                     Whole_Blood rs2611504 cis
## 2   RP11-109G10.2                        Prostate rs4630240 cis
## 4            MSMB                            Lung rs4630240 cis
## 399          MSMB                         Stomach rs4630240 cis
## 575         NCOA4                     Whole_Blood rs4630240 cis

Summary

The function makeSummary was designed to give a quick overview of the number of SNPs identified for each candidate gene. Of particular interest maybe those genes with molecular-based SNPs (e.g eQTLs) that do not have GWAS hits and vice versa.

# summarize associated SNPs
makeSummary(myMgl, saveFile = F) -> summary
#
summary
##               DNAse TranseQTL CiseQTL sqtlSeek sqtlAltrans Ptv GwasCatalog
## RP11-109G10.2 0     0         2       0        0           0   0          
## RP11-481A12.2 0     0         1       0        0           0   0          
## MSMB          2     0         454     0        0           0   5          
## NCOA4         0     0         163     0        6           6   4          
## TIMM23        0     0         22      0        2           0   2          
##               GRASP
## RP11-109G10.2 2    
## RP11-481A12.2 0    
## MSMB          9    
## NCOA4         11   
## TIMM23        2