Simultaneous analysis of genetic associations with multiple phenotypes may reveal shared genetic susceptibility across traits (pleiotropy). CPBayes is a Bayesian meta analysis method for studying cross-phenotype genetic associations. It uses summary-level data across multiple phenotypes to simultaneously measure the evidence of aggregate-level pleiotropic association and estimate an optimal subset of traits associated with the risk locus. CPBayes is based on a spike and slab prior and is implemented by Markov chain Monte Carlo (MCMC) technique Gibbs sampling.

This R-package consists of five main functions:

- cpbayes_uncor: It implements CPBayes for uncorrelated summary statistics. The summary statistics across traits/studies are uncorrelated when the studies have no overlapping subject.
- cpbayes_cor: It implements CPBayes for correlated summary statistics. The summary statistics across traits/studies are correlated when the studies have overlapping subjects or the phenotypes were measured in a cohort study.
- post_summaries: It summarizes the MCMC data produced by cpbayes_uncor or cpbayes_cor. It computes additional summaries to provide a better insight into a pleiotropic signal. It works in the same way for both cpbayes_uncor and cpbayes_cor.
- forest_cpbayes: It creates a forest plot presenting the pleiotropy result obtained by cpbayes_uncor or cpbayes_cor. It works in the same way for both cpbayes_uncor and cpbayes_cor.
- estimate_corln: It computes an approximate correlation matrix of the beta-hat vector for multiple overlapping case-control studies using the sample-overlap matrices.

You can install CPBayes from CRAN.

```
install.packages("CPBayes")
library("CPBayes")
```

The function estimate_corln estimates the correlation matrix of the beta-hat vector for multiple overlapping case-control studies using the sample-overlap matrices which describe the number of cases or controls shared between studies/traits, and the number of subjects who are case for one study/trait but control for another study/trait. For a cohort study, the phenotypic correlation matrix should be a reasonable substitute of this correlation matrix.

```
# Example data of sample-overlap matrices
SampleOverlapMatrixFile <- system.file("extdata", "SampleOverlapMatrix.rda", package = "CPBayes")
load(SampleOverlapMatrixFile)
SampleOverlapMatrix
## $n11
## trait1 trait2 trait3 trait4 trait5
## trait1 9048 4647 2985 2835 1812
## trait2 4647 13565 3873 4245 2419
## trait3 2985 3873 14681 6285 2044
## trait4 2835 4245 6285 16697 2059
## trait5 1812 2419 2044 2059 7121
##
## $n00
## trait1 trait2 trait3 trait4 trait5
## trait1 44683 35765 32987 30821 39374
## trait2 35765 40166 29358 27714 35464
## trait3 32987 29358 39050 28638 33973
## trait4 30821 27714 28638 37034 31972
## trait5 39374 35464 33973 31972 46610
##
## $n10
## trait1 trait2 trait3 trait4 trait5
## trait1 0 4401 6063 6213 7236
## trait2 8918 0 9692 9320 11146
## trait3 11696 10808 0 8396 12637
## trait4 13862 12452 10412 0 14638
## trait5 5309 4702 5077 5062 0
```

SampleOverlapMatrix is a list that contains an example of the sample overlap matrices for five different diseases in the Kaiser GERA cohort (a real data). The list constitutes of three matrices as follows. SampleOverlapMatrix$n11 provides the number of cases shared between all possible pairs of studies/traits. SampleOverlapMatrix$n00 provides the number of controls shared between all possible pairs of studies/traits. SampleOverlapMatrix$n10 provides the number of subjects who are case for one study/trait and control for another study/trait. For more detailed explanation, see the Arguments section of estimate_corln in the CPBayes manual.

```
# Estimate the correlation matrix of correlated beta-hat vector
n11 <- SampleOverlapMatrix$n11
n00 <- SampleOverlapMatrix$n00
n10 <- SampleOverlapMatrix$n10
cor <- estimate_corln(n11, n00, n10)
cor
## trait1 trait2 trait3 trait4 trait5
## trait1 1.000000000 0.270490702 0.05723195 0.002505875 0.08989408
## trait2 0.270490702 1.000000000 0.01601813 0.002744961 0.07849158
## trait3 0.057231953 0.016018131 1.00000000 0.155476792 0.01211052
## trait4 0.002505875 0.002744961 0.15547679 1.000000000 -0.01824859
## trait5 0.089894085 0.078491585 0.01211052 -0.018248589 1.00000000
```

The function estimate_corln computes an approximate correlation matrix of the correlated beta-hat vector obtained from multiple overlapping case-control studies using the sample-overlap matrices. Note that for a cohort study, the phenotypic correlation matrix should be a reasonable substitute of this correlation matrix. These approximations of the correlation structure are accurate when none of the diseases/traits is associated with the environmental covariates and genetic variant. While demonstrating cpbayes_cor, we used simulated data for 10 overlapping case-control studies with each study having a distinct set of 7000 cases and a common set of 10000 controls shared across all the studies. We used the estimate_corln function to estimate the correlation matrix of the correlated beta-hat vector using the sample-overlap matrices.

** Important note on the estimation of correlation structure of correlated beta-hat vector:** In general, environmental covariates are expected to be present in a study and associated with the phenotypes of interest. Also, a small proportion of genome-wide genetic variants are expected to be associated. Hence the above approximations of the correlation matrix may not be accurate. So in general, we recommend an alternative strategy to estimate the correlation matrix using the genome-wide summary statistics data across traits as follows. First, extract all the SNPs for each of which the trait-specific univariate association p-value across all the traits are > 0.1. The trait-specific univariate association p-values are obtained using the beta-hat and standard error for each trait. Each of the SNPs selected in this way is either weakly or not associated with any of the phenotypes (null SNP). Next, select a set of independent null SNPs from the initial set of null SNPs by using a threshold of r^2 < 0.01 (r: the correlation between the genotypes at a pair of SNPs). In the absence of in-sample linkage disequilibrium (LD) information, one can use the reference panel LD information for this screening. Finally, compute the correlation matrix of the effect estimates (beta-hat vector) as the sample correlation matrix of the beta-hat vector across all the selected independent null SNPs. This strategy is more general and applicable to a cohort study or multiple overlapping studies for binary or quantitative traits with arbitrary distributions. It is also useful when the beta-hat vector for multiple non-overlapping studies become correlated due to genetically related individuals across studies. Misspecification of the correlation structure can affect the results produced by CPBayes to some extent. Hence, if genome-wide summary statistics data across traits is available, we highly recommend to use this alternative strategy to estimate the correlation matrix of the beta-hat vector.

See our paper for more details: Arunabha Majumdar, Tanushree Haldar, Sourabh Bhattacharya, John Witte. An efficient Bayesian meta-analysis approach for studying cross-phenotype genetic associations (submitted), available at: http://biorxiv.org/content/early/2017/01/18/101543.