starmie: Admixture models

Stuart Lee, Gerry Tonkin-Hill

2016-11-07

In this vignette we use starmie to perform visualisation and inference on a sequence of ADMIXTURE models fit on the sample data supplied by the creators of ADMIXTURE.

Loading ADMIXTURE output and the ‘admix’ object

To begin we assume ADMIXTURE has been run on an example PED/BED file to analyse and there is a directory containing the resulting .P, .Q and logging files (if you want to perform inference).

To read a single ADMIXTURE run first supply the names of .P and .Q file and save it as an admix object using the loadAdmixture function.

library(starmie)
# get Q and P files for K = 3
p3 <- system.file("extdata/hapmap3_files", "hapmap3.3.P", package = "starmie")
q3 <- system.file("extdata/hapmap3_files", "hapmap3.3.Q", package = "starmie")

k3_admix <- loadAdmixture(q3, p3)
k3_admix
## admix object containing run information for k = 3 
## Model parameters:
##    No. samples: 324 
##    No. markers: 13928 
## Model diagnostics available: FALSE

Optionally, if you would like to look at model fit statstics you can read in a log file as well. Now the admix object contains the estimated log-likelihood and cross-validation error.

k3_log <- system.file("extdata/hapmap3_files", "log3.out", package = "starmie")

k3_admix <- loadAdmixture(q3, p3, k3_log)
k3_admix
## admix object containing run information for k = 3 
## Model parameters:
##    No. samples: 324 
##    No. markers: 13928 
## Model diagnostics available: TRUE 
## Model fit statistics:
##   K     logL CVerror
## 1 3 -3799887 0.47835

In general an admix object consists of the following elements about a single ADMIXTURE run:

attributes description
K K parameter supplied to ADMIXTURE
nsamples Number of samples
nmarkers Number of markers
Q_df Individual ancestral probability of membership to cluster
P_df Estimated ancestral allele frequencies for each cluster
log_info Model fit statistics

Constructing a barplot for a single admix object

The plotBar function works on admix objects and produces a facetted barplot in the same manner as for struct objects.

plotBar(k3_admix)

A regular barplot can be constructed by setting facet = FALSE.

plotBar(k3_admix, facet = FALSE)

Putting admix objects together the admixList object

If you have tried running ADMIXTURE with many different values of \(K\) then you can use an admixList object to maninpulate them. To construct an admixList object use loadAdmixture for each pair of Q and P files and then pass the results to the admixList constructor function.

For example we have run ADMIXTURE on the sample data set with \(K = 5\) using five-fold cross-validation.

admix_multi <- exampleAdmixture()
admix_multi
## admixList object containing 5  ADMIXTURE runs.
## Number of Ks by number of runs:
## 1 2 3 4 5 
## 1 1 1 1 1 
## Model fit information by K:
##   K     logL CVerror
## 1 1 -4169323 0.55248
## 2 2 -3835365 0.48190
## 3 3 -3799887 0.47835
## 4 4 -3789059 0.48236
## 5 5 -3778956 0.49001

You can also plot multiple admix objects to see how cluster memberships change for different values of \(K\).

plotMultiK(admix_multi)

Using ADMIXTURE output to perform inference.

The bestK method can be used to determine which value of K best explains the observed data using an elbow plot on the cross-validation error.

bestK(admix_multi)
## Creating diagnositc plots for admixture runs

##   K     logL CVerror
## 1 1 -4169323 0.55248
## 2 2 -3835365 0.48190
## 3 3 -3799887 0.47835
## 4 4 -3789059 0.48236
## 5 5 -3778956 0.49001

References

Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).

sessionInfo

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] C/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] starmie_0.1.2
## 
## loaded via a namespace (and not attached):
##  [1] gmp_0.5-12          Rcpp_0.12.7         highr_0.6          
##  [4] formatR_1.4         plyr_1.8.4          iterators_1.0.8    
##  [7] tools_3.3.1         digest_0.6.10       evaluate_0.10      
## [10] tibble_1.2          gtable_0.2.0        lattice_0.20-34    
## [13] Matrix_1.2-7.1      ggrepel_0.5         yaml_2.1.13        
## [16] parallel_3.3.1      expm_0.999-0        ggdendro_0.1-20    
## [19] gridExtra_2.2.1     stringr_1.1.0       knitr_1.14         
## [22] combinat_0.0-8      grid_3.3.1          data.table_1.9.6   
## [25] MCL_1.0             rmarkdown_1.1       reshape2_1.4.1     
## [28] ggplot2_2.1.0       purrr_0.2.2         readr_1.0.0        
## [31] tidyr_0.6.0         magrittr_1.5        scales_0.4.0       
## [34] iterpc_0.3.0        htmltools_0.3.5     MASS_7.3-45        
## [37] assertthat_0.1      lpSolve_5.6.13      colorspace_1.2-7   
## [40] labeling_0.3        label.switching_1.6 stringi_1.1.2      
## [43] proxy_0.4-16        munsell_0.4.3       chron_2.3-47