# this vignette is not created if snpStats is not installed
if (!require("snpStats")) {
  knitr::opts_chunk$set(eval = FALSE)
}
## Loading required package: snpStats
## Loading required package: survival
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand

Introduction

In this vignette we demonstrate the use of snpClust function in the adjclust package. snpClust performs adjacency-constrained hierarchical clustering of single nucleotide polymorphisms (SNPs), where the similarity between SNPs is defined by linkage disequilibrium (LD).

This function implements the algorithm described in the third chapter of [2]. It is an extension of the algorithm described in [3]. Denoting by \(p\) the number of SNPs to cluster and assuming that the similarity between SNPs whose indices are more distant than \(h\), its time complexity is \(O(p (\log(p) + h))\), and its space complexity is \(O(hp)\).

library("adjclust")

Loading and displaying genotype data

The begining of this vignette closely follows the “LD vignette” of the SnpStats package [1]. First, we load genotype data.

library("matrixStats")
data("ld.example", package = "snpStats")

We focus on the ceph.1mb data.

geno <- ceph.1mb[, -316]  ## drop one SNP leading to one missing LD value
p <- ncol(geno)
nSamples <- nrow(geno)
geno
## A SnpMatrix with  90 rows and  602 columns
## Row names:  NA06985 ... NA12892 
## Col names:  rs5993821 ... rs5747302

These data are drawn from the International HapMap Project and concern 602 SNPs1 over a 1Mb region of chromosome 22 in sample of 90 Europeans.

We can compute and display the LD between these SNPs.

ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = p-1)
image(ld.ceph, lwd = 0)

Adjacency-constrained Hierarchical Agglomerative Clustering

The snpClust function can handle genotype data as an input:

fit <- snpClust(geno, stats = "R.squared")
## Note: 134 merges with non increasing heights.

The above figure suggests that the LD signal is concentrated close to the diagonal. We can focus on a diagonal band with the bandwidth parameter h:

fitH <- snpClust(geno, h = 100, stats = "R.squared")
## Note: 132 merges with non increasing heights.
fitH
## 
## Call:
## adjClust(mat = x, type = "similarity", h = h)
## 
## Cluster method   : snpClust 
## Number of objects: 602

Output

The output of the snpClust is of class chac. In particular, it can be plotted as a dendrogram silently relying on the function plot.dendrogram:

plot(fitH, type = "rectangle", leaflab = "perpendicular")
## Warning in plot.chac(fitH, type = "rectangle", leaflab = "perpendicular"): 
## Detected reversals in dendrogram: mode = 'corrected', 'within-disp' or 'total-disp' might be more relevant.

Moreover, the output contains an element named merge which describes the successive merges of the clustering, and an element gains which gives the improvement in the criterion optimized by the clustering at each successive merge.

head(cbind(fitH$merge, fitH$gains))
##      [,1] [,2]
## [1,]   -1   -2
## [2,] -255 -256
## [3,] -488 -489
## [4,] -487    3
## [5,] -486    4
## [6,] -234 -235

Other types of input

In this section we show how the snpClust function can also be applied directly to LD values.

h <- 100
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = h)
image(ld.ceph, lwd = 0)

Due to numerical errors in the LD estimation, some of the estimated values are slightly larger than 1, so we round the estimates to 1e-10:

max(ld.ceph - 1)
## [1] 1.332268e-15
ld.ceph <- round(ld.ceph, digits = 10)

Note that when snpClust is called directly, this sanity check is performed internally before applying the classification algorithm.

We can now apply snpClust to this LD matrix (of class Matrix::dgCMatrix):

fitL <- snpClust(ld.ceph, h)
## Note: forcing the diagonal of the LD similarity matrix to be 1
## Note: 139 merges with non increasing heights.

snpClust also handles inputs of class base::matrix:

gmat <- as(geno, "matrix")
fitM <- snpClust(geno, h, stats = "R.squared")
## Note: 132 merges with non increasing heights.

References

[1] Clayton D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods. R package version 1.20.0

[2] Dehman A. (2015). Spatial clustering of linkage disequilibrium blocks for genome-wide association studies. Phd Thesis, Université Paris Saclay.

[3] Dehman A., Ambroise C., Neuvial P. (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. BMC Bioinformatics, 16, 148.

Session information

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] matrixStats_0.52.2   snpStats_1.26.0      Matrix_1.2-12       
##  [4] survival_2.41-3      adjclust_0.5.6       HiTC_1.20.0         
##  [7] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2  IRanges_2.10.2      
## [10] S4Vectors_0.14.3     BiocGenerics_0.22.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13               compiler_3.4.3            
##  [3] RColorBrewer_1.1-2         XVector_0.16.0            
##  [5] bitops_1.0-6               tools_3.4.3               
##  [7] zlibbioc_1.22.0            digest_0.6.12             
##  [9] evaluate_0.10.1            lattice_0.20-35           
## [11] DelayedArray_0.2.7         yaml_2.1.14               
## [13] GenomeInfoDbData_0.99.0    rtracklayer_1.36.4        
## [15] stringr_1.2.0              knitr_1.17                
## [17] Biostrings_2.44.2          rprojroot_1.2             
## [19] grid_3.4.3                 Biobase_2.36.2            
## [21] XML_3.98-1.9               BiocParallel_1.10.1       
## [23] rmarkdown_1.7              magrittr_1.5              
## [25] backports_1.1.1            Rsamtools_1.28.0          
## [27] htmltools_0.3.6            GenomicAlignments_1.12.2  
## [29] splines_3.4.3              SummarizedExperiment_1.6.3
## [31] stringi_1.1.5              RCurl_1.95-4.8

  1. We have dropped SNP rs2401075 because it produced a missing value due to the lack of genetic diversity in the considered sample.