Use clusterhap

Sebastian Simondi,Victoria Bonnecarrere, Lucia Gutierrez, Gaston Quero


What is clusterhap?

clusterhap function identifies haplotypes within QTL (Quantitative Trait Loci). One haplotype is a combination of SNP (Single Nucleotide Polymorphisms) within the QTL. This function groups together all individuals of a population with the same haplotype. Each group contains individual with the same allele in each SNP, whether or not missing data. Thus, clusterhap groups individuals, that to be imputed, have a non-zero probability of having the same alleles in the entire sequence of SNP’s. clusterhap does not impute missing data automatically as if they do packages such as this. ( alleHap,hapassoc,hsphase)

The function return five reports: a. haplotypes (all haplotypes identified), b. haplotypes_frequencies (the frequency of each haplotype), c. duplicates (individuals assigned to more than one haplotype), d. haplotypes_result (a matrix with individuals assigned to each haplotype, and sorted haplotypes), the function plot this matrix, e. underterminates (not assigned individuals)

Note: the user must decide which haplotype assign to the individuals which were assigned to more than one haplotype. Eventually, the user must manually remove from this matrix the duplicate haplotype. A decision criterion might be looking at the frequency of haplotypes and selects the most frequent haplotype.

Data Input Format

clusterhap uses a simple numeric matrix of individual (row) and SNP (columns). Given a QTL, the user should transform all SNP alleles of each individual in a number, in this way: the bases A, C, G and T are changed to 1, 2, 3 and 4, respectively. A zero (0) is assigned to all SNP with missing data (NA). Therefore, the matrix coordinates are 0, 1, 2, 3 or 4.

Theoretical Description

The clusterhap function first builds the set Y, composed of all the complete sequences of SNP’s the QTL, whose elements call haplotypes. With this objective, the function first counts all zeros within the vector. When there are not zero, the individual have not missing data in all SNP and it is assigned as an element in set Y. Once built the set Y, clusterhap function assigns to each haplotype the population`s vector which contains the same numbers in all coordinates nonzero. i.e. clusterhap associates to each QTL’s SNP complete sequence, all individuals which in each locus with data, has the same allele. For example, clusterhap assigns the individual 1 defined by [0,2,0,1,1,4,2] to haplotype 1 defined by [1,2,3,1,1,4,2] since it match in all nonzero coordinates. On the other hand, clusterhap does not associate individual 2 defined by [0,2,0,1,1,4,3] to this haplotype since the last coordinate differs. Originally, haplotype 1 had a C in the last SNP and individual 2 a G. Therefore, having been imputed individual 2 never coincides with haplotype 1. An individual may be associated with more than one haplotype or none, in which case they will be labeled as indeterminate. Thus, most individuals are univocally determined and associated with a single haplotype.

clusterhap examples:

Simple simulated data

The sim_qtl data.frame included with the package has simulated results for 7 SNPs on 5 individuals.

ind SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7
ind.1 A C G A A T C
ind.2 NA C NA A A T C
ind.3 C C T A A T C
ind.4 NA C NA A A T G
ind.5 C NA NA A A T C

It is transformed to:

ind SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7
ind.1 1 2 3 1 1 4 2
ind.2 0 2 0 1 1 4 2
ind.3 2 2 4 1 1 4 2
ind.4 0 2 0 1 1 4 3
ind.5 2 0 0 1 1 4 2

Only individuals 1 and 3 have complete SNP sequence for this QTL. Therefore:

    Y = {H.1 = [1,2,3,1,1,4,2], H.2 = [2,2,4,1,1,4,2]}

Individuals 1 and 2 is assigned to H.1 . The first has the complete sequence and the second in all nonzero data has the same alleles. None other individual is assigned to this haplotype since all remaining individuals differs in the identity of at least one SNP. Following the reasoning, individuals 2, 3 and 5 are assigned to H.2 . Notice that individual 2 was assigned to both haplotypes given that nonzero SNP coincide with both H.1 and H.2. On the other hand, individual 4 was not assigned to any haplotype due to SNP.7 does not match with no element of set Y. The clusterhap function classifies individual 4 as indeterminate.

clusterhap(sim_qtl, Print=TRUE)
#> $h.result
#>    id.geno SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7    haplo.qtl
#> 1    ind.1     1     2     3     1     1     4     2            1
#> 2    ind.2     0     2     0     1     1     4     2            1
#> 21   ind.2     0     2     0     1     1     4     2            2
#> 3    ind.3     2     2     4     1     1     4     2            2
#> 5    ind.5     2     0     0     1     1     4     2            2
#> 4    ind.4     0     2     0     1     1     4     3 undetermined
#> $haplotypes
#> SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7
#> 1 haplo 1     1     2     3     1     1     4     2
#> 3 haplo 2     2     2     4     1     1     4     2
#> $duplicates
#>    id.geno SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7 haplo.qtl
#> 2    ind.2     0     2     0     1     1     4     2         1
#> 21   ind.2     0     2     0     1     1     4     2         2
#> $freq
#>       freq haplo.1 haplo.2 undetermined
#> 1 freq.abs     2.0       3          1.0
#> 2 freq.rel    33.3      50         16.7
#> $und
#>     ind SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7
#> 4 ind.4     0     2     0     1     1     4     3

Real experimental data

The rice_qtl data.frame included with the package The Uruguayan Rice Breeding GWAS (URiB) population is composed of 637 genotypes from the INIA’s Rice Breeding Program (IRBP). In this example we use the information in one of those populations. The population has 324 indica lines, and 2 indica cultivars (El Paso 144 and INIA Olimar). The population was genotyped by SNPs obteined by Genotyping-by sequencing (GBS).The data is a QTL for Grain Quality.



1. clusterhap function calculates the amount of null components (cq) in each individual of the database.

H_data <- x # x is the data frame
cq <- NULL
Q <- H_data[, -1]
  for (i in 1:nrow(Q)) {
        for (j in 1:nrow(Y)) {
            cq <- rowSums(Q[i, ] == 0)

2. On the other hand, clusterhap function calculates vector coordinate to coordinate subtracts, between the individual and each haplotype or element of set Y, counting the number of zeros (cr).

y <- which(c.Q == 0)
b <- Q[y, ]
Y <- b[!duplicated(b), ]
 for (i in 1:nrow(Q)) {
        for (j in 1:nrow(Y)) {
            cq <- rowSums(Q[i, ] == 0)
            w <- Q[i, ] - Y[j, ]
            cr <- rowSums(w == 0)

3. If the sum of cq+cr, is equal to the amount of SNP within the QTL, then associate the individual to the haplotype. Note that the subtraction vector has zeros in coordinates where both vectors (individual and haplotype) contain the same number. When two SNP differs, the resulting vector presents no null coordinates; particularly when the individual contains zeros, due to the haplotype is a complete SNP sequence with no missing data. The only way that the sum cq+cr is equal to the amount of QTL and SNP is when individual and haplotype coordinates differ only in null coordinate of individual. i.e. individual not null coordinate coincide with haplotype coordinates. Hence, of being imputed missing SNP of the individual, this SNP has non-zero probability to match the haplotypes that was assigned by the function. Those individuals assigned to non haplotype are classified as indeterminate.

for (i in 1:nrow(Q)) {
        for (j in 1:nrow(Y)) {
            cq <- rowSums(Q[i, ] == 0)
            w <- Q[i, ] - Y[j, ]
            cr <- rowSums(w == 0)
            zeros <- cq + cr
            if (zeros == ncol(Q)) {
                hp <- cbind(hp, i)
                hp.1 <- cbind(hp.1, j)
                hpl <- cbind(Q[i, ], j)
                hpl1 <- cbind(id[i], hpl)
                hplq <- rbind(hplq, hpl1)


Burkett, K. et al. (2006) hapassoc: Software for Like lihood Inference of Trait Associations with SNP Haplotypes and Other Attributes. J Stat Soft., 16, 1-19.

Ferdosi, M. H. et al .(2014) hsphase: an R package for pedigree reconstruction, detection of recombination events, phasing and imputation of half-sib family groups. BMC Bioinformatics 15, 172.

Medina-Rodriguez, Nathan and Santana, A. (2015) alleHap: Allele Imputation and Haplotype Reconstruction from Pedigree Databases, R,package version 0.9.2.