**aphid** is an R package for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. It contains functions for multiple and pairwise sequence alignment, model construction and parameter optimization, file import/export, implementation of the forward, backward and Viterbi algorithms for conditional sequence probabilities, tree-based sequence weighting, and sequence simulation. The package has a wide variety of uses including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification.

Hidden Markov models (HMMs) underlie many of the most important tasks in computational biology, including multiple sequence alignment, genome annotation, and increasingly, sequence database searching. Originally developed for speech recognition algorithms, their application to the field of molecular biology has increased dramatically since advances in computational capacity have enabled full probabilistic analysis in place of heuristic approximations. Pioneering this transition are two groups lead by Anders Krogh and Sean Eddy, whose respective software packages SAM and HMMER have underpinned HMM-based bioinformatic analysis for over two decades.

Here, we present the **aphid** package for analysis with profile hidden Markov models in the R environment (R Core Team 2017). The package contains functions for developing, plotting, importing and exporting both standard and profile HMMs, as well as implementations of the forward, backward and Viterbi algorithms for computing full and optimal conditional sequence probabilities. The package also features a multiple sequence alignment tool that produces high quality alignments *via* profile HMM training.

The **aphid** package is designed to work in conjunction with the “DNAbin” and “AAbin” object types produced using the **ape** package (Paradis, Claude, and Strimmer 2004; Paradis 2012). These object types, in which sequences are represented in a bit-level coding scheme, are preferred over standard character-type sequences for maximizing memory and speed efficiency. While we recommend using **ape** alongside **aphid**, it is not a requisite and as such is listed in the “Suggests” rather than “Imports” section of the package description. Indeed, any sequence of standard ASCII characters is supported, making **aphid** suitable for other applications outside of biological sequence analysis. However, it should be noted that if DNA and/or amino acid sequences are input as character vectors, the functions may not recognize the ambiguity codes and therefore are not guaranteed to treat them appropriately.

To maximize speed, the low-level dynamic programming functions (including the forward, backward, Viterbi, and maximum *a posteriori* algorithms) are written in C++ linking to the Rcpp package (Eddelbuettel and Francois 2011). R versions of these functions are also maintained for the purposes of debugging, experimentation and code interpretation. This package also relies on the **openssl** package (Ooms 2016) for sequence and alignment comparisons using the MD5 hash algorithm.

Two primary object classes, “HMM” (hidden Markov model) and “PHMM” (profile hidden Markov model) are generated using the **aphid** functions deriveHMM and derivePHMM, respectively. These objects are lists consisting of emission and transition probability matrices (elements named “E” and “A”), and non-mandatory elements that may include vectors of background emission and transition probabilities (“qe” and “qa”, respectively) and other model metadata including “name”, “description”, “size” (the number of modules in the model), and “alphabet” (the set of symbols/residues emitted by the model). Objects of class “DPA” (dynamic programming array) are also generated by the Viterbi and forward/backward functions. These are predominantly created for the purposes of succinct console-printing.

HMMs and PHMMs are explained in more detail throughout the following sections using **aphid** package functions to demonstrate their utility. The examples are borrowed from Durbin et al (1998), to which users are encouraged to refer for a more in-depth explanation on the theory and application of these models. Book chapter numbers are provided wherever possible for ease of reference.

The **aphid** package can be used to produce high-quality multiple sequence alignments using the iterative model training method outlined above. The function `align`

takes as its primary argument a list of sequences either as a “DNAbin” object, an “AAbin” object, or a list of character sequences. An object of class “PHMM” can be passed to the function as an optional secondary argument (“model”), in which case the sequences are simply aligned to the model to produce the alignment matrix. If “model” is NULL, a preliminary model is first derived using the ‘seed’ sequence method outlined above, after which the model is trained using either the Baum Welch or Viterbi training algorithm (specified *via* the “method” argument). The sequences are then aligned to the model in the usual fashion to produce the alignment. Note that if only two sequences are present inthe input list, the `align`

function will perform a pairwise alignment without a profile HMM (Smith-Waterman or Needleman-Wunch alignment).

In this final example, we will deconstruct the original globin alignment and re-align the sequences using the original PHMM as a guide.

```
globins <- unalign(globins)
align(globins, model = globins.PHMM, seqweights = NULL, residues = "AMINO")
```

```
## 1 2 3 I I 4 5 6 7 8
## HBA_HUMAN "V" "G" "A" "-" "-" "H" "A" "G" "E" "Y"
## HBB_HUMAN "V" "-" "-" "-" "-" "N" "V" "D" "E" "V"
## MYG_PHYCA "V" "E" "A" "-" "-" "D" "V" "A" "G" "H"
## GLB3_CHITP "V" "K" "G" "-" "-" "-" "-" "-" "-" "D"
## GLB5_PETMA "V" "Y" "S" "-" "-" "T" "Y" "E" "T" "S"
## LGB2_LUPLU "F" "N" "A" "-" "-" "N" "I" "P" "K" "H"
## GLB1_GLYDI "I" "A" "G" "A" "D" "N" "G" "A" "G" "V"
```

Note that the column names show the progressive positions along the model and where residues were predicted to have been emitted by insert states (e.g. the 4th and 5th residues of sequence 7).

This package was written based on the algorithms described in Durbin et al (1998). This book offers an in depth explanation of hidden Markov models and profile HMMs for users of all levels of familiarity. Many of the examples and datasets in the package are directly derived from the text, which serves as a useful primer for this package. There are also excellent resources available for those wishing to use profile HMMs outside of the R environment. The aphid package maintains compatibility with the HMMER software suite through the file input and output functions readPHMM and writePHMM. Those interested are further encouraged to check out the SAM software package, which also features a comprehensive suite of functions and tutorials.

This software was developed at Victoria University of Wellington, NZ, with funding from a Rutherford Foundation Postdoctoral Research Fellowship award from the Royal Society of New Zealand.

Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. *Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids*. Cambridge: Cambridge University Press.

Eddelbuettel, Dirk, and Romain Francois. 2011. “Rcpp: seamless R and C++ integration.” *Journal of Statistical Software* 40 (8): 1–18. http://www.jstatsoft.org/v40/i08/.

Ooms, Jeroen. 2016. *openssl: toolkit for encryption, signatures and certificates based on OpenSSL*. R package. https://cran.r-project.org/package=openssl.

Paradis, Emmanuel. 2012. *Analysis of Phylogenetics and Evolution with R*. Second Edi. New York: Springer.

Paradis, Emmanuel, Julien Claude, and Korbinian Strimmer. 2004. “APE: analyses of phylogenetics and evolution in R language.” *Bioinformatics* 20: 289–90. doi:10.1093/bioinformatics/btg412.

R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://cran.r-project.org/.