effectR: An R package to call oomycete effectors

2018-01-16

The effectR package is an R package designed to call oomycete RxLR and CRN effectors by searching for the motifs of interest using regular expression searches and hidden markov models (HMM).

Overview

The effectR packages searches for the motifs of interest (RxLR-EER motif for RxLR effectors and LFLAK motif for CRN effectors) using a regular expression search (REGEX). These motifs used by the REGEX effectR search have been reported in the literature (Haas et al., 2009, Stam et al., 2004).

The effectR package aligns the REGEX search results using MAFFT, and builds a HMM profile based on the multiple sequence alignment result using the hmmbuild program from HMMER. The HMM profile is used to search across ORF of the genome of interest using the hmmsearch binary from HMMER. The search step will retain sequences with significant hits to the profile of interest. effectR also combines the redundant sequences found in both REGEX and HMM searches into a single dataset that can be easily exported. In addition, effectR reads and returns the HMM profile to the user and allows for the creation of a MOTIF logo-like plot using ggplot2.

External software (REQUIRED)

The effectR package uses MAFFT and HMMER3 to perform the hidden markov model seach across the results from the REGEX step. These two packages should be installed before running any of the effectR functions.

MAFFT is a multiple sequence alignment program that uses Fourier-transform algorithms to align multiple sequences. We recommend downloading and installing MAFFT by following the instructions and steps in the MAFFT installation web site.

Linux/OS X Users

Make sure that you remember the directory in which MAFFT is installed, of if the installation is sucessful, make sure to obtain the path via bash/tsh/console:

which mafft
/usr/local/bin/mafft

Windows Users

MAFFT comes in two main distributions for windows:

HMMER is used for searching sequence databases for sequence homologs. It uses hidden Markov models (profile HMMs) to search for sequences with hits to similar patterns than the profile. We use three main HMMER tools:

• hmmbuild to create the HMM database from the sequences ontained in the REGEX step of effectR
• hmmpress converts the HMM database into a format usable by other HMMER programs
• hmmsearch to excecute the HMM search in our sequence queries basde on the HMM profile

The effectR package requires all of these tools. A correct HMMER installation will install all three programs.

Linux/OS X users

We recommend downloading and installing HMMER by following the instructions and steps in the HMMER installation web site. Make sure that you remember the directory in which HMMER is installed, of if the installation is sucessful, make sure to obtain the path via bash/tsh/console:

which hmmbuild
which hmmpress
which hmmsearch
/usr/local/bin/hmmbuild
/usr/local/bin/hmmpress
/usr/local/bin/hmmsearch

Windows users

To use the effectR package in Windows, the user must download the Windows binaries of HMMER. effectR will not work with any other version of HMMER.

Data input

The effectR package is designed to work with amino acid sequences in FASTA format representing the six-frame translation of every open reading frame (ORF) of an oomycete genome. Using the six-frame translation of all ORF’s in a genome is recommended in order to obtain as many effectors as possible from a proteome. To obtain the ORF for a genome, we recommend the use of EMBOSS’ getorf.

effectR uses a list of sequences of the class SeqFastadna in order to perform the effector searches. The function read.fasta from the seqinr package reads the FASTA amino acid file into R, creating a list of SeqFastadna objects that represent each of the translated ORF’s from the original FASTA file. We will begin our example using a subset of translated ORF’s from the P. infestans genome sequenced by Haas et al., (2009):

library(effectR)
pkg <- "effectR"
fasta.file <- system.file("extdata", "test_infestans.fasta", package = pkg)
library(seqinr)
head(ORF, n = 2)
## $DS028118_1098 ## [1] "m" "r" "l" "a" "m" "i" "l" "l" "s" "i" "p" "l" "f" "v" "s" "g" "n" ## [18] "v" "l" "h" "v" "p" "t" "q" "v" "t" "k" "s" "h" "a" "v" "s" "p" "d" ## [35] "a" "q" "f" "v" "v" "a" "m" "g" "r" "r" "s" "l" "r" "t" "s" "g" "e" ## [52] "a" "n" "e" "e" "r" "t" "r" "l" "n" "t" "l" "l" "l" "l" "d" "d" "v" ## [69] "t" "e" "a" "e" "m" "s" "s" "i" "k" "k" "l" "a" "l" "t" "f" "a" "k" ## [86] "l" "e" "n" "r" "n" "d" "g" "a" "a" "d" "l" "f" "n" "m" "l" "r" "r" ## [103] "q" "g" "h" "t" "k" "e" "s" "a" "r" "n" "a" "g" "n" "l" "y" "t" "k" ## [120] "y" "l" "q" "n" "p" "s" "a" "f" "h" "t" ## attr(,"name") ## [1] "DS028118_1098" ## attr(,"Annot") ## [1] ">DS028118_1098 [2041764 - 2042150] | organism=Phytophthora_infestans_T30-4 | version=2015-03-12 | length=6928287 | SO=supercontig" ## attr(,"class") ## [1] "SeqFastadna" ## ##$DS028118_4814
##   [1] "m" "r" "l" "l" "y" "v" "l" "i" "a" "a" "a" "a" "s" "l" "l" "a" "n"
##  [18] "i" "e" "a" "t" "t" "s" "a" "d" "s" "a" "k" "m" "i" "s" "v" "q" "v"
##  [35] "s" "k" "t" "t" "r" "r" "f" "l" "r" "s" "p" "k" "p" "y" "e" "n" "n"
##  [52] "e" "e" "e" "r" "a" "p" "t" "n" "i" "l" "t" "l" "v" "d" "d" "a" "a"
##  [69] "k" "a" "l" "p" "s" "a" "d" "e" "a" "a" "k" "v" "l" "p" "t" "l" "d"
##  [86] "d" "l" "f" "r" "i" "q" "k" "m" "d" "e" "a" "p" "n" "t" "k" "k" "a"
## [103] "m" "e" "i" "v" "k" "p" "l" "f" "d" "k" "m" "s" "s" "y" "p" "k" "i"
## [120] "k" "e" "a" "a" "l" "k" "e" "m" "e" "t" "n" "q" "n" "l" "k" "s" "l"
## [137] "w" "s" "h" "f" "i" "l" "s" "k" "r" "v" "n" "f" "d" "v" "k" "e" "i"
## [154] "s" "k" "q" "r" "p" "s" "a" "s"
## attr(,"name")
## [1] "DS028118_4814"
## attr(,"Annot")
## [1] ">DS028118_4814 [4150993 - 4150511] (REVERSE SENSE) | organism=Phytophthora_infestans_T30-4 | version=2015-03-12 | length=6928287 | SO=supercontig"
## attr(,"class")
## [1] "SeqFastadna"

We have created a ORF object that includes the list of translated ORF’s from the subset of XX ORF’s from the P. infestans genome. For more information on the SeqFastadna objects please read the seqinr manual.

Obtaining non-redundant effectors and motif summaries

The user can extract all of the non-redundant sequences and a summary table with the information about the motifs using the effector.summary function. This function uses the results from either hmm.seach or regex.search functions to generate a table that includes the name of the candidate effector sequence, the number of motifs of interest (RxLR-EER or LFLAK-HVLV) per sequence and its location within the sequence. In addition, when the effector.summary function is used in an object that contains the results of hmm.search, the user will obtain a list of the non-reduntant sequences. If the user provides the results from regex.search, the function will return the motif summary table.

We will use the effector.summary function with our hmm.search results (the candidate.rxlr object):

summary.list <- effector.summary(candidate.rxlr)

knitr::kable(summary.list$motif.table) The motif table has a column called MOTIF. This column summarizes the candidate ORF into one of 4 categories: • Complete: The candidate ORF has both motifs of interest (RxLR + EER or LFLAK + HVLV) • Only RxLR/Only LFLAK: The candidate ORF only has the translocation domain • Only EER/HVLV: The candidate ORF only has the second motif of interest • No motifs: The sequence has a hit with the HMM profile but does not have any motif of interest Non-redundant sequences head(summary.list$motif.table, n = 2)
length(summary.list$consensus.sequences) The summary.list$consensus.sequences has all 27 RxLR candidate genes found in our searches.

Exporting the non-redundant effector candidates

To export the non-redundant effector candidates that resulted from the hmm.search or regex.search functions, we use the write.fasta function of the seqinr package. We recomend the users to read the documentation of the seqinr package Since the objects that result from the hmm.search or regex.search function are of the SeqFastadna class, we can use any of the function of the seqinr package that use this class as well.

To save the results from our example file, we would use the following command:

write.fasta(sequences = getSequence(summary.list$consensus.sequences), names = getName(summary.list$consensus.sequences), file.out = "RxLR_candidates.fasta")

Visualizing the HMM profile using a sequence logo-like plot

To determine if the HMM profile includes the motifs of interest, we have created the function hmm.logo. The function hmm.logo reads the HMM profile (obtained from the hmm.search step) and uses ggplot2 to create a bar-plot. The bar-plot will illustrate the bits (amino acid scores) of each amino acid used to construct the HMM profile according to its consensus position in the HMM profile. To learn more about sequence logo plots visit this wikipedia article.

The hmm.logo is a wrapper that parses the HMM profile table and plots the parsed table results in ggplot2.

To visualize the sequence logo-like plot in our example data set, we use our candidate.rxlr$HMM_Table object in the hmm.search function: hmm.logo(hmm.table = candidate.rxlr$HMM_Table)