Using the Open Tree synthesis in a comparative analysis

David Winter

2018-04-02

Phylogenetic Comparative Methods

The development of phylogenetic comparative methods has made phylogenies and important source of data in fields as diverse as ecology, genomic and medicine. Comparative methods can be used to investigate patterns in the evolution of traits or the diversification of lineages. In other cases a phylogeny is treated as a “nuisance parameter”, allowing with the autocorrelation created by the shared evolutionary history of the different species included to be controlled for.

In many cases finding a tree that relates the species for which trait data are available is a rate-limiting step in such comparative analyses. Here we show how the synthetic tree provided by Open Tree of Life (and made available in R via rotl) can help to fill this gap.

A phylogenetic meta-analysis

To demonstrate the use of rotl in a comparative analysis, we will partially reproduce the results of Rutkowska et al 2014. Very briefly, this study is a meta-analysis summarising the results of multiple studies testing for systematic differences in the size of eggs which contain male and female offspring. Such a difference might mean that birds invest more heavily in one sex than the other.

Because this study involves data from 51 different species, Rutkowska et al used a phylogenetic comparative approach to account for the shared evolutionary history among some of the studied-species.

Gather the data

If we are going to reproduce this analysis, we will first need to gather the data. Thankfully, the data is available as supplementary material from the publisher’s website. We can collect the data from using fulltext (with the papers DOI as input) and read it into memory with readxl::read_excel:

library(rotl)

## This should work, but Wiley has currently broken the URLs to access the
## SI. 
## if (require(readxl) && require(fulltext)) {
## doi <- "10.1111/jeb.12282"
## xl_file <- try(ft_get_si(doi, 1, save.name="egg.xls"), silent = TRUE)    
## egg_data <- read_excel(xl_file)
## } else {
egg_data <- read.csv(system.file("extdata", "egg.csv", package = "rotl"),
                     stringsAsFactors = FALSE)
## }
head(egg_data)
##                   animal                   Spp       Lndim Measure Neggs
## 1 Zonotrichia_leucophrys White-crowned sparrow 0.000000000  volume   294
## 2      Passer_domesticus         House sparrow 0.009407469  volume   149
## 3        Serinus_canaria                Canary 0.000000000  volume    52
## 4          Turdus_merula    European blackbird 0.021189299  volume    82
## 5    Agelaius_phoeniceus  Red-winged blackbird 0.218316086  volume   394
## 6    Quiscalus_mexicanus  Great-tailed grackle 0.281894985    mass   822
##   Nclutches        ESr Type         StudyID Year        D        EN
## 1        73 0.14004594 stat        Mead1987 1987 3.421918  85.91673
## 2        31 0.11175203 stat     Cordero2000 2000 4.045161  36.83413
## 3        21 0.49679140 stat     Leitner2006 2006 2.180952  23.84279
## 4        54 0.38598540 stat     Martyka2010 2010 1.414815  57.95812
## 5       106 0.07410136  raw Weatherhead1985 1985 3.173585 124.14982
## 6       205 0.05178834  raw     Teather1989 1989 3.407805 241.21099
##           Zr         VZr
## 1 0.14097244 0.012060292
## 2 0.11222075 0.029555954
## 3 0.54503712 0.047978211
## 4 0.40707397 0.018195675
## 5 0.07423744 0.008254242
## 6 0.05183471 0.004197959

The most important variable in this dataset is Zr, which is a normalized effect size for difference ,in size between eggs that contain males and females. Values close to zero come from studies that found the sex of an egg’s inhabitant had little effect in its size, while large positive or negative values correspond to studies with substantial sex biases (towards males and females respectively). Since this is a meta-analysis we should produce the classic funnel plot with effects-size on the y-axis and precision (the inverse of the sample standard error) on the x-axis. Here we calculate precision from the sample variance (Vzr):

plot(1/sqrt(egg_data$VZr), egg_data$Zr, pch=16,
     ylab="Effect size (Zr)",
     xlab="Precision (1/SE)",
     main="Effect sizes for sex bias in egg size among 51 brid species" )

In order to use this data later on we need to first convert it to a standard data.frame. We can also convert the animal column (the species names) to lower case which will make it easier to match names later on:

egg_data <- as.data.frame(egg_data)
egg_data$animal <- tolower(egg_data$animal)

Find the species in OTT

We can use the OTL synthesis tree to relate these species. To do so we first need to find Open Tree Taxonomy (OTT) IDs for each species. We can do that with the Taxonomic Name Resolution Service function tnrs_match_names:

taxa <- tnrs_match_names(unique(egg_data$animal), context="Animals")
head(taxa)
##            search_string            unique_name approximate_match ott_id
## 1 zonotrichia_leucophrys Zonotrichia leucophrys              TRUE 265553
## 2      passer_domesticus      Passer domesticus              TRUE 745175
## 3        serinus_canaria        Serinus canaria              TRUE 464865
## 4          turdus_merula          Turdus merula              TRUE 568572
## 5    agelaius_phoeniceus    Agelaius phoeniceus              TRUE 226605
## 6    quiscalus_mexicanus    Quiscalus mexicanus              TRUE 743411
##   is_synonym          flags number_matches
## 1      FALSE                             1
## 2      FALSE                             1
## 3      FALSE SIBLING_HIGHER              2
## 4      FALSE                             1
## 5      FALSE                             2
## 6      FALSE                             1

All of these species are in OTT, but a few of them go by different names in the Open Tree than we have in our data set. Because the tree rotl fetches will have Open Tree names, we need to create a named vector that maps the names we have for each species to the names Open Tree uses for them:

taxon_map <- structure(taxa$search_string, names=taxa$unique_name)

Now we can use this map to retrieve “data set names” from “OTT names”:

taxon_map["Anser caerulescens"]
##  Anser caerulescens 
## "chen_caerulescens"

Get a tree

Now we can get the tree. There are really too many tips here to show nicely, so we will leave them out of this plot

tr <- tol_induced_subtree(ott_id(taxa)[is_in_tree(ott_id(taxa))])
plot(tr, show.tip.label=FALSE)

There are a few things to note here. First, the tree has not branch lengths. At present this is true for the whole of the Open Tree synthetic tree. Some comparative methods require either branch lengths or an ultrametric tree. Before you can use one of those methods you will need to get a tree with branch lengths. You could try looking for published trees made available by the Open Tree with studies_find_trees. Alternatively, you could estimate branch lengths from the toplogy of a phylogeny returned by tol_induced_subtree, perhaps by downloading DNA sequences from the NCBI with rentrez or “hanging” the tree on nodes of known-age using penalized likelihood method in ape::chronos. In this case, we will use only the topology of the tree as input to our comparative analysis, so we can skip these steps.

Second, the tip labels contain OTT IDs, which means they will not perfectly match the species names in our dataset or the taxon map that we created earlier:

tr$tip.label[1:4]
## [1] "Passer_domesticus_ott745175"    "Serinus_canaria_ott464865"     
## [3] "Haemorhous_mexicanus_ott711865" "Agelaius_phoeniceus_ott226605"

Finally, the tree contains node labels for those nodes that match a higher taxonomic group, and empty character vectors ("") for all other nodes. Some comparative methods either do no expect node labels at all, or require all labeled nodes to have a unique name (meaning multiple “empty” labels will cause and error).

We can deal with all these details easily. rotl provides the convenience function strip_ott_ids to remove the extra information from the tip labels. With the IDs removed, we can use our taxon map to replace the tip labels in the tree with the species names from dataset.

otl_tips <- strip_ott_ids(tr$tip.label, remove_underscores=TRUE)
tr$tip.label <- taxon_map[ otl_tips ]

Finally, we can remove the node labels by setting the node.label attribute of the tree to NULL.

tr$node.label <- NULL
egg_data <- egg_data[egg_data$animal %in% tr$tip.label, ]

Perform the meta-analysis

Now we have data and a tree, and we know the names in the tree match the ones in the data. It’s time to do the comparative analysis. Rutkowska et al. used MCMCglmm, a Bayesian MCMC approach to fitting multi-level models,to perform their meta-analysis, and we will do the same. Of course, to properly analyse these data you would take some care in deciding on the appropriate priors to use and inspect the results carefully. In this case, we are really interested in using this as a demonstration, so we will just run a simple model.

Specifically we sill fit a model where the only variable that might explain the values of Zr is the random factor animal, which corresponds to the phylogenetic relationships among species. We also provide Zvr as the measurement error variance, effectively adding extra weight to the results of more powerful studies. Here’s how we specify and fit that model with MCMCglmm:

set.seed(123)
if (require(MCMCglmm, quietly = TRUE)) {

    pr <- list(R=list(V=1,nu=0.002),
               G=list(G1=list(V=1,nu=0.002))
               )

    model <- MCMCglmm(Zr~1,random=~animal,
                      pedigree=tr,
                      mev=egg_data$VZr,
                      prior=pr,
                      data=egg_data,
                      verbose=FALSE)
} else {
    model <- readRDS(file = system.file("extdata", "mcmcglmm_model.rds", package = "rotl"))
}
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'MCMCglmm'
## Warning: namespace 'MCMCglmm' is not available and has been replaced
## by .GlobalEnv when processing object '.meta-analysis_cache/html/birds_6633d27879a698757101208337b07521'

Now that we have a result we can find out how much phylogenetic signal exists for sex-biased differences in egg-size. In a multi-level model we can use variance components to look at this, specifically the proportion of the total variance that can be explained by phylogeny is called the phylogenetic reliability, H. Let’s calculate the H for this model:

var_comps <- colMeans(model$VCV )
var_comps["animal"] / sum(var_comps)
##     animal 
## 0.00313885

It appears there is almost no phylogenetic signal to the data. The relationships among species explain much less that one percent of the total variance in the data. If you were wondering, Rutkowska et al. report a similar result, even after adding more predictors to their model most of the variance in Zr was left unexplained.

What other comparative methods can I use in R?

Here we have demonstrated just one comparative analysis that you might do in R. There are an ever-growing number of packages that allow an ever-growing number of analysis to performed in R. Some “classics” like ancestral state reconstruction, phylogenetic independent contrasts and lineage through time plots are implemented in ape. Packages like phytools, caper and diversitree provide extensions to these methods. The CRAN Phylogenetics Taskview gives a good idea of the diversity of packages and analyses that can be completed in R.