BCB 503: RevBayes Intro

Sixth session: Tree Building & Dating

Orlando Schwery, 19. October 2021

This is based on the following tutorials:

Tree Building

Tree Dating

Molecular Dating (Intro and sub-tutorials on the topics of the other dating tutorials, but simplified)
Relaxed Clocks & Time Trees (similar to dating with BEAST, plus some model selection)

Additional tutorials related to the topic that we won’t cover today:

Tree Building

Partitioned Data Analysis (with video walkthrough)
Discrete Morphology - Tree Inference (Mk model, ascertainment bias, unequal rates)
Discrete Morphology - Multistate Characters (parallel to above, but with characters having different numbers of states)
Polymorphism-Aware Phylogenetic Models (incorporating allele frquencies in populations)

Tree Dating

We open RevBayes and set our working directory as we’re used to now:

rb                                         # or maybe ./rb depending on your system
getwd()
setwd("/mnt/c/Users/oschwery/Documents/Teaching/RevBayes_Workshop_UI2021/6-TreeBuildingDating/")

Tree Building from Molecular Sequence Data

Load data:

data <- readDiscreteCharacterData("data/primates_and_galeopterus_cytb.nex")
data

We could make variables of data properties to use later, but we shall instead make those where we need them, to keep the overview:

# num_taxa <- data.ntaxa()
# num_branches <- 2 * num_taxa - 3
# taxa <- data.taxa()

Initialise vectors for moves and monitors:

moves    = VectorMoves()
monitors = VectorMonitors()

Nucleotide Substitution Model

We’re staying simple and use Jukes Cantor (JC) here, setting the number of states to 4, since DNA has 4 bases:

Q <- fnJC(4)
Q

Tree Topology & Branch Lengths

As recommended in revbayes, we’re setting an outgroup using the function clade. We’ll supply it to dnUniformTopology, which will assume all possible trees to be equally likely:

out_group = clade("Galeopterus_variegatus")
taxa <- data.taxa()                                     # get taxon labels from data
topology ~ dnUniformTopology(taxa, outgroup=out_group)

The topology is a stochastic node that we’re trying to optimise, thus we’re setting moves for it. Two of the possible moves we can use are mvNNI (nearest-neighbour-interchange) and mvSPR (subtree-prune and regraft). We can tune them to the size of our tree by using the number of taxa in the moves’ weight:

num_taxa <- data.ntaxa()                                # get number of taxa from data
moves.append( mvNNI(topology, weight=num_taxa) )
moves.append( mvSPR(topology, weight=num_taxa/10.0) )   # we don't want to move large chunks as often as nearest neighbours...

We can give the length of each branch in the tree its own exponential prior and move in a for loop (which will be a plate in the DAG):

num_branches <- 2 * num_taxa - 3
for (i in 1:num_branches) {
   br_lens[i] ~ dnExponential(10.0)
   moves.append( mvScale(br_lens[i]) )
}

Just as with the net_diversification rate in the diversification models, we can use deterministic nodes to keep track of useful things, such as overal tree length and the resulting tree (topology plus branch lenghts) - though we’ll also need the latter to pass on to the core function:

TL := sum(br_lens)
psi := treeAssembly(topology, br_lens)

The Continuous Time Markov Chain

We can now draw sequences from a distribution made from the tree psi and the rate matrix Q:

seq ~ dnPhyloCTMC(tree=psi, Q=Q, type="DNA")

As usual, we .clamp our data to it, and wrap it all up in a model object:

seq.clamp(data)
mymodel = model(Q)
mymodel

For our monitors, we have our usual ones to keep track of the mcmc parameters (mnModel) and screen output (mnScreen), plus an additional one that will output the inferred/evaluated tree at each generation (mnFile):

monitors.append( mnModel(filename="output/primates_cytb_JC.log", printgen=10) )
monitors.append( mnScreen(printgen=100, TL) )
monitors.append( mnFile(filename="output/primates_cytb_JC.trees", printgen=10, psi) )

Finally, we initialise the mcmc and run it:

mymcmc = mcmc(mymodel, monitors, moves)
mymcmc.run(generations=20000)

Summarize MCMC Samples

We can inspect the output file primates_cytb_JC.log e.g. in Tracer. To get one single tree, we can load the trace back in, and use mapTree() to get a MAP ( maximum a posteriori ) tree:

treetrace = readTreeTrace("output/primates_cytb_JC.trees", treetype="non-clock")
map_tree = mapTree(treetrace,"output/primates_cytb_JC_MAP.tree")

We can also use that object to e.g. get the probability that the Lemurs are monophyletic:

Lemuroidea <- clade("Cheirogaleus_major",
                    "Daubentonia_madagascariensis",
                    "Lemur_catta",
                    "Lepilemur_hubbardorum",
                    "Microcebus_murinus",
                    "Propithecus_coquereli",
                    "Varecia_variegata_variegata")

treetrace.cladeProbability( Lemuroidea )

Advanced Options:

Consult the other tutorials for different substitution models, alternative priors for tree, and branch lengths, as well as settings like gamma site variation, invariant sites, or partitioning.

Tree Dating

Once we have a tree with branch lengths, we can use different ways to transform the latter from number of changes, to time units.

We start again by loading sequence data, assigning helper variables, and initialising moves and monitors:

cytb <- readDiscreteCharacterData("data/bears_cytb.nex")
n_taxa <- cytb.size()
taxa <- cytb.taxa()
moves    = VectorMoves()
monitors = VectorMonitors()

We could probably try and date a fixed tree, but we shall jump straight in and date the tree jointly with the tree inference. Thus we set up substitution and tree/bl models as before:

Substitution Model

This time, we go for a GTR plus Gamma model, which consists of four stationary frequencies (sf) and six exchangeability rates (er), which we will draw from an uniform Dirichlet prior, and assign the appropriate moves:

sf_hp <- v(1,1,1,1)
sf ~ dnDirichlet(sf_hp)

er_hp <- v(1,1,1,1,1,1)
er ~ dnDirichlet(er_hp)

moves.append( mvBetaSimplex(er, alpha=10.0, weight=3.0) )
moves.append( mvBetaSimplex(sf, alpha=10.0, weight=2.0) )

Now we can pass these to the GTR function to get our Q matrix:

Q_cytb := fnGTR(er,sf)

Finally, we create a parameter alpha for the shape of the gamma distribution, which is then set to be discretized into 4 categories:

alpha_cytb ~ dnUniform(0.0,1E6)
alpha_cytb.setValue( 1.0 )

moves.append( mvScale(alpha_cytb, lambda=0.5, tune=true, weight=2.0) )

rates_cytb := fnDiscretizeGamma(alpha_cytb, alpha_cytb, 4)

The Tree Model

Here, instead of just assuming all possible tree topologies to be equally likely, we shall actually use a birth-death model of speciation and extinction to model the proces that generated the tree.

As we’re used to, we’re setting priors and moves for a speciation and extinction rate:

speciation_rate ~ dnExponential(10)
extinction_rate ~ dnExponential(10)

moves.append( mvScale(speciation_rate, lambda=0.5, tune=true, weight=3.0) )
moves.append( mvScale(extinction_rate, lambda=0.5, tune=true, weight=3.0) )

We have all extant bears in our tree, thus we can set the sampling probability to 1:

rho <- 1.0

We don’t know the root age of the tree, but since we’re only inferring relative ages, we can set that to 1 as well:

extant_mrca <- 1.0

Having all the parts of the birth-death model together, we can pass them to the corresponding function that lets us draw from a distribution of trees, but this time we assign it with a = instead of an ~. We’re also setting the condition to nTaxa, because we only want trees with as many extant species as we have in or data set:

tree_dist = dnBDP(lambda=speciation_rate, mu=extinction_rate, rho=rho, rootAge=extant_mrca, samplingStrategy="uniform", condition="nTaxa", taxa=taxa)

This is to allow us to include a topology constraint before drawing trees. We want the Ursidae to be monophyletic, so we use clade to define that group, and then the function for drawing constrained topologies to finally draw trees.

clade_ursinae = clade("Melursus_ursinus", "Ursus_arctos", "Ursus_maritimus", "Helarctos_malayanus", "Ursus_americanus", "Ursus_thibetanus")                      
constraints = v(clade_ursinae)
timetree ~ dnConstrainedTopology(tree_dist, constraints=constraints)

Note that this time, we’re not clamping any data to the timetree, since our data isn’t a tree that we’re trying to fit a bd model to, but instead this is simply our prior for the tree estimation!

We add a number of tree moves again, including some for topology (mvFNPR), and node ages (mvNodeTimeSlideUniform):

moves.append( mvNarrow(timetree, weight=n_taxa) )
moves.append( mvFNPR(timetree, weight=n_taxa/4) )
moves.append( mvNodeTimeSlideUniform(timetree, weight=n_taxa) )
moves.append( mvSubtreeScale(timetree, weight=n_taxa/5.0) )

Finally, we can monitor the age of our constrained clade, e.g. to see it progress in the analysis:

age_ursinae := tmrca(timetree, clade_ursinae)

The Clock Model

Global Molekular Clock

As the simplest clock model, we can take a global molecular clock, which assumes one single rate across all branches and constant through time. We can set an exponential prior with a rate parameter of 10:

branch_rates ~ dnExponential(10.0)
moves.append( mvScale(branch_rates, lambda=0.5, tune=true, weight=3.0) )

The Continuous Time Markov Chain

At last, we can link the tree model, substitution model, and clock model to the core function that gives us the distribution of sequences, and clamp the data to it:

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q_cytb, siteRates=rates_cytb, branchRates=branch_rates, type="DNA")
phySeq.clamp(cytb)

MCMC

As usual, we wrap it all up in a model object, add the monitors, create an mcmc object and run it. Again we use the mnFile monitor to save the trees at each generation:

mymodel = model(sf)
monitors.append( mnModel(filename="output/bears_global.log", printgen=10) )
monitors.append( mnFile(filename="output/bears_global.trees", printgen=10, timetree) )
monitors.append( mnScreen(printgen=10, extant_mrca, age_ursinae, branch_rates) )
mymcmc = mcmc(mymodel, monitors, moves, nruns=2, combine="mixed")
mymcmc.run(generations=20000, tuningInterval=200)

Summarize Results

Once this is run, we can again load the trace, and summarize the tree topologies and node ages from the mcmc to one tree. This time we want the Maximum Clade Credibility tree:

trace = readTreeTrace("output/bears_global.trees")
mccTree(trace, file="output/bears_global.mcc.tre")

Tree Dating - Relaxed Clock

The most convenient way of trying those different clock models is to turn the different separate models (clock, tree, substitution) and runfile into different files to be sourced.

We might think that a global clock across the whole tree might not be a very realistic assumption - clearly the evolutionary rates could have been higher or lower at some times or for some taxa?

We can relax the constant rate assumption by changing the clock model from estimating a global branch rate to estimating a mean branch rate and individual rates for each branch.

The Clock Model

Uncorrelated Exponential Relaxed Clock

We replace the previous clock model with this new one. We start with a prior on the mean branch rate:

branch_rates_mean ~ dnExponential(10.0)
moves.append( mvScale(branch_rates_mean, lambda=0.5, tune=true, weight=3.0) )

As before for the branches of the tree building model, we’ll use a for loop based on the number of branches to assign each one its own exponential prior and move:

n_branches <- 2 * n_taxa - 2
for(i in 1:n_branches){
    branch_rates[i] ~ dnExp(1/branch_rates_mean)
    moves.append( mvScale(branch_rates[i], lambda=0.5, tune=true, weight=1.0) )
}

We optimize the mcmc by adding two more moves - the VectorScale move will allow to change all branch rates at once, while the RateAgeBetaShift move to change node ages and branch rates at the same time (meaning their branch length stays the same):

moves.append( mvVectorScale(branch_rates, lambda=0.5, tune=true, weight=4.0) )
moves.append( mvRateAgeBetaShift(tree=timetree, rates=branch_rates, tune=true, weight=n_taxa) )

Running and Summarising

We’d also change the output-names in our runfile, in order to not overwrite the previous ones:

monitors.append( mnModel(filename="output/bears_relaxed_exponential.log", printgen=10) )
monitors.append( mnFile(filename="output/bears_relaxed_exponential.trees", printgen=10, timetree) )

trace = readTreeTrace("output/bears_relaxed_exponential.trees")
mccTree(trace, file="output/bears_relaxed_exponential.mcc.tre" )

Tree Dating - Node Dating

Eventually, and if possible, we’d probably want to estimate absolute node ages (in million years), instead of relative ones, possibly using some fossil calibrations, similar to what one would do in BEAST.

If we just want to extend our last model to include two node calibrations, we can do so by simply modifying the tree model.

The Tree Model

Root Calibration

We want to constrain the age of the root, that is the MRCA of extant bears. We know from fossils that the earliest crown group bear appeared first at 1.84Ma, meaning the MRCA of living bears can’t be younger than that. To get a maximum bound on that node, we can make use of a secondary calibration, meaning an age taken from another dated phylogeny - here the estimage of the age of Caniforms (a clade that includes the bears) at 49Ma. If that higher taxonomic group is at least that old, the bears, which are nested within them, can’t be older.

Since we don’t know anything about where within that bound the MRCA of bears should fall, we will put a uniform prior on that node. We’ll replace the object extant_mrca from before with this uniform prior, and add a move as well:

extant_mrca_min <- 1.84
extant_mrca_max <- 49.0
    
extant_mrca ~ dnUniform(extant_mrca_min, extant_mrca_max)
    
moves.append( mvScale(extant_mrca, lambda=1, tune=true, weight=5.0) )

Internal Node Calibration

Now we also want to constrain the previously defined clade Ursinae. We’ve already set the clade constraint, and also assigned the node age_ursinae that keeps track of the age of that node. So now we just have to put a prior on that node.

The earliest fossil crown group bear is also the earliest member of Ursinae we know off, wo we’ll set an exponential prior that starts at that fossil’s age ( i.e. the exponential distribution is offset by the fossil age). This means it won’t be able to be younger than this age. Unlike other tree dating tools, the fossil age is treated like data, thus we create a stochastic (DAG-)node for the age of this (tree-)node, and clamp the known minimal age to it:

obs_age_ursinae ~ dnExponential(1.0, offset = -age_ursinae)
obs_age_ursinae.clamp(-1.84)

Running and Summarising

Again we change the output names in our runfile, and we’re ready to go:

monitors.append( mnModel(filename="output/bears_nodedate.log", printgen=10) )
monitors.append( mnFile(filename="output/bears_nodedate.trees", printgen=10, timetree) )

trace = readTreeTrace("output/bears_nodedate.trees")
mccTree(trace, file="output/bears_nodedate.mcc.tre" )