Orlando Schwery, 19. October 2021
This is based on the following tutorials:
Additional tutorials related to the topic that we won’t cover today:
We open RevBayes and set our working directory as we’re used to now:
rb # or maybe ./rb depending on your system
getwd()
setwd("/mnt/c/Users/oschwery/Documents/Teaching/RevBayes_Workshop_UI2021/6-TreeBuildingDating/")
Load data:
data <- readDiscreteCharacterData("data/primates_and_galeopterus_cytb.nex")
data
We could make variables of data
properties to use later, but we shall instead make those where we need them, to keep the overview:
# num_taxa <- data.ntaxa()
# num_branches <- 2 * num_taxa - 3
# taxa <- data.taxa()
Initialise vectors for moves
and monitors
:
moves = VectorMoves()
monitors = VectorMonitors()
We’re staying simple and use Jukes Cantor (JC) here, setting the number of states to 4
, since DNA has 4 bases:
Q <- fnJC(4)
Q
As recommended in revbayes
, we’re setting an outgroup using the function clade
. We’ll supply it to dnUniformTopology
, which will assume all possible trees to be equally likely:
out_group = clade("Galeopterus_variegatus")
taxa <- data.taxa() # get taxon labels from data
topology ~ dnUniformTopology(taxa, outgroup=out_group)
The topology
is a stochastic node that we’re trying to optimise, thus we’re setting moves for it. Two of the possible moves we can use are mvNNI
(nearest-neighbour-interchange) and mvSPR
(subtree-prune and regraft). We can tune them to the size of our tree by using the number of taxa in the moves’ weight:
num_taxa <- data.ntaxa() # get number of taxa from data
moves.append( mvNNI(topology, weight=num_taxa) )
moves.append( mvSPR(topology, weight=num_taxa/10.0) ) # we don't want to move large chunks as often as nearest neighbours...
We can give the length of each branch in the tree its own exponential prior and move in a for
loop (which will be a plate in the DAG):
num_branches <- 2 * num_taxa - 3
for (i in 1:num_branches) {
br_lens[i] ~ dnExponential(10.0)
moves.append( mvScale(br_lens[i]) )
}
Just as with the net_diversification
rate in the diversification models, we can use deterministic nodes to keep track of useful things, such as overal tree length and the resulting tree (topology plus branch lenghts) - though we’ll also need the latter to pass on to the core function:
TL := sum(br_lens)
psi := treeAssembly(topology, br_lens)
We can now draw sequences from a distribution made from the tree psi
and the rate matrix Q
:
seq ~ dnPhyloCTMC(tree=psi, Q=Q, type="DNA")
As usual, we .clamp
our data to it, and wrap it all up in a model object:
seq.clamp(data)
mymodel = model(Q)
mymodel
For our monitors
, we have our usual ones to keep track of the mcmc
parameters (mnModel
) and screen output (mnScreen
), plus an additional one that will output the inferred/evaluated tree at each generation (mnFile
):
monitors.append( mnModel(filename="output/primates_cytb_JC.log", printgen=10) )
monitors.append( mnScreen(printgen=100, TL) )
monitors.append( mnFile(filename="output/primates_cytb_JC.trees", printgen=10, psi) )
Finally, we initialise the mcmc
and run it:
mymcmc = mcmc(mymodel, monitors, moves)
mymcmc.run(generations=20000)
We can inspect the output file primates_cytb_JC.log
e.g. in Tracer. To get one single tree, we can load the trace back in, and use mapTree()
to get a MAP ( maximum a posteriori ) tree:
treetrace = readTreeTrace("output/primates_cytb_JC.trees", treetype="non-clock")
map_tree = mapTree(treetrace,"output/primates_cytb_JC_MAP.tree")
We can also use that object to e.g. get the probability that the Lemurs are monophyletic:
Lemuroidea <- clade("Cheirogaleus_major",
"Daubentonia_madagascariensis",
"Lemur_catta",
"Lepilemur_hubbardorum",
"Microcebus_murinus",
"Propithecus_coquereli",
"Varecia_variegata_variegata")
treetrace.cladeProbability( Lemuroidea )
Consult the other tutorials for different substitution models, alternative priors for tree, and branch lengths, as well as settings like gamma site variation, invariant sites, or partitioning.
Once we have a tree with branch lengths, we can use different ways to transform the latter from number of changes, to time units.
We start again by loading sequence data, assigning helper variables, and initialising moves and monitors:
cytb <- readDiscreteCharacterData("data/bears_cytb.nex")
n_taxa <- cytb.size()
taxa <- cytb.taxa()
moves = VectorMoves()
monitors = VectorMonitors()
We could probably try and date a fixed tree, but we shall jump straight in and date the tree jointly with the tree inference. Thus we set up substitution and tree/bl models as before:
This time, we go for a GTR plus Gamma model, which consists of four stationary frequencies (sf
) and six exchangeability rates (er
), which we will draw from an uniform Dirichlet prior, and assign the appropriate moves
:
sf_hp <- v(1,1,1,1)
sf ~ dnDirichlet(sf_hp)
er_hp <- v(1,1,1,1,1,1)
er ~ dnDirichlet(er_hp)
moves.append( mvBetaSimplex(er, alpha=10.0, weight=3.0) )
moves.append( mvBetaSimplex(sf, alpha=10.0, weight=2.0) )
Now we can pass these to the GTR function to get our Q
matrix:
Q_cytb := fnGTR(er,sf)
Finally, we create a parameter alpha
for the shape of the gamma distribution, which is then set to be discretized into 4 categories:
alpha_cytb ~ dnUniform(0.0,1E6)
alpha_cytb.setValue( 1.0 )
moves.append( mvScale(alpha_cytb, lambda=0.5, tune=true, weight=2.0) )
rates_cytb := fnDiscretizeGamma(alpha_cytb, alpha_cytb, 4)
Here, instead of just assuming all possible tree topologies to be equally likely, we shall actually use a birth-death model of speciation and extinction to model the proces that generated the tree.
As we’re used to, we’re setting priors and moves for a speciation and extinction rate:
speciation_rate ~ dnExponential(10)
extinction_rate ~ dnExponential(10)
moves.append( mvScale(speciation_rate, lambda=0.5, tune=true, weight=3.0) )
moves.append( mvScale(extinction_rate, lambda=0.5, tune=true, weight=3.0) )
We have all extant bears in our tree, thus we can set the sampling probability to 1
:
rho <- 1.0
We don’t know the root age of the tree, but since we’re only inferring relative ages, we can set that to 1
as well:
extant_mrca <- 1.0
Having all the parts of the birth-death model together, we can pass them to the corresponding function that lets us draw from a distribution of trees, but this time we assign it with a =
instead of an ~
. We’re also setting the condition
to nTaxa
, because we only want trees with as many extant species as we have in or data set:
tree_dist = dnBDP(lambda=speciation_rate, mu=extinction_rate, rho=rho, rootAge=extant_mrca, samplingStrategy="uniform", condition="nTaxa", taxa=taxa)
This is to allow us to include a topology constraint before drawing trees. We want the Ursidae to be monophyletic, so we use clade
to define that group, and then the function for drawing constrained topologies to finally draw trees.
clade_ursinae = clade("Melursus_ursinus", "Ursus_arctos", "Ursus_maritimus", "Helarctos_malayanus", "Ursus_americanus", "Ursus_thibetanus")
constraints = v(clade_ursinae)
timetree ~ dnConstrainedTopology(tree_dist, constraints=constraints)
Note that this time, we’re not clamping any data to the timetree
, since our data isn’t a tree that we’re trying to fit a bd model to, but instead this is simply our prior for the tree estimation!
We add a number of tree moves again, including some for topology (mvFNPR
), and node ages (mvNodeTimeSlideUniform
):
moves.append( mvNarrow(timetree, weight=n_taxa) )
moves.append( mvFNPR(timetree, weight=n_taxa/4) )
moves.append( mvNodeTimeSlideUniform(timetree, weight=n_taxa) )
moves.append( mvSubtreeScale(timetree, weight=n_taxa/5.0) )
Finally, we can monitor the age of our constrained clade, e.g. to see it progress in the analysis:
age_ursinae := tmrca(timetree, clade_ursinae)
As the simplest clock model, we can take a global molecular clock, which assumes one single rate across all branches and constant through time. We can set an exponential prior with a rate parameter of 10:
branch_rates ~ dnExponential(10.0)
moves.append( mvScale(branch_rates, lambda=0.5, tune=true, weight=3.0) )
At last, we can link the tree model, substitution model, and clock model to the core function that gives us the distribution of sequences, and clamp the data to it:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q_cytb, siteRates=rates_cytb, branchRates=branch_rates, type="DNA")
phySeq.clamp(cytb)
As usual, we wrap it all up in a model
object, add the monitors
, create an mcmc
object and run it. Again we use the mnFile
monitor to save the trees at each generation:
mymodel = model(sf)
monitors.append( mnModel(filename="output/bears_global.log", printgen=10) )
monitors.append( mnFile(filename="output/bears_global.trees", printgen=10, timetree) )
monitors.append( mnScreen(printgen=10, extant_mrca, age_ursinae, branch_rates) )
mymcmc = mcmc(mymodel, monitors, moves, nruns=2, combine="mixed")
mymcmc.run(generations=20000, tuningInterval=200)
Once this is run, we can again load the trace, and summarize the tree topologies and node ages from the mcmc to one tree. This time we want the Maximum Clade Credibility tree:
trace = readTreeTrace("output/bears_global.trees")
mccTree(trace, file="output/bears_global.mcc.tre")
The most convenient way of trying those different clock models is to turn the different separate models (clock, tree, substitution) and runfile into different files to be sourced.
We might think that a global clock across the whole tree might not be a very realistic assumption - clearly the evolutionary rates could have been higher or lower at some times or for some taxa?
We can relax the constant rate assumption by changing the clock model from estimating a global branch rate to estimating a mean branch rate and individual rates for each branch.
We’d also change the output-names in our runfile, in order to not overwrite the previous ones:
monitors.append( mnModel(filename="output/bears_relaxed_exponential.log", printgen=10) )
monitors.append( mnFile(filename="output/bears_relaxed_exponential.trees", printgen=10, timetree) )
trace = readTreeTrace("output/bears_relaxed_exponential.trees")
mccTree(trace, file="output/bears_relaxed_exponential.mcc.tre" )
Eventually, and if possible, we’d probably want to estimate absolute node ages (in million years), instead of relative ones, possibly using some fossil calibrations, similar to what one would do in BEAST.
If we just want to extend our last model to include two node calibrations, we can do so by simply modifying the tree model.
We want to constrain the age of the root, that is the MRCA of extant bears. We know from fossils that the earliest crown group bear appeared first at 1.84Ma, meaning the MRCA of living bears can’t be younger than that. To get a maximum bound on that node, we can make use of a secondary calibration, meaning an age taken from another dated phylogeny - here the estimage of the age of Caniforms (a clade that includes the bears) at 49Ma. If that higher taxonomic group is at least that old, the bears, which are nested within them, can’t be older.
Since we don’t know anything about where within that bound the MRCA of bears should fall, we will put a uniform prior on that node. We’ll replace the object extant_mrca
from before with this uniform prior, and add a move as well:
extant_mrca_min <- 1.84
extant_mrca_max <- 49.0
extant_mrca ~ dnUniform(extant_mrca_min, extant_mrca_max)
moves.append( mvScale(extant_mrca, lambda=1, tune=true, weight=5.0) )
Now we also want to constrain the previously defined clade Ursinae. We’ve already set the clade constraint, and also assigned the node age_ursinae
that keeps track of the age of that node. So now we just have to put a prior on that node.
The earliest fossil crown group bear is also the earliest member of Ursinae we know off, wo we’ll set an exponential prior that starts at that fossil’s age ( i.e. the exponential distribution is offset by the fossil age). This means it won’t be able to be younger than this age. Unlike other tree dating tools, the fossil age is treated like data, thus we create a stochastic (DAG-)node for the age of this (tree-)node, and clamp the known minimal age to it:
obs_age_ursinae ~ dnExponential(1.0, offset = -age_ursinae)
obs_age_ursinae.clamp(-1.84)
Again we change the output names in our runfile, and we’re ready to go:
monitors.append( mnModel(filename="output/bears_nodedate.log", printgen=10) )
monitors.append( mnFile(filename="output/bears_nodedate.trees", printgen=10, timetree) )
trace = readTreeTrace("output/bears_nodedate.trees")
mccTree(trace, file="output/bears_nodedate.mcc.tre" )