Orlando Schwery, 7. September 2021
This is based on the following tutorials:
Additional tutorials related to the topic that we won’t cover today:
We open RevBayes and set our working directory as we’re used to now:
rb # or maybe ./rb depending on your system
getwd()
setwd("/mnt/c/Users/oschwery/Documents/Teaching/RevBayes_Workshop_UI2021/3-BioGeo/")
This time we’re using example data from the Hawaiian Silversword Alliance. The data set we’re using right now is a simplified one where we only consider 4 areas: K (Kauai & Niihau), O (Oahu), M (Maui-Nui, Lanai & Molokai), and H (Hawaii).
# we start with a coding-practice that might be a bit gimmicky here, but can be useful otherwise:
# we make our file paths into variables which we can call when needed
# (there's also some merit to having all changeable variables of a script at the top...)
range_fn = "data/silversword.n4.range.nex"
tree_fn = "data/silversword.tre"
out_fn = "output/simple"
# we'll see later with the output, that we can use those variables as stubs to add on to
Now we use those to load our tree and data:
# we actually have to specify which tree with square brackets `[1]` even if we only got one
T = readTrees(tree_fn)[1]
dat_range_01 = readDiscreteCharacterData(range_fn)
dat_range_01
dat_range_01.show()
The data is in a ‘human-readable’ format of presence-absence strings for each species’ range, where every position stands for one of the areas. We’re now turning this into computer readable format (which is simply done by labelling them with natural numbers):
dat_range_n = formatDiscreteCharacterData(dat_range_01, "DEC")
# to compare:
dat_range_01[1]
dat_range_n[1]
# we save the number of different characters/areas for later:
n_areas = dat_range_01.nchar()
Essentially, each combination of four 0s and 1s now has it’s own ‘label’ in the form of a number. We can now output a file that lets us look up later which human-readable ranges correspond to which label (which can save us some sanity later):
# we get the list of states, notably from the natural-numbers object, which still kept track of them
state_desc = dat_range_n.getStateDescriptions()
# we start a string with column names:
state_desc_str = "state,range\n"
# then we loop over them and append the corresponding natural numbers from 0 to n
for (i in 1:state_desc.size()){
state_desc_str += (i-1) + "," + state_desc[i] + "\n"
}
# and write this to file
write(state_desc_str, file=out_fn+".state_labels.txt")
# or if we prefer to look at this in excel...
write(state_desc_str, file=out_fn+".state_labels.csv")
Note how we’ve used the output path we specified before as a stub and added on to it. All our files will start with simple
now, meaning we’ll know they belong to the same analysis.
The anagenetic rate matrix tells us the rates of dispersal between and local extirpation within the different ranges. For this simple model, we assume the rates to be constant through time, and to apply globally across all lineages in the tree. This means we’ll just have a dispersal rate
and an extirpation rate
.
For methods-y reasons, we’ll be using relative dispersal and extinction rates, and will add a biogeographic rate parameter (similar to a molecular clock in dating) later. Think about it as one part of the model determining what the balance between dispersal and extirpation is, and then another determining how often those then get to act along the branches.
# we therefore fix the dispersal rate...
dispersal_rate <- 1.0
# ...and populate a matrix (n-Areas x n-Areas) with it.
for (i in 1:n_areas) {
for (j in 1:n_areas) {
dr[i][j] <- dispersal_rate
}
}
dr
Now we create the same kind of matrix for extirpation, except that we only populate the diagonal, as extirpation happens within and not between the areas (and an extirpation event is limited to the areas and not ranges). This time we’re following the good practice to add moves to our parameters as we define them, so let’s initiate them first:
moves = VectorMoves()
Since the dispersal rate is fixed, we do try to infer the relative extirpation in relation to it. We set the prior in a way that we’re expecting it on average to be 1.0
, i.e. the same as the dispersal rate:
log_sd <- 0.5
log_mean <- ln(1) - 0.5*log_sd^2
extirpation_rate ~ dnLognormal(mean=log_mean, sd=log_sd)
moves.append( mvScale(extirpation_rate, weight=2) )
# and turn into a matrix
for (i in 1:n_areas) {
for (j in 1:n_areas) {
er[i][j] <- 0.0 # Zeros everywhere
}
er[i][i] := extirpation_rate # but the rate to be inferred in the diagonal
}
er
Note how the actual extirpation rate in the matrix is now a deterministic node based the stochastic node extirpation_rate
we crated before - we draw from the prior distribution and then plug whatever that is into the matrix.
Now we use a dedicated function to turn those into our relative rate matrix for DEC:
Q_DEC := fnDECRateMatrix(dispersalRates=dr, extirpationRates=er)
Q_DEC
If we inspect this matrix and compare it with the one from our slide ( i.e. from the Intro Tutorial), we can see how the values we set were used. The transition rates are in the upper diagonal, the extirpation rates are in the lower one. Note how the extirpation rates are either 0
or all the same (reflecting how we set them), while the transition rates are not just either 0
or 1.0
, but for larger ranges get to be 2.0
or 3.0
. The reason for this is that if a species e.g. moves from AB
to ABC
, then the rate of getting to C
is both the rate coming from A
and the rate coming from B
.
We could also look at the anagenetic transition probabilities for a given branch length now, e.g. 0.2
Q_DEC.getTransitionProbabilities(rate=0.2)
Note how some moves that were set to 0
in the rate matrix still have positive probabilities of happening. This is because over that branch length of 0.2
, more than one event can take place, so the probabilities for those moves are the result of two or more different moves happening (thus those probabilities are relatively low),
Now we need to include what happens biogeography-wise when a lineage splits into two. To this end we make a transition probability matrix (as opposed to a rate matrix before), consisting of the different types of cladogenic events we want to consider - in our case subset sympatry s
and allopatry a
- and the weights with which we’re expecting them to happen in relation to each other (equiprobable in our case). Other possible events could be full sympatry f
or jump dispersal j
. A dedicated function will then turn them into our probability matrix:
clado_event_types <- [ "s", "a" ]
clado_event_probs <- simplex(1, 1)
P_DEC := fnDECCladoProbs(eventProbs=clado_event_probs,
eventTypes=clado_event_types,
numCharacters=n_areas)
P_DEC
You can see that P_DEC
is essentially a list of possible combinations of ranges before and after the cladogenic event, together with the probability of that particular event to happen for that range.
Finally, we’re setting the biogeographic rate parameter mentioned above:
rate_bg ~ dnLoguniform(1E-4,1E2)
rate_bg.setValue(1E-2)
moves.append( mvSlide(rate_bg, weight=4) )
Again we are using a loguniform prior to be agnostic about the orders of magnitude. We are now expecting a uniform probability from 0.0001 to 100 events per million years.
Let’s also not forget to assign our tree as an actual (here constant) node to the model:
tree <- T
And at last, we combine our tree, matrices, and biogeographic rate into the distribution (dn
) for the phylogenetic Continuous Time Markow Chain (CTMC
) for Cladogenetic models, clamp our data to it, and package it all into a model object.
m_bg ~ dnPhyloCTMCClado(tree=tree,
Q=Q_DEC,
cladoProbs=P_DEC,
branchRates=rate_bg,
nSites=1,
type="NaturalNumbers")
m_bg
m_bg.clamp(dat_range_n)
m_bg
mymodel = model(m_bg)
Note how the number of Sites is 1
, and not 4
or 15
, because we have coded all 15 ranges as the states of one character.
We have set our moves already, so now we still need to set our monitors to save our outputs. We initialise a monitors vector and add the usual monitors for mnScreen
(so we can see what’s going on) and mnModel
(to save the inferred parameters):
monitors = VectorMonitors()
monitors.append( mnScreen(rate_bg, extirpation_rate, printgen=100) )
monitors.append( mnModel(file=out_fn+".params.log", printgen=10) )
But as I mentioned, there’s other types of monitors one can set, in order to save the corresponding output. Here, we may not only care about the inferred rates, but also the ancestral areas. Thus we set the monitor mnJointConditionalAncestralState
, which will at each generation of the MCMC sample ancestral states under the model and conditional on our tip-data. The result is a trace file where each node of the phylogeny has its ancestral states for each MCMC sample recorded.
The File
monitor will save the tree at each generation. This is more crucial for when we estimate the tree jointly too. Here, it will just save the same tree over and over, which will make summarizing the ancestral states easier. The mnStochasticCharacterMap
monitor will save stochastic character mappings at each generation. We could look at those to demonstrate how our ancestral area reconstruction differs from a mere trait reconstruction.
monitors.append( mnFile(tree, file=out_fn+".trees", printgen=10) )
monitors.append( mnJointConditionalAncestralState(tree=tree,
ctmc=m_bg,
filename=out_fn+".states.log",
type="NaturalNumbers",
printgen=10,
withTips=true,
withStartStates=true) )
monitors.append( mnStochasticCharacterMap(ctmc=m_bg,
filename=out_fn+".stoch.log",
printgen=100) )
Now we can combine our model, moves, and monitors to an MCMC object and run it:
mymcmc = mcmc(mymodel, moves, monitors)
mymcmc.run(3000)
Notably we didn’t have our nice model graph to go off of, but we can generate it using RevBayes from our actual model (which can be a nice way to confirm all is how we intended):
mymodel.graph(out_fn + ".DEC_dag.dot")
As RevBayes can’t plot itself, it outputs a file that can then be interpreted by other tools into a plot.
But now we only have raw outputs, and we want to actually associate those with our tree. We start a new RevBayes session:
rb # or maybe ./rb depending on your system
getwd()
setwd("/mnt/c/Users/oschwery/Documents/Teaching/RevBayes_Workshop_UI2021/3-BioGeo/")
We set our paths:
out_str = "output/simple"
out_state_fn = out_str + ".states.log"
out_tree_fn = out_str + ".trees"
out_mcc_fn = out_str + ".mcc.tre"
This step would be more useful if we had estimated the tree too and had to summarize it, but now it just gives us a file with the tree inthe right format:
tree_trace = readTreeTrace(file=out_tree_fn, treetype="clock")
tree_trace.setBurnin(0.25)
n_burn = tree_trace.getBurnin()
mcc_tree = mccTree(tree_trace, file=out_mcc_fn)
We load the node state trace:
state_trace = readAncestralStateTrace(file=out_state_fn)
We load the tree trace again, but this time as a different object:
tree_trace = readAncestralStateTreeTrace(file=out_tree_fn, treetype="clock")
And finally summarize the ancestral state results on the MCC tree:
anc_tree = ancestralStateTree(tree=mcc_tree,
ancestral_state_trace_vector=state_trace,
tree_trace=tree_trace,
include_start_states=true,
file=out_str+".ase.tre",
burnin=n_burn,
site=1)