BCB 503: RevBayes Intro

Second session: Trait Evolution and MCMC

Orlando Schwery, 31. August 2021

This is based on the following tutorials:


Additional tutorials related to the topic that we won’t cover today:

Trait Evolution:

MCMC:


Simple BM

We open RevBayes and set our working directory as established last time:

rb                                         # or maybe ./rb depending on your system
getwd()
setwd("/mnt/c/Users/oschwery/Documents/Teaching/RevBayes_Workshop_UI2021/2-Traits/")


Data

Now we want to load our data. The files contain 66 trees and character matrices respectively. We’ll only want the whales though (what else…):

# whales are set 11, so we make a variable for that, to call again whenever needed
dataset = 11
# we use the tree reading function to load the trees, subsetting for the one we want
T = readTrees("data/trees.nex")[dataset]
# we can look at it and see the raw tree - downside of RevBayes is that we can't plot it
T
# we do the same for the continuous character data...
data = readContinuousCharacterData("data/traits.nex")[dataset]
# and look at it too, though by default, we won't see the raw data but a summary
data

Rev thankfully confirms the successful loading of the data. Note that I used the = assignment here, since the data isn’t part of our model right now (though using <- would have worked too). But what if we want to see the actual data? How do we find help for that? The loading function gives us a good hint for what to search:

?ContinuousCharacterData    # This shows us a host of different things...

We can also see the same thing in the online documentation, where on the left we can see that ContinuousCharacterData is one of several Model Objects we can get info on. Further down there’s entries for Distributions, Functions, Moves, Monitors, and Workspace Objects (we’ll get to know some of each along the way). So we could also look at the ones for Tree and find all kinds of useful tree untility functions. Sometimes browsing this list can let you find things you weren’t aware you needed…

Firstly, we can index it as we might be used to from R:

data[15]

The other entries are function names that are associated to this object type, and we can call them with object.function(arguments). Like so:

data.size()
data.taxa()
data.filename()             # RevBayes even remembers where the data came from...

# and finally:
data.show()                 # though this also demonstrates why the summary table is maybe a nicer default...


The BM Model

Now that we have a sense how to handle our data, we can set up our simple BM model as we know it should look like (from the DAG).

We want the rate parameter sigma2 to be drawn from a log-uniform distribution (the distribution is uniform on the log scale, which apparently represents ignorance about the order of magnitude of the rate…), which nees a lower and upper limit, L and U. This is the whole top bit of the graph.

L <- 0.001
U <- 1                      # these are just fix assigned values, thus we specify them as constant nodes

sigma2 ~ dnLoguniform(L, U) # as a distribution, this is of course a stochastic node again

Now unlike in our previous toy examples, we do have a phylogeny here too. Since we’re assuming here that it is known without error, it is essentially just a value (or set of values) that we provide.

tree <- T

Note that the tree here is technically considered part of the model (like a parameter), rather than data.

The last bit is the node X, which is the actual phylogenetic BM model, where tree and sigma2 are used to evaluate the trait data:

X ~ dnPhyloBrownianREML(tree, branchRates=sigma2^0.5)

The gray shading of X indicates that it is a clamped node, so we would clamp the data to it, but let’s take a moment to look at this more first.

Note two things: X is a stochastic node, and (accordingly) by it’s nomenclature starting in dn* is a distribution. So according to this, we should be able to draw something from it, and in fact, as we know from the past, we already have drawn something by initialising X! So let’s look at it:

X                   # looks familiar
data

X.show()

So apparently we’ve ‘drawn’ continuous character data from something like a BM distribution based on our tree and a rate parameter! We have essentially simulated trait data under BM on our tree. In fact, this is how RevBayes lets you build a model and then use it to both simulate and infer. However, usually, if we simulate, we would use a particular parameter value, rather than drawing from a distribution like here for sigma2, e.g. like so:

Y ~ dnPhyloBrownianREML(tree, branchRates=0.5)
Y

This might be a good moment to rethink what we’re doing and ask ourselves why we’ve always drawn parameters from distributions so far? We clearly don’t need those to simulate, where we ‘know’ which value to use, but in this particular case, sigma2 is actually our inference goal. And by drawing values from a distribution for it, we provide the MCMC with candidates to evaluate.

The distributions we’ve specified are our priors.

Now back to our BM - as we said, the gray X node needs our data clamped to it to make inferences on, so let’s do that:

X.clamp(data)
X
data                # again we can see that X now has the same values as our data had
X.show()
data.show()

Now we want to wrap this all into one single object that contains all that belongs to our model:

mymodel = model(sigma2)

Note that again we’re using =, since mymodel isn’t PART of the model. The function model is now starting from sigma2 and traverses all the nodes that we attached to it (using <-, ~, :=, or .clamp()), thus finding our whole DAG. Let’s see that:

mymodel

We get a list of all the nodes, what kind they are, their values, and particularly also _parents and _children, which indicates which way the nodes are connected in the DAG. This is why the correct assignment of nodes is important! But uh oh, there’s some bits that should’t be there!

Because we made Y also use sigma2 it is now also part of our same DAG and thereby model. We’ll have to remove it and re-initiate the model object.

clear(Y)
mymodel = model(sigma2)
mymodel

Now we should have one with 8 nodes instead of 11, and only the nodes we want:

  1. sigma2
  2. L
  3. U
  4. X
  5. tree
  6. X.siteRates
  7. <0x7fffc46f65b0>
  8. <0x7fffc32a34e0>

We don’t really recognize them past 5. though, so what are these?

A quick look at the documentation shows us that siteRates is an argument of dnPhyloBrownianREML that we didn’t use, and that apparently has the default value 1. We can also see here that there’s an argument used for simulations, nSites, which is by default 10 - this should explain why we got a set of 10 characters when we simulated values for X earlier!

So after all, RevBayes does have some defaults…

What we don’t find here seems to be a same kind of node for the argument we did set, e.g. something like X.branchRates? But if we have a closer look at the two oddly named nodes, we’ll find that they represent what went in there: one is the 0.5 that we exponentiated sigma2 with, the other is the result of that calculation. Note the _parents/_children(or lack thereof) and the _type of those nodes!

This shows that it can be useful to assign all variables to objects and pass those to the functions, rather than the values directly.

This is now a great moment to introduce our next step, which is the MCMC analysis.


Running an MCMC

We still got our model object from before. Keeping in mind how mcmcs work, what else do we need to run this now?

First, we need a way for the ‘robot’ to move through parameter space, which is called moves. A glance at the documentation reveals that RevBayes offers a LOT of different moves to choose from, some of which very specific to use on a particular type of parameter (and not all always really explained here…). Some of the ones we’ll commonly encounter are a sliding move (mvSlide, essentially a sliding window move, where a new value is chosen from within a uniform distribution around the current value) and a scaling move (mvScale - the new value is chosen by multiplying the current number by an exponentiated random number from a uniform distribution).

We first have to initialise a moves object, and then append a move for each parameter we want the MCMC to optimise:

moves    = VectorMoves()

moves.append( mvScale(sigma2, weight=1.0) )

The command append allows us to add on to moves without neding to know how many elements there already are. But could also have used subsetting with square brackets, like this: moves[1] = mvScale(sigma2, weight=1.0).

We would usually initialise this earlier on, and append the corresponding moves right after we specified the respective nodes, to keep it all tidy. The argument weight determines how often on average this move will be used per iteration. This allows you to let the different parameters that you try to optimise be changed at different speeds. Note that you can have several different moves on the same parameter, if your analysis requires it.

The second thing we need for our mcmc is a way to actually keep track of the results we’re getting whenever the mcmc evaluates a set of proposed parameters( i.e. the states of our Markov chain, or our trace). These are called monitors, and again, RevBayes offers us a good number of different ones to use (again some rather specific to certain analyses). The most important ones however are definitely mnModel, which keeps track of our parameters in the model (and saves them to a file), and mnScreen, which sends us real-time updates of the same to the screen (as the name suggests).

Again, we first initialise a monitors object and then append all the monitors we need to it:

monitors = VectorMonitors()

monitors.append( mnModel(filename="output/simple_BM.log", printgen=10) )
monitors.append( mnScreen(printgen=1000, sigma2) )

In both cases, printgen determines the interval at which the results are saved. If we run Markov chains with a lot of generations, we’d rather not want every trial recorded, lest we drown in data. On the other hand, we want to save enough to be able to determine convergence, get a proper sense of the posterior distribution etc. For the screen monitor, it’s more a question of our patience, and to get a good sense of whether it is running and how fast.

Note that we want to save the log files into a new directory “output”, which Rev will automatically create if it doesn’t exist yet.

Now that we have our ingredients, we can get to the MCMC.

mymcmc = mcmc(mymodel, monitors, moves, nruns=2, combine="mixed")

The function mcmc doesn’t run the analysis, but rather creates an object that contains all the necessary info, and which we then run. Apart from the model, monitors, and moves, we opted to let it run two separate chains, and then combine those traces by mixing them.

If we look at the documentation for mcmc, we that it allows for some more tweaking ( e.g. heated chains), and that there is a number of methods we can then still apply to that object. The two most important ones for us now are .burnin, which sets a burnin interval that we then discard, and .run which simply runs the actual analysis. The latter also takes the argument generations, specifying how long our chain should be.

mymcmc.burnin(generations = 500, tuningInterval=100)

We notice that it runs the burnin as an extra step before the main analysis, instead of removing it from the whole chain after the fact.

mymcmc.run(generations=50000)

As a sanity check, Rev tells us what it’s running, how many replicates, and importantly also how many moves it uses and how often. We should now find a new directory called output, containing three log files: one for each separate chain, and a combined one. If you’d open them you’d notice that they contain one more entry than we might expect (or two more in the combined file), because in addition to the 5000 we expect (generations/printgen), each one starts with generation 0, the initial values.

If we’re interested in some more anlalytics, we could also get the operator summary now:

mymcmc.operatorSummary()

It shows you for all the different moves (in our case only the scaling move on sigma), how often they were accepted and what tuning parameters were used, etc. If your analysis comes out weird, checking whether the moves have acceptance ratios around 0.4 is a good start to look into the problem.


Final MCMC Notes:

We should now check our output files to see whether our analysis went well, by checking for convergence, etc., and then actually analyse our trace to see what the posterior distribution of our parameters is, what that means, and so on. The tutorials on convergence testing and plotting with RevScripter give good advice for that, but the log files can e.g. also be loaded into programs like Tracer, or be analysed with other R packages ( e.g. coda).

Another thing one might want to try is to run the mcmc without data, i.e. under the prior, to see whether the result is sensitive to the choice of priors used. Both for .burnin and .run this can easily be done by adding the argument underPrior = TRUE.

If uncertain about the proper parameter for your moves e.g. the window-size delta of mvSlide, or the proposal strength lambda of mvScale, you can auto-tune those parameters. In fact, we did that before. It has two components:

# 1. enable tuning for all the moves you want to (this is actually on by default)
moves.append( mvScale(sigma2, tune=TRUE, weight=1.0) )
# 2. set a tuning-interval during burnin
mymcmc.burnin(generations = 500, tuningInterval=100)

It is also possible to set a tuning intervall during .run, and tune those parameters throughout the run, but I would use that with caution, because it can lead to unwanted behaviours…