lorad-gtrig-vignette

In this vignette, we explain how to estimate the marginal likelihood using the LoRaD method (Wang et al. 2023) for a sample of parameters from a fixed tree topology and using the GTR+I+G model.

Data

The data set green5.nex comprises the following rbcL sequences:

Green Plant Species GenBank Accession Number
Chara connivens L13476
Sphagnum palustre L13485
Asplenium nidus AF525270
Picea pungens AF456382
Avena sativa L15300

The alignment of these five DNA sequences is 1296 sites in length.

Model

The program RevBayes (version 1.2.1) was used to obtain a sample from the Bayesian posterior distribution under a GTR+I+G model. This model has 6 parameters, 3 of which are multivariate, yielding a total model dimension of 18:

Parameter Dimension Prior Description
alpha 1 InverseGamma(1, 0.1) Shape parameter of the among-site rate heterogeneity Gamma distribution
pinvar 1 Beta(1, 1) Proportion of invariable sites
tree length 1 Gamma(1, 0.1) Sum of all 2*n-3 = 7 edge lengths in the unrooted phylogenetic tree (n = 5 sequences)
edge length proportions 7 Dirichlet(1,1,1,1,1,1,1) Proportion of tree length accounted for by each edge length
exchangeabilities 5 Dirichlet(1,1,1,1,1,1) The relative rate for each type of transition (rAC, rAG, rAT, rCG, rCT, rGT)
nucleotide relative frequencies 3 Dirichlet(1,1,1,1) The stationary distribution of nucleotide states (piA, piC, piG, piT)

Note: Gamma and InverseGamma distributions above use the shape, rate parameterization.

Reading in Data

First, load the lorad package.

library(lorad)

Read in the file containing the posterior sample. Note that RevBayes saves several columns that are functions of parameters (i.e. edgelens, site_rates). These parameters will be given a column specification of “ignore”.

# Dimensions of gtrigsamples
dim(gtrigsamples)
#> [1] 10001    35

# Column names of gtrigsamples
colnames(gtrigsamples)
#>  [1] "Iteration"                  "Posterior"                 
#>  [3] "Likelihood"                 "Prior"                     
#>  [5] "alpha"                      "edge_length_proportions.1."
#>  [7] "edge_length_proportions.2." "edge_length_proportions.3."
#>  [9] "edge_length_proportions.4." "edge_length_proportions.5."
#> [11] "edge_length_proportions.6." "edge_length_proportions.7."
#> [13] "edgelens.1."                "edgelens.2."               
#> [15] "edgelens.3."                "edgelens.4."               
#> [17] "edgelens.5."                "edgelens.6."               
#> [19] "edgelens.7."                "er.1."                     
#> [21] "er.2."                      "er.3."                     
#> [23] "er.4."                      "er.5."                     
#> [25] "er.6."                      "pi.1."                     
#> [27] "pi.2."                      "pi.3."                     
#> [29] "pi.4."                      "pinvar"                    
#> [31] "site_rates.1."              "site_rates.2."             
#> [33] "site_rates.3."              "site_rates.4."             
#> [35] "tree_length"

Column Specification

Next, we must create a named vector to associate each column name in params with a column specification. Here are the possible column specifications:

All columns labeled posterior will be summed to create the log posterior kernel (sum of log likelihood and log joint prior).

Here are the column specifications appropriate for this example. Note that we ignore the Likelihood and Prior columns because RevBayes has already combined these into the Posterior column.

Specification Column
Iteration iteration
Posterior posterior
Likelihood ignore
Prior ignore
alpha positive
edge_length_proportions.1. simplex
edge_length_proportions.2. simplex
edge_length_proportions.3. simplex
edge_length_proportions.4. simplex
edge_length_proportions.5. simplex
edge_length_proportions.6. simplex
edge_length_proportions.7. simplexfinal
edgelens.1. ignore
edgelens.2. ignore
edgelens.3. ignore
edgelens.4. ignore
edgelens.5. ignore
edgelens.6. ignore
edgelens.7. ignore
er.1. simplex
er.2. simplex
er.3. simplex
er.4. simplex
er.5. simplex
er.6. simplexfinal
pi.1. simplex
pi.2. simplex
pi.3. simplex
pi.4. simplexfinal
pinvar proportion
site_rates.1. ignore
site_rates.2. ignore
site_rates.3. ignore
site_rates.4. ignore
tree_length positive
# Create a named vector holding the column specifications
colspec <- c("Iteration"                  = "iteration", 
             "Posterior"                  = "posterior", 
             "Likelihood"                 = "ignore", 
             "Prior"                      = "ignore",
             "alpha"                          = "positive",          
             "edge_length_proportions.1." = "simplex",  
             "edge_length_proportions.2." = "simplex",  
             "edge_length_proportions.3." = "simplex",  
             "edge_length_proportions.4." = "simplex",  
             "edge_length_proportions.5." = "simplex",  
             "edge_length_proportions.6." = "simplex",  
             "edge_length_proportions.7." = "simplexfinal", 
             "edgelens.1."                = "ignore",               
             "edgelens.2."                = "ignore",               
             "edgelens.3."                = "ignore",                 
             "edgelens.4."                = "ignore",                 
             "edgelens.5."                = "ignore",                 
             "edgelens.6."                = "ignore",             
             "edgelens.7."                = "ignore",                 
             "er.1."                      = "simplex",                    
             "er.2."                      = "simplex",                      
             "er.3."                      = "simplex",                    
             "er.4."                      = "simplex",                      
             "er.5."                      = "simplex",                    
             "er.6."                      = "simplexfinal",                     
             "pi.1."                      = "simplex",                    
             "pi.2."                      = "simplex",                      
             "pi.3."                      = "simplex",                    
             "pi.4."                      = "simplexfinal",                     
             "pinvar"                     = "proportion",
             "site_rates.1."              = "ignore",
             "site_rates.2."              = "ignore",
             "site_rates.3."              = "ignore",
             "site_rates.4."              = "ignore",
             "tree_length"                = "positive")

Computing an Estimate of the Marginal Likelihood

To run the LoRaD function we need to specify the training fraction, the training sample selection method, and the coverage fraction. The training fraction and the coverage fraction must be between 0 and 1. The training sample selection method can be left (the first part of the sample), right (the second part of the sample), or random (a random part of the sample).

For this example we used 0.5 for the training fraction, 0.1 for the coverage fraction, and “left” for the training sample selection method.

results <- lorad_estimate(gtrigsamples, colspec, 0.5, "left", 0.1)
lorad_summary(results)
#> This is lorad 0.0.1.0 
#>     Parameter sample comprises 10001 sampled points
#>     Each sample point is a vector of 17 parameter values
#> 
#>    Training fraction is 0.5
#>     Coverage specified is 0.1
#>  
#> Partitioning samples into training and estimation:
#>     Sample size is 10001
#>     Training sample size is 5000
#>     Estimation sample size 5001
#>  
#> Processing training sample...
#>     Lowest radial distance is 3.005026965
#>     Log Delta -2.801976312
#>  
#> Processing estimation sample...
#>     Number of samples used is 448
#>     Nominal coverage specified is 0.100000
#>     Realized coverage is 0.089582
#>     Log marginal likelihood is -4440.521454
#> 

The log marginal likelihood is -4440.521454.

Literature Cited

Wang, Yu-Bo, Analisa Milkey, Aolan Li, Ming-Hui Chen, Lynn Kuo, and Paul O. Lewis. 2023. “LoRaD: Marginal Likelihood Estimation with Haste (but No Waste).” Systematic Biology 72 (3): 639–48. https://doi.org/10.1093/sysbio/syad007.