Starting secsse

Thijs Janzen

2023-10-20

Secsse introduction

secsse is an R package designed for multistate data sets under a concealed state and speciation (‘hisse’) framework. In this sense, it is parallel to the ‘MuSSE’ functionality implemented in ‘diversitree’, but it accounts for finding possible spurious relationships between traits and diversification rates (‘false positives’, Rabosky & Goldberg 2015) by testing against a ‘hidden trait’ (Beaulieu et al. 2013), which is responsible for more variation in diversification rates than the trait being investigated.

Secsse input files

Similar to the ‘diversitree’ (Fitzjohn et al. 2012) and ‘hisse’ (Beaulieu & O’Meara 2016) packages, secsse uses two input files: a rooted, ultrametric tree in nexus format (for conversion of other formats to nexus, we refer to the documentation in package ‘ape’) and a data file with two columns, the first containing taxa names and the second a numeric code for trait state with a header (usually 0, 1, 2, 3, etc., but notice that ‘NA’ is a valid code too, if you are not sure what trait state to assign to a taxon). Here, we will use a simple trait dataset with only values 0 and 1, indicating presence and absence of a trait. A comma-separated value file (.csv) generated in MsExcel works particularly well. The *.csv file can be loaded into R using the read.csv() function. and should look like this:

library(secsse)
data(traits)
tail(traits)
##     species trait
## t46     t46     1
## t56     t56     1
## t7       t7     0
## t10     t10     0
## t24     t24     0
## t4       t4     0

This data set (here we see only the bottom lines of the data frame) has two character states labeled as 0 and 1. Ambiguity about trait state (you are not sure which trait state to assign a taxon too, or you have no data on trait state for a particular taxon), can be assigned using ‘NA’. secsse handles ‘NA’ differently from a full trait state, in that it assigns probabilities to all trait states for a taxon demarcated with ‘NA’.

The second object we need is an ultrametric phylogenetic tree, that is rooted and has labelled tips. One can load it in R by using read.nexus(). In our example we load a prepared phylogeny named “phylo_vignette”:

data("phylo_vignette")

For running secsse it is important that tree tip labels agree with taxon names in the data file, but also that these are in the same order. For this purpose, we run the following piece of code prior to any analysis:

sorted_traits <- sortingtraits(traits, phylo_vignette)

If there is a mismatch in the number of taxa between data and tree file, you will receive an error message. However, to then identify which taxa are causing issues and if they are in the tree or data file, you can use the name.check function in the ‘geiger’(Harmon et al. 2008) package:

library(geiger)
## Loading required package: ape
## Loading required package: phytools
## Loading required package: maps
#pick out all elements that do not agree between tree and data
mismat <- name.check(phylo_vignette, traits)
#this will call all taxa that are in the tree, but not the data file
#mismat$tree_not_data
#and conversely,
#mismat$data_not_tree

If you have taxa in your tree file that do not appear in your trait file, it is worth adding them with value NA for trait state. You can visualise the tip states using the package diversitree:

if (requireNamespace("diversitree")) {
  for_plot <- data.frame(trait = traits$trait,
                         row.names = phylo_vignette$tip.label)
diversitree::trait.plot(phylo_vignette, dat = for_plot,
                        cols = list("trait" = c("blue", "red")),
                        type = "p")
}
## Loading required namespace: diversitree

After you are done properly setting up your data, you can proceed to setting parameters and constraints.

Note on assigning ambiguity to taxon trait states

If the user wishes to assign a taxon to multiple trait states, because he/she is unsure which state best describes the taxon, he/she can use NA. NA is used when there is no information on possible state at all; for example when a state was not measured or a taxon is unavailable for inspection. NA means a taxon is equally likely to pertain to any state. In case the user does have some information, for example if a taxon can pertain to multiple states, or if there is uncertainty regarding state but one or multiple states can with certainty be excluded, secsse offers flexibility to handle ambiguity. In this case, the user only needs to supply a trait file, with at least four columns, one for the taxon name, and three for trait state. Below, we show an example of what the trait info should be like (the column with species’ names has been removed). If a taxon may pertain to trait state 1 or 3, but not to 2, the three columns should have at least the values 1 and a 3, but never 2 (species in the third row). On the other hand, the species in the fifth row can pertain to all states: the first column would have a 1, the second a 2, the third a 3 (although if you only have this type of ambiguity, it is easier to assign NA and use a single-column data file).

#       traits traits traits
# [1,]      2      2      2
# [2,]      1      1      1
# [3,]      2      2      2
# [4,]      3      1      1
# [5,]      1      2      3

Setting up an analysis

To perform a Maximum Likelihood analysis, secsse makes use of the function DDD::optimize(), which in turn, typically, uses the subplex package to perform the Maximum Likelihood optimization. In such an analysis, we need to specify which parameters we want to optimize, which parameters to keep fix, and the initial values per parameter. We do so by providing the structure of the input parameters (e.g. in vector, matrix or list form), and within this structure we highlight values that stay at zero with a 0, and parameters to be inferred with indexes 1, 2, … n. The optimizer will then use these indexes to fill in the associated parameters and perform the optimization. If this all seems a bit unclear, please continue reading and look at the fully set up parameterization for the maximum likelihood below to gain more insight.

ETD

In the ETD model, we assume that the examined trait affects diversification. In a secsse analysis we need to specify the structure of three distinct properties: the lambda list, the mu vector and the transition (Q) matrix. Each of these informs properties of the model of speciation, extinction and trait-shifts respectively.

Lambda matrices

Speciation in a secsse model is defined using a list of matrices, where each matrix highlights the state of the daughter species resulting from a speciation event. In our case, we have a trait with two states, and thus we will have to specify a list with two matrices, one for each state, where each matrix in turn will then specify the daughter states. We can do so by hand, but secsse includes functionality to do this in a more organized manner - this is especially useful if you have a trait with more than two states for instance. In this more organized manner, we can provide secsse with a matrix specifying the potential speciation results, and secsse will construct the lambda list accordingly:

spec_matrix <- c()
spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1))
spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2))
lambda_list <- secsse::create_lambda_list(state_names = c(0, 1),
                                          num_concealed_states = 2,
                                          transition_matrix = spec_matrix,
                                          model = "ETD")
lambda_list
## $`0A`
##    0A 1A 0B 1B
## 0A  1  0  0  0
## 1A  0  0  0  0
## 0B  0  0  0  0
## 1B  0  0  0  0
## 
## $`1A`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  2  0  0
## 0B  0  0  0  0
## 1B  0  0  0  0
## 
## $`0B`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  0  0  0
## 0B  0  0  1  0
## 1B  0  0  0  0
## 
## $`1B`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  0  0  0
## 0B  0  0  0  0
## 1B  0  0  0  2

Let’s see what the code has done. First, we create a spec_matrix, where the first column indicates the parent species (0 or 1) and the second and third column indicate the identities of the two daughter species. In this case, we choose for symmetric speciation without a change of trait, e.g. the daughters have the same trait as the parent. If you have evidence of perhaps asymmetric inheritance, you can specify this here. The fourth column indicates the associated rate indicator. In this case we choose two different speciation rates. We choose two concealed states, as it is good practice to have the same number of concealed states as observed states. The resulting lambda_list then contains four entries, one for each unique state (see the names of the entries in the list), that is, for each combination of observed and concealed states, where the concealed states are indicated with a capital letter. Looking at the first entry in the list, e.g. the result of a speciation event starting with a parent in state 0A, will result with rate 1 in two daughter species of state 0A as well. The way to read this, is by looking at the row and column identifiers of the entered rate. Similarly, for a speciation event starting in state 1A (lambda_list[[2]]), the two daughter species are 1A as well, but this time with rate 2, as we specified that species with trait 1 will have a different speciation rate. Note that here, rates 1 and 2 are ordered with the observed trait, we will later explore the CTD model, where the rates will be sorted according to the concealed state.

Mu vector

Having the speciation rates set, we can move on to extinction rates. Since we are using the ETD model, here we also expect the extinction rates to be different:

mu_vec <- secsse::create_mu_vector(state_names = c(0, 1),
                                   num_concealed_states = 2,
                                   model = "ETD",
                                   lambda_list = lambda_list)
mu_vec
## 0A 1A 0B 1B 
##  3  4  3  4

The function create_mus_vector() takes the same standard information we provided earlier, with as addition our previously made lambda_list. It uses the lambda_list to identify the rate indicators (in this case 1 and 2) that are already used and to thus pick new rates. We see that secsse has created a named vector with two extinction rates (3 and 4), which are associated with our observed traits 0 and 1.

Transition matrix

Lastly, we need to specify our transition matrix. Often, Q matrices can get quite large and complicated, the more states you are analyzing. We have devised a tool to more easily put together Q matrices. This tool starts from the so-called shift_matrix, the basic matrix in which we only find information on transitions between examined states. The information contained in this shift_matrix is then automatically mimicked for inclusion in the full matrix, to ensure that the same complexity in examined state transitions is also found in concealed states. Instead of specifying the entire shift_matrix, instead it suffices to only specify the non-zero transitions. In this case these are from state 0 to 1, and vice versa:

shift_matrix <- c()
shift_matrix <- rbind(shift_matrix, c(0, 1, 5))
shift_matrix <- rbind(shift_matrix, c(1, 0, 6))

q_matrix <- secsse::create_q_matrix(state_names = c(0, 1),
                                    num_concealed_states = 2,
                                    shift_matrix = shift_matrix,
                                    diff.conceal = TRUE)
q_matrix
##    0A 1A 0B 1B
## 0A NA  5  7  0
## 1A  6 NA  0  7
## 0B  8  0 NA  5
## 1B  0  8  6 NA

Thus, we first specify a matrix containing the potential state transitions, here 0->1 and 1->0. Then, we use create_q_matrix() to create the q-matrix. By setting diff.conceal to TRUE, we ensure that the concealed states will get their own rates specified. Setting this to FALSE would set their rates equal to the observed rates (5 and 6). The way to read the transition matrix is column-row, e.g. starting at state 0A, with rate 5 the species will shift to state 1A and with rate 7 it will shift to state 0B. We intentionally ignore ‘double’ shifts, e.g. from 0A to 1B, where both the observed and the concealed trait shift at the same time. If you have good evidence to include such shifts in your model, you can modify the trans_matrix by hand of course.

Maximum Likelihood

We have now specified the required ingredients to perform Maximum Likelihood analyses. Prerequisites for performing Maximum Likelihood analyses with secsse are that we specify the ids of the rates we want optimized, and provide initial values. We can do so as follows:

idparsopt <- 1:8 # our maximum rate parameter was 8
idparsfix <- c(0) # we want to keep all zeros at zero
initparsopt <- rep(0.1, 8)
initparsfix <- c(0.0) # all zeros remain at zero.
sampling_fraction <- c(1, 1)

Here, we specify that we want to optimize all parameters with rates 1, 2, …, 8. We set these at initial values at 0.1 for all parameters. Here, we will only use one starting point, but in practice it is often advisable to explore multiple different initial values to avoid getting stuck in a local optimum and missing the global optimum. idparsfix and initparsfix indicate that all entries with a zero are to be kept at the value zero. Lastly, we set the sampling fraction to be c(1, 1), this indicates to secsse that we have sampled per trait all species with that trait in our dataset. Alternatively, if we know that perhaps some species with trait 0 are missing, we could specify that as c(0.8, 1.0). Thus, note that the sampling fraction does not add up to 1 across traits, but within traits.

And now we can perform maximum likelihood:

idparslist <- list()
idparslist[[1]] <- lambda_list
idparslist[[2]] <- mu_vec
idparslist[[3]] <- q_matrix

answ <- secsse::cla_secsse_ml(phy = phylo_vignette,
                              traits = traits$trait,
                              num_concealed_states = 2,
                              idparslist = idparslist,
                              idparsopt = idparsopt,
                              initparsopt = initparsopt,
                              idparsfix = idparsfix,
                              parsfix = initparsfix,
                              sampling_fraction = sampling_fraction,
                              verbose = FALSE)
## Warning in check_ml_conditions(traits, idparslist, initparsopt, idparsopt, :
## Note: you set some transitions as impossible to happen.

We can now extract several pieces of information from the returned answer:

ML_ETD <- answ$ML
ETD_par <- secsse::extract_par_vals(idparslist, answ$MLpars)
ML_ETD
## [1] -96.32138
ETD_par
## [1] 4.429929e-01 8.810607e-01 5.201400e-07 7.764175e-07 7.770646e-02
## [6] 1.570195e-09 1.410419e-01 6.549122e-02
spec_rates <- ETD_par[1:2]
ext_rates <- ETD_par[3:4]
Q_Examined <- ETD_par[5:6]
Q_Concealed <- ETD_par[7:8]
spec_rates
## [1] 0.4429929 0.8810607
ext_rates
## [1] 5.201400e-07 7.764175e-07
Q_Examined
## [1] 7.770646e-02 1.570195e-09
Q_Concealed
## [1] 0.14104187 0.06549122

The function extract_par_vals() goes over the list answ$MLpars and places the found parameter values back in consecutive vector 1:8 in this case. Here, we find that the speciation rate of trait 1 is higher than the speciation rate of trait 0.

CTD

Let’s compare our findings with a CTD model, e.g. a model centered around the concealed trait. Again, we need to specify our lambda list, mu vector and transition matrix. We will see that this is quite straightforward now that we have gotten the hang of how this works.

Lambda matrices

Again, we specify two distinct rates, indicating that the observed state inherits faithfully to the daughter species. However, this time, we set the model indicator to “CTD”:

spec_matrix <- c()
spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1))
spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2))
lambda_list <- secsse::create_lambda_list(state_names = c(0, 1),
                                          num_concealed_states = 2,
                                          transition_matrix = spec_matrix,
                                          model = "CTD")
lambda_list
## $`0A`
##    0A 1A 0B 1B
## 0A  1  0  0  0
## 1A  0  0  0  0
## 0B  0  0  0  0
## 1B  0  0  0  0
## 
## $`1A`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  1  0  0
## 0B  0  0  0  0
## 1B  0  0  0  0
## 
## $`0B`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  0  0  0
## 0B  0  0  2  0
## 1B  0  0  0  0
## 
## $`1B`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  0  0  0
## 0B  0  0  0  0
## 1B  0  0  0  2

The resulting lambda_list now has the chosen rates 1 and 2 sorted differently across the matrices, with matrices 1 and 2 containing rate 1, and matrices 3 and 4 containing rate 2. Looking at the column names of the matrices, states 1 and 2 are states 0A and 1A, and states 3 and 4 are states 0B and 1B, in other words, speciation rate 1 is now associated with all states with concealed state A, and speciation rate 2 is now associated with all states with concealed state B.

Mu vector

For the mu vector, we repeat the same we did for the ETD model:

mu_vec <- secsse::create_mu_vector(state_names = c(0, 1),
                                   num_concealed_states = 2,
                                   model = "CTD",
                                   lambda_list = lambda_list)
mu_vec
## 0A 1A 0B 1B 
##  3  3  4  4

Here, again, we see that whereas previously extinction rate 3 was associated with states 0A and 0B (e.g. all states with state 0), it is now associated with states 0A and 1A, e.g. all states associated with concealed state A.

Transition matrix

Setting up the transition matrix is not different from the ETD model, the same transitions are possible:

shift_matrix <- c()
shift_matrix <- rbind(shift_matrix, c(0, 1, 5))
shift_matrix <- rbind(shift_matrix, c(1, 0, 6))

q_matrix <- secsse::create_q_matrix(state_names = c(0, 1),
                                    num_concealed_states = 2,
                                    shift_matrix = shift_matrix,
                                    diff.conceal = TRUE)
q_matrix
##    0A 1A 0B 1B
## 0A NA  5  7  0
## 1A  6 NA  0  7
## 0B  8  0 NA  5
## 1B  0  8  6 NA

Maximum Likelihood

Now that we have specified our matrices, we can use the same code we used for the ETD model to perform our maximum likelihood:

idparsopt <- 1:8 # our maximum rate parameter was 8
idparsfix <- c(0) # we want to keep all zeros at zero
initparsopt <- rep(0.1, 8)
initparsfix <- c(0.0) # all zeros remain at zero.
sampling_fraction <- c(1, 1)

idparslist <- list()
idparslist[[1]] <- lambda_list
idparslist[[2]] <- mu_vec
idparslist[[3]] <- q_matrix

answ <- secsse::cla_secsse_ml(phy = phylo_vignette,
                              traits = traits$trait,
                              num_concealed_states = 2,
                              idparslist = idparslist,
                              idparsopt = idparsopt,
                              initparsopt = initparsopt,
                              idparsfix = idparsfix,
                              parsfix = initparsfix,
                              sampling_fraction = sampling_fraction,
                              verbose = FALSE)
## Warning in check_ml_conditions(traits, idparslist, initparsopt, idparsopt, :
## Note: you set some transitions as impossible to happen.
ML_CTD <- answ$ML
CTD_par <- secsse::extract_par_vals(idparslist, answ$MLpars)
ML_CTD
## [1] -98.41269
CTD_par
## [1] 1.967792e+00 2.939082e-01 3.637021e-04 4.716578e-05 7.756585e-02
## [6] 6.941423e-07 1.319752e+01 3.715673e+00
spec_rates <- CTD_par[1:2]
ext_rates <- CTD_par[3:4]
Q_Examined <- CTD_par[5:6]
Q_Concealed <- CTD_par[7:8]
spec_rates
## [1] 1.9677916 0.2939082
ext_rates
## [1] 3.637021e-04 4.716578e-05
Q_Examined
## [1] 7.756585e-02 6.941423e-07
Q_Concealed
## [1] 13.197515  3.715673

Here we now find that state A has a very low speciation rate, in contrast to a much higher speciation rate for state B (remember that speciation rate 1 is now associated with A, and not with state 0!). Similarly, extinction rates for both states are also quite different, with state A having a much lower extinction rate than state B. Examined trait shifts (Q_Examined) are quite low, whereas concealed trait shifts seem to be quite high. The LogLikelihood seems to be lower than what we found for the ETD model.

CR

As a check, we will also fit a model where there is no trait effect - perhaps we are looking for an effect when there is none. This is always a good sanity check.

Lambda matrices

To specify the lambda matrices, this time we choose the same rate indicator across both states.

spec_matrix <- c()
spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1))
spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 1))
lambda_list <- secsse::create_lambda_list(state_names = c(0, 1),
                                          num_concealed_states = 2,
                                          transition_matrix = spec_matrix,
                                          model = "CR")
lambda_list
## $`0A`
##    0A 1A 0B 1B
## 0A  1  0  0  0
## 1A  0  0  0  0
## 0B  0  0  0  0
## 1B  0  0  0  0
## 
## $`1A`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  1  0  0
## 0B  0  0  0  0
## 1B  0  0  0  0
## 
## $`0B`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  0  0  0
## 0B  0  0  1  0
## 1B  0  0  0  0
## 
## $`1B`
##    0A 1A 0B 1B
## 0A  0  0  0  0
## 1A  0  0  0  0
## 0B  0  0  0  0
## 1B  0  0  0  1

Mu vector

The mu vector follows closely from this, having a shared extinction rate across all states:

mu_vec <- secsse::create_mu_vector(state_names = c(0, 1),
                                   num_concealed_states = 2,
                                   model = "CR",
                                   lambda_list = lambda_list)
mu_vec
## 0A 1A 0B 1B 
##  2  2  2  2

Transition matrix

We will use the same transition matrix as used before, although one could perhaps argue that without a trait effect, all rates in the transition matrix (both forward and reverse trait shifts) should share the same rate. Here, we will choose the more parameter-rich version (Home assignment: try to modify the code to perform an analysis in which all rates in the transition matrix are the same).

shift_matrix <- c()
shift_matrix <- rbind(shift_matrix, c(0, 1, 3))
shift_matrix <- rbind(shift_matrix, c(1, 0, 4))

q_matrix <- secsse::create_q_matrix(state_names = c(0, 1),
                                    num_concealed_states = 2,
                                    shift_matrix = shift_matrix,
                                    diff.conceal = TRUE)
q_matrix
##    0A 1A 0B 1B
## 0A NA  3  5  0
## 1A  4 NA  0  5
## 0B  6  0 NA  3
## 1B  0  6  4 NA

Maximum Likelihood

idparsopt <- 1:6 # our maximum rate parameter was 6
idparsfix <- c(0) # we want to keep all zeros at zero
initparsopt <- rep(0.1, 6)
initparsfix <- c(0.0) # all zeros remain at zero.
sampling_fraction <- c(1, 1)

idparslist <- list()
idparslist[[1]] <- lambda_list
idparslist[[2]] <- mu_vec
idparslist[[3]] <- q_matrix

answ <- secsse::cla_secsse_ml(phy = phylo_vignette,
                              traits = traits$trait,
                              num_concealed_states = 2,
                              idparslist = idparslist,
                              idparsopt = idparsopt,
                              initparsopt = initparsopt,
                              idparsfix = idparsfix,
                              parsfix = initparsfix,
                              sampling_fraction = sampling_fraction,
                              verbose = FALSE)
## Warning in check_ml_conditions(traits, idparslist, initparsopt, idparsopt, :
## Note: you set some transitions as impossible to happen.
ML_CR <- answ$ML
CR_par <- secsse::extract_par_vals(idparslist, answ$MLpars)
ML_CR
## [1] -99.64176
CR_par
## [1] 6.923591e-01 1.444426e-07 7.760335e-02 5.258368e-10 1.000000e-01
## [6] 1.000000e-01
spec_rate <- CR_par[1]
ext_rate <-  CR_par[2]
Q_Examined <- CR_par[3:4]
Q_Concealed <- CR_par[5:6]
spec_rate
## [1] 0.6923591
ext_rate
## [1] 1.444426e-07
Q_Examined
## [1] 7.760335e-02 5.258368e-10
Q_Concealed
## [1] 0.1 0.1

We now recover a non-zero extinction rate, and much higher transition rates for the concealed than for the observed states.

Model comparisong using AIC

Having collected the different log likelihoods, we can directly compare the models using AIC. Remembering that the AIC is 2k - 2LL, where k is the number of parameters of each model and LL is the Log Likelihood, we can calculate this as follows:

res <- data.frame(ll = c(ML_ETD, ML_CTD, ML_CR),
                  k  = c(8, 8, 6),
                  model = c("ETD", "CTD", "CR"))
res$AIC <- 2 * res$k - 2 * res$ll
res
##          ll k model      AIC
## 1 -96.32138 8   ETD 208.6428
## 2 -98.41269 8   CTD 212.8254
## 3 -99.64176 6    CR 211.2835

I can now reveal to you that the tree we used was generated using an ETD model, which we have correctly recovered!

Further help

If after reading these vignettes, you still have questions, please feel free to create an issue at the package’s GitHub repository https://github.com/rsetienne/secsse/issues or e-mail the authors for help with this R package. Additionally, bug reports and feature requests are welcome by the same means.

References

Beaulieu, J. M., O’Meara, B. C., & Donoghue, M. J. (2013). Identifying hidden rate changes in the evolution of a binary morphological character: the evolution of plant habit in campanulid angiosperms. Systematic biology, 62(5), 725-737.

Beaulieu, J. M., & O’Meara, B. C. (2016). Detecting hidden diversification shifts in models of trait-dependent speciation and extinction. Systematic biology, 65(4), 583-601.

FitzJohn, R. G. (2012). Diversitree: comparative phylogenetic analyses of diversification in R. Methods in Ecology and Evolution, 3(6), 1084-1092.

Harmon, L. J., Weir, J. T., Brock, C. D., Glor, R. E., & Challenger, W. (2008). GEIGER: investigating evolutionary radiations. Bioinformatics, 24(1), 129-131.

Rabosky, D. L., & Goldberg, E. E. (2015). Model inadequacy and mistaken inferences of trait-dependent speciation. Systematic Biology, 64(2), 340-355.