SVDQuartets Tutorial for Molecular Evolution Workshop 2024

David Swofford and Laura Kubatko
28 May 2024

Preliminaries

In the instructions which follow, the “$” represents the Unix prompt. For example, if the instruction says:
         $ cat filename.txt
you would type “cat filename.txt” and hit return. Likewise, “paup>” represents the prompt used by the PAUP* program.

If you have not already done so, install FigTree on your local machine (see instructions on Slack). FigTree is not a critical component of the tutorial however, so don’t sweat it if you run into problems.

Move to the directory with the SVDQuartets tutorial:

$ cd moledata/svdquartets_tutorial


Part 1: Analysis of a data set simulated from the “anomaly zone”

We will begin with an example used by Liu and Edwards (2009), calling attention to problems that can arise due to gene trees for individual loci conflicting with the species tree because of “incomplete lineage sorting” (ILS). When internal branches of the species tree are very short (or effective population sizes are extremely large), the most probable gene tree can be inconsistent with the species tree, which can cause “concatenation” methods for inferring the tree to fail. See Kubatko and Degnan (2007), Liu and Edwards (2009), and Roch and Steel (2015) for details.

The data file anomaly_zone.nex represents one replicate of a simulation described in Fig. 2 in Liu and Edwards (2009). This file contains data for 10,000 loci, with 500 sites per locus. We will use the PAUP* program (http://paup.phylosolutions.com) to analyze this data set (PAUP* will subsequently be referred to as PAUP, without the asterisk). It is often convenient to run PAUP from a Nexus-file script containing the commands used to perform the analysis, but we will usually issue the commands interactively here for pedagogic reasons.

Start PAUP and load the data file.

$ paup anomaly_zone.nex -L anomaly.log

The first command argument above is the name of the data file, and the “-L” option opens a log file that will preserve all of the output of your run (you can use any name you like).

We’ll begin with a “concatenated” maximum-likelihood analysis to demonstrate the problem. In order to perform ML in PAUP, you first need to estimate a substitution model, which specifies the rate of substitution from each nucleotide to each other nucleotide, the equilibrium base frequencies, and the distribution of rate variation across sites. PAUP provides a tool to automatically compare models using the AIC or BIC model selection criteria. A reasonable starting tree is needed (but the details of the tree topology have little effect on the chosen model). We’ll just do a quick neighbor-joining analysis using LogDet distances (which are a good general purpose distance when you don’t know what else to use):

paup> dset distance=logdet;
paup> nj;
paup> automodel;

Using the default options, automodel will evaluate a set of 56 models and choose the one with that minimizes the AIC. On completion, the model parameters will be set to their maximum-likelihood estimates, and we can use this as the starting point for a maximum-likelihood search:

paup> set criterion=likelihood;
paup> hsearch;

(Note that you can abbreviate PAUP commands as long as they do not become ambiguous. For instance, instead of the above, we could have instead used set crit=like; hs;)

One way to see the tree found by the ML search is to use the describetrees command, with branch lengths drawn proportionally to the estimated number of substitutions/site:

paup> describe/plot=p;

Note the relationships among the taxa implied by the topology of this tree. Now we’ll do an svdquartets analysis (Chifman and Kubatko, 2014, 2015) for the same data. For any command in PAUP, you can see the available options by typing “command-name ?;”

Before we start our next analysis, look at the options available for the svdquartets command:

paup> help svdquartets;

For example, you can see that it is possible to evaluate quartets exhaustively (“evalquartets=all”) or sample quartets randomly (“evalquartets=random nquartets=n”). Typically, you will leave most options set at their default settings, and enter values only for those you wish to override. One exception is that the current default setting for evalquartets is random (this may change in a future version), so we will explicitly request evaluation of all possible quartets. We’ll also do a bootstrap analysis to assess confidence in the result:

paup> svdq evalq=all bootstrap;

The analysis finishes quickly and outputs the estimated tree, which is the one compatible with all 5 of the inferred quartet relationships. In fact, this is the correct tree that generated the data (see Liu and Edwards, 2009). The concatenated ML method estimated the tree incorrectly despite the large number of sites (500,000). Write down the bootstrap proportions for the two internal nodes for discussion later.

ASTRAL (Mirarab et al., 2014, 2015) is a “summary” method for estimating species trees from a set of gene trees estimated for each locus. Although it is very fast once the gene trees have been estimated, the preliminary phase of gene tree estimation (outside of ASTRAL) can take a long time. We’ll use the RAxML program (https://sco.h-its.org/exelixis/web/software/raxml/index.html) to estimate these trees (PAUP could also be used to estimate the trees, but if you have a data set with many tips, RAxML is much faster, and this example will show you how to do it.)

PAUP now provides a wrapper for RAxML that simplifies the process of obtaining the gene trees. We wrote a small Python script to generate the PAUP commands; take a quick look at it:

paup> !cat astral_prep.py

Commands typed at the command line beginning with “!” are passed to a Unix shell, so this command just shows the contents of the file on your terminal. The Python program just loops through the loci and outputs commands that analyze the data for each locus.

You will remain in the shell until you hit the return key at the “!” prompt, so do this now. Now run the python script from the “paup>” prompt to generate a Nexus command file, and hit return to go back to PAUP.

paup> !python astral_prep.py > run_astral.nex

The created Nexus file simply defines a “character partition” specifying the site numbers for each locus and runs a set of three commands for each of the 10,000 loci, e.g.:
        include loci.locus_130/only;
        raxml;
        savetree file=az_tree.tre format=newick append;
For each locus, all of the sites except those belonging to that locus are excluded, RAxML is executed using PAUP’s current model settings, and the resulting tree is appended to a growing treefile that will ultimately contain one tree for each locus.

Now run the Nexus command file generated by the Python script (you will be warned that you are going to reset the active data file; this is fine). Running the full analysis would take several hours, so read ahead rather than waiting for the command to complete.

paup> execute run_astral.nex;

Abort the command by typing ctrl-C after a few trees have been generated. Scroll back in the output and notice how different the trees are from one locus to the next. This variation comes from two sources: (1) the true variation in gene trees coming from the multispecies coalescent process, and (2) error in estimating the gene trees using a relatively small amount of sequence information per locus.

We will pretend that the analysis actually completed, and then quit the program.

paup> quit;

As in all good cooking shows, we have pre-baked the result so that we can pull it out of the oven immediately. It is stored in the file az_10000_trees.tre, which contains one tree for each of the 10,000 trees, written in the “Newick” (nested parentheses) format. It looks like this:

        (A,(B,(D,E)),C);
        (A,((B,D),E),C);
        (A,((B,D),C),E);
        (A,(B,(D,E)),C);
        (A,((B,E),D),C);
             .
             .
             .

We’ll now run this file in ASTRAL. First, look at the options available:

$ astral --help

For now, we will simply use the default settings to estimate a species tree, using the treefile we generated above as input.

$ astral -i az_10000_trees.tre

The analysis runs quickly and outputs a few lines of output. The estimated species tree is found near the end of the output, and looks something like this:

(A,(B,(C,(E,D)1:0.025494729020277836)1:0.033915343495063886));

Like SVDQuartets, ASTRAL infers the (unrooted) species tree correctly. Note the line near the top: “All output trees will be arbitrarily rooted at A”. Both the input to and output from ASTRAL are unrooted trees; the rooting implied by the result is meaningless. In our case, the outgroup is tip E, so we would need to reroot the tree appropriately for a figure.

To view the tree, copy the output tree description the clipboard (e.g., ctrl-C), launch FigTree and paste the clipboard contents (e.g., ctrl-V). Select the terminal branch leading to tip E, and click the Reroot tool.

If you want to gain more familiarity with ASTRAL (after our practical session ends), you can run the tutorial at https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md. On the MBL cluster, you can use the command “astral” rather than “java -jar astral”5.6.3.jar” as given in the tutorial.

Part 2: Analysis of a real data set

The data set is for members of the family Canidae (dogs, wolves, coyotes, foxes, jackals, etc.). Lindblad-Toh et al. (2005) sequenced a collection of 16 nuclear loci. The alignments used here are from a *BEAST2 tutorial written by Huw Ogilvie (https://github.com/genomescale/starbeast2/releases/download/v0.14.0/StarBEAST2-tutorial.zip). We reformatted them into the Nexus format and renamed the taxa from scientific to common names.

Start PAUP and load the canid sequence data file:

$ paup canids.nex -L canids.log

This file contains data for two individuals from each species (total 16 sequences), so we use a taxon-partition to assign each individual tip to a species. This definition is already in the file and you do not need to type it here. It looks like this:

    taxpartition species =
        sidestriped_jackal:  sidestriped_jackal_a sidestriped_jackal_b,
        African_golden_wolf: African_golden_wolf_a African_golden_wolf_b,
        coyote:              coyote_a coyote_b,
        gray_wolf:           gray_wolf_a gray_wolf_b,
        blackbacked_jackal:  blackbacked_jackal_a blackbacked_jackal_b,
        Ethiopian_wolf:      Ethiopian_wolf_a Ethiopian_wolf_b,
        Dhole:               Dhole_a Dhole_b,
        African_wild_dog:    African_wild_dog_a African_wild_dog_b
        ;


Analysis 1: Using SVDQuartets and qAge

NOTE for MOLE 2024: The version of PAUP* running on the MOLE virtual machines has a bug that sometimes causes crashes when multithreading is used with SVDQuartets. (This is probably a long-standing bug that we are are just noticing on very fast CPUs.) To avoid this problem, we will specify “nthreads=1” to suppress multithreading when running SVDQuartets. This is only a temporary limitation, and this notice will be removed once the bug is fixed.

Perform an SVDQuartets analysis, referencing this taxon partition:

paup> svdq nthreads=1 evalq=all taxpartition=species;

Notice that the (North American) gray wolf (Canis lupus) joins with the African golden wolf (Canis anthus) rather than the (North American) coyote (Canis latrans), unlike the result of the *BEAST tutorial (see StarBEAST2-tutorial.pdf ). However, this result is not well supported, as we can see by doing a multilocus nonparametric bootstrap, which resamples (with replacement) both loci and sites within loci:

paup> svdq nthreads=1 evalq=all taxpartition=species bootstrap=multilocus loci=combined;

(The “combined” above is the name of the character partition defining genes in the input file.) Examining the output bootstrap consensus tree, we see that the bootstrap support for the (gray wolf, African golden wolf) grouping is weak, and many bootstrap runs will actually find the (gray wolf, coyote) grouping that is probably correct. Interestingly, the DensiTree plot in the *BEAST tutorial also demonstrates that support for this grouping is very weak.

PAUP also includes a method for estimating branch lengths within a fixed species tree. The method is run with the “qAge”” command, and you can find help with this command as above:

paup> help qAge;

To estimate the branch lengths under the JC69 model, we can use the commands below. The first command converts the tree estimated by SVDQuartets to a rooted tree, and the second calls qAge to estimate the branch lengths on the rooted tree. Bootstrapping is used to obtain 95% confidence intervals. We’ll take a look at the output as a group.

paup> roottrees rootmethod=outgroup;
paup> qage taxpartition=species outunits=both bootstrap=multilocus loci=combined;

qAge also includes an option to carry out estimation under more complex substitution models. This option is invoked by changing “patProb=exactJC” to “patProb=expBL”. Prior to doing this, however, you must define a model using the “lset” command and provide the name of that model to the qAge command. Below is sample code. In the first line, we specify a GTR model, with parameters in the model to be estimated. The second line asks PAUP* to estimate those parameters using our data and the current tree in memory. In the third line, we set the model with the estimated values of the parameters, and we name it “MyGTR”. The final line uses qAge to carry out estimation under that model. Note that this analysis is more computationally intensive than the previous one that assumes the JC69 model. If you plan to try this analysis now, you may want to reduce the number of bootstrap replicates (currently set as nreps=100).

paup> lset nst=6 rmatrix=estimate baseFreq=empirical;
paup> lscores;
paup> lset nst=6 rmatrix=(0.749 2.410 0.365 1.066 4.465 1.000) baseFreq=(0.2476 0.2679 0.2558 0.2286) modelName=MyGTR;
paup> qage patProb=expBL taxpartition=species loci=combined bootstrap=multilocus treefile=test.tre modelName=MyGTR poolJC=no nreps=100;


Analysis 2: Using ASTRAL

We will again run ASTRAL so that you can see how to apply ASTRAL to an empirical data set. As before the first step is to estimate gene trees for each of the 16 loci. We can do this with the following two commands:

paup> !python astral_prep_canids.py > run_astral_canids.nex
paup> exec run_astral_canids.nex;

(In an interactive session you will need to hit return at the “!” prompt to return to the PAUP prompt). You can ignore warnings from PAUP* about the ingroup not being monophyletic; the trees are considered to be unrooted when input to ASTRAL.

After issuing these two commands, you will have a file called “canids_genetrees.tre” in your directory. This file contains a gene tree for each of the 16 genes, and will be used as the input to ASTRAL. There is one more file needed to run PAUP for this data set. We need to create a file that maps individual lineages to species, similar to the “taxpartition” we created in PAUP. This file is called “astral_canid_mapfile”. To run ASTRAL, you can now quit PAUP, and issue the following command:

$ astral -i canids_genetrees.tre -a astral_canid_mapfile

Again, you can locate the tree in the output written to the screen. Copying and pasting the estimated tree into FigTree and rooting along the branch that separates the sidestriped jackal and the blackbacked jackal from the rest of the species, we see that this tree agrees with that estimated with SVDQuartets using the bootstrap and with *BEAST2.

Part 3: Doing your own simulations

PAUP* now supports simulating data under the multispecies coalescent model. We will use the file sim-6tips.nex, and we will walk you through the components of this file:

#NEXUS

begin taxa;
    dimensions ntax=6;
    taxlabels A B C D E F;
end;

begin trees;
    tree 1 = [&R] (A:4.2,((B:1.0,C:1.0):2.2,(D:3.1,(E:3.0,F:3.0):0.1):0.1):1.0);
end;

begin paup;
    showtrees/userbrlens;
end;

begin dnasim;
    simdata multilocus=y nloci=100 nsitesperlocus=200;
    truetree source=memory treenum=1  units=2Ngen
             scalebrlen=1
             mscoal=y Ne=100000 mu=1e-8
             showtruetree=brlens showgenetrees=n storetruetrees=n
             seed=1;
    lset model=jc nst=1 basefreq=eq;    [= Jukes-Cantor model]
    beginsim nreps=100 seed=0;
        svdq nthreads=1;
        tally 'svdq';
    endsim;
end;

The copy of this file that you have in your directory will not run initially, because we have replaced the nloci=100, nsitesperlocus=200, and scalebrlen=1 with “@” symbols. Each of you will perform one (or more) simulation(s) by substituting values for the number of loci, number of sites per locus, and branch-length scaling factor. You should choose either 1 or 200 for nsitesperlocus. If you choose 1 site/locus, choose a value for nloci from the set {10000, 100000, 500000, or 1000000}. If you choose 200 sites/locus, choose a value for nloci from the set {50, 500, 2500, or 5000}. You will now run 3 simulations, with branch-length scaling factors of 0.5 (high ILS), 1 (moderate ILS), and 1.5 (low ILS). (NOTE: the 10000 loci x 200 sites/locus combinations will take a while, so randomly choose one of the 3 scaling factors for your first simulation so that if you run out of time we’ll collectively have results from all 3 ILS levels.)

Using a text editor, replace each “@” symbol with your assigned value. If you are not already comfortable using a Unix text editor, we recommend using “nano” (refer to the Unix tutorial). You can also edit using a local text editor such as BBEdit that allows editing of remote files via an SFTP connection, or you can download the file, edit it, and re-upload.

After you have modified and saved the file, execute it in PAUP:

$ paup sim-6tips.nex

As your runs complete, enter the results into the Google Sheet here: SVDQuartets results.

Feel free to run as many simulation conditions as you like, entering the results from each into the spreadsheet. We’ll discuss the results once most people have entered their numbers.


Running the tutorial locally

Download the materials for this practical and expand the archive:

$ wget https://phylosolutions.com/tutorials/wh2023-svdq-astral/svdquartets_tutorial.zip
$ unzip svdquartets_tutorial.zip

After the workshop, you can run the tutorial on your own machines by downloading and installing PAUP* (http://phylosolutions.com/paup-test/), ASTRAL (https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#installation), and RAxML (https://sco.h-its.org/exelixis/web/software/raxml/). Mac and Windows users should use the command-line (console) version of PAUP* rather than the GUI version. The tutorial assumes that PAUP*, ASTRAL, and RAxML can be started on your system using the commands paup, astral, and raxmlHPC, respectively. It will be easiest to set up symbolic links (symlinks) for PAUP* and RAxML; if you do not do this, you may need to modify the commands needed to start the programs from what is given in the tutorial instructions below. If you do not know how to create a symbolic link, “use the google.” For ASTRAL, Linux/macOS users can create a script containing the following two lines, substituting appropriately for /path/to/astral and modifying the ASTRAL version number if necessary:
        #!/bin/sh
        java -jar /path/to/astral/astral.5.6.3.jar  $@
Windows users would need to create a batch file that runs the “java -jar” command above and passes all arguments from the command line analogously to the ” $@” (again, google it).

References

Chifman J., Kubatko L.S. 2014. Quartet inference from SNP data under the coalescent model. 30:3317–3324.

Chifman J., Kubatko L.S. 2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J Theor Biol. 374:35–47.

Kubatko L.S., Degnan J.H. 2007. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 56:17–24.

Lindblad-Toh, K., Wade, C. M., Mikkelsen, T. S., Karlsson, E. K., Jaffe, D. B., Kamal, M., Clamp, M., Chang, J. L., Kulbokas, E. J., Zody, M. C., et al. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069): 803–819.

Liu L., Edwards S.V. 2009. Phylogenetic analysis in the anomaly zone. Syst Biol. 58:452–460.

Mirarab S., Reaz R., Bayzid M.S., Zimmermann T., Swenson M.S., Warnow T.J. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30:i541–i548.

Mirarab S., Warnow T.J. 2015. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31:i44–i52.

Roch, S. and M. Steel. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theoret. Pop. Biol. 100:56-62.

Swofford, D. L. and L. S. Kubatko. Species tree estimation using site pattern frequencies, Chapter 4, pgs. 68-88 in Species Tree Inference: A Guide to Methods and Applications, edited by L. Kubatko and L. Knowles, 2024.