David Swofford and Laura Kubatko
21 May 2026
In the instructions which follow, the “$” represents the Unix prompt.
For example, if the instruction says:
$ cat
filename.txt
you would type “cat filename.txt” and hit return.
Likewise, “paup>” represents the prompt used by the PAUP*
program.
Note: if you have not already done so, you may want to install FigTree on your local machine. FigTree is not a critical component of the tutorial however, so don’t sweat it if you run into problems.
Move to the directory with the SVDQuartets tutorial:
$ cd moledata/svdquartets_tutorial
The data set is for members of the family Canidae (dogs, wolves, coyotes, foxes, jackals, etc.). Lindblad-Toh et al. (2005) sequenced a collection of 16 nuclear loci. The alignments used here are from a *BEAST2 tutorial written by Huw Ogilvie (https://github.com/genomescale/starbeast2/releases/download/v0.14.0/StarBEAST2-tutorial.zip). We reformatted them into the Nexus format and renamed the taxa from scientific to common names.
The data are contained in the file canids.nex. This file contains data for two individuals from each species (total 16 sequences), so we use a taxon-partition to assign each individual tip to a species. This definition is already in the file and you do not need to type it here. It looks like this:
taxpartition species =
sidestriped_jackal: sidestriped_jackal_a sidestriped_jackal_b,
African_golden_wolf: African_golden_wolf_a African_golden_wolf_b,
coyote: coyote_a coyote_b,
gray_wolf: gray_wolf_a gray_wolf_b,
blackbacked_jackal: blackbacked_jackal_a blackbacked_jackal_b,
Ethiopian_wolf: Ethiopian_wolf_a Ethiopian_wolf_b,
Dhole: Dhole_a Dhole_b,
African_wild_dog: African_wild_dog_a African_wild_dog_b
;
You can take a look at the file by typing
$ cat canids.nex
Now you can start PAUP and load the canid sequence data file:
$ paup canids.nex -L canids.log
We will first estimate the species tree topology by performing an SVDQuartets analysis, referencing the taxon partition described above and included in the canids.nex file:
paup> svdq evalq=all taxpartition=species;
Notice that the (North American) gray wolf (Canis lupus) joins with the African golden wolf (Canis anthus) rather than the (North American) coyote (Canis latrans), unlike the result of the *BEAST tutorial (see StarBEAST2-tutorial.pdf ). However, this result is not well supported, as we can see by doing a multilocus nonparametric bootstrap, which resamples (with replacement) both loci and sites within loci:
paup> svdq evalq=all taxpartition=species bootstrap=multilocus loci=combined;
(The “combined” above is the name of the character partition defining genes in the input file.) Examining the output bootstrap consensus tree, we see that the bootstrap support for the (gray wolf, African golden wolf) grouping is weak, and many bootstrap runs will actually find the (gray wolf, coyote) grouping that is probably correct. Interestingly, the DensiTree plot in the *BEAST tutorial also demonstrates that support for this grouping is very weak.
PAUP also includes a method for estimating branch lengths within a fixed species tree. The method is run with the “qAge”” command, and you can find help with this command by typing:
paup> help qAge;
To estimate the branch lengths under the JC69 model, we can use the commands below. The first command converts the tree estimated by SVDQuartets to a rooted tree, and the second calls qAge to estimate the branch lengths on the rooted tree. Bootstrapping is used to obtain 95% confidence intervals. We’ll take a look at the output as a group.
paup> roottrees rootmethod=outgroup;
paup> qage taxpartition=species outunits=both bootstrap=multilocus loci=combined;
qAge also includes an option to carry out estimation under more complex substitution models. This option is invoked by changing “patProb=exactJC” to “patProb=expBL”. Prior to doing this, however, you must define a model using the “lset” command and provide the name of that model to the qAge command. Below is sample code. In the first line, we specify a GTR model, with parameters in the model to be estimated. The second line asks PAUP* to estimate those parameters using our data and the current tree in memory. In the third line, we set the model with the estimated values of the parameters, and we name it “MyGTR”. The final line uses qAge to carry out estimation under that model. Note that this analysis is more computationally intensive than the previous one that assumes the JC69 model. If you plan to try this analysis now, you may want to reduce the number of bootstrap replicates (currently set as nreps=100).
paup> lset nst=6 rmatrix=estimate baseFreq=empirical;
paup> lscores;
paup> lset nst=6 rmatrix=(0.749 2.410 0.365 1.066 4.465 1.000) baseFreq=(0.2476 0.2679 0.2558 0.2286) modelName=MyGTR;
paup> qage patProb=expBL taxpartition=species loci=combined bootstrap=multilocus treefile=test.tre modelName=MyGTR poolJC=no nreps=100;
As mentioned in the lecture this morning, ASTRAL is the most popular method for species tree inference. We'll run ASTRAL for the Canidae data set to get a feel for how it works. The first step is to estimate gene trees for each of the 16 loci. We can do this with the following two commands:
paup> !python3 astral_prep_canids.py > run_astral_canids.nex
paup> exec run_astral_canids.nex;
(In an interactive session you will need to hit return at the “!” prompt to return to the PAUP prompt). You can ignore warnings from PAUP* about the ingroup not being monophyletic; the trees are considered to be unrooted when input to ASTRAL.
After issuing these two commands, you will have a file called “canids_genetrees.tre” in your directory. This file contains a gene tree for each of the 16 genes, and will be used as the input to ASTRAL. There is one more file needed to run PAUP for this data set. We need to create a file that maps individual lineages to species, similar to the “taxpartition” we created in PAUP. This file is called “astral_canid_mapfile”. To run ASTRAL, you can now quit PAUP, and issue the following command:
$ astral -i canids_genetrees.tre -a astral_canid_mapfile
Again, you can locate the tree in the output written to the screen. Copying and pasting the estimated tree into FigTree and rooting along the branch that separates the sidestriped jackal and the blackbacked jackal from the rest of the species, we see that this tree agrees with that estimated with SVDQuartets using the bootstrap and with *BEAST2.
If you want to gain more familiarity with ASTRAL (after our practical
session ends), you can run the tutorial at https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md.
On the MBL virtual machines, you can use the command “astral” rather than “java
-jar astral”5.6.3.jar” as given in the tutorial.
PAUP* supports the simulation of data under the multispecies coalescent model. We will use the file sim-6tips.nex, and we will walk you through the components of this file:
#NEXUS
begin taxa;
dimensions ntax=6;
taxlabels A B C D E F;
end;
begin trees;
tree 1 = [&R] (A:4.2,((B:1.0,C:1.0):2.2,(D:3.1,(E:3.0,F:3.0):0.1):0.1):1.0);
end;
begin paup;
showtrees/userbrlens;
end;
begin dnasim;
simdata multilocus=y nloci=100 nsitesperlocus=200;
truetree source=memory treenum=1 units=2Ngen
scalebrlen=1
mscoal=y Ne=100000 mu=1e-8
showtruetree=brlens showgenetrees=n storetruetrees=n
seed=1;
lset model=jc nst=1 basefreq=eq; [= Jukes-Cantor model]
beginsim nreps=100 seed=0;
svdq;
tally 'svdq';
endsim;
end;
The copy of this file that you have in your directory will not run initially, because we have replaced the nloci=100, nsitesperlocus=200, and scalebrlen=1 with “@” symbols. Replacing the "@" symbols with numbers will allow you to explore the performance of SVDQuartets over a range of conditions. We recommend trying the following settings:
Note the interaction between the number of sites per locus and the number of loci. The choices above are designed to let you make comparisons between "CIS" data (one site per locus) and true multilocus data (200 sites per loci) when the total data set size is fixed. For example, nsitesperlocus = 1 and nloci=10000 corresponds to 10,000bp, as does nsitesperlocus=200 and nloci=50.
The scalebrlen option re-scales the total length of the tree by whatever factor is given, which controls the amount of incomplete lineage sorting (ILS) in the gene trees that are generated. A factor of 0.5 shortens the total tree length, resulting in more ILS; a factor of 1 corresponds to moderate ILS; and a factor of 1.5 corresponds to low levels of ILS.
After you have modified and saved the file, execute it in PAUP:
$ paup sim-6tips.nex
If you have the time and interest, run all combinations of the conditions above and plot your results. For example, plot the data set size on the x-axis, the proportion of trees estimated correctly on the y-axis, and plot two lines: one for nsiterperlocus=1 and one for nsitesperlocus=200. Compare the performance of SVDQuartets in these two cases, and note how it performs overall as a function of total data set size.
Though we have run PAUP on the MBL virtual machines, it is easy to download PAUP to your own machine at http://phylosolutions.com/paup-test/. You can copy the files from the svdquartets_tutorial folder to your own machine as well, and then you should be able to repeat this lab exercise locally. If you'd like to run the ASTRAL analyses, you will also need to download ASTRAL: https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#installation.
Chifman J., Kubatko L.S. 2014. Quartet inference from SNP data under the coalescent model. 30:3317–3324.
Chifman J., Kubatko L.S. 2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J Theor Biol. 374:35–47.
Kubatko L.S., Degnan J.H. 2007. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 56:17–24.
Lindblad-Toh, K., Wade, C. M., Mikkelsen, T. S., Karlsson, E. K., Jaffe, D. B., Kamal, M., Clamp, M., Chang, J. L., Kulbokas, E. J., Zody, M. C., et al. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069): 803–819.
Liu L., Edwards S.V. 2009. Phylogenetic analysis in the anomaly zone. Syst Biol. 58:452–460.
Mirarab S., Reaz R., Bayzid M.S., Zimmermann T., Swenson M.S., Warnow T.J. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30:i541–i548.
Mirarab S., Warnow T.J. 2015. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31:i44–i52.
Roch, S. and M. Steel. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theoret. Pop. Biol. 100:56-62.
Swofford, D. L. and L. S. Kubatko. Species tree estimation using site pattern frequencies, Chapter 4, pgs. 68-88 in Species Tree Inference: A Guide to Methods and Applications, edited by L. Kubatko and L. Knowles, 2024.