README.txt Ziheng Yang, January 2004 (A) Overview This folder includes the files used in Yang (2004) and this file describes the use of the AHRS algorithm described in that paper. Two options (clock = 5 and 6) are implemented in baseml and codeml. clock = 5 is for the global clock, and clock = 6 implements the heuristic rate smoothing (AHRS) algorithm combined with maximum likelihood estimation of divergence dates. The mouse lemur files are used as an example. The control files for the nucleotide, amino acid, and codon based analysis published in Yang (2004) are baseml2.ctl, aaml2.ctl, and codonml2.ctl. You can open those files to check them. The models in Yang (2004) have two features compared with the previous models (clock = 1, 2, 3), which were described in Yang and Yoder (2003). The first is that the new models deal with multiple-loci data sets in which some species are missing at some loci. The second is that the AHRS algorithm attempts to automatically assign branches to rate groups, which are then used for maximum likelihood estimation of divergence dates using the method of Yang and Yoder (2003). So the new global-clock option (clock = 5) is identical to the old global clock option (clock = 3) if every locus has all the species. Note however that there are some differences in the data formats between the new models and the old ones, due to the need to deal with missing species at some loci. See section (B) for details. (B) Data Format Have a look at the sequence data file MouseLemurs123.nuc and the tree file MouseLemurs.trees in this folder. The new models, clock = 5 (global clock or simply clock) and 6 (local clock) are for combined analysis of data from multiple loci (or site partitions). An important difference from the old options, clock = 0, 1, 2, 3, is that some species can be missing at some loci. Specify the number of loci using the variable ndata in the control file baseml.ctl or codeml.ctl. Prepare a tree file, including a tree for all species that occur in the sequence data. This will be called the species tree or master tree. The beginning of the tree file should have the number of species and the number of trees (which is usually 1) on the first line. Also prepare a sequence data file, with the sequence alignments listed one after another for the loci. The example file MouseLemurs123.nuc lists data from the 3 codon positions as if they were three different genes. When the program runs, it reads the master tree first, and then the sequence alignments locus by locus. For each locus, it constructs (extracts) the sub-tree from the master tree. For example, suppose the master tree is for six species with two fossil calibration points: (((S1, S2), S3) @0.10, ((S4, S5) @0.20, S6)); Suppose we have sequences for 4 species at the first locus and 5 species at the second locus (with ndata = 2). The data file looks as follows. 4 100 S1 TCTATGTTATATGTATGAGTATGA ... S3 TCGATATTATGTGTATGAGTATGA ... S4 TCTATATTACATGTATGAGTATGA ... S6 TCCATATTAAATGTATGAGTATGA ... 5 200 S1 GTTATATGTATGAGTATGATCTAT ... S2 GTTATATGCGTGAGTATGAACTAT ... S4 ATTATGTGTATGAGTATGATCGAT ... S5 GCTACATGTATGAGTATGATCTAT ... S6 ATTAAATGTATGAGTATGATCCAT ... The program will then construct the "gene tree" for locus 1 to be ((S1, S2), S3) @0.10; You should have at least one calibration node for each locus. If two "sister" species never occur simultaneously at one locus so that there is no way of identifying the date of their divergence, you might want to consider them as one "species". For example, if donkey is sequenced at some loci and horse at others and at no locus both are sequenced, you can rename the two species equus in the tree and sequence data files. (C) Models clock = 5 (clock) and clock = 6 (local clock) When you run clock = 5, a global clock is assumed. If all species are present in every locus, this analysis can be achieved by using clock = 3 (combined analysis) as in Yang & Yoder (2003). This is the case for the included example data file MouseLemurs123.nuc. Note again however that the data formats are different. clock = 3 uses MouseLemurs.nuc while clock = 5 uses MouseLemurs124.nuc. If some species are missing at some loci in your data, you can't use clock = 3 anymore and clock = 5 is the only choice. The AHRS algorithm (clock = 6) works in 3 steps. Here is a brief description. See Yang (2004) for more details. Step 1: Estimate branch lengths on each gene tree without the clock. Step 2: Heuristic rate smoothing, to estimate one set of divergence times for the nodes in the species tree and as many sets of branch rates as the number of loci. Each branch in the gene tree is assigned and estimated one rate. For the above toy example, the program estimates 3 divergence times (five ancestral nodes minus two calibration nodes), plus 6 rates (six branches in a rooted tree of four species) for the first locus and 8 rates for the second locus (eight branches in a rooted tree of five species). A smoothing parameter called \nu is also estimatd for each locus. Rate estimation is achieved by minimizing the discrepancy between the predicted branch lengths and the estimates obtained in Step 1, and by minimizing rate changes over time (across lineages). The estimated rates for branches in the gene trees are then collapsed into a few branch rate categories. The number of rate categories is set by a variable called nbrate=4 in the routine DatingHeteroData() in the file treesub.c. You can change this and recompile. The program also allows you to manually classify the rates into groups. When it asks for the number of rate groups, you can reply with 0: meaning no change, 1: meaning one rate (same as using clock = 5), or say 5 (in which case the program will further ask to input 4 cutting points to separate the rates into 5 groups). Step 3: Maximum likelihood estimation of divergence times using the assigned branch rate groups. Divergence times are estimated simultaneously with the rates for branch rate groups. For the toy example mentioned above, this estimates 3 = 5 - 2 node ages in the species tree, plus 3 rates at each locus, with 9 time and rate parameters in all. (D) Output The output is fairly self-explanatory. For clock = 6, look at the output on the screen at each of the three steps: Step 1: Estimating branch lengths under no clock. Step 2: Ad hoc rate smoothing to estimate branch rates. Step 3: ML estimation of times and rates. In particular, read in the output at the end of step 2 and check the branch labels (say, using Rod Page's TreeView) and see whether the assignment is reasonable and whether you want to use your own classification scheme. The programs print out a file named RateDist.txt, which contains a distance matrix for each locus, with the distance calculated as the difference between two rates. You can use an algorithm such as UPGMA to cluster the rates into groups (clades) to help with the classification. I used the neighbor program in J. Felsenstein's phylip package to do this (figure 5b and the 4-rate manual (4RM) model in Yang 2004). (E) Notes 1. The example control files are baseml2.ctl, aaml2.ctl, codonml2.ctl. 2. In early versions of baseml and codeml, the order of the control variables in the control files does not matter to the specification. Unfortunately that is not the case anymore for clock = 5 or 6. The variables clock and ndata should be specified in the control file before some other variables such as fix_kappa, kappa, fix_alpha, alpha etc. See the example files baseml2.ctl and codonml2.ctl etc. included in the same folder. The reason for this is that you can fix the kappa values for different genes at certain values during Step 1, and for the program to know how many kappa values to read, it needs to know the number of genes (ndata); see note 5 below. 3. The options are implemented in baseml for nucleotide sequences and in codeml for amino acid and codon sequences. 4. Only rooted binary trees are accepted in the tree file. 5. Estimating and fixing substitution parameters in step 1. If the model involves substitution parameters such as kappa, alpha, you can run the program multiple times to guarantee getting correct estimates for them Step 1 of the algorithm and then use them as fixed constants in both Step 1 (when you rerun the program) and Step 3. (Step 2 does not use a substitution model and so those parameters are irrelevant.) Step 1 in the clock = 6 model always uses the algorithm of iterating one branch at a time, and may be problematic if the gamma shape parameter (for rate variation among sites) is estimated at the same time. Note that the calculation in step 1 is just the old clock = 0 analysis and is in older versions of the programs, if the coccrect tree files are supplied for each locus. In that case you can use clock = 0 method = 0 to try and estimate the substitution parameters and then fix them when you run clock = 6. If you have different numbers of species at different loci, the tree for each locus is different and it would be awkward to use clock = 0 to complete step 1. In that case you can use clock = 6 so that the correct trees are extracted and used for each locus, but you run the algorithms multiple times to make sure the correct MLEs are obtained. Either way, when step 1 is successful, you can fix the parameters in the control file, by say, data = 3 fix_alpha = 1 alpha = 0.29169 0.16392 1.24726 which means that the alpha parameter is fixed at the three values above for the three genes (codon positions in this case). Other parameters that can be fixed this way and that can vary for site partitions or genes or proteins include kappa and alpha for baseml, alpha and aaRatefile for amino acid sequences, and icode (genetic code table) kappa, and omega for codon-based analsis. 6. I notice that minor differences in the rate estimates might lead to assignment of one or two branches in a different rate group, leading to differences in final time estimates. As a result, multiple runs using the same setting may lead to slightly different time estimates. You should watch out for failure of the iteration algorithm in any of the three steps, which can cause differences between runs. 7. For protein sequences, you can use different substitution rate matrices. In the following, three nuclear proteins (using wag) are followed by one mitochondrial protein (using mtmam). aaRatefile = wag.dat wag.dat wag.dat mtmam.dat For codon sequences, you can use different genetic codes. In the following, three nuclear genes are followed by one mitochondrial gene. icode = 0 0 0 1 8. Separating codon positions into different files. You can use Mgene = 1 option in baseml to separate the three codon positions into three different files (named Gene1.seq, Gene2.seq, etc.). You need to use verbose = 1 to do this, I think. Then you can copy and paste those files into one sequence data file, which will be in the format required by the options here. Good luck. References Yang, Z. 2004. A heuristic rate smoothing procedure for maximum likelihood estimation of species divergence times. Acta Zoologica Sinica 50:645-656. //end of file