History and bug fixes of PAML (Phylogenetic Analysis by Maximum Likelihood) Ziheng Yang If you find problems or have questions, please visit the PAML discussion site https://groups.google.com/forum/#!forum/pamlsoftware. Version 4.8a, August 2014 (*) baseml (nhomo = 3 or 4). I have added the nonhomogeneous GTR (REV) model with a set of abcde parameters for every branch. The specifications are as follows. model=7 nhomo=4 fix_kappa=0: one set of rate parameters abcde in GTR for the whole tree. model=7 nhomo=4 fix_kappa=1: one set of rate parameters abcde in GTR for every branch. In version 4.8, only the second option (one set of abcde parameters for the whole tree) is implemented, no matter whether you choose 0 or 1 for fix_kappa. After this change, the use of fix_kappa in the nhomo models is consistent between HKY and GTR. For HKY, the specifications have always been as follows. model=4 nhomo=4 fix_kappa=0: one kappa in HKY for the whole tree. model=4 nhomo=4 fix_kappa=1: one kappa in HKY for every branch. Note that nhomo=4 means one set of frequency parameters for every branch on the tree. I have also added a method for calculating the expected number of changes from any nucleotide i to nucleotide j along every branch under the nonhomogeneous models (nhomo=3 or 4). This accounts for the fact that the base frequencies are changing over time, and so does the rate of substitution. I tested only the nhomo=4 option carefully and the options nhomo=3 and 5 are not well tested. Version 4.8, March 2014 (*) mcmctree. The prior for rates for loci (with ndata = 2 or more) has changed in this version, and the gamma-Dirichlet prior is used. The old i.i.d. prior is replaced and unavailable anymore. The new gamma-Dirichlet prior is decribed in a paper dos Reis M, Zhu T, Yang Z. 2014. The impact of the rate prior on Bayesian estimation of divergence times with multiple loci. Syst Biol. in press. (*) codeml. Added a Bayesian method for estimating distance t and dN/dS ratio w in pairwise comparisons. The option is runmode = -3 * -3: pairwise Bayesian runmode = -3 1.1 1.1 1.1 2.2 * -3: pairwise Bayesian The four numbers are the parameters in the gamma priors for t and w (a_t b_t a_w b_w). The method is described in a manuscript Angelis K, dos Reis M, Yang Z. 2014. Bayesian estimation of nonsynonymous/synonymous rate ratios for pairwise sequence comparisons. Mol Biol Evol. (*) baseml, codeml, mcmctree). In version 4.7a, I disabled the option of reading species indexes in the tree file (for example, the number 19, instead of the sequence name, may be used in the tree file to mean the 19th sequence in the sequence alignment). I have added this back. Version 4.7 does not have this problem. (*) codeml in paml 4.6 and 4.7 has a bug if your sequences have stop codons. Stop codons have always been disallowed in the codon models implemented in codeml. In versions 4.6-4.7, if there is a stop codon in any sequence, the codons in all sequences at that site (column in alignment) are changed into ??? and then the sequences are processed in the usual way. The likelihood should then be the same as the likelihood if that column is removed from the alignment. However, the implementation of the idea was incorrect and gives wrong results if the sequences do not contain any ambiguity characters and if you choose cleandata = 0. I have now fixed this. The idea is not really useful, so the good advice is that you delete the column of stop codons before the data are analyzed using codeml. Version 4.7a, October 2013 (*) codeml. The mutation-selection model of Yang and Nielsen (2008) is noted to work for the site and branch-site models but not for the branch models. This is now fixed. (*) mcmctree. I fixed and modified the infinitesites program, which generates the limiting distribution when the amount of sequence data approaches infinity. This works for both the clock (clock=1) and relaxed clock models (clock=2 or 3). The mcmctree tutorial explains how to run this program. Version 4.7, January 2013 (*) mcmctree. This version includes a method for dating divergences using sampled tips, as in viral datasets. The prior for node ages in the given rooted tree is generated using a birth-death-sequential-sampling model, described in Stadler and Yang (2012 Syst Biol, submitted). The SIV/HIV-2 example dataset analyzed in the paper is in the examples/TipDate/ folder. Look at the readme file there. (*) codeml. Version 4.6 crashes if you have a lot of ambiguity codons in the alignment, or precisely if the total number of fully determined codons (61 in the standard code) and the ambiguous codons exceed 127. I was using "unsigned char" to represent the codons, which go from 0 to 255 and can represent 256 distinct codons. Somehow I changed this to "char", which on some systems goes from -128 to 127. Can't remember why I made the change. Anyway this is now fixed. Versions 4.5 or earlier do not have this problem. Thanks to David Yu for pointing out the problem. Version 4.6, September 2012 (*) baseml. The model of autocorrelated-rates among sites (Yang 1995 Genetics 139:993-1005) is broken in paml versions 4.3 to 4.5 inclusive. Earlier versions were apparently fine. When you select the auto-discrete-gamma model (alpha > 0, rho > 0) or the nonparametric autocorrelated-rates model (fix_alpha = 0, ncatG = 3, nparK = 4), the program aborts with the following error message: "Error: use 10, 20, 32, 64, 128, 512, 1024 for npoints for legendre.." This is now fixed. The option of nparK = 3 seems to have problems in all versions, and the program prints out the following warning message. "nparK==3, double stochastic, may not work. Use nparK=4?" This option is supposed to fit a Markov-dependent model with equal proportions in the rate categies. I have not got time to investigate this option, but a glance at table 3 in that paper indicates that this model was not used in the table and probably never properly implemented, so please follow the advice and use nparK = 4 instead. (*) codeml. I have now added the site model M2a_rel of Weadick & Chang (2012 Mol. Biol. Evol.), in which w2 > 0. This is specified using NSsites = 22 while model M2a is specified as NSsites = 2 (with w2 > 1). M2a_rel is the null model for the likelihood ratio test based on clade model C, suggested by Weadick & Chang (2012 Mol. Biol. Evol.). Note that the very old site model M2 from Nielsen & Yang (1998 Genetics) assumes w0 = 0. Yang, Wong, Nielsen (2005 MBE) changed the model so that 0 < w0 < 1 and renamed the model M2a. My tests using the versions around that time reveals the following: 3.13d (M2): w0 = 0, w2 > 0 3.14a (M2a): w0 > 0, w2 >= 1 3.14b (M2a): w0 > 0, w2 >= 1 If seems that when we changed M2 into M2a, we also changed the constraint on w2. I don't have a copy of version 3.14 to check. (*) basseml & codeml. To following option bootstrap = 1000 should produce 1000 bootstrap samples in the file boot.txt. A bug is introduced in versions 4.4 and 4.5, so that garbiage is created instead of bootstrap datasets. Versions 4.3 and earlier were correct. This bug is now fixed. (*) evolver. The simulation of codon sequences (evolver 6 MCcodon.dat) is incorrect when a genetic code different from the universal code is used. This is a conceptual error I made and it has been in all previous versions when simulation of codon sequences is allowed. I have assumed incorrectly that setting the frequencies of stop codons to 0 should be sufficient to get correct results. The zero frequencies allow the program to identify the stop codons correctly, but one still needs the code table to know which changes are synonymous and which nonsynonymous. evolver has been using the universal code to simulate codon sequences. Simulation using the universal code has been correct, and simulation using the mitochondrial code (for example) is affcted and the results should be incorrect although the impact may be slight. This bug is now fixed. A variable is added in the control file MCcodon.dat right below the codon frequencies, which tells the program which genetic code to use. 1 * genetic code (0:universal; 1:mammalian mt; 2-10:see below) The program also checks that the stop codons have frequency 0 and that the frequencies of the sense codons sum to 1. Thanks to Mario dos Reis for pointing this out. (*) mcmctree. A bug is found in the calculation of the prior on rates under the correlated-rates model (clock=3). This bug has been in version 4 (July 2007) to version 4.5 (December 2011), inclusive. The prior is calculated by multiplying the conditional densities of equation (7) in Rannala & Yand (2007) over all interior nodes on the master tree (see also Figure 1 in that paper). However, the rA used in the program was not for the current branch: it was instead for the ancestral branch. The correct formula used was correct for the root of the master tree, but incorrect for other interior nodes. The impact of the bug on posterior time estimates may be expected to be small, but if the clock is seriously violated and rates change over neighboring branches, this may be a concern. Version 4.5, December 2011 baseml. I have added the joint ancestral reconstruction under the nonhomogeneous models (nhomo = 3 or 4). This works for the model of one rate for all sites, and does not work for the model of gamma rates for sites or the partition models. Also I implemented the joint reconstruction only and not the marginal reconstruction. mcmctree. The program now prints out a file called FigTree.tre in the current working directory that is readable by FigTree. The branch lengths show the posterior means of times while the bars generated by FigTree will indicate the 95% CIs. I have also added an option to deal with TipDate data, such as viral sequences with dates. See examples/TipDate/README.txt for more notes. Version 4.4e, May 2011 mcmctree. I have added an option of automatically adjusting the step lengths for proposals in the MCMC algorithm. The program will use the burn in to automatically adjust the step lengths, using those values specified in the control file as initial values. codeml: The iteration method that updates one branch length at a time (method = 1) is broken for the free-ratios model (model = 1) from versions 4.3 to 4.4d inclusive. When codeml runs, you will see messages like the following on the screen: Warning rounding error? b=3 cycle=0 lnL=6036.9413757 != 6070.1069284 Version 4.2 and earlier are correct. This bug is now fixed. Thanks to Tae-Kun Seo for reporting the bug. Version 4.4d, March 2011 yn00. The option of using weighting (weighting = 1) in the calculation of dN and dS by the YN00 method (Yang and Nielsen 2000) has been broken since Version 4.4 (February 2010). A bug was introduced when I changed the character coding in Version 4.4, so that this bug affects Versions 4.4, 4.4a, 4.4b, and 4.4c. The bug typically causes a crash. The result for weighting = 0 is not affected. Versions before 4.4 are correct. This bug is now fixed. Thanks for Charlie (http://www.ucl.ac.uk/discussions/viewtopic.php?t=8726&highlight=yn00). Version 4.4c, August 2010 mcmctree. A bug causes the exotic calibrations (gamma, skew normal and skew t) to be sometimes ignored, if you have multiple loci (partitions) and if some species/sequences are missing at some loci. If all calibrations are bounds (minimum, maximum, and joint), or if the partitions have the same number of species/sequences, the calculation is still correct. This bug is in versions 4.1-4.4b. This is now fixed. Thanks to Mike Steiper. Version 4.4b, July 2010 mcmctree. A serious bug has been discovered which affects the option of approximate likelihood calculation (usedata = 2). If the first child of the root in the tree for any locus is a tip, the MLEs of branch lengths in the in.BV file are assigned to wrong branches, so that the results are entirely incorrect, and also very different from those obtained using the exact likelihood calculation (usedata = 1). Suppose the tree for a locus is (A, B), with a bifurcation at the root, and with A and B representing species or clades. If A is a tip, the program is incorrect, while if A is an internal node, the program is correct. This bug is in versions 4.2b - 4.4a. Hopefully this is now corrected. Thanks to Mario dos Reis. mcmctree. I also changed the specification on L, U, and B bounds to allow the user to specify the tail probabilities (which used to be fixed at 2.5%) for each fossil calibration. See table 8 in pamlDOC.pdf for details. You will still have to run the program without data to get the effective prior of times used by the program. Version 4.4a, June 2010 codeml (RateAncestor=1). For ancestral reconstruction under codon models, the program prints out the translated amino acids when it prints the codons. A bug caused the translation to be incorrect, so instead of the amino acid, the program prints something like h or H. This bug was introduced in Version 4.4, February 2010 when I changed the way that the characters are coded. This is fixed now. baseml (clock = 1, 2, or 3 Mgene option). If you assign different rates for different genes (or site parations) and the model assumes global clock or local clock, the program fails to print out the relative rates for genes. This bug seems to be introduced between 3.13 and 3.14 and has since. The log likelihood value and estimates of parameters are all correct, and the problem is only that MLEs of rates for genes are not printed in the output file properly. Version 4.4a, April 2010 In versions 4.2, 4.3, 4.4, Clade model D does not print the w estimates for the site classes correctly. For the example run in examples/datasets2/, the incorrect output looks like the following site class 0 1 2 proportion 0.48590 0.10718 0.40692 branch type 0: 0.10557 4.17186 4.17186 branch type 1: 0.10557 4.17186 3.05340 while the correct results (from version 4.1) are site class 0 1 2 proportion 0.48590 0.10718 0.40692 background w 0.10557 4.17186 3.05340 foreground w 0.10557 4.17186 0.95081 The ratios w2 and w3 for site class 2 are incorrectly printed, with w2 equal to w1. The branch lengths and lnL values etc. are all correct. This bug affects versions 4.2, 4.3, and 4.4, and version 4.1 and earlier versions appear to be correct. Also the bug affects clade model D, and clade model C appears to be fine. This is fixed now. Thanks to cajawe for reporting the bug (see http://www.ucl.ac.uk/discussions/viewtopic.php?p=28240#28240). Version 4.4, February 2010 I changed the way that the characters are coded in the programs, which means that essentially every program in the package is modified. It is likely that some bugs are introduced during the process. codeml. The iteration algorithm specified by method = 1 when applied to amino acid model of gamma rates among sites (seqtype = 2, alpha > 0, method = 1) is broken in versions 4.2-4.3b. The bug causes the iteration to fail to converge, with messages like the following printed on the monitor: "Warning rounding error? b=5 cycle=0 lnL=1718.5667685 != 1749.4724545". mcmctree. If the gamma prior for rates has alpha <1/3, the initial values are generated incorrectly. This is now fixed. Version 4.3b, November 2009 mcmctree. A bug was introduced in version 4.2, which causes the program to ignore the minimum age constraint on the root age and uses the maximum bound specified by RootAge in the control file, if there is a calibration on the root and it is a minimum-age constraint. If you have no calibration on the root, or if the calibration is an maximum bound, both minimum and maximum bound or gamma distribution, there is no problem. The error affects versions 4.2 and 4.3, and version 4.1 is correct. This is now fixed. Thanks to Mathieu Groussin. evolver. I have now added back the option of printing out the site pattern counts instead of the sequences (specified at the beginning of the control file such as MCbase.dat). Use of pattern counts takes less disk space. The option was removed in version 4.2b after the algorithm for collaspsing sites into patterns was rewritten. Version 4.3a, October 2009 mcmctree. A bug was introduced in version 4.3, which causes the root age to be proposed incorrectly. The effect of the bug appears to be subtle and is hard to assess. It means that the results from all three clock models (clock = 0, 1, 2) are incorrect. Here is a bit more detail about the bug. The internal maximum node age was set too high, and caused loss of significant digits in a smart reflection algorithm which reflects proposed values into the feasible interval. This update fixes the bug. Please use this version to reanalyze your data if you have used version 4.3. Version 4.2 does not have this bug. mcmctree. I now supplies compiling options so that you can use hard minimum hard maximum, or hard minimum soft maximum. The executables are called mcmctreeHH and mcmctreeHS. See the commands for compiling at the beginning of the file mcmctree.c. Version 4.3, August 2009 mcmctree. The approximate likelihood calculation was unreliable, because the Hessian matrix was not calculated reliably, especially if the maximum likelihood estimates of some branch lengths are 0. I have now implemented the approach of Seo, Kishino and Thorne (2004), which appears to be more stable. baseml & codeml: I changed these two programs to read fasta alignments directly, working out the number of sequences and the sequence length automatically. However, this does not work when you have multiple alignments in one file, that is, if ndata = 2 or higher. Version 4.2b, April 2009 mcmctree. There is a bug that leads to crash when usedata = 3 and when the data at some loci have ambiguity characters while other loci do not. usedata = 3 means that the program invokes baseml or codeml to calculate the maximum likelihood estimates (MLEs) of branch lengths and the variances of those MLEs. The temporary sequence alignment files for the loci (tmp?.txt) generated by mcmctree have garbiage and are unreadable by baseml (or codeml). This bug is now corrected. (You can replace the wrong alignment file for the locus by your original alignment and run baseml outside mcmctree.) All programs. I rewrote the code for collapsing sites into patterns. This should have no impact on user files, except that evolver does not produce data files in the "pattern-count" format anymore. Now two formats (paml/phylip format and PAUP nexus format) are used, specified by the first line of the control data file (e.g., MCbase.dat, MCcodon.dat, MCaa.dat). This pattern-count format is still accepted as input by baseml, codeml, etc. (See the section on "Site pattern counts" in the documentation). baseml & codeml. A bug was introduced to the local clock model (clock=6), so that the program generates the following error message: "need fossils for this locus". For example, running codeml on the control file /examples/MouseLemurs/codonml2.ctl will do so. This bug is now fixed. codeml: The bug introduced in version 4.2 concerning the free-ratios model (Yang 1998 MBE) was not fixed properly in version 4.2a (see the notes for 4.2a below). The results are incorrect when there are exactly 7 branch types (for example when you run the free-ratios model on a 5-species tree) but are correct for other numbers of branch types. I hope this is now finally fixed. Version 4.2a, February 2009 codeml: Two bugs were known to have been introduced in version 4.2 when I rewrote the code for the branch-site and clade models. The first is the null hypothesis in the branch-site test (model = 2 NSsites = 2, fix_omega = 1 omega = 1). The reference for this model is Yang, Wong & Nielsen (2005), and Zhang et al. (2005). The alternative model (fix_omega = 0) is fine. The second bug is in the free-ratios model (Yang 1998 MBE). The results from those two models are incorrect and should be ignored if you used version 4.2. Both mistakes are fixed in this update. Thanks to W.P. Zhou and BYLee. Version 4.2, December 2008 MCMCtree: The implementation of the lower bound and of the upper bound is changed. The improper density of equation (15) of Yang and Rannala (2006) for the lower bound is noted to push up divergence times, a feature we consider undesirable. Now a truncated Cauchy distribution is used instead. This will be described in more detail somewhere. The default file name for the MCMC samples is mcmc.out. To change this, add the following line in the control file mcmctree.ctl: mcmcfile = mcmc.out codeml: Clade models C and D are extended to accommodate more than two branch types. The old models accept two branch types only, referred to as foreground and background branches. The structure of the new models are below. You use branch/node numbers to specify the branch types (with #0 to be the default, followed by #1, #2, #3, ...). As before, the BEB calculation is implemented for clade model C only, and is not available for clade model D. The BEB calculation is expensive, so I expect clade model C will work if you have only a few branch types. For each additional branch type, the BEB computation is 10 times more expensive. Site class Proportion w ratio 0 p0 0 < w0 < 1 1 p1 w1 (w1 = 1 is fixed in model C) 2 p2 w2, w3, w4, ... I also added an option of fixing the last w to 1 in clade model C, specified in the control file as follows. For example, if there are three branch types in the tree (see the table above), the last w fixed will be w4. fix_omega = 1 omega = 1 The two versions of the clade model can be compared to test whether the last w is > 1. codeml: branch model with amino acid chemical properties, specified using model = 2 and aaDist = 7. Under this model, the types of branches are specified just like the branch model (by labelling branches). For each branch type, the program estimates a few w's, depending on specifications in the file OmegaAA.dat. This model seems to be broken for a long time since its implementation. Look at the readme.txt file in examples/mtCDNAape/. Version 4.1, August 2008 MCMCtree: A few changes have been introduced to this program. The format for specifying fossil calibrations in the tree file has been changed, so that the gamma distribution is specified as 'G(alpha, beta)' and the old format '>0.6=0.7<0.8' does not work anymore. The bounds can be specified using either the old form '<0.8' or the new form 'U(0.8)'. Two exotic distributions, skew normal and skew t, are added, using the format 'SN(location, scale, shape)' and 'ST(location, scale, shape, df)'. I edited the documentation pamlDOC.pdf and rewrote the part for mcmctree. The format of the MCMC output file mcmc.out has been modified so that it is readable by Andrew Rambaut's Tracer program. Note that MCMCtree starts taking samples and writing into the file only after burnin. codeml. The mutation-selection models of Yang and Nielsen (2008 Mol. Biol. Evol. 25, 568-579) are specified using the options CodonFreq = 6 or 7 * 6:FMutSel0, 7:FMutSel estFreq = 1 Use examples/mtCDNAape/codeml.HC.ctl to duplicate results in table 1 in the paper. It seems that some results are in the mlc file while some other results are in the rst and rst1 files. codeml (branch model): When codon sequences are analyzed under the branch model (model = 1 or 2, NSsites = 0), the main output file includes a tree under the line "w ratios as labels for TreeView:". This line was introduced in version 3.15. However, the branch lengths in the tree had the dN values, while I intended them to be the original branch lengths (which are t, the number of nucleotide substitutions per codon). This bug may also affect the results of ancestral reconstruction under the same branch model. This bug is fixed now. baseml and codeml (discrete-gamma model): My intention has been to use the mean and not the median to represent all rates in each category when the discrete-gamma model is implemented. See Yang (1994 J. Mol. Evol. 39:306-314) for the distinction. However, a bug caused a few recent versions to use the median instead. This affects baseml versions 3.14c and 3.15, while version 3.14b or earlier and version 4 or later are correct. codeml in version 4a & 4b were incorrect while other versions are correct. I hope both baseml and codeml are correct from version 4c. baseml and codeml (branch and clade labels): The local clock models (clock = 2) in the two programs and also the branch or branch-site models in codeml use labels to identify branches. When nested clade label ($) are used, the program may get confused so that the branches are not labeled correctly. I think this is not fixed in version 4c. Version 4b, January 2008 mcmctree. Changed a variable name in the control file from MaxRootAge to RootAge. This should be specified in any of the following formats: RootAge = <1.0 * constraint on root age, used if no fossil for root. RootAge = '<1.0' * constraint on root age, used if no fossil for root. RootAge = <1.0>0.8 * constraint on root age, used if no fossil for root. RootAge = '<1.0>0.8' * constraint on root age, used if no fossil for root. Added some text around the output, to make the results more comprehensible. evolver (option 8). option 8 for calculating bipartition distances between trees is broken. I changed the program to prepare for a figure and forgot to remove the changes afterwards. This is now fixed. Version 4, July 2007 mcmctree. The MCMC algorithm of Rannala and Yang (2007) dealing with rate drift is added. The section of manual on mcmctree is now rewritten. An example data set is included in the examples/DatingSoftBound/ folder. codeml. This now implements the mutation-selection models of codon substitution of Yang and Nielsen (2008). I should include more details after the models are better tested, for example, after the paper is written. baseml. The log likelihood under the gamma-rates model (that is, when alpha > 0 in the control file) is not calculated correctly, or not as intended. Version 3.14a is correct, but versions 3.14c and 3.15 are incorrect. I am not sure about version 3.14b, as I have not kept a copy. The bug causes the median to be used to represent all rates in each category of the discre-gamma model whereas the mean was intended. See the discussion around equation (10) and the paragraph below it in Yang (1994. J. Mol. Evol. 39:306-314). The bug is not expected to have much effect on tree topologies or branch lengths. Thanks to Simon Whelan for reporting the bug. Version 3.15, March 2006 evolverNSbranches (simulation program to generate data under the branch models of codon substitution). A bug was introduced after version 3.14a so that versions 3.14b (?), 3.14c, and 3.15 have a bug that makes the program abort prematurely. This is now corrected. Thanks to N. Clark for reporting the bug. Version 3.15, January 2006 evolver. The simulation option of the evolver program in version 3.14b (?), 3.14c, and 3.15 has a serious bug that makes the program generate nonsensible sequences when the user tree has 0 internal branch lengths. The bug was introduced between versions 3.14a (which is correct) and 3.14c (which is wrong). I don't seem to have kept 3.14b to check whether it is in error. In this case, the smart idea that led to the bug was the thought that no evolution occurs if the branch length is 0 (or less than 1e-20, to be more precise). If the branch length is 0, there is no need to evolve the sequence along the branch, so I decided to copy the sequence at the start of the branch to the node at the end of the branch. This would be fine, except that I introduced a bug at the same time that caused the program to skip generating sequences at all nodes that are descendents of the zero-length branch. As a result, sequences at tips that are descendents of the zero-length branch have no data. On some systems the uninitialized space is initialized as 0, and then the sequences will be printed out as consisting of all Ts (T is the first nucleotide in my program). At the same time, I also turned on the option of simulating data on random trees in evolver. You change fixtree=1 into 0, and recompile. Then use MCbaseRandomTree.dat to simulate data instead of MCbase.dat: evolver 5 MCbaseRandomTree.dat Version 3.15, November 2005 mcmctree: This implements the MCMC algorithm of Yang and Rannala (2005 MBE) for estimating species divergence times using soft bounds. The example file is in examples/SoftBound/. Look at the readme file and see whether you can duplicate our results in the paper. Note that this program works with DNA sequence data only, and assumes the molecular clock (clock = 1). Note that assumption of the clock can lead to seriously biased results if the clock is violated. Work is under way to relax the clock assumption. yn00: Added three methods for estimating dS and dN: LWL85 (Li, Wu & Luo 1985), LPB93 (Li 1993; Pamilo and Bianchi 1993), and LWL85m (Yang 2006 Computaional Molecular Evolution, Oxford University Press. Eqs. 2.12 & 2.13). codonml. Added example files for clade model C, in folder examples/CladeModelC. Thanks for Joe Bielawski for preparing these. Added a sequence data format, in which the numbers or frequencies of site patterns are inputed instead of the sequences. The format is indicated by the option variable P on the first line of the data file, used in the same way as option variables I (for interleaved format) and S (for sequential format, which is the default). See the section on sequence data format in the paml documentation for details. The old control variable readpattf for baseml and codeml is now deleted. You can also use evolver to simulate data using this format. Look at the control files MCbase.dat, MCcodon.dat, and MCaa.dat. This format is useful for very large data sets with >1M sites, when collapsing sites into patterns can take a long time. Version 3.14b, May 2005 pamp crashes when there are more than 127 or 255 (depending on the system) species in the data. The bug is due to my declaration of the variable visit[] in the routine PathwayMP1 in pamp.c as char rather than int. Another problem is that the formula used to calculate distances under REV (GTR) is sometimes inapplicable and the program prints out messages like "err: DistanceREV". I introduced some ad hoc way of dealing with the problem. Version 3.14b, April 2005 aaml (codeml with seqtype = 2): Under the REV model (model = 9), codeml failed to print out the estimated amino acid substitution rate matrix. This is fixed. aaml (codeml with seqtype = 2 or 3): If you have multiple data sets (ndata > 1) and analysing protein data sets under the empirical models such as wag, jtt, dayhoff. The results for the first data set are correct, but all later data sets are analyzed incorrectly under the corresponding +F models, that is, wag+F, jtt+F, dayhoff+F. A bug in the program means that for the second and later data sets, the equilibrium amino acid frequencies are from the real data and not correctly set to those specified by the empirical models. Thanks to Ben Evans. Version 3.14b, February 2005 codonml and aaml: There is a bug that affects the joint reconstruction of ancestral sequences when some branch lengths are equal to each other and when those branches are visited one after the other in the program. For example, if some branch lengths are estimated to be 0, such a problem might occur. I think this bug is in the program for a long time, although I note that joint ancestral reconstruction is turned off in some versions for other reasons. The marginal reconstruction is not affected by this bug. The update fixes this bug. baseml and codeml: The program crashes on some systems for the local clock models (clock = 5 and 6) after the calculation is finished. If you manage to run the model, the results should be fine. There are bugs in the program when freeing memory at the end of the calculation. They are fixed in this update. See the discussion site at http://www.rannala.org/phpBB2/viewtopic.php?t=228 Thanks to fjvera. Version 3.14a, December 2004 codonml (codeml for codons): The mechanisctic amino acid substitution models (table 3 in Yang et al. 1998 MBBE 15: 1600-1611) appear to be broken in paml/codeml versions 3.13 and 3.14. They were correct in version 3.12. This is now fixed. If you use those models, please run the examples in the folder examples/mtCDNA/ and compare results with those published in the paper to confirm the program. The results in the paper are correct, but the program implementation may be broken due to lack of maintenance. codonml (codeml for codons) When you fit the branch models with three or more branch types (omega ratios), the program aborts with an error message saying that only two omega ratios are allowed. This is due to a bug in the program and is now fixed. Version 3.14, September 2004 1. codonml (codeml for codons): (1a) NSsites models M1 (neutral), M2 (selection), and M8 (beta& ) have been changed. The references are two manuscripts that are not yet published (Wong, Yang, Goldman & Nielsen, 2004 Genetics 168: 1041-1051; Yang, Wong, and Nielsen 2005 MBE). The details are as in the following table. [table missing or never written.] The old models M1 and M2 are noted to be rather unrealistic as they do not account for sites with 0 < w0 < 1. Thus in the new models 0 < w0 < 1 is estimated from the data. Also insisting on a class of sites with 0 = 1 appears to help avoid false positives as sites under weak selection will be absorbed in this neutral class rather than being claimed to be under positive selection. Under M8, the constraint s > 1 is now automatically enforced to help avoid multiple local peaks and also to make the model a positive-selection model. Note that the newer models replace the old ones, which are available in older versions of paml/codeml only. After the changes, the models are nested as follows: M0 (one ratio) < M1a (NearlyNeutral) < M2a (PositiveSelection) < M3 (discrete), and M7 (beta) < M8 (beta& ), where "<" means the model on the left is a special case of the model on the right:. Two LRTs can be constructed using those models. The first one compares M1a against M2a, using a chi square with df 2, and the second one compares M7 against M8, using a chi square with d.f. = 2. (1b) Similarly branch-site model A (Yang and Nielsen 2002) is changed. The old model assumes 0 = 0 and is unrealistic. This is replaced by 0 < 0 < 1, estimated from the data. The new model is still called branch-site model A. It can be compared with the new M1a (NearlyNeutral) to form a likelihood ratio test, with d.f. 2. This is called test 1. Another test, called test 2, uses a null model A but with 2 = 1 fixed (use fix_omega = 1 and omega = 1). Test 2 is very conservative, but test 1 might mistake relaxed selective constraint on the foreground branch as positive selection. (1c) Similarly the clade model C ( see also Forsberg and Christiansen 2003; Bielawski and Yang 2004) is changed as follows. The new model C replaces 0 = 0 by 0 < 0 < 1 and has five parameters in the distribution: p0, p1, 0, 2, and 3. The new model C can be compared with the new M1a (NearlyNeutral), with d.f. 3. (1d) The empirical Bayes procedure for calculating posterior probabilities of site classes under the site models, branch-site models, and clade models are called naïve empirical Bayes (NEB). They are naïve because they use the MLEs of parameters without accounting for sampling errors in them. This problem is now fixed using a procedure called Bayes empirical Bayes (BEB) (Yang, Wong, and Nielsen 2005). The BEB procedure is implemented for the site models M2a and M8, the (modified) branch-site model A, and the (modified) clade model C. The old NEB calculation is kept in the output and the new BEB results follow the NEB result. Perhaps I will remove the NEB calculation in later versions. baseml/codeml: rewrote likelihood clock and local clock models. Implemented models for combined analysis of multiple genes incorporating multiple calibration points (Yang and Yoder 2003). The ad hoc rate smoothing procedure of Yang (2004) is implemented for nucleotide, amino acid, and codon substitution models, and the implementation also deals with missing species at some loci. The option variable clock in the control files is now used differently from before. See later in this documentation and also the readme and readme2 files in the folder examples/MouseLemurs/. baseml/codeml local clock models. clock = 5 and 6 are added in 3.14beta5. codeml: added branch-site models C and D (Bielawski and Yang in press). mcmctree: this program is disabled in this release. The old program died and the new program is still under construction. The main body of this documentation may not be up to date. Bug fixes and minor corrections baseml/codeml. The output in the file rates under the gamma model lists inferred rate for each site. The output is incorrect if the tree is large and scaling is used to avoid underflow. The use of scaling is indicated by two lines of output on the monitor like the following: "2 node(s) used for scaling (Yang 2000 J Mol Evol 51:423-432): 155 350". The results are clearly wrong as the probabilities are much greater than 1, and the rates are many orders of magnitude too large, etc. Also under the same problematic condition the expected numbers of sites with certain site patterns in the file lnf are incorrect. When scaling is not used, the results should be o.k. and look reasonable. This affects versions 3.13 and 3.14beta1-3. This is fixed in version 3.14. Thanks for Nick Goldman for pointing out the error. yn00. The program crashes for large datasets with many codons due to a memory allocation error. This affects versions 3.13 and some versions of 3.14beta1. The problem is fixed in version 3.14. aaml (codeml for amino acids). Model REVaa_0 is broken due to lack of maintenance in versions 3.1, 3.11, 3,12, 3,13, and 3,14beta1-4. It seems to be working in version 3. This is fixed in 3.14beta5. 2 April 2004. codonml (codeml for codons) in branch-site models (model = 2 NSsites = 2 or 3) prior to 3.14beta3 have a scaling problem, which makes the length of the foreground branch to be incorrectly estimated. Other branch lengths and substitution parameters, such as the parameters in the distribution, are corrected calculated, as well as the log likelihood values and the posterior probabilities. 27 March 2004 pamp: The gamma parameter for variable rates among sites using the method of Yang & Kumar (1996) is not estimated correctly in 3.14beta1, beta2, beta3. Versions 3.13d and earlier seem fine. codonml (codeml for codons) in version 3.14beta and 3.14beta1 does not work when fix_omega = 1 is used for branch models. The results are incorrect. Please use newer versions 3.14beta3 or later. Also version 3.13 is fine. baseml: Models TN93 and REV are wrong when used with Mgene = 3 or 4. This seems correct in version 3.12 but went wrong in 3.13 and 3.14 since I inserted TN92 between HKY85 and TN93. Thanks for Lee Bofkin for reporting the error. codonml (codeml for codons): When codon models are used to reconstruct ancestral sequences (with RateAncestor = 1), the program lists synonymous and nonsynonymous changes at each site under the heading "Changes at sites (syn nonsyn)." This listing is incorrect due to a bug in the program. This bug affects versions 3.12, 3.13 and 3.14, and version 3.11 seems correct. Thanks to Joe Bielawski. baseml/codeml: The SE's for divergence times under the clock models are calculated incorrectly. This happens when you use clock = 1 or 2, supply fossil date to calculate absolute times, and request standard errors for times. The estimates of times themselves are correct, but standard errors for times are wrong. The SEs for times and for the rate under the TipDate model (clock = 3) are wrong as well. The programs print out the variances after the +-, instead of their square roots. This error was introduced in version 3.13. Versions prior to 3.13 are correct. I posted an update 3.13d to fix this bug. evolver: the simulation program can now accept species names in the tree. Note that the data file formats for evolver (MCbase.dat, MCcodon.dat, MCaa.dat) have changed and you might have to change your own data files. Look at the files included in the package. The documentation in paml v3.13 from August 2002 - 12 December 2002 had a mistake about the critical values for the newer test using a modified M8. The critical values for the test are 2.71 at the 5% significance level and 5.41 at the 1% level, rather than 1.95 and 3.32 as in the documentation. Version 3.14beta2, November 2003 (1) baseml: Models TN93 and REV are wrong when used with Mgene = 3 or 4. This seems correct in version 3.12 but went wrong in 3.13 and 3.14 since I inserted TN92 between HKY85 and TN93. Thanks for Lee Bofkin for reporting the error. (2) codonml (codeml for codons): When codon models are used to reconstruct ancestral sequences (with RateAncestor = 1), the program lists synonymous and nonsynonymous changes at each site under the heading "Changes at sites (syn nonsyn)." This listing is incorrect due to a bug in the program. This bug affects versions 3.12, 3.13 and 3.14, and version 3.11 seems correct. Thanks to Joe Bielawski. Version 3.14beta, 20 July 2003 (1) baseml/codeml: rewrote likelihood clock and local clock models. Implemented models for combined analysis of multiple genes using multiple calibration points (Yang and Yoder 2003). The option variable clock in the control files is now used differently from before. See documentation below and also the readme and example files in the folder paml/examples/MouseLemurs/. (2) codeml: added branch-site models C and D (Bielawski and Yang submitted). (3) baseml/codeml: The SE's for divergence times under the clock models are calculated incorrectly. This happens when you use clock = 1 or 2, supply fossil date to calculate absolute times, and request standard errors for times. The estimates of times themselves are correct, but standard errors for times are wrong. The SEs for times and for the rate under the TipDate model (clock = 3) are wrong as well. The programs print out the variances after the +-, instead of their square roots. This error was introduced in version 3.13. Versions prior to 3.13 are correct. I posted an update 3.13d to fix this bug. (4) mcmctree: this program is decommissioned in this release. The old program died and the new program is still under construction. (5) evolver: the simulation program can now accept species names in the tree. Note that the data file formats for evolver (MCbase.dat, MCcodon.dat, MCaa.dat) have changed and you might have to change your own data files. Look at the files included in the package. (6) Improved the iteration algorithm a little. No change to the interface. (7) The documentation in paml v3.13 from August 2002 - 12 December 2002 had a mistake about the critical values for the newer test using a modified M8. The critical values for the test are 2.71 at the 5% significance level and 5.41 at the 1% level, rather than 1.95 and 3.32 as in the documentation. (8) Edited the manual pamlDOC.pdf. 15 May 2003 paml3.13d baseml/codeml: The SE's for times under the clock models are calculated incorrectly. This happens when you use clock = 1 or 2, supply fossil date to calculate absolute times, and request standard errors for times. The estimates of times themselves are correct, but standard errors for times are wrong. The SEs for times and for the rate under the TipDate model (clock = 3) are wrong as well. The programs print out the variances after the +-, instead of their square roots. This error was introduced in version 3.13. Versions prior to 3.13 are correct. 12 December 2002 correction of documentation The paml documentation in paml v3.13 from August 2002 - 12 December 2002 has a mistake about the critical values for the newer test using a modified M8. The critical values for the test are 2.71 at the 5% significance level and 5.41 at the 1% level, rather than 1.95 and 3.32 as in the documentation. I made a mistake and quickly discovered it. I thought I corrected this a few months ago, but apparently the document included in the paml release had this mistake. I am sorry. I'll correct it now. Version 3.13, August 2002 (1) baseml: nhomo = 1 and model = 6 (for both the one-rate and gamma-rates models) in version 3.12 gives incorrect results. The error was introduced in version 3.12 and is now fixed. Sorry for the trouble and thanks to Joshua Scott Rest. (2) baseml: Tamura's (1992) model is inserted between HKY85 and TN93. Unfortunately this means that the model code for TN93 and REV are shifted, and you have to change your control file for those two models. I also added user-specified REV or UNREST models. 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85, 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu (4) I implemented the nonhomogeneous models of Galtier and Gouy (1998). The options are model = 5 (T92) and nhomo = 3 (N1) or 4 (N2) or 5 (user). nhomo = 5 is added for the model of Yang and Roberts (1995) as well. It means that the user specifies, in the tree, how many sets of base frequencies (or how many GC content parameters in the model of Galtier and Gouy 1998) should be used and which branches has which sets. (5) codeml: For pairwise amino acid sequence comparison (seqtype = 2 and runmode = -2), the program crashes. This is fixed now. Thanks to Dennis Paul Wall. (6) codeml: The mechanistic models of amino acid substitution (model = 6 or 7) are broken for quite some time due to lack of maintenance. I have now fixed them. See the file examples/mtCDNA/readme.txt. (7) codemlsites is now removed, which is for batch run of multiple NSsites models. This is now done by specifying several NSsites models in the control file codeml.ctl for codeml, for example: NSsites = 0 1 7 8. (8) baseml & codeml used to ask the user what to do if the tree has branch lengths. I have now introduced a variable fix_blength in the control file that does the same thing. fix_blength = -1 * 0: ignore, -1: random, 1: initial, 2: fixed Version 3.12, March 2002 (1) baseml and codeml, Mgene=2 or 4 (different pi's) and RateAncestor = 1 Ancestral reconstruction does not work properly although the ML iteration is fine. The output likelihood values (in the form "Node 20: lnL = -1047.965624") are different from the optimum likelihood already calculated. The option was fine in version 3.0e and 3.1, and was broken in version 3.11. (2) codeml stops with an error message "pi=0" when some amino acids are missing in the sequences. Earlier versions of the program are correct. (3) yn00 now accepts multiple data sets (specified by ndata). Version 3.11, September 2001 (1) baseml and codeml (clock = 1 or clock = 2) sometimes print out ridiculous (e.g., large negative) tree lengths and branch lengths. I thought this was fixed in version 3.1, but it was not. Hopefully it is now fixed. (2) baseml and codeml: Malpha (a separate alpha for each site partition) does not work with Mgene = 0. The output will give you the same alpha value for each gene as you specify in the control file. This is now fixed. (3) baseml: When the REV or UNREST models (model = 6 or 7) are used together with Mgene = 2, 3, 4, the program outputs one Q matrix, when each site partition should have a different Q matrix. The ML parameter estimates are correct, but I did not bother to format the output correctly. The single Q matrix in the output is the one for the first partition. I have added the output for the other matrices now. (4) baseml and codeml. Ancestral reconstruction was disabled for large trees with more than 100 or 200 sequences in previous versions. It is now allowed. On large trees, the multiplication of transition probabilities across branches in the tree cause the product to become too small to be represented in the computer. This is known as underflows. I implemented a scaling algorithm to avoid underflows some time ago but did not bother to do it with the ancestral reconstruction algorithm at the time. The algorithm should now work, up to the upper limit set in the program, that is, 1000 or 2000 sequences. Version 3.1, June 2001 Changed the way of counting base or amino acid frequencies so that this version of baseml or codeml might generate different results from previous versions under the following model specifications. . baseml & aaml: if you have multiple genes and the model uses the same frequencies in the model for the different site partitions. . codonml: if you have ambiguity characters at all. Ambiguity characters were ignored when counting frequencies in previous versions but are used now. There is no difference if your data do not contain any alignment gaps or ambiguity characters or if you choose to remove them before any analysis (cleandata = 1). The early versions used two ad hoc methods in dealing with such ambiguity characters in different places of the programs. The new version uses another ad hoc method and in all places. I think it will be fine which ever version you use, but do not use different versions in the same analysis. codonml (codeml with seqtype = 1) with Mgene = 1 does not work correctly if the sequences have alignment gaps or ambiguity characters. Mgene = 1 means separate analysis. The bug causes data for the genes except the first to be read out of frames, so that most likely you will see a lot of messages "stop codon XXX" before the program aborts. This is now fixed. codonml (codeml with seqtype = 1) with fix_alpha=0: In the final output for the codon + gamma model, parameter alpha is not correctly identified. If you have fix_kappa = 0, fix_omega =0, and fix_alpha=0, the last three numbers should be kappa, omega (dn/ds), and alpha. The detailed output did not identify alpha correctly, although the likelihood value and the outputted rates and frequencies (r and f) are correct. I believe this is now corrected. Version 3.0e, 14 May 2001 codonml (codeml with seqtype = 1) with model = 1: The program crashes when you have a small tree with <= 5 branches. This is a new error introduced in version 3.0c. It is now fixed. baseml and codeml (clock = 1 or clock = 2) sometimes print out ridiculous tree lengths and branch lengths. yn00 in version 3.0c ignores icode and uses the universal code only. It was correct in vesion 3.0b. This is fixed in version 3.0d. Versions 3.0c-3.0d, 20 February 2001 We bought a Pentium III cluster running redhat linux 6.2. The gcc compiler 2.61 seems to have bugs and codeml does not run for NSsites = 7 or 8. The program might output "points out of order" or "Det goes to 0" error messages. 3.0d seems to be safer for these models. codonml (codeml with seqtype = 1) with NSsites = 3 might generate a message "error: sortwM3." and then stop with no output after the iteration. This is now fixed. During the iteration, there was no restriction on the w values, so that w0 can be greater than w1 and so on. At the end of the iteration, I sort the three w values so that w0 < w1 < w2 before outputting. 3.0c has a bug at sorting. If you don't see the error message, everything is fine. 10 November 2000 The probabilities for Joint ancestral reconstruction are not correct. Many values are larger than 1. Fixed in version 3.0c. Version 3.0c, 30 October 2000 (1) Changed parameterization of clock models (clock = 1, 2, or 3), so that the output now lists node ages, and the SEs are for node ages as well. Node ages are measured by the expected number of nucleotide or amino acid substitutions per site (nucleotide, amino acid, or codon) from the node to the present time, and are proportional to the divergence times. In earlier versions, the output list proportions. (2) codonml: problem with NSsites model with multiple gene data (Mgene=1) is now fixed(?). In the earlier version, the likelihood calculation was o.k., but identification of sites under positive selection was incorrect as the program attempted to list sites in another gene when it was analyzing data for one gene. (1) Mgene options for codons and amino acids: Version 3.0b, 6 October 2000 The local clock models (clock = 2) are not implemented correctly in paml3.0 or 3.0a. Yoder, A. D., and Z. Yang. 2000. Estimation of primate speciation dates using local molecular clocks. Molecular Biology and Evolution 17 (7): 1081-1090. I introduced a serious error before I made the models available to the public, in ver3.0 (May 2000). Two symptoms are noted right now: (1) crashes (which are good as they make you realise that something is wrong immediately) and (2) the local clock model (clock=2) having a worse likelihood than the global clock model (clock = 1). The results in the paper are correct except for a typo ("37.30 - 48.48" in table 3 for the cetacean-whale divergence using the C3 calibration should be "57.30 - 48.48"). I now include the example data sets and results from that paper to help you prepare your own data files. Sorry for any damages done. Many thanks to Toby Johnson and Philip Awadalla for spotting the error. Version 3, 4 February 2000 The program evolver may not generate sequences correctly if branch lengths are very large, say >10. Scaling by nodes to avoid underflow. When there are many sequences in the data (say > 200 or 300), the probability of observing data at a site can become too small to represent in the machine. This is because the probabilities are multiplied along branches in the tree and a large tree has many branches. Since version 2.0, this underflow is dealt with by scaling, and it works with both the one-rate and gamma-rates models. It did not work with the NSsites models in codonml, but probably nobody has analysed large data sets to notice this. Anyway, this version hopefully fixes that problem, when I was analysing the data set of Bush et al. (1999 MBE 16:1457-1465), which has about 350 sequences. If you use the program on very large data sets, watch out for anything strange and let me know. I suppose with the option of using supplied branch lengths, data sets of such sizes will become manageable, and this scaling will become important. Version 2.0h, Oct 14, 1999 Fixed the following errors, all simple errors. The RemoveIndel routine in 2.0e and 20.f has a bug that removes most sites in the sequence except those with all T's and C's for baseml. The same routine in 2.0g has a different bug that removes most sites except those will all T's and C's in the sequence for aaml. In either case, the results will be grossly wrong. The bugs affect the analysis only if you have gaps or ambiguity characters in the data and use runmode = 1 for aaml in 2.0e and 2.0f, and runmode = 1 for codonml for 2.0g. The newly introduced option for amino acid reconstruction by baseml (icode) does not work. The program aborts with an error message. This is now fixed. Version 2.0g, Sept 21, 1999 Fixed a bug in 2.0f that causes crash in codeml when getSE = 1. If the tree has branch lengths, codeml and baseml in 2.0f will simply use the branch lengths rather than estimating them by iteration. I have added a question for the user to choose to ignore the branch lengths, use them as fixed or use them as initial values for the iteration. Version 2.0d, June 30, 1999 Fixed some convergence problems in ML pairwise estimation of dS and dN, in particular, in presence of data partitions (the G option). Fixed a bug in evolver. The bug means that no sensible data are generated if the tree has branch lengths smaller than 1e-7. At some point, I did something "clever" (taking advantage of the fact that at such small branch lengths, no evolution would occur), which destroyed the calculation. Version 2.0c, May 12, 1999 Corrected a few problems with the evolver program, and with running multiple data sets using baseml. Added REV in MCbase.dat. Thanks to John Mercer. Note--Uncomment ndata in baseml or codeml to analyse multiple data sets generated from evolver. April 16, 1999 Although posted only a few days ago, paml2.0 has an error in the program codeml, which invalidates most analyses based on codon substitution models. The symptom is relatively easy to spot and is indicated by one of the parameters (say omega) not being estimated at all and in the output you will have exactly the same number as you specified in the control file. Amino acid-based analysis or pairwise comparison of codon sequences are not affected. Use paml2.0a to redo the analysis. Apologies for the trouble. version 2.0, 31 March 1999 Added back the faster calculation when there is no missing data. Enabled the programs/analysis that broke in the temporary versions, such as pamp, joint reconstruction. Renamed listtree as evolver, and added simulation options. PAML 1.4 Temporary versions Dec98 & Jan99 pamlTMP.Jan99.notes =================== Ziheng Yang Fri Jan 15 10:01:42 GMT 1999 .Type make to compile the programs. If you got error messages, see whether you know how to change the file Makefile. If it does not work either, look at the files paml.cc, paml.gcc, etc. .At the suggestion of David Posada, I added ML estimation of pairwise distances between amino acid sequences in the program codeml. The relavant variables in the control file codeml.ctl are runmode = -2 for pairwise distance calculation aaRatefile = jones.dat * only used for aa seqs with model=empirical(_F) * dayhoff.dat, jones.dat, mtmam.dat, or your own model = 3 alpha = 0 model specifies the amino acid substitution model. You should choose a value from 0, 1, 2, 3. If you want to estimate the rate matrix from your data, you should do that using many sequences and then move the estimated substitution rate matrix (in the output file mlc) into a file and change the variable aaRatefile. Choosing alpha>0 means using gamma distance with that specified alpha parameter. If you see anything strange, please let me know. .I have included only three programs baseml, codeml, and listtree. I have not tested the others and so they are not included. pamlTMP.Dec98.notes =================== Ziheng Yang Wed Dec 9 10:44:37 GMT 1998 This temporary version of paml has involved a number of changes, and is quite likely to contain errors. Please let me know if you notice anything strange. Some of the options do not work for now. Details follow. I will compile the Win32 and PowerMac versions after everything is fixed and tested. . Missing data: baseml and codeml now handle missing data. Choose DelMissing = 1 in the control file to remove sites with ambiguity characters or alignment gaps, as did previous versions. . Clock: The clock is now fixed. It may still crash, but the option is not worse than the no-clock options. The definitions of the node times or branch lengths are now different from the documentation (pamlDOC.html), and so check the branch lengths in the output tree topology. Thanks to Jeff Thorne. . Nexus file formats: Paml programs now read sequence files and tree structure files in the Nexus format used by paup and MacClade. Only the data or tree are read, and everything else in the file is ignored. . Ancestral reconstruction: Marginal ancestral reconstructions are now done under the gamma-rates model as well as the constant-rate model. For protein-coding genes, reconstructions can now be done at the nucleotide level with baseml (in particular, by using models that account for differences at the three codon positions), at the amino acid level with aaml, and at the codon level with aaml. According to Belinda Chang, those different reconstructions tend to be similar. Results for ancestral reconstruction are now listed by site, and not by site pattern as in earlier versions. Missing data are now handled in those analysis. See instructions in this doc about baseml.ctl for more options. Thanks to Belinda Chang. I would like to take this opportunity to point out that the reconstructed ancestral sequences are not the real observed sequences and may contain errors. . I have disabled the option for variable rates among sites in combination with variable dN/dS rate ratios among lineages in codonml. This option has never been implemented in the program but was not disabled. . Distance matrices from codonml. The program codonml outputs matrices of synonymous and nonsynonymous rates in all pairwise comparisons using the method of Nei & Gojobori (1986) into two files named DistanceNG.dN & DistanceNG.dN. The matrices are lower-diagonal and can be used with programs, such as neighbor and fitch, in phylip (and possibly paup* too) for tree reconstruction or branch length estimation by distance methods. If you choose runmode = -2 for ML pairwise comparison, the program will also output two matrices of ML estimates into two files named DistanceML.dS & DistanceML.dN, for synonymous and nonsynonymous rates. Since many users seem interested in looking at dN/dS rate ratios among lineages, examination of the tree shapes indicated by branch lengths calculated from the two rates may be interesting although the analysis is intuitive. Obviously, you should NOT name your own data files with those names; otherwise they will get overwritten. If your species name has more than 10 characters, you will have to edit the files and cut the names short before you can use Phylip programs. I have decided to leave this work to the user. I plan to create some other distance matrix files for nucleotides and amino acids as well. . An minor prompt error when running codonml with model = 2 (variable dN/dS ratios among lineages) is fixed. The example data set lysozymeSmall.nuc is included for the user to duplicate my results in Yang (1998 Molecular Biology and Evolution 15: 568-573). The control file codemlLysozyme.ctl explains in more detail how to specify the options. I would like to remind you that it is not correct to use the option model = 1 to estimate dN/dS ratios for branches to find out which ratios are greater than one, and then to use model = 2 to test whether that ratio is significantly greater than one. The problem with such an analysis is that the hypothesis being tested is derived from the data which you use to test the hypothesis, and as a result, you tend to get significantly results too often. Strictly speaking, the hypothesis should be formulated before the data are analysed. Things that do not work in this version include . All parsimony-based analyses are now broken. The program pamp is not included. Programs baseml and codeml used to calculate the MP score for each tree before doing an ML analysis on that tree. Now a -1 is used to indicate this is not possible. Version 1.4 (UCL) July 1998 1. Most of the changes are in the program codeml (aaml for amino acid sequences and codonml for codon sequences). A few new models of codon substitution are implemented (see Yang and Nielsen 1998; Nielsen and Yang 1998; Yang 1998). 2. Some problems relating to ancestral sequence reconstruction (in baseml and codeml) are fixed. The earlier algorithm does not work properly if the data contain more than several sequences. The algorithm makes a guess at the likely character states at the interior nodes of the tree, and then uses those to generate reconstructions (joint reconstructions, see eq. 2 in Yang et al. 1995) to be evaluated. Sometimes this strategy misses important reconstructions, and as a result, the probabilities for reconstructed characters at specific interior nodes (marginal reconstructions; see eq. 4 in Yang et al. 1995) are substantially underestimated if the number of sequences is not very small. I have now written separate codes to do the marginal reconstruction, evaluating all possible character states for each interior node. This works also under the gamma model of substitution rates among sites, a feature that was not implemented before. The joint reconstruction works with one rate for all sites only and has the old problem of possibly missing important reconstructions. However, the posterior probabilities for those joint reconstructions that do get evaluated are accurate. I have added an option (choose RateAncestor = 2 rather than 1) for the user to specify the reconstruction to be evaluated. The two approaches are expected to produce very similar results, and since the marginal reconstruction is always reliable, perhaps it can always be used for data analysis. (Thanks to Belinda Chang.) 3. The output file for estimated rates for sites under the variable-rates models is now "Rates.out" instead of "rst". 4. I have implemented ancestral sequence reconstructions based on codon-substitution models (codonml). This has not been tested carefully, and I would appreciate your comments if you use it. 5. A number of minor changes have been made. I have fixed several floating exception errors that seem to occur on DEC Alpha machines only. 6. The documentation is now changed into the html format. Version 1.3a,b,c (UCL) June 1998 1. Basemlg was put back at the request of a user. 2. Some changes are made to codeml concerning codon-substitution models (Yang and Nielsen 1998 JME; Yang 1998 MBE; Nielsen and Yang 1998 Genetics) PAML Version 1.3: (UC Berkeley) July 1997 1. Starting with this version, the program basemlg will not be included in the package. This program implements the (continuous) gamma model of substitution rate variation among nucleotide sites (Yang 1993). It involves intensive computation and has been superseded by the discrete-gamma model implemented in the program baseml. If you want to use this program, you should use previous versions of paml, which can be obtained from me if nowhere else. 2. The simulation program mcml.c is removed from the package. This is now superced by the program listtree.c, which does different things besides simulating data sets. 3. I have implemented the stepwise addition algorithm in the programs baseml and codeml. This algorithm is faster than the star-decomposition algorithm implemented before, but my implementation is crude and inefficient. The NNI perturbation is implemented too. These work without the molecular clock. 4. Many changes are made with the program codonml (codeml with seqtype=1). A few new codon-based models are implemented. One assumes different nonsynonymous/synonymous (dN/dS) rate ratios among branches in the phylogeny, and may be useful for detecting episodic positive selection. Another allows the dN/dS ratio to vary among sites and may be useful for testing neutrality and detecting positively-selected sites in the sequence. These are based on my collaborative work with Rasmus Nielsen. I made some changes to the output of estimates of dS and dN in pairwise comparisons (codonml with runmode=-2), so that the results are easy to understand. Another change with this program is in the codon-substitution model of Goldman and Yang (1994). We used the amino acid distance matrix of Grantham (1974) to modify the rate of nonsynonymous substitutions, but unfortunately the model does not fit data well. It does not always fit the data better than if we simply assume equal distance between any two amino acids. So I have added the option of ignoring the amino acid differences. Goldman and Yang's original formula is still available but not recommended for use. As a result of these changes, the formulation of the Goldman and Yang model is now somewhat different from the paper and previous versions of the program. See Section 6.2 for details. 5. I have again made some changes with the user interface, based on my guess of what the "user" would like to see. Version 1.1a-d: (University of California, Berkeley) 1995-1996 1.1a No records of changes remain. 1.1b November, 1995: The following line at the beginning of the file tools.h #define __cplusplus is removed. I added this line to avoid some (unjustified) warning messages from a compiler I used to use and forgot to remove it when I was preparing paml1.1. This line made some compilers unable to compile. A bug that caused baseml and basemlg to give wrong estimates of kappa under the F84 model is now fixed. Estimates of kappa under the F84 models listed in the paml 1.0 and 1.1 documents are all incorrect and are a bit too small. For example, the estimate of kappa under the F84 model for the mtDNA data of Brown et al. (1982) should be 4.344 while the wrong estimate given in the large table of the document is 3.420. Likelihood value, branch lengths and other parameters seem to have been correctly calculated by paml 1.0 and 1.1. Versions prior to 1.0 do not have this bug, and results in Yang (J. Mol. Evol. 1994, 39:306-314) and Yang, Lauder, and Lin (J. Mol. Evol. 1995, 41:587-596), where the F84 model was used, are correct. 1.1c March 1996 Representation of tree topologies using branches (see the document) are now allowed (restored) in the tree structure file. This is for coping with cases where some taxa are ancestors of others. If this representation is used, the line starts with a dollar sign. So the following line in the file trees.5s (the tree file for 5 species) $ 5 6 3 6 4 6 5 3 1 3 2 represents the tree ((12)34), with 5 to be the ancestor of 1 and 2. Alignment gaps are now allowed when option G is used (for multiple genes). The gaps are removed before analysis. A small program listtree lists all the bifurcating trees for a given (small) number of species. codonml (codeml.c with seqtype=1) now has an option for calculating the number of synonymous substitutions per synonymous site (Ks) and the number of nonsynonymous substitutions per nonsynonymous site (Ka) and their ratio for each pairwise comparison, using the method of Goldman and Yang (1994). This is implemented with runmode=-2. Estimates of Ks and Ka are collected in the file rst and estimates of model parameters in the file mlc. This option produces table 2 in the Goldman & Yang paper. The interior nodes were incorrectly identified when the ancestral sequences were listed in the file rst. They were all one less than their correct numbers. Nodes 7, 8, 9, 10 for the stewart.aa data were incorrectly identified as nodes 6, 7, 8, 9. This is corrected now. 1.1d. July 1996 An unintended version of codeml.c was included in version 1.1c, which causes the program to produce different results under the codon-based model of Goldman and Yang (1994). The small number in baseml.c was set too small, making the program do unnecessary iterations; this is fixed now. I also changed the program pamp so that it will estimate the substitution rate matrix without using a tree. I have included a few other genetic code tables in this version of codeml; see the file codeml.ctl. PAML Version 1.2: November 1996 (UC Berkeley) 1. I have done some changes with the user interface. Name of files (sequence data file, control file, tree structure file) are now specified in the control files (The default are baseml.ctl, codeml.ctl, pamp.ctl, mcmctree.ctl). The command line for executing a program is now ProgramName ControlFileName with ControlFileName optional. I have added some text in the output file of baseml so that the results are eassier to understand. Choose verbose=0 in the control file if you do not want the detailed output. 2. A new program mcmctree is included, which performs Bayesian estimation of phylogenies using Markov chain Monte Carlo methods (Rannala and Yang 1996; Yang and Rannala 1997). 3. A number of minor changes are made, such as using species names in the parenthesis tree representation. July 1995 Version 1.1: (IMEG PennState) Models for analyzing data of multiple genes (Yang 1996 JME) are implemented in baseml and codeml. Two more programs are included: pamp for parsimony-based analysis implementing the methods of Yang and Kumar (1996 MBE) and mcml for Monte Carlo simulation (prepared for Arndt von Haeseler but apparently not used). February 1995 Version 1.0 (IMEG, PennState) This is the first official version of paml. Three main programs are included: baseml, basemlg and codeml. Control files are used for all of them. The program codeml is formed by merging two programs codonml and aaml, for codon sequences and amino acid sequences, respectively. One rate, gamma rates, and auto-discrete-gamma rates for sites are implemented for nucleotide, amino acid, and codon-based models. The document is provided as postscript files (paml1_3.ps). October 1994 Version 0.12, no name for package (BAU and IMEG PennState) The bracket notation of tree topologies is introduced. Tree search by star decomposition is implemented in baseml and codonml. The options for baseml are moved into an option file baseml.ctl. The molecular clock seems to work with the automatic tree search model (runmode=1 or 2). The old readme file is renamed manual. March 1994 Version 0.10, no name for package: (NHM, England) Programs tripml and comptree are renamed codonml and rell. The readme file is expanded and the numerical optimization routine, ming1, used by dnaml and codonml, is improved. The discrete-gamma model of Yang (1994 JME) is implemented in both baseml and codonml. Preliminary version 0.11, no name for package: April 1994 (NHM, England) Programs dnaml and dnamls are renamed baseml and basemlg to avoid confusion with Joe Felsenstein's programs. The auto-discrete-gamma model and the nonparametric models of rate variation and correlation mong sites of Yang (1995 Genetics) are added in baseml. Nonhomogeneous-process models of Yang and Roberts (1995 MBE) are implemented in baseml. April 1993 Version 0.01, no name for package (Cambridge). Three programs are included in the package: 1. dnamls implements the model of gamma rates among sites of Yang (1993) and works with arbitrary topologies and substitution models (JC69, K80, F81, HKY85). 2. dnaml implements the model of a single rate for all sites. (The same name as Joe Felsenstein's program was used.) 3. tripml implements the codon-based model of Goldman and Yang (1994). Trees are represented by the "branch notation". // end of file