examples/MouseLemurs/README2.txt

   1 README.txt\r
   2 Ziheng Yang, January 2004\r
   3 \r
   4 (A)  Overview\r
   5 \r
   6 This folder includes the files used in Yang (2004) and this file\r
   7 describes the use of the AHRS algorithm described in that paper.  Two\r
   8 options (clock = 5 and 6) are implemented in baseml and codeml.  clock\r
   9 = 5 is for the global clock, and clock = 6 implements the heuristic\r
  10 rate smoothing (AHRS) algorithm combined with maximum likelihood\r
  11 estimation of divergence dates.  The mouse lemur files are used as an\r
  12 example.  The control files for the nucleotide, amino acid, and codon\r
  13 based analysis published in Yang (2004) are baseml2.ctl, aaml2.ctl,\r
  14 and codonml2.ctl.  You can open those files to check them.\r
  15 \r
  16 The models in Yang (2004) have two features compared with the previous\r
  17 models (clock = 1, 2, 3), which were described in Yang and Yoder\r
  18 (2003).  The first is that the new models deal with multiple-loci data\r
  19 sets in which some species are missing at some loci.  The second is\r
  20 that the AHRS algorithm attempts to automatically assign branches to\r
  21 rate groups, which are then used for maximum likelihood estimation of\r
  22 divergence dates using the method of Yang and Yoder (2003).  So the\r
  23 new global-clock option (clock = 5) is identical to the old global\r
  24 clock option (clock = 3) if every locus has all the species.  Note\r
  25 however that there are some differences in the data formats between\r
  26 the new models and the old ones, due to the need to deal with missing\r
  27 species at some loci.  See section (B) for details.\r
  28 \r
  29 \r
  30 (B) Data Format\r
  31 \r
  32 Have a look at the sequence data file MouseLemurs123.nuc and the tree\r
  33 file MouseLemurs.trees in this folder.\r
  34 \r
  35 The new models, clock = 5 (global clock or simply clock) and 6 (local\r
  36 clock) are for combined analysis of data from multiple loci (or site\r
  37 partitions).  An important difference from the old options, clock = 0,\r
  38 1, 2, 3, is that some species can be missing at some loci.  Specify\r
  39 the number of loci using the variable ndata in the control file\r
  40 baseml.ctl or codeml.ctl.  Prepare a tree file, including a tree for\r
  41 all species that occur in the sequence data.  This will be called the\r
  42 species tree or master tree.  The beginning of the tree file should\r
  43 have the number of species and the number of trees (which is usually\r
  44 1) on the first line.  Also prepare a sequence data file, with the\r
  45 sequence alignments listed one after another for the loci.  The\r
  46 example file MouseLemurs123.nuc lists data from the 3 codon\r
  47 positions as if they were three different genes.  When the program\r
  48 runs, it reads the master tree first, and then the sequence alignments\r
  49 locus by locus.  For each locus, it constructs (extracts) the sub-tree\r
  50 from the master tree.\r
  51 \r
  52 For example, suppose the master tree is for six species with two\r
  53 fossil calibration points:\r
  54 \r
  55       (((S1, S2), S3) @0.10, ((S4, S5) @0.20, S6));\r
  56 \r
  57 Suppose we have sequences for 4 species at the first locus and 5\r
  58 species at the second locus (with ndata = 2).  The data file looks as\r
  59 follows.\r
  60 \r
  61      4 100\r
  62      S1       TCTATGTTATATGTATGAGTATGA ...\r
  63      S3       TCGATATTATGTGTATGAGTATGA ...\r
  64      S4       TCTATATTACATGTATGAGTATGA ...\r
  65      S6       TCCATATTAAATGTATGAGTATGA ...\r
  66 \r
  67      5 200\r
  68      S1       GTTATATGTATGAGTATGATCTAT ...\r
  69      S2       GTTATATGCGTGAGTATGAACTAT ...\r
  70      S4       ATTATGTGTATGAGTATGATCGAT ...\r
  71      S5       GCTACATGTATGAGTATGATCTAT ...\r
  72      S6       ATTAAATGTATGAGTATGATCCAT ...\r
  73 \r
  74 \r
  75 The program will then construct the "gene tree" for locus 1 to be \r
  76       ((S1, S2), S3) @0.10;\r
  77 \r
  78 You should have at least one calibration node for each locus.  If two\r
  79 "sister" species never occur simultaneously at one locus so that there\r
  80 is no way of identifying the date of their divergence, you might want\r
  81 to consider them as one "species".  For example, if donkey is\r
  82 sequenced at some loci and horse at others and at no locus both are\r
  83 sequenced, you can rename the two species equus in the tree and\r
  84 sequence data files.\r
  85 \r
  86 \r
  87 (C) Models clock = 5 (clock) and clock = 6 (local clock)\r
  88 \r
  89 When you run clock = 5, a global clock is assumed.  If all species are\r
  90 present in every locus, this analysis can be achieved by using clock =\r
  91 3 (combined analysis) as in Yang & Yoder (2003).  This is the case for\r
  92 the included example data file MouseLemurs123.nuc.  Note again however\r
  93 that the data formats are different.  clock = 3 uses MouseLemurs.nuc\r
  94 while clock = 5 uses MouseLemurs124.nuc.  If some species are missing\r
  95 at some loci in your data, you can't use clock = 3 anymore and clock =\r
  96 5 is the only choice.\r
  97 \r
  98 The AHRS algorithm (clock = 6) works in 3 steps.  Here is a brief\r
  99 description.  See Yang (2004) for more details.\r
 100 \r
 101    Step 1: Estimate branch lengths on each gene tree without the\r
 102            clock.\r
 103 \r
 104    Step 2: Heuristic rate smoothing, to estimate one set of divergence\r
 105            times for the nodes in the species tree and as many sets of\r
 106            branch rates as the number of loci.  Each branch in the\r
 107            gene tree is assigned and estimated one rate.  For the\r
 108            above toy example, the program estimates 3 divergence times\r
 109            (five ancestral nodes minus two calibration nodes), plus 6\r
 110            rates (six branches in a rooted tree of four species) for\r
 111            the first locus and 8 rates for the second locus (eight\r
 112            branches in a rooted tree of five species).  A smoothing\r
 113            parameter called \nu is also estimatd for each locus.  Rate\r
 114            estimation is achieved by minimizing the discrepancy\r
 115            between the predicted branch lengths and the estimates\r
 116            obtained in Step 1, and by minimizing rate changes over\r
 117            time (across lineages).  The estimated rates for branches\r
 118            in the gene trees are then collapsed into a few branch rate\r
 119            categories.  The number of rate categories is set by a\r
 120            variable called nbrate=4 in the routine DatingHeteroData()\r
 121            in the file treesub.c.  You can change this and recompile.\r
 122            The program also allows you to manually classify the rates\r
 123            into groups.  When it asks for the number of rate groups,\r
 124            you can reply with 0: meaning no change, 1: meaning one\r
 125            rate (same as using clock = 5), or say 5 (in which case the\r
 126            program will further ask to input 4 cutting points to\r
 127            separate the rates into 5 groups).\r
 128 \r
 129    Step 3: Maximum likelihood estimation of divergence times using the\r
 130            assigned branch rate groups.  Divergence times are estimated\r
 131            simultaneously with the rates for branch rate groups.  For\r
 132            the toy example mentioned above, this estimates 3 = 5 - 2\r
 133            node ages in the species tree, plus 3 rates at each locus,\r
 134            with 9 time and rate parameters in all.\r
 135 \r
 136 \r
 137 (D) Output\r
 138 \r
 139 The output is fairly self-explanatory.  For clock = 6, look at the\r
 140 output on the screen at each of the three steps:\r
 141 \r
 142    Step 1: Estimating branch lengths under no clock.\r
 143    Step 2: Ad hoc rate smoothing to estimate branch rates.\r
 144    Step 3: ML estimation of times and rates.\r
 145 \r
 146 In particular, read in the output at the end of step 2 and check the\r
 147 branch labels (say, using Rod Page's TreeView) and see whether the\r
 148 assignment is reasonable and whether you want to use your own\r
 149 classification scheme.  The programs print out a file named\r
 150 RateDist.txt, which contains a distance matrix for each locus, with\r
 151 the distance calculated as the difference between two rates.  You can\r
 152 use an algorithm such as UPGMA to cluster the rates into groups\r
 153 (clades) to help with the classification.  I used the neighbor program\r
 154 in J. Felsenstein's phylip package to do this (figure 5b and the\r
 155 4-rate manual (4RM) model in Yang 2004).\r
 156 \r
 157 \r
 158 (E)  Notes\r
 159 \r
 160  1.  The example control files are baseml2.ctl, aaml2.ctl, codonml2.ctl.\r
 161 \r
 162  2.  In early versions of baseml and codeml, the order of the control\r
 163      variables in the control files does not matter to the\r
 164      specification.  Unfortunately that is not the case anymore for\r
 165      clock = 5 or 6.  The variables clock and ndata should be\r
 166      specified in the control file before some other variables such as\r
 167      fix_kappa, kappa, fix_alpha, alpha etc.  See the example files\r
 168      baseml2.ctl and codonml2.ctl etc. included in the same folder.\r
 169      The reason for this is that you can fix the kappa values for\r
 170      different genes at certain values during Step 1, and for the\r
 171      program to know how many kappa values to read, it needs to know\r
 172      the number of genes (ndata); see note 5 below.\r
 173 \r
 174  3.  The options are implemented in baseml for nucleotide sequences\r
 175      and in codeml for amino acid and codon sequences.  \r
 176 \r
 177  4.  Only rooted binary trees are accepted in the tree file.\r
 178 \r
 179  5.  Estimating and fixing substitution parameters in step 1.  If the\r
 180      model involves substitution parameters such as kappa, alpha, you\r
 181      can run the program multiple times to guarantee getting correct\r
 182      estimates for them Step 1 of the algorithm and then use them as\r
 183      fixed constants in both Step 1 (when you rerun the program) and\r
 184      Step 3.  (Step 2 does not use a substitution model and so those\r
 185      parameters are irrelevant.)  Step 1 in the clock = 6 model always\r
 186      uses the algorithm of iterating one branch at a time, and may be\r
 187      problematic if the gamma shape parameter (for rate variation\r
 188      among sites) is estimated at the same time.  Note that the\r
 189      calculation in step 1 is just the old clock = 0 analysis and is\r
 190      in older versions of the programs, if the coccrect tree files are\r
 191      supplied for each locus.  In that case you can use clock = 0\r
 192      method = 0 to try and estimate the substitution parameters and\r
 193      then fix them when you run clock = 6.  If you have different\r
 194      numbers of species at different loci, the tree for each locus is\r
 195      different and it would be awkward to use clock = 0 to complete\r
 196      step 1.  In that case you can use clock = 6 so that the correct\r
 197      trees are extracted and used for each locus, but you run the\r
 198      algorithms multiple times to make sure the correct MLEs are\r
 199      obtained.  Either way, when step 1 is successful, you can fix the\r
 200      parameters in the control file, by say,\r
 201 \r
 202                   data = 3\r
 203              fix_alpha = 1\r
 204                  alpha = 0.29169  0.16392  1.24726\r
 205 \r
 206      which means that the alpha parameter is fixed at the three values\r
 207      above for the three genes (codon positions in this case).  Other\r
 208      parameters that can be fixed this way and that can vary for site\r
 209      partitions or genes or proteins include kappa and alpha for\r
 210      baseml, alpha and aaRatefile for amino acid sequences, and icode\r
 211      (genetic code table) kappa, and omega for codon-based analsis.\r
 212  \r
 213  6.  I notice that minor differences in the rate estimates might lead\r
 214      to assignment of one or two branches in a different rate group,\r
 215      leading to differences in final time estimates.  As a result,\r
 216      multiple runs using the same setting may lead to slightly\r
 217      different time estimates.  You should watch out for failure of\r
 218      the iteration algorithm in any of the three steps, which can\r
 219      cause differences between runs.\r
 220 \r
 221  7.  For protein sequences, you can use different substitution rate\r
 222      matrices.  In the following, three nuclear proteins (using wag)\r
 223      are followed by one mitochondrial protein (using mtmam).\r
 224 \r
 225         aaRatefile = wag.dat wag.dat wag.dat mtmam.dat\r
 226 \r
 227      For codon sequences, you can use different genetic codes.  In the\r
 228      following, three nuclear genes are followed by one mitochondrial\r
 229      gene.\r
 230 \r
 231              icode = 0 0 0 1\r
 232 \r
 233  8.  Separating codon positions into different files.  You can use\r
 234      Mgene = 1 option in baseml to separate the three codon positions\r
 235      into three different files (named Gene1.seq, Gene2.seq, etc.).\r
 236      You need to use verbose = 1 to do this, I think.  Then you can\r
 237      copy and paste those files into one sequence data file, which\r
 238      will be in the format required by the options here.\r
 239 \r
 240 Good luck.\r
 241 \r
 242 References\r
 243 \r
 244 Yang, Z. 2004. A heuristic rate smoothing procedure for maximum likelihood estimation of species divergence times. Acta Zoologica Sinica 50:645-656.\r
 245 \r
 246 //end of file\r