Technical/Simulation/Codon/README.txt

   1 Codon Simulation Programs\r
   2 \r
   3 README.txt \r
   4 by Ziheng Yang, July 2004\r
   5 Last modified:  June 2005\r
   6 \r
   7 \r
   8 (A) evolver: program for simulating codon sequences\r
   9 \r
  10 This document explains how to compile evolver to simulate codon\r
  11 sequences under the branch (Yang 1998), site (Nielsen & Yang 1998;\r
  12 Yang et al. 2000), and branch-site models (Yang & Nielsen 2002; Yang\r
  13 et al. 2005).  By default, option 6 of evolver simulates codon\r
  14 sequences under the codon model assuming the same omega (dN/dS) ratio\r
  15 for all sites and lineages (model M0).  The folder also includes a\r
  16 program called PositiveSites, which can be used to compare evolver and\r
  17 codeml outputs to calculate performance measurements of the inference\r
  18 under the site or branch-site models.\r
  19 \r
  20 The included .exe files are MS Windows executables.  If you use\r
  21 unix/linux/osx, you should delete the .exe files, and compile the\r
  22 programs yourself.  cd to the paml/src/ folder.  Then type one of the \r
  23 following commands to compile.  If you want, you can add some other \r
  24 compiler switches specific to your OS/compiler.\r
  25 \r
  26 evolverNSbranches:\r
  27      cc -O2 -DCodonNSbranches    -o evolverNSbranches evolver.c tools.c -lm\r
  28 \r
  29 evolverNSsites:\r
  30      cc -O2 -DCodonNSsites       -o evolverNSsites evolver.c tools.c -lm\r
  31 \r
  32 evolverNSbranchsites:\r
  33      cl -O2 -DCodonNSbranchsites -o evolverNSbranchsites evolver.c tools.c -lm\r
  34 \r
  35 \r
  36 (A1) evolverNSbranches takes input from MCcodonNSbranches.dat and simulate\r
  37 codon sequences with different omega ratios (dN/dS) for different\r
  38 branches on the tree, that is, under the model of Yang (1998).  Check\r
  39 that file for details, a copy of which is included in this folder.\r
  40 Note that you specify the branch lengths in the tree using the symbol\r
  41 ":", as in other programs, and you specify the omega ratio for the\r
  42 branch using the symbol "#", a notation not used by other programs.\r
  43 \r
  44 (A2) evolverNSsites takes input from MCcodonNSsites.dat and simulates codon\r
  45 sequences with a few classes of sites with different omega ratios\r
  46 (dN/dS).  This is the discrete (M3) model of Yang et al. (2000).  You\r
  47 can of course use this format to simulate other models as well,\r
  48 because they are its special cases.  For example, model M2a\r
  49 (PositiveSelecion) uses three site classes with omega < 1, = 1, and >\r
  50 1, so you can use a omega = 1 in the data file.  Check the file for\r
  51 details, a copy of which is included in this folder.\r
  52 \r
  53 Note that the branch length under a codon model is defined as the\r
  54 expected number of nucleotide substitutions per codon.  Under NSsites\r
  55 models, the silent rate is assumed to be constant along the sequence,\r
  56 and only the replacement rates are assumed to vary.  The branch length\r
  57 is then defined as the expected number of nucleotide substitutions per\r
  58 codon, averaged over the site classes.\r
  59 \r
  60 (A3) evolverNSbranchsites takes input from MCcodonNSbranchsites.dat and\r
  61 simulates codon sequences under variations of the branch-site models\r
  62 (Yang & Nielsen 2002).  Look at the file MCcodonNSbranchsites.dat and\r
  63 the documentation.  In this case, there are several site classes, and\r
  64 in each site class, the different branches on the tree have different\r
  65 omega ratios.  For example, the following might be part of the\r
  66 specification in MCcodonNSbranchsites.dat for simulating 4 site\r
  67 partitions (classes) on a tree of five species: A - E.  This specifies\r
  68 the tree topology and branch lengths, 4 site classes in the\r
  69 proportions 0.1, 0.2, 0.3, and 0.4.  Next the omega values for\r
  70 branches are specified for each of the four site partitions.  Viewed\r
  71 another way, each branch has a few site partitions with the same\r
  72 silent rate but variable replacement rates and thus omega's.  The\r
  73 branch length, say 0.1 for the branch leading to species A, is the\r
  74 expected number of nucleotide substitutions per codon, averaged over\r
  75 the site classes.\r
  76 \r
  77 ==================================================================\r
  78 (((A :0.1, B :0.2) :0.012, C :0.3) :0.123, D :0.4, E :0.5);  * tree and branch lengths\r
  79 \r
  80 4             * 4 site partitions\r
  81 .1 .2 .3 .4   * proportions of the 4 partitions; must sum to 1\r
  82 \r
  83 ((A #0.2, B #0.2) #0.2, C #0.2, D #0.2);  * omega for partition 1\r
  84 ((A #0.2, B #0.2) #1.0, C #0.2, D #0.2);  * omega for partition 2\r
  85 ((A #0.2, B #0.0) #0.2, C #0.5, D #0.2);  * omega for partition 3\r
  86 ((A #0.2, B #0.0) #2.0, C #0.0, D #0.8);  * omega for partition 4\r
  87 ==================================================================\r
  88 \r
  89 Note that evolverNSbranchsites will read the same tree topology 5\r
  90 times to simulate 4 site partitions, the first time to read the branch\r
  91 lengths and four more times to read the omega values for branches in\r
  92 each site partition.  Do not include omega values in the first tree\r
  93 and do not include branch lengths in the second to last trees.  This\r
  94 set up allows you to simulate under much more general models than the\r
  95 branch-site model A implemented in paml/codeml (see Yang & Nielsen\r
  96 2002; Yang et al. 2005).\r
  97 \r
  98 evolverNSbranchsites samples sites from the site partitions at random,\r
  99 so that the number of sites in each partition varies among replicate\r
 100 datasets.  This is assumed by the branch-site models of Yang and\r
 101 Nielsen (2002).  In the simulations of Zhang (2004) and Zhang, Nielsen\r
 102 & Yang (2005), the number of sites in each partition is fixed.  These\r
 103 authors also considered various violations of the branch-site models\r
 104 of Yang and Nielsen (2002).\r
 105 \r
 106 \r
 107 (B) PositiveSites: Program for summarizing simulation results\r
 108 concerning inference of sites under positive selection under site\r
 109 and branch-site models.\r
 110 \r
 111 When you use evolverNSsites (see above) to simulate data sets, the\r
 112 program will generate a file called siterates.txt, listing the sites\r
 113 that have omega > 1 and are thus truly under positive selection in\r
 114 each simulated data set.  When the data sets are analyzed using\r
 115 codeml, codeml will print results in a file called mlc (and in rst),\r
 116 which may include a list of sites inferred to be under positive\r
 117 selection with calculated empirical Bayes posterior probabilities.\r
 118 PositiveSites compares siterates.txt and mlc to calculate a few\r
 119 measures of performance of the codeml analysis.  This is the kind of\r
 120 analyses used in the simulation studies of Anisimova et al. (2002) and\r
 121 Wong et al. (2004).\r
 122 \r
 123 Currently codeml uses two procedures to calculate the posterior\r
 124 probabilities: the Naive Empirical Bayes (NEB: Nielsen & Yang 1998; Yang et\r
 125 al. 2000) and Bayes empirical Bayes (BEB: Yang, Wong, Nielsen\r
 126 2005).  NEB uses maximum likelihood estimates of parameters\r
 127 without accounting for their errors, while BEB uses a prior to correct\r
 128 for uncertainties in the parameter estimates.  codeml prints out the\r
 129 NEB results, followed by the BEB results in both mlc and rst, while\r
 130 PositiveSites processes mlc only.  The included windows executables\r
 131 PositiveSitesNEB and PositiveSitesBEB are for processing the NEB and\r
 132 BEB results, respectively.  To compile for unix, use the following\r
 133 command to compile the included source file.\r
 134 \r
 135    cc -O2 -DNEB        -o PositiveSitesNEB PositiveSites.c -lm\r
 136    cc -O2 -DBEB        -o PositiveSitesBEB PositiveSites.c -lm\r
 137    cc -O2 -DBranchSite -o PositiveSitesBS  PositiveSites.c -lm\r
 138 \r
 139 The program relies on a unique string in the codeml output mlc to\r
 140 identify results for each simulated replicate.  NEB relies on the\r
 141 string "Naive Empirical".  BEB relies on the string "Bayes Empirical".\r
 142 The source file PositiveSites.c is rather short so you can check the\r
 143 source file.\r
 144 \r
 145 PositiveSitesNEB, PositiveSitesBEB, and PositiveSitesBS are all\r
 146 supposed to work with codeml in paml 3.14 or later.  (Note that BEB was not\r
 147 implemented prior to version 3.14.)  \r
 148 \r
 149 The program should be run as follows:\r
 150 \r
 151    PositiveSitesBEB <#sites> <#repl>\r
 152    PositiveSitesBEB <#sites> <#repl> <Evolverf> <Codemlf>\r
 153    PositiveSitesBS  <#sites> <#repl> <Evolverf> <Codemlf> <positive site classes>\r
 154 \r
 155 \r
 156 \r
 157 The whole analysis should be run in the following order: simulate data sets, analyze \r
 158 datasets, and summarize the results.  \r
 159 \r
 160         evolverNSsites\r
 161         codeml\r
 162         PositiveSitesNEB 100 500\r
 163 \r
 164 The last line above tells PositiveSites the number of codons in the sequence (100) \r
 165 and the number of simulated replicates on the command line.\r
 166 \r
 167 \r
 168 Now about the measurements.  PositiveSites calculates the accuracy,\r
 169 power, and false positive rate of codeml inference of positive\r
 170 selection sites.  The measures are defined as follows (Anisimova et\r
 171 al. 2002; Wong et al. 2004).\r
 172 \r
 173                              codeml inference\r
 174                               +        -         Total\r
 175         evolver    +          N++      N+-       N+.\r
 176                    -          N-+      N--       N-.\r
 177                    Total      N.+      N.-       N\r
 178 \r
 179    Accuracy      = N++/N.+\r
 180    Power         = N++/N+.\r
 181    FalsePositive = N-+/N-.\r
 182 \r
 183 The program collects N++ (NmatchB & NmatchC), N+. (NEvolver), and N.+\r
 184 (NCodemlB & NcodemlC), and then calculates the three measures as\r
 185 above.  Note that codeml inference depends on a cutoff P, hence the B\r
 186 (for binned) and C (for cumulative) difference.  All proportions are\r
 187 calculated as the ratio of averages, taking the ratio after counting\r
 188 sites over replicate data sets.  Output is on the screen.  There are\r
 189 more descriptions of those measures in Yang, Wong & Nielsen\r
 190 (submitted).\r
 191 \r
 192 \r
 193 \r
 194 (C) Appendix: Testing the NEB algorithm.  This may not interest you\r
 195 anymore since it is about NEB, which is now superceded by the BEB.  It\r
 196 is not deleted yet as I spent time writing it and as it does explain\r
 197 what the posterior probabilities calculated by the NEB and BEB\r
 198 procedures mean.  You can confirm the calculation of codeml and also\r
 199 the simulation program by fixing parameters at their true values\r
 200 (values used in the simulation).  For example, you can use a small\r
 201 tree, simulated 100 codons in sequence and 500 datasets, using fixed\r
 202 parameters for the w distribution (say under model 2A).  Then analyze\r
 203 the data using codeml but with all parameters including branch\r
 204 lengths, kappa, and parameters in the w distribution fixed at their\r
 205 true values.  You can do this by using in.codeml.  See the paml\r
 206 Documentation about how to set this up.  Then when codeml says a site\r
 207 is under positive selection with probability 0.9, that site is under\r
 208 positive selection with probability 0.9.  The following is a sample\r
 209 output from PositiveSitesNEB in one such simulation.  Consider the\r
 210 second line of output.  1645 sites were listed by codeml as being\r
 211 under positive selection with posterior probability in the bin\r
 212 0.55-0.60.  Among them, 55.1% of them are truly under positive\r
 213 selection (on the evolver list).  This proportion is in the column\r
 214 "AccuracyBin", accuracy binned.  If both evolver and codeml are\r
 215 correct, we should expect this proportion to be between 0.55 and 0.60,\r
 216 and in this case the match seems to be good enough.  There are 65422\r
 217 sites with posterior probability exceeding 0.55 according to codeml,\r
 218 and 93.2% of them are truly under positive selection.  This proportion\r
 219 is in the column "AccuracyCum", accuracy accumulated.  There is no\r
 220 theory to predict what this proportion should equal to except that it\r
 221 should exceed 0.55.  The last column "Power" is the power of the\r
 222 codeml analysis.  At the 55% cutoff, 88.2% of the 69098 sites truly\r
 223 under positive selection were picked up by codeml.  Another measure,\r
 224 used by Wong et al. (2004), is the false positive rate.  This is\r
 225 defined as the number of sites falsely identified to be under posiitve\r
 226 selection by codeml among all sites not under positive selection.\r
 227 \r
 228  P              AccuracyBin      Pcut     AccuracyCum     Power\r
 229 \r
 230  0.50 - 0.55:   0.519 ( 1863)    >0.50:   0.920 (67285)   0.896 (69098)\r
 231  0.55 - 0.60:   0.551 ( 1645)    >0.55:   0.932 (65422)   0.882 (69098)\r
 232  0.60 - 0.65:   0.611 ( 1742)    >0.60:   0.941 (63777)   0.869 (69098)\r
 233  0.65 - 0.70:   0.660 ( 1757)    >0.65:   0.951 (62035)   0.854 (69098)\r
 234  0.70 - 0.75:   0.713 ( 1813)    >0.70:   0.959 (60278)   0.837 (69098)\r
 235  0.75 - 0.80:   0.775 ( 1995)    >0.75:   0.967 (58465)   0.818 (69098)\r
 236  0.80 - 0.85:   0.821 ( 2466)    >0.80:   0.974 (56470)   0.796 (69098)\r
 237  0.85 - 0.90:   0.876 ( 3057)    >0.85:   0.981 (54004)   0.766 (69098)\r
 238  0.90 - 0.95:   0.925 ( 4785)    >0.90:   0.987 (50947)   0.728 (69098)\r
 239  0.95 - 0.99:   0.975 (10202)    >0.95:   0.993 (46162)   0.664 (69098)\r
 240  0.99 - 1.00:   0.998 (35960)    >0.99:   0.998 (35960)   0.520 (69098)\r
 241 \r
 242 A test like this only confirms that the programs are likely to be\r
 243 correctly coded.  It does not tell much about the performance of the\r
 244 codeml analysis of real data sets.  In real data analysis, the model\r
 245 used by codeml might be wrong and the true values of parameters\r
 246 are unknown.\r
 247 \r
 248 A similar test can be conducted to confirm the BEB calculation.  In\r
 249 this case the data should be simulated by drawing parameter values\r
 250 (such as the proportions and omega ratios) from the prior.  Yang,\r
 251 Wong, Nielsen (2005: fig. 2) inlcuded an example of this test, which\r
 252 is also used to illustrate the Bayesian and frequentist measures of\r
 253 performance.\r
 254 \r
 255 \r
 256 References\r
 257 \r
 258 Anisimova, M., J. P. Bielawski, and Z. Yang. 2002. Accuracy and power\r
 259 of Bayes prediction of amino acid sites under positive\r
 260 selection. Molecular Biology and Evolution 19:950-958.\r
 261 \r
 262 Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting\r
 263 positively selected amino acid sites and applications to the HIV-1\r
 264 envelope gene. Genetics 148:929-936.\r
 265 \r
 266 Wong, W. S. W., Z. Yang, N. Goldman, and R. Nielsen. 2004. Accuracy\r
 267 and power of statistical methods for detecting adaptive evolution in\r
 268 protein coding sequences and for identifying positively selected\r
 269 sites.  Genetics 168:1041-1051.\r
 270 \r
 271 Yang, Z. 1998. Likelihood ratio tests for detecting positive selection\r
 272 and application to primate lysozyme evolution. Molecular Biology and\r
 273 Evolution 15:568-573.\r
 274 \r
 275 Yang, Z., and R. Nielsen. 2002. Codon-substitution models for\r
 276 detecting molecular adaptation at individual sites along specific\r
 277 lineages.  Mol. Biol. Evol. 19:908-917.\r
 278 \r
 279 Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. \r
 280 Codon-substitution models for heterogeneous selection pressure at amino acid \r
 281 sites. Genetics 155:431-449.\r
 282 \r
 283 Yang, Z., W. S. W. Wong, and R. Nielsen.  2005.  Bayes empirical Bayes\r
 284 inference of amino acid sites under positive selection.\r
 285 Mol. Biol. Evol. 22:1107-1118.\r
 286 \r
 287 Zhang, J. 2004. Frequent false detection of positive selection by the\r
 288 likelihood method with branch-site models.  Mol. Biol. Evol. 21:1332-1339.\r
 289 \r
 290 Zhang, J., R. Nielsen, and Z. Yang.  2005.  Evaluation of an improved\r
 291 branch-site likelihood method for detecting positive selection at the\r
 292 molecular level.  Mol. Biol. Evol. 22:2472-2479.\r