Technical/Simulation/multiruns.txt

   1 multiruns\r
   2 =========\r
   3 \r
   4 Commands for compiling the program:\r
   5 \r
   6      cl -O2 -Ot -W4 multiruns.c\r
   7      cc -o multiruns -O2 multiruns.c -lm\r
   8 \r
   9      multiruns <rstfile1> <rstfile2> ... <lnLColumn> \r
  10 \r
  11 Examples for running the progam (comparing three runs with lnL in column 19 in rst1):\r
  12 \r
  13      multiruns rst1.a rst1.b rst1.c 19\r
  14      multiruns a/rst1 b/rst1 c/rst1 19\r
  15 \r
  16 \r
  17 March 2003, Ziheng Yang\r
  18 September 2005, changed tworuns into multiruns, ziheng yang\r
  19 \r
  20 This program compares outputs from multiple separate ML runs analyzing\r
  21 many data sets (using ndata) to assemble a result file.  Because of local \r
  22 peaks and convergence problems, multiple runs for the same analysis may not \r
  23 generate the same results.  Then we should use the results corresponding to \r
  24 the highest lnL.  This program takes input files which have summary results \r
  25 from multiple runs, one line for each data set.  The program takes one line \r
  26 from each of the input files and compare the first field, which is an index \r
  27 column and should be identical between the input files, and an lnL column.  \r
  28 The program decides which run generated the highest lnL, and copy the line \r
  29 from that run into the output file: out.txt.\r
  30 \r
  31 This is useful when you analyze the same set of simulated replicate data \r
  32 sets multiple times, using different starting values.  For example, codeml \r
  33 may write a line of output in rst1 for each data set, including parameter \r
  34 estimates and lnL.  You can then use this program to compare the rst1 output \r
  35 files from multiple runs to generate one output file.  The program allows the \r
  36 fields to be either numerical or text, but the first (index) and lnL columns\r
  37 should be numerical.\r
  38 \r
  39 A senario is the following.  You simulate 1000 data sets and want to\r
  40 analyze them under a particular model using codeml, but you are\r
  41 worried that the algorithm may not converge for some data sets.  So\r
  42 you run the analysis three times, using different starting values.\r
  43 You can use ndata = 1000 in the control file to analyze 1000 data sets\r
  44 one by one.  Also codeml might use different starting values\r
  45 automatically, but you can watch the initial log likelihood lnL0 for\r
  46 the first data set printed on the screen to confirm.  Each of the\r
  47 multiple runs generates a file called rst1, with 1000 lines of output,\r
  48 like the following.  \r
  49 \r
  50 \r
  51 rst1.a\r
  52 \r
  53 1       167     81      3.340   0.907   0.022   0.012   -899.000 \r
  54 2       143     82      2.459   0.000   0.000   0.037   -825.758 \r
  55 3       117     76      2.137   1.000   0.000   0.005   -622.806 \r
  56 \r
  57 rst1.b\r
  58 \r
  59 1       167     81      3.340   0.907   0.022   0.012   -890.000 \r
  60 2       143     82      3.9     0.000   0.000   0.037   -815.759 \r
  61 3       117     76      2.137   1.000   0.000   0.005   -622.806 \r
  62 \r
  63 rst1.c\r
  64 1       167     81      3.340   0.907   0.022   0.012   -890.000 \r
  65 2       143     82      2.459   0.000   0.000   0.037   -820.759 \r
  66 3       117     76      2.137   1.000   0.000   0.005   -622.806 \r
  67 \r
  68 \r
  69 Column 8 has the lnL, while the other columns on each line are\r
  70 estimated branch lengths and kappa, etc.  You then run multiruns as\r
  71 follows.\r
  72 \r
  73     multiruns rst1.a rst1.b rst1.c 8\r
  74 \r
  75 The output will be like the following.\r
  76 \r
  77 Usage:\r
  78         multiruns <file1> <file2> ... <lnLcolumn>\r
  79 \r
  80 r.a  r.b  r.c    ==>  out.txt\r
  81 \r
  82 record    1  (+++)   -899.000 (1) -   -890.000 (2) =   -9.000\r
  83 record    2  (+++)   -825.758 (1) -   -815.759 (2) =   -9.999\r
  84 record    3  (+++)   -622.806 (1) -   -622.806 (1) =    0.000\r
  85 record    4  (+++)  -2789.741 (1) -  -2789.741 (1) =    0.000\r
  86 \r
  87 wrote 5 records into out.txt\r
  88 \r
  89 The output file out.txt has the following.\r
  90 \r
  91 1       167     81      3.340   0.907   0.022   0.012   -890.000\r
  92 2       143     82      3.9     0.000   0.000   0.037   -815.759\r
  93 3       117     76      2.137   1.000   0.000   0.005   -622.806\r
  94 4       625     228     3.595   0.891   0.079   0.035   -2789.741\r
  95 \r
  96 \r
  97 A similar senario might be that you have 2000 gene alignments from the\r
  98 same set of species and want to analyze them under the same model, in\r
  99 which case you can use multiruns to assemble results when you run the\r
 100 same model a few times.  \r
 101 \r
 102 Below are some notes about rst1, produced by codeml (also by baseml).\r
 103 I use this file to print out results in simulations and then port the\r
 104 results into excel for plotting etc.  The output in this file is\r
 105 rather volatile and is basically in the state that it happens to be\r
 106 in.  You should open up codeml.c (or baseml.c) in a text editor and\r
 107 search for frst1 to view the printf statements that were commented\r
 108 out, that is, bracketed inside /* */ and perhapd uncomment them (that\r
 109 is, remove /* */) and then recompile.  For example the following block\r
 110 in codeml.c prints out all the MLEs of parameters under the model and\r
 111 the log likelihood and then flushes the file buffer after the\r
 112 iteration is finished for each data set.  If the block is bracketed by\r
 113 /* and */, you can remove the multiple lines containing /* and */ and\r
 114 recompile.  The same output is printed in the main result file mlc, so\r
 115 it should be easy for you to figure out what the results in rst1 mean.\r
 116 \r
 117 for(i=0; i<com.np; i++) fprintf(frst1,"\t%.5f",x[i]);\r
 118 fprintf(frst1,"\t%.3f",-lnL);\r
 119 fflush(frst1);\r
 120 \r
 121 Ziheng Yang\r