Updated README.md and WHAT_IS_NEW

[rsem.git] / README.md
diff --git a/README.md b/README.md

index 5c6943bd42f32d5631228267cd19aa1cffd2d2ec..4184f5d44caf4d94ebb8c5445f25d531517931d8 100644 (file)
--- a/README.md
+++ b/README.md
@@ -46,15 +46,29 @@ To compile RSEM, simply run
     
      make
  
+For cygwin users, please uncomment the 3rd and 7th line in
+'sam/Makefile' before you run 'make'.
+
+To compile EBSeq, which is included in the RSEM package, run
+
+    make ebseq
+
  To install, simply put the rsem directory in your environment's PATH
  variable.
  
+If you prefer to put all RSEM executables to a bin directory, please
+also remember to put 'rsem_perl_utils.pm' and 'WHAT_IS_NEW' to the
+same bin directory. 'rsem_perl_utils.pm' is required for most RSEM's
+perl scripts and 'WHAT_IS_NEW' contains the RSEM version information.
+
  ### Prerequisites
  
  C++, Perl and R are required to be installed. 
  
-To take advantage of RSEM's built-in support for the Bowtie alignment
-program, you must have [Bowtie](http://bowtie-bio.sourceforge.net) installed.
+To take advantage of RSEM's built-in support for the Bowtie/Bowtie 2
+alignment program, you must have
+[Bowtie](http://bowtie-bio.sourceforge.net) and/or [Bowtie
+2](http://bowtie-bio.sourceforge.net/bowtie2) installed.
  
  ## <a name="usage"></a> Usage
  
@@ -100,7 +114,13 @@ consideration.
  #### Using an alternative aligner
  
  By default, RSEM automates the alignment of reads to reference
-transcripts using the Bowtie alignment program.  To use an alternative
+transcripts using the Bowtie alignment program. Turn on '--bowtie2'
+for 'rsem-prepare-reference' and 'rsem-calculate-expression' will
+allow RSEM to use the Bowtie 2 alignment program instead. Please note
+that indel alignments, local alignments and discordant alignments are
+disallowed when RSEM uses Bowtie 2 since RSEM currently cannot handle
+them. See the description of '--bowtie2' option in
+'rsem-calculate-expression' for more details. To use an alternative
  alignment program, align the input reads against the file
  'reference_name.idx.fa' generated by 'rsem-prepare-reference', and
  format the alignment output in SAM or BAM format.  Then, instead of
@@ -194,7 +214,7 @@ Here are some guidance for visualizing transcript coordinate files using IGV:
  
  1) Import the transcript sequences as a genome 
  
-Select File -> Import Genome, then fill in ID, Name and Fasta file. Fasta file should be 'reference_name.transcripts.fa'. After that, click Save button. Suppose ID is filled as 'reference_name', a file called 'reference_name.genome' will be generated. Next time, we can use: File -> Load Genome, then select 'reference_name.genome'.
+Select File -> Import Genome, then fill in ID, Name and Fasta file. Fasta file should be 'reference_name.idx.fa'. After that, click Save button. Suppose ID is filled as 'reference_name', a file called 'reference_name.genome' will be generated. Next time, we can use: File -> Load Genome, then select 'reference_name.genome'.
  
  2) Load visualization files
  
@@ -279,9 +299,9 @@ to get usage information or read the following subsections.
  
  __reference_name:__ The name of RSEM references, which should be already generated by 'rsem-prepare-reference'              
  
-__estimated_model_file:__ This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using 'rsem-calculate-expression'. The file can be found under the 'sample_name.stat' folder with the name of 'sample_name.model'    
+__estimated_model_file:__ This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using 'rsem-calculate-expression'. The file can be found under the 'sample_name.stat' folder with the name of 'sample_name.model'. 'model_file_description.txt' provides the format and meanings of this file.    
  
-__estimated_isoform_results:__ This file contains expression levels for all isoforms recorded in the reference. It can be learned using 'rsem-calculate-expression' from real data. The corresponding file users want to use is 'sample_name.isoforms.results'. If simulating from user-designed expression profile is desired, start from a learned 'sample_name.isoforms.results' file and only modify the 'TPM' column. The simulator only reads the TPM column. But keeping the file format the same is required.   
+__estimated_isoform_results:__ This file contains expression levels for all isoforms recorded in the reference. It can be learned using 'rsem-calculate-expression' from real data. The corresponding file users want to use is 'sample_name.isoforms.results'. If simulating from user-designed expression profile is desired, start from a learned 'sample_name.isoforms.results' file and only modify the 'TPM' column. The simulator only reads the TPM column. But keeping the file format the same is required. If the RSEM references built are aware of allele-specific transcripts, 'sample_name.alleles.results' should be used instead.   
  
  __theta0:__ This parameter determines the fraction of reads that are coming from background "noise" (instead of from a transcript). It can also be estimated using 'rsem-calculate-expression' from real data. Users can find it as the first value of the third line of the file 'sample_name.stat/sample_name.theta'.   
  
@@ -289,11 +309,14 @@ __N:__ The total number of reads to be simulated. If 'rsem-calculate-expression'
  
  __output_name:__ Prefix for all output files.   
  
+__--seed seed:__ Set seed for the random number generator used in simulation. The seed should be a 32-bit unsigned integer.
+
  __-q:__ Set it will stop outputting intermediate information.   
  
  ### Outputs:
  
  output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by counting where each simulated read comes from.
+output_name.sim.alleles.results: Allele-specific expression levels estimated by counting where each simulated read comes from.
  
  output_name.fa if single-end without quality score;   
  output_name.fq if single-end with quality score;   
@@ -430,7 +453,7 @@ be sent to <a href="mailto:nleng@wisc.edu">Ning Leng</a>.
  
  ## <a name="authors"></a> Authors
  
-RSEM is developed by Bo Li, with substaintial technical input from Colin Dewey.
+The RSEM algorithm is developed by Bo Li and Colin Dewey. The RSEM software is mainly implemented by Bo Li.
  
  ## <a name="acknowledgements"></a> Acknowledgements
  
@@ -439,9 +462,9 @@ RSEM uses the [Boost C++](http://www.boost.org) and
  [EBSeq](http://www.biostat.wisc.edu/~ningleng/EBSeq_Package/) for
  differential expression analysis.
  
-We thank earonesty for contributing patches.
+We thank earonesty and Dr. Samuel Arvidsson for contributing patches.
  
-We thank Han Lin for suggesting possible fixes. 
+We thank Han Lin, j.miller, Jo&euml;l Fillon, Dr. Samuel G. Younkin and Malcolm Cook for suggesting possible fixes. 
  
  ## <a name="license"></a> License