Fixed a bug in perl scripts for printing error messages

[rsem.git] / README.md
diff --git a/README.md b/README.md

index a4c5ce504a0f284bb594e9a57807a17969a4fc51..8d3c15f80f90ed0f52755582af952ebd971b7da4 100644 (file)
--- a/README.md
+++ b/README.md
@@ -14,6 +14,7 @@ Table of Contents
  * [Example](#example)
  * [Simulation](#simulation)
  * [Generate Transcript-to-Gene-Map from Trinity Output](#gen_trinity)
+* [Differential Expression Analysis](#de)
  * [Acknowledgements](#acknowledgements)
  * [License](#license)
  
@@ -22,15 +23,21 @@ Table of Contents
  ## <a name="introduction"></a> Introduction
  
  RSEM is a software package for estimating gene and isoform expression
-levels from RNA-Seq data.  The new RSEM package (rsem-1.x) provides an
-user-friendly interface, supports threads for parallel computation of
-the EM algorithm, single-end and paired-end read data, quality scores,
-variable-length reads and RSPD estimation. It can also generate
-genomic-coordinate BAM files and UCSC wiggle files for
-visualization. In addition, it provides posterior mean and 95%
-credibility interval estimates for expression levels. For
-visualization, it can also generate transcript-coordinate BAM files
-and visualize them and also models learned.
+levels from RNA-Seq data. The RSEM package provides an user-friendly
+interface, supports threads for parallel computation of the EM
+algorithm, single-end and paired-end read data, quality scores,
+variable-length reads and RSPD estimation. In addition, it provides
+posterior mean and 95% credibility interval estimates for expression
+levels. For visualization, It can generate BAM and Wiggle files in
+both transcript-coordinate and genomic-coordinate. Genomic-coordinate
+files can be visualized by both UCSC Genome browser and Broad
+Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate
+files can be visualized by IGV. RSEM also has its own scripts to
+generate transcript read depth plots in pdf format. The unique feature
+of RSEM is, the read depth plots can be stacked, with read depth
+contributed to unique reads shown in black and contributed to
+multi-reads shown in red. In addition, models learned from data can
+also be visualized. Last but not least, RSEM contains a simulator.
  
  ## <a name="compilation"></a> Compilation & Installation
  
@@ -84,8 +91,8 @@ documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-express
  #### Calculating expression values from single-end data
  
  For single-end models, users have the option of providing a fragment
-length distribution via the --fragment-length-mean and
---fragment-length-sd options.  The specification of an accurate fragment
+length distribution via the '--fragment-length-mean' and
+'--fragment-length-sd' options.  The specification of an accurate fragment
  length distribution is important for the accuracy of expression level
  estimates from single-end data.  If the fragment length mean and sd are
  not provided, RSEM will not take a fragment length distribution into
@@ -96,12 +103,30 @@ consideration.
  By default, RSEM automates the alignment of reads to reference
  transcripts using the Bowtie alignment program.  To use an alternative
  alignment program, align the input reads against the file
-'reference_name.idx.fa' generated by rsem-prepare-reference, and format
-the alignment output in SAM or BAM format.  Then, instead of providing
-reads to rsem-calculate-expression, specify the --sam or --bam option
-and provide the SAM or BAM file as an argument.  When using an
-alternative aligner, you may also want to provide the --no-bowtie option
-to rsem-prepare-reference so that the Bowtie indices are not built.
+'reference_name.idx.fa' generated by 'rsem-prepare-reference', and
+format the alignment output in SAM or BAM format.  Then, instead of
+providing reads to 'rsem-calculate-expression', specify the '--sam' or
+'--bam' option and provide the SAM or BAM file as an argument.  When
+using an alternative aligner, you may also want to provide the
+'--no-bowtie' option to 'rsem-prepare-reference' so that the Bowtie
+indices are not built.
+
+RSEM requires all alignments of the same read group together. For
+paired-end reads, RSEM also requires the two mates of any alignment be
+adjacent. To check if your SAM/BAM file satisfy the requirements,
+please run
+
+    rsem-sam-validator <input.sam/input.bam>
+
+If your file does not satisfy the requirements, you can use
+'convert-sam-for-rsem' to convert it into a BAM file which RSEM can
+process. Please run
+ 
+    convert-sam-for-rsem --help
+
+to get usage information or visit the [convert-sam-for-rsem
+documentation
+page](http://deweylab.biostat.wisc.edu/rsem/convert-sam-for-rsem.html).
  
  However, please note that RSEM does ** not ** support gapped
  alignments. So make sure that your aligner does not produce alignments
@@ -123,25 +148,41 @@ unsorted BAM file, 'sample_name.genome.sorted.bam' and
  generated by the samtools included. All these files are in genomic
  coordinates.
  
-#### a) Generating a UCSC Wiggle file
+#### a) Generating a Wiggle file
  
  A wiggle plot representing the expected number of reads overlapping
-each position in the genome can be generated from the sorted genome
-BAM file output.  To generate the wiggle plot, run the 'rsem-bam2wig'
-program on the 'sample_name.genome.sorted.bam' file.
+each position in the genome/transcript set can be generated from the
+sorted genome/transcript BAM file output.  To generate the wiggle
+plot, run the 'rsem-bam2wig' program on the
+'sample_name.genome.sorted.bam'/'sample_name.transcript.sorted.bam' file.
  
  Usage:    
  
-    rsem-bam2wig bam_input wig_output wiggle_name
+    rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]
+
+sorted_bam_input        : Input BAM format file, must be sorted  
+wig_output              : Output wiggle file's name, e.g. output.wig  
+wiggle_name             : the name of this wiggle plot  
+--no-fractional-weight  : If this is set, RSEM will not look for "ZW" tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line.
+
+#### b) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)
+
+For UCSC genome browser, please refer to the [UCSC custom track help page](http://genome.ucsc.edu/goldenPath/help/customTrack.html).
+
+For integrative genomics viewer, please refer to the [IGV home page](http://www.broadinstitute.org/software/igv/home). Note: Although IGV can generate read depth plot from the BAM file given, it cannot recognize "ZW" tag RSEM puts. Therefore IGV counts each alignment as weight 1 instead of the expected weight for the plot it generates. So we recommend to use the wiggle file generated by RSEM for read depth visualization.
+
+Here are some guidance for visualizing transcript coordinate files using IGV:
  
-bam_input: sorted bam file   
-wig_output: output file name, e.g. output.wig   
-wiggle_name: the name the user wants to use for this wiggle plot  
+1) Import the transcript sequences as a genome 
  
-#### b) Loading a BAM and/or Wiggle file into the UCSC Genome Browser
+Select File -> Import Genome, then fill in ID, Name and Fasta file. Fasta file should be 'reference_name.transcripts.fa'. After that, click Save button. Suppose ID is filled as 'reference_name', a file called 'reference_name.genome' will be generated. Next time, we can use: File -> Load Genome, then select 'reference_name.genome'.
  
-Refer to the [UCSC custom track help page](http://genome.ucsc.edu/goldenPath/help/customTrack.html).
+2) Load visualization files
  
+Select File -> Load from File, then choose one transcript coordinate visualization file generated by RSEM. IGV might require you to convert wiggle file to tdf file. You should use igvtools to perform this task. One way to perform the conversion is to use the following command:
+
+    igvtools tile reference_name.transcript.wig reference_name.transcript.tdf reference_name.genome   
+ 
  #### c) Generating Transcript Wiggle Plots
  
  To generate transcript wiggle plots, you should run the
@@ -240,12 +281,85 @@ For Trinity users, RSEM provides a perl script to generate transcript-to-gene-ma
  
  trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.    
  map_file: transcript-to-gene-map file's name.    
+
+## <a name="de"></a> Differential Expression Analysis
+
+Popular differential expression (DE) analysis tools such as edgeR and
+DESeq do not take variance due to read mapping uncertainty into
+consideration. Because read mapping ambiguity is prevalent among
+isoforms and de novo assembled transcripts, these tools are not ideal
+for DE detection in such conditions. 
+
+**EBSeq**, an empirical Bayesian DE
+analysis tool developed in UW-Madison, can take variance due to read
+mapping ambiguity into consideration by grouping isoforms with parent
+gene's number of isoforms. In addition, it is more robust to
+outliers. RSEM includes the newest version of EBSeq in the folder
+named 'EBSeq'.
+
+For more information about EBSeq (including the paper describing their
+method), please visit <a
+href="http://www.biostat.wisc.edu/~ningleng/EBSeq_Package">EBSeq
+website</a>. You can also find a local version of vignette under
+'EBSeq/inst/doc/EBSeq_Vignette.pdf'.
+
+EBSeq requires gene-isoform relationship for its isoform DE
+detection. However, for de novo assembled transcriptome, it is hard to
+obtain an accurate gene-isoform relationship. Instead, RSEM provides a
+script 'rsem-generate-ngvector', which clusters isoforms based on
+measures directly relating to read mappaing ambiguity. First, it
+calcualtes the 'unmappability' of each transcript. The 'unmappability'
+of a transcript is the ratio between the number of k mers with at
+least one perfect match to other transcripts and the total number of k
+mers of this transcript, where k is a parameter. Then, Ng vector is
+generated by applying Kmeans algorithm to the 'unmappability' values
+with number of clusters set as 3. This program will make sure the mean
+'unmappability' scores for clusters are in ascending order. All
+transcripts whose lengths are less than k are assigned to cluster
+3. Run
+
+    rsem-generate-ngvector --help
+
+to get usage information or visit the [rsem-generate-ngvector
+documentation
+page](http://deweylab.biostat.wisc.edu/rsem/rsem-generate-ngvector.html).
+
+If your reference is a de novo assembled transcript set, you should
+run 'rsem-generate-ngvector' first. Then load the resulting
+'output_name.ngvec' into R. For example, you can use 
+
+    NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+
+. After that, replace 'IsoNgTrun' with 'NgVec' in the second line of
+section 3.2.5 (Page 10) of EBSeq's vignette:
+
+    IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, ...)
+
+For users' convenience, RSEM also provides a script
+'rsem-form-counts-matrix' to extract input matrix from expression
+results:
+
+    rsem-form-counts-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
+
+The results files are required to be either all gene level results or
+all isoform level results. You can load the matrix into R by
+
+    IsoMat <- read.table(file="output_name.counts.matrix")
+
+before running function 'EBTest'.
+
+Questions related to EBSeq should be sent to <a href="mailto:nleng@wisc.edu">Ning Leng</a>.
   
  ## <a name="acknowledgements"></a> Acknowledgements
  
  RSEM uses the [Boost C++](http://www.boost.org) and
-[samtools](http://samtools.sourceforge.net) libraries.
+[samtools](http://samtools.sourceforge.net) libraries. RSEM includes
+[EBSeq](http://www.biostat.wisc.edu/~ningleng/EBSeq_Package/) for
+differential expression analysis.
+
+We thank earonesty for contributing patches.
  
  ## <a name="license"></a> License
  
-RSEM is licensed under the [GNU General Public License v3](http://www.gnu.org/licenses/gpl-3.0.html).
+RSEM is licensed under the [GNU General Public License
+v3](http://www.gnu.org/licenses/gpl-3.0.html).