Fixed a minor bug which only affects paired-end reads for reporting how many alignmen...

[rsem.git] / README.md
diff --git a/README.md b/README.md

index bb66d5498be3014ccab93ef382d6d05536e6e047..7b9a4b83ca83a7ec87926af31cab8bf292a29e8c 100644 (file)
--- a/README.md
+++ b/README.md
@@ -14,6 +14,8 @@ Table of Contents
  * [Example](#example)
  * [Simulation](#simulation)
  * [Generate Transcript-to-Gene-Map from Trinity Output](#gen_trinity)
+* [Differential Expression Analysis](#de)
+* [Authors](#authors)
  * [Acknowledgements](#acknowledgements)
  * [License](#license)
  
@@ -49,13 +51,11 @@ variable.
  
  ### Prerequisites
  
-C++ and Perl are required to be installed. 
+C++, Perl and R are required to be installed. 
  
  To take advantage of RSEM's built-in support for the Bowtie alignment
  program, you must have [Bowtie](http://bowtie-bio.sourceforge.net) installed.
  
-If you want to plot model learned by RSEM, you should also install R. 
-
  ## <a name="usage"></a> Usage
  
  ### I. Preparing Reference Sequences
@@ -102,17 +102,26 @@ consideration.
  By default, RSEM automates the alignment of reads to reference
  transcripts using the Bowtie alignment program.  To use an alternative
  alignment program, align the input reads against the file
-'reference_name.idx.fa' generated by 'rsem-prepare-reference', and format
-the alignment output in SAM or BAM format.  Then, instead of providing
-reads to 'rsem-calculate-expression', specify the '--sam' or '--bam' option
-and provide the SAM or BAM file as an argument.  When using an
-alternative aligner, you may also want to provide the '--no-bowtie' option
-to 'rsem-prepare-reference' so that the Bowtie indices are not built.
-
-Some aligners' (other than Bowtie) output might need to be converted
-so that RSEM can use. For conversion, please run
+'reference_name.idx.fa' generated by 'rsem-prepare-reference', and
+format the alignment output in SAM or BAM format.  Then, instead of
+providing reads to 'rsem-calculate-expression', specify the '--sam' or
+'--bam' option and provide the SAM or BAM file as an argument.  When
+using an alternative aligner, you may also want to provide the
+'--no-bowtie' option to 'rsem-prepare-reference' so that the Bowtie
+indices are not built.
+
+RSEM requires all alignments of the same read group together. For
+paired-end reads, RSEM also requires the two mates of any alignment be
+adjacent. To check if your SAM/BAM file satisfy the requirements,
+please run
+
+    rsem-sam-validator <input.sam/input.bam>
+
+If your file does not satisfy the requirements, you can use
+'convert-sam-for-rsem' to convert it into a BAM file which RSEM can
+process. Please run
   
-   convert-sam-for-rsem --help
+    convert-sam-for-rsem --help
  
  to get usage information or visit the [convert-sam-for-rsem
  documentation
@@ -138,7 +147,27 @@ unsorted BAM file, 'sample_name.genome.sorted.bam' and
  generated by the samtools included. All these files are in genomic
  coordinates.
  
-#### a) Generating a Wiggle file
+#### a) Converting transcript BAM file into genome BAM file
+
+Normally, RSEM will do this for you via '--output-genome-bam' option
+of 'rsem-calculate-expression'. However, if you have run
+'rsem-prepare-reference' and use 'reference_name.idx.fa' to build
+indices for your aligner, you can use 'rsem-tbam2gbam' to convert your
+transcript coordinate BAM alignments file into a genomic coordinate
+BAM alignments file without the need to run the whole RSEM
+pipeline. Please note that 'rsem-prepare-reference' will convert all
+'N' into 'G' by default for 'reference_name.idx.fa'. If you do not
+want this to happen, please use '--no-ntog' option.
+
+Usage:
+
+    rsem-tbam2gbam reference_name unsorted_transcript_bam_input genome_bam_output
+
+reference_name                   : The name of reference built by 'rsem-prepare-reference'                             
+unsorted_transcript_bam_input    : This file should satisfy: 1) the alignments of a same read are grouped together, 2) for any paired-end alignment, the two mates should be adjacent to each other, 3) this file should not be sorted by samtools 
+genome_bam_output                : The output genomic coordinate BAM file's name
+
+#### b) Generating a Wiggle file
  
  A wiggle plot representing the expected number of reads overlapping
  each position in the genome/transcript set can be generated from the
@@ -148,19 +177,32 @@ plot, run the 'rsem-bam2wig' program on the
  
  Usage:    
  
-    rsem-bam2wig sorted_bam_input wig_output wiggle_name
+    rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]
  
-sorted_bam_input: sorted bam file   
-wig_output: output file name, e.g. output.wig   
-wiggle_name: the name the user wants to use for this wiggle plot  
+sorted_bam_input        : Input BAM format file, must be sorted  
+wig_output              : Output wiggle file's name, e.g. output.wig  
+wiggle_name             : the name of this wiggle plot  
+--no-fractional-weight  : If this is set, RSEM will not look for "ZW" tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line
  
-#### b) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)
+#### c) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)
  
  For UCSC genome browser, please refer to the [UCSC custom track help page](http://genome.ucsc.edu/goldenPath/help/customTrack.html).
  
  For integrative genomics viewer, please refer to the [IGV home page](http://www.broadinstitute.org/software/igv/home). Note: Although IGV can generate read depth plot from the BAM file given, it cannot recognize "ZW" tag RSEM puts. Therefore IGV counts each alignment as weight 1 instead of the expected weight for the plot it generates. So we recommend to use the wiggle file generated by RSEM for read depth visualization.
  
-#### c) Generating Transcript Wiggle Plots
+Here are some guidance for visualizing transcript coordinate files using IGV:
+
+1) Import the transcript sequences as a genome 
+
+Select File -> Import Genome, then fill in ID, Name and Fasta file. Fasta file should be 'reference_name.transcripts.fa'. After that, click Save button. Suppose ID is filled as 'reference_name', a file called 'reference_name.genome' will be generated. Next time, we can use: File -> Load Genome, then select 'reference_name.genome'.
+
+2) Load visualization files
+
+Select File -> Load from File, then choose one transcript coordinate visualization file generated by RSEM. IGV might require you to convert wiggle file to tdf file. You should use igvtools to perform this task. One way to perform the conversion is to use the following command:
+
+    igvtools tile reference_name.transcript.wig reference_name.transcript.tdf reference_name.genome   
+ 
+#### d) Generating Transcript Wiggle Plots
  
  To generate transcript wiggle plots, you should run the
  'rsem-plot-transcript-wiggles' program.  Run 
@@ -170,7 +212,7 @@ To generate transcript wiggle plots, you should run the
  to get usage information or visit the [rsem-plot-transcript-wiggles
  documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-plot-transcript-wiggles.html).
  
-#### d) Visualize the model learned by RSEM
+#### e) Visualize the model learned by RSEM
  
  RSEM provides an R script, 'rsem-plot-model', for visulazing the model learned.
  
@@ -258,14 +300,124 @@ For Trinity users, RSEM provides a perl script to generate transcript-to-gene-ma
  
  trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.    
  map_file: transcript-to-gene-map file's name.    
- 
+
+## <a name="de"></a> Differential Expression Analysis
+
+Popular differential expression (DE) analysis tools such as edgeR and
+DESeq do not take variance due to read mapping uncertainty into
+consideration. Because read mapping ambiguity is prevalent among
+isoforms and de novo assembled transcripts, these tools are not ideal
+for DE detection in such conditions. 
+
+**EBSeq**, an empirical Bayesian DE analysis tool developed in
+UW-Madison, can take variance due to read mapping ambiguity into
+consideration by grouping isoforms with parent gene's number of
+isoforms. In addition, it is more robust to outliers. For more
+information about EBSeq (including the paper describing their method),
+please visit <a
+href="http://www.biostat.wisc.edu/~ningleng/EBSeq_Package">EBSeq
+website</a>.
+
+RSEM includes the newest version of EBSeq in its folder
+named 'EBSeq'. To use it, first type
+
+    make ebseq
+
+to compile the EBSeq related codes. 
+
+EBSeq requires gene-isoform relationship for its isoform DE
+detection. However, for de novo assembled transcriptome, it is hard to
+obtain an accurate gene-isoform relationship. Instead, RSEM provides a
+script 'rsem-generate-ngvector', which clusters transcripts based on
+measures directly relating to read mappaing ambiguity. First, it
+calcualtes the 'unmappability' of each transcript. The 'unmappability'
+of a transcript is the ratio between the number of k mers with at
+least one perfect match to other transcripts and the total number of k
+mers of this transcript, where k is a parameter. Then, Ng vector is
+generated by applying Kmeans algorithm to the 'unmappability' values
+with number of clusters set as 3. This program will make sure the mean
+'unmappability' scores for clusters are in ascending order. All
+transcripts whose lengths are less than k are assigned to cluster
+3. Run
+
+    rsem-generate-ngvector --help
+
+to get usage information or visit the [rsem-generate-ngvector
+documentation
+page](http://deweylab.biostat.wisc.edu/rsem/rsem-generate-ngvector.html).
+
+If your reference is a de novo assembled transcript set, you should
+run 'rsem-generate-ngvector' first. Then load the resulting
+'output_name.ngvec' into R. For example, you can use 
+
+    NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+
+. After that, replace 'IsoNgTrun' with 'NgVec' in the second line of
+section 3.2.5 (Page 10) of EBSeq's vignette:
+
+    IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, ...)
+
+For users' convenience, RSEM also provides a script
+'rsem-generate-data-matrix' to extract input matrix from expression
+results:
+
+    rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
+
+The results files are required to be either all gene level results or
+all isoform level results. You can load the matrix into R by
+
+    IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))
+
+before running function 'EBTest'.
+
+At last, RSEM provides a R script, 'rsem-find-DE', which run EBSeq for
+you. 
+
+Usage: 
+
+    rsem-find-DE data_matrix_file [--ngvector ngvector_file] number_of_samples_in_condition_1 FDR_rate output_file
+
+This script calls EBSeq to find differentially expressed genes/transcripts in two conditions.
+
+data_matrix_file: m by n matrix containing expected counts, m is the number of transcripts/genes, n is the number of total samples.   
+[--ngvector ngvector_file]: optional field. 'ngvector_file' is calculated by 'rsem-generate-ngvector'. Having this field is recommended for transcript data.   
+number_of_samples_in_condition_1: the number of samples in condition 1. A condition's samples must be adjacent. The left group of samples are defined as condition 1.   
+FDR_rate: false discovery rate.   
+output_file: the output file. Three files will be generated: 'output_file', 'output_file.hard_threshold' and 'output_file.all'. The first file reports all DE genes/transcripts using a soft threshold (calculated by crit_func in EBSeq). The second file reports all DE genes/transcripts using a hard threshold (only report if PPEE <= fdr). The third file reports all genes/transcripts. The first file is recommended to be used as DE results because it generally contains more called genes/transcripts.   
+
+The results are written as a matrix with row and column names. The row names are the differentially expressed transcripts'/genes' ids. The column names are 'PPEE', 'PPDE', 'PostFC' and 'RealFC'.
+
+PPEE: posterior probability of being equally expressed.   
+PPDE: posterior probability of being differentially expressed.   
+PostFC: posterior fold change (condition 1 over condition2).   
+RealFC: real fold change (condition 1 over condition2).   
+
+To get the above usage information, type 
+
+    rsem-find-DE
+
+Note: any wrong parameter setting will lead 'rsem-find-DE' to output
+usage information and halt.
+
+Questions related to EBSeq should
+be sent to <a href="mailto:nleng@wisc.edu">Ning Leng</a>.
+
+## <a name="authors"></a> Authors
+
+RSEM is developed by Bo Li, with substaintial technical input from Colin Dewey.
+
  ## <a name="acknowledgements"></a> Acknowledgements
  
  RSEM uses the [Boost C++](http://www.boost.org) and
-[samtools](http://samtools.sourceforge.net) libraries.
+[samtools](http://samtools.sourceforge.net) libraries. RSEM includes
+[EBSeq](http://www.biostat.wisc.edu/~ningleng/EBSeq_Package/) for
+differential expression analysis.
  
  We thank earonesty for contributing patches.
  
+We thank Han Lin for suggesting possible fixes. 
+
  ## <a name="license"></a> License
  
-RSEM is licensed under the [GNU General Public License v3](http://www.gnu.org/licenses/gpl-3.0.html).
+RSEM is licensed under the [GNU General Public License
+v3](http://www.gnu.org/licenses/gpl-3.0.html).