+## <a name="gen_trinity"></a> Generate Transcript-to-Gene-Map from Trinity Output
+
+For Trinity users, RSEM provides a perl script to generate transcript-to-gene-map file from the fasta file produced by Trinity.
+
+### Usage:
+
+ extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file
+
+trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.
+map_file: transcript-to-gene-map file's name.
+
+## <a name="de"></a> Differential Expression Analysis
+
+Popular differential expression (DE) analysis tools such as edgeR and
+DESeq do not take variance due to read mapping uncertainty into
+consideration. Because read mapping ambiguity is prevalent among
+isoforms and de novo assembled transcripts, these tools are not ideal
+for DE detection in such conditions.
+
+**EBSeq**, an empirical Bayesian DE analysis tool developed in
+UW-Madison, can take variance due to read mapping ambiguity into
+consideration by grouping isoforms with parent gene's number of
+isoforms. In addition, it is more robust to outliers. For more
+information about EBSeq (including the paper describing their method),
+please visit <a
+href="http://www.biostat.wisc.edu/~ningleng/EBSeq_Package">EBSeq
+website</a>.
+
+RSEM includes the newest version of EBSeq in its folder
+named 'EBSeq'. To use it, first type
+
+ make ebseq
+
+to compile the EBSeq related codes.
+
+EBSeq requires gene-isoform relationship for its isoform DE
+detection. However, for de novo assembled transcriptome, it is hard to
+obtain an accurate gene-isoform relationship. Instead, RSEM provides a
+script 'rsem-generate-ngvector', which clusters transcripts based on
+measures directly relating to read mappaing ambiguity. First, it
+calcualtes the 'unmappability' of each transcript. The 'unmappability'
+of a transcript is the ratio between the number of k mers with at
+least one perfect match to other transcripts and the total number of k
+mers of this transcript, where k is a parameter. Then, Ng vector is
+generated by applying Kmeans algorithm to the 'unmappability' values
+with number of clusters set as 3. This program will make sure the mean
+'unmappability' scores for clusters are in ascending order. All
+transcripts whose lengths are less than k are assigned to cluster
+3. Run
+
+ rsem-generate-ngvector --help
+
+to get usage information or visit the [rsem-generate-ngvector
+documentation
+page](http://deweylab.biostat.wisc.edu/rsem/rsem-generate-ngvector.html).
+
+If your reference is a de novo assembled transcript set, you should
+run 'rsem-generate-ngvector' first. Then load the resulting
+'output_name.ngvec' into R. For example, you can use
+
+ NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+
+. After that, replace 'IsoNgTrun' with 'NgVec' in the second line of
+section 3.2.5 (Page 10) of EBSeq's vignette:
+
+ IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, ...)
+
+For users' convenience, RSEM also provides a script
+'rsem-generate-data-matrix' to extract input matrix from expression
+results:
+
+ rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
+
+The results files are required to be either all gene level results or
+all isoform level results. You can load the matrix into R by
+
+ IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))
+
+before running function 'EBTest'.
+
+At last, RSEM provides a R script, 'rsem-find-DE', which run EBSeq for
+you.
+
+Usage:
+
+ rsem-find-DE data_matrix_file [--ngvector ngvector_file] number_of_samples_in_condition_1 FDR_rate output_file
+
+This script calls EBSeq to find differentially expressed genes/transcripts in two conditions.
+
+data_matrix_file: m by n matrix containing expected counts, m is the number of transcripts/genes, n is the number of total samples.
+[--ngvector ngvector_file]: optional field. 'ngvector_file' is calculated by 'rsem-generate-ngvector'. Having this field is recommended for transcript data.
+number_of_samples_in_condition_1: the number of samples in condition 1. A condition's samples must be adjacent. The left group of samples are defined as condition 1.
+FDR_rate: false discovery rate.
+output_file: the output file. Three files will be generated: 'output_file', 'output_file.hard_threshold' and 'output_file.all'. The first file reports all DE genes/transcripts using a soft threshold (calculated by crit_func in EBSeq). The second file reports all DE genes/transcripts using a hard threshold (only report if PPEE <= fdr). The third file reports all genes/transcripts. The first file is recommended to be used as DE results because it generally contains more called genes/transcripts.
+
+The results are written as a matrix with row and column names. The row names are the differentially expressed transcripts'/genes' ids. The column names are 'PPEE', 'PPDE', 'PostFC' and 'RealFC'.
+
+PPEE: posterior probability of being equally expressed.
+PPDE: posterior probability of being differentially expressed.
+PostFC: posterior fold change (condition 1 over condition2).
+RealFC: real fold change (condition 1 over condition2).
+
+To get the above usage information, type
+
+ rsem-find-DE
+
+Note: any wrong parameter setting will lead 'rsem-find-DE' to output
+usage information and halt.
+
+Questions related to EBSeq should
+be sent to <a href="mailto:nleng@wisc.edu">Ning Leng</a>.
+
+## <a name="authors"></a> Authors
+
+RSEM is developed by Bo Li, with substaintial technical input from Colin Dewey.
+
+## <a name="acknowledgements"></a> Acknowledgements
+
+RSEM uses the [Boost C++](http://www.boost.org) and
+[samtools](http://samtools.sourceforge.net) libraries. RSEM includes
+[EBSeq](http://www.biostat.wisc.edu/~ningleng/EBSeq_Package/) for
+differential expression analysis.
+
+We thank earonesty for contributing patches.