README for RSEM =============== [Bo Li](http://pages.cs.wisc.edu/~bli) \(bli at cs dot wisc dot edu\) * * * Table of Contents ----------------- * [Introduction](#introduction) * [Compilation & Installation](#compilation) * [Usage](#usage) * [Example](#example) * [Simulation](#simulation) * [Acknowledgements](#acknowledgements) * [License](#license) * * *

Introduction

RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The new RSEM package (rsem-1.x) provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. It can also generate genomic-coordinate BAM files and UCSC wiggle files for visualization. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels.

Compilation & Installation

To compile RSEM, simply run make To install, simply put the rsem directory in your environment's PATH variable. ### Prerequisites To take advantage of RSEM's built-in support for the Bowtie alignment program, you must have [Bowtie](http://bowtie-bio.sourceforge.net) installed.

Usage

### I. Preparing Reference Sequences RSEM can extract reference transcripts from a genome if you provide it with gene annotations in a GTF file. Alternatively, you can provide RSEM with transcript sequences directly. Please note that GTF files generated from the UCSC Table Browser do not contain isoform-gene relationship information. However, if you use the UCSC Genes annotation track, this information can be recovered by downloading the knownIsoforms.txt file for the appropriate genome. To prepare the reference sequences, you should run the 'rsem-prepare-reference' program. Run rsem-prepare-reference --help to get usage information or visit the [rsem-prepare-reference documentation page](rsem-prepare-reference.html). ### II. Calculating Expression Values To prepare the reference sequences, you should run the 'rsem-calculate-expression' program. Run rsem-calculate-expression --help to get usage information or visit the [rsem-calculate-expression documentation page](rsem-calculate-expression.html). Note: RSEM no longer provides nu values. Instead, RSEM provides nrf(normalized read fraction), which is a normalized version of theta vector excluding theta_0. #### Calculating expression values from single-end data For single-end models, users have the option of providing a fragment length distribution via the --fragment-length-mean and --fragment-length-sd options. The specification of an accurate fragment length distribution is important for the accuracy of expression level estimates from single-end data. If the fragment length mean and sd are not provided, RSEM will not take a fragment length distribution into consideration. #### Using an alternative aligner By default, RSEM automates the alignment of reads to reference transcripts using the Bowtie alignment program. To use an alternative alignment program, align the input reads against the file 'reference_name.idx.fa' generated by rsem-prepare-reference, and format the alignment output in SAM or BAM format. Then, instead of providing reads to rsem-calculate-expression, specify the --sam or --bam option and provide the SAM or BAM file as an argument. When using an alternative aligner, you may also want to provide the --no-bowtie option to rsem-prepare-reference so that the Bowtie indices are not built. ### III. Visualization RSEM contains a version of samtools in the 'sam' subdirectory. When users specify the --out-bam option RSEM will produce three files: 'sample_name.bam', the unsorted BAM file, 'sample_name.sorted.bam' and 'sample_name.sorted.bam.bai' the sorted BAM file and indices generated by the samtools included. #### a) Generating a UCSC Wiggle file A wiggle plot representing the expected number of reads overlapping each position in the genome can be generated from the sorted BAM file output. To generate the wiggle plot, run the 'rsem-bam2wig' program on the 'sample_name.sorted.bam' file. Usage: rsem-bam2wig bam_input wig_output wiggle_name bam_input: sorted bam file wig_output: output file name, e.g. output.wig wiggle_name: the name the user wants to use for this wiggle plot #### b) Loading a BAM and/or Wiggle file into the UCSC Genome Browser Refer to the [UCSC custom track help page](http://genome.ucsc.edu/goldenPath/help/customTrack.html).

Example

Suppose we download the mouse genome from UCSC Genome Browser. We will use a reference_name of 'mm9'. We have a FASTQ-formatted file, 'mmliver.fq', containing single-end reads from one sample, which we call 'mmliver_single_quals'. We want to estimate expression values by using the single-end model with a fragment length distribution. We know that the fragment length distribution is approximated by a normal distribution with a mean of 150 and a standard deviation of 35. We wish to generate 95% credibility intervals in addition to maximum likelihood estimates. RSEM will be allowed 1G of memory for the credibility interval calculation. We will visualize the probabilistic read mappings generated by RSEM. The commands for this scenario are as follows: rsem-prepare-reference --gtf mm9.gtf --mapping knownIsoforms.txt --bowtie-path /sw/bowtie /data/mm9 /ref/mm9 rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --out-bam --calc-ci --memory-allocate 1024 /data/mmliver.fq /ref/mm9 mmliver_single_quals rsem-bam2wig mmliver_single_quals.sorted.bam mmliver_single_quals.sorted.wig mmliver_single_quals

Simulation

### Usage: rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q] estimated_model_file: File containing model parameters. Generated by rsem-calculate-expression. estimated_isoform_results: File containing isoform expression levels. Generated by rsem-calculate-expression. theta0: fraction of reads that are "noise" (not derived from a transcript). N: number of reads to simulate. output_name: prefix for all output files. [-q] : set it will stop outputting intermediate information. ### Outputs: output_name.fa if single-end without quality score; output_name.fq if single-end with quality score; output_name_1.fa & output_name_2.fa if paired-end without quality score; output_name_1.fq & output_name_2.fq if paired-end with quality score. output_name.sim.isoforms.results, output_name.sim.genes.results : Results estimated based on sample values.

Acknowledgements

RSEM uses randomc.h and mersenne.cpp from for random number generation. RSEM also uses the [Boost C++](http://www.boost.org) and [samtools](http://samtools.sourceforge.net) libraries.

License

RSEM is licensed under the [GNU General Public License v3](http://www.gnu.org/licenses/gpl-3.0.html).