X-Git-Url: https://git.donarmstrong.com/?a=blobdiff_plain;f=README.md;h=77c0693af045ab7daa551055dff244905bca111f;hb=97554bbac838f2ed578d81f98e421dac0669e74e;hp=b7a5204d1f904937deb5fb30c6b2ee1463bb8136;hpb=f112f29d305c73d9626562fc0b0df233beb04502;p=rsem.git diff --git a/README.md b/README.md index b7a5204..77c0693 100644 --- a/README.md +++ b/README.md @@ -13,23 +13,32 @@ Table of Contents * [Usage](#usage) * [Example](#example) * [Simulation](#simulation) +* [Generate Transcript-to-Gene-Map from Trinity Output](#gen_trinity) * [Acknowledgements](#acknowledgements) * [License](#license) * * * -## Introduction +## Introduction RSEM is a software package for estimating gene and isoform expression -levels from RNA-Seq data. The new RSEM package (rsem-1.x) provides an -user-friendly interface, supports threads for parallel computation of -the EM algorithm, single-end and paired-end read data, quality scores, -variable-length reads and RSPD estimation. It can also generate -genomic-coordinate BAM files and UCSC wiggle files for visualization. In -addition, it provides posterior mean and 95% credibility interval -estimates for expression levels. - -## Compilation & Installation +levels from RNA-Seq data. The RSEM package provides an user-friendly +interface, supports threads for parallel computation of the EM +algorithm, single-end and paired-end read data, quality scores, +variable-length reads and RSPD estimation. In addition, it provides +posterior mean and 95% credibility interval estimates for expression +levels. For visualization, It can generate BAM and Wiggle files in +both transcript-coordinate and genomic-coordinate. Genomic-coordinate +files can be visualized by both UCSC Genome browser and Broad +Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate +files can be visualized by IGV. RSEM also has its own scripts to +generate transcript read depth plots in pdf format. The unique feature +of RSEM is, the read depth plots can be stacked, with read depth +contributed to unique reads shown in black and contributed to +multi-reads shown in red. In addition, models learned from data can +also be visualized. Last but not least, RSEM contains a simulator. + +## Compilation & Installation To compile RSEM, simply run @@ -40,10 +49,14 @@ variable. ### Prerequisites +C++ and Perl are required to be installed. + To take advantage of RSEM's built-in support for the Bowtie alignment program, you must have [Bowtie](http://bowtie-bio.sourceforge.net) installed. -## Usage +If you want to plot model learned by RSEM, you should also install R. + +## Usage ### I. Preparing Reference Sequences @@ -77,8 +90,8 @@ documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-express #### Calculating expression values from single-end data For single-end models, users have the option of providing a fragment -length distribution via the --fragment-length-mean and ---fragment-length-sd options. The specification of an accurate fragment +length distribution via the '--fragment-length-mean' and +'--fragment-length-sd' options. The specification of an accurate fragment length distribution is important for the accuracy of expression level estimates from single-end data. If the fragment length mean and sd are not provided, RSEM will not take a fragment length distribution into @@ -89,69 +102,138 @@ consideration. By default, RSEM automates the alignment of reads to reference transcripts using the Bowtie alignment program. To use an alternative alignment program, align the input reads against the file -'reference_name.idx.fa' generated by rsem-prepare-reference, and format +'reference_name.idx.fa' generated by 'rsem-prepare-reference', and format the alignment output in SAM or BAM format. Then, instead of providing -reads to rsem-calculate-expression, specify the --sam or --bam option +reads to 'rsem-calculate-expression', specify the '--sam' or '--bam' option and provide the SAM or BAM file as an argument. When using an -alternative aligner, you may also want to provide the --no-bowtie option -to rsem-prepare-reference so that the Bowtie indices are not built. +alternative aligner, you may also want to provide the '--no-bowtie' option +to 'rsem-prepare-reference' so that the Bowtie indices are not built. + +RSEM requires all alignments of the same read group together. For +paired-end reads, RSEM also requires the two mates of any alignment be +adjacent. If the alternative aligner does not satisfy the first +requirement, you can use 'convert-sam-for-rsem' for conversion. Please run + + convert-sam-for-rsem --help + +to get usage information or visit the [convert-sam-for-rsem +documentation +page](http://deweylab.biostat.wisc.edu/rsem/convert-sam-for-rsem.html). + +However, please note that RSEM does ** not ** support gapped +alignments. So make sure that your aligner does not produce alignments +with intersions/deletions. Also, please make sure that you use +'reference_name.idx.fa' , which is generated by RSEM, to build your +aligner's indices. ### III. Visualization -RSEM contains a version of samtools in the 'sam' subdirectory. When -users specify the --out-bam option RSEM will produce three files: -'sample_name.bam', the unsorted BAM file, 'sample_name.sorted.bam' and -'sample_name.sorted.bam.bai' the sorted BAM file and indices generated -by the samtools included. +RSEM contains a version of samtools in the 'sam' subdirectory. RSEM +will always produce three files:'sample_name.transcript.bam', the +unsorted BAM file, 'sample_name.transcript.sorted.bam' and +'sample_name.transcript.sorted.bam.bai' the sorted BAM file and +indices generated by the samtools included. All three files are in +transcript coordinates. When users specify the --output-genome-bam +option RSEM will produce three files: 'sample_name.genome.bam', the +unsorted BAM file, 'sample_name.genome.sorted.bam' and +'sample_name.genome.sorted.bam.bai' the sorted BAM file and indices +generated by the samtools included. All these files are in genomic +coordinates. -#### a) Generating a UCSC Wiggle file +#### a) Generating a Wiggle file A wiggle plot representing the expected number of reads overlapping -each position in the genome can be generated from the sorted BAM file -output. To generate the wiggle plot, run the 'rsem-bam2wig' program on -the 'sample_name.sorted.bam' file. +each position in the genome/transcript set can be generated from the +sorted genome/transcript BAM file output. To generate the wiggle +plot, run the 'rsem-bam2wig' program on the +'sample_name.genome.sorted.bam'/'sample_name.transcript.sorted.bam' file. Usage: - rsem-bam2wig bam_input wig_output wiggle_name + rsem-bam2wig sorted_bam_input wig_output wiggle_name -bam_input: sorted bam file +sorted_bam_input: sorted bam file wig_output: output file name, e.g. output.wig wiggle_name: the name the user wants to use for this wiggle plot -#### b) Loading a BAM and/or Wiggle file into the UCSC Genome Browser +#### b) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV) + +For UCSC genome browser, please refer to the [UCSC custom track help page](http://genome.ucsc.edu/goldenPath/help/customTrack.html). + +For integrative genomics viewer, please refer to the [IGV home page](http://www.broadinstitute.org/software/igv/home). Note: Although IGV can generate read depth plot from the BAM file given, it cannot recognize "ZW" tag RSEM puts. Therefore IGV counts each alignment as weight 1 instead of the expected weight for the plot it generates. So we recommend to use the wiggle file generated by RSEM for read depth visualization. + +#### c) Generating Transcript Wiggle Plots + +To generate transcript wiggle plots, you should run the +'rsem-plot-transcript-wiggles' program. Run + + rsem-plot-transcript-wiggles --help + +to get usage information or visit the [rsem-plot-transcript-wiggles +documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-plot-transcript-wiggles.html). + +#### d) Visualize the model learned by RSEM -Refer to the [UCSC custom track help page](http://genome.ucsc.edu/goldenPath/help/customTrack.html). +RSEM provides an R script, 'rsem-plot-model', for visulazing the model learned. -## Example +Usage: + + rsem-plot-model sample_name output_plot_file -Suppose we download the mouse genome from UCSC Genome Browser. We will -use a reference_name of 'mm9'. We have a FASTQ-formatted file, -'mmliver.fq', containing single-end reads from one sample, which we call -'mmliver_single_quals'. We want to estimate expression values by using -the single-end model with a fragment length distribution. We know that -the fragment length distribution is approximated by a normal -distribution with a mean of 150 and a standard deviation of 35. We wish -to generate 95% credibility intervals in addition to maximum likelihood -estimates. RSEM will be allowed 1G of memory for the credibility -interval calculation. We will visualize the probabilistic read mappings -generated by RSEM. +sample_name: the name of the sample analyzed +output_plot_file: the file name for plots generated from the model. It is a pdf file + +The plots generated depends on read type and user configuration. It +may include fragment length distribution, mate length distribution, +read start position distribution (RSPD), quality score vs observed +quality given a reference base, position vs percentage of sequencing +error given a reference base and histogram of reads with different +number of alignments. + +fragment length distribution and mate length distribution: x-axis is fragment/mate length, y axis is the probability of generating a fragment/mate with the associated length + +RSPD: Read Start Position Distribution. x-axis is bin number, y-axis is the probability of each bin. RSPD can be used as an indicator of 3' bias + +Quality score vs. observed quality given a reference base: x-axis is Phred quality scores associated with data, y-axis is the "observed quality", Phred quality scores learned by RSEM from the data. Q = -10log_10(P), where Q is Phred quality score and P is the probability of sequencing error for a particular base + +Position vs. percentage sequencing error given a reference base: x-axis is position and y-axis is percentage sequencing error + +Histogram of reads with different number of alignments: x-axis is the number of alignments a read has and y-axis is the number of such reads. The inf in x-axis means number of reads filtered due to too many alignments + +## Example + +Suppose we download the mouse genome from UCSC Genome Browser. We +will use a reference_name of 'mm9'. We have a FASTQ-formatted file, +'mmliver.fq', containing single-end reads from one sample, which we +call 'mmliver_single_quals'. We want to estimate expression values by +using the single-end model with a fragment length distribution. We +know that the fragment length distribution is approximated by a normal +distribution with a mean of 150 and a standard deviation of 35. We +wish to generate 95% credibility intervals in addition to maximum +likelihood estimates. RSEM will be allowed 1G of memory for the +credibility interval calculation. We will visualize the probabilistic +read mappings generated by RSEM on UCSC genome browser. We will +generate a list of genes' transcript wiggle plots in 'output.pdf'. The +list is 'gene_ids.txt'. We will visualize the models learned in +'mmliver_single_quals.models.pdf' The commands for this scenario are as follows: rsem-prepare-reference --gtf mm9.gtf --mapping knownIsoforms.txt --bowtie-path /sw/bowtie /data/mm9 /ref/mm9 - rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --out-bam --calc-ci --memory-allocate 1024 /data/mmliver.fq /ref/mm9 mmliver_single_quals + rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --output-genome-bam --calc-ci --memory-allocate 1024 /data/mmliver.fq /ref/mm9 mmliver_single_quals rsem-bam2wig mmliver_single_quals.sorted.bam mmliver_single_quals.sorted.wig mmliver_single_quals + rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf + rsem-plot-model mmliver_single_quals mmliver_single_quals.models.pdf -## Simulation +## Simulation ### Usage: rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q] -estimated_model_file: File containing model parameters. Generated by +estimated_model_file: file containing model parameters. Generated by rsem-calculate-expression. -estimated_isoform_results: File containing isoform expression levels. +estimated_isoform_results: file containing isoform expression levels. Generated by rsem-calculate-expression. theta0: fraction of reads that are "noise" (not derived from a transcript). N: number of reads to simulate. @@ -168,13 +250,24 @@ output_name_1.fq & output_name_2.fq if paired-end with quality score. output_name.sim.isoforms.results, output_name.sim.genes.results : Results estimated based on sample values. -## Acknowledgements +## Generate Transcript-to-Gene-Map from Trinity Output -RSEM uses randomc.h and mersenne.cpp from - for random number generation. RSEM -also uses the [Boost C++](http://www.boost.org) and +For Trinity users, RSEM provides a perl script to generate transcript-to-gene-map file from the fasta file produced by Trinity. + +### Usage: + + extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file + +trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled. +map_file: transcript-to-gene-map file's name. + +## Acknowledgements + +RSEM uses the [Boost C++](http://www.boost.org) and [samtools](http://samtools.sourceforge.net) libraries. -## License +We thank earonesty for contributing patches. + +## License RSEM is licensed under the [GNU General Public License v3](http://www.gnu.org/licenses/gpl-3.0.html).