#### Calculating expression values from single-end data
For single-end models, users have the option of providing a fragment
-length distribution via the --fragment-length-mean and
---fragment-length-sd options. The specification of an accurate fragment
+length distribution via the '--fragment-length-mean' and
+'--fragment-length-sd' options. The specification of an accurate fragment
length distribution is important for the accuracy of expression level
estimates from single-end data. If the fragment length mean and sd are
not provided, RSEM will not take a fragment length distribution into
By default, RSEM automates the alignment of reads to reference
transcripts using the Bowtie alignment program. To use an alternative
alignment program, align the input reads against the file
-'reference_name.idx.fa' generated by rsem-prepare-reference, and format
+'reference_name.idx.fa' generated by 'rsem-prepare-reference', and format
the alignment output in SAM or BAM format. Then, instead of providing
-reads to rsem-calculate-expression, specify the --sam or --bam option
+reads to 'rsem-calculate-expression', specify the '--sam' or '--bam' option
and provide the SAM or BAM file as an argument. When using an
-alternative aligner, you may also want to provide the --no-bowtie option
-to rsem-prepare-reference so that the Bowtie indices are not built.
+alternative aligner, you may also want to provide the '--no-bowtie' option
+to 'rsem-prepare-reference' so that the Bowtie indices are not built.
+
+Some aligners' (other than Bowtie) output might need to be converted
+so that RSEM can use. For conversion, please run
+
+ convert-sam-for-rsem --help
+
+to get usage information or visit the [convert-sam-for-rsem
+documentation
+page](http://deweylab.biostat.wisc.edu/rsem/convert-sam-for-rsem.html).
However, please note that RSEM does ** not ** support gapped
alignments. So make sure that your aligner does not produce alignments
int numA = 0, numT = 0, numAO = 0, numTO = 0; // numAO : number of A in overlap seed region
int threshold_1, threshold_2;
-
threshold_1 = int(0.9 * len - 1.5 * sqrt(len * 1.0) + 0.5);
threshold_2 = (OLEN - 1) / 2 + 1;
for (int i = 0; i < len; i++) {
This program converts the SAM file generated by user's aligner into a SAM file which RSEM can process. However, users should make sure their aligners use 'reference_name.idx.fa' generated by 'rsem-prepare-reference' as their references. In addition, their aligners should output header information and make two mates of the same alignment adjacent to each other for paired-end data. This program will output the converted file into standard output by default for the purpose of piping. By setting '-o' option, users can make the converted file written into disk.
-Note: You do not need to run this script if Bowtie (not Bowtie 2) is used.
+Note: You do not need to run this script if Bowtie (not Bowtie 2) is used, or the order of @SQ lines is the same as 'reference_name.idx.fa' and the alignment lines of a same read group together and the mates of the same alignment are adjacent each other for paired-end reads.
+
+Note: This program can only recognize SAM files. See ARGUMENTS section.
=head1 EXAMPLES
In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments. RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified. Users may use an alternative aligner by specifying one of the --sam and --bam options, and providing an alignment file in the specified format. However, users should make sure that they align against the indices generated by 'rsem-prepare-reference' and the alignment file satisfies the requirements mentioned in ARGUMENTS section.
-One simple way to make the alignment file (e.g. input.sam) satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use the following command:
+One simple way to make the alignment file satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use 'convert-sam-for-rsem' script. This script only accept SAM format files as input. If a BAM format file is obtained, please use samtools to convert it to a SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and the SAM file is named 'input.sam', you can run the following command:
- sort -k 1,1 -s input.sam > input.sorted.sam
+ convert-sam-for-rsem /ref/mouse_125 input.sam -o input_for_rsem.sam
-The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL not be '*'.
+For details, please refer to 'convert-sam-for-rsem's documentation page.
+
+The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL are not '*'.
The user must run 'rsem-prepare-reference' with the appropriate reference before using this program.
The temporary directory and all intermediate files will be removed when RSEM finishes unless '--keep-intermediate-files' is specified.
-With the "--calc-ci" option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates.
+With the '--calc-ci' option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates.
=head1 OUTPUT
=head1 EXAMPLES
-Assume the path to the bowtie executables is in the user's PATH environment variable. Reference files are under '/ref' with name 'mm9'.
+Assume the path to the bowtie executables is in the user's PATH environment variable. Reference files are under '/ref' with name 'mouse_125'.
1) '/data/mmliver.fq', single-end reads with quality scores. Quality scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8 threads and generate a genome BAM file:
-p 8 \
--output-genome-bam \
/data/mmliver.fq \
- /ref/mm9 \
+ /ref/mouse_125 \
mmliver_single_quals
2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', paired-end reads with quality scores. Quality scores are in SANGER format. We want to use 8 threads and do not generate a genome BAM file:
--paired-end \
/data/mmliver_1.fq \
/data/mmliver_2.fq \
- /ref/mm9 \
+ /ref/mouse_125 \
mmliver_paired_end_quals
3) '/data/mmliver.fa', single-end reads without quality scores. We want to use 8 threads:
rsem-calculate-expression -p 8 \
--no-qualities \
/data/mmliver.fa \
- /ref/mm9 \
+ /ref/mouse_125 \
mmliver_single_without_quals
4) Data are the same as 1). We want to take a fragment length distribution into consideration. We set the fragment length mean to 150 and the standard deviation to 35. In addition to a BAM file, we also want to generate credibility intervals. We allow RSEM to use 1GB of memory for CI calculation:
--calc-ci \
--ci-memory 1024 \
/data/mmliver.fq \
- /ref/mm9 \
+ /ref/mouse_125 \
mmliver_single_quals
5) '/data/mmliver_paired_end_quals.bam', paired-end reads with quality scores. We want to use 8 threads:
--bam \
-p 8 \
/data/mmliver_paired_end_quals.bam \
- /ref/mm9 \
+ /ref/mouse_125 \
mmliver_paired_end_quals
=cut
=head1 EXAMPLES
-1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mm9'. We'll add poly(A) tails with length 125. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'.
+1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mouse_125'. We'll add poly(A) tails with length 125. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'.
There are two ways to write the command:
--transcript-to-gene-map knownIsoforms.txt \
--bowtie-path /sw/bowtie \
/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
- /ref/mm9
+ /ref/mouse_125
OR
--transcript-to-gene-map knownIsoforms.txt \
--bowtie-path /sw/bowtie \
/data/mm9 \
- /ref/mm9
+ /ref/mouse_125
-2) Suppose we only have transcripts from EST tags in 'mm9.fasta'. In addition, we also have isoform-gene information in 'mapping.txt'. We do not want to add any poly(A) tails. The reference_name will be set to 'mm9'. In addition, we do not want to build Bowtie indices, and will use an alternative aligner to align reads against the 'mm9.idx.fa' output file:
+2) Suppose we only have transcripts from EST tags in 'mm9.fasta'. In addition, we also have isoform-gene information in 'mapping.txt'. We do not want to add any poly(A) tails. The reference_name will be set to 'mouse_0'. In addition, we do not want to build Bowtie indices, and will use an alternative aligner to align reads against the 'mouse_0.idx.fa' output file:
rsem-prepare-reference --transcript-to-gene-map mapping.txt \
--no-polyA \
--no-bowtie \
mm9.fasta \
- mm9
+ mouse_0
=cut