From 4a435fbee229af1dd5f76b4feb0ca71398ac8796 Mon Sep 17 00:00:00 2001 From: Bo Li Date: Sat, 10 Dec 2011 17:44:50 -0600 Subject: [PATCH] updated some description --- README.md | 21 +++++++++++++++------ SingleReadQ.h | 1 - convert_sam_for_rsem | 4 +++- rsem-calculate-expression | 22 ++++++++++++---------- rsem-prepare-reference | 10 +++++----- 5 files changed, 35 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index a4c5ce5..b46826a 100644 --- a/README.md +++ b/README.md @@ -84,8 +84,8 @@ documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-express #### Calculating expression values from single-end data For single-end models, users have the option of providing a fragment -length distribution via the --fragment-length-mean and ---fragment-length-sd options. The specification of an accurate fragment +length distribution via the '--fragment-length-mean' and +'--fragment-length-sd' options. The specification of an accurate fragment length distribution is important for the accuracy of expression level estimates from single-end data. If the fragment length mean and sd are not provided, RSEM will not take a fragment length distribution into @@ -96,12 +96,21 @@ consideration. By default, RSEM automates the alignment of reads to reference transcripts using the Bowtie alignment program. To use an alternative alignment program, align the input reads against the file -'reference_name.idx.fa' generated by rsem-prepare-reference, and format +'reference_name.idx.fa' generated by 'rsem-prepare-reference', and format the alignment output in SAM or BAM format. Then, instead of providing -reads to rsem-calculate-expression, specify the --sam or --bam option +reads to 'rsem-calculate-expression', specify the '--sam' or '--bam' option and provide the SAM or BAM file as an argument. When using an -alternative aligner, you may also want to provide the --no-bowtie option -to rsem-prepare-reference so that the Bowtie indices are not built. +alternative aligner, you may also want to provide the '--no-bowtie' option +to 'rsem-prepare-reference' so that the Bowtie indices are not built. + +Some aligners' (other than Bowtie) output might need to be converted +so that RSEM can use. For conversion, please run + + convert-sam-for-rsem --help + +to get usage information or visit the [convert-sam-for-rsem +documentation +page](http://deweylab.biostat.wisc.edu/rsem/convert-sam-for-rsem.html). However, please note that RSEM does ** not ** support gapped alignments. So make sure that your aligner does not produce alignments diff --git a/SingleReadQ.h b/SingleReadQ.h index 8e7d898..9fd3dd1 100644 --- a/SingleReadQ.h +++ b/SingleReadQ.h @@ -72,7 +72,6 @@ void SingleReadQ::calc_lq(bool hasPolyA, int seedLen) { int numA = 0, numT = 0, numAO = 0, numTO = 0; // numAO : number of A in overlap seed region int threshold_1, threshold_2; - threshold_1 = int(0.9 * len - 1.5 * sqrt(len * 1.0) + 0.5); threshold_2 = (OLEN - 1) / 2 + 1; for (int i = 0; i < len; i++) { diff --git a/convert_sam_for_rsem b/convert_sam_for_rsem index ee035f0..a6bc124 100755 --- a/convert_sam_for_rsem +++ b/convert_sam_for_rsem @@ -165,7 +165,9 @@ Show help information. This program converts the SAM file generated by user's aligner into a SAM file which RSEM can process. However, users should make sure their aligners use 'reference_name.idx.fa' generated by 'rsem-prepare-reference' as their references. In addition, their aligners should output header information and make two mates of the same alignment adjacent to each other for paired-end data. This program will output the converted file into standard output by default for the purpose of piping. By setting '-o' option, users can make the converted file written into disk. -Note: You do not need to run this script if Bowtie (not Bowtie 2) is used. +Note: You do not need to run this script if Bowtie (not Bowtie 2) is used, or the order of @SQ lines is the same as 'reference_name.idx.fa' and the alignment lines of a same read group together and the mates of the same alignment are adjacent each other for paired-end reads. + +Note: This program can only recognize SAM files. See ARGUMENTS section. =head1 EXAMPLES diff --git a/rsem-calculate-expression b/rsem-calculate-expression index cbfef53..f90ec48 100755 --- a/rsem-calculate-expression +++ b/rsem-calculate-expression @@ -558,11 +558,13 @@ Show help information. In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments. RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified. Users may use an alternative aligner by specifying one of the --sam and --bam options, and providing an alignment file in the specified format. However, users should make sure that they align against the indices generated by 'rsem-prepare-reference' and the alignment file satisfies the requirements mentioned in ARGUMENTS section. -One simple way to make the alignment file (e.g. input.sam) satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use the following command: +One simple way to make the alignment file satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use 'convert-sam-for-rsem' script. This script only accept SAM format files as input. If a BAM format file is obtained, please use samtools to convert it to a SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and the SAM file is named 'input.sam', you can run the following command: - sort -k 1,1 -s input.sam > input.sorted.sam + convert-sam-for-rsem /ref/mouse_125 input.sam -o input_for_rsem.sam -The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL not be '*'. +For details, please refer to 'convert-sam-for-rsem's documentation page. + +The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL are not '*'. The user must run 'rsem-prepare-reference' with the appropriate reference before using this program. @@ -572,7 +574,7 @@ Please note that some of the default values for the Bowtie parameters are not th The temporary directory and all intermediate files will be removed when RSEM finishes unless '--keep-intermediate-files' is specified. -With the "--calc-ci" option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates. +With the '--calc-ci' option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates. =head1 OUTPUT @@ -647,7 +649,7 @@ This is a folder instead of a file. All model related statistics are stored in t =head1 EXAMPLES -Assume the path to the bowtie executables is in the user's PATH environment variable. Reference files are under '/ref' with name 'mm9'. +Assume the path to the bowtie executables is in the user's PATH environment variable. Reference files are under '/ref' with name 'mouse_125'. 1) '/data/mmliver.fq', single-end reads with quality scores. Quality scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8 threads and generate a genome BAM file: @@ -655,7 +657,7 @@ Assume the path to the bowtie executables is in the user's PATH environment vari -p 8 \ --output-genome-bam \ /data/mmliver.fq \ - /ref/mm9 \ + /ref/mouse_125 \ mmliver_single_quals 2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', paired-end reads with quality scores. Quality scores are in SANGER format. We want to use 8 threads and do not generate a genome BAM file: @@ -664,7 +666,7 @@ Assume the path to the bowtie executables is in the user's PATH environment vari --paired-end \ /data/mmliver_1.fq \ /data/mmliver_2.fq \ - /ref/mm9 \ + /ref/mouse_125 \ mmliver_paired_end_quals 3) '/data/mmliver.fa', single-end reads without quality scores. We want to use 8 threads: @@ -672,7 +674,7 @@ Assume the path to the bowtie executables is in the user's PATH environment vari rsem-calculate-expression -p 8 \ --no-qualities \ /data/mmliver.fa \ - /ref/mm9 \ + /ref/mouse_125 \ mmliver_single_without_quals 4) Data are the same as 1). We want to take a fragment length distribution into consideration. We set the fragment length mean to 150 and the standard deviation to 35. In addition to a BAM file, we also want to generate credibility intervals. We allow RSEM to use 1GB of memory for CI calculation: @@ -686,7 +688,7 @@ Assume the path to the bowtie executables is in the user's PATH environment vari --calc-ci \ --ci-memory 1024 \ /data/mmliver.fq \ - /ref/mm9 \ + /ref/mouse_125 \ mmliver_single_quals 5) '/data/mmliver_paired_end_quals.bam', paired-end reads with quality scores. We want to use 8 threads: @@ -695,7 +697,7 @@ Assume the path to the bowtie executables is in the user's PATH environment vari --bam \ -p 8 \ /data/mmliver_paired_end_quals.bam \ - /ref/mm9 \ + /ref/mouse_125 \ mmliver_paired_end_quals =cut diff --git a/rsem-prepare-reference b/rsem-prepare-reference index 072bb61..78743e9 100755 --- a/rsem-prepare-reference +++ b/rsem-prepare-reference @@ -211,7 +211,7 @@ B<'reference_name.transcripts.fa'> contains the extracted reference transcripts =head1 EXAMPLES -1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mm9'. We'll add poly(A) tails with length 125. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'. +1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mouse_125'. We'll add poly(A) tails with length 125. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'. There are two ways to write the command: @@ -219,7 +219,7 @@ There are two ways to write the command: --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ - /ref/mm9 + /ref/mouse_125 OR @@ -227,14 +227,14 @@ OR --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /data/mm9 \ - /ref/mm9 + /ref/mouse_125 -2) Suppose we only have transcripts from EST tags in 'mm9.fasta'. In addition, we also have isoform-gene information in 'mapping.txt'. We do not want to add any poly(A) tails. The reference_name will be set to 'mm9'. In addition, we do not want to build Bowtie indices, and will use an alternative aligner to align reads against the 'mm9.idx.fa' output file: +2) Suppose we only have transcripts from EST tags in 'mm9.fasta'. In addition, we also have isoform-gene information in 'mapping.txt'. We do not want to add any poly(A) tails. The reference_name will be set to 'mouse_0'. In addition, we do not want to build Bowtie indices, and will use an alternative aligner to align reads against the 'mouse_0.idx.fa' output file: rsem-prepare-reference --transcript-to-gene-map mapping.txt \ --no-polyA \ --no-bowtie \ mm9.fasta \ - mm9 + mouse_0 =cut -- 2.39.2