Provided a more detailed description for how to simulate RNA-Seq data using 'rsem...

author Bo Li <bli@cs.wisc.edu>

Fri, 22 Nov 2013 12:06:00 +0000 (06:06 -0600)

committer Bo Li <bli@cs.wisc.edu>

Fri, 22 Nov 2013 12:06:00 +0000 (06:06 -0600)
author Bo Li <bli@cs.wisc.edu>
Fri, 22 Nov 2013 12:06:00 +0000 (06:06 -0600)
committer Bo Li <bli@cs.wisc.edu>
Fri, 22 Nov 2013 12:06:00 +0000 (06:06 -0600)
diff --git a/README.md b/README.md

index f729ef8136adfa47c293ff4d6d0aada39dd30416..5c6943bd42f32d5631228267cd19aa1cffd2d2ec 100644 (file)
--- a/README.md
+++ b/README.md
@@ -181,7 +181,7 @@ Usage:
  
  sorted_bam_input        : Input BAM format file, must be sorted  
  wig_output              : Output wiggle file's name, e.g. output.wig  
-wiggle_name             : the name of this wiggle plot  
+wiggle_name             : The name of this wiggle plot  
  --no-fractional-weight  : If this is set, RSEM will not look for "ZW" tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line
  
  #### c) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)
@@ -243,7 +243,7 @@ Histogram of reads with different number of alignments: x-axis is the number of
  ## <a name="example"></a> Example
  
  Suppose we download the mouse genome from UCSC Genome Browser.  We
-will use a reference_name of 'mm9'.  We have a FASTQ-formatted file,
+will use a reference_name of 'mouse_125'.  We have a FASTQ-formatted file,
  'mmliver.fq', containing single-end reads from one sample, which we
  call 'mmliver_single_quals'.  We want to estimate expression values by
  using the single-end model with a fragment length distribution. We
@@ -259,36 +259,69 @@ list is 'gene_ids.txt'. We will visualize the models learned in
  
  The commands for this scenario are as follows:
  
-    rsem-prepare-reference --gtf mm9.gtf --mapping knownIsoforms.txt --bowtie-path /sw/bowtie /data/mm9 /ref/mm9
-    rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --output-genome-bam --calc-ci --memory-allocate 1024 /data/mmliver.fq /ref/mm9 mmliver_single_quals
+    rsem-prepare-reference --gtf mm9.gtf --mapping knownIsoforms.txt --bowtie-path /sw/bowtie /data/mm9 /ref/mouse_125
+    rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --output-genome-bam --calc-ci --memory-allocate 1024 /data/mmliver.fq /ref/mouse_125 mmliver_single_quals
      rsem-bam2wig mmliver_single_quals.sorted.bam mmliver_single_quals.sorted.wig mmliver_single_quals
      rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf 
      rsem-plot-model mmliver_single_quals mmliver_single_quals.models.pdf
  
  ## <a name="simulation"></a> Simulation
  
+RSEM provides users the 'rsem-simulate-reads' program to simulate RNA-Seq data based on parameters learned from real data sets. Run
+
+    rsem-simulate-reads
+
+to get usage information or read the following subsections.
+ 
  ### Usage: 
  
      rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]
  
-estimated_model_file:  file containing model parameters.  Generated by
-rsem-calculate-expression.   
-estimated_isoform_results: file containing isoform expression levels.
-Generated by rsem-calculate-expression.   
-theta0: fraction of reads that are "noise" (not derived from a transcript).   
-N: number of reads to simulate.   
-output_name: prefix for all output files.   
-[-q] : set it will stop outputting intermediate information.   
+__reference_name:__ The name of RSEM references, which should be already generated by 'rsem-prepare-reference'              
+
+__estimated_model_file:__ This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using 'rsem-calculate-expression'. The file can be found under the 'sample_name.stat' folder with the name of 'sample_name.model'    
+
+__estimated_isoform_results:__ This file contains expression levels for all isoforms recorded in the reference. It can be learned using 'rsem-calculate-expression' from real data. The corresponding file users want to use is 'sample_name.isoforms.results'. If simulating from user-designed expression profile is desired, start from a learned 'sample_name.isoforms.results' file and only modify the 'TPM' column. The simulator only reads the TPM column. But keeping the file format the same is required.   
+
+__theta0:__ This parameter determines the fraction of reads that are coming from background "noise" (instead of from a transcript). It can also be estimated using 'rsem-calculate-expression' from real data. Users can find it as the first value of the third line of the file 'sample_name.stat/sample_name.theta'.   
+
+__N:__ The total number of reads to be simulated. If 'rsem-calculate-expression' is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file 'sample_name.stat/sample_name.cnt'.   
+
+__output_name:__ Prefix for all output files.   
+
+__-q:__ Set it will stop outputting intermediate information.   
  
  ### Outputs:
  
+output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by counting where each simulated read comes from.
+
  output_name.fa if single-end without quality score;   
  output_name.fq if single-end with quality score;   
  output_name_1.fa & output_name_2.fa if paired-end without quality
  score;   
  output_name_1.fq & output_name_2.fq if paired-end with quality score.   
  
-output_name.sim.isoforms.results, output_name.sim.genes.results : Results estimated based on sample values.
+**Format of the header line**: Each simulated read's header line encodes where it comes from. The header line has the format:
+
+    {>/@}_rid_dir_sid_pos[_insertL]
+
+__{>/@}:__ Either '>' or '@' must appear. '>' appears if FASTA files are generated and '@' appears if FASTQ files are generated
+
+__rid:__ Simulated read's index, numbered from 0   
+
+__dir:__ The direction of the simulated read. 0 refers to forward strand ('+') and 1 refers to reverse strand ('-')   
+
+__sid:__ Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid's transcript name can be found in the 'transcript_id' column of the 'sample_name.isoforms.results' file (at line sid + 1, line 1 is for column names)   
+
+__pos:__ The start position of the simulated read in strand dir of transcript sid. It is numbered from 0   
+
+__insertL:__ Only appear for paired-end reads. It gives the insert length of the simulated read.   
+
+### Example:
+
+Suppose we want to simulate 50 millon single-end reads with quality scores and use the parameters learned from [Example](#example). In addition, we set theta0 as 0.2 and output_name as 'simulated_reads'. The command is:
+
+    rsem-simulate-reads /ref/mouse_125 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads
  
  ## <a name="gen_trinity"></a> Generate Transcript-to-Gene-Map from Trinity Output
  
diff --git a/WHAT_IS_NEW b/WHAT_IS_NEW

index 09f4dd9c70d44f7f788ea1abd1b44752b02ba204..6645796d2423bf91107b4d55af9f2a2c17cda372 100644 (file)
--- a/WHAT_IS_NEW
+++ b/WHAT_IS_NEW
@@ -1,3 +1,10 @@
+RSEM v1.2.8
+
+- Provided a more detailed description for how to simulate RNA-Seq data using 'rsem-simulate-reads'
+- Provided more user-friendly error message if RSEM fails to extract transcript sequences due to the failure of reading certain chromosome sequences
+
+--------------------------------------------------------------------------------------------
+
  RSEM v1.2.7
  
  - 'rsem-find-DE' is replaced by 'rsem-run-ebseq' and 'rsem-control-fdr' for a more friendly user experience
diff --git a/extractRef.cpp b/extractRef.cpp

index 1a56cc15e8c615f02e14c7cd15f4e8c787e5a172..2d2b17cea4fde25be35e6c9ab6925ba404144a74 100644 (file)
--- a/extractRef.cpp
+++ b/extractRef.cpp
@@ -304,8 +304,8 @@ int main(int argc, char* argv[]) {
         for (int i = 1; i <= M; i++) {
                 if (seqs[i] == "") {
                         const Transcript& transcript = transcripts.getTranscriptAt(i);
-                       fprintf(stderr, "Cannot extract transcript %s's sequence from chromosome %s, whose information might not be provided! Please check if the chromosome directory is set correctly or the list of chromosome files is complete.\n", \
-                                       transcript.getTranscriptID().c_str(), transcript.getSeqName().c_str());
+                       fprintf(stderr, "Cannot extract transcript %s's sequence from chromosome %s! Loading chromosome %s's sequence is failed. Please check if 1) the chromosome directory is set correctly; 2) the list of chromosome files is complete; 3) the FASTA files containing chromosome sequences are not truncated or having wrong format.\n", \
+                               transcript.getTranscriptID().c_str(), transcript.getSeqName().c_str(), transcript.getSeqName().c_str());
                         exit(-1);
                 }
         }
diff --git a/simulation.cpp b/simulation.cpp

index 1288c65178afa360f89f4cff3ee5a2f78b7de5cf..0073c3ec1eee70249f087e4edeb1d769c48b61a5 100644 (file)
--- a/simulation.cpp
+++ b/simulation.cpp
@@ -260,7 +260,29 @@ int main(int argc, char* argv[]) {
         FILE *fi = NULL;
  
         if (argc != 7 && argc != 8) {
-               printf("Usage: rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]\n");
+               printf("Usage: rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]\n\n");
+               printf("Parameters:\n\n");
+               printf("reference_name: The name of RSEM references, which should be already generated by 'rsem-prepare-reference'\n");
+               printf("estimated_model_file: This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using 'rsem-calculate-expression'. The file can be found under the 'sample_name.stat' folder with the name of 'sample_name.model'\n");
+               printf("estimated_isoform_results: This file contains expression levels for all isoforms recorded in the reference. It can be learned using 'rsem-calculate-expression' from real data. The corresponding file users want to use is 'sample_name.isoforms.results'. If simulating from user-designed expression profile is desired, start from a learned 'sample_name.isoforms.results' file and only modify the 'TPM' column. The simulator only reads the TPM column. But keeping the file format the same is required.\n");
+               printf("theta0: This parameter determines the fraction of reads that are coming from background \"noise\" (instead of from a transcript). It can also be estimated using 'rsem-calculate-expression' from real data. Users can find it as the first value of the third line of the file 'sample_name.stat/sample_name.theta'.\n");
+               printf("N: The total number of reads to be simulated. If 'rsem-calculate-expression' is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file 'sample_name.stat/sample_name.cnt'.\n");
+               printf("output_name: Prefix for all output files.\n");
+               printf("-q: Set it will stop outputting intermediate information.\n\n");
+               printf("Outputs:\n\n");
+               printf("output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by counting where each simulated read comes from.\n\n");
+               printf("output_name.fa if single-end without quality score;\noutput_name.fq if single-end with quality score;\noutput_name_1.fa & output_name_2.fa if paired-end without quality score;\noutput_name_1.fq & output_name_2.fq if paired-end with quality score.\n\n");
+               printf("Format of the header line: Each simulated read's header line encodes where it comes from. The header line has the format:\n\n");
+               printf("\t{>/@}_rid_dir_sid_pos[_insertL]\n\n");
+               printf("{>/@}: Either '>' or '@' must appear. '>' appears if FASTA files are generated and '@' appears if FASTQ files are generated\n");
+               printf("rid: Simulated read's index, numbered from 0\n");
+               printf("dir: The direction of the simulated read. 0 refers to forward strand ('+') and 1 refers to reverse strand ('-')\n");
+               printf("sid: Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid's transcript name can be found in the 'transcript_id' column of the 'sample_name.isoforms.results' file (at line sid + 1, line 1 is for column names)\n");
+               printf("pos: The start position of the simulated read in strand dir of transcript sid. It is numbered from 0\n");
+               printf("insertL: Only appear for paired-end reads. It gives the insert length of the simulated read.\n\n");
+               printf("Example:\n\n");
+               printf("Suppose we want to simulate 50 millon single-end reads with quality scores and use the parameters learned from [Example](#example). In addition, we set theta0 as 0.2 and output_name as 'simulated_reads'. The command is:\n\n");
+               printf("\trsem-simulate-reads /ref/mouse_125 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads\n");
                 exit(-1);
         }
author	Bo Li <bli@cs.wisc.edu>
	Fri, 22 Nov 2013 12:06:00 +0000 (06:06 -0600)
committer	Bo Li <bli@cs.wisc.edu>
	Fri, 22 Nov 2013 12:06:00 +0000 (06:06 -0600)
README.md		patch \| blob \| history
WHAT_IS_NEW		patch \| blob \| history
extractRef.cpp		patch \| blob \| history
simulation.cpp		patch \| blob \| history