a program to print smith-waterman scores against a query sequence

[fastq-tools.git] / src / swsse2 / README
diff --git a/src/swsse2/README b/src/swsse2/README

new file mode 100644 (file)

index 0000000..11f01ac
--- /dev/null
+++ b/src/swsse2/README
@@ -0,0 +1,100 @@
+Introduction
+
+The Smith-Waterman [1] algorithm is one of the most sensitive sequencing 
+algorithms in use today.  It is also the slowest due to the number of 
+calculations needed to perform the search.  To speed up the algorithm, it 
+has been adapted to use Single Instruction Multiple Data, SIMD, instructions 
+found on many common microprocessors today.  SIMD instructions are able to 
+perform the same operation on multiple pieces of data parallel.
+
+The program swsse2 introduces a new SIMD implementation of the Smith-Waterman 
+algorithm for the X86 processor.  The weights are precomputed parallel to the 
+query sequence, like the Rognes [2] implementation, but are accessed in the 
+striped pattern.  The new implementation reached speeds six times faster than 
+other SIMD implementations.
+
+Below is a graph comparing the total search times of 11 queries, 3806 residues,
+against the Swiss-Prot 49.1 database, 75,841,138 residues.  The tests were run
+on a PC with a 2.00GHz Intel Xeon Core 2 Duo processor with 2 GB RAM.  The 
+program is singlely threaded, so the number of cores has no affect on the run 
+times.  The Wozniak, Rognes and striped implementations were run with the 
+scoring matrices BLOSUM50 and BLOSUM62 and four different gap penalties, 10-k,
+10-2k, 14-2k and 40-2k.  Since the Wozniak's runtime does not change depending
+on the scoring matrix, one line is used for both scoring matrices.
+
+Build Instructions
+
+    * Download the zip file with the swsse2 sources.
+    * Unzip the sources.
+    * Load the swsse2.vcproj file into Microsoft Visual C++ 2005.
+    * Build the project (F7).  For optimized code, be sure to change the 
+      configuration to a Release build. 
+    * The swsse2.exe file is in the Release directory ready to be run.
+
+Running
+
+To run swsse2 three files must be provided, the scoring matrix, query sequence
+and the database sequence.  Four scoring matrices are provided with the 
+release, BLOSUM45, BLOSUM50, BLOSUM62 and BLOSUM80.  The query sequence and 
+database sequence must be in the FASTA format.  For example, to run with the 
+default gap penalties 10-2k, the scoring matrix BLOSUM50, the query sequence 
+ptest1.fasta and the sequence database db.fasta use:
+
+     c:\swsse2>.\Release\swsse2.exe blosum50.mat ptest1.fasta db.fasta
+     ptest1.fasta vs db.fasta
+     Matrix: blosum50.mat, Init: -10, Ext: -2
+
+     Score  Description
+        53  108_LYCES Protein 108 precursor.
+        53  10KD_VIGUN 10 kDa protein precursor (Clone PSAS10).
+        32  1431_ECHGR 14-3-3 protein homolog 1.
+        32  1431_ECHMU 14-3-3 protein homolog 1 (Emma14-3-3.1).
+        27  110K_PLAKN 110 kDa antigen (PK110) (Fragment).
+        26  1432_ECHGR 14-3-3 protein homolog 2.
+        25  13S1_FAGES 13S globulin seed storage protein 1
+        25  13S3_FAGES 13S globulin seed storage protein 3
+        25  13S2_FAGES 13S globulin seed storage protein 2
+        23  12S1_ARATH 12S seed storage protein CRA1
+        22  13SB_FAGES 13S globulin basic chain.
+        21  12AH_CLOS4 12-alpha-hydroxysteroid dehydrogenase
+        21  140U_DROME RPII140-upstream protein.
+        21  12S2_ARATH 12S seed storage protein CRB
+        21  1431_LYCES 14-3-3 protein 1.
+        20  1431_ARATH 14-3-3-like protein GF14
+
+     21 residues in query string
+     2014 residues in 25 library sequences
+     Scan time:  0.000 (Striped implementation)
+
+Options
+
+    Usage: swsse2 [-h] [-(n|w|r|s)] [-i num] [-e num] [-t num] [-c num] 
+                  matrix query db
+
+        -h       : this help message
+        -n       : run a non-vectorized Smith-Waterman search
+        -w       : run a vectorized Wozniak search
+        -r       : run a vectorized Rognes search (NOT SUPPORTED)
+        -s       : run a vectorized striped search (default)
+        -i num   : gap init penalty (default -10)
+        -e num   : gap extension penalty (default -2)
+        -t num   : minimum score threshold (default 20)
+        -c num   : number of scores to be displayed (default 250)
+        matrix   : scoring matrix file
+        query    : query sequence file (fasta format)
+        db       : sequence database file (fasta format) 
+
+Note
+
+The Rognes implementation is not released as part of the swsse2 package due to
+patent concerns. 
+
+References
+
+[1] Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197.
+
+[2] Rognes, T. and Seeberg, E. (2000) Six-fold speed-up of the Smith-Waterman sequence database searches using parallel processing on common microprocessors.  Bioinformatics, 16, 699-706. 
+
+ 
+
+