src/swsse2/README

   1 Introduction
   2
   3 The Smith-Waterman [1] algorithm is one of the most sensitive sequencing
   4 algorithms in use today.  It is also the slowest due to the number of
   5 calculations needed to perform the search.  To speed up the algorithm, it
   6 has been adapted to use Single Instruction Multiple Data, SIMD, instructions
   7 found on many common microprocessors today.  SIMD instructions are able to
   8 perform the same operation on multiple pieces of data parallel.
   9
  10 The program swsse2 introduces a new SIMD implementation of the Smith-Waterman
  11 algorithm for the X86 processor.  The weights are precomputed parallel to the
  12 query sequence, like the Rognes [2] implementation, but are accessed in the
  13 striped pattern.  The new implementation reached speeds six times faster than
  14 other SIMD implementations.
  15
  16 Below is a graph comparing the total search times of 11 queries, 3806 residues,
  17 against the Swiss-Prot 49.1 database, 75,841,138 residues.  The tests were run
  18 on a PC with a 2.00GHz Intel Xeon Core 2 Duo processor with 2 GB RAM.  The
  19 program is singlely threaded, so the number of cores has no affect on the run
  20 times.  The Wozniak, Rognes and striped implementations were run with the
  21 scoring matrices BLOSUM50 and BLOSUM62 and four different gap penalties, 10-k,
  22 10-2k, 14-2k and 40-2k.  Since the Wozniak's runtime does not change depending
  23 on the scoring matrix, one line is used for both scoring matrices.
  24
  25 Build Instructions
  26
  27     * Download the zip file with the swsse2 sources.
  28     * Unzip the sources.
  29     * Load the swsse2.vcproj file into Microsoft Visual C++ 2005.
  30     * Build the project (F7).  For optimized code, be sure to change the
  31       configuration to a Release build.
  32     * The swsse2.exe file is in the Release directory ready to be run.
  33
  34 Running
  35
  36 To run swsse2 three files must be provided, the scoring matrix, query sequence
  37 and the database sequence.  Four scoring matrices are provided with the
  38 release, BLOSUM45, BLOSUM50, BLOSUM62 and BLOSUM80.  The query sequence and
  39 database sequence must be in the FASTA format.  For example, to run with the
  40 default gap penalties 10-2k, the scoring matrix BLOSUM50, the query sequence
  41 ptest1.fasta and the sequence database db.fasta use:
  42
  43      c:\swsse2>.\Release\swsse2.exe blosum50.mat ptest1.fasta db.fasta
  44      ptest1.fasta vs db.fasta
  45      Matrix: blosum50.mat, Init: -10, Ext: -2
  46
  47      Score  Description
  48         53  108_LYCES Protein 108 precursor.
  49         53  10KD_VIGUN 10 kDa protein precursor (Clone PSAS10).
  50         32  1431_ECHGR 14-3-3 protein homolog 1.
  51         32  1431_ECHMU 14-3-3 protein homolog 1 (Emma14-3-3.1).
  52         27  110K_PLAKN 110 kDa antigen (PK110) (Fragment).
  53         26  1432_ECHGR 14-3-3 protein homolog 2.
  54         25  13S1_FAGES 13S globulin seed storage protein 1
  55         25  13S3_FAGES 13S globulin seed storage protein 3
  56         25  13S2_FAGES 13S globulin seed storage protein 2
  57         23  12S1_ARATH 12S seed storage protein CRA1
  58         22  13SB_FAGES 13S globulin basic chain.
  59         21  12AH_CLOS4 12-alpha-hydroxysteroid dehydrogenase
  60         21  140U_DROME RPII140-upstream protein.
  61         21  12S2_ARATH 12S seed storage protein CRB
  62         21  1431_LYCES 14-3-3 protein 1.
  63         20  1431_ARATH 14-3-3-like protein GF14
  64
  65      21 residues in query string
  66      2014 residues in 25 library sequences
  67      Scan time:  0.000 (Striped implementation)
  68
  69 Options
  70
  71     Usage: swsse2 [-h] [-(n|w|r|s)] [-i num] [-e num] [-t num] [-c num]
  72                   matrix query db
  73
  74         -h       : this help message
  75         -n       : run a non-vectorized Smith-Waterman search
  76         -w       : run a vectorized Wozniak search
  77         -r       : run a vectorized Rognes search (NOT SUPPORTED)
  78         -s       : run a vectorized striped search (default)
  79         -i num   : gap init penalty (default -10)
  80         -e num   : gap extension penalty (default -2)
  81         -t num   : minimum score threshold (default 20)
  82         -c num   : number of scores to be displayed (default 250)
  83         matrix   : scoring matrix file
  84         query    : query sequence file (fasta format)
  85         db       : sequence database file (fasta format)
  86
  87 Note
  88
  89 The Rognes implementation is not released as part of the swsse2 package due to
  90 patent concerns.
  91
  92 References
  93
  94 [1] Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197.
  95
  96 [2] Rognes, T. and Seeberg, E. (2000) Six-fold speed-up of the Smith-Waterman sequence database searches using parallel processing on common microprocessors.  Bioinformatics, 16, 699-706.
  97
  98
  99
 100