1 %% LyX 1.5.1 created this file. For more info, see http://www.lyx.org/.
2 %% Do not edit unless you really know what you are doing.
3 \documentclass[english]{scrartcl}
4 \usepackage[T1]{fontenc}
5 \usepackage[latin9]{inputenc}
6 \setlength{\parskip}{\medskipamount}
7 \setlength{\parindent}{0pt}
10 \IfFileExists{url.sty}{\usepackage{url}}
11 {\newcommand{\url}{\texttt}}
15 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
16 %% Because html converters don't know tabularnewline
17 \providecommand{\tabularnewline}{\\}
19 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
20 \newenvironment{lyxcode}
22 \setlength{\rightmargin}{\leftmargin}
23 \setlength{\listparindent}{0pt}% needed for AMS classes
25 \setlength{\itemsep}{0pt}
26 \setlength{\parsep}{0pt}
27 \normalfont\ttfamily}%
31 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
32 \usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
39 \title{Biotools Cookbook}
42 \author{Martin Asser Hansen}
45 \publishers{John Mattick Group\\
46 Institute for Molecular Bioscience\\
47 University of Queensland\\
49 E-mail: mail@maasha.dk}
64 \section{Introduction}
66 Biotools is a selection of simple tools that can be linked together
67 (piped as we shall call it) in a very flexible manner to perform both
68 simple and complex tasks. The fundamental idea is that biotools work
69 on a data stream that will only terminate at the end of an analysis
70 and that this data stream can be passed through several different
71 biotools, each performing one specific task. The advantage of this
72 approach is that a user can perform simple and complex tasks without
73 having to write advanced code. Moreover, since the data format used
74 to pass data between biotools is text based, biotools can be written
75 by different developers in their favorite programming language ---
76 and still the biotools will be able to work together.
78 In the most simple form bioools can be piped together on the command
79 line like this (using the pipe character '|'):
82 read\_data~|~calculate\_something~|~write\_result
84 However, a more comprehensive analysis could be composed:
87 read\_data~|~select\_entries~|~convert\_entries~|~search\_database~~
89 evaluate\_results~|~plot\_diagram~|~plot\_another\_diagram~|
93 The data stream that is piped through the biotools consists of records
94 of key/value pairs in the same way a hash does in order to keep as
95 simple a structure as possible. An example record can be seen below:
118 The '-\/-\/-' denotes the delimiter of the records, and each key
119 is a word followed by a ':' and a white-space and then the value.
120 By convention the biotools only uses upper case keys (a list of used
121 keys can be seen in Appendix~\ref{sec:Keys}). Since the records
122 basically are hash structures this mean that the order of the keys
123 in the stream is unordered, and in the above example it is pure coincidence
124 that HIT\_BEG is displayed before HIT\_END, however, when the order
125 of the keys is importent, the biotools will automagically see to that.
127 All of the biotools are able to read and write a data stream to and
128 from file as long as the records are in the biotools format. This
129 means that if you are undertaking a lengthy analysis where one of
130 the steps is time consuming, you may save the stream after this step,
131 and subsequently start one or more analysis from that last step%
132 \footnote{It is a goal that the biotools at some point will be able to dump
133 the data stream to file in case one of the tools fail critically.%
134 }. If you are running a lengthy analysis it is highly recommended that
135 you create a small test sample of the data and run that through the
136 pipeline --- and once you are satisfied with the result proceed with
137 the full data set (see~\ref{sub:How-to-select-a-few-records}).
140 \section{The Data Stream}
143 \subsection{How to read the data stream from file?\label{sub:How-to-read-stream}}
145 You want to read a data stream that you previously have saved to file
146 in biotools format. This can be done implicetly or explicitly. The
147 implicit way uses the 'stdout' stream of the Unix terminal:
152 cat is the Unix command that reads a file and output the result to
153 'stdout' --- which in this case is piped to any biotool represented
154 by the <biotool>. It is also possible to read the data stream using
155 '<' to direct the 'stdout' stream into the biotool like this:
160 However, that will not work if you pipe more biotools together. Then
161 it is much safer to read the stream from a file explicitly like this:
164 <biotool>~-{}-stream\_in=<file>
166 Here the filename <file> is explicetly given to the biotool <biotool>
167 with the switch -\/-stream\_in. This switch works with all biotools.
168 It is also possible to read in data from multiple sources by repeating
169 the explicit read step:
172 <biotool>~-{}-stream\_in=<file1>~|~<biotool>~-{}-stream\_in=<file2>
175 \subsection{How to write the data stream to file?\label{sub:How-to-write-stream}}
177 In order to save the output stream from a biotool to file, so you
178 can read in the stream again at a later time, you can do one of two
184 All, the biotools write the data stream to 'stdout' by default which
185 can be written to a file by redirecting 'stdout' to file using '>'
186 , however, if one of the biotools for writing other formats is used
187 then the both the biotools records as well as the result output will
188 go to 'stdout' in a mixture causing havock! To avoid this you must
189 use the switch -\/-stream\_out that explictly tells the biotool to
190 write the output stream to file:
193 <biotool>~-{}-stream\_out=<file>
195 The -\/-stream\_out switch works with all biotools.
198 \subsection{How to terminate the data stream?}
200 The data stream is never stops unless the user want to save the stream
201 or by supplying the -\/-no\_stream switch that will terminate the
205 <biotool>~-{}-no\_stream
207 The -\/-no\_stream switch only works with those biotools where it
208 makes sense that the user might want to terminale the data stream,
209 \emph{i.e}. after an analysis step where the user wants to output
210 the result, but not the data stream.
213 \subsection{How to write my results to file?\label{sub:How-to-write-result}}
215 Saving the result of an analysis to file can be done implicitly or
216 explicitly. The implicit way:
219 <biotool>~-{}-no\_stream~>~<file>
221 If you use '>' to redirect 'stdout' to file then it is important to
222 use the -\/-no\_stream switch to avoid writing a mix of biotools
223 records and result to the same file causing havock. The safe way is
224 to use the -\/-result\_out switch which explicetly tells the biotool
225 to write the result to a given file:
228 <biotool>~-{}-result\_out=<file>
230 Using the above method will not terminate the stream, so it is possible
231 to pipe that into another biotool generating different results:
234 <biotool1>~-{}-result\_out=<file1>~|~<biotool2>~-{}-result\_out=<file2>
236 And still the data stream will continue unless terminated with -\/-no\_stream:
239 <biotool>~-{}-result\_out=<file>~-{}-no\_stream
241 Or written to file using implicitly or explicity \eqref{sub:How-to-write-result}.
245 <biotool>~-{}-result\_out=<file1>~-{}-stream\_out=<file2>
248 \subsection{How to read data from multiple sources?}
250 To read multiple data sources, with the same type or different type
254 <biotool1>~-{}-data\_in=<file1>~|~<biotool2>~-{}-data\_in=<file2>
256 where type is the data type a specific biotool reads.
259 \section{Reading input}
262 \subsection{How to read biotools input?}
264 See \eqref{sub:How-to-read-stream}.
267 \subsection{How to read in data?}
269 Data in different formats can be read with the appropriate biotool
270 for that format. The biotools are typicalled named 'read\_<data type>'
271 such as \textbf{read\_fasta}, \textbf{read\_bed}, \textbf{read\_tab},
272 etc., and all behave in a similar manner. Data can be read by supplying
273 the -\/-data\_in switch and a file name to the file containing the
277 <biotool>~-{}-data\_in=<file>
279 It is also possible to read in a saved biotools stream (see \ref{sub:How-to-read-stream})
280 as well as reading data in one go:
283 <biotool>~-{}-stream\_in=<file1>~-{}-data\_in=<file2>
285 If you want to read data from several files you can do this:
288 <biotool>~-{}-data\_in=<file1>~|~<biotool>~-{}-data\_in=<file2>
290 If you have several data files you can read in all explicitly with
291 a comma separated list:
294 <biotool>~-{}-data\_in=file1,file2,file3
296 And it is also possible to use file globbing:
299 <biotool>~-{}-data\_in={*}.fna
304 <biotool>~-{}-data\_in=file1,/dir/{*}.fna
306 Finally, it is possible to read in data in different formats using
307 the appropriate biotool for each format:
310 <biotool1>~-{}-data\_in=<file1>~|~<biotool2>~-{}-data\_in=<file2>~...
313 \subsection{How to read FASTA input?}
315 Sequences in FASTA format can be read explicitly using \textbf{read\_fasta}:
318 read\_fasta~-{}-data\_in=<file>
321 \subsection{How to read alignment input?}
323 If your alignment if FASTA formatted then you can \textbf{read\_align}.
324 It is also possible to use \textbf{read\_fasta} since the data is
325 FASTA formatted, however, with \textbf{read\_fasta} the key ALIGN
326 will be omitted. The ALIGN key is used to determine which sequences
327 belong to what alignment which is required for \textbf{write\_align}.
330 read\_align~-{}-data\_in=<file>
333 \subsection{How to read tabular input?\label{sub:How-to-read-table}}
335 Tabular input can be read with \textbf{read\_tab} which will read
336 in all rows and chosen columns (separated by a given delimter) from
337 a table in text format.
341 \noindent \begin{center}
343 Human & ATACGTCAG & 23524\tabularnewline
344 Dog & AGCATGAC & 2442\tabularnewline
345 Mouse & GACTG & 234\tabularnewline
346 Cat & AAATGCA & 2342\tabularnewline
350 Can be read using the command:
353 read\_tab~-{}-data\_in=<file>
355 Which will result in four records, one for each row, where the keys
356 V0, V1, V2 are the default keys for the organism, sequence, and count,
357 respectively. It is possible to select a subset of colums to read
358 by using the -\/-cols switch which takes a comma separated list of
359 columns numbers (first column is designated 0) as argument. So to
360 read in only the sequence and the count so that the count comes before
364 read\_tab~-{}-data\_in=<file>~-{}-cols=2,1
366 It is also possible to name the columns with the -\/-keys switch:
369 read\_tab~-{}-data\_in=<file>~-{}-cols=2,1~-{}-keys=COUNT,SEQ
372 \subsection{How to read BED input?}
374 The BED (Browser Extensible Data%
375 \footnote{\url{http://genome.ucsc.edu/FAQ/FAQformat}%
376 }) format is a tabular format for data pertaining to one of the Eukaryotic
377 genomes in the UCSC genome brower%
378 \footnote{\url{http://genome.ucsc.edu/}%
379 }. The BED format consists of up to 12 columns, where the first three
380 are mandatory CHR, CHR\_BEG, and CHR\_END. The mandatory columns and
381 any of the optional columns can all be read in easily with the \textbf{read\_bed}
385 read\_bed~-{}-data\_in=<file>
387 It is also possible to read the BED file with \textbf{read\_tab} (see~\ref{sub:How-to-read-table}),
388 however, that will be more cumbersome because you need to specify
392 read\_tab~-{}-data\_in=<file>~-{}-keys=CHR,CHR\_BEG,CHR\_END~...
395 \subsection{How to read PSL input?}
397 The PSL format is the output from BLAT and contains 21 mandatory fields
398 that can be read with \textbf{read\_psl}:
401 read\_psl~-{}-data\_in=<file>
404 \section{Writing output}
406 All result output can be written explicitly to file using the -\/-result\_out
407 switch which all result generating biotools have. It is also possible
408 to write the result to file implicetly by directing 'stdout' to file
409 using '>', however, that requires the -\/-no\_stream swich to prevent
410 a mixture of data stream and results in the file. The explicit (and
414 ...~|~<biotool>~-{}-result\_out=<file>
419 ...~|~<biotool>~-{}-no\_stream~>~<file>
422 \subsection{How to write biotools output?}
424 See \eqref{sub:How-to-write-stream}.
427 \subsection{How to write FASTA output?\label{sub:How-to-write-fasta}}
429 FASTA output can be written with \textbf{write\_fasta}.
432 ...~|~write\_fasta~-{}-result\_out=<file>
434 It is also possible to wrap the sequences to a given width using the
435 -\/-wrap switch allthough wrapping of sequence is generally an evil
439 ...~|~write\_fasta~-{}-no\_stream~-{}-wrap=80
442 \subsection{How to write alignment output?\label{sub:How-to-write-alignment}}
444 Pretty alignments with ruler%
445 \footnote{'.' for every 10 residues, ':' for every 50, and '|' for every 100%
446 } and consensus sequence can be created with \textbf{write\_align},
447 what also have the optional -\/-wrap switch to break the alignment
448 into blocks of a given width:
451 ...~|~write\_align~-{}-result\_out=<file>~-{}-wrap=80
453 If the number of sequnces in the alignment is 2 then a pairwise alignment
454 will be output otherwise a multiple alignment. And if the sequence
455 type, determined automagically, is protein, then residues and symbols
456 (+,~:,~.) will be used to show consensus according to the Blosum62
460 \subsection{How to write tabular output?\label{sub:How-to-write-tab}}
462 Outputting the data stream as a table can be done with \textbf{write\_tab},
463 which will write generate one row per record with the values as columns.
464 If you supply the optional -\/-comment switch, when the first row
465 in the table will be a 'comment' line prefixed with a '\#':
468 ...~|~write\_tab~-{}-result\_out=<file>~-{}-comment
470 You can also change the delimiter from the default (tab) to \emph{e.g.}
474 ...~|~write\_tab~-{}-result\_out=<file>~-{}-delimit=','
476 If you want the values output in a specific order you have to supply
477 a comma separated list using the -\/-keys switch that will print
478 only those keys in that order:
481 ...~|~write\_tab~-{}-result\_out=<file>~-{}-keys=SEQ\_NAME,COUNT
483 Alternatively, if you have some keys that you don't want in the tabular
484 output, use the -\/-no\_keys switch. So to print all keys except
485 SEQ and SEQ\_TYPE do:
488 ...~|~write\_tab~-{}-result\_out=<file>~-{}-no\_keys=SEQ,SEQ\_TYPE
490 Finally, if you have a stream containing a mix of different records
491 types, \emph{e.g.} records with sequences and records with matches,
492 then you can use \textbf{write\_tab} to output all the records in
493 tabluar format, however, the -\/-comment, -\/-keys, and -\/-no\_keys
494 switches will only respond to records of the first type encountered.
495 The reason is that outputting mixed records is probably not what you
496 want anyway, and you should remove all the unwanted records from the
497 stream before outputting the table: \textbf{grab} is your friend (see~\ref{sub:How-to-grab}).
500 \subsection{How to write a BED output?\label{sub:How-to-write-BED}}
502 Data in BED format can be output if the records contain the mandatory
503 keys CHR, CHR\_BEG, and CHR\_END using \textbf{write\_bed}. If the
504 optional keys are also present, they will be output as well:
507 write\_bed~-{}-result\_out=<file>
510 \subsection{How to write PSL output?\label{sub:How-to-write-PSL}}
512 Data in PSL format can be output using \textbf{write\_psl:}
515 write\_psl~-{}-result\_out=<file>
518 \section{Manipulating Records}
521 \subsection{How to select a few records?\label{sub:How-to-select-a-few-records}}
523 To quickly get an overview of your data you can limit the data stream
524 to show a few records. This also very useful to test the pipeline
525 with a few records if you are setting up a complex analysis using
526 several biotools. That way you can inspect that all goes well before
527 analyzing and waiting for the full data set. All of the read\_<type>
528 biotools have the -\/-num switch which will take a number as argument
529 and only that number of records will be read. So to read in the first
530 10 FASTA entries from a file:
533 read\_fasta~-{}-data\_in=test.fna~-{}-num=10
535 Another way of doing this is to use \textbf{head\_records} will limit
536 the stream to show the first 10 records (default):
541 Using \textbf{head\_records} directly after one of the read\_<type>
542 biotools will be a lot slower than using the -\/-num switch with
543 the read\_<type> biotools, however, \textbf{head\_records} can also
544 be used to limit the output from all the other biotools. It is also
545 possible to give \textbf{head\_records} a number of records to show
546 using the -\/-num switch. So to display the first 100 records do:
549 ...~|~head\_records~-{}-num=100
552 \subsection{How to count all records in the data stream?}
554 To count all the records in the data stream use \textbf{count\_records},
555 which adds one record (which is not included in the count) to the
556 data stream. So to count the number of sequences in a FASTA file you
560 cat~test.fna~|~read\_fasta~|~count\_records~-{}-no\_stream
562 Which will write the last record containing the count to 'stdout':
569 It is also possible to write the count to file using the -\/-result\_out
573 \subsection{How to grab specific records?\label{sub:How-to-grab}}
575 The biotool \textbf{grab} is related to the Unix grep and locates
576 records based on matching keys and/or values using either a pattern,
577 a Perl regex, or a numerical evaluation. To easily \textbf{grab} all
578 records in the stream that has any mentioning of the pattern 'human'
579 just pipe the data stream through \textbf{grab} like this:
582 ...~|~grab~-{}-pattern=human
584 This will search for the pattern 'human' in all keys and all values.
585 The -\/-pattern switch takes a comma separated list of patterns,
586 so in order to match multiple patterns do:
589 ...~|~grab~-{}-pattern=human,mouse
591 It is also possible to use the -\/-pattern\_in switch instead of
592 -\/-pattern. -\/-pattern\_in is used to read a file with one pattern
596 ...~|~grab~-{}-pattern\_in=patterns.txt
598 If you want the opposite result --- to find all records that does
599 not match the patterns, add the -\/-invert switch, which not only
600 works with the -\/-pattern switch, but also with -\/-regex and -\/-eval:
603 ...~|~grab~-{}-pattern=human~-{}-invert
605 If you want to search the record keys only, \emph{e.g.} to find all
606 records containing the key SEQ you can add the -\/-keys\_only switch.
607 This will prevent matching of SEQ in any record value, and in fact
608 SEQ is a not uncommon peptide sequence you could get an unwanted record.
609 Also, this will give an increase in speed since only the keys are
613 ...~|~grab~-{}-pattern=SEQ~-{}-keys\_only
615 However, if you are interested in finding the peptide sequence SEQ
616 and not the SEQ key, just add the -\/-vals\_only switch instead:
619 ...~|~grab~-{}-pattern=SEQ~-{}-vals\_only
621 Also, if you want to grab for certain key/value pairs you can supply
622 a comma separated list of keys whos values will then be searched using
623 the -\/-keys switch. This is handy if your records contain large
624 genomic sequences and you dont want to search the entire sequence
625 for \emph{e.g.} the organism name --- it is much faster to tell \textbf{grab}
626 which keys to search the value for:
629 ...~|~grab~-{}-pattern=human~-{}-keys=SEQ\_NAME
633 It is also possible to invoke flexible matching using regex (regular
634 expressions) instead of simple pattern matching. In \textbf{grab}
635 the regex engine is Perl based and allows use of different type of
636 wild cards, alternatives, \emph{etc}%
637 \footnote{\url{http://perldoc.perl.org/perlreref.html}%
638 }. If you want to \textbf{grab} records withs the sequence ATCG or
639 GCTA you can do this:
642 ...~|~grab~-{}-regex='ATCG|GCTA'
644 Or if you want to find sequences beginning with ATCG:
647 ...~|~grab~-{}-regex='\textasciicircum{}ATCG'
649 You can also use \textbf{grab} to locate records that fulfill a numerical
650 property using the -\/-eval switch witch takes an expression in three
651 parts. The first part is the key that holds the number we want to
652 evaluate, the second part holds one the six operators:
655 \item Greater than: >
656 \item Greater than or equal to: >=
658 \item Less than or equal to: <=
660 \item Not equal to: !=
662 And finally comes the number used in the evaluation. So to \textbf{grab}
663 all records with a sequence length greater than 30:
666 ...~length\_seq~|~grab~-{}-eval='SEQ\_LEN~>~30'
668 If you want to locate all records containing the pattern 'human' and
669 where the sequence length is greater that 30, you do this by running
670 the stream through \textbf{grab} twice:
673 ...~|~grab~-{}-pattern='human'~|~length\_seq~|~grab~-{}-eval='SEQ\_LEN~>~30'
675 To get the best speed performance, use the most restrictive \textbf{grab}
679 \subsection{How to remove keys from records?}
681 To remove one or more specific keys from all records in the data stream
682 use \textbf{remove\_keys} like this:
685 ...~|~remove\_keys~-{}-keys=SEQ,SEQ\_NAME
687 In the above example SEQ and SEQ\_NAME will be removed from all records
688 if they exists in these. If all keys are removed from a record, then
689 the record will be removed.
692 \subsection{How to rename keys in records?}
694 Sometimes you want to rename a record key, \emph{e.g.} if you have
695 read in a two column table with sequence name and sequence in each
696 column (see \ref{sub:How-to-read-table}) without specifying the key
697 names, then the sequence name will be called V0 and the sequence V1
698 as default in the \textbf{read\_tab} biotool. To rename the V0 and
699 V1 keys we need to run the stream through \textbf{rename\_keys} twice
700 (one for each key to rename):
703 ...~|~rename\_keys~-{}-keys=V0,SEQ\_NAME~|~rename\_keys~-{}-keys=V1,SEQ
705 The first instance of \textbf{rename\_keys} replaces all the V0 keys
706 with SEQ\_NAME, and the second instance of \textbf{rename\_keys} replaces
707 all the V1 keys with SEQ. \emph{Et viola} the data can now be used
708 in the biotools that requires these keys.
711 \section{Manipulating Sequences}
714 \subsection{How to get sequence lengths?}
716 The length for sequences in records can be determined with \textbf{length\_seq},
717 which adds the key SEQ\_LEN to each record with the sequence length
718 as the value. It also generates an extra record that is emitted last
719 with the key TOTAL\_SEQ\_LEN showing the total length of all the sequences.
722 read\_fasta~-{}-data\_in=<file>~|~length\_seq
724 It is also possible to determine the sequence length using the generic
725 tool \textbf{length\_vals} (see \#\#\#), which determines the length
726 of the values for a given list of keys:
729 read\_fasta~-{}-data\_in=<file>~|~length\_vals~-{}-keys=SEQ
731 To obtain the total length of all sequences use \textbf{sum\_vals}
735 read\_fasta~-{}-data\_in=<file>~|~length\_vals~-{}-keys=SEQ
737 |~sum\_vals~-{}-keys=SEQ\_LEN
739 The biotool \textbf{analyze\_seq} will also determine the length of
740 each sequence (see~\ref{sub:How-to-analyze}).
743 \subsection{How to analyze sequence composition?\label{sub:How-to-analyze}}
745 If you want to find out the sequence type, composition, length, as
746 well as GC content, indel content and proportions of soft and hard
747 masked sequence, then use \textbf{analyze\_seq}. This handy biotool
748 will determine all these things per sequence from which it is easy
749 to get an overview using the \textbf{write\_tab} biotool to output
750 a table (see~\ref{sub:How-to-write-tab}). So in order to determine
751 the sequence composition of a FASTA file with just one entry containing
752 the sequence 'ATCG' we just read the data with \textbf{read\_fasta}
753 and run the output through \textbf{analyze\_seq} which will add the
754 analysis to the record like this:
757 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~...
803 RES:\textasciitilde{}:~0
815 Now to make a table of how may As, Ts, Cs, and Gs you can add the
819 ...~|~analyze\_seq~|~write\_tab~-{}-keys=RES:A,RES:T,RES:C,RES:G
821 Or if you want to see the proportions of hard and soft masked sequence:
824 ...~|~analyse\_seq~|~write\_tab~-{}-keys=HARD\_MASK\%,SOFT\_MASK\%
826 If you have a stack of sequences in one file and you want to determine
827 the mean GC content you can do it using the \textbf{mean\_vals} biotool:
830 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~|~mean\_vals~-{}-keys=GC\%
832 Or if you want the total count of Ns you can use \textbf{sum\_vals}
836 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~|~sum\_vals~-{}-keys=RES:N
839 \subsection{How to extract subsequences?\label{sub:How-to-extract}}
841 In order to extract a subsequence from a longer sequence use the biotool
842 extract\_seq, which will replace the sequence in the record with the
843 subsequence (this behaviour should probably be modified to be dependant
844 of a -\/-replace or a -\/-no\_replace switch). So to extract the
845 first 20 residues from all sequences do (first residue is designated
849 ...~|~extract\_seq~-{}-beg=1~-{}-len=20
851 You can also specify a begin and end coordinate set:
854 ...~|~extract\_seq~-{}-beg=20~-{}-end=40
856 If you want the subsequences from position 20 to the sequence end
860 ...~|~extract\_seq~-{}-beg=20
862 If you want to extract subsequences a given distance from the sequence
863 end you can do this by reversing the sequence with the biotool \textbf{reverse\_seq}
864 \eqref{sub:How-to-reverse-seq}, followed by \textbf{extract\_seq}
865 to get the subsequence, and then \textbf{reverse\_seq} again to get
866 the subsequence back in the original orientation:
869 read\_fasta~-{}-data\_in=test.fna~|~reverse\_seq
871 |~extract\_seq~-{}-beg=10~-{}-len=10~|~reverse\_seq
874 \subsection{How to get genomic sequence?\label{sub:How-to-get-genomic-sequence}}
876 The biotool \textbf{get\_genomic\_seq} can extract subsequences for
877 a given genome specified with the -\/-genome switch explicitly using
878 the -\/-beg and -\/-end/-\/-len switches:
881 get\_genome\_seq~-{}-genome=<genome>~-{}-beg=1~-{}-len=100
883 Alternatively, \textbf{get\_genome\_seq} can be used to append the
884 corresponding sequence to BED, PSL, and BLAST records:
887 read\_bed~-{}-data\_in=<BED~file>~|~get\_genome\_seq~-{}-genome=<genome>
890 \subsection{How to upper-case sequences?}
892 Sequences can be shifted from lower case to upper case using \textbf{uppercase\_seq}:
898 \subsection{How to reverse sequences?\label{sub:How-to-reverse-seq}}
900 The order of residues in a sequence can be reversed using reverse\_seq:
905 Note that in order to reverse/complement a sequence you also need
906 the \textbf{complement\_seq} biotool (see~\ref{sub:How-to-complement}).
909 \subsection{How to complement sequences?\label{sub:How-to-complement}}
911 DNA and RNA sequences can be complemented with \textbf{complement\_seq},
912 which automagically determines the sequence type:
915 ...~|~complement\_seq
917 Note that in order to reverse/complement a sequence you also need
918 the \textbf{reverse\_seq} biotool (see~\ref{sub:How-to-reverse-seq}).
921 \subsection{How to remove indels from sequnces?}
923 Indels can be removed from sequences with the \textbf{remove\_indels}
924 biotool. This is useful if you have aligned some sequences (see~\ref{sub:How-to-align})
925 and extracted (see~\ref{sub:How-to-extract}) a block of subsequences
926 from the alignment and you want to use these sequence in a search
927 where you need to remove the indels first. '-', '\textasciitilde{}',
928 and '.' are considered indels:
934 \subsection{How to split sequences into overlapping subsequences?}
936 Sequences can be slit into overlapping subsequences with the \textbf{split\_seq}
940 ...~|~split\_seq~-{}-word\_size=20~-{}-uniq
943 \subsection{How to determine the oligo frequency?}
945 In order to determine if any oligo usage is over represented in one
946 or more sequences you can determine the frequency of oligos of a given
947 size with \textbf{oligo\_freq}:
950 ...~|~oligo\_freq~-{}-word\_size=4
952 And if you have more than one sequence and want to accumulate the
953 frequences you need the -\/-all switch:
956 ...~|~oligo\_freq~-{}-word\_size=4~-{}-all
958 To get a meaningful result you need to write the resulting frequencies
959 as a table with \textbf{write\_tab} (see~\ref{sub:How-to-write-tab}),
960 but first it is important to \textbf{grab} (see~\ref{sub:How-to-grab})
961 the records with the frequencies to avoid full length sequences in
965 ...~|~oligo\_freq~-{}-word\_size=4~-{}-all~|~grab~-{}-pattern=OLIGO~-{}-keys\_only
967 |~write\_tab~-{}-no\_stream
969 And the resulting frequency table can be sorted with Unix sort (man
973 \subsection{How to search for sequences in genomes?}
975 See the following biotool:
978 \item \textbf{patscan\_seq} \eqref{sub:How-to-use-patscan}
979 \item \textbf{blat\_seq} \eqref{sub:How-to-use-BLAT}
980 \item \textbf{blast\_seq} \eqref{sub:How-to-use-BLAST}
981 \item \textbf{vmatch\_seq} \eqref{sub:How-to-use-Vmatch}
984 \subsection{How to search sequences for a pattern?\label{sub:How-to-use-patscan}}
986 It is possible to search sequences in the data stream for patterns
987 using the \textbf{patscan\_seq} biotool which utilizes the powerful
988 scan\_for\_matches engine. Consult the documentation for scan\_for\_matches
989 in order to learn how to define patterns (the documentation is included
990 in Appendix~\ref{sec:scan_for_matches-README}).
992 To search all sequences for a simple pattern consisting of the sequence
993 ATCGATCG allowing for 3 mismatches, 2 insertions and 1 deletion:
996 read\_fasta~-{}-data\_in=<file>~|~patscan\_seq~-{}-pattern='ATCGATCG{[}3,2,1]'
998 The -\/-pattern switch takes a comma seperated list of patterns,
999 so if you want to search for more that one pattern do:
1002 ...~|~patscan\_seq~-{}-pattern='ATCGATCG{[}3,2,1],GCTAGCTA{[}3,2,1]'
1004 It is also possible to have a list of different patterns to search
1005 for in a file with one pattern per line. In order to get \textbf{patscan\_seq}
1006 to read these patterns use the -\/-pattern\_in switch:
1009 ...~|~patscan\_seq~-{}-pattern\_in=<file>
1011 To also scan the complementary strand in nucleotide sequences (\textbf{patscan\_seq}
1012 automagically determines the sequence type) you need to add the -\/-comp
1016 ...~|~patscan\_seq~-{}-pattern=<pattern>~-{}-comp
1018 It is also possible to use \textbf{patscan\_seq} to output those records
1019 that does not contain a certain pattern by using the -\/-invert switch:
1022 ...~|~patscan\_seq~-{}-pattern=<pattern>~-{}-invert
1024 Finally, \textbf{patscan\_seq} can also scan for patterns in a given
1025 genome sequence, instead of sequences in the stream, using the -\/-genome
1029 patscan~-{}-pattern=<pattern>~-{}-genome=<genome>
1032 \subsection{How to use BLAT for sequence search?\label{sub:How-to-use-BLAT}}
1034 Sequences in the data stream can be matched against supported genomes
1035 using \textbf{blat\_seq} which is a biotool using BLAT as the name
1036 might suggest. Currently only Mouse and Human genomes are available
1037 and it is not possible to use OOC files since there is still a need
1038 for a local repository for genome files. Otherwise it is just:
1041 read\_fasta~-{}-data\_in=<file>~|~blat\_seq~-{}-genome=<genome>
1043 The search results can then be written to file with \textbf{write\_psl}
1044 (see~\ref{sub:How-to-write-PSL}) or \textbf{write\_bed} (see~\ref{sub:How-to-write-BED})
1045 allthough with \textbf{write\_bed} some information will be lost).
1046 It is also possible to plot chromosome distribution of the search
1047 results using \textbf{plot\_chrdist} (see~\ref{sub:How-to-plot-chrdist})
1048 or the distribution of the match lengths using \textbf{plot\_lendist}
1049 (see~\ref{sub:How-to-plot-lendist}) or a karyogram with the hits
1050 using \textbf{plot\_karyogram} (see~\ref{sub:How-to-plot-karyogram}).
1053 \subsection{How to use BLAST for sequence search?\label{sub:How-to-use-BLAST}}
1055 Two biotools exist for blasting sequences: \textbf{create\_blast\_db}
1056 is used to create the BLAST database required for BLAST which is queried
1057 using the biotool \textbf{blast\_seq}. So in order to create a BLAST
1058 database from sequences in the data stream you simple run:
1061 ...~|~create\_blast\_db~-{}-database=my\_database~-{}-no\_stream
1063 The type of sequence to use for the database is automagically determined
1064 by \textbf{create\_blast\_db}, but don't have a mixture of peptide
1065 and nucleic acids sequences in the stream. The -\/-database switch
1066 takes a path as argument, but will default to 'blastdb\_<time\_stamp>
1069 The resulting database can now be queried with sequences in another
1070 data stream using \textbf{blast\_seq}:
1073 ...~|~blast\_seq~-{}-database=my\_database
1075 Again, the sequence type is determined automagically and the appropriate
1076 BLAST program is guessed (see below table), however, the program name
1077 can be overruled with the -\/-program switch.
1079 \noindent \begin{center}
1080 \begin{tabular}{ccc}
1081 Subject sequence & Query sequence & Program guess\tabularnewline
1083 Nucleotide & Nucleotide & blastn\tabularnewline
1084 Protein & Protein & blastp\tabularnewline
1085 Protein & Nucleotide & blastx\tabularnewline
1086 Nucleotide & Protein & tblastn\tabularnewline
1090 Finally, it is also possible to use \textbf{blast\_seq} for blasting
1091 sequences agains a preformatted genome using the -\/-genome switch
1092 instead of the -\/-database switch:
1095 ...~|~blast\_seq~-{}-genome=<genome>
1098 \subsection{How to use Vmatch for sequence search?\label{sub:How-to-use-Vmatch}}
1100 The powerful suffix array software package Vmatch%
1101 \footnote{\url{http://www.vmatch.de/}%
1102 } can be used for exact mapping of sequences against indexed genomes
1103 using the biotool \textbf{vmatch\_seq}, which will e.g. map 700000
1104 ESTs to the human genome locating all 160 mio hits in less than an
1108 ...~|~vmatch\_seq~-{}-genome=<genome>
1110 Only nucleotide sequences and sequences longer than 11 nucleotides
1111 will be mapped. The resulting SCORE key will hold the number of genome
1112 matches of a given sequence (multi-mappers).
1115 \subsection{How to find all matches between sequences?\label{sub:How-to-find-matches}}
1117 All matches between two sequences can be determined with the biotool
1118 \textbf{match\_seq}. The match finding engine underneath the hood
1119 of \textbf{match\_seq} is the super fast suffix tree program MUMmer%
1120 \footnote{\url{http://mummer.sourceforge.net/}%
1121 }, which will locate all forward and reverse matches between huge sequences
1122 in a matter of minutes (if the repeat count is not too high and if
1123 the word size used is appropriate). Matching two \emph{Helicobacter
1124 pylori} genomes (1.7Mbp) takes around 10 seconds:
1127 ...~|~match\_seq~-{}-word\_size=20~-{}-direction=both
1129 The output from \textbf{match\_seq} can be used to generate a dot
1130 plot with \textbf{plot\_matches} (see~\ref{sub:How-to-generate-dotplot}).
1133 \subsection{How to align sequences?\label{sub:How-to-align}}
1135 Sequences in the stream can be aligned with the \textbf{align\_seq}
1136 biotool that uses Muscle%
1137 \footnote{\url{http://www.drive5.com/muscle/muscle.html}%
1138 } as aligment engine. Currently you cannot change any of the Muscle
1139 alignment parameters and \textbf{align\_seq} will create an alignment
1140 based on the defaults (which are really good!):
1145 The aligned output can be written to file in FASTA format using \textbf{write\_fasta}
1146 (see~\ref{sub:How-to-write-fasta}) or in pretty text using \textbf{write\_align}
1147 (see~\ref{sub:How-to-write-alignment}).
1150 \subsection{How to create a weight matrix?}
1152 If you want a weight matrix to show the sequence composition of a
1153 stack of sequences you can use the biotool create\_weight\_matrix:
1156 ...~|~create\_weight\_matrix
1158 The result can be output in percent using the -\/-percent switch:
1161 ...~|~create\_weight\_matrix~-{}-percent
1163 The weight matrix can be written as tabular output with \textbf{write\_tab}
1164 (see~\ref{sub:How-to-write-tab}) after removeing the records containing
1165 SEQ with \textbf{grab} (see~\ref{sub:How-to-grab}):
1168 ...~|~create\_weight\_matrix~|~grab~-{}-invert~-{}-keys=SEQ~-{}-keys\_only
1170 |~write\_tab~-{}-no\_stream
1172 The V0 column will hold the residue, while the rest of the columns
1173 will hold the frequencies for each sequence position.
1178 There exists several biotools for plotting. Some of these are based
1180 \footnote{\url{http://www.gnuplot.info/}%
1181 }, which is an extremely powerful platform to generate all sorts of
1182 plots and even though GNUplot has quite a steep learning curve, the
1183 biotools utilizing GNUplot are simple to use. GNUplot is able to output
1184 a lot of different formats (called terminals in GNUplot), but the
1185 biotools focusses on three formats only:
1188 \item The 'dumb' terminal is default to the GNUplot based biotools and will
1189 output a plot in crude ASCII text (Fig.~\ref{fig:Dumb-terminal}).
1190 This is quite nice for a quick and dirty plot to get an overview of
1192 \item The 'post' or 'postscript' terminal output postscript code which is
1193 publication grade graphics that can be viewed with applications such
1194 as Ghostview, Photoshop, and Preview.
1195 \item The 'svg' terminal output's scalable vector graphics (SVG) which is
1196 a vector based format. SVG is great because you can edit the resulting
1197 plot using Photoshop or Inkscape%
1198 \footnote{Inkscape is a really handy drawing program that is free and open source.
1199 Availble at \url{http://www.inkscape.org}%
1200 } if you want to add additional labels, captions, arrows, and so on
1201 and then save the result in different formats, such as postscript
1202 without loosing resolution.
1204 The biotools for plotting that are not based on GNUplot only output
1205 SVG (that may change in the future).
1209 \noindent \begin{centering}
1210 \includegraphics[width=12cm]{lendist_ascii}
1213 \caption{\label{fig:Dumb-terminal}Dumb terminal}
1217 The output of a length distribution plot in the default 'dumb terminal'
1218 to the terminal window.
1225 \subsection{How to plot a histogram?\label{How-to-plot-histogram}}
1227 A generic histogram for a given value can be plotted with the biotool
1228 \textbf{plot\_histogram} (Fig.~\ref{fig:Histogram}):
1231 ...~|~plot\_histogram~-{}-key=TISSUE~-{}-no\_stream
1235 \noindent \begin{flushleft}
1238 \noindent \begin{centering}
1239 \includegraphics[width=12cm]{histogram}
1242 \caption{\label{fig:Histogram}Histogram}
1249 \subsection{How to plot a length distribution?\label{sub:How-to-plot-lendist}}
1251 Plotting of length distributions, weather sequence lengths, patterns
1252 lengths, hit lengths, \emph{etc.} is a really handy thing and can
1253 be done with the the biotool \textbf{plot\_lendist}. If you have a
1254 file with FASTA entries and want to plot the length distribution you
1258 read\_fasta~-{}-data\_in=<file>~|~length\_seq
1260 |~plot\_lendist~-{}-key=SEQ\_LEN~-{}-no\_stream
1262 The result will be written to the default dumb terminal and will look
1263 like Fig.~\ref{fig:Dumb-terminal}.
1265 If you instead want the result in postscript format you can do:
1268 ...~|~plot\_lendist~-{}-key=SEQ\_LEN~-{}-terminal=post~-{}-result\_out=file.ps
1270 That will generate the plot and save it to file, but not interrupt
1271 the data stream which can then be used in further analysis. You can
1272 also save the plot implicetly using '>', however, it is then important
1273 to terminate the stream with the -\/-no\_stream switch:
1276 ...~|~plot\_lendist~-{}-key=SEQ\_LEN~-{}-terminal=post~-{}-no\_stream~>~file.ps
1278 The resulting plot can be seen in Fig.~\ref{fig:Length-distribution}.
1284 \noindent \begin{centering}
1285 \includegraphics[width=12cm]{lendist}
1288 \caption{\label{fig:Length-distribution}Length distribution}
1292 Length distribution of 630 piRNA like RNAs.
1299 \subsection{How to plot a chromosome distribution?\label{sub:How-to-plot-chrdist}}
1301 If you have the result of a sequence search against a multi chromosome
1302 genome, it is very practical to be able to plot the distribution of
1303 search hits on the different chromosomes. This can be done with \textbf{plot\_chrdist}:
1306 read\_fasta~-{}-data\_in=<file>~|~blat\_genome~|~plot\_chrdist~-{}-no\_stream
1308 The above example will result in a crude plot using the 'dumb' terminal,
1309 and if you want to mess around with the results from the BLAT search
1310 you probably want to save the result to file first (see~\ref{sub:How-to-write-PSL}).
1311 To plot the chromosome distribution from the saved search result you
1315 read\_bed~-{}-data\_in=file.bed~|~plot\_chrdist~-{}-terminal=post~-{}-result\_out=plot.ps
1317 That will result in the output show in Fig.~\ref{fig:Chromosome-distribution}.
1323 \noindent \begin{centering}
1324 \includegraphics[angle=90,width=12cm]{chrdist}
1327 \caption{\label{fig:Chromosome-distribution}Chromosome distribution}
1333 \subsection{How to generate a dotplot?\label{sub:How-to-generate-dotplot}}
1335 A dotplot is a powerful way to get an overview of the size and location
1336 of sequence insertions, deletions, and duplications between two sequences.
1337 Generating a dotplot with biotools is a two step process where you
1338 initially find all matches between two sequences using the tool \textbf{match\_seq}
1339 (see~\ref{sub:How-to-find-matches}) and plot the resulting matches
1340 with \textbf{plot\_matches}. Matching and plotting two \emph{Helicobacter
1341 pylori} genomes (1.7Mbp) takes around 10 seconds:
1344 ...~|~match\_seq~|~plot\_matches~-{}-terminal=post~-{}-result\_out=plot.ps
1346 The resulting dotplot is in Fig.~\ref{fig:Dotplot}.
1350 \noindent \begin{centering}
1351 \includegraphics[width=12cm]{dotplot}
1354 \caption{\label{fig:Dotplot}Dotplot}
1358 Forward matches are displayed in green while reverse matches are displayed
1366 \subsection{How to plot a sequence logo?}
1368 Sequence logos can be generate with \textbf{plot\_seqlogo}. The sequnce
1369 type is determined automagically and an entropy scale of 2 bits and
1370 4 bits is used for nucleotide and peptide sequences, respectively%
1371 \footnote{\url{http://www.ccrnp.ncifcrf.gov/~toms/paper/hawaii/latex/node5.html}%
1375 ...~|~plot\_seqlogo~-{}-no\_stream~-{}-result\_out=seqlogo.svg
1377 An example of a sequence logo can be seen in Fig.~\ref{fig:Sequence-logo}.
1381 \noindent \begin{centering}
1382 \includegraphics[width=12cm]{seqlogo}
1385 \caption{\label{fig:Sequence-logo}Sequence logo}
1391 \subsection{How to plot a karyogram?\label{sub:How-to-plot-karyogram}}
1393 To plot search hits on genomes use \textbf{plot\_karyogram}, which
1394 will output a nice karyogram in SVG graphics:
1397 ...~|~plot\_karyogram~-{}-result\_out=karyogram.svg
1399 The banding data is taken from the UCSC genome browser database and
1400 currently only Human and Mouse is supported. Fig.~\ref{fig:Karyogram}
1401 shows the distribution of piRNA like RNAs matched to the Human genome.
1405 \noindent \begin{centering}
1406 \includegraphics[width=12cm]{karyogram}
1409 \caption{\label{fig:Karyogram}Karyogram}
1413 Hits from a search of piRNA like RNAs in the Human genome is displayed
1414 as short horizontal bars.
1421 \section{Uploading Results}
1424 \subsection{How do I display my results in the UCSC Genome Browser?}
1426 Results from the list of biotools below can be uploaded directly to
1427 a local mirror of the UCSC Genome Browser using the biotool \textbf{upload\_to\_ucsc}:
1430 \item patscan\_seq \eqref{sub:How-to-use-patscan}
1431 \item blat\_seq \eqref{sub:How-to-use-BLAT}
1432 \item blast\_seq \eqref{sub:How-to-use-BLAST}
1433 \item vmatch\_seq \eqref{sub:How-to-use-Vmatch}
1435 The syntax for uploading data the most simple way requires two mandatory
1436 switches: -\/-database, which is the UCSC database name (such as
1437 hg18, mm9, etc.) and-\/-table which should be the users initials
1438 followed by an underscore and a short description of the data:
1441 ...~|~upload\_to\_ucsc~-{}-database=hg18~-{}-table=mah\_snoRNAs
1443 The \textbf{upload\_to\_ucsc} biotool modifies the users \textasciitilde{}/ucsc/my\_tracks.ra
1444 file automagically (a backup is created with the name \textasciitilde{}/ucsc/my\_tracks.ra\textasciitilde{})
1445 with default values that can be overridden using the following switches:
1448 \item -\/-short\_label - Short label for track - Default=database->table
1449 \item -\/-long\_label - Long label for track - Default=database->table
1450 \item -\/-group - Track group name - Default=<user name as defined in env>
1451 \item -\/-priority - Track display priority - Default=1
1452 \item -\/-color - Track color - Default=147,73,42
1453 \item -\/-chunk\_size - Chunks for loading - Default=10000000
1454 \item -\/-visibility - Track visibility - Default=pack
1456 Also, data in BED or PSL format can be uploaded with \textbf{upload\_to\_ucsc}
1457 as long as these reference to genomes and chromosomes existing in
1458 the UCSC Genome Browser:
1461 read\_bed~-{}-data\_in=<bed~file>~|~upload\_to\_ucsc~...
1465 read\_psl~-{}-data\_in=<psl~file>~|~upload\_to\_ucsc~...
1468 \section{Trouble shooting}
1470 Shoot the messenger!
1474 \section{Keys\label{sec:Keys}}
1489 \section{Switches\label{sec:Switches}}
1504 \section{scan\_for\_matches README\label{sec:scan_for_matches-README}}
1507 ~~~~~~~~~~~~~~~~~~~~~~~~~~scan\_for\_matches:
1509 ~~~~A~Program~to~Scan~Nucleotide~or~Protein~Sequences~for~Matching~Patterns
1511 ~~~~~~~~~~~~~~~~~~~~~~~~Ross~Overbeek
1513 ~~~~~~~~~~~~~~~~~~~~~~~~MCS
1515 ~~~~~~~~~~~~~~~~~~~~~~~~Argonne~National~Laboratory
1517 ~~~~~~~~~~~~~~~~~~~~~~~~Argonne,~IL~60439
1519 ~~~~~~~~~~~~~~~~~~~~~~~~USA
1521 Scan\_for\_matches~is~a~utility~that~we~have~written~to~search~for
1523 patterns~in~DNA~and~protein~sequences.~~I~wrote~most~of~the~code,
1525 although~David~Joerg~and~Morgan~Price~wrote~sections~of~an
1527 earlier~version.~~The~whole~notion~of~pattern~matching~has~a~rich
1529 history,~and~we~borrowed~liberally~from~many~sources.~~However,~it~is
1531 worth~noting~that~we~were~strongly~influenced~by~the~elegant~tools
1533 developed~and~distributed~by~David~Searls.~~My~intent~is~to~make~the
1535 existing~tool~available~to~anyone~in~the~research~community~that~might
1537 find~it~useful.~~I~will~continue~to~try~to~fix~bugs~and~make~suggested
1539 enhancements,~at~least~until~I~feel~that~a~superior~tool~exists.
1541 Hence,~I~would~appreciate~it~if~all~bug~reports~and~suggestions~are
1543 directed~to~me~at~Overbeek@mcs.anl.gov.~~
1545 I~will~try~to~log~all~bug~fixes~and~report~them~to~users~that~send~me
1547 their~email~addresses.~~I~do~not~require~that~you~give~me~your~name
1549 and~address.~~However,~if~you~do~give~it~to~me,~I~will~try~to~notify
1551 you~of~serious~problems~as~they~are~discovered.
1555 ~~~~The~distribution~should~contain~at~least~the~following~programs:
1557 ~~~~~~~~~~~~~~~~README~~~~~~~~~~~~~~~~~~-~~~~~This~document
1559 ~~~~~~~~~~~~~~~~ggpunit.c~~~~~~~~~~~~~~~-~~~~~One~of~the~two~source~files
1561 ~~~~~~~~~~~~~~~~scan\_for\_matches.c~~~~~~-~~~~~The~second~source~file
1565 ~~~~~~~~~~~~~~~~run\_tests~~~~~~~~~~~~~~~-~~~~~A~perl~script~to~test~things
1567 ~~~~~~~~~~~~~~~~show\_hits~~~~~~~~~~~~~~~-~~~~~A~handy~perl~script
1569 ~~~~~~~~~~~~~~~~test\_dna\_input~~~~~~~~~~-~~~~~Test~sequences~for~DNA
1571 ~~~~~~~~~~~~~~~~test\_dna\_patterns~~~~~~~-~~~~~Test~patterns~for~DNA~scan
1573 ~~~~~~~~~~~~~~~~test\_output~~~~~~~~~~~~~-~~~~~Desired~output~from~test
1575 ~~~~~~~~~~~~~~~~test\_prot\_input~~~~~~~~~-~~~~~Test~protein~sequences
1577 ~~~~~~~~~~~~~~~~test\_prot\_patterns~~~~~~-~~~~~Test~patterns~for~proteins
1579 ~~~~~~~~~~~~~~~~testit~~~~~~~~~~~~~~~~~~-~~~~~a~perl~script~used~for~test
1581 ~~~~Only~the~first~three~files~are~required.~~The~others~are~useful,
1583 ~~~~but~only~if~you~have~Perl~installed~on~your~system.~~If~you~do
1585 ~~~~have~Perl,~I~suggest~that~you~type
1589 ~~~~~~~~~~~~~~~~which~perl
1591 ~~~~to~find~out~where~it~installed.~~On~my~system,~I~get~the~following
1597 ~~~~~~~~~~~~~~~~clone\%~which~perl
1599 ~~~~~~~~~~~~~~~~/usr/local/bin/perl
1601 ~~~~indicating~that~Perl~is~installed~in~/usr/local/bin.~~Anyway,~once
1603 ~~~~you~know~where~it~is~installed,~edit~the~first~line~of~files~
1609 ~~~~replacing~/usr/local/bin/perl~with~the~appropriate~location.~~I
1611 ~~~~will~assume~that~you~can~do~this,~although~it~is~not~critical~(it
1613 ~~~~is~needed~only~to~test~the~installation~and~to~use~the~\char`\"{}show\_hits\char`\"{}
1615 ~~~~utility).~~Perl~is~not~required~to~actually~install~and~run
1617 ~~~~scan\_for\_matches.~
1619 ~~~~If~you~do~not~have~Perl,~I~suggest~you~get~it~and~install~it~(it
1621 ~~~~is~a~wonderful~utility).~~Information~about~Perl~and~how~to~get~it
1623 ~~~~can~be~found~in~the~book~\char`\"{}Programming~Perl\char`\"{}~by~Larry~Wall~and
1625 ~~~~Randall~L.~Schwartz,~published~by~O'Reilly~\&~Associates,~Inc.
1627 ~~~~To~get~started,~you~will~need~to~compile~the~program.~~~I~do~this
1631 ~~~~~~~~gcc~-O~-o~scan\_for\_matches~~ggpunit.c~scan\_for\_matches.c
1633 ~~~~If~you~do~not~use~GNU~C,~use~
1635 ~~~~~~~~cc~-O~-DCC~-o~scan\_for\_matches~~ggpunit.c~scan\_for\_matches.c
1637 ~~~~which~works~on~my~Sun.~~
1639 ~~~~Once~you~have~compiled~scan\_for\_matches,~you~can~verify~that~it
1643 ~~~~~~~~clone\%~run\_tests~tmp
1645 ~~~~~~~~clone\%~diff~tmp~test\_output
1647 ~~~~You~may~get~a~few~strange~lines~of~the~sort
1649 ~~~~~~~~clone\%~run\_tests~tmp
1651 ~~~~~~~~rm:~tmp:~No~such~file~or~directory
1653 ~~~~~~~~clone\%~diff~tmp~test\_output
1655 ~~~~These~should~cause~no~concern.~~However,~if~the~\char`\"{}diff\char`\"{}~shows~that
1657 ~~~~tmp~and~test\_output~are~different,~contact~me~(you~have~a
1661 ~~~~You~should~now~be~able~to~use~scan\_for\_matches~by~following~the
1663 ~~~~instructions~given~below~(which~is~all~the~normal~user~should~have
1665 ~~~~to~understand,~once~things~are~installed~properly).
1667 ~==============================================================
1669 How~to~run~scan\_for\_matches:
1671 ~~~~To~run~the~program,~you~type~need~to~create~two~files
1673 ~~~~1.~~the~first~file~contains~the~pattern~you~wish~to~scan~for;~I'll
1675 ~~~~~~~~call~this~file~pat\_file~in~what~follows~(but~any~name~is~ok)
1677 ~~~~2.~~the~second~file~contains~a~set~of~sequences~to~scan.~~These
1679 ~~~~~~~~should~be~in~\char`\"{}fasta~format\char`\"{}.~~Just~look~at~the~contents~of
1681 ~~~~~~~~test\_dna\_input~to~see~examples~of~this~format.~~Basically,
1683 ~~~~~~~~each~sequence~begins~with~a~line~of~the~form
1685 ~~~~~~~~~~~>sequence\_id
1687 ~~~~~~~~and~is~followed~by~one~or~more~lines~containing~the~sequence.
1689 ~~~~Once~these~files~have~been~created,~you~just~use
1691 ~~~~~~~~scan\_for\_matches~pat\_file~<~input\_file
1693 ~~~~to~scan~all~of~the~input~sequences~for~the~given~pattern.~~As~an
1695 ~~~~example,~suppose~that~pat\_file~contains~a~single~line~of~the~form
1697 ~~~~~~~~~~~~~~~~p1=4...7~3...8~\textasciitilde{}p1
1701 ~~~~~~~~~~~~~~~~scan\_for\_matches~pat\_file~<~test\_dna\_input
1703 ~~~~should~produce~two~\char`\"{}hits\char`\"{}.~~When~I~run~this~on~my~machine,~I~get
1705 ~~~~~~~~clone\%~scan\_for\_matches~pat\_file~<~test\_dna\_input
1707 ~~~~~~~~>tst1:{[}6,27]
1709 ~~~~~~~~cguaacc~ggttaacc~gguuacg~
1711 ~~~~~~~~>tst2:{[}6,27]
1713 ~~~~~~~~CGUAACC~GGTTAACC~GGUUACG~
1717 Simple~Patterns~Built~by~Matching~Ranges~and~Reverse~Complements
1719 ~~~~Let~me~first~explain~this~simple~pattern:
1723 ~~~~~~~~~~~~~~~~p1=4...7~3...8~\textasciitilde{}p1
1725 ~~~~The~pattern~consists~of~three~\char`\"{}pattern~units\char`\"{}~separated~by~spaces.
1727 ~~~~The~first~pattern~unit~is
1729 ~~~~~~~~~~~~~~~~p1=4...7
1731 ~~~~which~means~\char`\"{}match~4~to~7~characters~and~call~them~p1\char`\"{}.~~The
1733 ~~~~second~pattern~unit~is
1735 ~~~~~~~~~~~~~~~~3...8
1737 ~~~~which~means~\char`\"{}then~match~3~to~8~characters\char`\"{}.~~The~last~pattern~unit
1741 ~~~~~~~~~~~~~~~~\textasciitilde{}p1
1743 ~~~~which~means~\char`\"{}match~the~reverse~complement~of~p1\char`\"{}.~~The~first
1745 ~~~~reported~hit~is~shown~as
1747 ~~~~~~~~>tst1:{[}6,27]
1749 ~~~~~~~~cguaacc~ggttaacc~gguuacg~
1751 ~~~~which~states~that~characters~6~through~27~of~sequence~tst1~were
1753 ~~~~matched.~~\char`\"{}cguaac\char`\"{}~matched~the~first~pattern~unit,~\char`\"{}ggttaacc\char`\"{}~the
1755 ~~~~second,~and~\char`\"{}gguuacg\char`\"{}~the~third.~~This~is~an~example~of~a~common
1757 ~~~~type~of~pattern~used~to~search~for~sections~of~DNA~or~RNA~that
1759 ~~~~would~fold~into~a~hairpin~loop.
1761 Searching~Both~Strands
1763 ~~~~Now~for~a~short~aside:~scan\_for\_matches~only~searched~the
1765 ~~~~sequences~in~the~input~file;~it~did~not~search~the~opposite
1767 ~~~~strand.~~With~a~pattern~of~the~sort~we~just~used,~there~is~not
1769 ~~~~need~o~search~the~opposite~strand.~~However,~it~is~normally~the
1771 ~~~~case~that~you~will~wish~to~search~both~the~sequence~and~the
1773 ~~~~opposite~strand~(i.e.,~the~reverse~complement~of~the~sequence).
1775 ~~~~To~do~that,~you~would~just~use~the~\char`\"{}-c\char`\"{}~command~line.~~For~example,
1777 ~~~~~~~~scan\_for\_matches~-c~pat\_file~<~test\_dna\_input
1779 ~~~~Hits~on~the~opposite~strand~will~show~a~beginning~location~greater
1781 ~~~~than~te~end~location~of~the~match.
1783 Defining~Pairing~Rules~and~Allowing~Mismatches,~Insertions,~and~Deletions
1785 ~~~~Let~us~stop~now~and~ask~\char`\"{}What~additional~features~would~one~need~to
1787 ~~~~really~find~the~kinds~of~loop~structures~that~characterize~tRNAs,
1789 ~~~~rRNAs,~and~so~forth?\char`\"{}~~I~can~immediately~think~of~two:
1791 ~~~~~~~~a)~you~will~need~to~be~able~to~allow~non-standard~pairings
1793 ~~~~~~~~~~~(those~other~than~G-C~and~A-U),~and
1795 ~~~~~~~~b)~you~will~need~to~be~able~to~tolerate~some~number~of
1797 ~~~~~~~~~~~mismatches~and~bulges.
1801 ~~~~Let~me~first~show~you~how~to~handle~non-standard~\char`\"{}rules~for
1803 ~~~~pairing~in~reverse~complements\char`\"{}.~~Consider~the~following~pattern,
1805 ~~~~which~I~show~as~two~line~(you~may~use~as~many~lines~as~you~like~in
1807 ~~~~forming~a~pattern,~although~you~can~only~break~a~pattern~at~points
1809 ~~~~where~space~would~be~legal):
1811 ~~~~~~~~~~~~r1=\{au,ua,gc,cg,gu,ug,ga,ag\}~
1813 ~~~~~~~~~~~~p1=2...3~0...4~p2=2...5~1...5~r1\textasciitilde{}p2~0...4~\textasciitilde{}p1~~~~~~~~
1815 ~~~~The~first~\char`\"{}pattern~unit\char`\"{}~does~not~actually~match~anything;~rather,
1817 ~~~~it~defines~a~\char`\"{}pairing~rule\char`\"{}~in~which~standard~pairings~are
1819 ~~~~allowed,~as~well~as~G-A~and~A-G~(in~case~you~wondered,~Us~and~Ts
1821 ~~~~and~upper~and~lower~case~can~be~used~interchangably;~for~example
1823 ~~~~r1=\{AT,UA,gc,cg\}~could~be~used~to~define~the~\char`\"{}standard~rule\char`\"{}~for
1825 ~~~~pairings).~~The~second~line~consists~of~six~pattern~units~which
1827 ~~~~may~be~interpreted~as~follows:
1829 ~~~~~~~~~~~~p1=2...3~~~~~match~2~or~3~characters~(call~it~p1)
1831 ~~~~~~~~~~~~0...4~~~~~~~~match~0~to~4~characters
1833 ~~~~~~~~~~~~p2=2...5~~~~~match~2~to~5~characters~(call~it~p2)
1835 ~~~~~~~~~~~~1...5~~~~~~~~match~1~to~5~characters
1837 ~~~~~~~~~~~~r1\textasciitilde{}p2~~~~~~~~match~the~reverse~complement~of~p2,
1839 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~allowing~G-A~and~A-G~pairs
1841 ~~~~~~~~~~~~0...4~~~~~~~~match~0~to~4~characters~~~~~~~~
1843 ~~~~~~~~~~~~\textasciitilde{}p1~~~~~~~~~~match~the~reverse~complement~of~p1
1845 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~allowing~only~G-C,~C-G,~A-T,~and~T-A~pairs
1847 ~~~~Thus,~r1\textasciitilde{}p2~means~\char`\"{}match~the~reverse~complement~of~p2~using~rule~r1\char`\"{}.
1849 ~~~~Now~let~us~consider~the~issue~of~tolerating~mismatches~and~bulges.
1851 ~~~~You~may~add~a~\char`\"{}qualifier\char`\"{}~to~the~pattern~unit~that~gives~the
1853 ~~~~tolerable~number~of~\char`\"{}mismatches,~deletions,~and~insertions\char`\"{}.
1857 ~~~~~~~~~~~~~~~~p1=10...10~3...8~\textasciitilde{}p1{[}1,2,1]
1859 ~~~~means~that~the~third~pattern~unit~must~match~10~characters,
1861 ~~~~allowing~one~\char`\"{}mismatch\char`\"{}~(a~pairing~other~than~G-C,~C-G,~A-T,~or
1863 ~~~~T-A),~two~deletions~(a~deletion~is~a~character~that~occurs~in~p1,
1865 ~~~~but~has~been~\char`\"{}deleted\char`\"{}~from~the~string~matched~by~\textasciitilde{}p1),~and~one
1867 ~~~~insertion~(an~\char`\"{}insertion\char`\"{}~is~a~character~that~occurs~in~the~string
1869 ~~~~matched~by~\textasciitilde{}p1,~but~not~for~which~no~corresponding~character
1871 ~~~~occurs~in~p1).~~In~this~case,~the~pattern~would~match
1873 ~~~~~~~~~~~~~~ACGTACGTAC~GGGGGGGG~GCGTTACCT
1875 ~~~~which~is,~you~must~admit,~a~fairly~weak~loop.~~It~is~common~to
1877 ~~~~allow~mismatches,~but~you~will~find~yourself~using~insertions~and
1879 ~~~~deletions~much~more~rarely.~~In~any~event,~you~should~note~that
1881 ~~~~allowing~mismatches,~insertions,~and~deletions~does~force~the
1883 ~~~~program~to~try~many~additional~possible~pairings,~so~it~does~slow
1885 ~~~~things~down~a~bit.
1887 How~Patterns~Are~Matched
1889 ~~~~Now~is~as~good~a~time~as~any~to~discuss~the~basic~flow~of~control
1891 ~~~~when~matching~patterns.~~Recall~that~a~\char`\"{}pattern\char`\"{}~is~a~sequence~of
1893 ~~~~\char`\"{}pattern~units\char`\"{}.~~Suppose~that~the~pattern~units~were
1895 ~~~~~~~~u1~u2~u3~u4~...~un
1897 ~~~~The~scan~of~a~sequence~S~begins~by~setting~the~current~position
1899 ~~~~to~1.~~Then,~an~attempt~is~made~to~match~u1~starting~at~the
1901 ~~~~current~position.~~Each~attempt~to~match~a~pattern~unit~can
1903 ~~~~succeed~or~fail.~~If~it~succeeds,~then~an~attempt~is~made~to~match
1905 ~~~~the~next~unit.~~If~it~fails,~then~an~attempt~is~made~to~find~an
1907 ~~~~alternative~match~for~the~immediately~preceding~pattern~unit.~~If
1909 ~~~~this~succeeds,~then~we~proceed~forward~again~to~the~next~unit.~~If
1911 ~~~~it~fails~we~go~back~to~the~preceding~unit.~~This~process~is~called
1913 ~~~~\char`\"{}backtracking\char`\"{}.~~If~there~are~no~previous~units,~then~the~current
1915 ~~~~position~is~incremented~by~one,~and~everything~starts~again.~~This
1917 ~~~~proceeds~until~either~the~current~position~goes~past~the~end~of
1919 ~~~~the~sequence~or~all~of~the~pattern~units~succeed.~~On~success,
1921 ~~~~scan\_for\_matches~reports~the~\char`\"{}hit\char`\"{},~the~current~position~is~set
1923 ~~~~just~past~the~hit,~and~an~attempt~is~made~to~find~another~hit.
1925 ~~~~If~you~wish~to~limit~the~scan~to~simply~finding~a~maximum~of,~say,
1927 ~~~~10~hits,~you~can~use~the~-n~option~(-n~10~would~set~the~limit~to
1929 ~~~~10~reported~hits).~~For~example,
1931 ~~~~~~~~scan\_for\_matches~-c~-n~1~pat\_file~<~test\_dna\_input
1933 ~~~~would~search~for~just~the~first~hit~(and~would~stop~searching~the
1935 ~~~~current~sequences~or~any~that~follow~in~the~input~file).
1937 Searching~for~repeats:
1939 ~~~~In~the~last~section,~I~discussed~almost~all~of~the~details
1941 ~~~~required~to~allow~you~to~look~for~repeats.~~Consider~the~following
1943 ~~~~set~of~patterns:
1945 ~~~~~~~~p1=6...6~3...8~p1~~~(find~exact~6~character~repeat~separated
1947 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~by~to~8~characters)
1949 ~~~~~~~~p1=6...6~3..8~p1{[}1,0,0]~~~(allow~one~mismatch)
1951 ~~~~~~~~p1=3...3~p1{[}1,0,0]~p1{[}1,0,0]~p1{[}1,0,0]~~
1953 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~(match~12~characters~that~are~the~remains
1955 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~of~a~3-character~sequence~occurring~4~times)
1959 ~~~~~~~~p1=4...8~0...3~p2=6...8~p1~0...3~p2
1961 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~(This~would~match~things~like
1963 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ATCT~G~TCTTT~ATCT~TG~TCTTT
1965 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~)
1967 Searching~for~particular~sequences:
1969 ~~~~Occasionally,~one~wishes~to~match~a~specific,~known~sequence.
1971 ~~~~In~such~a~case,~you~can~just~give~the~sequence~(along~with~an
1973 ~~~~optional~statement~of~the~allowable~mismatches,~insertions,~and
1975 ~~~~deletions).~~Thus,
1977 ~~~~~~~~p1=6...8~GAGA~\textasciitilde{}p1~~~~(match~a~hairpin~with~GAGA~as~the~loop)
1979 ~~~~~~~~RRRRYYYY~~~~~~~~~~~~~(match~4~purines~followed~by~4~pyrimidines)
1981 ~~~~~~~~TATAA{[}1,0,0]~~~~~~~~~(match~TATAA,~allowing~1~mismatch)
1985 Matches~against~a~\char`\"{}weight~matrix\char`\"{}:
1987 ~~~~I~will~conclude~my~examples~of~the~types~of~pattern~units
1989 ~~~~available~for~matching~against~nucleotide~sequences~by~discussing~a
1991 ~~~~crude~implemetation~of~matching~using~a~\char`\"{}weight~matrix\char`\"{}.~~While~I
1993 ~~~~am~less~than~overwhelmed~with~the~syntax~that~I~chose,~I~think~that
1995 ~~~~the~reader~should~be~aware~that~I~was~thinking~of~generating
1997 ~~~~patterns~containing~such~pattern~units~automatically~from
1999 ~~~~alignments~(and~did~not~really~plan~on~typing~such~things~in~by
2001 ~~~~hand~very~often).~~Anyway,~suppose~that~you~wanted~to~match~a
2003 ~~~~sequence~of~eight~characters.~~The~\char`\"{}consensus\char`\"{}~of~these~eight
2005 ~~~~characters~is~GRCACCGS,~but~the~actual~\char`\"{}frequencies~of~occurrence\char`\"{}
2007 ~~~~are~given~in~the~matrix~below.~~Thus,~the~first~character~is~an~A
2009 ~~~~16\%~the~time~and~a~G~84\%~of~the~time.~~The~second~is~an~A~57\%~of
2011 ~~~~the~time,~a~C~10\%~of~the~time,~a~G~29\%~of~the~time,~and~a~T~4\%~of
2015 ~~~~~~~~~~~~~C1~~~~~C2~~~~C3~~~~C4~~~C5~~~~C6~~~~C7~~~~C8
2019 ~~~~~~~A~~~~~16~~~~~57~~~~~0~~~~95~~~~0~~~~18~~~~~0~~~~~0
2021 ~~~~~~~C~~~~~~0~~~~~10~~~~80~~~~~0~~100~~~~60~~~~~0~~~~50
2023 ~~~~~~~G~~~~~84~~~~~29~~~~~0~~~~~0~~~~0~~~~20~~~100~~~~50
2025 ~~~~~~~T~~~~~~0~~~~~~4~~~~20~~~~~5~~~~0~~~~~2~~~~~0~~~~~0~~~
2029 ~~~~One~could~use~the~following~pattern~unit~to~search~for~inexact
2031 ~~~~matches~related~to~such~a~\char`\"{}weight~matrix\char`\"{}:
2033 ~~~~~~~~\{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
2035 ~~~~~~~~~(0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)\}~>~450
2037 ~~~~This~pattern~unit~will~attempt~to~match~exactly~eight~characters.
2039 ~~~~For~each~character~in~the~sequence,~the~entry~in~the~corresponding
2041 ~~~~tuple~is~added~to~an~accumulated~sum.~~If~the~sum~is~greater~than
2043 ~~~~450,~the~match~succeeds;~else~it~fails.
2045 ~~~~Recently,~this~feature~was~upgraded~to~allow~ranges.~~Thus,
2047 ~~600~>~~\{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
2049 ~~~~~~~~~(0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)\}~>~450
2051 ~~~~will~work,~as~well.
2053 Allowing~Alternatives:
2055 ~~~~Very~occasionally,~you~may~wish~to~allow~alternative~pattern~units
2057 ~~~~(i.e.,~\char`\"{}match~either~A~or~B\char`\"{}).~~You~can~do~this~using~something
2061 ~~~~~~~~~~~~~~~~(~GAGA~|~GCGCA)
2063 ~~~~which~says~\char`\"{}match~either~GAGA~or~GCGCA\char`\"{}.~~You~may~take
2065 ~~~~alternatives~of~a~list~of~pattern~units,~for~example
2067 ~~~~~~~~(p1=3...3~3...8~\textasciitilde{}p1~|~p1=5...5~4...4~\textasciitilde{}p1~GGG)
2069 ~~~~would~match~one~of~two~sequences~of~pattern~units.~~There~is~one
2071 ~~~~clumsy~aspect~of~the~syntax:~to~match~a~list~of~alternatives,~you
2073 ~~~~need~to~fully~the~request.~~Thus,
2075 ~~~~~~~~(GAGA~|~(GCGCA~|~TTCGA))
2077 ~~~~would~be~needed~to~try~the~three~alternatives.
2081 ~~~~Sometimes~a~pattern~will~contain~a~sequence~of~distinct~ranges,
2083 ~~~~and~you~might~wish~to~limit~the~sum~of~the~lengths~of~the~matched
2085 ~~~~subsequences.~~~For~example,~suppose~that~you~basically~wanted~to
2087 ~~~~match~something~like
2089 ~~~~ARRYYTT~p1=0...5~GCA{[}1,0,0]~p2=1...6~\textasciitilde{}p1~4...8~\textasciitilde{}p2~p3=4...10~CCT
2091 ~~~~but~that~the~sum~of~the~lengths~of~p1,~p2,~and~p3~must~not~exceed
2093 ~~~~eight~characters.~~To~do~this,~you~could~add~
2095 ~~~~~~~~length(p1+p2+p3)~<~9
2097 ~~~~as~the~last~pattern~unit.~~It~will~just~succeed~or~fail~(but~does
2099 ~~~~not~actually~match~any~characters~in~the~sequence).
2103 Matching~Protein~Sequences
2105 ~~~~Suppose~that~the~input~file~contains~protein~sequences.~~In~this
2107 ~~~~case,~you~must~invoke~scan\_for\_matches~with~the~\char`\"{}-p\char`\"{}~option.~~You
2109 ~~~~cannot~use~aspects~of~the~language~that~relate~directly~to
2111 ~~~~nucleotide~sequences~(e.g.,~the~-c~command~line~option~or~pattern
2113 ~~~~constructs~referring~to~the~reverse~complement~of~a~previously
2115 ~~~~matched~unit).~~
2117 ~~~~You~also~have~two~additional~constructs~that~allow~you~to~match
2119 ~~~~either~\char`\"{}one~of~a~set~of~amino~acids\char`\"{}~or~\char`\"{}any~amino~acid~other~than
2121 ~~~~those~a~given~set\char`\"{}.~~For~example,
2123 ~~~~~~~~p1=0...4~any(HQD)~1...3~notany(HK)~p1
2125 ~~~~would~successfully~match~a~string~like
2127 ~~~~~~~~~~~YWV~D~AA~C~YWV
2129 Using~the~show\_hits~Utility
2131 ~~~~When~viewing~a~large~set~of~complex~matches,~you~might~find~it
2133 ~~~~convenient~to~post-process~the~scan\_for\_matches~output~to~get~a
2135 ~~~~more~readable~version.~~We~provide~a~simple~post-processor~called
2137 ~~~~\char`\"{}show\_hits\char`\"{}.~~To~see~its~effect,~just~pipe~the~output~of~a
2139 ~~~~scan\_for\_matches~into~show\_hits:
2143 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp
2145 ~~~~~~~~>tst1:{[}1,28]
2147 ~~~~~~~~gtacguaacc~~ggttaac~cgguuacgtac~
2149 ~~~~~~~~>tst1:{[}28,1]
2151 ~~~~~~~~gtacgtaacc~~ggttaac~cggttacgtac~
2153 ~~~~~~~~>tst2:{[}2,31]
2155 ~~~~~~~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG~
2157 ~~~~~~~~>tst2:{[}31,2]
2159 ~~~~~~~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG~
2161 ~~~~~~~~>tst3:{[}3,32]
2163 ~~~~~~~~gtacguaacc~g~gttaactt~cgguuacgtac~
2165 ~~~~~~~~>tst3:{[}32,3]
2167 ~~~~~~~~gtacgtaacc~g~aagttaac~cggttacgtac~
2169 ~~~~~Piped~Through~show\_hits:
2173 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp~|~show\_hits
2175 ~~~~~~~~tst1:{[}1,28]:~~gtacguaacc~~~ggttaac~~cgguuacgtac
2177 ~~~~~~~~tst1:{[}28,1]:~~gtacgtaacc~~~ggttaac~~cggttacgtac
2179 ~~~~~~~~tst2:{[}2,31]:~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG
2181 ~~~~~~~~tst2:{[}31,2]:~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG
2183 ~~~~~~~~tst3:{[}3,32]:~~gtacguaacc~g~gttaactt~cgguuacgtac
2185 ~~~~~~~~tst3:{[}32,3]:~~gtacgtaacc~g~aagttaac~cggttacgtac
2189 ~~~~Optionally,~you~can~specify~which~of~the~\char`\"{}fields\char`\"{}~in~the~matches
2191 ~~~~you~wish~to~sort~on,~and~show\_hits~will~sort~them.~~The~field
2193 ~~~~numbers~start~with~0.~~So,~you~might~get~something~like
2195 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp~|~show\_hits~2~1
2197 ~~~~~~~~tst2:{[}2,31]:~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG
2199 ~~~~~~~~tst2:{[}31,2]:~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG
2201 ~~~~~~~~tst3:{[}32,3]:~~gtacgtaacc~g~aagttaac~cggttacgtac
2203 ~~~~~~~~tst1:{[}1,28]:~~gtacguaacc~~~ggttaac~~cgguuacgtac
2205 ~~~~~~~~tst1:{[}28,1]:~~gtacgtaacc~~~ggttaac~~cggttacgtac
2207 ~~~~~~~~tst3:{[}3,32]:~~gtacguaacc~g~gttaactt~cgguuacgtac
2211 ~~~~In~this~case,~the~hits~have~been~sorted~on~fields~2~and~1~(that~is,
2213 ~~~~the~third~and~second~matched~subfields).
2215 ~~~~show\_hits~is~just~one~possible~little~post-processor,~and~you
2217 ~~~~might~well~wish~to~write~a~customized~one~for~yourself.
2219 Reducing~the~Cost~of~a~Search
2221 ~~~~The~scan\_for\_matches~utility~uses~a~fairly~simple~search,~and~may
2223 ~~~~consume~large~amounts~of~CPU~time~for~complex~patterns.~~Someday,
2225 ~~~~I~may~decide~to~optimize~the~code.~~However,~until~then,~let~me
2227 ~~~~mention~one~useful~technique.~~
2229 ~~~~When~you~have~a~complex~pattern~that~includes~a~number~of~varying
2231 ~~~~ranges,~imprecise~matches,~and~so~forth,~it~is~useful~to
2233 ~~~~\char`\"{}pipeline\char`\"{}~matches.~~That~is,~form~a~simpler~pattern~that~can~be
2235 ~~~~used~to~scan~through~a~large~database~extracting~sections~that
2237 ~~~~might~be~matched~by~the~more~complex~pattern.~~Let~me~illustrate
2239 ~~~~with~a~short~example.~~Suppose~that~you~really~wished~to~match~the
2243 ~~~~p1=3...5~0...8~\textasciitilde{}p1{[}1,1,0]~p2=6...7~3...6~AGC~3...5~RYGC~\textasciitilde{}p2{[}1,0,0]
2245 ~~~~In~this~case,~the~pattern~units~AGC~3...5~RYGC~can~be~used~to~rapidly
2247 ~~~~constrain~the~overall~search.~~You~can~preprocess~the~overall
2249 ~~~~database~using~the~pattern:
2251 ~~~~~~~~~~31...31~AGC~3...5~RYGC~7...7
2253 ~~~~Put~the~complex~pattern~in~pat\_file1~and~the~simpler~pattern~in
2255 ~~~~pat\_file2.~~Then~use,
2257 ~~~~~~~~scan\_for\_matches~-c~pat\_file2~<~nucleotide\_database~|
2259 ~~~~~~~~scan\_for\_matches~pat\_file1
2261 ~~~~The~output~will~show~things~like
2263 ~~~~>seqid:{[}232,280]{[}2,47]
2267 ~~~~Then,~the~actual~section~of~the~sequence~that~was~matched~can~be
2269 ~~~~easily~computed~as~{[}233,278]~(remember,~the~positions~start~from
2273 ~~~~Let~me~finally~add,~you~should~do~a~few~short~experiments~to~see
2275 ~~~~whether~or~not~such~pipelining~actually~improves~performance~-{}-~it
2277 ~~~~is~not~always~obvious~where~the~time~is~going,~and~I~have
2279 ~~~~sometimes~found~that~the~added~complexity~of~pipelining~actually
2281 ~~~~slowed~things~up.~~It~gets~its~best~improvements~when~there~are
2283 ~~~~exact~matches~of~more~than~just~a~few~characters~that~can~be
2285 ~~~~rapidly~used~to~eliminate~large~sections~of~the~database.
2291 Feb~9,~1995:~~~the~pattern~units~\textasciicircum{}~and~\$~now~work~as~in~normal~regular
2293 ~~~~~~~~~~~~~~~expressions.~~That~is
2295 ~~~~~~~~~~~~~~~~~~~~~~~~TTF~\$
2297 ~~~~~~~~~~~~~~~matches~only~TTF~at~the~end~of~the~string~and~
2299 ~~~~~~~~~~~~~~~~~~~~~~~~\textasciicircum{}~TTF~
2301 ~~~~~~~~~~~~~~~matches~only~an~initial~TTF
2303 ~~~~~~~~~~~~~~~The~pattern~unit~
2305 ~~~~~~~~~~~~~~~~~~~~~~~~<p1
2307 ~~~~~~~~~~~~~~~matches~the~reverse~of~the~string~named~p1.~~That~is,
2309 ~~~~~~~~~~~~~~~if~p1~matched~GCAT,~then~<p1~would~match~TACG.~~Thus,
2311 ~~~~~~~~~~~~~~~~~~~p1=6...6~<p1
2313 ~~~~~~~~~~~~~~~matches~a~real~palindrome~(not~the~biologically~common
2315 ~~~~~~~~~~~~~~~meaning~of~\char`\"{}reverse~complement\char`\"{})