]> git.donarmstrong.com Git - biopieces.git/blob - bp_doc/biotools_cookbook.tex
b84da8811ed6e1f9804688b375a8456b41e02bfa
[biopieces.git] / bp_doc / biotools_cookbook.tex
1 %% LyX 1.5.1 created this file.  For more info, see http://www.lyx.org/.
2 %% Do not edit unless you really know what you are doing.
3 \documentclass[english]{scrartcl}
4 \usepackage[T1]{fontenc}
5 \usepackage[latin9]{inputenc}
6 \setlength{\parskip}{\medskipamount}
7 \setlength{\parindent}{0pt}
8 \usepackage{amsmath}
9 \usepackage{graphicx}
10 \IfFileExists{url.sty}{\usepackage{url}}
11                       {\newcommand{\url}{\texttt}}
12
13 \makeatletter
14
15 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
16 %% Because html converters don't know tabularnewline
17 \providecommand{\tabularnewline}{\\}
18
19 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
20 \newenvironment{lyxcode}
21 {\begin{list}{}{
22 \setlength{\rightmargin}{\leftmargin}
23 \setlength{\listparindent}{0pt}% needed for AMS classes
24 \raggedright
25 \setlength{\itemsep}{0pt}
26 \setlength{\parsep}{0pt}
27 \normalfont\ttfamily}%
28  \item[]}
29 {\end{list}}
30
31 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
32 \usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
33
34 \usepackage{babel}
35 \makeatother
36
37 \begin{document}
38
39 \title{Biotools Cookbook}
40
41
42 \author{Martin Asser Hansen}
43
44
45 \publishers{John Mattick Group\\
46 Institute for Molecular Bioscience\\
47 University of Queensland\\
48 Australia\\
49 E-mail: mail@maasha.dk}
50
51 \maketitle
52 \thispagestyle{empty}
53
54 \newpage{}
55
56 \tableofcontents{}
57
58 \listoffigures
59
60
61 \newpage{}
62
63
64 \section{Introduction}
65
66 Biotools is a selection of simple tools that can be linked together
67 (piped as we shall call it) in a very flexible manner to perform both
68 simple and complex tasks. The fundamental idea is that biotools work
69 on a data stream that will only terminate at the end of an analysis
70 and that this data stream can be passed through several different
71 biotools, each performing one specific task. The advantage of this
72 approach is that a user can perform simple and complex tasks without
73 having to write advanced code. Moreover, since the data format used
74 to pass data between biotools is text based, biotools can be written
75 by different developers in their favorite programming language ---
76 and still the biotools will be able to work together.
77
78 In the most simple form bioools can be piped together on the command
79 line like this (using the pipe character '|'):
80
81 \begin{lyxcode}
82 read\_data~|~calculate\_something~|~write\_result
83 \end{lyxcode}
84 However, a more comprehensive analysis could be composed:
85
86 \begin{lyxcode}
87 read\_data~|~select\_entries~|~convert\_entries~|~search\_database~~
88
89 evaluate\_results~|~plot\_diagram~|~plot\_another\_diagram~|
90
91 load\_to\_database
92 \end{lyxcode}
93 The data stream that is piped through the biotools consists of records
94 of key/value pairs in the same way a hash does in order to keep as
95 simple a structure as possible. An example record can be seen below:
96
97 \begin{lyxcode}
98
99
100 REC\_TYPE:~PATSCAN
101
102 MATCH:~AGATCAAGTG
103
104 S\_BEG:~7
105
106 S\_END:~16
107
108 ALIGN\_LEN:~9
109
110 S\_ID:~piR-t6
111
112 STRAND:~+
113
114 PATTERN:~AGATCAAGTG
115
116 -{}-{}-
117 \end{lyxcode}
118 The '-\/-\/-' denotes the delimiter of the records, and each key
119 is a word followed by a ':' and a white-space and then the value.
120 By convention the biotools only uses upper case keys (a list of used
121 keys can be seen in Appendix~\ref{sec:Keys}). Since the records
122 basically are hash structures this mean that the order of the keys
123 in the stream is unordered, and in the above example it is pure coincidence
124 that HIT\_BEG is displayed before HIT\_END, however, when the order
125 of the keys is importent, the biotools will automagically see to that.
126
127 All of the biotools are able to read and write a data stream to and
128 from file as long as the records are in the biotools format. This
129 means that if you are undertaking a lengthy analysis where one of
130 the steps is time consuming, you may save the stream after this step,
131 and subsequently start one or more analysis from that last step%
132 \footnote{It is a goal that the biotools at some point will be able to dump
133 the data stream to file in case one of the tools fail critically.%
134 }. If you are running a lengthy analysis it is highly recommended that
135 you create a small test sample of the data and run that through the
136 pipeline --- and once you are satisfied with the result proceed with
137 the full data set (see~\ref{sub:How-to-select-a-few-records}).
138
139
140 \section{The Data Stream}
141
142
143 \subsection{How to read the data stream from file?\label{sub:How-to-read-stream}}
144
145 You want to read a data stream that you previously have saved to file
146 in biotools format. This can be done implicetly or explicitly. The
147 implicit way uses the 'stdout' stream of the Unix terminal:
148
149 \begin{lyxcode}
150 cat~|~<biotool>
151 \end{lyxcode}
152 cat is the Unix command that reads a file and output the result to
153 'stdout' --- which in this case is piped to any biotool represented
154 by the <biotool>. It is also possible to read the data stream using
155 '<' to direct the 'stdout' stream into the biotool like this:
156
157 \begin{lyxcode}
158 <biotool>~<~<file>
159 \end{lyxcode}
160 However, that will not work if you pipe more biotools together. Then
161 it is much safer to read the stream from a file explicitly like this:
162
163 \begin{lyxcode}
164 <biotool>~-{}-stream\_in=<file>
165 \end{lyxcode}
166 Here the filename <file> is explicetly given to the biotool <biotool>
167 with the switch -\/-stream\_in. This switch works with all biotools.
168 It is also possible to read in data from multiple sources by repeating
169 the explicit read step:
170
171 \begin{lyxcode}
172 <biotool>~-{}-stream\_in=<file1>~|~<biotool>~-{}-stream\_in=<file2>
173 \end{lyxcode}
174
175 \subsection{How to write the data stream to file?\label{sub:How-to-write-stream}}
176
177 In order to save the output stream from a biotool to file, so you
178 can read in the stream again at a later time, you can do one of two
179 things:
180
181 \begin{lyxcode}
182 <biotool>~>~<file>
183 \end{lyxcode}
184 All, the biotools write the data stream to 'stdout' by default which
185 can be written to a file by redirecting 'stdout' to file using '>'
186 , however, if one of the biotools for writing other formats is used
187 then the both the biotools records as well as the result output will
188 go to 'stdout' in a mixture causing havock! To avoid this you must
189 use the switch -\/-stream\_out that explictly tells the biotool to
190 write the output stream to file:
191
192 \begin{lyxcode}
193 <biotool>~-{}-stream\_out=<file>
194 \end{lyxcode}
195 The -\/-stream\_out switch works with all biotools.
196
197
198 \subsection{How to terminate the data stream?}
199
200 The data stream is never stops unless the user want to save the stream
201 or by supplying the -\/-no\_stream switch that will terminate the
202 stream:
203
204 \begin{lyxcode}
205 <biotool>~-{}-no\_stream
206 \end{lyxcode}
207 The -\/-no\_stream switch only works with those biotools where it
208 makes sense that the user might want to terminale the data stream,
209 \emph{i.e}. after an analysis step where the user wants to output
210 the result, but not the data stream.
211
212
213 \subsection{How to write my results to file?\label{sub:How-to-write-result}}
214
215 Saving the result of an analysis to file can be done implicitly or
216 explicitly. The implicit way:
217
218 \begin{lyxcode}
219 <biotool>~-{}-no\_stream~>~<file>
220 \end{lyxcode}
221 If you use '>' to redirect 'stdout' to file then it is important to
222 use the -\/-no\_stream switch to avoid writing a mix of biotools
223 records and result to the same file causing havock. The safe way is
224 to use the -\/-result\_out switch which explicetly tells the biotool
225 to write the result to a given file:
226
227 \begin{lyxcode}
228 <biotool>~-{}-result\_out=<file>
229 \end{lyxcode}
230 Using the above method will not terminate the stream, so it is possible
231 to pipe that into another biotool generating different results:
232
233 \begin{lyxcode}
234 <biotool1>~-{}-result\_out=<file1>~|~<biotool2>~-{}-result\_out=<file2>
235 \end{lyxcode}
236 And still the data stream will continue unless terminated with -\/-no\_stream:
237
238 \begin{lyxcode}
239 <biotool>~-{}-result\_out=<file>~-{}-no\_stream
240 \end{lyxcode}
241 Or written to file using implicitly or explicity \eqref{sub:How-to-write-result}.
242 The explicit way:
243
244 \begin{lyxcode}
245 <biotool>~-{}-result\_out=<file1>~-{}-stream\_out=<file2>
246 \end{lyxcode}
247
248 \subsection{How to read data from multiple sources?}
249
250 To read multiple data sources, with the same type or different type
251 of data do:
252
253 \begin{lyxcode}
254 <biotool1>~-{}-data\_in=<file1>~|~<biotool2>~-{}-data\_in=<file2>
255 \end{lyxcode}
256 where type is the data type a specific biotool reads.
257
258
259 \section{Reading input}
260
261
262 \subsection{How to read biotools input?}
263
264 See \eqref{sub:How-to-read-stream}.
265
266
267 \subsection{How to read in data?}
268
269 Data in different formats can be read with the appropriate biotool
270 for that format. The biotools are typicalled named 'read\_<data type>'
271 such as \textbf{read\_fasta}, \textbf{read\_bed}, \textbf{read\_tab},
272 etc., and all behave in a similar manner. Data can be read by supplying
273 the -\/-data\_in switch and a file name to the file containing the
274 data:
275
276 \begin{lyxcode}
277 <biotool>~-{}-data\_in=<file>
278 \end{lyxcode}
279 It is also possible to read in a saved biotools stream (see \ref{sub:How-to-read-stream})
280 as well as reading data in one go:
281
282 \begin{lyxcode}
283 <biotool>~-{}-stream\_in=<file1>~-{}-data\_in=<file2>
284 \end{lyxcode}
285 If you want to read data from several files you can do this:
286
287 \begin{lyxcode}
288 <biotool>~-{}-data\_in=<file1>~|~<biotool>~-{}-data\_in=<file2>
289 \end{lyxcode}
290 If you have several data files you can read in all explicitly with
291 a comma separated list:
292
293 \begin{lyxcode}
294 <biotool>~-{}-data\_in=file1,file2,file3
295 \end{lyxcode}
296 And it is also possible to use file globbing:
297
298 \begin{lyxcode}
299 <biotool>~-{}-data\_in={*}.fna
300 \end{lyxcode}
301 Or in a combination:
302
303 \begin{lyxcode}
304 <biotool>~-{}-data\_in=file1,/dir/{*}.fna
305 \end{lyxcode}
306 Finally, it is possible to read in data in different formats using
307 the appropriate biotool for each format:
308
309 \begin{lyxcode}
310 <biotool1>~-{}-data\_in=<file1>~|~<biotool2>~-{}-data\_in=<file2>~...
311 \end{lyxcode}
312
313 \subsection{How to read FASTA input?}
314
315 Sequences in FASTA format can be read explicitly using \textbf{read\_fasta}:
316
317 \begin{lyxcode}
318 read\_fasta~-{}-data\_in=<file>
319 \end{lyxcode}
320
321 \subsection{How to read alignment input?}
322
323 If your alignment if FASTA formatted then you can \textbf{read\_align}.
324 It is also possible to use \textbf{read\_fasta} since the data is
325 FASTA formatted, however, with \textbf{read\_fasta} the key ALIGN
326 will be omitted. The ALIGN key is used to determine which sequences
327 belong to what alignment which is required for \textbf{write\_align}.
328
329 \begin{lyxcode}
330 read\_align~-{}-data\_in=<file>
331 \end{lyxcode}
332
333 \subsection{How to read tabular input?\label{sub:How-to-read-table}}
334
335 Tabular input can be read with \textbf{read\_tab} which will read
336 in all rows and chosen columns (separated by a given delimter) from
337 a table in text format.
338
339 The table below:
340
341 \noindent \begin{center}
342 \begin{tabular}{lll}
343 Human & ATACGTCAG & 23524\tabularnewline
344 Dog & AGCATGAC & 2442\tabularnewline
345 Mouse & GACTG & 234\tabularnewline
346 Cat & AAATGCA & 2342\tabularnewline
347 \end{tabular}
348 \par\end{center}
349
350 Can be read using the command:
351
352 \begin{lyxcode}
353 read\_tab~-{}-data\_in=<file>
354 \end{lyxcode}
355 Which will result in four records, one for each row, where the keys
356 V0, V1, V2 are the default keys for the organism, sequence, and count,
357 respectively. It is possible to select a subset of colums to read
358 by using the -\/-cols switch which takes a comma separated list of
359 columns numbers (first column is designated 0) as argument. So to
360 read in only the sequence and the count so that the count comes before
361 the sequence do:
362
363 \begin{lyxcode}
364 read\_tab~-{}-data\_in=<file>~-{}-cols=2,1
365 \end{lyxcode}
366 It is also possible to name the columns with the -\/-keys switch:
367
368 \begin{lyxcode}
369 read\_tab~-{}-data\_in=<file>~-{}-cols=2,1~-{}-keys=COUNT,SEQ
370 \end{lyxcode}
371
372 \subsection{How to read BED input?}
373
374 The BED (Browser Extensible Data%
375 \footnote{\url{http://genome.ucsc.edu/FAQ/FAQformat}%
376 }) format is a tabular format for data pertaining to one of the Eukaryotic
377 genomes in the UCSC genome brower%
378 \footnote{\url{http://genome.ucsc.edu/}%
379 }. The BED format consists of up to 12 columns, where the first three
380 are mandatory CHR, CHR\_BEG, and CHR\_END. The mandatory columns and
381 any of the optional columns can all be read in easily with the \textbf{read\_bed}
382 biotool.
383
384 \begin{lyxcode}
385 read\_bed~-{}-data\_in=<file>
386 \end{lyxcode}
387 It is also possible to read the BED file with \textbf{read\_tab} (see~\ref{sub:How-to-read-table}),
388 however, that will be more cumbersome because you need to specify
389 the keys:
390
391 \begin{lyxcode}
392 read\_tab~-{}-data\_in=<file>~-{}-keys=CHR,CHR\_BEG,CHR\_END~...
393 \end{lyxcode}
394
395 \subsection{How to read PSL input?}
396
397 The PSL format is the output from BLAT and contains 21 mandatory fields
398 that can be read with \textbf{read\_psl}:
399
400 \begin{lyxcode}
401 read\_psl~-{}-data\_in=<file>
402 \end{lyxcode}
403
404 \section{Writing output}
405
406 All result output can be written explicitly to file using the -\/-result\_out
407 switch which all result generating biotools have. It is also possible
408 to write the result to file implicetly by directing 'stdout' to file
409 using '>', however, that requires the -\/-no\_stream swich to prevent
410 a mixture of data stream and results in the file. The explicit (and
411 safe) way:
412
413 \begin{lyxcode}
414 ...~|~<biotool>~-{}-result\_out=<file>
415 \end{lyxcode}
416 The implicit way:
417
418 \begin{lyxcode}
419 ...~|~<biotool>~-{}-no\_stream~>~<file>
420 \end{lyxcode}
421
422 \subsection{How to write biotools output?}
423
424 See \eqref{sub:How-to-write-stream}.
425
426
427 \subsection{How to write FASTA output?\label{sub:How-to-write-fasta}}
428
429 FASTA output can be written with \textbf{write\_fasta}.
430
431 \begin{lyxcode}
432 ...~|~write\_fasta~-{}-result\_out=<file>
433 \end{lyxcode}
434 It is also possible to wrap the sequences to a given width using the
435 -\/-wrap switch allthough wrapping of sequence is generally an evil
436 thing:
437
438 \begin{lyxcode}
439 ...~|~write\_fasta~-{}-no\_stream~-{}-wrap=80
440 \end{lyxcode}
441
442 \subsection{How to write alignment output?\label{sub:How-to-write-alignment}}
443
444 Pretty alignments with ruler%
445 \footnote{'.' for every 10 residues, ':' for every 50, and '|' for every 100%
446 } and consensus sequence can be created with \textbf{write\_align},
447 what also have the optional -\/-wrap switch to break the alignment
448 into blocks of a given width: 
449
450 \begin{lyxcode}
451 ...~|~write\_align~-{}-result\_out=<file>~-{}-wrap=80
452 \end{lyxcode}
453 If the number of sequnces in the alignment is 2 then a pairwise alignment
454 will be output otherwise a multiple alignment. And if the sequence
455 type, determined automagically, is protein, then residues and symbols
456 (+,~:,~.) will be used to show consensus according to the Blosum62
457 matrix.
458
459
460 \subsection{How to write tabular output?\label{sub:How-to-write-tab}}
461
462 Outputting the data stream as a table can be done with \textbf{write\_tab},
463 which will write generate one row per record with the values as columns.
464 If you supply the optional -\/-comment switch, when the first row
465 in the table will be a 'comment' line prefixed with a '\#':
466
467 \begin{lyxcode}
468 ...~|~write\_tab~-{}-result\_out=<file>~-{}-comment
469 \end{lyxcode}
470 You can also change the delimiter from the default (tab) to \emph{e.g.}
471 ',':
472
473 \begin{lyxcode}
474 ...~|~write\_tab~-{}-result\_out=<file>~-{}-delimit=','
475 \end{lyxcode}
476 If you want the values output in a specific order you have to supply
477 a comma separated list using the -\/-keys switch that will print
478 only those keys in that order:
479
480 \begin{lyxcode}
481 ...~|~write\_tab~-{}-result\_out=<file>~-{}-keys=SEQ\_NAME,COUNT
482 \end{lyxcode}
483 Alternatively, if you have some keys that you don't want in the tabular
484 output, use the -\/-no\_keys switch. So to print all keys except
485 SEQ and SEQ\_TYPE do:
486
487 \begin{lyxcode}
488 ...~|~write\_tab~-{}-result\_out=<file>~-{}-no\_keys=SEQ,SEQ\_TYPE
489 \end{lyxcode}
490 Finally, if you have a stream containing a mix of different records
491 types, \emph{e.g.} records with sequences and records with matches,
492 then you can use \textbf{write\_tab} to output all the records in
493 tabluar format, however, the -\/-comment, -\/-keys, and -\/-no\_keys
494 switches will only respond to records of the first type encountered.
495 The reason is that outputting mixed records is probably not what you
496 want anyway, and you should remove all the unwanted records from the
497 stream before outputting the table: \textbf{grab} is your friend (see~\ref{sub:How-to-grab}).
498
499
500 \subsection{How to write a BED output?\label{sub:How-to-write-BED}}
501
502 Data in BED format can be output if the records contain the mandatory
503 keys CHR, CHR\_BEG, and CHR\_END using \textbf{write\_bed}. If the
504 optional keys are also present, they will be output as well:
505
506 \begin{lyxcode}
507 write\_bed~-{}-result\_out=<file>
508 \end{lyxcode}
509
510 \subsection{How to write PSL output?\label{sub:How-to-write-PSL}}
511
512 Data in PSL format can be output using \textbf{write\_psl:}
513
514 \begin{lyxcode}
515 write\_psl~-{}-result\_out=<file>
516 \end{lyxcode}
517
518 \section{Manipulating Records}
519
520
521 \subsection{How to select a few records?\label{sub:How-to-select-a-few-records}}
522
523 To quickly get an overview of your data you can limit the data stream
524 to show a few records. This also very useful to test the pipeline
525 with a few records if you are setting up a complex analysis using
526 several biotools. That way you can inspect that all goes well before
527 analyzing and waiting for the full data set. All of the read\_<type>
528 biotools have the -\/-num switch which will take a number as argument
529 and only that number of records will be read. So to read in the first
530 10 FASTA entries from a file:
531
532 \begin{lyxcode}
533 read\_fasta~-{}-data\_in=test.fna~-{}-num=10
534 \end{lyxcode}
535 Another way of doing this is to use \textbf{head\_records} will limit
536 the stream to show the first 10 records (default):
537
538 \begin{lyxcode}
539 ...~|~head\_records
540 \end{lyxcode}
541 Using \textbf{head\_records} directly after one of the read\_<type>
542 biotools will be a lot slower than using the -\/-num switch with
543 the read\_<type> biotools, however, \textbf{head\_records} can also
544 be used to limit the output from all the other biotools. It is also
545 possible to give \textbf{head\_records} a number of records to show
546 using the -\/-num switch. So to display the first 100 records do:
547
548 \begin{lyxcode}
549 ...~|~head\_records~-{}-num=100
550 \end{lyxcode}
551
552 \subsection{How to count all records in the data stream?}
553
554 To count all the records in the data stream use \textbf{count\_records},
555 which adds one record (which is not included in the count) to the
556 data stream. So to count the number of sequences in a FASTA file you
557 can do this:
558
559 \begin{lyxcode}
560 cat~test.fna~|~read\_fasta~|~count\_records~-{}-no\_stream
561 \end{lyxcode}
562 Which will write the last record containing the count to 'stdout':
563
564 \begin{lyxcode}
565 -{}-{}-
566
567 count\_records:~630
568 \end{lyxcode}
569 It is also possible to write the count to file using the -\/-result\_out
570 switch.
571
572
573 \subsection{How to grab specific records?\label{sub:How-to-grab}}
574
575 The biotool \textbf{grab} is related to the Unix grep and locates
576 records based on matching keys and/or values using either a pattern,
577 a Perl regex, or a numerical evaluation. To easily \textbf{grab} all
578 records in the stream that has any mentioning of the pattern 'human'
579 just pipe the data stream through \textbf{grab} like this:
580
581 \begin{lyxcode}
582 ...~|~grab~-{}-pattern=human
583 \end{lyxcode}
584 This will search for the pattern 'human' in all keys and all values.
585 The -\/-pattern switch takes a comma separated list of patterns,
586 so in order to match multiple patterns do:
587
588 \begin{lyxcode}
589 ...~|~grab~-{}-pattern=human,mouse
590 \end{lyxcode}
591 It is also possible to use the -\/-pattern\_in switch instead of
592 -\/-pattern. -\/-pattern\_in is used to read a file with one pattern
593 per line:
594
595 \begin{lyxcode}
596 ...~|~grab~-{}-pattern\_in=patterns.txt
597 \end{lyxcode}
598 If you want the opposite result --- to find all records that does
599 not match the patterns, add the -\/-invert switch, which not only
600 works with the -\/-pattern switch, but also with -\/-regex and -\/-eval:
601
602 \begin{lyxcode}
603 ...~|~grab~-{}-pattern=human~-{}-invert
604 \end{lyxcode}
605 If you want to search the record keys only, \emph{e.g.} to find all
606 records containing the key SEQ you can add the -\/-keys\_only switch.
607 This will prevent matching of SEQ in any record value, and in fact
608 SEQ is a not uncommon peptide sequence you could get an unwanted record.
609 Also, this will give an increase in speed since only the keys are
610 searched:
611
612 \begin{lyxcode}
613 ...~|~grab~-{}-pattern=SEQ~-{}-keys\_only
614 \end{lyxcode}
615 However, if you are interested in finding the peptide sequence SEQ
616 and not the SEQ key, just add the -\/-vals\_only switch instead:
617
618 \begin{lyxcode}
619 ...~|~grab~-{}-pattern=SEQ~-{}-vals\_only
620 \end{lyxcode}
621 Also, if you want to grab for certain key/value pairs you can supply
622 a comma separated list of keys whos values will then be searched using
623 the -\/-keys switch. This is handy if your records contain large
624 genomic sequences and you dont want to search the entire sequence
625 for \emph{e.g.} the organism name --- it is much faster to tell \textbf{grab}
626 which keys to search the value for:
627
628 \begin{lyxcode}
629 ...~|~grab~-{}-pattern=human~-{}-keys=SEQ\_NAME
630
631
632 \end{lyxcode}
633 It is also possible to invoke flexible matching using regex (regular
634 expressions) instead of simple pattern matching. In \textbf{grab}
635 the regex engine is Perl based and allows use of different type of
636 wild cards, alternatives, \emph{etc}%
637 \footnote{\url{http://perldoc.perl.org/perlreref.html}%
638 }. If you want to \textbf{grab} records withs the sequence ATCG or
639 GCTA you can do this:
640
641 \begin{lyxcode}
642 ...~|~grab~-{}-regex='ATCG|GCTA'
643 \end{lyxcode}
644 Or if you want to find sequences beginning with ATCG:
645
646 \begin{lyxcode}
647 ...~|~grab~-{}-regex='\textasciicircum{}ATCG'
648 \end{lyxcode}
649 You can also use \textbf{grab} to locate records that fulfill a numerical
650 property using the -\/-eval switch witch takes an expression in three
651 parts. The first part is the key that holds the number we want to
652 evaluate, the second part holds one the six operators:
653
654 \begin{enumerate}
655 \item Greater than: >
656 \item Greater than or equal to: >=
657 \item Less than: <
658 \item Less than or equal to: <=
659 \item Equal to: =
660 \item Not equal to: !=
661 \end{enumerate}
662 And finally comes the number used in the evaluation. So to \textbf{grab}
663 all records with a sequence length greater than 30:
664
665 \begin{lyxcode}
666 ...~length\_seq~|~grab~-{}-eval='SEQ\_LEN~>~30'
667 \end{lyxcode}
668 If you want to locate all records containing the pattern 'human' and
669 where the sequence length is greater that 30, you do this by running
670 the stream through \textbf{grab} twice:
671
672 \begin{lyxcode}
673 ...~|~grab~-{}-pattern='human'~|~length\_seq~|~grab~-{}-eval='SEQ\_LEN~>~30'
674 \end{lyxcode}
675 To get the best speed performance, use the most restrictive \textbf{grab}
676 first.
677
678
679 \subsection{How to remove keys from records?}
680
681 To remove one or more specific keys from all records in the data stream
682 use \textbf{remove\_keys} like this:
683
684 \begin{lyxcode}
685 ...~|~remove\_keys~-{}-keys=SEQ,SEQ\_NAME
686 \end{lyxcode}
687 In the above example SEQ and SEQ\_NAME will be removed from all records
688 if they exists in these. If all keys are removed from a record, then
689 the record will be removed.
690
691
692 \subsection{How to rename keys in records?}
693
694 Sometimes you want to rename a record key, \emph{e.g.} if you have
695 read in a two column table with sequence name and sequence in each
696 column (see \ref{sub:How-to-read-table}) without specifying the key
697 names, then the sequence name will be called V0 and the sequence V1
698 as default in the \textbf{read\_tab} biotool. To rename the V0 and
699 V1 keys we need to run the stream through \textbf{rename\_keys} twice
700 (one for each key to rename):
701
702 \begin{lyxcode}
703 ...~|~rename\_keys~-{}-keys=V0,SEQ\_NAME~|~rename\_keys~-{}-keys=V1,SEQ
704 \end{lyxcode}
705 The first instance of \textbf{rename\_keys} replaces all the V0 keys
706 with SEQ\_NAME, and the second instance of \textbf{rename\_keys} replaces
707 all the V1 keys with SEQ. \emph{Et viola} the data can now be used
708 in the biotools that requires these keys.
709
710
711 \section{Manipulating Sequences}
712
713
714 \subsection{How to get sequence lengths?}
715
716 The length for sequences in records can be determined with \textbf{length\_seq},
717 which adds the key SEQ\_LEN to each record with the sequence length
718 as the value. It also generates an extra record that is emitted last
719 with the key TOTAL\_SEQ\_LEN showing the total length of all the sequences.
720
721 \begin{lyxcode}
722 read\_fasta~-{}-data\_in=<file>~|~length\_seq
723 \end{lyxcode}
724 It is also possible to determine the sequence length using the generic
725 tool \textbf{length\_vals} (see \#\#\#), which determines the length
726 of the values for a given list of keys:
727
728 \begin{lyxcode}
729 read\_fasta~-{}-data\_in=<file>~|~length\_vals~-{}-keys=SEQ
730 \end{lyxcode}
731 To obtain the total length of all sequences use \textbf{sum\_vals}
732 like this:
733
734 \begin{lyxcode}
735 read\_fasta~-{}-data\_in=<file>~|~length\_vals~-{}-keys=SEQ
736
737 |~sum\_vals~-{}-keys=SEQ\_LEN
738 \end{lyxcode}
739 The biotool \textbf{analyze\_seq} will also determine the length of
740 each sequence (see~\ref{sub:How-to-analyze}).
741
742
743 \subsection{How to analyze sequence composition?\label{sub:How-to-analyze}}
744
745 If you want to find out the sequence type, composition, length, as
746 well as GC content, indel content and proportions of soft and hard
747 masked sequence, then use \textbf{analyze\_seq}. This handy biotool
748 will determine all these things per sequence from which it is easy
749 to get an overview using the \textbf{write\_tab} biotool to output
750 a table (see~\ref{sub:How-to-write-tab}). So in order to determine
751 the sequence composition of a FASTA file with just one entry containing
752 the sequence 'ATCG' we just read the data with \textbf{read\_fasta}
753 and run the output through \textbf{analyze\_seq} which will add the
754 analysis to the record like this:
755
756 \begin{lyxcode}
757 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~...
758
759
760
761 -{}-{}-
762
763 GC\%:~50.00
764
765 HARD\_MASK\%:~0.00
766
767 RES:-:~0
768
769 RES:.:~0
770
771 RES:A:~1
772
773 RES:B:~0
774
775 RES:C:~1
776
777 RES:D:~0
778
779 RES:G:~1
780
781 RES:H:~0
782
783 RES:K:~0
784
785 RES:M:~0
786
787 RES:N:~0
788
789 RES:R:~0
790
791 RES:S:~0
792
793 RES:T:~1
794
795 RES:U:~0
796
797 RES:V:~0
798
799 RES:W:~0
800
801 RES:Y:~0
802
803 RES:\textasciitilde{}:~0
804
805 SEQ:~ATCG
806
807 SEQ\_LEN:~4
808
809 SEQ\_NAME:~test
810
811 SEQ\_TYPE:~DNA
812
813 SOFT\_MASK\%:~0.00
814 \end{lyxcode}
815 Now to make a table of how may As, Ts, Cs, and Gs you can add the
816 following:
817
818 \begin{lyxcode}
819 ...~|~analyze\_seq~|~write\_tab~-{}-keys=RES:A,RES:T,RES:C,RES:G
820 \end{lyxcode}
821 Or if you want to see the proportions of hard and soft masked sequence:
822
823 \begin{lyxcode}
824 ...~|~analyse\_seq~|~write\_tab~-{}-keys=HARD\_MASK\%,SOFT\_MASK\%
825 \end{lyxcode}
826 If you have a stack of sequences in one file and you want to determine
827 the mean GC content you can do it using the \textbf{mean\_vals} biotool:
828
829 \begin{lyxcode}
830 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~|~mean\_vals~-{}-keys=GC\%
831 \end{lyxcode}
832 Or if you want the total count of Ns you can use \textbf{sum\_vals}
833 like this:
834
835 \begin{lyxcode}
836 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~|~sum\_vals~-{}-keys=RES:N
837 \end{lyxcode}
838
839 \subsection{How to extract subsequences?\label{sub:How-to-extract}}
840
841 In order to extract a subsequence from a longer sequence use the biotool
842 extract\_seq, which will replace the sequence in the record with the
843 subsequence (this behaviour should probably be modified to be dependant
844 of a -\/-replace or a -\/-no\_replace switch). So to extract the
845 first 20 residues from all sequences do (first residue is designated
846 1): 
847
848 \begin{lyxcode}
849 ...~|~extract\_seq~-{}-beg=1~-{}-len=20
850 \end{lyxcode}
851 You can also specify a begin and end coordinate set:
852
853 \begin{lyxcode}
854 ...~|~extract\_seq~-{}-beg=20~-{}-end=40
855 \end{lyxcode}
856 If you want the subsequences from position 20 to the sequence end
857 do:
858
859 \begin{lyxcode}
860 ...~|~extract\_seq~-{}-beg=20
861 \end{lyxcode}
862 If you want to extract subsequences a given distance from the sequence
863 end you can do this by reversing the sequence with the biotool \textbf{reverse\_seq}
864 \eqref{sub:How-to-reverse-seq}, followed by \textbf{extract\_seq}
865 to get the subsequence, and then \textbf{reverse\_seq} again to get
866 the subsequence back in the original orientation:
867
868 \begin{lyxcode}
869 read\_fasta~-{}-data\_in=test.fna~|~reverse\_seq
870
871 |~extract\_seq~-{}-beg=10~-{}-len=10~|~reverse\_seq
872 \end{lyxcode}
873
874 \subsection{How to get genomic sequence?\label{sub:How-to-get-genomic-sequence}}
875
876 The biotool \textbf{get\_genomic\_seq} can extract subsequences for
877 a given genome specified with the -\/-genome switch explicitly using
878 the -\/-beg and -\/-end/-\/-len switches:
879
880 \begin{lyxcode}
881 get\_genome\_seq~-{}-genome=<genome>~-{}-beg=1~-{}-len=100
882 \end{lyxcode}
883 Alternatively, \textbf{get\_genome\_seq} can be used to append the
884 corresponding sequence to BED, PSL, and BLAST records:
885
886 \begin{lyxcode}
887 read\_bed~-{}-data\_in=<BED~file>~|~get\_genome\_seq~-{}-genome=<genome>
888 \end{lyxcode}
889
890 \subsection{How to upper-case sequences?}
891
892 Sequences can be shifted from lower case to upper case using \textbf{uppercase\_seq}:
893
894 \begin{lyxcode}
895 ...~|~uppercase\_seq
896 \end{lyxcode}
897
898 \subsection{How to reverse sequences?\label{sub:How-to-reverse-seq}}
899
900 The order of residues in a sequence can be reversed using reverse\_seq:
901
902 \begin{lyxcode}
903 ...~|~reverse\_seq
904 \end{lyxcode}
905 Note that in order to reverse/complement a sequence you also need
906 the \textbf{complement\_seq} biotool (see~\ref{sub:How-to-complement}).
907
908
909 \subsection{How to complement sequences?\label{sub:How-to-complement}}
910
911 DNA and RNA sequences can be complemented with \textbf{complement\_seq},
912 which automagically determines the sequence type:
913
914 \begin{lyxcode}
915 ...~|~complement\_seq
916 \end{lyxcode}
917 Note that in order to reverse/complement a sequence you also need
918 the \textbf{reverse\_seq} biotool (see~\ref{sub:How-to-reverse-seq}).
919
920
921 \subsection{How to remove indels from sequnces?}
922
923 Indels can be removed from sequences with the \textbf{remove\_indels}
924 biotool. This is useful if you have aligned some sequences (see~\ref{sub:How-to-align})
925 and extracted (see~\ref{sub:How-to-extract}) a block of subsequences
926 from the alignment and you want to use these sequence in a search
927 where you need to remove the indels first. '-', '\textasciitilde{}',
928 and '.' are considered indels:
929
930 \begin{lyxcode}
931 ...~|~remove\_indels
932 \end{lyxcode}
933
934 \subsection{How to split sequences into overlapping subsequences?}
935
936 Sequences can be slit into overlapping subsequences with the \textbf{split\_seq}
937 biotool.
938
939 \begin{lyxcode}
940 ...~|~split\_seq~-{}-word\_size=20~-{}-uniq
941 \end{lyxcode}
942
943 \subsection{How to determine the oligo frequency?}
944
945 In order to determine if any oligo usage is over represented in one
946 or more sequences you can determine the frequency of oligos of a given
947 size with \textbf{oligo\_freq}:
948
949 \begin{lyxcode}
950 ...~|~oligo\_freq~-{}-word\_size=4
951 \end{lyxcode}
952 And if you have more than one sequence and want to accumulate the
953 frequences you need the -\/-all switch:
954
955 \begin{lyxcode}
956 ...~|~oligo\_freq~-{}-word\_size=4~-{}-all
957 \end{lyxcode}
958 To get a meaningful result you need to write the resulting frequencies
959 as a table with \textbf{write\_tab} (see~\ref{sub:How-to-write-tab}),
960 but first it is important to \textbf{grab} (see~\ref{sub:How-to-grab})
961 the records with the frequencies to avoid full length sequences in
962 the table:
963
964 \begin{lyxcode}
965 ...~|~oligo\_freq~-{}-word\_size=4~-{}-all~|~grab~-{}-pattern=OLIGO~-{}-keys\_only
966
967 |~write\_tab~-{}-no\_stream
968 \end{lyxcode}
969 And the resulting frequency table can be sorted with Unix sort (man
970 sort).
971
972
973 \subsection{How to search for sequences in genomes?}
974
975 See the following biotool:
976
977 \begin{itemize}
978 \item \textbf{patscan\_seq} \eqref{sub:How-to-use-patscan}
979 \item \textbf{blat\_seq} \eqref{sub:How-to-use-BLAT}
980 \item \textbf{blast\_seq} \eqref{sub:How-to-use-BLAST}
981 \item \textbf{vmatch\_seq} \eqref{sub:How-to-use-Vmatch}
982 \end{itemize}
983
984 \subsection{How to search sequences for a pattern?\label{sub:How-to-use-patscan}}
985
986 It is possible to search sequences in the data stream for patterns
987 using the \textbf{patscan\_seq} biotool which utilizes the powerful
988 scan\_for\_matches engine. Consult the documentation for scan\_for\_matches
989 in order to learn how to define patterns (the documentation is included
990 in Appendix~\ref{sec:scan_for_matches-README}).
991
992 To search all sequences for a simple pattern consisting of the sequence
993 ATCGATCG allowing for 3 mismatches, 2 insertions and 1 deletion:
994
995 \begin{lyxcode}
996 read\_fasta~-{}-data\_in=<file>~|~patscan\_seq~-{}-pattern='ATCGATCG{[}3,2,1]'
997 \end{lyxcode}
998 The -\/-pattern switch takes a comma seperated list of patterns,
999 so if you want to search for more that one pattern do:
1000
1001 \begin{lyxcode}
1002 ...~|~patscan\_seq~-{}-pattern='ATCGATCG{[}3,2,1],GCTAGCTA{[}3,2,1]'
1003 \end{lyxcode}
1004 It is also possible to have a list of different patterns to search
1005 for in a file with one pattern per line. In order to get \textbf{patscan\_seq}
1006 to read these patterns use the -\/-pattern\_in switch:
1007
1008 \begin{lyxcode}
1009 ...~|~patscan\_seq~-{}-pattern\_in=<file>
1010 \end{lyxcode}
1011 To also scan the complementary strand in nucleotide sequences (\textbf{patscan\_seq}
1012 automagically determines the sequence type) you need to add the -\/-comp
1013 switch:
1014
1015 \begin{lyxcode}
1016 ...~|~patscan\_seq~-{}-pattern=<pattern>~-{}-comp
1017 \end{lyxcode}
1018 It is also possible to use \textbf{patscan\_seq} to output those records
1019 that does not contain a certain pattern by using the -\/-invert switch:
1020
1021 \begin{lyxcode}
1022 ...~|~patscan\_seq~-{}-pattern=<pattern>~-{}-invert
1023 \end{lyxcode}
1024 Finally, \textbf{patscan\_seq} can also scan for patterns in a given
1025 genome sequence, instead of sequences in the stream, using the -\/-genome
1026 switch:
1027
1028 \begin{lyxcode}
1029 patscan~-{}-pattern=<pattern>~-{}-genome=<genome>
1030 \end{lyxcode}
1031
1032 \subsection{How to use BLAT for sequence search?\label{sub:How-to-use-BLAT}}
1033
1034 Sequences in the data stream can be matched against supported genomes
1035 using \textbf{blat\_seq} which is a biotool using BLAT as the name
1036 might suggest. Currently only Mouse and Human genomes are available
1037 and it is not possible to use OOC files since there is still a need
1038 for a local repository for genome files. Otherwise it is just:
1039
1040 \begin{lyxcode}
1041 read\_fasta~-{}-data\_in=<file>~|~blat\_seq~-{}-genome=<genome>
1042 \end{lyxcode}
1043 The search results can then be written to file with \textbf{write\_psl}
1044 (see~\ref{sub:How-to-write-PSL}) or \textbf{write\_bed} (see~\ref{sub:How-to-write-BED})
1045 allthough with \textbf{write\_bed} some information will be lost).
1046 It is also possible to plot chromosome distribution of the search
1047 results using \textbf{plot\_chrdist} (see~\ref{sub:How-to-plot-chrdist})
1048 or the distribution of the match lengths using \textbf{plot\_lendist}
1049 (see~\ref{sub:How-to-plot-lendist}) or a karyogram with the hits
1050 using \textbf{plot\_karyogram} (see~\ref{sub:How-to-plot-karyogram}).
1051
1052
1053 \subsection{How to use BLAST for sequence search?\label{sub:How-to-use-BLAST}}
1054
1055 Two biotools exist for blasting sequences: \textbf{create\_blast\_db}
1056 is used to create the BLAST database required for BLAST which is queried
1057 using the biotool \textbf{blast\_seq}. So in order to create a BLAST
1058 database from sequences in the data stream you simple run:
1059
1060 \begin{lyxcode}
1061 ...~|~create\_blast\_db~-{}-database=my\_database~-{}-no\_stream
1062 \end{lyxcode}
1063 The type of sequence to use for the database is automagically determined
1064 by \textbf{create\_blast\_db}, but don't have a mixture of peptide
1065 and nucleic acids sequences in the stream. The -\/-database switch
1066 takes a path as argument, but will default to 'blastdb\_<time\_stamp>
1067 if not set.
1068
1069 The resulting database can now be queried with sequences in another
1070 data stream using \textbf{blast\_seq}:
1071
1072 \begin{lyxcode}
1073 ...~|~blast\_seq~-{}-database=my\_database
1074 \end{lyxcode}
1075 Again, the sequence type is determined automagically and the appropriate
1076 BLAST program is guessed (see below table), however, the program name
1077 can be overruled with the -\/-program switch.
1078
1079 \noindent \begin{center}
1080 \begin{tabular}{ccc}
1081 Subject sequence & Query sequence & Program guess\tabularnewline
1082 \hline
1083 Nucleotide & Nucleotide & blastn\tabularnewline
1084 Protein & Protein & blastp\tabularnewline
1085 Protein & Nucleotide & blastx\tabularnewline
1086 Nucleotide & Protein & tblastn\tabularnewline
1087 \end{tabular}
1088 \par\end{center}
1089
1090 Finally, it is also possible to use \textbf{blast\_seq} for blasting
1091 sequences agains a preformatted genome using the -\/-genome switch
1092 instead of the -\/-database switch:
1093
1094 \begin{lyxcode}
1095 ...~|~blast\_seq~-{}-genome=<genome>
1096 \end{lyxcode}
1097
1098 \subsection{How to use Vmatch for sequence search?\label{sub:How-to-use-Vmatch}}
1099
1100 The powerful suffix array software package Vmatch%
1101 \footnote{\url{http://www.vmatch.de/}%
1102 } can be used for exact mapping of sequences against indexed genomes
1103 using the biotool \textbf{vmatch\_seq}, which will e.g. map 700000
1104 ESTs to the human genome locating all 160 mio hits in less than an
1105 hour.
1106
1107 \begin{lyxcode}
1108 ...~|~vmatch\_seq~-{}-genome=<genome>
1109 \end{lyxcode}
1110 Only nucleotide sequences and sequences longer than 11 nucleotides
1111 will be mapped. The resulting SCORE key will hold the number of genome
1112 matches of a given sequence (multi-mappers).
1113
1114
1115 \subsection{How to find all matches between sequences?\label{sub:How-to-find-matches}}
1116
1117 All matches between two sequences can be determined with the biotool
1118 \textbf{match\_seq}. The match finding engine underneath the hood
1119 of \textbf{match\_seq} is the super fast suffix tree program MUMmer%
1120 \footnote{\url{http://mummer.sourceforge.net/}%
1121 }, which will locate all forward and reverse matches between huge sequences
1122 in a matter of minutes (if the repeat count is not too high and if
1123 the word size used is appropriate). Matching two \emph{Helicobacter
1124 pylori} genomes (1.7Mbp) takes around 10 seconds:
1125
1126 \begin{lyxcode}
1127 ...~|~match\_seq~-{}-word\_size=20~-{}-direction=both
1128 \end{lyxcode}
1129 The output from \textbf{match\_seq} can be used to generate a dot
1130 plot with \textbf{plot\_matches} (see~\ref{sub:How-to-generate-dotplot}).
1131
1132
1133 \subsection{How to align sequences?\label{sub:How-to-align}}
1134
1135 Sequences in the stream can be aligned with the \textbf{align\_seq}
1136 biotool that uses Muscle%
1137 \footnote{\url{http://www.drive5.com/muscle/muscle.html}%
1138 } as aligment engine. Currently you cannot change any of the Muscle
1139 alignment parameters and \textbf{align\_seq} will create an alignment
1140 based on the defaults (which are really good!):
1141
1142 \begin{lyxcode}
1143 ...~|~align\_seq
1144 \end{lyxcode}
1145 The aligned output can be written to file in FASTA format using \textbf{write\_fasta}
1146 (see~\ref{sub:How-to-write-fasta}) or in pretty text using \textbf{write\_align}
1147 (see~\ref{sub:How-to-write-alignment}).
1148
1149
1150 \subsection{How to create a weight matrix?}
1151
1152 If you want a weight matrix to show the sequence composition of a
1153 stack of sequences you can use the biotool create\_weight\_matrix:
1154
1155 \begin{lyxcode}
1156 ...~|~create\_weight\_matrix
1157 \end{lyxcode}
1158 The result can be output in percent using the -\/-percent switch:
1159
1160 \begin{lyxcode}
1161 ...~|~create\_weight\_matrix~-{}-percent
1162 \end{lyxcode}
1163 The weight matrix can be written as tabular output with \textbf{write\_tab}
1164 (see~\ref{sub:How-to-write-tab}) after removeing the records containing
1165 SEQ with \textbf{grab} (see~\ref{sub:How-to-grab}):
1166
1167 \begin{lyxcode}
1168 ...~|~create\_weight\_matrix~|~grab~-{}-invert~-{}-keys=SEQ~-{}-keys\_only
1169
1170 |~write\_tab~-{}-no\_stream
1171 \end{lyxcode}
1172 The V0 column will hold the residue, while the rest of the columns
1173 will hold the frequencies for each sequence position.
1174
1175
1176 \section{Plotting}
1177
1178 There exists several biotools for plotting. Some of these are based
1179 on GNUplot%
1180 \footnote{\url{http://www.gnuplot.info/}%
1181 }, which is an extremely powerful platform to generate all sorts of
1182 plots and even though GNUplot has quite a steep learning curve, the
1183 biotools utilizing GNUplot are simple to use. GNUplot is able to output
1184 a lot of different formats (called terminals in GNUplot), but the
1185 biotools focusses on three formats only:
1186
1187 \begin{enumerate}
1188 \item The 'dumb' terminal is default to the GNUplot based biotools and will
1189 output a plot in crude ASCII text (Fig.~\ref{fig:Dumb-terminal}).
1190 This is quite nice for a quick and dirty plot to get an overview of
1191 your data .
1192 \item The 'post' or 'postscript' terminal output postscript code which is
1193 publication grade graphics that can be viewed with applications such
1194 as Ghostview, Photoshop, and Preview.
1195 \item The 'svg' terminal output's scalable vector graphics (SVG) which is
1196 a vector based format. SVG is great because you can edit the resulting
1197 plot using Photoshop or Inkscape%
1198 \footnote{Inkscape is a really handy drawing program that is free and open source.
1199 Availble at \url{http://www.inkscape.org}%
1200 } if you want to add additional labels, captions, arrows, and so on
1201 and then save the result in different formats, such as postscript
1202 without loosing resolution.
1203 \end{enumerate}
1204 The biotools for plotting that are not based on GNUplot only output
1205 SVG (that may change in the future).
1206
1207 %
1208 \begin{figure}
1209 \noindent \begin{centering}
1210 \includegraphics[width=12cm]{lendist_ascii}
1211 \par\end{centering}
1212
1213 \caption{\label{fig:Dumb-terminal}Dumb terminal}
1214
1215
1216 \begin{quote}
1217 The output of a length distribution plot in the default 'dumb terminal'
1218 to the terminal window. 
1219 \end{quote}
1220
1221 \end{figure}
1222
1223
1224
1225 \subsection{How to plot a histogram?\label{How-to-plot-histogram}}
1226
1227 A generic histogram for a given value can be plotted with the biotool
1228 \textbf{plot\_histogram} (Fig.~\ref{fig:Histogram}):
1229
1230 \begin{lyxcode}
1231 ...~|~plot\_histogram~-{}-key=TISSUE~-{}-no\_stream
1232 \end{lyxcode}
1233 (Figure missing)
1234
1235 \noindent \begin{flushleft}
1236 %
1237 \begin{figure}
1238 \noindent \begin{centering}
1239 \includegraphics[width=12cm]{histogram}
1240 \par\end{centering}
1241
1242 \caption{\label{fig:Histogram}Histogram}
1243
1244 \end{figure}
1245
1246 \par\end{flushleft}
1247
1248
1249 \subsection{How to plot a length distribution?\label{sub:How-to-plot-lendist}}
1250
1251 Plotting of length distributions, weather sequence lengths, patterns
1252 lengths, hit lengths, \emph{etc.} is a really handy thing and can
1253 be done with the the biotool \textbf{plot\_lendist}. If you have a
1254 file with FASTA entries and want to plot the length distribution you
1255 do it like this:
1256
1257 \begin{lyxcode}
1258 read\_fasta~-{}-data\_in=<file>~|~length\_seq
1259
1260 |~plot\_lendist~-{}-key=SEQ\_LEN~-{}-no\_stream
1261 \end{lyxcode}
1262 The result will be written to the default dumb terminal and will look
1263 like Fig.~\ref{fig:Dumb-terminal}.
1264
1265 If you instead want the result in postscript format you can do:
1266
1267 \begin{lyxcode}
1268 ...~|~plot\_lendist~-{}-key=SEQ\_LEN~-{}-terminal=post~-{}-result\_out=file.ps
1269 \end{lyxcode}
1270 That will generate the plot and save it to file, but not interrupt
1271 the data stream which can then be used in further analysis. You can
1272 also save the plot implicetly using '>', however, it is then important
1273 to terminate the stream with the -\/-no\_stream switch:
1274
1275 \begin{lyxcode}
1276 ...~|~plot\_lendist~-{}-key=SEQ\_LEN~-{}-terminal=post~-{}-no\_stream~>~file.ps
1277 \end{lyxcode}
1278 The resulting plot can be seen in Fig.~\ref{fig:Length-distribution}.
1279
1280 %
1281 \begin{figure}
1282
1283
1284 \noindent \begin{centering}
1285 \includegraphics[width=12cm]{lendist}
1286 \par\end{centering}
1287
1288 \caption{\label{fig:Length-distribution}Length distribution}
1289
1290
1291 \begin{quote}
1292 Length distribution of 630 piRNA like RNAs.
1293 \end{quote}
1294
1295 \end{figure}
1296
1297
1298
1299 \subsection{How to plot a chromosome distribution?\label{sub:How-to-plot-chrdist}}
1300
1301 If you have the result of a sequence search against a multi chromosome
1302 genome, it is very practical to be able to plot the distribution of
1303 search hits on the different chromosomes. This can be done with \textbf{plot\_chrdist}:
1304
1305 \begin{lyxcode}
1306 read\_fasta~-{}-data\_in=<file>~|~blat\_genome~|~plot\_chrdist~-{}-no\_stream
1307 \end{lyxcode}
1308 The above example will result in a crude plot using the 'dumb' terminal,
1309 and if you want to mess around with the results from the BLAT search
1310 you probably want to save the result to file first (see~\ref{sub:How-to-write-PSL}).
1311 To plot the chromosome distribution from the saved search result you
1312 can do:
1313
1314 \begin{lyxcode}
1315 read\_bed~-{}-data\_in=file.bed~|~plot\_chrdist~-{}-terminal=post~-{}-result\_out=plot.ps
1316 \end{lyxcode}
1317 That will result in the output show in Fig.~\ref{fig:Chromosome-distribution}.
1318
1319 %
1320 \begin{figure}
1321
1322
1323 \noindent \begin{centering}
1324 \includegraphics[angle=90,width=12cm]{chrdist}
1325 \par\end{centering}
1326
1327 \caption{\label{fig:Chromosome-distribution}Chromosome distribution}
1328
1329 \end{figure}
1330
1331
1332
1333 \subsection{How to generate a dotplot?\label{sub:How-to-generate-dotplot}}
1334
1335 A dotplot is a powerful way to get an overview of the size and location
1336 of sequence insertions, deletions, and duplications between two sequences.
1337 Generating a dotplot with biotools is a two step process where you
1338 initially find all matches between two sequences using the tool \textbf{match\_seq}
1339 (see~\ref{sub:How-to-find-matches}) and plot the resulting matches
1340 with \textbf{plot\_matches}. Matching and plotting two \emph{Helicobacter
1341 pylori} genomes (1.7Mbp) takes around 10 seconds:
1342
1343 \begin{lyxcode}
1344 ...~|~match\_seq~|~plot\_matches~-{}-terminal=post~-{}-result\_out=plot.ps
1345 \end{lyxcode}
1346 The resulting dotplot is in Fig.~\ref{fig:Dotplot}.
1347
1348 %
1349 \begin{figure}
1350 \noindent \begin{centering}
1351 \includegraphics[width=12cm]{dotplot}
1352 \par\end{centering}
1353
1354 \caption{\label{fig:Dotplot}Dotplot}
1355
1356
1357 \begin{quote}
1358 Forward matches are displayed in green while reverse matches are displayed
1359 in red.
1360 \end{quote}
1361
1362 \end{figure}
1363
1364
1365
1366 \subsection{How to plot a sequence logo?}
1367
1368 Sequence logos can be generate with \textbf{plot\_seqlogo}. The sequnce
1369 type is determined automagically and an entropy scale of 2 bits and
1370 4 bits is used for nucleotide and peptide sequences, respectively%
1371 \footnote{\url{http://www.ccrnp.ncifcrf.gov/~toms/paper/hawaii/latex/node5.html}%
1372 }.
1373
1374 \begin{lyxcode}
1375 ...~|~plot\_seqlogo~-{}-no\_stream~-{}-result\_out=seqlogo.svg
1376 \end{lyxcode}
1377 An example of a sequence logo can be seen in Fig.~\ref{fig:Sequence-logo}.
1378
1379 %
1380 \begin{figure}
1381 \noindent \begin{centering}
1382 \includegraphics[width=12cm]{seqlogo}
1383 \par\end{centering}
1384
1385 \caption{\label{fig:Sequence-logo}Sequence logo}
1386
1387 \end{figure}
1388
1389
1390
1391 \subsection{How to plot a karyogram?\label{sub:How-to-plot-karyogram}}
1392
1393 To plot search hits on genomes use \textbf{plot\_karyogram}, which
1394 will output a nice karyogram in SVG graphics:
1395
1396 \begin{lyxcode}
1397 ...~|~plot\_karyogram~-{}-result\_out=karyogram.svg
1398 \end{lyxcode}
1399 The banding data is taken from the UCSC genome browser database and
1400 currently only Human and Mouse is supported. Fig.~\ref{fig:Karyogram}
1401 shows the distribution of piRNA like RNAs matched to the Human genome.
1402
1403 %
1404 \begin{figure}
1405 \noindent \begin{centering}
1406 \includegraphics[width=12cm]{karyogram}
1407 \par\end{centering}
1408
1409 \caption{\label{fig:Karyogram}Karyogram}
1410
1411
1412 \begin{quote}
1413 Hits from a search of piRNA like RNAs in the Human genome is displayed
1414 as short horizontal bars.
1415 \end{quote}
1416
1417 \end{figure}
1418
1419
1420
1421 \section{Uploading Results}
1422
1423
1424 \subsection{How do I display my results in the UCSC Genome Browser?}
1425
1426 Results from the list of biotools below can be uploaded directly to
1427 a local mirror of the UCSC Genome Browser using the biotool \textbf{upload\_to\_ucsc}:
1428
1429 \begin{itemize}
1430 \item patscan\_seq \eqref{sub:How-to-use-patscan}
1431 \item blat\_seq \eqref{sub:How-to-use-BLAT}
1432 \item blast\_seq \eqref{sub:How-to-use-BLAST}
1433 \item vmatch\_seq \eqref{sub:How-to-use-Vmatch}
1434 \end{itemize}
1435 The syntax for uploading data the most simple way requires two mandatory
1436 switches: -\/-database, which is the UCSC database name (such as
1437 hg18, mm9, etc.) and-\/-table which should be the users initials
1438 followed by an underscore and a short description of the data:
1439
1440 \begin{lyxcode}
1441 ...~|~upload\_to\_ucsc~-{}-database=hg18~-{}-table=mah\_snoRNAs
1442 \end{lyxcode}
1443 The \textbf{upload\_to\_ucsc} biotool modifies the users \textasciitilde{}/ucsc/my\_tracks.ra
1444 file automagically (a backup is created with the name \textasciitilde{}/ucsc/my\_tracks.ra\textasciitilde{})
1445 with default values that can be overridden using the following switches:
1446
1447 \begin{itemize}
1448 \item -\/-short\_label - Short label for track - Default=database->table
1449 \item -\/-long\_label - Long label for track - Default=database->table
1450 \item -\/-group - Track group name - Default=<user name as defined in env>
1451 \item -\/-priority - Track display priority - Default=1
1452 \item -\/-color - Track color - Default=147,73,42
1453 \item -\/-chunk\_size - Chunks for loading - Default=10000000
1454 \item -\/-visibility - Track visibility - Default=pack
1455 \end{itemize}
1456 Also, data in BED or PSL format can be uploaded with \textbf{upload\_to\_ucsc}
1457 as long as these reference to genomes and chromosomes existing in
1458 the UCSC Genome Browser:
1459
1460 \begin{lyxcode}
1461 read\_bed~-{}-data\_in=<bed~file>~|~upload\_to\_ucsc~...
1462
1463
1464
1465 read\_psl~-{}-data\_in=<psl~file>~|~upload\_to\_ucsc~...
1466 \end{lyxcode}
1467
1468 \section{Trouble shooting}
1469
1470 Shoot the messenger!
1471
1472 \appendix
1473
1474 \section{Keys\label{sec:Keys}}
1475
1476 HIT
1477
1478 HIT\_BEG
1479
1480 HIT\_END
1481
1482 HIT\_LEN
1483
1484 HIT\_NAME
1485
1486 PATTERN
1487
1488
1489 \section{Switches\label{sec:Switches}}
1490
1491 -\/-stream\_in
1492
1493 -\/-stream\_out
1494
1495 -\/-no\_stream
1496
1497 -\/-data\_in
1498
1499 -\/-result\_out
1500
1501 -\/-num
1502
1503
1504 \section{scan\_for\_matches README\label{sec:scan_for_matches-README}}
1505
1506 \begin{lyxcode}
1507 ~~~~~~~~~~~~~~~~~~~~~~~~~~scan\_for\_matches:
1508
1509 ~~~~A~Program~to~Scan~Nucleotide~or~Protein~Sequences~for~Matching~Patterns
1510
1511 ~~~~~~~~~~~~~~~~~~~~~~~~Ross~Overbeek
1512
1513 ~~~~~~~~~~~~~~~~~~~~~~~~MCS
1514
1515 ~~~~~~~~~~~~~~~~~~~~~~~~Argonne~National~Laboratory
1516
1517 ~~~~~~~~~~~~~~~~~~~~~~~~Argonne,~IL~60439
1518
1519 ~~~~~~~~~~~~~~~~~~~~~~~~USA
1520
1521 Scan\_for\_matches~is~a~utility~that~we~have~written~to~search~for
1522
1523 patterns~in~DNA~and~protein~sequences.~~I~wrote~most~of~the~code,
1524
1525 although~David~Joerg~and~Morgan~Price~wrote~sections~of~an
1526
1527 earlier~version.~~The~whole~notion~of~pattern~matching~has~a~rich
1528
1529 history,~and~we~borrowed~liberally~from~many~sources.~~However,~it~is
1530
1531 worth~noting~that~we~were~strongly~influenced~by~the~elegant~tools
1532
1533 developed~and~distributed~by~David~Searls.~~My~intent~is~to~make~the
1534
1535 existing~tool~available~to~anyone~in~the~research~community~that~might
1536
1537 find~it~useful.~~I~will~continue~to~try~to~fix~bugs~and~make~suggested
1538
1539 enhancements,~at~least~until~I~feel~that~a~superior~tool~exists.
1540
1541 Hence,~I~would~appreciate~it~if~all~bug~reports~and~suggestions~are
1542
1543 directed~to~me~at~Overbeek@mcs.anl.gov.~~
1544
1545 I~will~try~to~log~all~bug~fixes~and~report~them~to~users~that~send~me
1546
1547 their~email~addresses.~~I~do~not~require~that~you~give~me~your~name
1548
1549 and~address.~~However,~if~you~do~give~it~to~me,~I~will~try~to~notify
1550
1551 you~of~serious~problems~as~they~are~discovered.
1552
1553 Getting~Started:
1554
1555 ~~~~The~distribution~should~contain~at~least~the~following~programs:
1556
1557 ~~~~~~~~~~~~~~~~README~~~~~~~~~~~~~~~~~~-~~~~~This~document
1558
1559 ~~~~~~~~~~~~~~~~ggpunit.c~~~~~~~~~~~~~~~-~~~~~One~of~the~two~source~files
1560
1561 ~~~~~~~~~~~~~~~~scan\_for\_matches.c~~~~~~-~~~~~The~second~source~file
1562
1563 ~~~~~~~~~~~~~~~~
1564
1565 ~~~~~~~~~~~~~~~~run\_tests~~~~~~~~~~~~~~~-~~~~~A~perl~script~to~test~things
1566
1567 ~~~~~~~~~~~~~~~~show\_hits~~~~~~~~~~~~~~~-~~~~~A~handy~perl~script
1568
1569 ~~~~~~~~~~~~~~~~test\_dna\_input~~~~~~~~~~-~~~~~Test~sequences~for~DNA
1570
1571 ~~~~~~~~~~~~~~~~test\_dna\_patterns~~~~~~~-~~~~~Test~patterns~for~DNA~scan
1572
1573 ~~~~~~~~~~~~~~~~test\_output~~~~~~~~~~~~~-~~~~~Desired~output~from~test
1574
1575 ~~~~~~~~~~~~~~~~test\_prot\_input~~~~~~~~~-~~~~~Test~protein~sequences
1576
1577 ~~~~~~~~~~~~~~~~test\_prot\_patterns~~~~~~-~~~~~Test~patterns~for~proteins
1578
1579 ~~~~~~~~~~~~~~~~testit~~~~~~~~~~~~~~~~~~-~~~~~a~perl~script~used~for~test
1580
1581 ~~~~Only~the~first~three~files~are~required.~~The~others~are~useful,
1582
1583 ~~~~but~only~if~you~have~Perl~installed~on~your~system.~~If~you~do
1584
1585 ~~~~have~Perl,~I~suggest~that~you~type
1586
1587 ~~~~~~~~
1588
1589 ~~~~~~~~~~~~~~~~which~perl
1590
1591 ~~~~to~find~out~where~it~installed.~~On~my~system,~I~get~the~following
1592
1593 ~~~~response:
1594
1595 ~~~~~~~~
1596
1597 ~~~~~~~~~~~~~~~~clone\%~which~perl
1598
1599 ~~~~~~~~~~~~~~~~/usr/local/bin/perl
1600
1601 ~~~~indicating~that~Perl~is~installed~in~/usr/local/bin.~~Anyway,~once
1602
1603 ~~~~you~know~where~it~is~installed,~edit~the~first~line~of~files~
1604
1605 ~~~~~~~~testit
1606
1607 ~~~~~~~~show\_hits
1608
1609 ~~~~replacing~/usr/local/bin/perl~with~the~appropriate~location.~~I
1610
1611 ~~~~will~assume~that~you~can~do~this,~although~it~is~not~critical~(it
1612
1613 ~~~~is~needed~only~to~test~the~installation~and~to~use~the~\char`\"{}show\_hits\char`\"{}
1614
1615 ~~~~utility).~~Perl~is~not~required~to~actually~install~and~run
1616
1617 ~~~~scan\_for\_matches.~
1618
1619 ~~~~If~you~do~not~have~Perl,~I~suggest~you~get~it~and~install~it~(it
1620
1621 ~~~~is~a~wonderful~utility).~~Information~about~Perl~and~how~to~get~it
1622
1623 ~~~~can~be~found~in~the~book~\char`\"{}Programming~Perl\char`\"{}~by~Larry~Wall~and
1624
1625 ~~~~Randall~L.~Schwartz,~published~by~O'Reilly~\&~Associates,~Inc.
1626
1627 ~~~~To~get~started,~you~will~need~to~compile~the~program.~~~I~do~this
1628
1629 ~~~~using~
1630
1631 ~~~~~~~~gcc~-O~-o~scan\_for\_matches~~ggpunit.c~scan\_for\_matches.c
1632
1633 ~~~~If~you~do~not~use~GNU~C,~use~
1634
1635 ~~~~~~~~cc~-O~-DCC~-o~scan\_for\_matches~~ggpunit.c~scan\_for\_matches.c
1636
1637 ~~~~which~works~on~my~Sun.~~
1638
1639 ~~~~Once~you~have~compiled~scan\_for\_matches,~you~can~verify~that~it
1640
1641 ~~~~works~with
1642
1643 ~~~~~~~~clone\%~run\_tests~tmp
1644
1645 ~~~~~~~~clone\%~diff~tmp~test\_output
1646
1647 ~~~~You~may~get~a~few~strange~lines~of~the~sort
1648
1649 ~~~~~~~~clone\%~run\_tests~tmp
1650
1651 ~~~~~~~~rm:~tmp:~No~such~file~or~directory
1652
1653 ~~~~~~~~clone\%~diff~tmp~test\_output
1654
1655 ~~~~These~should~cause~no~concern.~~However,~if~the~\char`\"{}diff\char`\"{}~shows~that
1656
1657 ~~~~tmp~and~test\_output~are~different,~contact~me~(you~have~a
1658
1659 ~~~~problem).~
1660
1661 ~~~~You~should~now~be~able~to~use~scan\_for\_matches~by~following~the
1662
1663 ~~~~instructions~given~below~(which~is~all~the~normal~user~should~have
1664
1665 ~~~~to~understand,~once~things~are~installed~properly).
1666
1667 ~==============================================================
1668
1669 How~to~run~scan\_for\_matches:
1670
1671 ~~~~To~run~the~program,~you~type~need~to~create~two~files
1672
1673 ~~~~1.~~the~first~file~contains~the~pattern~you~wish~to~scan~for;~I'll
1674
1675 ~~~~~~~~call~this~file~pat\_file~in~what~follows~(but~any~name~is~ok)
1676
1677 ~~~~2.~~the~second~file~contains~a~set~of~sequences~to~scan.~~These
1678
1679 ~~~~~~~~should~be~in~\char`\"{}fasta~format\char`\"{}.~~Just~look~at~the~contents~of
1680
1681 ~~~~~~~~test\_dna\_input~to~see~examples~of~this~format.~~Basically,
1682
1683 ~~~~~~~~each~sequence~begins~with~a~line~of~the~form
1684
1685 ~~~~~~~~~~~>sequence\_id
1686
1687 ~~~~~~~~and~is~followed~by~one~or~more~lines~containing~the~sequence.
1688
1689 ~~~~Once~these~files~have~been~created,~you~just~use
1690
1691 ~~~~~~~~scan\_for\_matches~pat\_file~<~input\_file
1692
1693 ~~~~to~scan~all~of~the~input~sequences~for~the~given~pattern.~~As~an
1694
1695 ~~~~example,~suppose~that~pat\_file~contains~a~single~line~of~the~form
1696
1697 ~~~~~~~~~~~~~~~~p1=4...7~3...8~\textasciitilde{}p1
1698
1699 ~~~~Then,
1700
1701 ~~~~~~~~~~~~~~~~scan\_for\_matches~pat\_file~<~test\_dna\_input
1702
1703 ~~~~should~produce~two~\char`\"{}hits\char`\"{}.~~When~I~run~this~on~my~machine,~I~get
1704
1705 ~~~~~~~~clone\%~scan\_for\_matches~pat\_file~<~test\_dna\_input
1706
1707 ~~~~~~~~>tst1:{[}6,27]
1708
1709 ~~~~~~~~cguaacc~ggttaacc~gguuacg~
1710
1711 ~~~~~~~~>tst2:{[}6,27]
1712
1713 ~~~~~~~~CGUAACC~GGTTAACC~GGUUACG~
1714
1715 ~~~~~~~~clone\%~
1716
1717 Simple~Patterns~Built~by~Matching~Ranges~and~Reverse~Complements
1718
1719 ~~~~Let~me~first~explain~this~simple~pattern:
1720
1721 ~~~~~~~~~~~~~~~~
1722
1723 ~~~~~~~~~~~~~~~~p1=4...7~3...8~\textasciitilde{}p1
1724
1725 ~~~~The~pattern~consists~of~three~\char`\"{}pattern~units\char`\"{}~separated~by~spaces.
1726
1727 ~~~~The~first~pattern~unit~is
1728
1729 ~~~~~~~~~~~~~~~~p1=4...7
1730
1731 ~~~~which~means~\char`\"{}match~4~to~7~characters~and~call~them~p1\char`\"{}.~~The
1732
1733 ~~~~second~pattern~unit~is
1734
1735 ~~~~~~~~~~~~~~~~3...8
1736
1737 ~~~~which~means~\char`\"{}then~match~3~to~8~characters\char`\"{}.~~The~last~pattern~unit
1738
1739 ~~~~is~
1740
1741 ~~~~~~~~~~~~~~~~\textasciitilde{}p1
1742
1743 ~~~~which~means~\char`\"{}match~the~reverse~complement~of~p1\char`\"{}.~~The~first
1744
1745 ~~~~reported~hit~is~shown~as
1746
1747 ~~~~~~~~>tst1:{[}6,27]
1748
1749 ~~~~~~~~cguaacc~ggttaacc~gguuacg~
1750
1751 ~~~~which~states~that~characters~6~through~27~of~sequence~tst1~were
1752
1753 ~~~~matched.~~\char`\"{}cguaac\char`\"{}~matched~the~first~pattern~unit,~\char`\"{}ggttaacc\char`\"{}~the
1754
1755 ~~~~second,~and~\char`\"{}gguuacg\char`\"{}~the~third.~~This~is~an~example~of~a~common
1756
1757 ~~~~type~of~pattern~used~to~search~for~sections~of~DNA~or~RNA~that
1758
1759 ~~~~would~fold~into~a~hairpin~loop.
1760
1761 Searching~Both~Strands
1762
1763 ~~~~Now~for~a~short~aside:~scan\_for\_matches~only~searched~the
1764
1765 ~~~~sequences~in~the~input~file;~it~did~not~search~the~opposite
1766
1767 ~~~~strand.~~With~a~pattern~of~the~sort~we~just~used,~there~is~not
1768
1769 ~~~~need~o~search~the~opposite~strand.~~However,~it~is~normally~the
1770
1771 ~~~~case~that~you~will~wish~to~search~both~the~sequence~and~the
1772
1773 ~~~~opposite~strand~(i.e.,~the~reverse~complement~of~the~sequence).
1774
1775 ~~~~To~do~that,~you~would~just~use~the~\char`\"{}-c\char`\"{}~command~line.~~For~example,
1776
1777 ~~~~~~~~scan\_for\_matches~-c~pat\_file~<~test\_dna\_input
1778
1779 ~~~~Hits~on~the~opposite~strand~will~show~a~beginning~location~greater
1780
1781 ~~~~than~te~end~location~of~the~match.
1782
1783 Defining~Pairing~Rules~and~Allowing~Mismatches,~Insertions,~and~Deletions
1784
1785 ~~~~Let~us~stop~now~and~ask~\char`\"{}What~additional~features~would~one~need~to
1786
1787 ~~~~really~find~the~kinds~of~loop~structures~that~characterize~tRNAs,
1788
1789 ~~~~rRNAs,~and~so~forth?\char`\"{}~~I~can~immediately~think~of~two:
1790
1791 ~~~~~~~~a)~you~will~need~to~be~able~to~allow~non-standard~pairings
1792
1793 ~~~~~~~~~~~(those~other~than~G-C~and~A-U),~and
1794
1795 ~~~~~~~~b)~you~will~need~to~be~able~to~tolerate~some~number~of
1796
1797 ~~~~~~~~~~~mismatches~and~bulges.
1798
1799 ~~~~~~~~
1800
1801 ~~~~Let~me~first~show~you~how~to~handle~non-standard~\char`\"{}rules~for
1802
1803 ~~~~pairing~in~reverse~complements\char`\"{}.~~Consider~the~following~pattern,
1804
1805 ~~~~which~I~show~as~two~line~(you~may~use~as~many~lines~as~you~like~in
1806
1807 ~~~~forming~a~pattern,~although~you~can~only~break~a~pattern~at~points
1808
1809 ~~~~where~space~would~be~legal):
1810
1811 ~~~~~~~~~~~~r1=\{au,ua,gc,cg,gu,ug,ga,ag\}~
1812
1813 ~~~~~~~~~~~~p1=2...3~0...4~p2=2...5~1...5~r1\textasciitilde{}p2~0...4~\textasciitilde{}p1~~~~~~~~
1814
1815 ~~~~The~first~\char`\"{}pattern~unit\char`\"{}~does~not~actually~match~anything;~rather,
1816
1817 ~~~~it~defines~a~\char`\"{}pairing~rule\char`\"{}~in~which~standard~pairings~are
1818
1819 ~~~~allowed,~as~well~as~G-A~and~A-G~(in~case~you~wondered,~Us~and~Ts
1820
1821 ~~~~and~upper~and~lower~case~can~be~used~interchangably;~for~example
1822
1823 ~~~~r1=\{AT,UA,gc,cg\}~could~be~used~to~define~the~\char`\"{}standard~rule\char`\"{}~for
1824
1825 ~~~~pairings).~~The~second~line~consists~of~six~pattern~units~which
1826
1827 ~~~~may~be~interpreted~as~follows:
1828
1829 ~~~~~~~~~~~~p1=2...3~~~~~match~2~or~3~characters~(call~it~p1)
1830
1831 ~~~~~~~~~~~~0...4~~~~~~~~match~0~to~4~characters
1832
1833 ~~~~~~~~~~~~p2=2...5~~~~~match~2~to~5~characters~(call~it~p2)
1834
1835 ~~~~~~~~~~~~1...5~~~~~~~~match~1~to~5~characters
1836
1837 ~~~~~~~~~~~~r1\textasciitilde{}p2~~~~~~~~match~the~reverse~complement~of~p2,
1838
1839 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~allowing~G-A~and~A-G~pairs
1840
1841 ~~~~~~~~~~~~0...4~~~~~~~~match~0~to~4~characters~~~~~~~~
1842
1843 ~~~~~~~~~~~~\textasciitilde{}p1~~~~~~~~~~match~the~reverse~complement~of~p1
1844
1845 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~allowing~only~G-C,~C-G,~A-T,~and~T-A~pairs
1846
1847 ~~~~Thus,~r1\textasciitilde{}p2~means~\char`\"{}match~the~reverse~complement~of~p2~using~rule~r1\char`\"{}.
1848
1849 ~~~~Now~let~us~consider~the~issue~of~tolerating~mismatches~and~bulges.
1850
1851 ~~~~You~may~add~a~\char`\"{}qualifier\char`\"{}~to~the~pattern~unit~that~gives~the
1852
1853 ~~~~tolerable~number~of~\char`\"{}mismatches,~deletions,~and~insertions\char`\"{}.
1854
1855 ~~~~Thus,
1856
1857 ~~~~~~~~~~~~~~~~p1=10...10~3...8~\textasciitilde{}p1{[}1,2,1]
1858
1859 ~~~~means~that~the~third~pattern~unit~must~match~10~characters,
1860
1861 ~~~~allowing~one~\char`\"{}mismatch\char`\"{}~(a~pairing~other~than~G-C,~C-G,~A-T,~or
1862
1863 ~~~~T-A),~two~deletions~(a~deletion~is~a~character~that~occurs~in~p1,
1864
1865 ~~~~but~has~been~\char`\"{}deleted\char`\"{}~from~the~string~matched~by~\textasciitilde{}p1),~and~one
1866
1867 ~~~~insertion~(an~\char`\"{}insertion\char`\"{}~is~a~character~that~occurs~in~the~string
1868
1869 ~~~~matched~by~\textasciitilde{}p1,~but~not~for~which~no~corresponding~character
1870
1871 ~~~~occurs~in~p1).~~In~this~case,~the~pattern~would~match
1872
1873 ~~~~~~~~~~~~~~ACGTACGTAC~GGGGGGGG~GCGTTACCT
1874
1875 ~~~~which~is,~you~must~admit,~a~fairly~weak~loop.~~It~is~common~to
1876
1877 ~~~~allow~mismatches,~but~you~will~find~yourself~using~insertions~and
1878
1879 ~~~~deletions~much~more~rarely.~~In~any~event,~you~should~note~that
1880
1881 ~~~~allowing~mismatches,~insertions,~and~deletions~does~force~the
1882
1883 ~~~~program~to~try~many~additional~possible~pairings,~so~it~does~slow
1884
1885 ~~~~things~down~a~bit.
1886
1887 How~Patterns~Are~Matched
1888
1889 ~~~~Now~is~as~good~a~time~as~any~to~discuss~the~basic~flow~of~control
1890
1891 ~~~~when~matching~patterns.~~Recall~that~a~\char`\"{}pattern\char`\"{}~is~a~sequence~of
1892
1893 ~~~~\char`\"{}pattern~units\char`\"{}.~~Suppose~that~the~pattern~units~were
1894
1895 ~~~~~~~~u1~u2~u3~u4~...~un
1896
1897 ~~~~The~scan~of~a~sequence~S~begins~by~setting~the~current~position
1898
1899 ~~~~to~1.~~Then,~an~attempt~is~made~to~match~u1~starting~at~the
1900
1901 ~~~~current~position.~~Each~attempt~to~match~a~pattern~unit~can
1902
1903 ~~~~succeed~or~fail.~~If~it~succeeds,~then~an~attempt~is~made~to~match
1904
1905 ~~~~the~next~unit.~~If~it~fails,~then~an~attempt~is~made~to~find~an
1906
1907 ~~~~alternative~match~for~the~immediately~preceding~pattern~unit.~~If
1908
1909 ~~~~this~succeeds,~then~we~proceed~forward~again~to~the~next~unit.~~If
1910
1911 ~~~~it~fails~we~go~back~to~the~preceding~unit.~~This~process~is~called
1912
1913 ~~~~\char`\"{}backtracking\char`\"{}.~~If~there~are~no~previous~units,~then~the~current
1914
1915 ~~~~position~is~incremented~by~one,~and~everything~starts~again.~~This
1916
1917 ~~~~proceeds~until~either~the~current~position~goes~past~the~end~of
1918
1919 ~~~~the~sequence~or~all~of~the~pattern~units~succeed.~~On~success,
1920
1921 ~~~~scan\_for\_matches~reports~the~\char`\"{}hit\char`\"{},~the~current~position~is~set
1922
1923 ~~~~just~past~the~hit,~and~an~attempt~is~made~to~find~another~hit.
1924
1925 ~~~~If~you~wish~to~limit~the~scan~to~simply~finding~a~maximum~of,~say,
1926
1927 ~~~~10~hits,~you~can~use~the~-n~option~(-n~10~would~set~the~limit~to
1928
1929 ~~~~10~reported~hits).~~For~example,
1930
1931 ~~~~~~~~scan\_for\_matches~-c~-n~1~pat\_file~<~test\_dna\_input
1932
1933 ~~~~would~search~for~just~the~first~hit~(and~would~stop~searching~the
1934
1935 ~~~~current~sequences~or~any~that~follow~in~the~input~file).
1936
1937 Searching~for~repeats:
1938
1939 ~~~~In~the~last~section,~I~discussed~almost~all~of~the~details
1940
1941 ~~~~required~to~allow~you~to~look~for~repeats.~~Consider~the~following
1942
1943 ~~~~set~of~patterns:
1944
1945 ~~~~~~~~p1=6...6~3...8~p1~~~(find~exact~6~character~repeat~separated
1946
1947 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~by~to~8~characters)
1948
1949 ~~~~~~~~p1=6...6~3..8~p1{[}1,0,0]~~~(allow~one~mismatch)
1950
1951 ~~~~~~~~p1=3...3~p1{[}1,0,0]~p1{[}1,0,0]~p1{[}1,0,0]~~
1952
1953 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~(match~12~characters~that~are~the~remains
1954
1955 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~of~a~3-character~sequence~occurring~4~times)
1956
1957 ~~~~~~~~~~~~~~~~
1958
1959 ~~~~~~~~p1=4...8~0...3~p2=6...8~p1~0...3~p2
1960
1961 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~(This~would~match~things~like
1962
1963 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ATCT~G~TCTTT~ATCT~TG~TCTTT
1964
1965 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~)
1966
1967 Searching~for~particular~sequences:
1968
1969 ~~~~Occasionally,~one~wishes~to~match~a~specific,~known~sequence.
1970
1971 ~~~~In~such~a~case,~you~can~just~give~the~sequence~(along~with~an
1972
1973 ~~~~optional~statement~of~the~allowable~mismatches,~insertions,~and
1974
1975 ~~~~deletions).~~Thus,
1976
1977 ~~~~~~~~p1=6...8~GAGA~\textasciitilde{}p1~~~~(match~a~hairpin~with~GAGA~as~the~loop)
1978
1979 ~~~~~~~~RRRRYYYY~~~~~~~~~~~~~(match~4~purines~followed~by~4~pyrimidines)
1980
1981 ~~~~~~~~TATAA{[}1,0,0]~~~~~~~~~(match~TATAA,~allowing~1~mismatch)
1982
1983 ~~~~~~~~
1984
1985 Matches~against~a~\char`\"{}weight~matrix\char`\"{}:
1986
1987 ~~~~I~will~conclude~my~examples~of~the~types~of~pattern~units
1988
1989 ~~~~available~for~matching~against~nucleotide~sequences~by~discussing~a
1990
1991 ~~~~crude~implemetation~of~matching~using~a~\char`\"{}weight~matrix\char`\"{}.~~While~I
1992
1993 ~~~~am~less~than~overwhelmed~with~the~syntax~that~I~chose,~I~think~that
1994
1995 ~~~~the~reader~should~be~aware~that~I~was~thinking~of~generating
1996
1997 ~~~~patterns~containing~such~pattern~units~automatically~from
1998
1999 ~~~~alignments~(and~did~not~really~plan~on~typing~such~things~in~by
2000
2001 ~~~~hand~very~often).~~Anyway,~suppose~that~you~wanted~to~match~a
2002
2003 ~~~~sequence~of~eight~characters.~~The~\char`\"{}consensus\char`\"{}~of~these~eight
2004
2005 ~~~~characters~is~GRCACCGS,~but~the~actual~\char`\"{}frequencies~of~occurrence\char`\"{}
2006
2007 ~~~~are~given~in~the~matrix~below.~~Thus,~the~first~character~is~an~A
2008
2009 ~~~~16\%~the~time~and~a~G~84\%~of~the~time.~~The~second~is~an~A~57\%~of
2010
2011 ~~~~the~time,~a~C~10\%~of~the~time,~a~G~29\%~of~the~time,~and~a~T~4\%~of
2012
2013 ~~~~the~time.~~
2014
2015 ~~~~~~~~~~~~~C1~~~~~C2~~~~C3~~~~C4~~~C5~~~~C6~~~~C7~~~~C8
2016
2017 ~~~~
2018
2019 ~~~~~~~A~~~~~16~~~~~57~~~~~0~~~~95~~~~0~~~~18~~~~~0~~~~~0
2020
2021 ~~~~~~~C~~~~~~0~~~~~10~~~~80~~~~~0~~100~~~~60~~~~~0~~~~50
2022
2023 ~~~~~~~G~~~~~84~~~~~29~~~~~0~~~~~0~~~~0~~~~20~~~100~~~~50
2024
2025 ~~~~~~~T~~~~~~0~~~~~~4~~~~20~~~~~5~~~~0~~~~~2~~~~~0~~~~~0~~~
2026
2027 ~~~~
2028
2029 ~~~~One~could~use~the~following~pattern~unit~to~search~for~inexact
2030
2031 ~~~~matches~related~to~such~a~\char`\"{}weight~matrix\char`\"{}:
2032
2033 ~~~~~~~~\{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
2034
2035 ~~~~~~~~~(0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)\}~>~450
2036
2037 ~~~~This~pattern~unit~will~attempt~to~match~exactly~eight~characters.
2038
2039 ~~~~For~each~character~in~the~sequence,~the~entry~in~the~corresponding
2040
2041 ~~~~tuple~is~added~to~an~accumulated~sum.~~If~the~sum~is~greater~than
2042
2043 ~~~~450,~the~match~succeeds;~else~it~fails.
2044
2045 ~~~~Recently,~this~feature~was~upgraded~to~allow~ranges.~~Thus,
2046
2047 ~~600~>~~\{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
2048
2049 ~~~~~~~~~(0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)\}~>~450
2050
2051 ~~~~will~work,~as~well.
2052
2053 Allowing~Alternatives:
2054
2055 ~~~~Very~occasionally,~you~may~wish~to~allow~alternative~pattern~units
2056
2057 ~~~~(i.e.,~\char`\"{}match~either~A~or~B\char`\"{}).~~You~can~do~this~using~something
2058
2059 ~~~~like
2060
2061 ~~~~~~~~~~~~~~~~(~GAGA~|~GCGCA)
2062
2063 ~~~~which~says~\char`\"{}match~either~GAGA~or~GCGCA\char`\"{}.~~You~may~take
2064
2065 ~~~~alternatives~of~a~list~of~pattern~units,~for~example
2066
2067 ~~~~~~~~(p1=3...3~3...8~\textasciitilde{}p1~|~p1=5...5~4...4~\textasciitilde{}p1~GGG)
2068
2069 ~~~~would~match~one~of~two~sequences~of~pattern~units.~~There~is~one
2070
2071 ~~~~clumsy~aspect~of~the~syntax:~to~match~a~list~of~alternatives,~you
2072
2073 ~~~~need~to~fully~the~request.~~Thus,
2074
2075 ~~~~~~~~(GAGA~|~(GCGCA~|~TTCGA))
2076
2077 ~~~~would~be~needed~to~try~the~three~alternatives.
2078
2079 One~Minor~Extension
2080
2081 ~~~~Sometimes~a~pattern~will~contain~a~sequence~of~distinct~ranges,
2082
2083 ~~~~and~you~might~wish~to~limit~the~sum~of~the~lengths~of~the~matched
2084
2085 ~~~~subsequences.~~~For~example,~suppose~that~you~basically~wanted~to
2086
2087 ~~~~match~something~like
2088
2089 ~~~~ARRYYTT~p1=0...5~GCA{[}1,0,0]~p2=1...6~\textasciitilde{}p1~4...8~\textasciitilde{}p2~p3=4...10~CCT
2090
2091 ~~~~but~that~the~sum~of~the~lengths~of~p1,~p2,~and~p3~must~not~exceed
2092
2093 ~~~~eight~characters.~~To~do~this,~you~could~add~
2094
2095 ~~~~~~~~length(p1+p2+p3)~<~9
2096
2097 ~~~~as~the~last~pattern~unit.~~It~will~just~succeed~or~fail~(but~does
2098
2099 ~~~~not~actually~match~any~characters~in~the~sequence).
2100
2101 ~~~~
2102
2103 Matching~Protein~Sequences
2104
2105 ~~~~Suppose~that~the~input~file~contains~protein~sequences.~~In~this
2106
2107 ~~~~case,~you~must~invoke~scan\_for\_matches~with~the~\char`\"{}-p\char`\"{}~option.~~You
2108
2109 ~~~~cannot~use~aspects~of~the~language~that~relate~directly~to
2110
2111 ~~~~nucleotide~sequences~(e.g.,~the~-c~command~line~option~or~pattern
2112
2113 ~~~~constructs~referring~to~the~reverse~complement~of~a~previously
2114
2115 ~~~~matched~unit).~~
2116
2117 ~~~~You~also~have~two~additional~constructs~that~allow~you~to~match
2118
2119 ~~~~either~\char`\"{}one~of~a~set~of~amino~acids\char`\"{}~or~\char`\"{}any~amino~acid~other~than
2120
2121 ~~~~those~a~given~set\char`\"{}.~~For~example,
2122
2123 ~~~~~~~~p1=0...4~any(HQD)~1...3~notany(HK)~p1
2124
2125 ~~~~would~successfully~match~a~string~like
2126
2127 ~~~~~~~~~~~YWV~D~AA~C~YWV
2128
2129 Using~the~show\_hits~Utility
2130
2131 ~~~~When~viewing~a~large~set~of~complex~matches,~you~might~find~it
2132
2133 ~~~~convenient~to~post-process~the~scan\_for\_matches~output~to~get~a
2134
2135 ~~~~more~readable~version.~~We~provide~a~simple~post-processor~called
2136
2137 ~~~~\char`\"{}show\_hits\char`\"{}.~~To~see~its~effect,~just~pipe~the~output~of~a
2138
2139 ~~~~scan\_for\_matches~into~show\_hits:
2140
2141 ~~~~~Normal~Output:
2142
2143 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp
2144
2145 ~~~~~~~~>tst1:{[}1,28]
2146
2147 ~~~~~~~~gtacguaacc~~ggttaac~cgguuacgtac~
2148
2149 ~~~~~~~~>tst1:{[}28,1]
2150
2151 ~~~~~~~~gtacgtaacc~~ggttaac~cggttacgtac~
2152
2153 ~~~~~~~~>tst2:{[}2,31]
2154
2155 ~~~~~~~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG~
2156
2157 ~~~~~~~~>tst2:{[}31,2]
2158
2159 ~~~~~~~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG~
2160
2161 ~~~~~~~~>tst3:{[}3,32]
2162
2163 ~~~~~~~~gtacguaacc~g~gttaactt~cgguuacgtac~
2164
2165 ~~~~~~~~>tst3:{[}32,3]
2166
2167 ~~~~~~~~gtacgtaacc~g~aagttaac~cggttacgtac~
2168
2169 ~~~~~Piped~Through~show\_hits:
2170
2171 ~~~~
2172
2173 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp~|~show\_hits
2174
2175 ~~~~~~~~tst1:{[}1,28]:~~gtacguaacc~~~ggttaac~~cgguuacgtac
2176
2177 ~~~~~~~~tst1:{[}28,1]:~~gtacgtaacc~~~ggttaac~~cggttacgtac
2178
2179 ~~~~~~~~tst2:{[}2,31]:~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG
2180
2181 ~~~~~~~~tst2:{[}31,2]:~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG
2182
2183 ~~~~~~~~tst3:{[}3,32]:~~gtacguaacc~g~gttaactt~cgguuacgtac
2184
2185 ~~~~~~~~tst3:{[}32,3]:~~gtacgtaacc~g~aagttaac~cggttacgtac
2186
2187 ~~~~~~~~clone\%~
2188
2189 ~~~~Optionally,~you~can~specify~which~of~the~\char`\"{}fields\char`\"{}~in~the~matches
2190
2191 ~~~~you~wish~to~sort~on,~and~show\_hits~will~sort~them.~~The~field
2192
2193 ~~~~numbers~start~with~0.~~So,~you~might~get~something~like
2194
2195 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp~|~show\_hits~2~1
2196
2197 ~~~~~~~~tst2:{[}2,31]:~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG
2198
2199 ~~~~~~~~tst2:{[}31,2]:~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG
2200
2201 ~~~~~~~~tst3:{[}32,3]:~~gtacgtaacc~g~aagttaac~cggttacgtac
2202
2203 ~~~~~~~~tst1:{[}1,28]:~~gtacguaacc~~~ggttaac~~cgguuacgtac
2204
2205 ~~~~~~~~tst1:{[}28,1]:~~gtacgtaacc~~~ggttaac~~cggttacgtac
2206
2207 ~~~~~~~~tst3:{[}3,32]:~~gtacguaacc~g~gttaactt~cgguuacgtac
2208
2209 ~~~~~~~~clone\%~
2210
2211 ~~~~In~this~case,~the~hits~have~been~sorted~on~fields~2~and~1~(that~is,
2212
2213 ~~~~the~third~and~second~matched~subfields).
2214
2215 ~~~~show\_hits~is~just~one~possible~little~post-processor,~and~you
2216
2217 ~~~~might~well~wish~to~write~a~customized~one~for~yourself.
2218
2219 Reducing~the~Cost~of~a~Search
2220
2221 ~~~~The~scan\_for\_matches~utility~uses~a~fairly~simple~search,~and~may
2222
2223 ~~~~consume~large~amounts~of~CPU~time~for~complex~patterns.~~Someday,
2224
2225 ~~~~I~may~decide~to~optimize~the~code.~~However,~until~then,~let~me
2226
2227 ~~~~mention~one~useful~technique.~~
2228
2229 ~~~~When~you~have~a~complex~pattern~that~includes~a~number~of~varying
2230
2231 ~~~~ranges,~imprecise~matches,~and~so~forth,~it~is~useful~to
2232
2233 ~~~~\char`\"{}pipeline\char`\"{}~matches.~~That~is,~form~a~simpler~pattern~that~can~be
2234
2235 ~~~~used~to~scan~through~a~large~database~extracting~sections~that
2236
2237 ~~~~might~be~matched~by~the~more~complex~pattern.~~Let~me~illustrate
2238
2239 ~~~~with~a~short~example.~~Suppose~that~you~really~wished~to~match~the
2240
2241 ~~~~pattern~
2242
2243 ~~~~p1=3...5~0...8~\textasciitilde{}p1{[}1,1,0]~p2=6...7~3...6~AGC~3...5~RYGC~\textasciitilde{}p2{[}1,0,0]
2244
2245 ~~~~In~this~case,~the~pattern~units~AGC~3...5~RYGC~can~be~used~to~rapidly
2246
2247 ~~~~constrain~the~overall~search.~~You~can~preprocess~the~overall
2248
2249 ~~~~database~using~the~pattern:
2250
2251 ~~~~~~~~~~31...31~AGC~3...5~RYGC~7...7
2252
2253 ~~~~Put~the~complex~pattern~in~pat\_file1~and~the~simpler~pattern~in
2254
2255 ~~~~pat\_file2.~~Then~use,
2256
2257 ~~~~~~~~scan\_for\_matches~-c~pat\_file2~<~nucleotide\_database~|
2258
2259 ~~~~~~~~scan\_for\_matches~pat\_file1
2260
2261 ~~~~The~output~will~show~things~like
2262
2263 ~~~~>seqid:{[}232,280]{[}2,47]
2264
2265 ~~~~matches~pieces
2266
2267 ~~~~Then,~the~actual~section~of~the~sequence~that~was~matched~can~be
2268
2269 ~~~~easily~computed~as~{[}233,278]~(remember,~the~positions~start~from
2270
2271 ~~~~1,~not~0).
2272
2273 ~~~~Let~me~finally~add,~you~should~do~a~few~short~experiments~to~see
2274
2275 ~~~~whether~or~not~such~pipelining~actually~improves~performance~-{}-~it
2276
2277 ~~~~is~not~always~obvious~where~the~time~is~going,~and~I~have
2278
2279 ~~~~sometimes~found~that~the~added~complexity~of~pipelining~actually
2280
2281 ~~~~slowed~things~up.~~It~gets~its~best~improvements~when~there~are
2282
2283 ~~~~exact~matches~of~more~than~just~a~few~characters~that~can~be
2284
2285 ~~~~rapidly~used~to~eliminate~large~sections~of~the~database.
2286
2287 =============
2288
2289 Additions:
2290
2291 Feb~9,~1995:~~~the~pattern~units~\textasciicircum{}~and~\$~now~work~as~in~normal~regular
2292
2293 ~~~~~~~~~~~~~~~expressions.~~That~is
2294
2295 ~~~~~~~~~~~~~~~~~~~~~~~~TTF~\$
2296
2297 ~~~~~~~~~~~~~~~matches~only~TTF~at~the~end~of~the~string~and~
2298
2299 ~~~~~~~~~~~~~~~~~~~~~~~~\textasciicircum{}~TTF~
2300
2301 ~~~~~~~~~~~~~~~matches~only~an~initial~TTF
2302
2303 ~~~~~~~~~~~~~~~The~pattern~unit~
2304
2305 ~~~~~~~~~~~~~~~~~~~~~~~~<p1
2306
2307 ~~~~~~~~~~~~~~~matches~the~reverse~of~the~string~named~p1.~~That~is,
2308
2309 ~~~~~~~~~~~~~~~if~p1~matched~GCAT,~then~<p1~would~match~TACG.~~Thus,
2310
2311 ~~~~~~~~~~~~~~~~~~~p1=6...6~<p1
2312
2313 ~~~~~~~~~~~~~~~matches~a~real~palindrome~(not~the~biologically~common
2314
2315 ~~~~~~~~~~~~~~~meaning~of~\char`\"{}reverse~complement\char`\"{})
2316
2317
2318 \end{lyxcode}
2319
2320 \end{document}