bp_doc/biotools_cookbook.tex

   1 %% LyX 1.5.1 created this file.  For more info, see http://www.lyx.org/.
   2 %% Do not edit unless you really know what you are doing.
   3 \documentclass[english]{scrartcl}
   4 \usepackage[T1]{fontenc}
   5 \usepackage[latin9]{inputenc}
   6 \setlength{\parskip}{\medskipamount}
   7 \setlength{\parindent}{0pt}
   8 \usepackage{amsmath}
   9 \usepackage{graphicx}
  10 \IfFileExists{url.sty}{\usepackage{url}}
  11                       {\newcommand{\url}{\texttt}}
  12
  13 \makeatletter
  14
  15 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
  16 %% Because html converters don't know tabularnewline
  17 \providecommand{\tabularnewline}{\\}
  18
  19 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
  20 \newenvironment{lyxcode}
  21 {\begin{list}{}{
  22 \setlength{\rightmargin}{\leftmargin}
  23 \setlength{\listparindent}{0pt}% needed for AMS classes
  24 \raggedright
  25 \setlength{\itemsep}{0pt}
  26 \setlength{\parsep}{0pt}
  27 \normalfont\ttfamily}%
  28  \item[]}
  29 {\end{list}}
  30
  31 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
  32 \usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
  33
  34 \usepackage{babel}
  35 \makeatother
  36
  37 \begin{document}
  38
  39 \title{Biotools Cookbook}
  40
  41
  42 \author{Martin Asser Hansen}
  43
  44
  45 \publishers{John Mattick Group\\
  46 Institute for Molecular Bioscience\\
  47 University of Queensland\\
  48 Australia\\
  49 E-mail: mail@maasha.dk}
  50
  51 \maketitle
  52 \thispagestyle{empty}
  53
  54 \newpage{}
  55
  56 \tableofcontents{}
  57
  58 \listoffigures
  59
  60
  61 \newpage{}
  62
  63
  64 \section{Introduction}
  65
  66 Biotools is a selection of simple tools that can be linked together
  67 (piped as we shall call it) in a very flexible manner to perform both
  68 simple and complex tasks. The fundamental idea is that biotools work
  69 on a data stream that will only terminate at the end of an analysis
  70 and that this data stream can be passed through several different
  71 biotools, each performing one specific task. The advantage of this
  72 approach is that a user can perform simple and complex tasks without
  73 having to write advanced code. Moreover, since the data format used
  74 to pass data between biotools is text based, biotools can be written
  75 by different developers in their favorite programming language ---
  76 and still the biotools will be able to work together.
  77
  78 In the most simple form bioools can be piped together on the command
  79 line like this (using the pipe character '|'):
  80
  81 \begin{lyxcode}
  82 read\_data~|~calculate\_something~|~write\_result
  83 \end{lyxcode}
  84 However, a more comprehensive analysis could be composed:
  85
  86 \begin{lyxcode}
  87 read\_data~|~select\_entries~|~convert\_entries~|~search\_database~~
  88
  89 evaluate\_results~|~plot\_diagram~|~plot\_another\_diagram~|
  90
  91 load\_to\_database
  92 \end{lyxcode}
  93 The data stream that is piped through the biotools consists of records
  94 of key/value pairs in the same way a hash does in order to keep as
  95 simple a structure as possible. An example record can be seen below:
  96
  97 \begin{lyxcode}
  98
  99
 100 REC\_TYPE:~PATSCAN
 101
 102 MATCH:~AGATCAAGTG
 103
 104 S\_BEG:~7
 105
 106 S\_END:~16
 107
 108 ALIGN\_LEN:~9
 109
 110 S\_ID:~piR-t6
 111
 112 STRAND:~+
 113
 114 PATTERN:~AGATCAAGTG
 115
 116 -{}-{}-
 117 \end{lyxcode}
 118 The '-\/-\/-' denotes the delimiter of the records, and each key
 119 is a word followed by a ':' and a white-space and then the value.
 120 By convention the biotools only uses upper case keys (a list of used
 121 keys can be seen in Appendix~\ref{sec:Keys}). Since the records
 122 basically are hash structures this mean that the order of the keys
 123 in the stream is unordered, and in the above example it is pure coincidence
 124 that HIT\_BEG is displayed before HIT\_END, however, when the order
 125 of the keys is importent, the biotools will automagically see to that.
 126
 127 All of the biotools are able to read and write a data stream to and
 128 from file as long as the records are in the biotools format. This
 129 means that if you are undertaking a lengthy analysis where one of
 130 the steps is time consuming, you may save the stream after this step,
 131 and subsequently start one or more analysis from that last step%
 132 \footnote{It is a goal that the biotools at some point will be able to dump
 133 the data stream to file in case one of the tools fail critically.%
 134 }. If you are running a lengthy analysis it is highly recommended that
 135 you create a small test sample of the data and run that through the
 136 pipeline --- and once you are satisfied with the result proceed with
 137 the full data set (see~\ref{sub:How-to-select-a-few-records}).
 138
 139
 140 \section{The Data Stream}
 141
 142
 143 \subsection{How to read the data stream from file?\label{sub:How-to-read-stream}}
 144
 145 You want to read a data stream that you previously have saved to file
 146 in biotools format. This can be done implicetly or explicitly. The
 147 implicit way uses the 'stdout' stream of the Unix terminal:
 148
 149 \begin{lyxcode}
 150 cat~|~<biotool>
 151 \end{lyxcode}
 152 cat is the Unix command that reads a file and output the result to
 153 'stdout' --- which in this case is piped to any biotool represented
 154 by the <biotool>. It is also possible to read the data stream using
 155 '<' to direct the 'stdout' stream into the biotool like this:
 156
 157 \begin{lyxcode}
 158 <biotool>~<~<file>
 159 \end{lyxcode}
 160 However, that will not work if you pipe more biotools together. Then
 161 it is much safer to read the stream from a file explicitly like this:
 162
 163 \begin{lyxcode}
 164 <biotool>~-{}-stream\_in=<file>
 165 \end{lyxcode}
 166 Here the filename <file> is explicetly given to the biotool <biotool>
 167 with the switch -\/-stream\_in. This switch works with all biotools.
 168 It is also possible to read in data from multiple sources by repeating
 169 the explicit read step:
 170
 171 \begin{lyxcode}
 172 <biotool>~-{}-stream\_in=<file1>~|~<biotool>~-{}-stream\_in=<file2>
 173 \end{lyxcode}
 174
 175 \subsection{How to write the data stream to file?\label{sub:How-to-write-stream}}
 176
 177 In order to save the output stream from a biotool to file, so you
 178 can read in the stream again at a later time, you can do one of two
 179 things:
 180
 181 \begin{lyxcode}
 182 <biotool>~>~<file>
 183 \end{lyxcode}
 184 All, the biotools write the data stream to 'stdout' by default which
 185 can be written to a file by redirecting 'stdout' to file using '>'
 186 , however, if one of the biotools for writing other formats is used
 187 then the both the biotools records as well as the result output will
 188 go to 'stdout' in a mixture causing havock! To avoid this you must
 189 use the switch -\/-stream\_out that explictly tells the biotool to
 190 write the output stream to file:
 191
 192 \begin{lyxcode}
 193 <biotool>~-{}-stream\_out=<file>
 194 \end{lyxcode}
 195 The -\/-stream\_out switch works with all biotools.
 196
 197
 198 \subsection{How to terminate the data stream?}
 199
 200 The data stream is never stops unless the user want to save the stream
 201 or by supplying the -\/-no\_stream switch that will terminate the
 202 stream:
 203
 204 \begin{lyxcode}
 205 <biotool>~-{}-no\_stream
 206 \end{lyxcode}
 207 The -\/-no\_stream switch only works with those biotools where it
 208 makes sense that the user might want to terminale the data stream,
 209 \emph{i.e}. after an analysis step where the user wants to output
 210 the result, but not the data stream.
 211
 212
 213 \subsection{How to write my results to file?\label{sub:How-to-write-result}}
 214
 215 Saving the result of an analysis to file can be done implicitly or
 216 explicitly. The implicit way:
 217
 218 \begin{lyxcode}
 219 <biotool>~-{}-no\_stream~>~<file>
 220 \end{lyxcode}
 221 If you use '>' to redirect 'stdout' to file then it is important to
 222 use the -\/-no\_stream switch to avoid writing a mix of biotools
 223 records and result to the same file causing havock. The safe way is
 224 to use the -\/-result\_out switch which explicetly tells the biotool
 225 to write the result to a given file:
 226
 227 \begin{lyxcode}
 228 <biotool>~-{}-result\_out=<file>
 229 \end{lyxcode}
 230 Using the above method will not terminate the stream, so it is possible
 231 to pipe that into another biotool generating different results:
 232
 233 \begin{lyxcode}
 234 <biotool1>~-{}-result\_out=<file1>~|~<biotool2>~-{}-result\_out=<file2>
 235 \end{lyxcode}
 236 And still the data stream will continue unless terminated with -\/-no\_stream:
 237
 238 \begin{lyxcode}
 239 <biotool>~-{}-result\_out=<file>~-{}-no\_stream
 240 \end{lyxcode}
 241 Or written to file using implicitly or explicity \eqref{sub:How-to-write-result}.
 242 The explicit way:
 243
 244 \begin{lyxcode}
 245 <biotool>~-{}-result\_out=<file1>~-{}-stream\_out=<file2>
 246 \end{lyxcode}
 247
 248 \subsection{How to read data from multiple sources?}
 249
 250 To read multiple data sources, with the same type or different type
 251 of data do:
 252
 253 \begin{lyxcode}
 254 <biotool1>~-{}-data\_in=<file1>~|~<biotool2>~-{}-data\_in=<file2>
 255 \end{lyxcode}
 256 where type is the data type a specific biotool reads.
 257
 258
 259 \section{Reading input}
 260
 261
 262 \subsection{How to read biotools input?}
 263
 264 See \eqref{sub:How-to-read-stream}.
 265
 266
 267 \subsection{How to read in data?}
 268
 269 Data in different formats can be read with the appropriate biotool
 270 for that format. The biotools are typicalled named 'read\_<data type>'
 271 such as \textbf{read\_fasta}, \textbf{read\_bed}, \textbf{read\_tab},
 272 etc., and all behave in a similar manner. Data can be read by supplying
 273 the -\/-data\_in switch and a file name to the file containing the
 274 data:
 275
 276 \begin{lyxcode}
 277 <biotool>~-{}-data\_in=<file>
 278 \end{lyxcode}
 279 It is also possible to read in a saved biotools stream (see \ref{sub:How-to-read-stream})
 280 as well as reading data in one go:
 281
 282 \begin{lyxcode}
 283 <biotool>~-{}-stream\_in=<file1>~-{}-data\_in=<file2>
 284 \end{lyxcode}
 285 If you want to read data from several files you can do this:
 286
 287 \begin{lyxcode}
 288 <biotool>~-{}-data\_in=<file1>~|~<biotool>~-{}-data\_in=<file2>
 289 \end{lyxcode}
 290 If you have several data files you can read in all explicitly with
 291 a comma separated list:
 292
 293 \begin{lyxcode}
 294 <biotool>~-{}-data\_in=file1,file2,file3
 295 \end{lyxcode}
 296 And it is also possible to use file globbing:
 297
 298 \begin{lyxcode}
 299 <biotool>~-{}-data\_in={*}.fna
 300 \end{lyxcode}
 301 Or in a combination:
 302
 303 \begin{lyxcode}
 304 <biotool>~-{}-data\_in=file1,/dir/{*}.fna
 305 \end{lyxcode}
 306 Finally, it is possible to read in data in different formats using
 307 the appropriate biotool for each format:
 308
 309 \begin{lyxcode}
 310 <biotool1>~-{}-data\_in=<file1>~|~<biotool2>~-{}-data\_in=<file2>~...
 311 \end{lyxcode}
 312
 313 \subsection{How to read FASTA input?}
 314
 315 Sequences in FASTA format can be read explicitly using \textbf{read\_fasta}:
 316
 317 \begin{lyxcode}
 318 read\_fasta~-{}-data\_in=<file>
 319 \end{lyxcode}
 320
 321 \subsection{How to read alignment input?}
 322
 323 If your alignment if FASTA formatted then you can \textbf{read\_align}.
 324 It is also possible to use \textbf{read\_fasta} since the data is
 325 FASTA formatted, however, with \textbf{read\_fasta} the key ALIGN
 326 will be omitted. The ALIGN key is used to determine which sequences
 327 belong to what alignment which is required for \textbf{write\_align}.
 328
 329 \begin{lyxcode}
 330 read\_align~-{}-data\_in=<file>
 331 \end{lyxcode}
 332
 333 \subsection{How to read tabular input?\label{sub:How-to-read-table}}
 334
 335 Tabular input can be read with \textbf{read\_tab} which will read
 336 in all rows and chosen columns (separated by a given delimter) from
 337 a table in text format.
 338
 339 The table below:
 340
 341 \noindent \begin{center}
 342 \begin{tabular}{lll}
 343 Human & ATACGTCAG & 23524\tabularnewline
 344 Dog & AGCATGAC & 2442\tabularnewline
 345 Mouse & GACTG & 234\tabularnewline
 346 Cat & AAATGCA & 2342\tabularnewline
 347 \end{tabular}
 348 \par\end{center}
 349
 350 Can be read using the command:
 351
 352 \begin{lyxcode}
 353 read\_tab~-{}-data\_in=<file>
 354 \end{lyxcode}
 355 Which will result in four records, one for each row, where the keys
 356 V0, V1, V2 are the default keys for the organism, sequence, and count,
 357 respectively. It is possible to select a subset of colums to read
 358 by using the -\/-cols switch which takes a comma separated list of
 359 columns numbers (first column is designated 0) as argument. So to
 360 read in only the sequence and the count so that the count comes before
 361 the sequence do:
 362
 363 \begin{lyxcode}
 364 read\_tab~-{}-data\_in=<file>~-{}-cols=2,1
 365 \end{lyxcode}
 366 It is also possible to name the columns with the -\/-keys switch:
 367
 368 \begin{lyxcode}
 369 read\_tab~-{}-data\_in=<file>~-{}-cols=2,1~-{}-keys=COUNT,SEQ
 370 \end{lyxcode}
 371
 372 \subsection{How to read BED input?}
 373
 374 The BED (Browser Extensible Data%
 375 \footnote{\url{http://genome.ucsc.edu/FAQ/FAQformat}%
 376 }) format is a tabular format for data pertaining to one of the Eukaryotic
 377 genomes in the UCSC genome brower%
 378 \footnote{\url{http://genome.ucsc.edu/}%
 379 }. The BED format consists of up to 12 columns, where the first three
 380 are mandatory CHR, CHR\_BEG, and CHR\_END. The mandatory columns and
 381 any of the optional columns can all be read in easily with the \textbf{read\_bed}
 382 biotool.
 383
 384 \begin{lyxcode}
 385 read\_bed~-{}-data\_in=<file>
 386 \end{lyxcode}
 387 It is also possible to read the BED file with \textbf{read\_tab} (see~\ref{sub:How-to-read-table}),
 388 however, that will be more cumbersome because you need to specify
 389 the keys:
 390
 391 \begin{lyxcode}
 392 read\_tab~-{}-data\_in=<file>~-{}-keys=CHR,CHR\_BEG,CHR\_END~...
 393 \end{lyxcode}
 394
 395 \subsection{How to read PSL input?}
 396
 397 The PSL format is the output from BLAT and contains 21 mandatory fields
 398 that can be read with \textbf{read\_psl}:
 399
 400 \begin{lyxcode}
 401 read\_psl~-{}-data\_in=<file>
 402 \end{lyxcode}
 403
 404 \section{Writing output}
 405
 406 All result output can be written explicitly to file using the -\/-result\_out
 407 switch which all result generating biotools have. It is also possible
 408 to write the result to file implicetly by directing 'stdout' to file
 409 using '>', however, that requires the -\/-no\_stream swich to prevent
 410 a mixture of data stream and results in the file. The explicit (and
 411 safe) way:
 412
 413 \begin{lyxcode}
 414 ...~|~<biotool>~-{}-result\_out=<file>
 415 \end{lyxcode}
 416 The implicit way:
 417
 418 \begin{lyxcode}
 419 ...~|~<biotool>~-{}-no\_stream~>~<file>
 420 \end{lyxcode}
 421
 422 \subsection{How to write biotools output?}
 423
 424 See \eqref{sub:How-to-write-stream}.
 425
 426
 427 \subsection{How to write FASTA output?\label{sub:How-to-write-fasta}}
 428
 429 FASTA output can be written with \textbf{write\_fasta}.
 430
 431 \begin{lyxcode}
 432 ...~|~write\_fasta~-{}-result\_out=<file>
 433 \end{lyxcode}
 434 It is also possible to wrap the sequences to a given width using the
 435 -\/-wrap switch allthough wrapping of sequence is generally an evil
 436 thing:
 437
 438 \begin{lyxcode}
 439 ...~|~write\_fasta~-{}-no\_stream~-{}-wrap=80
 440 \end{lyxcode}
 441
 442 \subsection{How to write alignment output?\label{sub:How-to-write-alignment}}
 443
 444 Pretty alignments with ruler%
 445 \footnote{'.' for every 10 residues, ':' for every 50, and '|' for every 100%
 446 } and consensus sequence can be created with \textbf{write\_align},
 447 what also have the optional -\/-wrap switch to break the alignment
 448 into blocks of a given width:
 449
 450 \begin{lyxcode}
 451 ...~|~write\_align~-{}-result\_out=<file>~-{}-wrap=80
 452 \end{lyxcode}
 453 If the number of sequnces in the alignment is 2 then a pairwise alignment
 454 will be output otherwise a multiple alignment. And if the sequence
 455 type, determined automagically, is protein, then residues and symbols
 456 (+,~:,~.) will be used to show consensus according to the Blosum62
 457 matrix.
 458
 459
 460 \subsection{How to write tabular output?\label{sub:How-to-write-tab}}
 461
 462 Outputting the data stream as a table can be done with \textbf{write\_tab},
 463 which will write generate one row per record with the values as columns.
 464 If you supply the optional -\/-comment switch, when the first row
 465 in the table will be a 'comment' line prefixed with a '\#':
 466
 467 \begin{lyxcode}
 468 ...~|~write\_tab~-{}-result\_out=<file>~-{}-comment
 469 \end{lyxcode}
 470 You can also change the delimiter from the default (tab) to \emph{e.g.}
 471 ',':
 472
 473 \begin{lyxcode}
 474 ...~|~write\_tab~-{}-result\_out=<file>~-{}-delimit=','
 475 \end{lyxcode}
 476 If you want the values output in a specific order you have to supply
 477 a comma separated list using the -\/-keys switch that will print
 478 only those keys in that order:
 479
 480 \begin{lyxcode}
 481 ...~|~write\_tab~-{}-result\_out=<file>~-{}-keys=SEQ\_NAME,COUNT
 482 \end{lyxcode}
 483 Alternatively, if you have some keys that you don't want in the tabular
 484 output, use the -\/-no\_keys switch. So to print all keys except
 485 SEQ and SEQ\_TYPE do:
 486
 487 \begin{lyxcode}
 488 ...~|~write\_tab~-{}-result\_out=<file>~-{}-no\_keys=SEQ,SEQ\_TYPE
 489 \end{lyxcode}
 490 Finally, if you have a stream containing a mix of different records
 491 types, \emph{e.g.} records with sequences and records with matches,
 492 then you can use \textbf{write\_tab} to output all the records in
 493 tabluar format, however, the -\/-comment, -\/-keys, and -\/-no\_keys
 494 switches will only respond to records of the first type encountered.
 495 The reason is that outputting mixed records is probably not what you
 496 want anyway, and you should remove all the unwanted records from the
 497 stream before outputting the table: \textbf{grab} is your friend (see~\ref{sub:How-to-grab}).
 498
 499
 500 \subsection{How to write a BED output?\label{sub:How-to-write-BED}}
 501
 502 Data in BED format can be output if the records contain the mandatory
 503 keys CHR, CHR\_BEG, and CHR\_END using \textbf{write\_bed}. If the
 504 optional keys are also present, they will be output as well:
 505
 506 \begin{lyxcode}
 507 write\_bed~-{}-result\_out=<file>
 508 \end{lyxcode}
 509
 510 \subsection{How to write PSL output?\label{sub:How-to-write-PSL}}
 511
 512 Data in PSL format can be output using \textbf{write\_psl:}
 513
 514 \begin{lyxcode}
 515 write\_psl~-{}-result\_out=<file>
 516 \end{lyxcode}
 517
 518 \section{Manipulating Records}
 519
 520
 521 \subsection{How to select a few records?\label{sub:How-to-select-a-few-records}}
 522
 523 To quickly get an overview of your data you can limit the data stream
 524 to show a few records. This also very useful to test the pipeline
 525 with a few records if you are setting up a complex analysis using
 526 several biotools. That way you can inspect that all goes well before
 527 analyzing and waiting for the full data set. All of the read\_<type>
 528 biotools have the -\/-num switch which will take a number as argument
 529 and only that number of records will be read. So to read in the first
 530 10 FASTA entries from a file:
 531
 532 \begin{lyxcode}
 533 read\_fasta~-{}-data\_in=test.fna~-{}-num=10
 534 \end{lyxcode}
 535 Another way of doing this is to use \textbf{head\_records} will limit
 536 the stream to show the first 10 records (default):
 537
 538 \begin{lyxcode}
 539 ...~|~head\_records
 540 \end{lyxcode}
 541 Using \textbf{head\_records} directly after one of the read\_<type>
 542 biotools will be a lot slower than using the -\/-num switch with
 543 the read\_<type> biotools, however, \textbf{head\_records} can also
 544 be used to limit the output from all the other biotools. It is also
 545 possible to give \textbf{head\_records} a number of records to show
 546 using the -\/-num switch. So to display the first 100 records do:
 547
 548 \begin{lyxcode}
 549 ...~|~head\_records~-{}-num=100
 550 \end{lyxcode}
 551
 552 \subsection{How to count all records in the data stream?}
 553
 554 To count all the records in the data stream use \textbf{count\_records},
 555 which adds one record (which is not included in the count) to the
 556 data stream. So to count the number of sequences in a FASTA file you
 557 can do this:
 558
 559 \begin{lyxcode}
 560 cat~test.fna~|~read\_fasta~|~count\_records~-{}-no\_stream
 561 \end{lyxcode}
 562 Which will write the last record containing the count to 'stdout':
 563
 564 \begin{lyxcode}
 565 -{}-{}-
 566
 567 count\_records:~630
 568 \end{lyxcode}
 569 It is also possible to write the count to file using the -\/-result\_out
 570 switch.
 571
 572
 573 \subsection{How to grab specific records?\label{sub:How-to-grab}}
 574
 575 The biotool \textbf{grab} is related to the Unix grep and locates
 576 records based on matching keys and/or values using either a pattern,
 577 a Perl regex, or a numerical evaluation. To easily \textbf{grab} all
 578 records in the stream that has any mentioning of the pattern 'human'
 579 just pipe the data stream through \textbf{grab} like this:
 580
 581 \begin{lyxcode}
 582 ...~|~grab~-{}-pattern=human
 583 \end{lyxcode}
 584 This will search for the pattern 'human' in all keys and all values.
 585 The -\/-pattern switch takes a comma separated list of patterns,
 586 so in order to match multiple patterns do:
 587
 588 \begin{lyxcode}
 589 ...~|~grab~-{}-pattern=human,mouse
 590 \end{lyxcode}
 591 It is also possible to use the -\/-pattern\_in switch instead of
 592 -\/-pattern. -\/-pattern\_in is used to read a file with one pattern
 593 per line:
 594
 595 \begin{lyxcode}
 596 ...~|~grab~-{}-pattern\_in=patterns.txt
 597 \end{lyxcode}
 598 If you want the opposite result --- to find all records that does
 599 not match the patterns, add the -\/-invert switch, which not only
 600 works with the -\/-pattern switch, but also with -\/-regex and -\/-eval:
 601
 602 \begin{lyxcode}
 603 ...~|~grab~-{}-pattern=human~-{}-invert
 604 \end{lyxcode}
 605 If you want to search the record keys only, \emph{e.g.} to find all
 606 records containing the key SEQ you can add the -\/-keys\_only switch.
 607 This will prevent matching of SEQ in any record value, and in fact
 608 SEQ is a not uncommon peptide sequence you could get an unwanted record.
 609 Also, this will give an increase in speed since only the keys are
 610 searched:
 611
 612 \begin{lyxcode}
 613 ...~|~grab~-{}-pattern=SEQ~-{}-keys\_only
 614 \end{lyxcode}
 615 However, if you are interested in finding the peptide sequence SEQ
 616 and not the SEQ key, just add the -\/-vals\_only switch instead:
 617
 618 \begin{lyxcode}
 619 ...~|~grab~-{}-pattern=SEQ~-{}-vals\_only
 620 \end{lyxcode}
 621 Also, if you want to grab for certain key/value pairs you can supply
 622 a comma separated list of keys whos values will then be searched using
 623 the -\/-keys switch. This is handy if your records contain large
 624 genomic sequences and you dont want to search the entire sequence
 625 for \emph{e.g.} the organism name --- it is much faster to tell \textbf{grab}
 626 which keys to search the value for:
 627
 628 \begin{lyxcode}
 629 ...~|~grab~-{}-pattern=human~-{}-keys=SEQ\_NAME
 630
 631
 632 \end{lyxcode}
 633 It is also possible to invoke flexible matching using regex (regular
 634 expressions) instead of simple pattern matching. In \textbf{grab}
 635 the regex engine is Perl based and allows use of different type of
 636 wild cards, alternatives, \emph{etc}%
 637 \footnote{\url{http://perldoc.perl.org/perlreref.html}%
 638 }. If you want to \textbf{grab} records withs the sequence ATCG or
 639 GCTA you can do this:
 640
 641 \begin{lyxcode}
 642 ...~|~grab~-{}-regex='ATCG|GCTA'
 643 \end{lyxcode}
 644 Or if you want to find sequences beginning with ATCG:
 645
 646 \begin{lyxcode}
 647 ...~|~grab~-{}-regex='\textasciicircum{}ATCG'
 648 \end{lyxcode}
 649 You can also use \textbf{grab} to locate records that fulfill a numerical
 650 property using the -\/-eval switch witch takes an expression in three
 651 parts. The first part is the key that holds the number we want to
 652 evaluate, the second part holds one the six operators:
 653
 654 \begin{enumerate}
 655 \item Greater than: >
 656 \item Greater than or equal to: >=
 657 \item Less than: <
 658 \item Less than or equal to: <=
 659 \item Equal to: =
 660 \item Not equal to: !=
 661 \end{enumerate}
 662 And finally comes the number used in the evaluation. So to \textbf{grab}
 663 all records with a sequence length greater than 30:
 664
 665 \begin{lyxcode}
 666 ...~length\_seq~|~grab~-{}-eval='SEQ\_LEN~>~30'
 667 \end{lyxcode}
 668 If you want to locate all records containing the pattern 'human' and
 669 where the sequence length is greater that 30, you do this by running
 670 the stream through \textbf{grab} twice:
 671
 672 \begin{lyxcode}
 673 ...~|~grab~-{}-pattern='human'~|~length\_seq~|~grab~-{}-eval='SEQ\_LEN~>~30'
 674 \end{lyxcode}
 675 To get the best speed performance, use the most restrictive \textbf{grab}
 676 first.
 677
 678
 679 \subsection{How to remove keys from records?}
 680
 681 To remove one or more specific keys from all records in the data stream
 682 use \textbf{remove\_keys} like this:
 683
 684 \begin{lyxcode}
 685 ...~|~remove\_keys~-{}-keys=SEQ,SEQ\_NAME
 686 \end{lyxcode}
 687 In the above example SEQ and SEQ\_NAME will be removed from all records
 688 if they exists in these. If all keys are removed from a record, then
 689 the record will be removed.
 690
 691
 692 \subsection{How to rename keys in records?}
 693
 694 Sometimes you want to rename a record key, \emph{e.g.} if you have
 695 read in a two column table with sequence name and sequence in each
 696 column (see \ref{sub:How-to-read-table}) without specifying the key
 697 names, then the sequence name will be called V0 and the sequence V1
 698 as default in the \textbf{read\_tab} biotool. To rename the V0 and
 699 V1 keys we need to run the stream through \textbf{rename\_keys} twice
 700 (one for each key to rename):
 701
 702 \begin{lyxcode}
 703 ...~|~rename\_keys~-{}-keys=V0,SEQ\_NAME~|~rename\_keys~-{}-keys=V1,SEQ
 704 \end{lyxcode}
 705 The first instance of \textbf{rename\_keys} replaces all the V0 keys
 706 with SEQ\_NAME, and the second instance of \textbf{rename\_keys} replaces
 707 all the V1 keys with SEQ. \emph{Et viola} the data can now be used
 708 in the biotools that requires these keys.
 709
 710
 711 \section{Manipulating Sequences}
 712
 713
 714 \subsection{How to get sequence lengths?}
 715
 716 The length for sequences in records can be determined with \textbf{length\_seq},
 717 which adds the key SEQ\_LEN to each record with the sequence length
 718 as the value. It also generates an extra record that is emitted last
 719 with the key TOTAL\_SEQ\_LEN showing the total length of all the sequences.
 720
 721 \begin{lyxcode}
 722 read\_fasta~-{}-data\_in=<file>~|~length\_seq
 723 \end{lyxcode}
 724 It is also possible to determine the sequence length using the generic
 725 tool \textbf{length\_vals} (see \#\#\#), which determines the length
 726 of the values for a given list of keys:
 727
 728 \begin{lyxcode}
 729 read\_fasta~-{}-data\_in=<file>~|~length\_vals~-{}-keys=SEQ
 730 \end{lyxcode}
 731 To obtain the total length of all sequences use \textbf{sum\_vals}
 732 like this:
 733
 734 \begin{lyxcode}
 735 read\_fasta~-{}-data\_in=<file>~|~length\_vals~-{}-keys=SEQ
 736
 737 |~sum\_vals~-{}-keys=SEQ\_LEN
 738 \end{lyxcode}
 739 The biotool \textbf{analyze\_seq} will also determine the length of
 740 each sequence (see~\ref{sub:How-to-analyze}).
 741
 742
 743 \subsection{How to analyze sequence composition?\label{sub:How-to-analyze}}
 744
 745 If you want to find out the sequence type, composition, length, as
 746 well as GC content, indel content and proportions of soft and hard
 747 masked sequence, then use \textbf{analyze\_seq}. This handy biotool
 748 will determine all these things per sequence from which it is easy
 749 to get an overview using the \textbf{write\_tab} biotool to output
 750 a table (see~\ref{sub:How-to-write-tab}). So in order to determine
 751 the sequence composition of a FASTA file with just one entry containing
 752 the sequence 'ATCG' we just read the data with \textbf{read\_fasta}
 753 and run the output through \textbf{analyze\_seq} which will add the
 754 analysis to the record like this:
 755
 756 \begin{lyxcode}
 757 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~...
 758
 759
 760
 761 -{}-{}-
 762
 763 GC\%:~50.00
 764
 765 HARD\_MASK\%:~0.00
 766
 767 RES:-:~0
 768
 769 RES:.:~0
 770
 771 RES:A:~1
 772
 773 RES:B:~0
 774
 775 RES:C:~1
 776
 777 RES:D:~0
 778
 779 RES:G:~1
 780
 781 RES:H:~0
 782
 783 RES:K:~0
 784
 785 RES:M:~0
 786
 787 RES:N:~0
 788
 789 RES:R:~0
 790
 791 RES:S:~0
 792
 793 RES:T:~1
 794
 795 RES:U:~0
 796
 797 RES:V:~0
 798
 799 RES:W:~0
 800
 801 RES:Y:~0
 802
 803 RES:\textasciitilde{}:~0
 804
 805 SEQ:~ATCG
 806
 807 SEQ\_LEN:~4
 808
 809 SEQ\_NAME:~test
 810
 811 SEQ\_TYPE:~DNA
 812
 813 SOFT\_MASK\%:~0.00
 814 \end{lyxcode}
 815 Now to make a table of how may As, Ts, Cs, and Gs you can add the
 816 following:
 817
 818 \begin{lyxcode}
 819 ...~|~analyze\_seq~|~write\_tab~-{}-keys=RES:A,RES:T,RES:C,RES:G
 820 \end{lyxcode}
 821 Or if you want to see the proportions of hard and soft masked sequence:
 822
 823 \begin{lyxcode}
 824 ...~|~analyse\_seq~|~write\_tab~-{}-keys=HARD\_MASK\%,SOFT\_MASK\%
 825 \end{lyxcode}
 826 If you have a stack of sequences in one file and you want to determine
 827 the mean GC content you can do it using the \textbf{mean\_vals} biotool:
 828
 829 \begin{lyxcode}
 830 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~|~mean\_vals~-{}-keys=GC\%
 831 \end{lyxcode}
 832 Or if you want the total count of Ns you can use \textbf{sum\_vals}
 833 like this:
 834
 835 \begin{lyxcode}
 836 read\_fasta~-{}-data\_in=test.fna~|~analyze\_seq~|~sum\_vals~-{}-keys=RES:N
 837 \end{lyxcode}
 838
 839 \subsection{How to extract subsequences?\label{sub:How-to-extract}}
 840
 841 In order to extract a subsequence from a longer sequence use the biotool
 842 extract\_seq, which will replace the sequence in the record with the
 843 subsequence (this behaviour should probably be modified to be dependant
 844 of a -\/-replace or a -\/-no\_replace switch). So to extract the
 845 first 20 residues from all sequences do (first residue is designated
 846 1):
 847
 848 \begin{lyxcode}
 849 ...~|~extract\_seq~-{}-beg=1~-{}-len=20
 850 \end{lyxcode}
 851 You can also specify a begin and end coordinate set:
 852
 853 \begin{lyxcode}
 854 ...~|~extract\_seq~-{}-beg=20~-{}-end=40
 855 \end{lyxcode}
 856 If you want the subsequences from position 20 to the sequence end
 857 do:
 858
 859 \begin{lyxcode}
 860 ...~|~extract\_seq~-{}-beg=20
 861 \end{lyxcode}
 862 If you want to extract subsequences a given distance from the sequence
 863 end you can do this by reversing the sequence with the biotool \textbf{reverse\_seq}
 864 \eqref{sub:How-to-reverse-seq}, followed by \textbf{extract\_seq}
 865 to get the subsequence, and then \textbf{reverse\_seq} again to get
 866 the subsequence back in the original orientation:
 867
 868 \begin{lyxcode}
 869 read\_fasta~-{}-data\_in=test.fna~|~reverse\_seq
 870
 871 |~extract\_seq~-{}-beg=10~-{}-len=10~|~reverse\_seq
 872 \end{lyxcode}
 873
 874 \subsection{How to get genomic sequence?\label{sub:How-to-get-genomic-sequence}}
 875
 876 The biotool \textbf{get\_genomic\_seq} can extract subsequences for
 877 a given genome specified with the -\/-genome switch explicitly using
 878 the -\/-beg and -\/-end/-\/-len switches:
 879
 880 \begin{lyxcode}
 881 get\_genome\_seq~-{}-genome=<genome>~-{}-beg=1~-{}-len=100
 882 \end{lyxcode}
 883 Alternatively, \textbf{get\_genome\_seq} can be used to append the
 884 corresponding sequence to BED, PSL, and BLAST records:
 885
 886 \begin{lyxcode}
 887 read\_bed~-{}-data\_in=<BED~file>~|~get\_genome\_seq~-{}-genome=<genome>
 888 \end{lyxcode}
 889
 890 \subsection{How to upper-case sequences?}
 891
 892 Sequences can be shifted from lower case to upper case using \textbf{uppercase\_seq}:
 893
 894 \begin{lyxcode}
 895 ...~|~uppercase\_seq
 896 \end{lyxcode}
 897
 898 \subsection{How to reverse sequences?\label{sub:How-to-reverse-seq}}
 899
 900 The order of residues in a sequence can be reversed using reverse\_seq:
 901
 902 \begin{lyxcode}
 903 ...~|~reverse\_seq
 904 \end{lyxcode}
 905 Note that in order to reverse/complement a sequence you also need
 906 the \textbf{complement\_seq} biotool (see~\ref{sub:How-to-complement}).
 907
 908
 909 \subsection{How to complement sequences?\label{sub:How-to-complement}}
 910
 911 DNA and RNA sequences can be complemented with \textbf{complement\_seq},
 912 which automagically determines the sequence type:
 913
 914 \begin{lyxcode}
 915 ...~|~complement\_seq
 916 \end{lyxcode}
 917 Note that in order to reverse/complement a sequence you also need
 918 the \textbf{reverse\_seq} biotool (see~\ref{sub:How-to-reverse-seq}).
 919
 920
 921 \subsection{How to remove indels from sequnces?}
 922
 923 Indels can be removed from sequences with the \textbf{remove\_indels}
 924 biotool. This is useful if you have aligned some sequences (see~\ref{sub:How-to-align})
 925 and extracted (see~\ref{sub:How-to-extract}) a block of subsequences
 926 from the alignment and you want to use these sequence in a search
 927 where you need to remove the indels first. '-', '\textasciitilde{}',
 928 and '.' are considered indels:
 929
 930 \begin{lyxcode}
 931 ...~|~remove\_indels
 932 \end{lyxcode}
 933
 934 \subsection{How to split sequences into overlapping subsequences?}
 935
 936 Sequences can be slit into overlapping subsequences with the \textbf{split\_seq}
 937 biotool.
 938
 939 \begin{lyxcode}
 940 ...~|~split\_seq~-{}-word\_size=20~-{}-uniq
 941 \end{lyxcode}
 942
 943 \subsection{How to determine the oligo frequency?}
 944
 945 In order to determine if any oligo usage is over represented in one
 946 or more sequences you can determine the frequency of oligos of a given
 947 size with \textbf{oligo\_freq}:
 948
 949 \begin{lyxcode}
 950 ...~|~oligo\_freq~-{}-word\_size=4
 951 \end{lyxcode}
 952 And if you have more than one sequence and want to accumulate the
 953 frequences you need the -\/-all switch:
 954
 955 \begin{lyxcode}
 956 ...~|~oligo\_freq~-{}-word\_size=4~-{}-all
 957 \end{lyxcode}
 958 To get a meaningful result you need to write the resulting frequencies
 959 as a table with \textbf{write\_tab} (see~\ref{sub:How-to-write-tab}),
 960 but first it is important to \textbf{grab} (see~\ref{sub:How-to-grab})
 961 the records with the frequencies to avoid full length sequences in
 962 the table:
 963
 964 \begin{lyxcode}
 965 ...~|~oligo\_freq~-{}-word\_size=4~-{}-all~|~grab~-{}-pattern=OLIGO~-{}-keys\_only
 966
 967 |~write\_tab~-{}-no\_stream
 968 \end{lyxcode}
 969 And the resulting frequency table can be sorted with Unix sort (man
 970 sort).
 971
 972
 973 \subsection{How to search for sequences in genomes?}
 974
 975 See the following biotool:
 976
 977 \begin{itemize}
 978 \item \textbf{patscan\_seq} \eqref{sub:How-to-use-patscan}
 979 \item \textbf{blat\_seq} \eqref{sub:How-to-use-BLAT}
 980 \item \textbf{blast\_seq} \eqref{sub:How-to-use-BLAST}
 981 \item \textbf{vmatch\_seq} \eqref{sub:How-to-use-Vmatch}
 982 \end{itemize}
 983
 984 \subsection{How to search sequences for a pattern?\label{sub:How-to-use-patscan}}
 985
 986 It is possible to search sequences in the data stream for patterns
 987 using the \textbf{patscan\_seq} biotool which utilizes the powerful
 988 scan\_for\_matches engine. Consult the documentation for scan\_for\_matches
 989 in order to learn how to define patterns (the documentation is included
 990 in Appendix~\ref{sec:scan_for_matches-README}).
 991
 992 To search all sequences for a simple pattern consisting of the sequence
 993 ATCGATCG allowing for 3 mismatches, 2 insertions and 1 deletion:
 994
 995 \begin{lyxcode}
 996 read\_fasta~-{}-data\_in=<file>~|~patscan\_seq~-{}-pattern='ATCGATCG{[}3,2,1]'
 997 \end{lyxcode}
 998 The -\/-pattern switch takes a comma seperated list of patterns,
 999 so if you want to search for more that one pattern do:
1000
1001 \begin{lyxcode}
1002 ...~|~patscan\_seq~-{}-pattern='ATCGATCG{[}3,2,1],GCTAGCTA{[}3,2,1]'
1003 \end{lyxcode}
1004 It is also possible to have a list of different patterns to search
1005 for in a file with one pattern per line. In order to get \textbf{patscan\_seq}
1006 to read these patterns use the -\/-pattern\_in switch:
1007
1008 \begin{lyxcode}
1009 ...~|~patscan\_seq~-{}-pattern\_in=<file>
1010 \end{lyxcode}
1011 To also scan the complementary strand in nucleotide sequences (\textbf{patscan\_seq}
1012 automagically determines the sequence type) you need to add the -\/-comp
1013 switch:
1014
1015 \begin{lyxcode}
1016 ...~|~patscan\_seq~-{}-pattern=<pattern>~-{}-comp
1017 \end{lyxcode}
1018 It is also possible to use \textbf{patscan\_seq} to output those records
1019 that does not contain a certain pattern by using the -\/-invert switch:
1020
1021 \begin{lyxcode}
1022 ...~|~patscan\_seq~-{}-pattern=<pattern>~-{}-invert
1023 \end{lyxcode}
1024 Finally, \textbf{patscan\_seq} can also scan for patterns in a given
1025 genome sequence, instead of sequences in the stream, using the -\/-genome
1026 switch:
1027
1028 \begin{lyxcode}
1029 patscan~-{}-pattern=<pattern>~-{}-genome=<genome>
1030 \end{lyxcode}
1031
1032 \subsection{How to use BLAT for sequence search?\label{sub:How-to-use-BLAT}}
1033
1034 Sequences in the data stream can be matched against supported genomes
1035 using \textbf{blat\_seq} which is a biotool using BLAT as the name
1036 might suggest. Currently only Mouse and Human genomes are available
1037 and it is not possible to use OOC files since there is still a need
1038 for a local repository for genome files. Otherwise it is just:
1039
1040 \begin{lyxcode}
1041 read\_fasta~-{}-data\_in=<file>~|~blat\_seq~-{}-genome=<genome>
1042 \end{lyxcode}
1043 The search results can then be written to file with \textbf{write\_psl}
1044 (see~\ref{sub:How-to-write-PSL}) or \textbf{write\_bed} (see~\ref{sub:How-to-write-BED})
1045 allthough with \textbf{write\_bed} some information will be lost).
1046 It is also possible to plot chromosome distribution of the search
1047 results using \textbf{plot\_chrdist} (see~\ref{sub:How-to-plot-chrdist})
1048 or the distribution of the match lengths using \textbf{plot\_lendist}
1049 (see~\ref{sub:How-to-plot-lendist}) or a karyogram with the hits
1050 using \textbf{plot\_karyogram} (see~\ref{sub:How-to-plot-karyogram}).
1051
1052
1053 \subsection{How to use BLAST for sequence search?\label{sub:How-to-use-BLAST}}
1054
1055 Two biotools exist for blasting sequences: \textbf{create\_blast\_db}
1056 is used to create the BLAST database required for BLAST which is queried
1057 using the biotool \textbf{blast\_seq}. So in order to create a BLAST
1058 database from sequences in the data stream you simple run:
1059
1060 \begin{lyxcode}
1061 ...~|~create\_blast\_db~-{}-database=my\_database~-{}-no\_stream
1062 \end{lyxcode}
1063 The type of sequence to use for the database is automagically determined
1064 by \textbf{create\_blast\_db}, but don't have a mixture of peptide
1065 and nucleic acids sequences in the stream. The -\/-database switch
1066 takes a path as argument, but will default to 'blastdb\_<time\_stamp>
1067 if not set.
1068
1069 The resulting database can now be queried with sequences in another
1070 data stream using \textbf{blast\_seq}:
1071
1072 \begin{lyxcode}
1073 ...~|~blast\_seq~-{}-database=my\_database
1074 \end{lyxcode}
1075 Again, the sequence type is determined automagically and the appropriate
1076 BLAST program is guessed (see below table), however, the program name
1077 can be overruled with the -\/-program switch.
1078
1079 \noindent \begin{center}
1080 \begin{tabular}{ccc}
1081 Subject sequence & Query sequence & Program guess\tabularnewline
1082 \hline
1083 Nucleotide & Nucleotide & blastn\tabularnewline
1084 Protein & Protein & blastp\tabularnewline
1085 Protein & Nucleotide & blastx\tabularnewline
1086 Nucleotide & Protein & tblastn\tabularnewline
1087 \end{tabular}
1088 \par\end{center}
1089
1090 Finally, it is also possible to use \textbf{blast\_seq} for blasting
1091 sequences agains a preformatted genome using the -\/-genome switch
1092 instead of the -\/-database switch:
1093
1094 \begin{lyxcode}
1095 ...~|~blast\_seq~-{}-genome=<genome>
1096 \end{lyxcode}
1097
1098 \subsection{How to use Vmatch for sequence search?\label{sub:How-to-use-Vmatch}}
1099
1100 The powerful suffix array software package Vmatch%
1101 \footnote{\url{http://www.vmatch.de/}%
1102 } can be used for exact mapping of sequences against indexed genomes
1103 using the biotool \textbf{vmatch\_seq}, which will e.g. map 700000
1104 ESTs to the human genome locating all 160 mio hits in less than an
1105 hour.
1106
1107 \begin{lyxcode}
1108 ...~|~vmatch\_seq~-{}-genome=<genome>
1109 \end{lyxcode}
1110 Only nucleotide sequences and sequences longer than 11 nucleotides
1111 will be mapped. The resulting SCORE key will hold the number of genome
1112 matches of a given sequence (multi-mappers).
1113
1114
1115 \subsection{How to find all matches between sequences?\label{sub:How-to-find-matches}}
1116
1117 All matches between two sequences can be determined with the biotool
1118 \textbf{match\_seq}. The match finding engine underneath the hood
1119 of \textbf{match\_seq} is the super fast suffix tree program MUMmer%
1120 \footnote{\url{http://mummer.sourceforge.net/}%
1121 }, which will locate all forward and reverse matches between huge sequences
1122 in a matter of minutes (if the repeat count is not too high and if
1123 the word size used is appropriate). Matching two \emph{Helicobacter
1124 pylori} genomes (1.7Mbp) takes around 10 seconds:
1125
1126 \begin{lyxcode}
1127 ...~|~match\_seq~-{}-word\_size=20~-{}-direction=both
1128 \end{lyxcode}
1129 The output from \textbf{match\_seq} can be used to generate a dot
1130 plot with \textbf{plot\_matches} (see~\ref{sub:How-to-generate-dotplot}).
1131
1132
1133 \subsection{How to align sequences?\label{sub:How-to-align}}
1134
1135 Sequences in the stream can be aligned with the \textbf{align\_seq}
1136 biotool that uses Muscle%
1137 \footnote{\url{http://www.drive5.com/muscle/muscle.html}%
1138 } as aligment engine. Currently you cannot change any of the Muscle
1139 alignment parameters and \textbf{align\_seq} will create an alignment
1140 based on the defaults (which are really good!):
1141
1142 \begin{lyxcode}
1143 ...~|~align\_seq
1144 \end{lyxcode}
1145 The aligned output can be written to file in FASTA format using \textbf{write\_fasta}
1146 (see~\ref{sub:How-to-write-fasta}) or in pretty text using \textbf{write\_align}
1147 (see~\ref{sub:How-to-write-alignment}).
1148
1149
1150 \subsection{How to create a weight matrix?}
1151
1152 If you want a weight matrix to show the sequence composition of a
1153 stack of sequences you can use the biotool create\_weight\_matrix:
1154
1155 \begin{lyxcode}
1156 ...~|~create\_weight\_matrix
1157 \end{lyxcode}
1158 The result can be output in percent using the -\/-percent switch:
1159
1160 \begin{lyxcode}
1161 ...~|~create\_weight\_matrix~-{}-percent
1162 \end{lyxcode}
1163 The weight matrix can be written as tabular output with \textbf{write\_tab}
1164 (see~\ref{sub:How-to-write-tab}) after removeing the records containing
1165 SEQ with \textbf{grab} (see~\ref{sub:How-to-grab}):
1166
1167 \begin{lyxcode}
1168 ...~|~create\_weight\_matrix~|~grab~-{}-invert~-{}-keys=SEQ~-{}-keys\_only
1169
1170 |~write\_tab~-{}-no\_stream
1171 \end{lyxcode}
1172 The V0 column will hold the residue, while the rest of the columns
1173 will hold the frequencies for each sequence position.
1174
1175
1176 \section{Plotting}
1177
1178 There exists several biotools for plotting. Some of these are based
1179 on GNUplot%
1180 \footnote{\url{http://www.gnuplot.info/}%
1181 }, which is an extremely powerful platform to generate all sorts of
1182 plots and even though GNUplot has quite a steep learning curve, the
1183 biotools utilizing GNUplot are simple to use. GNUplot is able to output
1184 a lot of different formats (called terminals in GNUplot), but the
1185 biotools focusses on three formats only:
1186
1187 \begin{enumerate}
1188 \item The 'dumb' terminal is default to the GNUplot based biotools and will
1189 output a plot in crude ASCII text (Fig.~\ref{fig:Dumb-terminal}).
1190 This is quite nice for a quick and dirty plot to get an overview of
1191 your data .
1192 \item The 'post' or 'postscript' terminal output postscript code which is
1193 publication grade graphics that can be viewed with applications such
1194 as Ghostview, Photoshop, and Preview.
1195 \item The 'svg' terminal output's scalable vector graphics (SVG) which is
1196 a vector based format. SVG is great because you can edit the resulting
1197 plot using Photoshop or Inkscape%
1198 \footnote{Inkscape is a really handy drawing program that is free and open source.
1199 Availble at \url{http://www.inkscape.org}%
1200 } if you want to add additional labels, captions, arrows, and so on
1201 and then save the result in different formats, such as postscript
1202 without loosing resolution.
1203 \end{enumerate}
1204 The biotools for plotting that are not based on GNUplot only output
1205 SVG (that may change in the future).
1206
1207 %
1208 \begin{figure}
1209 \noindent \begin{centering}
1210 \includegraphics[width=12cm]{lendist_ascii}
1211 \par\end{centering}
1212
1213 \caption{\label{fig:Dumb-terminal}Dumb terminal}
1214
1215
1216 \begin{quote}
1217 The output of a length distribution plot in the default 'dumb terminal'
1218 to the terminal window.
1219 \end{quote}
1220
1221 \end{figure}
1222
1223
1224
1225 \subsection{How to plot a histogram?\label{How-to-plot-histogram}}
1226
1227 A generic histogram for a given value can be plotted with the biotool
1228 \textbf{plot\_histogram} (Fig.~\ref{fig:Histogram}):
1229
1230 \begin{lyxcode}
1231 ...~|~plot\_histogram~-{}-key=TISSUE~-{}-no\_stream
1232 \end{lyxcode}
1233 (Figure missing)
1234
1235 \noindent \begin{flushleft}
1236 %
1237 \begin{figure}
1238 \noindent \begin{centering}
1239 \includegraphics[width=12cm]{histogram}
1240 \par\end{centering}
1241
1242 \caption{\label{fig:Histogram}Histogram}
1243
1244 \end{figure}
1245
1246 \par\end{flushleft}
1247
1248
1249 \subsection{How to plot a length distribution?\label{sub:How-to-plot-lendist}}
1250
1251 Plotting of length distributions, weather sequence lengths, patterns
1252 lengths, hit lengths, \emph{etc.} is a really handy thing and can
1253 be done with the the biotool \textbf{plot\_lendist}. If you have a
1254 file with FASTA entries and want to plot the length distribution you
1255 do it like this:
1256
1257 \begin{lyxcode}
1258 read\_fasta~-{}-data\_in=<file>~|~length\_seq
1259
1260 |~plot\_lendist~-{}-key=SEQ\_LEN~-{}-no\_stream
1261 \end{lyxcode}
1262 The result will be written to the default dumb terminal and will look
1263 like Fig.~\ref{fig:Dumb-terminal}.
1264
1265 If you instead want the result in postscript format you can do:
1266
1267 \begin{lyxcode}
1268 ...~|~plot\_lendist~-{}-key=SEQ\_LEN~-{}-terminal=post~-{}-result\_out=file.ps
1269 \end{lyxcode}
1270 That will generate the plot and save it to file, but not interrupt
1271 the data stream which can then be used in further analysis. You can
1272 also save the plot implicetly using '>', however, it is then important
1273 to terminate the stream with the -\/-no\_stream switch:
1274
1275 \begin{lyxcode}
1276 ...~|~plot\_lendist~-{}-key=SEQ\_LEN~-{}-terminal=post~-{}-no\_stream~>~file.ps
1277 \end{lyxcode}
1278 The resulting plot can be seen in Fig.~\ref{fig:Length-distribution}.
1279
1280 %
1281 \begin{figure}
1282
1283
1284 \noindent \begin{centering}
1285 \includegraphics[width=12cm]{lendist}
1286 \par\end{centering}
1287
1288 \caption{\label{fig:Length-distribution}Length distribution}
1289
1290
1291 \begin{quote}
1292 Length distribution of 630 piRNA like RNAs.
1293 \end{quote}
1294
1295 \end{figure}
1296
1297
1298
1299 \subsection{How to plot a chromosome distribution?\label{sub:How-to-plot-chrdist}}
1300
1301 If you have the result of a sequence search against a multi chromosome
1302 genome, it is very practical to be able to plot the distribution of
1303 search hits on the different chromosomes. This can be done with \textbf{plot\_chrdist}:
1304
1305 \begin{lyxcode}
1306 read\_fasta~-{}-data\_in=<file>~|~blat\_genome~|~plot\_chrdist~-{}-no\_stream
1307 \end{lyxcode}
1308 The above example will result in a crude plot using the 'dumb' terminal,
1309 and if you want to mess around with the results from the BLAT search
1310 you probably want to save the result to file first (see~\ref{sub:How-to-write-PSL}).
1311 To plot the chromosome distribution from the saved search result you
1312 can do:
1313
1314 \begin{lyxcode}
1315 read\_bed~-{}-data\_in=file.bed~|~plot\_chrdist~-{}-terminal=post~-{}-result\_out=plot.ps
1316 \end{lyxcode}
1317 That will result in the output show in Fig.~\ref{fig:Chromosome-distribution}.
1318
1319 %
1320 \begin{figure}
1321
1322
1323 \noindent \begin{centering}
1324 \includegraphics[angle=90,width=12cm]{chrdist}
1325 \par\end{centering}
1326
1327 \caption{\label{fig:Chromosome-distribution}Chromosome distribution}
1328
1329 \end{figure}
1330
1331
1332
1333 \subsection{How to generate a dotplot?\label{sub:How-to-generate-dotplot}}
1334
1335 A dotplot is a powerful way to get an overview of the size and location
1336 of sequence insertions, deletions, and duplications between two sequences.
1337 Generating a dotplot with biotools is a two step process where you
1338 initially find all matches between two sequences using the tool \textbf{match\_seq}
1339 (see~\ref{sub:How-to-find-matches}) and plot the resulting matches
1340 with \textbf{plot\_matches}. Matching and plotting two \emph{Helicobacter
1341 pylori} genomes (1.7Mbp) takes around 10 seconds:
1342
1343 \begin{lyxcode}
1344 ...~|~match\_seq~|~plot\_matches~-{}-terminal=post~-{}-result\_out=plot.ps
1345 \end{lyxcode}
1346 The resulting dotplot is in Fig.~\ref{fig:Dotplot}.
1347
1348 %
1349 \begin{figure}
1350 \noindent \begin{centering}
1351 \includegraphics[width=12cm]{dotplot}
1352 \par\end{centering}
1353
1354 \caption{\label{fig:Dotplot}Dotplot}
1355
1356
1357 \begin{quote}
1358 Forward matches are displayed in green while reverse matches are displayed
1359 in red.
1360 \end{quote}
1361
1362 \end{figure}
1363
1364
1365
1366 \subsection{How to plot a sequence logo?}
1367
1368 Sequence logos can be generate with \textbf{plot\_seqlogo}. The sequnce
1369 type is determined automagically and an entropy scale of 2 bits and
1370 4 bits is used for nucleotide and peptide sequences, respectively%
1371 \footnote{\url{http://www.ccrnp.ncifcrf.gov/~toms/paper/hawaii/latex/node5.html}%
1372 }.
1373
1374 \begin{lyxcode}
1375 ...~|~plot\_seqlogo~-{}-no\_stream~-{}-result\_out=seqlogo.svg
1376 \end{lyxcode}
1377 An example of a sequence logo can be seen in Fig.~\ref{fig:Sequence-logo}.
1378
1379 %
1380 \begin{figure}
1381 \noindent \begin{centering}
1382 \includegraphics[width=12cm]{seqlogo}
1383 \par\end{centering}
1384
1385 \caption{\label{fig:Sequence-logo}Sequence logo}
1386
1387 \end{figure}
1388
1389
1390
1391 \subsection{How to plot a karyogram?\label{sub:How-to-plot-karyogram}}
1392
1393 To plot search hits on genomes use \textbf{plot\_karyogram}, which
1394 will output a nice karyogram in SVG graphics:
1395
1396 \begin{lyxcode}
1397 ...~|~plot\_karyogram~-{}-result\_out=karyogram.svg
1398 \end{lyxcode}
1399 The banding data is taken from the UCSC genome browser database and
1400 currently only Human and Mouse is supported. Fig.~\ref{fig:Karyogram}
1401 shows the distribution of piRNA like RNAs matched to the Human genome.
1402
1403 %
1404 \begin{figure}
1405 \noindent \begin{centering}
1406 \includegraphics[width=12cm]{karyogram}
1407 \par\end{centering}
1408
1409 \caption{\label{fig:Karyogram}Karyogram}
1410
1411
1412 \begin{quote}
1413 Hits from a search of piRNA like RNAs in the Human genome is displayed
1414 as short horizontal bars.
1415 \end{quote}
1416
1417 \end{figure}
1418
1419
1420
1421 \section{Uploading Results}
1422
1423
1424 \subsection{How do I display my results in the UCSC Genome Browser?}
1425
1426 Results from the list of biotools below can be uploaded directly to
1427 a local mirror of the UCSC Genome Browser using the biotool \textbf{upload\_to\_ucsc}:
1428
1429 \begin{itemize}
1430 \item patscan\_seq \eqref{sub:How-to-use-patscan}
1431 \item blat\_seq \eqref{sub:How-to-use-BLAT}
1432 \item blast\_seq \eqref{sub:How-to-use-BLAST}
1433 \item vmatch\_seq \eqref{sub:How-to-use-Vmatch}
1434 \end{itemize}
1435 The syntax for uploading data the most simple way requires two mandatory
1436 switches: -\/-database, which is the UCSC database name (such as
1437 hg18, mm9, etc.) and-\/-table which should be the users initials
1438 followed by an underscore and a short description of the data:
1439
1440 \begin{lyxcode}
1441 ...~|~upload\_to\_ucsc~-{}-database=hg18~-{}-table=mah\_snoRNAs
1442 \end{lyxcode}
1443 The \textbf{upload\_to\_ucsc} biotool modifies the users \textasciitilde{}/ucsc/my\_tracks.ra
1444 file automagically (a backup is created with the name \textasciitilde{}/ucsc/my\_tracks.ra\textasciitilde{})
1445 with default values that can be overridden using the following switches:
1446
1447 \begin{itemize}
1448 \item -\/-short\_label - Short label for track - Default=database->table
1449 \item -\/-long\_label - Long label for track - Default=database->table
1450 \item -\/-group - Track group name - Default=<user name as defined in env>
1451 \item -\/-priority - Track display priority - Default=1
1452 \item -\/-color - Track color - Default=147,73,42
1453 \item -\/-chunk\_size - Chunks for loading - Default=10000000
1454 \item -\/-visibility - Track visibility - Default=pack
1455 \end{itemize}
1456 Also, data in BED or PSL format can be uploaded with \textbf{upload\_to\_ucsc}
1457 as long as these reference to genomes and chromosomes existing in
1458 the UCSC Genome Browser:
1459
1460 \begin{lyxcode}
1461 read\_bed~-{}-data\_in=<bed~file>~|~upload\_to\_ucsc~...
1462
1463
1464
1465 read\_psl~-{}-data\_in=<psl~file>~|~upload\_to\_ucsc~...
1466 \end{lyxcode}
1467
1468 \section{Trouble shooting}
1469
1470 Shoot the messenger!
1471
1472 \appendix
1473
1474 \section{Keys\label{sec:Keys}}
1475
1476 HIT
1477
1478 HIT\_BEG
1479
1480 HIT\_END
1481
1482 HIT\_LEN
1483
1484 HIT\_NAME
1485
1486 PATTERN
1487
1488
1489 \section{Switches\label{sec:Switches}}
1490
1491 -\/-stream\_in
1492
1493 -\/-stream\_out
1494
1495 -\/-no\_stream
1496
1497 -\/-data\_in
1498
1499 -\/-result\_out
1500
1501 -\/-num
1502
1503
1504 \section{scan\_for\_matches README\label{sec:scan_for_matches-README}}
1505
1506 \begin{lyxcode}
1507 ~~~~~~~~~~~~~~~~~~~~~~~~~~scan\_for\_matches:
1508
1509 ~~~~A~Program~to~Scan~Nucleotide~or~Protein~Sequences~for~Matching~Patterns
1510
1511 ~~~~~~~~~~~~~~~~~~~~~~~~Ross~Overbeek
1512
1513 ~~~~~~~~~~~~~~~~~~~~~~~~MCS
1514
1515 ~~~~~~~~~~~~~~~~~~~~~~~~Argonne~National~Laboratory
1516
1517 ~~~~~~~~~~~~~~~~~~~~~~~~Argonne,~IL~60439
1518
1519 ~~~~~~~~~~~~~~~~~~~~~~~~USA
1520
1521 Scan\_for\_matches~is~a~utility~that~we~have~written~to~search~for
1522
1523 patterns~in~DNA~and~protein~sequences.~~I~wrote~most~of~the~code,
1524
1525 although~David~Joerg~and~Morgan~Price~wrote~sections~of~an
1526
1527 earlier~version.~~The~whole~notion~of~pattern~matching~has~a~rich
1528
1529 history,~and~we~borrowed~liberally~from~many~sources.~~However,~it~is
1530
1531 worth~noting~that~we~were~strongly~influenced~by~the~elegant~tools
1532
1533 developed~and~distributed~by~David~Searls.~~My~intent~is~to~make~the
1534
1535 existing~tool~available~to~anyone~in~the~research~community~that~might
1536
1537 find~it~useful.~~I~will~continue~to~try~to~fix~bugs~and~make~suggested
1538
1539 enhancements,~at~least~until~I~feel~that~a~superior~tool~exists.
1540
1541 Hence,~I~would~appreciate~it~if~all~bug~reports~and~suggestions~are
1542
1543 directed~to~me~at~Overbeek@mcs.anl.gov.~~
1544
1545 I~will~try~to~log~all~bug~fixes~and~report~them~to~users~that~send~me
1546
1547 their~email~addresses.~~I~do~not~require~that~you~give~me~your~name
1548
1549 and~address.~~However,~if~you~do~give~it~to~me,~I~will~try~to~notify
1550
1551 you~of~serious~problems~as~they~are~discovered.
1552
1553 Getting~Started:
1554
1555 ~~~~The~distribution~should~contain~at~least~the~following~programs:
1556
1557 ~~~~~~~~~~~~~~~~README~~~~~~~~~~~~~~~~~~-~~~~~This~document
1558
1559 ~~~~~~~~~~~~~~~~ggpunit.c~~~~~~~~~~~~~~~-~~~~~One~of~the~two~source~files
1560
1561 ~~~~~~~~~~~~~~~~scan\_for\_matches.c~~~~~~-~~~~~The~second~source~file
1562
1563 ~~~~~~~~~~~~~~~~
1564
1565 ~~~~~~~~~~~~~~~~run\_tests~~~~~~~~~~~~~~~-~~~~~A~perl~script~to~test~things
1566
1567 ~~~~~~~~~~~~~~~~show\_hits~~~~~~~~~~~~~~~-~~~~~A~handy~perl~script
1568
1569 ~~~~~~~~~~~~~~~~test\_dna\_input~~~~~~~~~~-~~~~~Test~sequences~for~DNA
1570
1571 ~~~~~~~~~~~~~~~~test\_dna\_patterns~~~~~~~-~~~~~Test~patterns~for~DNA~scan
1572
1573 ~~~~~~~~~~~~~~~~test\_output~~~~~~~~~~~~~-~~~~~Desired~output~from~test
1574
1575 ~~~~~~~~~~~~~~~~test\_prot\_input~~~~~~~~~-~~~~~Test~protein~sequences
1576
1577 ~~~~~~~~~~~~~~~~test\_prot\_patterns~~~~~~-~~~~~Test~patterns~for~proteins
1578
1579 ~~~~~~~~~~~~~~~~testit~~~~~~~~~~~~~~~~~~-~~~~~a~perl~script~used~for~test
1580
1581 ~~~~Only~the~first~three~files~are~required.~~The~others~are~useful,
1582
1583 ~~~~but~only~if~you~have~Perl~installed~on~your~system.~~If~you~do
1584
1585 ~~~~have~Perl,~I~suggest~that~you~type
1586
1587 ~~~~~~~~
1588
1589 ~~~~~~~~~~~~~~~~which~perl
1590
1591 ~~~~to~find~out~where~it~installed.~~On~my~system,~I~get~the~following
1592
1593 ~~~~response:
1594
1595 ~~~~~~~~
1596
1597 ~~~~~~~~~~~~~~~~clone\%~which~perl
1598
1599 ~~~~~~~~~~~~~~~~/usr/local/bin/perl
1600
1601 ~~~~indicating~that~Perl~is~installed~in~/usr/local/bin.~~Anyway,~once
1602
1603 ~~~~you~know~where~it~is~installed,~edit~the~first~line~of~files~
1604
1605 ~~~~~~~~testit
1606
1607 ~~~~~~~~show\_hits
1608
1609 ~~~~replacing~/usr/local/bin/perl~with~the~appropriate~location.~~I
1610
1611 ~~~~will~assume~that~you~can~do~this,~although~it~is~not~critical~(it
1612
1613 ~~~~is~needed~only~to~test~the~installation~and~to~use~the~\char`\"{}show\_hits\char`\"{}
1614
1615 ~~~~utility).~~Perl~is~not~required~to~actually~install~and~run
1616
1617 ~~~~scan\_for\_matches.~
1618
1619 ~~~~If~you~do~not~have~Perl,~I~suggest~you~get~it~and~install~it~(it
1620
1621 ~~~~is~a~wonderful~utility).~~Information~about~Perl~and~how~to~get~it
1622
1623 ~~~~can~be~found~in~the~book~\char`\"{}Programming~Perl\char`\"{}~by~Larry~Wall~and
1624
1625 ~~~~Randall~L.~Schwartz,~published~by~O'Reilly~\&~Associates,~Inc.
1626
1627 ~~~~To~get~started,~you~will~need~to~compile~the~program.~~~I~do~this
1628
1629 ~~~~using~
1630
1631 ~~~~~~~~gcc~-O~-o~scan\_for\_matches~~ggpunit.c~scan\_for\_matches.c
1632
1633 ~~~~If~you~do~not~use~GNU~C,~use~
1634
1635 ~~~~~~~~cc~-O~-DCC~-o~scan\_for\_matches~~ggpunit.c~scan\_for\_matches.c
1636
1637 ~~~~which~works~on~my~Sun.~~
1638
1639 ~~~~Once~you~have~compiled~scan\_for\_matches,~you~can~verify~that~it
1640
1641 ~~~~works~with
1642
1643 ~~~~~~~~clone\%~run\_tests~tmp
1644
1645 ~~~~~~~~clone\%~diff~tmp~test\_output
1646
1647 ~~~~You~may~get~a~few~strange~lines~of~the~sort
1648
1649 ~~~~~~~~clone\%~run\_tests~tmp
1650
1651 ~~~~~~~~rm:~tmp:~No~such~file~or~directory
1652
1653 ~~~~~~~~clone\%~diff~tmp~test\_output
1654
1655 ~~~~These~should~cause~no~concern.~~However,~if~the~\char`\"{}diff\char`\"{}~shows~that
1656
1657 ~~~~tmp~and~test\_output~are~different,~contact~me~(you~have~a
1658
1659 ~~~~problem).~
1660
1661 ~~~~You~should~now~be~able~to~use~scan\_for\_matches~by~following~the
1662
1663 ~~~~instructions~given~below~(which~is~all~the~normal~user~should~have
1664
1665 ~~~~to~understand,~once~things~are~installed~properly).
1666
1667 ~==============================================================
1668
1669 How~to~run~scan\_for\_matches:
1670
1671 ~~~~To~run~the~program,~you~type~need~to~create~two~files
1672
1673 ~~~~1.~~the~first~file~contains~the~pattern~you~wish~to~scan~for;~I'll
1674
1675 ~~~~~~~~call~this~file~pat\_file~in~what~follows~(but~any~name~is~ok)
1676
1677 ~~~~2.~~the~second~file~contains~a~set~of~sequences~to~scan.~~These
1678
1679 ~~~~~~~~should~be~in~\char`\"{}fasta~format\char`\"{}.~~Just~look~at~the~contents~of
1680
1681 ~~~~~~~~test\_dna\_input~to~see~examples~of~this~format.~~Basically,
1682
1683 ~~~~~~~~each~sequence~begins~with~a~line~of~the~form
1684
1685 ~~~~~~~~~~~>sequence\_id
1686
1687 ~~~~~~~~and~is~followed~by~one~or~more~lines~containing~the~sequence.
1688
1689 ~~~~Once~these~files~have~been~created,~you~just~use
1690
1691 ~~~~~~~~scan\_for\_matches~pat\_file~<~input\_file
1692
1693 ~~~~to~scan~all~of~the~input~sequences~for~the~given~pattern.~~As~an
1694
1695 ~~~~example,~suppose~that~pat\_file~contains~a~single~line~of~the~form
1696
1697 ~~~~~~~~~~~~~~~~p1=4...7~3...8~\textasciitilde{}p1
1698
1699 ~~~~Then,
1700
1701 ~~~~~~~~~~~~~~~~scan\_for\_matches~pat\_file~<~test\_dna\_input
1702
1703 ~~~~should~produce~two~\char`\"{}hits\char`\"{}.~~When~I~run~this~on~my~machine,~I~get
1704
1705 ~~~~~~~~clone\%~scan\_for\_matches~pat\_file~<~test\_dna\_input
1706
1707 ~~~~~~~~>tst1:{[}6,27]
1708
1709 ~~~~~~~~cguaacc~ggttaacc~gguuacg~
1710
1711 ~~~~~~~~>tst2:{[}6,27]
1712
1713 ~~~~~~~~CGUAACC~GGTTAACC~GGUUACG~
1714
1715 ~~~~~~~~clone\%~
1716
1717 Simple~Patterns~Built~by~Matching~Ranges~and~Reverse~Complements
1718
1719 ~~~~Let~me~first~explain~this~simple~pattern:
1720
1721 ~~~~~~~~~~~~~~~~
1722
1723 ~~~~~~~~~~~~~~~~p1=4...7~3...8~\textasciitilde{}p1
1724
1725 ~~~~The~pattern~consists~of~three~\char`\"{}pattern~units\char`\"{}~separated~by~spaces.
1726
1727 ~~~~The~first~pattern~unit~is
1728
1729 ~~~~~~~~~~~~~~~~p1=4...7
1730
1731 ~~~~which~means~\char`\"{}match~4~to~7~characters~and~call~them~p1\char`\"{}.~~The
1732
1733 ~~~~second~pattern~unit~is
1734
1735 ~~~~~~~~~~~~~~~~3...8
1736
1737 ~~~~which~means~\char`\"{}then~match~3~to~8~characters\char`\"{}.~~The~last~pattern~unit
1738
1739 ~~~~is~
1740
1741 ~~~~~~~~~~~~~~~~\textasciitilde{}p1
1742
1743 ~~~~which~means~\char`\"{}match~the~reverse~complement~of~p1\char`\"{}.~~The~first
1744
1745 ~~~~reported~hit~is~shown~as
1746
1747 ~~~~~~~~>tst1:{[}6,27]
1748
1749 ~~~~~~~~cguaacc~ggttaacc~gguuacg~
1750
1751 ~~~~which~states~that~characters~6~through~27~of~sequence~tst1~were
1752
1753 ~~~~matched.~~\char`\"{}cguaac\char`\"{}~matched~the~first~pattern~unit,~\char`\"{}ggttaacc\char`\"{}~the
1754
1755 ~~~~second,~and~\char`\"{}gguuacg\char`\"{}~the~third.~~This~is~an~example~of~a~common
1756
1757 ~~~~type~of~pattern~used~to~search~for~sections~of~DNA~or~RNA~that
1758
1759 ~~~~would~fold~into~a~hairpin~loop.
1760
1761 Searching~Both~Strands
1762
1763 ~~~~Now~for~a~short~aside:~scan\_for\_matches~only~searched~the
1764
1765 ~~~~sequences~in~the~input~file;~it~did~not~search~the~opposite
1766
1767 ~~~~strand.~~With~a~pattern~of~the~sort~we~just~used,~there~is~not
1768
1769 ~~~~need~o~search~the~opposite~strand.~~However,~it~is~normally~the
1770
1771 ~~~~case~that~you~will~wish~to~search~both~the~sequence~and~the
1772
1773 ~~~~opposite~strand~(i.e.,~the~reverse~complement~of~the~sequence).
1774
1775 ~~~~To~do~that,~you~would~just~use~the~\char`\"{}-c\char`\"{}~command~line.~~For~example,
1776
1777 ~~~~~~~~scan\_for\_matches~-c~pat\_file~<~test\_dna\_input
1778
1779 ~~~~Hits~on~the~opposite~strand~will~show~a~beginning~location~greater
1780
1781 ~~~~than~te~end~location~of~the~match.
1782
1783 Defining~Pairing~Rules~and~Allowing~Mismatches,~Insertions,~and~Deletions
1784
1785 ~~~~Let~us~stop~now~and~ask~\char`\"{}What~additional~features~would~one~need~to
1786
1787 ~~~~really~find~the~kinds~of~loop~structures~that~characterize~tRNAs,
1788
1789 ~~~~rRNAs,~and~so~forth?\char`\"{}~~I~can~immediately~think~of~two:
1790
1791 ~~~~~~~~a)~you~will~need~to~be~able~to~allow~non-standard~pairings
1792
1793 ~~~~~~~~~~~(those~other~than~G-C~and~A-U),~and
1794
1795 ~~~~~~~~b)~you~will~need~to~be~able~to~tolerate~some~number~of
1796
1797 ~~~~~~~~~~~mismatches~and~bulges.
1798
1799 ~~~~~~~~
1800
1801 ~~~~Let~me~first~show~you~how~to~handle~non-standard~\char`\"{}rules~for
1802
1803 ~~~~pairing~in~reverse~complements\char`\"{}.~~Consider~the~following~pattern,
1804
1805 ~~~~which~I~show~as~two~line~(you~may~use~as~many~lines~as~you~like~in
1806
1807 ~~~~forming~a~pattern,~although~you~can~only~break~a~pattern~at~points
1808
1809 ~~~~where~space~would~be~legal):
1810
1811 ~~~~~~~~~~~~r1=\{au,ua,gc,cg,gu,ug,ga,ag\}~
1812
1813 ~~~~~~~~~~~~p1=2...3~0...4~p2=2...5~1...5~r1\textasciitilde{}p2~0...4~\textasciitilde{}p1~~~~~~~~
1814
1815 ~~~~The~first~\char`\"{}pattern~unit\char`\"{}~does~not~actually~match~anything;~rather,
1816
1817 ~~~~it~defines~a~\char`\"{}pairing~rule\char`\"{}~in~which~standard~pairings~are
1818
1819 ~~~~allowed,~as~well~as~G-A~and~A-G~(in~case~you~wondered,~Us~and~Ts
1820
1821 ~~~~and~upper~and~lower~case~can~be~used~interchangably;~for~example
1822
1823 ~~~~r1=\{AT,UA,gc,cg\}~could~be~used~to~define~the~\char`\"{}standard~rule\char`\"{}~for
1824
1825 ~~~~pairings).~~The~second~line~consists~of~six~pattern~units~which
1826
1827 ~~~~may~be~interpreted~as~follows:
1828
1829 ~~~~~~~~~~~~p1=2...3~~~~~match~2~or~3~characters~(call~it~p1)
1830
1831 ~~~~~~~~~~~~0...4~~~~~~~~match~0~to~4~characters
1832
1833 ~~~~~~~~~~~~p2=2...5~~~~~match~2~to~5~characters~(call~it~p2)
1834
1835 ~~~~~~~~~~~~1...5~~~~~~~~match~1~to~5~characters
1836
1837 ~~~~~~~~~~~~r1\textasciitilde{}p2~~~~~~~~match~the~reverse~complement~of~p2,
1838
1839 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~allowing~G-A~and~A-G~pairs
1840
1841 ~~~~~~~~~~~~0...4~~~~~~~~match~0~to~4~characters~~~~~~~~
1842
1843 ~~~~~~~~~~~~\textasciitilde{}p1~~~~~~~~~~match~the~reverse~complement~of~p1
1844
1845 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~allowing~only~G-C,~C-G,~A-T,~and~T-A~pairs
1846
1847 ~~~~Thus,~r1\textasciitilde{}p2~means~\char`\"{}match~the~reverse~complement~of~p2~using~rule~r1\char`\"{}.
1848
1849 ~~~~Now~let~us~consider~the~issue~of~tolerating~mismatches~and~bulges.
1850
1851 ~~~~You~may~add~a~\char`\"{}qualifier\char`\"{}~to~the~pattern~unit~that~gives~the
1852
1853 ~~~~tolerable~number~of~\char`\"{}mismatches,~deletions,~and~insertions\char`\"{}.
1854
1855 ~~~~Thus,
1856
1857 ~~~~~~~~~~~~~~~~p1=10...10~3...8~\textasciitilde{}p1{[}1,2,1]
1858
1859 ~~~~means~that~the~third~pattern~unit~must~match~10~characters,
1860
1861 ~~~~allowing~one~\char`\"{}mismatch\char`\"{}~(a~pairing~other~than~G-C,~C-G,~A-T,~or
1862
1863 ~~~~T-A),~two~deletions~(a~deletion~is~a~character~that~occurs~in~p1,
1864
1865 ~~~~but~has~been~\char`\"{}deleted\char`\"{}~from~the~string~matched~by~\textasciitilde{}p1),~and~one
1866
1867 ~~~~insertion~(an~\char`\"{}insertion\char`\"{}~is~a~character~that~occurs~in~the~string
1868
1869 ~~~~matched~by~\textasciitilde{}p1,~but~not~for~which~no~corresponding~character
1870
1871 ~~~~occurs~in~p1).~~In~this~case,~the~pattern~would~match
1872
1873 ~~~~~~~~~~~~~~ACGTACGTAC~GGGGGGGG~GCGTTACCT
1874
1875 ~~~~which~is,~you~must~admit,~a~fairly~weak~loop.~~It~is~common~to
1876
1877 ~~~~allow~mismatches,~but~you~will~find~yourself~using~insertions~and
1878
1879 ~~~~deletions~much~more~rarely.~~In~any~event,~you~should~note~that
1880
1881 ~~~~allowing~mismatches,~insertions,~and~deletions~does~force~the
1882
1883 ~~~~program~to~try~many~additional~possible~pairings,~so~it~does~slow
1884
1885 ~~~~things~down~a~bit.
1886
1887 How~Patterns~Are~Matched
1888
1889 ~~~~Now~is~as~good~a~time~as~any~to~discuss~the~basic~flow~of~control
1890
1891 ~~~~when~matching~patterns.~~Recall~that~a~\char`\"{}pattern\char`\"{}~is~a~sequence~of
1892
1893 ~~~~\char`\"{}pattern~units\char`\"{}.~~Suppose~that~the~pattern~units~were
1894
1895 ~~~~~~~~u1~u2~u3~u4~...~un
1896
1897 ~~~~The~scan~of~a~sequence~S~begins~by~setting~the~current~position
1898
1899 ~~~~to~1.~~Then,~an~attempt~is~made~to~match~u1~starting~at~the
1900
1901 ~~~~current~position.~~Each~attempt~to~match~a~pattern~unit~can
1902
1903 ~~~~succeed~or~fail.~~If~it~succeeds,~then~an~attempt~is~made~to~match
1904
1905 ~~~~the~next~unit.~~If~it~fails,~then~an~attempt~is~made~to~find~an
1906
1907 ~~~~alternative~match~for~the~immediately~preceding~pattern~unit.~~If
1908
1909 ~~~~this~succeeds,~then~we~proceed~forward~again~to~the~next~unit.~~If
1910
1911 ~~~~it~fails~we~go~back~to~the~preceding~unit.~~This~process~is~called
1912
1913 ~~~~\char`\"{}backtracking\char`\"{}.~~If~there~are~no~previous~units,~then~the~current
1914
1915 ~~~~position~is~incremented~by~one,~and~everything~starts~again.~~This
1916
1917 ~~~~proceeds~until~either~the~current~position~goes~past~the~end~of
1918
1919 ~~~~the~sequence~or~all~of~the~pattern~units~succeed.~~On~success,
1920
1921 ~~~~scan\_for\_matches~reports~the~\char`\"{}hit\char`\"{},~the~current~position~is~set
1922
1923 ~~~~just~past~the~hit,~and~an~attempt~is~made~to~find~another~hit.
1924
1925 ~~~~If~you~wish~to~limit~the~scan~to~simply~finding~a~maximum~of,~say,
1926
1927 ~~~~10~hits,~you~can~use~the~-n~option~(-n~10~would~set~the~limit~to
1928
1929 ~~~~10~reported~hits).~~For~example,
1930
1931 ~~~~~~~~scan\_for\_matches~-c~-n~1~pat\_file~<~test\_dna\_input
1932
1933 ~~~~would~search~for~just~the~first~hit~(and~would~stop~searching~the
1934
1935 ~~~~current~sequences~or~any~that~follow~in~the~input~file).
1936
1937 Searching~for~repeats:
1938
1939 ~~~~In~the~last~section,~I~discussed~almost~all~of~the~details
1940
1941 ~~~~required~to~allow~you~to~look~for~repeats.~~Consider~the~following
1942
1943 ~~~~set~of~patterns:
1944
1945 ~~~~~~~~p1=6...6~3...8~p1~~~(find~exact~6~character~repeat~separated
1946
1947 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~by~to~8~characters)
1948
1949 ~~~~~~~~p1=6...6~3..8~p1{[}1,0,0]~~~(allow~one~mismatch)
1950
1951 ~~~~~~~~p1=3...3~p1{[}1,0,0]~p1{[}1,0,0]~p1{[}1,0,0]~~
1952
1953 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~(match~12~characters~that~are~the~remains
1954
1955 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~of~a~3-character~sequence~occurring~4~times)
1956
1957 ~~~~~~~~~~~~~~~~
1958
1959 ~~~~~~~~p1=4...8~0...3~p2=6...8~p1~0...3~p2
1960
1961 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~(This~would~match~things~like
1962
1963 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ATCT~G~TCTTT~ATCT~TG~TCTTT
1964
1965 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~)
1966
1967 Searching~for~particular~sequences:
1968
1969 ~~~~Occasionally,~one~wishes~to~match~a~specific,~known~sequence.
1970
1971 ~~~~In~such~a~case,~you~can~just~give~the~sequence~(along~with~an
1972
1973 ~~~~optional~statement~of~the~allowable~mismatches,~insertions,~and
1974
1975 ~~~~deletions).~~Thus,
1976
1977 ~~~~~~~~p1=6...8~GAGA~\textasciitilde{}p1~~~~(match~a~hairpin~with~GAGA~as~the~loop)
1978
1979 ~~~~~~~~RRRRYYYY~~~~~~~~~~~~~(match~4~purines~followed~by~4~pyrimidines)
1980
1981 ~~~~~~~~TATAA{[}1,0,0]~~~~~~~~~(match~TATAA,~allowing~1~mismatch)
1982
1983 ~~~~~~~~
1984
1985 Matches~against~a~\char`\"{}weight~matrix\char`\"{}:
1986
1987 ~~~~I~will~conclude~my~examples~of~the~types~of~pattern~units
1988
1989 ~~~~available~for~matching~against~nucleotide~sequences~by~discussing~a
1990
1991 ~~~~crude~implemetation~of~matching~using~a~\char`\"{}weight~matrix\char`\"{}.~~While~I
1992
1993 ~~~~am~less~than~overwhelmed~with~the~syntax~that~I~chose,~I~think~that
1994
1995 ~~~~the~reader~should~be~aware~that~I~was~thinking~of~generating
1996
1997 ~~~~patterns~containing~such~pattern~units~automatically~from
1998
1999 ~~~~alignments~(and~did~not~really~plan~on~typing~such~things~in~by
2000
2001 ~~~~hand~very~often).~~Anyway,~suppose~that~you~wanted~to~match~a
2002
2003 ~~~~sequence~of~eight~characters.~~The~\char`\"{}consensus\char`\"{}~of~these~eight
2004
2005 ~~~~characters~is~GRCACCGS,~but~the~actual~\char`\"{}frequencies~of~occurrence\char`\"{}
2006
2007 ~~~~are~given~in~the~matrix~below.~~Thus,~the~first~character~is~an~A
2008
2009 ~~~~16\%~the~time~and~a~G~84\%~of~the~time.~~The~second~is~an~A~57\%~of
2010
2011 ~~~~the~time,~a~C~10\%~of~the~time,~a~G~29\%~of~the~time,~and~a~T~4\%~of
2012
2013 ~~~~the~time.~~
2014
2015 ~~~~~~~~~~~~~C1~~~~~C2~~~~C3~~~~C4~~~C5~~~~C6~~~~C7~~~~C8
2016
2017 ~~~~
2018
2019 ~~~~~~~A~~~~~16~~~~~57~~~~~0~~~~95~~~~0~~~~18~~~~~0~~~~~0
2020
2021 ~~~~~~~C~~~~~~0~~~~~10~~~~80~~~~~0~~100~~~~60~~~~~0~~~~50
2022
2023 ~~~~~~~G~~~~~84~~~~~29~~~~~0~~~~~0~~~~0~~~~20~~~100~~~~50
2024
2025 ~~~~~~~T~~~~~~0~~~~~~4~~~~20~~~~~5~~~~0~~~~~2~~~~~0~~~~~0~~~
2026
2027 ~~~~
2028
2029 ~~~~One~could~use~the~following~pattern~unit~to~search~for~inexact
2030
2031 ~~~~matches~related~to~such~a~\char`\"{}weight~matrix\char`\"{}:
2032
2033 ~~~~~~~~\{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
2034
2035 ~~~~~~~~~(0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)\}~>~450
2036
2037 ~~~~This~pattern~unit~will~attempt~to~match~exactly~eight~characters.
2038
2039 ~~~~For~each~character~in~the~sequence,~the~entry~in~the~corresponding
2040
2041 ~~~~tuple~is~added~to~an~accumulated~sum.~~If~the~sum~is~greater~than
2042
2043 ~~~~450,~the~match~succeeds;~else~it~fails.
2044
2045 ~~~~Recently,~this~feature~was~upgraded~to~allow~ranges.~~Thus,
2046
2047 ~~600~>~~\{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
2048
2049 ~~~~~~~~~(0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)\}~>~450
2050
2051 ~~~~will~work,~as~well.
2052
2053 Allowing~Alternatives:
2054
2055 ~~~~Very~occasionally,~you~may~wish~to~allow~alternative~pattern~units
2056
2057 ~~~~(i.e.,~\char`\"{}match~either~A~or~B\char`\"{}).~~You~can~do~this~using~something
2058
2059 ~~~~like
2060
2061 ~~~~~~~~~~~~~~~~(~GAGA~|~GCGCA)
2062
2063 ~~~~which~says~\char`\"{}match~either~GAGA~or~GCGCA\char`\"{}.~~You~may~take
2064
2065 ~~~~alternatives~of~a~list~of~pattern~units,~for~example
2066
2067 ~~~~~~~~(p1=3...3~3...8~\textasciitilde{}p1~|~p1=5...5~4...4~\textasciitilde{}p1~GGG)
2068
2069 ~~~~would~match~one~of~two~sequences~of~pattern~units.~~There~is~one
2070
2071 ~~~~clumsy~aspect~of~the~syntax:~to~match~a~list~of~alternatives,~you
2072
2073 ~~~~need~to~fully~the~request.~~Thus,
2074
2075 ~~~~~~~~(GAGA~|~(GCGCA~|~TTCGA))
2076
2077 ~~~~would~be~needed~to~try~the~three~alternatives.
2078
2079 One~Minor~Extension
2080
2081 ~~~~Sometimes~a~pattern~will~contain~a~sequence~of~distinct~ranges,
2082
2083 ~~~~and~you~might~wish~to~limit~the~sum~of~the~lengths~of~the~matched
2084
2085 ~~~~subsequences.~~~For~example,~suppose~that~you~basically~wanted~to
2086
2087 ~~~~match~something~like
2088
2089 ~~~~ARRYYTT~p1=0...5~GCA{[}1,0,0]~p2=1...6~\textasciitilde{}p1~4...8~\textasciitilde{}p2~p3=4...10~CCT
2090
2091 ~~~~but~that~the~sum~of~the~lengths~of~p1,~p2,~and~p3~must~not~exceed
2092
2093 ~~~~eight~characters.~~To~do~this,~you~could~add~
2094
2095 ~~~~~~~~length(p1+p2+p3)~<~9
2096
2097 ~~~~as~the~last~pattern~unit.~~It~will~just~succeed~or~fail~(but~does
2098
2099 ~~~~not~actually~match~any~characters~in~the~sequence).
2100
2101 ~~~~
2102
2103 Matching~Protein~Sequences
2104
2105 ~~~~Suppose~that~the~input~file~contains~protein~sequences.~~In~this
2106
2107 ~~~~case,~you~must~invoke~scan\_for\_matches~with~the~\char`\"{}-p\char`\"{}~option.~~You
2108
2109 ~~~~cannot~use~aspects~of~the~language~that~relate~directly~to
2110
2111 ~~~~nucleotide~sequences~(e.g.,~the~-c~command~line~option~or~pattern
2112
2113 ~~~~constructs~referring~to~the~reverse~complement~of~a~previously
2114
2115 ~~~~matched~unit).~~
2116
2117 ~~~~You~also~have~two~additional~constructs~that~allow~you~to~match
2118
2119 ~~~~either~\char`\"{}one~of~a~set~of~amino~acids\char`\"{}~or~\char`\"{}any~amino~acid~other~than
2120
2121 ~~~~those~a~given~set\char`\"{}.~~For~example,
2122
2123 ~~~~~~~~p1=0...4~any(HQD)~1...3~notany(HK)~p1
2124
2125 ~~~~would~successfully~match~a~string~like
2126
2127 ~~~~~~~~~~~YWV~D~AA~C~YWV
2128
2129 Using~the~show\_hits~Utility
2130
2131 ~~~~When~viewing~a~large~set~of~complex~matches,~you~might~find~it
2132
2133 ~~~~convenient~to~post-process~the~scan\_for\_matches~output~to~get~a
2134
2135 ~~~~more~readable~version.~~We~provide~a~simple~post-processor~called
2136
2137 ~~~~\char`\"{}show\_hits\char`\"{}.~~To~see~its~effect,~just~pipe~the~output~of~a
2138
2139 ~~~~scan\_for\_matches~into~show\_hits:
2140
2141 ~~~~~Normal~Output:
2142
2143 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp
2144
2145 ~~~~~~~~>tst1:{[}1,28]
2146
2147 ~~~~~~~~gtacguaacc~~ggttaac~cgguuacgtac~
2148
2149 ~~~~~~~~>tst1:{[}28,1]
2150
2151 ~~~~~~~~gtacgtaacc~~ggttaac~cggttacgtac~
2152
2153 ~~~~~~~~>tst2:{[}2,31]
2154
2155 ~~~~~~~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG~
2156
2157 ~~~~~~~~>tst2:{[}31,2]
2158
2159 ~~~~~~~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG~
2160
2161 ~~~~~~~~>tst3:{[}3,32]
2162
2163 ~~~~~~~~gtacguaacc~g~gttaactt~cgguuacgtac~
2164
2165 ~~~~~~~~>tst3:{[}32,3]
2166
2167 ~~~~~~~~gtacgtaacc~g~aagttaac~cggttacgtac~
2168
2169 ~~~~~Piped~Through~show\_hits:
2170
2171 ~~~~
2172
2173 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp~|~show\_hits
2174
2175 ~~~~~~~~tst1:{[}1,28]:~~gtacguaacc~~~ggttaac~~cgguuacgtac
2176
2177 ~~~~~~~~tst1:{[}28,1]:~~gtacgtaacc~~~ggttaac~~cggttacgtac
2178
2179 ~~~~~~~~tst2:{[}2,31]:~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG
2180
2181 ~~~~~~~~tst2:{[}31,2]:~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG
2182
2183 ~~~~~~~~tst3:{[}3,32]:~~gtacguaacc~g~gttaactt~cgguuacgtac
2184
2185 ~~~~~~~~tst3:{[}32,3]:~~gtacgtaacc~g~aagttaac~cggttacgtac
2186
2187 ~~~~~~~~clone\%~
2188
2189 ~~~~Optionally,~you~can~specify~which~of~the~\char`\"{}fields\char`\"{}~in~the~matches
2190
2191 ~~~~you~wish~to~sort~on,~and~show\_hits~will~sort~them.~~The~field
2192
2193 ~~~~numbers~start~with~0.~~So,~you~might~get~something~like
2194
2195 ~~~~~~~~clone\%~scan\_for\_matches~-c~pat\_file~<~tmp~|~show\_hits~2~1
2196
2197 ~~~~~~~~tst2:{[}2,31]:~~CGTACGUAAC~C~GGTTAACC~GGUUACGTACG
2198
2199 ~~~~~~~~tst2:{[}31,2]:~~CGTACGTAAC~C~GGTTAACC~GGTTACGTACG
2200
2201 ~~~~~~~~tst3:{[}32,3]:~~gtacgtaacc~g~aagttaac~cggttacgtac
2202
2203 ~~~~~~~~tst1:{[}1,28]:~~gtacguaacc~~~ggttaac~~cgguuacgtac
2204
2205 ~~~~~~~~tst1:{[}28,1]:~~gtacgtaacc~~~ggttaac~~cggttacgtac
2206
2207 ~~~~~~~~tst3:{[}3,32]:~~gtacguaacc~g~gttaactt~cgguuacgtac
2208
2209 ~~~~~~~~clone\%~
2210
2211 ~~~~In~this~case,~the~hits~have~been~sorted~on~fields~2~and~1~(that~is,
2212
2213 ~~~~the~third~and~second~matched~subfields).
2214
2215 ~~~~show\_hits~is~just~one~possible~little~post-processor,~and~you
2216
2217 ~~~~might~well~wish~to~write~a~customized~one~for~yourself.
2218
2219 Reducing~the~Cost~of~a~Search
2220
2221 ~~~~The~scan\_for\_matches~utility~uses~a~fairly~simple~search,~and~may
2222
2223 ~~~~consume~large~amounts~of~CPU~time~for~complex~patterns.~~Someday,
2224
2225 ~~~~I~may~decide~to~optimize~the~code.~~However,~until~then,~let~me
2226
2227 ~~~~mention~one~useful~technique.~~
2228
2229 ~~~~When~you~have~a~complex~pattern~that~includes~a~number~of~varying
2230
2231 ~~~~ranges,~imprecise~matches,~and~so~forth,~it~is~useful~to
2232
2233 ~~~~\char`\"{}pipeline\char`\"{}~matches.~~That~is,~form~a~simpler~pattern~that~can~be
2234
2235 ~~~~used~to~scan~through~a~large~database~extracting~sections~that
2236
2237 ~~~~might~be~matched~by~the~more~complex~pattern.~~Let~me~illustrate
2238
2239 ~~~~with~a~short~example.~~Suppose~that~you~really~wished~to~match~the
2240
2241 ~~~~pattern~
2242
2243 ~~~~p1=3...5~0...8~\textasciitilde{}p1{[}1,1,0]~p2=6...7~3...6~AGC~3...5~RYGC~\textasciitilde{}p2{[}1,0,0]
2244
2245 ~~~~In~this~case,~the~pattern~units~AGC~3...5~RYGC~can~be~used~to~rapidly
2246
2247 ~~~~constrain~the~overall~search.~~You~can~preprocess~the~overall
2248
2249 ~~~~database~using~the~pattern:
2250
2251 ~~~~~~~~~~31...31~AGC~3...5~RYGC~7...7
2252
2253 ~~~~Put~the~complex~pattern~in~pat\_file1~and~the~simpler~pattern~in
2254
2255 ~~~~pat\_file2.~~Then~use,
2256
2257 ~~~~~~~~scan\_for\_matches~-c~pat\_file2~<~nucleotide\_database~|
2258
2259 ~~~~~~~~scan\_for\_matches~pat\_file1
2260
2261 ~~~~The~output~will~show~things~like
2262
2263 ~~~~>seqid:{[}232,280]{[}2,47]
2264
2265 ~~~~matches~pieces
2266
2267 ~~~~Then,~the~actual~section~of~the~sequence~that~was~matched~can~be
2268
2269 ~~~~easily~computed~as~{[}233,278]~(remember,~the~positions~start~from
2270
2271 ~~~~1,~not~0).
2272
2273 ~~~~Let~me~finally~add,~you~should~do~a~few~short~experiments~to~see
2274
2275 ~~~~whether~or~not~such~pipelining~actually~improves~performance~-{}-~it
2276
2277 ~~~~is~not~always~obvious~where~the~time~is~going,~and~I~have
2278
2279 ~~~~sometimes~found~that~the~added~complexity~of~pipelining~actually
2280
2281 ~~~~slowed~things~up.~~It~gets~its~best~improvements~when~there~are
2282
2283 ~~~~exact~matches~of~more~than~just~a~few~characters~that~can~be
2284
2285 ~~~~rapidly~used~to~eliminate~large~sections~of~the~database.
2286
2287 =============
2288
2289 Additions:
2290
2291 Feb~9,~1995:~~~the~pattern~units~\textasciicircum{}~and~\$~now~work~as~in~normal~regular
2292
2293 ~~~~~~~~~~~~~~~expressions.~~That~is
2294
2295 ~~~~~~~~~~~~~~~~~~~~~~~~TTF~\$
2296
2297 ~~~~~~~~~~~~~~~matches~only~TTF~at~the~end~of~the~string~and~
2298
2299 ~~~~~~~~~~~~~~~~~~~~~~~~\textasciicircum{}~TTF~
2300
2301 ~~~~~~~~~~~~~~~matches~only~an~initial~TTF
2302
2303 ~~~~~~~~~~~~~~~The~pattern~unit~
2304
2305 ~~~~~~~~~~~~~~~~~~~~~~~~<p1
2306
2307 ~~~~~~~~~~~~~~~matches~the~reverse~of~the~string~named~p1.~~That~is,
2308
2309 ~~~~~~~~~~~~~~~if~p1~matched~GCAT,~then~<p1~would~match~TACG.~~Thus,
2310
2311 ~~~~~~~~~~~~~~~~~~~p1=6...6~<p1
2312
2313 ~~~~~~~~~~~~~~~matches~a~real~palindrome~(not~the~biologically~common
2314
2315 ~~~~~~~~~~~~~~~meaning~of~\char`\"{}reverse~complement\char`\"{})
2316
2317
2318 \end{lyxcode}
2319
2320 \end{document}