#LyX 1.5.1 created this file. For more info see http://www.lyx.org/
\lyxformat 276
\begin_document
\begin_header
\textclass scrartcl
\begin_preamble
\usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
\end_preamble
\language english
\inputencoding auto
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100
\graphics default
\paperfontsize default
\spacing single
\papersize default
\use_geometry false
\use_amsmath 1
\use_esint 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation skip
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\author "" 
\author "" 
\end_header

\begin_body

\begin_layout Title
Biopieces Cookbook
\end_layout

\begin_layout Author
Martin Asser Hansen
\end_layout

\begin_layout Publishers
John Mattick Group
\newline
Institute for Molecular Bioscience
\newline
University of Queensland
\newline
Aust
ralia
\newline
E-mail: mail@maasha.dk
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Standard


\backslash
thispagestyle{empty}
\end_layout

\end_inset


\end_layout

\begin_layout Standard

\newpage

\end_layout

\begin_layout Standard
\begin_inset LatexCommand tableofcontents

\end_inset


\end_layout

\begin_layout Standard
\begin_inset FloatList figure

\end_inset


\end_layout

\begin_layout Standard

\newpage

\end_layout

\begin_layout Section
Introduction
\end_layout

\begin_layout Standard
Biopieces is a collection of simple tools that can be linked together (piped
 as we shall call it) in a very flexible manner to perform both simple and
 complex tasks.
 The fundamental idea is that biopieces work on a data stream that will
 only terminate at the end of an analysis and that this data stream can
 be passed through several different biopieces, each performing one specific
 task.
 The advantage of this approach is that a user can perform simple and complex
 tasks without having to write advanced code.
 Moreover, since the data format used to pass data between biopieces is
 text based, biopieces can be written by different developers in their favorite
 programming language --- and still the biopieces will be able to work together.
\end_layout

\begin_layout Standard
In the most simple form bioools can be piped together on the command line
 like this (using the pipe character '|'):
\end_layout

\begin_layout LyX-Code
read_data | calculate_something | write_result
\end_layout

\begin_layout Standard
However, a more comprehensive analysis could be composed:
\end_layout

\begin_layout LyX-Code
read_data | select_entries | convert_entries | search_database  
\end_layout

\begin_layout LyX-Code
evaluate_results | plot_diagram | plot_another_diagram |
\end_layout

\begin_layout LyX-Code
load_to_database
\end_layout

\begin_layout Standard
The data stream that is piped through the biopieces consists of records
 of key/value pairs in the same way a hash does in order to keep as simple
 a structure as possible.
 An example record can be seen below:
\end_layout

\begin_layout LyX-Code

\end_layout

\begin_layout LyX-Code

\size scriptsize
REC_TYPE: PATSCAN
\end_layout

\begin_layout LyX-Code

\size scriptsize
MATCH: AGATCAAGTG
\end_layout

\begin_layout LyX-Code

\size scriptsize
S_BEG: 7
\end_layout

\begin_layout LyX-Code

\size scriptsize
S_END: 16
\end_layout

\begin_layout LyX-Code

\size scriptsize
ALIGN_LEN: 9
\end_layout

\begin_layout LyX-Code

\size scriptsize
S_ID: piR-t6
\end_layout

\begin_layout LyX-Code

\size scriptsize
STRAND: +
\end_layout

\begin_layout LyX-Code

\size scriptsize
PATTERN: AGATCAAGTG
\end_layout

\begin_layout LyX-Code

\size scriptsize
---
\end_layout

\begin_layout Standard
The '
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\backslash
/-
\end_layout

\end_inset

' denotes the delimiter of the records, and each key is a word followed
 by a ':' and a white-space and then the value.
 By convention the biopieces only uses upper case keys (a list of used keys
 can be seen in Appendix\InsetSpace ~

\begin_inset LatexCommand ref
reference "sec:Keys"

\end_inset

).
 Since the records basically are hash structures this mean that the order
 of the keys in the stream is unordered, and in the above example it is
 pure coincidence that HIT_BEG is displayed before HIT_END, however, when
 the order of the keys is importent, the biopieces will automagically see
 to that.
\end_layout

\begin_layout Standard
All of the biopieces are able to read and write a data stream to and from
 file as long as the records are in the biopieces format.
 This means that if you are undertaking a lengthy analysis where one of
 the steps is time consuming, you may save the stream after this step, and
 subsequently start one or more analysis from that last step
\begin_inset Foot
status collapsed

\begin_layout Standard
It is a goal that the biopieces at some point will be able to dump the data
 stream to file in case one of the tools fail critically.
\end_layout

\end_inset

.
 If you are running a lengthy analysis it is highly recommended that you
 create a small test sample of the data and run that through the pipeline
 --- and once you are satisfied with the result proceed with the full data
 set (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-select-a-few-records"

\end_inset

).
\end_layout

\begin_layout Standard
All of the biopieces can be supplied with long arguments prefixed with 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

 switches or single character arguments prefixed with - switches that can
 be grouped together (e.g.
 -xok).
 In this cookbook only the long switches are used to emphasize what these
 switches do.
\end_layout

\begin_layout Section
Setup
\end_layout

\begin_layout Standard
In order to get the biopieces to work, you need to add environment settings
 to include the code binaries, scripts, and modules that constitute the
 biopieces package.
 Assuming that you are using bash, add the following to your '~/.bashrc'
 file using your favorite editor.
 After the changes has been saved you need to either run 'source ~/.bashrc'
 or relogin.
\end_layout

\begin_layout LyX-Code

\size scriptsize
if [ -f "/home/m.hansen/maasha/conf/bashrc" ]; then
\end_layout

\begin_layout LyX-Code

\size scriptsize
    source "/home/m.hansen/maasha/conf/bashrc"
\end_layout

\begin_layout LyX-Code

\size scriptsize
fi
\end_layout

\begin_layout Section
Getting Started
\end_layout

\begin_layout Standard
The biopiece 
\series bold
list_biopieces
\series default
 lists all the biopieces along with a description:
\end_layout

\begin_layout LyX-Code

\size scriptsize
list_biopieces
\end_layout

\begin_layout LyX-Code

\size scriptsize
align_seq             Align sequences in stream using Muscle.
\end_layout

\begin_layout LyX-Code

\size scriptsize
analyze_seq           Analysis the residue composition of each sequence
 in stream.
\end_layout

\begin_layout LyX-Code

\size scriptsize
analyze_vals          Determine type, count, min, max, sum and mean for
 values in stream.
\end_layout

\begin_layout LyX-Code

\size scriptsize
blast_seq             BLAST sequences in stream against a specified database.
\end_layout

\begin_layout LyX-Code

\size scriptsize
blat_seq              BLAT sequences in stream against a specified genome.
\end_layout

\begin_layout LyX-Code

\size scriptsize
complement_seq        Complement sequences in stream.
\end_layout

\begin_layout LyX-Code

\size scriptsize
count_records         Count the number of records in stream.
\end_layout

\begin_layout LyX-Code

\size scriptsize
count_seq             Count sequences in stream.
\end_layout

\begin_layout LyX-Code

\size scriptsize
count_vals            Count the number of times values of given keys exists
 in stream.
\end_layout

\begin_layout LyX-Code

\size scriptsize
create_blast_db       Create a BLAST database from sequences in stream for
 use with BLAST.
\end_layout

\begin_layout LyX-Code

\size scriptsize
...
\end_layout

\begin_layout Standard
To list the biopieces for writing different formats, you can use unix's
 grep like this:
\end_layout

\begin_layout LyX-Code

\size scriptsize
list_biopieces | grep write
\end_layout

\begin_layout LyX-Code

\size scriptsize
write_align           Write aligned sequences in pretty alignment format.
\end_layout

\begin_layout LyX-Code

\size scriptsize
write_bed             Write records from stream as BED lines.
\end_layout

\begin_layout LyX-Code

\size scriptsize
write_blast           Write BLAST records from stream in BLAST tabular format
 (-m8 and 9).
\end_layout

\begin_layout LyX-Code

\size scriptsize
write_fasta           Write sequences in FASTA format.
\end_layout

\begin_layout LyX-Code

\size scriptsize
write_psl             Write records from stream in PSL format.
\end_layout

\begin_layout LyX-Code

\size scriptsize
write_tab             Write records from stream as tab separated table.
\end_layout

\begin_layout Standard
In order to find out how a specific biopiece works, you just type the program
 name without any arguments and press return and the usage of the biopiece
 will be displayed.
 E.g.
 
\series bold
read_fasta
\series default
 <return>:
\end_layout

\begin_layout Standard
\begin_inset Box Frameless
position "t"
hor_pos "c"
has_inner_box 1
inner_pos "t"
use_parbox 0
width "100col%"
special "none"
height "1in"
height_special "totalheight"
status open

\begin_layout LyX-Code

\size scriptsize
Program name:   read_fasta
\end_layout

\begin_layout LyX-Code

\size scriptsize
Author:         Martin Asser Hansen - Copyright (C) - All rights reserved
\end_layout

\begin_layout LyX-Code

\size scriptsize
Contact:        mail@maasha.dk
\end_layout

\begin_layout LyX-Code

\size scriptsize
Date:           August 2007
\end_layout

\begin_layout LyX-Code

\size scriptsize
License:        GNU General Public License version 2 (http://www.gnu.org/copyleft/
gpl.html)
\end_layout

\begin_layout LyX-Code

\size scriptsize
Description:    Read FASTA entries.
\end_layout

\begin_layout LyX-Code

\size scriptsize
Usage:          read_fasta [options] -i <FASTA file(s)>
\end_layout

\begin_layout LyX-Code

\size scriptsize
Options:
\end_layout

\begin_layout LyX-Code

\size scriptsize
   [-i <file(s)> | --data_in=<file(s)>]  -  Comma separated list of files
 to read.
\end_layout

\begin_layout LyX-Code

\size scriptsize
   [-n <int>     | --num=<int>]          -  Limit number of records to read.
\end_layout

\begin_layout LyX-Code

\size scriptsize
   [-I <file>    | --stream_in=<file>]   -  Read input stream from file
  -  Default=STDIN
\end_layout

\begin_layout LyX-Code

\size scriptsize
   [-O <file>    | --stream_out=<file>]  -  Write output stream to file
  -  Default=STDOUT
\end_layout

\begin_layout LyX-Code

\end_layout

\begin_layout LyX-Code

\size scriptsize
Examples:
\end_layout

\begin_layout LyX-Code

\size scriptsize
   read_fasta -i test.fna             -  Read FASTA entries from file.
\end_layout

\begin_layout LyX-Code

\size scriptsize
   read_fasta -i test1.fna,test2,fna  -  Read FASTA entries from files.
\end_layout

\begin_layout LyX-Code

\size scriptsize
   read_fasta -i '*.fna'              -  Read FASTA entries from files.
\end_layout

\begin_layout LyX-Code

\size scriptsize
   read_fasta -i test.fna -n 10       -  Read first 10 FASTA entries from
 file.
\end_layout

\end_inset


\end_layout

\begin_layout Section
The Data Stream
\end_layout

\begin_layout Subsection
How to read the data stream from file?
\begin_inset LatexCommand label
name "sub:How-to-read-stream"

\end_inset


\end_layout

\begin_layout Standard
You want to read a data stream that you previously have saved to file in
 biopieces format.
 This can be done implicetly or explicitly.
 The implicit way uses the 'stdout' stream of the Unix terminal:
\end_layout

\begin_layout LyX-Code
cat | <biopiece>
\end_layout

\begin_layout Standard
cat is the Unix command that reads a file and output the result to 'stdout'
 --- which in this case is piped to any biopiece represented by the <biopiece>.
 It is also possible to read the data stream using '<' to direct the 'stdout'
 stream into the biopiece like this:
\end_layout

\begin_layout LyX-Code
<biopiece> < <file>
\end_layout

\begin_layout Standard
However, that will not work if you pipe more biopieces together.
 Then it is much safer to read the stream from a file explicitly like this:
\end_layout

\begin_layout LyX-Code
<biopiece> --stream_in=<file>
\end_layout

\begin_layout Standard
Here the filename <file> is explicetly given to the biopiece <biopiece>
 with the switch 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

stream_in.
 This switch works with all biopieces.
 It is also possible to read in data from multiple sources by repeating
 the explicit read step:
\end_layout

\begin_layout LyX-Code
<biopiece> --stream_in=<file1> | <biopiece> --stream_in=<file2>
\end_layout

\begin_layout Subsection
How to write the data stream to file?
\begin_inset LatexCommand label
name "sub:How-to-write-stream"

\end_inset


\end_layout

\begin_layout Standard
In order to save the output stream from a biopiece to file, so you can read
 in the stream again at a later time, you can do one of two things:
\end_layout

\begin_layout LyX-Code
<biopiece> > <file>
\end_layout

\begin_layout Standard
All, the biopieces write the data stream to 'stdout' by default which can
 be written to a file by redirecting 'stdout' to file using '>' , however,
 if one of the biopieces for writing other formats is used then the both
 the biopieces records as well as the result output will go to 'stdout'
 in a mixture causing havock! To avoid this you must use the switch 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

stream_out that explictly tells the biopiece to write the output stream
 to file:
\end_layout

\begin_layout LyX-Code
<biopiece> --stream_out=<file>
\end_layout

\begin_layout Standard
The 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

stream_out switch works with all biopieces.
\end_layout

\begin_layout Subsection
How to terminate the data stream?
\end_layout

\begin_layout Standard
The data stream is never stops unless the user want to save the stream or
 by supplying the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_stream switch that will terminate the stream:
\end_layout

\begin_layout LyX-Code
<biopiece> --no_stream
\end_layout

\begin_layout Standard
The 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_stream switch only works with those biopieces where it makes sense that
 the user might want to terminale the data stream, 
\emph on
i.e
\emph default
.
 after an analysis step where the user wants to output the result, but not
 the data stream.
\end_layout

\begin_layout Subsection
How to write my results to file?
\begin_inset LatexCommand label
name "sub:How-to-write-result"

\end_inset


\end_layout

\begin_layout Standard
Saving the result of an analysis to file can be done implicitly or explicitly.
 The implicit way:
\end_layout

\begin_layout LyX-Code
<biopiece> --no_stream > <file>
\end_layout

\begin_layout Standard
If you use '>' to redirect 'stdout' to file then it is important to use
 the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_stream switch to avoid writing a mix of biopieces records and result
 to the same file causing havock.
 The safe way is to use the 
\begin_inset ERT
status open

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

result_out switch which explicetly tells the biopiece to write the result
 to a given file:
\end_layout

\begin_layout LyX-Code
<biopiece> --result_out=<file>
\end_layout

\begin_layout Standard
Using the above method will not terminate the stream, so it is possible
 to pipe that into another biopiece generating different results:
\end_layout

\begin_layout LyX-Code
<biopiece1> --result_out=<file1> | <biopiece2> --result_out=<file2>
\end_layout

\begin_layout Standard
And still the data stream will continue unless terminated with 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_stream:
\end_layout

\begin_layout LyX-Code
<biopiece> --result_out=<file> --no_stream
\end_layout

\begin_layout Standard
Or written to file using implicitly or explicity 
\begin_inset LatexCommand eqref
reference "sub:How-to-write-result"

\end_inset

.
 The explicit way:
\end_layout

\begin_layout LyX-Code
<biopiece> --result_out=<file1> --stream_out=<file2>
\end_layout

\begin_layout Subsection
How to read data from multiple sources?
\end_layout

\begin_layout Standard
To read multiple data sources, with the same type or different type of data
 do:
\end_layout

\begin_layout LyX-Code
<biopiece1> --data_in=<file1> | <biopiece2> --data_in=<file2>
\end_layout

\begin_layout Standard
where type is the data type a specific biopiece reads.
\end_layout

\begin_layout Section
Reading input
\end_layout

\begin_layout Subsection
How to read biopieces input?
\end_layout

\begin_layout Standard
See 
\begin_inset LatexCommand eqref
reference "sub:How-to-read-stream"

\end_inset

.
\end_layout

\begin_layout Subsection
How to read in data?
\end_layout

\begin_layout Standard
Data in different formats can be read with the appropriate biopiece for
 that format.
 The biopieces are typicalled named 'read_<data type>' such as 
\series bold
read_fasta
\series default
, 
\series bold
read_bed
\series default
, 
\series bold
read_tab
\series default
, etc., and all behave in a similar manner.
 Data can be read by supplying the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

data_in switch and a file name to the file containing the data:
\end_layout

\begin_layout LyX-Code
<biopiece> --data_in=<file>
\end_layout

\begin_layout Standard
It is also possible to read in a saved biopieces stream (see 
\begin_inset LatexCommand ref
reference "sub:How-to-read-stream"

\end_inset

) as well as reading data in one go:
\end_layout

\begin_layout LyX-Code
<biopiece> --stream_in=<file1> --data_in=<file2>
\end_layout

\begin_layout Standard
If you want to read data from several files you can do this:
\end_layout

\begin_layout LyX-Code
<biopiece> --data_in=<file1> | <biopiece> --data_in=<file2>
\end_layout

\begin_layout Standard
If you have several data files you can read in all explicitly with a comma
 separated list:
\end_layout

\begin_layout LyX-Code
<biopiece> --data_in=file1,file2,file3
\end_layout

\begin_layout Standard
And it is also possible to use file globbing
\begin_inset Foot
status open

\begin_layout Standard
using the short option will only work if you quote the argument -i '*.fna'
\end_layout

\end_inset

:
\end_layout

\begin_layout LyX-Code
<biopiece> --data_in=*.fna
\end_layout

\begin_layout Standard
Or in a combination:
\end_layout

\begin_layout LyX-Code
<biopiece> --data_in=file1,/dir/*.fna
\end_layout

\begin_layout Standard
Finally, it is possible to read in data in different formats using the appropria
te biopiece for each format:
\end_layout

\begin_layout LyX-Code
<biopiece1> --data_in=<file1> | <biopiece2> --data_in=<file2> ...
\end_layout

\begin_layout Subsection
How to read FASTA input?
\end_layout

\begin_layout Standard
Sequences in FASTA format can be read explicitly using 
\series bold
read_fasta
\series default
:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file>
\end_layout

\begin_layout Subsection
How to read alignment input?
\end_layout

\begin_layout Standard
If your alignment if FASTA formatted then you can 
\series bold
read_align
\series default
.
 It is also possible to use 
\series bold
read_fasta
\series default
 since the data is FASTA formatted, however, with 
\series bold
read_fasta
\series default
 the key ALIGN will be omitted.
 The ALIGN key is used to determine which sequences belong to what alignment
 which is required for 
\series bold
write_align
\series default
.
\end_layout

\begin_layout LyX-Code
read_align --data_in=<file>
\end_layout

\begin_layout Subsection
How to read tabular input?
\begin_inset LatexCommand label
name "sub:How-to-read-table"

\end_inset


\end_layout

\begin_layout Standard
Tabular input can be read with 
\series bold
read_tab
\series default
 which will read in all rows and chosen columns (separated by a given delimter)
 from a table in text format.
\end_layout

\begin_layout Standard
The table below:
\end_layout

\begin_layout Standard
\noindent
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="4" columns="3">
<features>
<column alignment="left" valignment="top" width="0">
<column alignment="left" valignment="top" width="0">
<column alignment="left" valignment="top" width="0">
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Human
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
ATACGTCAG
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
23524
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Dog
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
AGCATGAC
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
2442
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Mouse
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
GACTG
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
234
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Cat
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
AAATGCA
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
2342
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
Can be read using the command:
\end_layout

\begin_layout LyX-Code
read_tab --data_in=<file>
\end_layout

\begin_layout Standard
Which will result in four records, one for each row, where the keys V0,
 V1, V2 are the default keys for the organism, sequence, and count, respectively.
 It is possible to select a subset of colums to read by using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

cols switch which takes a comma separated list of columns numbers (first
 column is designated 0) as argument.
 So to read in only the sequence and the count so that the count comes before
 the sequence do:
\end_layout

\begin_layout LyX-Code
read_tab --data_in=<file> --cols=2,1
\end_layout

\begin_layout Standard
It is also possible to name the columns with the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

keys switch:
\end_layout

\begin_layout LyX-Code
read_tab --data_in=<file> --cols=2,1 --keys=COUNT,SEQ
\end_layout

\begin_layout Subsection
How to read BED input?
\end_layout

\begin_layout Standard
The BED (Browser Extensible Data
\begin_inset Foot
status open

\begin_layout Standard
\begin_inset LatexCommand url
target "http://genome.ucsc.edu/FAQ/FAQformat"

\end_inset


\end_layout

\end_inset

) format is a tabular format for data pertaining to one of the Eukaryotic
 genomes in the UCSC genome brower
\begin_inset Foot
status collapsed

\begin_layout Standard
\begin_inset LatexCommand url
target "http://genome.ucsc.edu/"

\end_inset


\end_layout

\end_inset

.
 The BED format consists of up to 12 columns, where the first three are
 mandatory CHR, CHR_BEG, and CHR_END.
 The mandatory columns and any of the optional columns can all be read in
 easily with the 
\series bold
read_bed
\series default
 biopiece.
\end_layout

\begin_layout LyX-Code
read_bed --data_in=<file>
\end_layout

\begin_layout Standard
It is also possible to read the BED file with 
\series bold
read_tab
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-read-table"

\end_inset

), however, that will be more cumbersome because you need to specify the
 keys:
\end_layout

\begin_layout LyX-Code
read_tab --data_in=<file> --keys=CHR,CHR_BEG,CHR_END ...
\end_layout

\begin_layout Subsection
How to read PSL input?
\end_layout

\begin_layout Standard
The PSL format is the output from BLAT and contains 21 mandatory fields
 that can be read with 
\series bold
read_psl
\series default
:
\end_layout

\begin_layout LyX-Code
read_psl --data_in=<file>
\end_layout

\begin_layout Section
Writing output
\end_layout

\begin_layout Standard
All result output can be written explicitly to file using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

result_out switch which all result generating biopieces have.
 It is also possible to write the result to file implicetly by directing
 'stdout' to file using '>', however, that requires the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_stream swich to prevent a mixture of data stream and results in the file.
 The explicit (and safe) way:
\end_layout

\begin_layout LyX-Code
...
 | <biopiece> --result_out=<file>
\end_layout

\begin_layout Standard
The implicit way:
\end_layout

\begin_layout LyX-Code
...
 | <biopiece> --no_stream > <file>
\end_layout

\begin_layout Subsection
How to write biopieces output?
\end_layout

\begin_layout Standard
See 
\begin_inset LatexCommand eqref
reference "sub:How-to-write-stream"

\end_inset

.
\end_layout

\begin_layout Subsection
How to write FASTA output?
\begin_inset LatexCommand label
name "sub:How-to-write-fasta"

\end_inset


\end_layout

\begin_layout Standard
FASTA output can be written with 
\series bold
write_fasta
\series default
.
\end_layout

\begin_layout LyX-Code
...
 | write_fasta --result_out=<file>
\end_layout

\begin_layout Standard
It is also possible to wrap the sequences to a given width using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

wrap switch allthough wrapping of sequence is generally an evil thing:
\end_layout

\begin_layout LyX-Code
...
 | write_fasta --no_stream --wrap=80
\end_layout

\begin_layout Subsection
How to write alignment output?
\begin_inset LatexCommand label
name "sub:How-to-write-alignment"

\end_inset


\end_layout

\begin_layout Standard
Pretty alignments with ruler
\begin_inset Foot
status collapsed

\begin_layout Standard
'.' for every 10 residues, ':' for every 50, and '|' for every 100
\end_layout

\end_inset

 and consensus sequence
\begin_inset Note Note
status collapsed

\begin_layout Standard
which reminds me to make that an option.
\end_layout

\end_inset

 can be created with 
\series bold
write_align
\series default
, what also have the optional 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

wrap switch to break the alignment into blocks of a given width: 
\end_layout

\begin_layout LyX-Code
...
 | write_align --result_out=<file> --wrap=80
\end_layout

\begin_layout Standard
If the number of sequnces in the alignment is 2 then a pairwise alignment
 will be output otherwise a multiple alignment.
 And if the sequence type, determined automagically, is protein, then residues
 and symbols (+,\InsetSpace ~
:,\InsetSpace ~
.) will be used to show consensus according to the Blosum62
 matrix.
\end_layout

\begin_layout Subsection
How to write tabular output?
\begin_inset LatexCommand label
name "sub:How-to-write-tab"

\end_inset


\end_layout

\begin_layout Standard
Outputting the data stream as a table can be done with 
\series bold
write_tab
\series default
, which will write generate one row per record with the values as columns.
 If you supply the optional 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

comment switch, when the first row in the table will be a 'comment' line
 prefixed with a '#':
\end_layout

\begin_layout LyX-Code
...
 | write_tab --result_out=<file> --comment
\end_layout

\begin_layout Standard
You can also change the delimiter from the default (tab) to 
\emph on
e.g.

\emph default
 ',':
\end_layout

\begin_layout LyX-Code
...
 | write_tab --result_out=<file> --delimit=','
\end_layout

\begin_layout Standard
If you want the values output in a specific order you have to supply a comma
 separated list using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

keys switch that will print only those keys in that order:
\end_layout

\begin_layout LyX-Code
...
 | write_tab --result_out=<file> --keys=SEQ_NAME,COUNT
\end_layout

\begin_layout Standard
Alternatively, if you have some keys that you don't want in the tabular
 output, use the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_keys switch.
 So to print all keys except SEQ and SEQ_TYPE do:
\end_layout

\begin_layout LyX-Code
...
 | write_tab --result_out=<file> --no_keys=SEQ,SEQ_TYPE
\end_layout

\begin_layout Standard
Finally, if you have a stream containing a mix of different records types,
 
\emph on
e.g.

\emph default
 records with sequences and records with matches, then you can use 
\series bold
write_tab
\series default
 to output all the records in tabluar format, however, the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

comment, 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

keys, and 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_keys switches will only respond to records of the first type encountered.
 The reason is that outputting mixed records is probably not what you want
 anyway, and you should remove all the unwanted records from the stream
 before outputting the table: 
\series bold
grab
\series default
 is your friend (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-grab"

\end_inset

).
\end_layout

\begin_layout Subsection
How to write a BED output?
\begin_inset LatexCommand label
name "sub:How-to-write-BED"

\end_inset


\end_layout

\begin_layout Standard
Data in BED format can be output if the records contain the mandatory keys
 CHR, CHR_BEG, and CHR_END using 
\series bold
write_bed
\series default
.
 If the optional keys are also present, they will be output as well:
\end_layout

\begin_layout LyX-Code
write_bed --result_out=<file>
\end_layout

\begin_layout Subsection
How to write PSL output?
\begin_inset LatexCommand label
name "sub:How-to-write-PSL"

\end_inset


\end_layout

\begin_layout Standard
Data in PSL format can be output using 
\series bold
write_psl:
\end_layout

\begin_layout LyX-Code
write_psl --result_out=<file>
\end_layout

\begin_layout Section
Manipulating Records
\end_layout

\begin_layout Subsection
How to select a few records?
\begin_inset LatexCommand label
name "sub:How-to-select-a-few-records"

\end_inset


\end_layout

\begin_layout Standard
To quickly get an overview of your data you can limit the data stream to
 show a few records.
 This also very useful to test the pipeline with a few records if you are
 setting up a complex analysis using several biopieces.
 That way you can inspect that all goes well before analyzing and waiting
 for the full data set.
 All of the read_<type> biopieces have the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

num switch which will take a number as argument and only that number of
 records will be read.
 So to read in the first 10 FASTA entries from a file:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=test.fna --num=10
\end_layout

\begin_layout Standard
Another way of doing this is to use 
\series bold
head_records
\series default
 will limit the stream to show the first 10 records (default):
\end_layout

\begin_layout LyX-Code
...
 | head_records
\end_layout

\begin_layout Standard
Using 
\series bold
head_records
\series default
 directly after one of the read_<type> biopieces will be a lot slower than
 using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

num switch with the read_<type> biopieces, however, 
\series bold
head_records
\series default
 can also be used to limit the output from all the other biopieces.
 It is also possible to give 
\series bold
head_records
\series default
 a number of records to show using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

num switch.
 So to display the first 100 records do:
\end_layout

\begin_layout LyX-Code
...
 | head_records --num=100
\end_layout

\begin_layout Subsection
How to select random records?
\begin_inset LatexCommand label
name "sub:How-to-select-random-records"

\end_inset


\end_layout

\begin_layout Standard
If you want to inspect a number of random records from the stream this can
 be done with the 
\series bold
random_records
\series default
 biopiece.
 So if you have 1 mio records in the stream and you want to select 1000
 random records do:
\end_layout

\begin_layout LyX-Code
...
 | random_records --num=1000
\end_layout

\begin_layout Subsection
How to count all records in the data stream?
\end_layout

\begin_layout Standard
To count all the records in the data stream use 
\series bold
count_records
\series default
, which adds one record (which is not included in the count) to the data
 stream.
 So to count the number of sequences in a FASTA file you can do this:
\end_layout

\begin_layout LyX-Code
cat test.fna | read_fasta | count_records --no_stream
\end_layout

\begin_layout Standard
Which will write the last record containing the count to 'stdout':
\end_layout

\begin_layout LyX-Code

\size scriptsize
count_records: 630
\end_layout

\begin_layout LyX-Code

\size scriptsize
---
\end_layout

\begin_layout Standard
It is also possible to write the count to file using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

result_out switch.
\end_layout

\begin_layout Subsection
How to get the length of record values?
\begin_inset LatexCommand label
name "sub:How-to-get-value_length"

\end_inset


\end_layout

\begin_layout Standard
Use the 
\series bold
length_vals
\series default
 biopiece to get the length of each value for a comma separated list of
 keys:
\end_layout

\begin_layout LyX-Code
...
 | length_vals --keys=HIT,PATTERN
\end_layout

\begin_layout Subsection
How to grab specific records?
\begin_inset LatexCommand label
name "sub:How-to-grab"

\end_inset


\end_layout

\begin_layout Standard
The biopiece 
\series bold
grab
\series default
 is related to the Unix grep and locates records based on matching keys
 and/or values using either a pattern, a Perl regex, or a numerical evaluation.
 To easily 
\series bold
grab
\series default
 all records in the stream that has any mentioning of the pattern 'human'
 just pipe the data stream through 
\series bold
grab
\series default
 like this:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern=human
\end_layout

\begin_layout Standard
This will search for the pattern 'human' in all keys and all values.
 The 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern switch takes a comma separated list of patterns, so in order to
 match multiple patterns do:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern=human,mouse
\end_layout

\begin_layout Standard
It is also possible to use the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern_in switch instead of 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern.
 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern_in is used to read a file with one pattern per line:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern_in=patterns.txt
\end_layout

\begin_layout Standard
If you want the opposite result --- to find all records that does not match
 the patterns, add the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

invert switch, which not only works with the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern switch, but also with 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

regex and 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

eval:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern=human --invert
\end_layout

\begin_layout Standard
If you want to search the record keys only, 
\emph on
e.g.

\emph default
 to find all records containing the key SEQ you can add the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

keys_only switch.
 This will prevent matching of SEQ in any record value, and in fact SEQ
 is a not uncommon peptide sequence you could get an unwanted record.
 Also, this will give an increase in speed since only the keys are searched:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern=SEQ --keys_only
\end_layout

\begin_layout Standard
However, if you are interested in finding the peptide sequence SEQ and not
 the SEQ key, just add the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

vals_only switch instead:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern=SEQ --vals_only
\end_layout

\begin_layout Standard
Also, if you want to grab for certain key/value pairs you can supply a comma
 separated list of keys whos values will then be searched using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

keys switch.
 This is handy if your records contain large genomic sequences and you dont
 want to search the entire sequence for 
\emph on
e.g.

\emph default
 the organism name --- it is much faster to tell 
\series bold
grab
\series default
 which keys to search the value for:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern=human --keys=SEQ_NAME
\end_layout

\begin_layout LyX-Code

\end_layout

\begin_layout Standard
It is also possible to invoke flexible matching using regex (regular expressions
) instead of simple pattern matching.
 In 
\series bold
grab
\series default
 the regex engine is Perl based and allows use of different type of wild
 cards, alternatives, 
\emph on
etc
\emph default

\begin_inset Foot
status open

\begin_layout Standard
\begin_inset LatexCommand url
target "http://perldoc.perl.org/perlreref.html"

\end_inset


\end_layout

\end_inset

.
 If you want to 
\series bold
grab
\series default
 records withs the sequence ATCG or GCTA you can do this:
\end_layout

\begin_layout LyX-Code
...
 | grab --regex='ATCG|GCTA'
\end_layout

\begin_layout Standard
Or if you want to find sequences beginning with ATCG:
\end_layout

\begin_layout LyX-Code
...
 | grab --regex='^ATCG'
\end_layout

\begin_layout Standard
You can also use 
\series bold
grab
\series default
 to locate records that fulfill a numerical property using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

eval switch witch takes an expression in three parts.
 The first part is the key that holds the value we want to evaluate, the
 second part holds one the six operators:
\end_layout

\begin_layout Enumerate
Greater than: >
\end_layout

\begin_layout Enumerate
Greater than or equal to: >=
\end_layout

\begin_layout Enumerate
Less than: <
\end_layout

\begin_layout Enumerate
Less than or equal to: <=
\end_layout

\begin_layout Enumerate
Equal to: =
\end_layout

\begin_layout Enumerate
Not equal to: !=
\end_layout

\begin_layout Enumerate
String wise equal to: eq
\end_layout

\begin_layout Enumerate
String wise not equal to: ne
\end_layout

\begin_layout Standard
And finally comes the number used in the evaluation.
 So to 
\series bold
grab
\series default
 all records with a sequence length greater than 30:
\end_layout

\begin_layout LyX-Code
...
 length_seq | grab --eval='SEQ_LEN > 30'
\end_layout

\begin_layout Standard
If you want to locate all records containing the pattern 'human' and where
 the sequence length is greater that 30, you do this by running the stream
 through 
\series bold
grab
\series default
 twice:
\end_layout

\begin_layout LyX-Code
...
 | grab --pattern='human' | length_seq | grab --eval='SEQ_LEN > 30'
\end_layout

\begin_layout Standard
Finally, it is possible to do fast matching of expressions from a file using
 the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

exact switch.
 Each of these expressions has to be matched exactly over the entrie length,
 which if useful if you have a file with accession numbers, that you want
 to locate in the stream:
\end_layout

\begin_layout LyX-Code
...
 | grab --exact acc_no.txt | ...
\end_layout

\begin_layout Standard
Using 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

exact is much faster than using 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern_in, because with 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

exact the expression has to be complete matches, where 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern_in looks for subpatterns.
\end_layout

\begin_layout Standard
NB! To get the best speed performance, use the most restrictive 
\series bold
grab
\series default
 first.
\end_layout

\begin_layout Subsection
How to remove keys from records?
\end_layout

\begin_layout Standard
To remove one or more specific keys from all records in the data stream
 use 
\series bold
remove_keys
\series default
 like this:
\end_layout

\begin_layout LyX-Code
...
 | remove_keys --keys=SEQ,SEQ_NAME
\end_layout

\begin_layout Standard
In the above example SEQ and SEQ_NAME will be removed from all records if
 they exists in these.
 If all keys are removed from a record, then the record will be removed.
\end_layout

\begin_layout Subsection
How to rename keys in records?
\end_layout

\begin_layout Standard
Sometimes you want to rename a record key, 
\emph on
e.g.

\emph default
 if you have read in a two column table with sequence name and sequence
 in each column (see 
\begin_inset LatexCommand ref
reference "sub:How-to-read-table"

\end_inset

) without specifying the key names, then the sequence name will be called
 V0 and the sequence V1 as default in the 
\series bold
read_tab
\series default
 biopiece.
 To rename the V0 and V1 keys we need to run the stream through 
\series bold
rename_keys
\series default
 twice (one for each key to rename):
\end_layout

\begin_layout LyX-Code
...
 | rename_keys --keys=V0,SEQ_NAME | rename_keys --keys=V1,SEQ
\end_layout

\begin_layout Standard
The first instance of 
\series bold
rename_keys
\series default
 replaces all the V0 keys with SEQ_NAME, and the second instance of 
\series bold
rename_keys
\series default
 replaces all the V1 keys with SEQ.
 
\emph on
Et viola
\emph default
 the data can now be used in the biopieces that requires these keys.
\end_layout

\begin_layout Section
Manipulating Sequences
\end_layout

\begin_layout Subsection
How to get sequence lengths?
\end_layout

\begin_layout Standard
The length for sequences in records can be determined with 
\series bold
length_seq
\series default
, which adds the key SEQ_LEN to each record with the sequence length as
 the value.
 It also generates an extra record that is emitted last with the key TOTAL_SEQ_L
EN showing the total length of all the sequences.
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file> | length_seq
\end_layout

\begin_layout Standard
It is also possible to determine the sequence length using the generic tool
 
\series bold
length_vals
\series default
 
\begin_inset LatexCommand eqref
reference "sub:How-to-get-value_length"

\end_inset

, which determines the length of the values for a given list of keys:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file> | length_vals --keys=SEQ
\end_layout

\begin_layout Standard
To obtain the total length of all sequences use 
\series bold
sum_vals
\series default
 like this:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file> | length_vals --keys=SEQ
\end_layout

\begin_layout LyX-Code
| sum_vals --keys=SEQ_LEN
\end_layout

\begin_layout Standard
The biopiece 
\series bold
analyze_seq
\series default
 will also determine the length of each sequence (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-analyze"

\end_inset

).
\end_layout

\begin_layout Subsection
How to analyze sequence composition?
\begin_inset LatexCommand label
name "sub:How-to-analyze"

\end_inset


\end_layout

\begin_layout Standard
If you want to find out the sequence type, composition, length, as well
 as GC content, indel content and proportions of soft and hard masked sequence,
 then use 
\series bold
analyze_seq
\series default
.
 This handy biopiece will determine all these things per sequence from which
 it is easy to get an overview using the 
\series bold
write_tab
\series default
 biopiece to output a table (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-tab"

\end_inset

).
 So in order to determine the sequence composition of a FASTA file with
 just one entry containing the sequence 'ATCG' we just read the data with
 
\series bold
read_fasta
\series default
 and run the output through 
\series bold
analyze_seq
\series default
 which will add the analysis to the record like this:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=test.fna | analyze_seq ...
\end_layout

\begin_layout LyX-Code

\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:D: 0
\end_layout

\begin_layout LyX-Code

\size scriptsize
MIX_INDEX: 0.55
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:W: 0
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:G: 16
\end_layout

\begin_layout LyX-Code

\size scriptsize
SOFT_MASK%: 63.75
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:B: 0
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:V: 0
\end_layout

\begin_layout LyX-Code

\size scriptsize
HARD_MASK%: 0.00
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:H: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:S: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:N: 0
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:.: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
GC%: 35.00 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:A: 8 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:Y: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:M: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:T: 44 
\end_layout

\begin_layout LyX-Code

\size scriptsize
SEQ_TYPE: DNA 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:K: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:~: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
SEQ: TTTCAGTTTGGGACGGAGTAAGGCCTTCCtttttttttttttttttttttttttttttgagaccgagtcttgctc
tgtcg 
\end_layout

\begin_layout LyX-Code

\size scriptsize
SEQ_LEN: 
\end_layout

\begin_layout LyX-Code

\size scriptsize
80 RES:R: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:C: 12 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:-: 0 
\end_layout

\begin_layout LyX-Code

\size scriptsize
RES:U: 0
\end_layout

\begin_layout LyX-Code

\size scriptsize
---
\end_layout

\begin_layout LyX-Code

\end_layout

\begin_layout Standard
Now to make a table of how may As, Ts, Cs, and Gs you can add the following:
\end_layout

\begin_layout LyX-Code
...
 | analyze_seq | write_tab --keys=RES:A,RES:T,RES:C,RES:G
\end_layout

\begin_layout Standard
Or if you want to see the proportions of hard and soft masked sequence:
\end_layout

\begin_layout LyX-Code
...
 | analyse_seq | write_tab --keys=HARD_MASK%,SOFT_MASK%
\end_layout

\begin_layout Standard
If you have a stack of sequences in one file and you want to determine the
 mean GC content you can do it using the 
\series bold
mean_vals
\series default
 biopiece:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=test.fna | analyze_seq | mean_vals --keys=GC%
\end_layout

\begin_layout Standard
Or if you want the total count of Ns you can use 
\series bold
sum_vals
\series default
 like this:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=test.fna | analyze_seq | sum_vals --keys=RES:N
\end_layout

\begin_layout Standard
The MIX_INDEX key is calculated as the count of the most common residue
 over the sequence length, and can be used as a cut-off for removing sequence
 tags consisting of mostly one nucleotide:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=test.fna | analyze_seq | grab --eval='MIX_INDEX<0.85'
\end_layout

\begin_layout Subsection
How to extract subsequences?
\begin_inset LatexCommand label
name "sub:How-to-extract"

\end_inset


\end_layout

\begin_layout Standard
In order to extract a subsequence from a longer sequence use the biopiece
 extract_seq, which will replace the sequence in the record with the subsequence
 (this behaviour should probably be modified to be dependant of a 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

replace or a 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_replace switch
\begin_inset Note Note
status collapsed

\begin_layout Standard
also in split_seq
\end_layout

\end_inset

).
 So to extract the first 20 residues from all sequences do (first residue
 is designated 1): 
\end_layout

\begin_layout LyX-Code
...
 | extract_seq --beg=1 --len=20
\end_layout

\begin_layout Standard
You can also specify a begin and end coordinate set:
\end_layout

\begin_layout LyX-Code
...
 | extract_seq --beg=20 --end=40
\end_layout

\begin_layout Standard
If you want the subsequences from position 20 to the sequence end do:
\end_layout

\begin_layout LyX-Code
...
 | extract_seq --beg=20
\end_layout

\begin_layout Standard
If you want to extract subsequences a given distance from the sequence end
 you can do this by reversing the sequence with the biopiece 
\series bold
reverse_seq
\series default
 
\begin_inset LatexCommand eqref
reference "sub:How-to-reverse-seq"

\end_inset

, followed by 
\series bold
extract_seq
\series default
 to get the subsequence, and then 
\series bold
reverse_seq
\series default
 again to get the subsequence back in the original orientation:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=test.fna | reverse_seq
\end_layout

\begin_layout LyX-Code
| extract_seq --beg=10 --len=10 | reverse_seq
\end_layout

\begin_layout Subsection
How to get genomic sequence?
\begin_inset LatexCommand label
name "sub:How-to-get-genomic-sequence"

\end_inset


\end_layout

\begin_layout Standard
The biopiece 
\series bold
get_genomic_seq
\series default
 can extract subsequences for a given genome specified with the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

genome switch explicitly using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

beg and 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

end/
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

len switches:
\end_layout

\begin_layout LyX-Code
get_genome_seq --genome=<genome> --beg=1 --len=100
\end_layout

\begin_layout Standard
Alternatively, 
\series bold
get_genome_seq
\series default
 can be used to append the corresponding sequence to BED, PSL, and BLAST
 records:
\end_layout

\begin_layout LyX-Code
read_bed --data_in=<BED file> | get_genome_seq --genome=<genome>
\end_layout

\begin_layout Standard
It is also possible to include flaking sequence using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

flank switch.
 So to include 50 nucleotides upstream and 50 nucleotides downstream for
 each BED entry do:
\end_layout

\begin_layout LyX-Code
read_bed --data_in=<BED file> | get_genome_seq --genome=<genome> --flank=50
\end_layout

\begin_layout Subsection
How to upper-case sequences?
\end_layout

\begin_layout Standard
Sequences can be shifted from lower case to upper case using 
\series bold
uppercase_seq
\series default
:
\end_layout

\begin_layout LyX-Code
...
 | uppercase_seq
\end_layout

\begin_layout Subsection
How to reverse sequences?
\begin_inset LatexCommand label
name "sub:How-to-reverse-seq"

\end_inset


\end_layout

\begin_layout Standard
The order of residues in a sequence can be reversed using reverse_seq:
\end_layout

\begin_layout LyX-Code
...
 | reverse_seq
\end_layout

\begin_layout Standard
Note that in order to reverse/complement a sequence you also need the 
\series bold
complement_seq
\series default
 biopiece (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-complement"

\end_inset

).
\end_layout

\begin_layout Subsection
How to complement sequences?
\begin_inset LatexCommand label
name "sub:How-to-complement"

\end_inset


\end_layout

\begin_layout Standard
DNA and RNA sequences can be complemented with 
\series bold
complement_seq
\series default
, which automagically determines the sequence type:
\end_layout

\begin_layout LyX-Code
...
 | complement_seq
\end_layout

\begin_layout Standard
Note that in order to reverse/complement a sequence you also need the 
\series bold
reverse_seq
\series default
 biopiece (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-reverse-seq"

\end_inset

).
\end_layout

\begin_layout Subsection
How to remove indels from sequnces?
\end_layout

\begin_layout Standard
Indels can be removed from sequences with the 
\series bold
remove_indels
\series default
 biopiece.
 This is useful if you have aligned some sequences (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-align"

\end_inset

) and extracted (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-extract"

\end_inset

) a block of subsequences from the alignment and you want to use these sequence
 in a search where you need to remove the indels first.
 '-', '~', and '.' are considered indels:
\end_layout

\begin_layout LyX-Code
...
 | remove_indels
\end_layout

\begin_layout Subsection
How to shuffle sequences?
\end_layout

\begin_layout Standard
All residues in sequences in the stream can be shuffled to random positions
 using the 
\series bold
shuffle_seq
\series default
 biopiece:
\end_layout

\begin_layout LyX-Code
...
 | shuffle_seq
\end_layout

\begin_layout Subsection
How to split sequences into overlapping subsequences?
\end_layout

\begin_layout Standard
Sequences can be slit into overlapping subsequences with the 
\series bold
split_seq
\series default
 biopiece.
\end_layout

\begin_layout LyX-Code
...
 | split_seq --word_size=20 --uniq
\end_layout

\begin_layout Subsection
How to determine the oligo frequency?
\end_layout

\begin_layout Standard
In order to determine if any oligo usage is over represented in one or more
 sequences you can determine the frequency of oligos of a given size with
 
\series bold
oligo_freq
\series default
:
\end_layout

\begin_layout LyX-Code
...
 | oligo_freq --word_size=4
\end_layout

\begin_layout Standard
And if you have more than one sequence and want to accumulate the frequences
 you need the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

all switch:
\end_layout

\begin_layout LyX-Code
...
 | oligo_freq --word_size=4 --all
\end_layout

\begin_layout Standard
To get a meaningful result you need to write the resulting frequencies as
 a table with 
\series bold
write_tab
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-tab"

\end_inset

), but first it is important to 
\series bold
grab
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-grab"

\end_inset

) the records with the frequencies to avoid full length sequences in the
 table:
\end_layout

\begin_layout LyX-Code
...
 | oligo_freq --word_size=4 --all | grab --pattern=OLIGO --keys_only
\end_layout

\begin_layout LyX-Code
| write_tab --no_stream
\end_layout

\begin_layout Standard
And the resulting frequency table can be sorted with Unix sort (man sort).
\end_layout

\begin_layout Subsection
How to search for sequences in genomes?
\end_layout

\begin_layout Standard
See the following biopiece:
\end_layout

\begin_layout Itemize

\series bold
patscan_seq
\series default
 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-patscan"

\end_inset


\end_layout

\begin_layout Itemize

\series bold
blat_seq
\series default
 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-BLAT"

\end_inset


\end_layout

\begin_layout Itemize

\series bold
blast_seq
\series default
 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-BLAST"

\end_inset


\end_layout

\begin_layout Itemize

\series bold
vmatch_seq
\series default
 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-Vmatch"

\end_inset


\end_layout

\begin_layout Subsection
How to search sequences for a pattern?
\begin_inset LatexCommand label
name "sub:How-to-use-patscan"

\end_inset


\end_layout

\begin_layout Standard
It is possible to search sequences in the data stream for patterns using
 the 
\series bold
patscan_seq
\series default
 biopiece which utilizes the powerful scan_for_matches engine.
 Consult the documentation for scan_for_matches in order to learn how to
 define patterns (the documentation is included in Appendix\InsetSpace ~

\begin_inset LatexCommand ref
reference "sec:scan_for_matches-README"

\end_inset

).
\end_layout

\begin_layout Standard
To search all sequences for a simple pattern consisting of the sequence
 ATCGATCG allowing for 3 mismatches, 2 insertions and 1 deletion:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file> | patscan_seq --pattern='ATCGATCG[3,2,1]'
\end_layout

\begin_layout Standard
The 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern switch takes a comma seperated list of patterns, so if you want
 to search for more that one pattern do:
\end_layout

\begin_layout LyX-Code
...
 | patscan_seq --pattern='ATCGATCG[3,2,1],GCTAGCTA[3,2,1]'
\end_layout

\begin_layout Standard
It is also possible to have a list of different patterns to search for in
 a file with one pattern per line.
 In order to get 
\series bold
patscan_seq
\series default
 to read these patterns use the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

pattern_in switch:
\end_layout

\begin_layout LyX-Code
...
 | patscan_seq --pattern_in=<file>
\end_layout

\begin_layout Standard
To also scan the complementary strand in nucleotide sequences (
\series bold
patscan_seq
\series default
 automagically determines the sequence type) you need to add the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

comp switch:
\end_layout

\begin_layout LyX-Code
...
 | patscan_seq --pattern=<pattern> --comp
\end_layout

\begin_layout Standard
It is also possible to use 
\series bold
patscan_seq
\series default
 to output those records that does not contain a certain pattern by using
 the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

invert switch:
\end_layout

\begin_layout LyX-Code
...
 | patscan_seq --pattern=<pattern> --invert
\end_layout

\begin_layout Standard
Finally, 
\series bold
patscan_seq
\series default
 can also scan for patterns in a given genome sequence, instead of sequences
 in the stream, using the 
\begin_inset ERT
status open

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

genome switch:
\end_layout

\begin_layout LyX-Code
patscan --pattern=<pattern> --genome=<genome>
\end_layout

\begin_layout Subsection
How to use BLAT for sequence search?
\begin_inset LatexCommand label
name "sub:How-to-use-BLAT"

\end_inset


\end_layout

\begin_layout Standard
Sequences in the data stream can be matched against supported genomes using
 
\series bold
blat_seq
\series default
 which is a biopiece using BLAT as the name might suggest.
 Currently only Mouse and Human genomes are available and it is not possible
 to use OOC files since there is still a need for a local repository for
 genome files.
 Otherwise it is just:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file> | blat_seq --genome=<genome>
\end_layout

\begin_layout Standard
The search results can then be written to file with 
\series bold
write_psl
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-PSL"

\end_inset

) or 
\series bold
write_bed
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-BED"

\end_inset

) allthough with 
\series bold
write_bed
\series default
 some information will be lost).
 It is also possible to plot chromosome distribution of the search results
 using 
\series bold
plot_chrdist
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-plot-chrdist"

\end_inset

) or the distribution of the match lengths using 
\series bold
plot_lendist
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-plot-lendist"

\end_inset

) or a karyogram with the hits using 
\series bold
plot_karyogram
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-plot-karyogram"

\end_inset

).
\end_layout

\begin_layout Subsection
How to use BLAST for sequence search?
\begin_inset LatexCommand label
name "sub:How-to-use-BLAST"

\end_inset


\end_layout

\begin_layout Standard
Two biopieces exist for blasting sequences: 
\series bold
create_blast_db
\series default
 is used to create the BLAST database required for BLAST which is queried
 using the biopiece 
\series bold
blast_seq
\series default
.
 So in order to create a BLAST database from sequences in the data stream
 you simple run:
\end_layout

\begin_layout LyX-Code
...
 | create_blast_db --database=my_database --no_stream
\end_layout

\begin_layout Standard
The type of sequence to use for the database is automagically determined
 by 
\series bold
create_blast_db
\series default
, but don't have a mixture of peptide and nucleic acids sequences in the
 stream.
 The 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

database switch takes a path as argument, but will default to 'blastdb_<time_sta
mp> if not set.
\end_layout

\begin_layout Standard
The resulting database can now be queried with sequences in another data
 stream using 
\series bold
blast_seq
\series default
:
\end_layout

\begin_layout LyX-Code
...
 | blast_seq --database=my_database
\end_layout

\begin_layout Standard
Again, the sequence type is determined automagically and the appropriate
 BLAST program is guessed (see below table), however, the program name can
 be overruled with the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

program switch.
\end_layout

\begin_layout Standard
\noindent
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="5" columns="3">
<features>
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<row bottomline="true">
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Subject sequence
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Query sequence
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Program guess
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Nucleotide
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Nucleotide
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
blastn
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Protein
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Protein
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
blastp
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Protein
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Nucleotide
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
blastx
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Nucleotide
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
Protein
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Standard
tblastn
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Standard
Finally, it is also possible to use 
\series bold
blast_seq
\series default
 for blasting sequences agains a preformatted genome using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

genome switch instead of the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

database switch:
\end_layout

\begin_layout LyX-Code
...
 | blast_seq --genome=<genome>
\end_layout

\begin_layout Subsection
How to use Vmatch for sequence search?
\begin_inset LatexCommand label
name "sub:How-to-use-Vmatch"

\end_inset


\end_layout

\begin_layout Standard
The powerful suffix array software package Vmatch
\begin_inset Foot
status collapsed

\begin_layout Standard
\begin_inset LatexCommand url
target "http://www.vmatch.de/"

\end_inset


\end_layout

\end_inset

 can be used for exact mapping of sequences against indexed genomes using
 the biopiece 
\series bold
vmatch_seq
\series default
, which will e.g.
 map 700000 ESTs to the human genome locating all 160 mio hits in less than
 an hour.
 Only nucleotide sequences and sequences longer than 11 nucleotides will
 be mapped.
 It is recommended that sequences consisting of mostly one nucleotide type
 are removed.
 This can be done with the 
\series bold
analyze_seq
\series default
 biopiece 
\begin_inset LatexCommand eqref
reference "sub:How-to-analyze"

\end_inset

.
\end_layout

\begin_layout LyX-Code
...
 | vmatch_seq --genome=<genome>
\end_layout

\begin_layout Standard
It is also possible to allow for mismatches using the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

hamming_dist switch.
 So to allow for 2 mismatches:
\end_layout

\begin_layout LyX-Code
...
 | vmatch_seq --genome=<genome> --hamming_dist=2
\end_layout

\begin_layout Standard
Or to allow for 10% mismathing nucleotides:
\end_layout

\begin_layout LyX-Code
...
 | vmatch_seq --genome=<genome> --hamming_dist=10p
\end_layout

\begin_layout Standard
To allow both indels and mismatches use the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

edit_dist switch.
 So to allow for one mismatch or one indel:
\end_layout

\begin_layout LyX-Code
...
 | vmatch_seq --genome=<genome> --hamming_dist=1
\end_layout

\begin_layout Standard
Or to allow for 5% indels or mismatches:
\end_layout

\begin_layout LyX-Code
...
 | vmatch_seq --genome=<genome> --hamming_dist=5p
\end_layout

\begin_layout Standard
Note that using 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

hamming_dist or
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

edit_dist greatly slows down vmatch considerably --- use with care.
\end_layout

\begin_layout Standard
The resulting SCORE key can be replaced to hold the number of genome matches
 of a given sequence (multi-mappers) is the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

count switch is given.
\end_layout

\begin_layout Subsection
How to find all matches between sequences?
\begin_inset LatexCommand label
name "sub:How-to-find-matches"

\end_inset


\end_layout

\begin_layout Standard
All matches between two sequences can be determined with the biopiece 
\series bold
match_seq
\series default
.
 The match finding engine underneath the hood of 
\series bold
match_seq
\series default
 is the super fast suffix tree program MUMmer
\begin_inset Foot
status collapsed

\begin_layout Standard
\begin_inset LatexCommand url
target "http://mummer.sourceforge.net/"

\end_inset


\end_layout

\end_inset

, which will locate all forward and reverse matches between huge sequences
 in a matter of minutes (if the repeat count is not too high and if the
 word size used is appropriate).
 Matching two 
\emph on
Helicobacter pylori
\emph default
 genomes (1.7Mbp) takes around 10 seconds:
\end_layout

\begin_layout LyX-Code
...
 | match_seq --word_size=20 --direction=both
\end_layout

\begin_layout Standard
The output from 
\series bold
match_seq
\series default
 can be used to generate a dot plot with 
\series bold
plot_matches
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-generate-dotplot"

\end_inset

).
\end_layout

\begin_layout Subsection
How to align sequences?
\begin_inset LatexCommand label
name "sub:How-to-align"

\end_inset


\end_layout

\begin_layout Standard
Sequences in the stream can be aligned with the 
\series bold
align_seq
\series default
 biopiece that uses Muscle
\begin_inset Foot
status open

\begin_layout Standard
\begin_inset LatexCommand url
target "http://www.drive5.com/muscle/muscle.html"

\end_inset


\end_layout

\end_inset

 as aligment engine.
 Currently you cannot change any of the Muscle alignment parameters and
 
\series bold
align_seq
\series default
 will create an alignment based on the defaults (which are really good!):
\end_layout

\begin_layout LyX-Code
...
 | align_seq
\end_layout

\begin_layout Standard
The aligned output can be written to file in FASTA format using 
\series bold
write_fasta
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-fasta"

\end_inset

) or in pretty text using 
\series bold
write_align
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-alignment"

\end_inset

).
\end_layout

\begin_layout Subsection
How to create a weight matrix?
\end_layout

\begin_layout Standard
If you want a weight matrix to show the sequence composition of a stack
 of sequences you can use the biopiece create_weight_matrix:
\end_layout

\begin_layout LyX-Code
...
 | create_weight_matrix
\end_layout

\begin_layout Standard
The result can be output in percent using the 
\begin_inset ERT
status open

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

percent switch:
\end_layout

\begin_layout LyX-Code
...
 | create_weight_matrix --percent
\end_layout

\begin_layout Standard
The weight matrix can be written as tabular output with 
\series bold
write_tab
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-tab"

\end_inset

) after removeing the records containing SEQ with 
\series bold
grab
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-grab"

\end_inset

):
\end_layout

\begin_layout LyX-Code
...
 | create_weight_matrix | grab --invert --keys=SEQ --keys_only
\end_layout

\begin_layout LyX-Code
| write_tab --no_stream
\end_layout

\begin_layout Standard
The V0 column will hold the residue, while the rest of the columns will
 hold the frequencies for each sequence position.
\end_layout

\begin_layout Section
Plotting
\end_layout

\begin_layout Standard
There exists several biopieces for plotting.
 Some of these are based on GNUplot
\begin_inset Foot
status open

\begin_layout Standard
\begin_inset LatexCommand url
target "http://www.gnuplot.info/"

\end_inset


\end_layout

\end_inset

, which is an extremely powerful platform to generate all sorts of plots
 and even though GNUplot has quite a steep learning curve, the biopieces
 utilizing GNUplot are simple to use.
 GNUplot is able to output a lot of different formats (called terminals
 in GNUplot), but the biopieces focusses on three formats only:
\end_layout

\begin_layout Enumerate
The 'dumb' terminal is default to the GNUplot based biopieces and will output
 a plot in crude ASCII text (Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Dumb-terminal"

\end_inset

).
 This is quite nice for a quick and dirty plot to get an overview of your
 data .
\end_layout

\begin_layout Enumerate
The 'post' or 'postscript' terminal output postscript code which is publication
 grade graphics that can be viewed with applications such as Ghostview,
 Photoshop, and Preview.
\end_layout

\begin_layout Enumerate
The 'svg' terminal output's scalable vector graphics (SVG) which is a vector
 based format.
 SVG is great because you can edit the resulting plot using Photoshop or
 Inkscape
\begin_inset Foot
status collapsed

\begin_layout Standard
Inkscape is a really handy drawing program that is free and open source.
 Availble at 
\begin_inset LatexCommand htmlurl
target "http://www.inkscape.org"

\end_inset


\end_layout

\end_inset

 if you want to add additional labels, captions, arrows, and so on and then
 save the result in different formats, such as postscript without loosing
 resolution.
\end_layout

\begin_layout Standard
The biopieces for plotting that are not based on GNUplot only output SVG
 (that may change in the future).
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
	filename lendist_ascii.png
	lyxscale 70
	width 12cm

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Caption

\begin_layout Standard
\begin_inset LatexCommand label
name "fig:Dumb-terminal"

\end_inset

Dumb terminal
\end_layout

\end_inset


\end_layout

\begin_layout Quote
The output of a length distribution plot in the default 'dumb terminal'
 to the terminal window.
 
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
How to plot a histogram?
\begin_inset LatexCommand label
name "How-to-plot-histogram"

\end_inset


\end_layout

\begin_layout Standard
A generic histogram for a given value can be plotted with the biopiece 
\series bold
plot_histogram
\series default
 (Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Histogram"

\end_inset

):
\end_layout

\begin_layout LyX-Code
...
 | plot_histogram --key=TISSUE --no_stream
\end_layout

\begin_layout Standard
(Figure missing)
\end_layout

\begin_layout Standard
\noindent
\align left
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
	filename histogram.png
	lyxscale 70
	width 12cm

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Caption

\begin_layout Standard
\begin_inset LatexCommand label
name "fig:Histogram"

\end_inset

Histogram
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsection
How to plot a length distribution?
\begin_inset LatexCommand label
name "sub:How-to-plot-lendist"

\end_inset


\end_layout

\begin_layout Standard
Plotting of length distributions, weather sequence lengths, patterns lengths,
 hit lengths, 
\emph on
etc.

\emph default
 is a really handy thing and can be done with the the biopiece 
\series bold
plot_lendist
\series default
.
 If you have a file with FASTA entries and want to plot the length distribution
 you do it like this:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file> | length_seq
\end_layout

\begin_layout LyX-Code
| plot_lendist --key=SEQ_LEN --no_stream
\end_layout

\begin_layout Standard
The result will be written to the default dumb terminal and will look like
 Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Dumb-terminal"

\end_inset

.
\end_layout

\begin_layout Standard
If you instead want the result in postscript format you can do:
\end_layout

\begin_layout LyX-Code
...
 | plot_lendist --key=SEQ_LEN --terminal=post --result_out=file.ps
\end_layout

\begin_layout Standard
That will generate the plot and save it to file, but not interrupt the data
 stream which can then be used in further analysis.
 You can also save the plot implicetly using '>', however, it is then important
 to terminate the stream with the 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_stream switch:
\end_layout

\begin_layout LyX-Code
...
 | plot_lendist --key=SEQ_LEN --terminal=post --no_stream > file.ps
\end_layout

\begin_layout Standard
The resulting plot can be seen in Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Length-distribution"

\end_inset

.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Standard

\end_layout

\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
	filename lendist.ps
	lyxscale 50
	width 12cm

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Caption

\begin_layout Standard
\begin_inset LatexCommand label
name "fig:Length-distribution"

\end_inset

Length distribution
\end_layout

\end_inset


\end_layout

\begin_layout Quote
Length distribution of 630 piRNA like RNAs.
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
How to plot a chromosome distribution?
\begin_inset LatexCommand label
name "sub:How-to-plot-chrdist"

\end_inset


\end_layout

\begin_layout Standard
If you have the result of a sequence search against a multi chromosome genome,
 it is very practical to be able to plot the distribution of search hits
 on the different chromosomes.
 This can be done with 
\series bold
plot_chrdist
\series default
:
\end_layout

\begin_layout LyX-Code
read_fasta --data_in=<file> | blat_genome | plot_chrdist --no_stream
\end_layout

\begin_layout Standard
The above example will result in a crude plot using the 'dumb' terminal,
 and if you want to mess around with the results from the BLAT search you
 probably want to save the result to file first (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-write-PSL"

\end_inset

).
 To plot the chromosome distribution from the saved search result you can
 do:
\end_layout

\begin_layout LyX-Code
read_bed --data_in=file.bed | plot_chrdist --terminal=post --result_out=plot.ps
\end_layout

\begin_layout Standard
That will result in the output show in Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Chromosome-distribution"

\end_inset

.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Standard

\end_layout

\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
	filename chrdist.ps
	lyxscale 50
	width 12cm
	rotateAngle 90

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Caption

\begin_layout Standard
\begin_inset LatexCommand label
name "fig:Chromosome-distribution"

\end_inset

Chromosome distribution
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsection
How to generate a dotplot?
\begin_inset LatexCommand label
name "sub:How-to-generate-dotplot"

\end_inset


\end_layout

\begin_layout Standard
A dotplot is a powerful way to get an overview of the size and location
 of sequence insertions, deletions, and duplications between two sequences.
 Generating a dotplot with biopieces is a two step process where you initially
 find all matches between two sequences using the tool 
\series bold
match_seq
\series default
 (see\InsetSpace ~

\begin_inset LatexCommand ref
reference "sub:How-to-find-matches"

\end_inset

) and plot the resulting matches with 
\series bold
plot_matches
\series default
.
 Matching and plotting two 
\emph on
Helicobacter pylori
\emph default
 genomes (1.7Mbp) takes around 10 seconds:
\end_layout

\begin_layout LyX-Code
...
 | match_seq | plot_matches --terminal=post --result_out=plot.ps
\end_layout

\begin_layout Standard
The resulting dotplot is in Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Dotplot"

\end_inset

.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
	filename dotplot.ps
	lyxscale 50
	width 12cm

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Caption

\begin_layout Standard
\begin_inset LatexCommand label
name "fig:Dotplot"

\end_inset

Dotplot
\end_layout

\end_inset


\end_layout

\begin_layout Quote
Forward matches are displayed in green while reverse matches are displayed
 in red.
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
How to plot a sequence logo?
\end_layout

\begin_layout Standard
Sequence logos can be generate with 
\series bold
plot_seqlogo
\series default
.
 The sequnce type is determined automagically and an entropy scale of 2
 bits and 4 bits is used for nucleotide and peptide sequences, respectively
\begin_inset Foot
status collapsed

\begin_layout Standard
\begin_inset LatexCommand htmlurl
target "http://www.ccrnp.ncifcrf.gov/~toms/paper/hawaii/latex/node5.html"

\end_inset


\end_layout

\end_inset

.
\end_layout

\begin_layout LyX-Code
...
 | plot_seqlogo --no_stream --result_out=seqlogo.svg
\end_layout

\begin_layout Standard
An example of a sequence logo can be seen in Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Sequence-logo"

\end_inset

.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
	filename seqlogo.png
	lyxscale 50
	width 12cm

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Caption

\begin_layout Standard
\begin_inset LatexCommand label
name "fig:Sequence-logo"

\end_inset

Sequence logo
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsection
How to plot a karyogram?
\begin_inset LatexCommand label
name "sub:How-to-plot-karyogram"

\end_inset


\end_layout

\begin_layout Standard
To plot search hits on genomes use 
\series bold
plot_karyogram
\series default
, which will output a nice karyogram in SVG graphics:
\end_layout

\begin_layout LyX-Code
...
 | plot_karyogram --result_out=karyogram.svg
\end_layout

\begin_layout Standard
The banding data is taken from the UCSC genome browser database and currently
 only Human and Mouse is supported.
 Fig.\InsetSpace ~

\begin_inset LatexCommand ref
reference "fig:Karyogram"

\end_inset

 shows the distribution of piRNA like RNAs matched to the Human genome.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
	filename karyogram.png
	lyxscale 35
	width 12cm

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Caption

\begin_layout Standard
\begin_inset LatexCommand label
name "fig:Karyogram"

\end_inset

Karyogram
\end_layout

\end_inset


\end_layout

\begin_layout Quote
Hits from a search of piRNA like RNAs in the Human genome is displayed as
 short horizontal bars.
\end_layout

\end_inset


\end_layout

\begin_layout Section
Uploading Results
\end_layout

\begin_layout Subsection
How do I display my results in the UCSC Genome Browser?
\end_layout

\begin_layout Standard
Results from the list of biopieces below can be uploaded directly to a local
 mirror of the UCSC Genome Browser using the biopiece 
\series bold
upload_to_ucsc
\series default
:
\end_layout

\begin_layout Itemize
patscan_seq 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-patscan"

\end_inset


\end_layout

\begin_layout Itemize
blat_seq 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-BLAT"

\end_inset


\end_layout

\begin_layout Itemize
blast_seq 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-BLAST"

\end_inset


\end_layout

\begin_layout Itemize
vmatch_seq 
\begin_inset LatexCommand eqref
reference "sub:How-to-use-Vmatch"

\end_inset


\end_layout

\begin_layout Standard
The syntax for uploading data the most simple way requires two mandatory
 switches: 
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

database, which is the UCSC database name (such as hg18, mm9, etc.) and
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

table which should be the users initials followed by an underscore and a
 short description of the data:
\end_layout

\begin_layout LyX-Code
...
 | upload_to_ucsc --database=hg18 --table=mah_snoRNAs
\end_layout

\begin_layout Standard
The 
\series bold
upload_to_ucsc
\series default
 biopiece modifies the users ~/ucsc/my_tracks.ra file automagically (a backup
 is created with the name ~/ucsc/my_tracks.ra~) with default values that
 can be overridden using the following switches:
\end_layout

\begin_layout Itemize
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

short_label - Short label for track - Default=database->table
\end_layout

\begin_layout Itemize
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

long_label - Long label for track - Default=database->table
\end_layout

\begin_layout Itemize
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

group - Track group name - Default=<user name as defined in env>
\end_layout

\begin_layout Itemize
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

priority - Track display priority - Default=1
\end_layout

\begin_layout Itemize
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

color - Track color - Default=147,73,42
\end_layout

\begin_layout Itemize
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

chunk_size - Chunks for loading - Default=10000000
\end_layout

\begin_layout Itemize
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

visibility - Track visibility - Default=pack
\end_layout

\begin_layout Standard
Also, data in BED or PSL format can be uploaded with 
\series bold
upload_to_ucsc
\series default
 as long as these reference to genomes and chromosomes existing in the UCSC
 Genome Browser:
\end_layout

\begin_layout LyX-Code
read_bed --data_in=<bed file> | upload_to_ucsc ...
\end_layout

\begin_layout LyX-Code

\end_layout

\begin_layout LyX-Code
read_psl --data_in=<psl file> | upload_to_ucsc ...
\end_layout

\begin_layout Section
Power Scripting
\end_layout

\begin_layout Standard
It is possible to do commandline scripting of biopiece records using Perl.
 Because a biopiece record essentially is a hash structure, you can pass
 records to 
\series bold
bioscript
\series default
 command, which is a wrapper around the Perl executable that allows direct
 manipulations of the records using the power of Perl.
\end_layout

\begin_layout Standard
In the below example we replace in all records the value to the CHR key
 with a forthrunning number:
\end_layout

\begin_layout LyX-Code
...
 | bioscript 'while($r=get_record(
\backslash
*STDIN)){$r->{CHR}=$i++; put_record($r)}'
\end_layout

\begin_layout Standard
Something more useful would probably be to create custom FASTA headers.
 E.g.
 if we read in a BED file, lookup the genomic sequence, create a custom
 FASTA header with 
\series bold
bioscript
\series default
 and output FASTA entries:
\end_layout

\begin_layout LyX-Code
...
 | bioscript 'while($r=get_record(
\backslash
*STDIN)){$r->{SEQ_NAME}= //
\end_layout

\begin_layout LyX-Code
join("_",$r->{CHR},$r->{CHR_BEG},$r->{CHR_END}); put_record($r)}'
\end_layout

\begin_layout Standard
And the output:
\end_layout

\begin_layout LyX-Code
>chr2L_21567527_21567550 
\end_layout

\begin_layout LyX-Code
taccaaacggatgcctcagacatc
\end_layout

\begin_layout LyX-Code
>chr2L_693380_693403 
\end_layout

\begin_layout LyX-Code
taccaaacggatgcctcagacatc 
\end_layout

\begin_layout LyX-Code
>chr2L_13859534_13859557 
\end_layout

\begin_layout LyX-Code
taccaaacggatgcctcagacatc 
\end_layout

\begin_layout LyX-Code
>chr2L_9005090_9005113 
\end_layout

\begin_layout LyX-Code
taccaaacggatgcctcagacatc 
\end_layout

\begin_layout LyX-Code
>chr2L_2106825_2106848 
\end_layout

\begin_layout LyX-Code
taccaaacggatgcctcagacatc 
\end_layout

\begin_layout LyX-Code
>chr2L_14649031_14649054 
\end_layout

\begin_layout LyX-Code
taccaaacggatgcctcagacatc
\end_layout

\begin_layout Section
Trouble shooting
\end_layout

\begin_layout Standard
Shoot the messenger!
\end_layout

\begin_layout Section
\start_of_appendix
Keys
\begin_inset LatexCommand label
name "sec:Keys"

\end_inset


\end_layout

\begin_layout Standard
HIT
\end_layout

\begin_layout Standard
HIT_BEG
\end_layout

\begin_layout Standard
HIT_END
\end_layout

\begin_layout Standard
HIT_LEN
\end_layout

\begin_layout Standard
HIT_NAME
\end_layout

\begin_layout Standard
PATTERN
\end_layout

\begin_layout Section
Switches
\begin_inset LatexCommand label
name "sec:Switches"

\end_inset


\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

stream_in
\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

stream_out
\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

no_stream
\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

data_in
\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

result_out
\end_layout

\begin_layout Standard
\begin_inset ERT
status collapsed

\begin_layout Standard

-
\backslash
/-
\end_layout

\end_inset

num
\end_layout

\begin_layout Section
scan_for_matches README
\begin_inset LatexCommand label
name "sec:scan_for_matches-README"

\end_inset


\end_layout

\begin_layout LyX-Code
                          scan_for_matches:
\end_layout

\begin_layout LyX-Code
    A Program to Scan Nucleotide or Protein Sequences for Matching Patterns
\end_layout

\begin_layout LyX-Code
                        Ross Overbeek
\end_layout

\begin_layout LyX-Code
                        MCS
\end_layout

\begin_layout LyX-Code
                        Argonne National Laboratory
\end_layout

\begin_layout LyX-Code
                        Argonne, IL 60439
\end_layout

\begin_layout LyX-Code
                        USA
\end_layout

\begin_layout LyX-Code
Scan_for_matches is a utility that we have written to search for
\end_layout

\begin_layout LyX-Code
patterns in DNA and protein sequences.
  I wrote most of the code,
\end_layout

\begin_layout LyX-Code
although David Joerg and Morgan Price wrote sections of an
\end_layout

\begin_layout LyX-Code
earlier version.
  The whole notion of pattern matching has a rich
\end_layout

\begin_layout LyX-Code
history, and we borrowed liberally from many sources.
  However, it is
\end_layout

\begin_layout LyX-Code
worth noting that we were strongly influenced by the elegant tools
\end_layout

\begin_layout LyX-Code
developed and distributed by David Searls.
  My intent is to make the
\end_layout

\begin_layout LyX-Code
existing tool available to anyone in the research community that might
\end_layout

\begin_layout LyX-Code
find it useful.
  I will continue to try to fix bugs and make suggested
\end_layout

\begin_layout LyX-Code
enhancements, at least until I feel that a superior tool exists.
\end_layout

\begin_layout LyX-Code
Hence, I would appreciate it if all bug reports and suggestions are
\end_layout

\begin_layout LyX-Code
directed to me at Overbeek@mcs.anl.gov.
  
\end_layout

\begin_layout LyX-Code
I will try to log all bug fixes and report them to users that send me
\end_layout

\begin_layout LyX-Code
their email addresses.
  I do not require that you give me your name
\end_layout

\begin_layout LyX-Code
and address.
  However, if you do give it to me, I will try to notify
\end_layout

\begin_layout LyX-Code
you of serious problems as they are discovered.
\end_layout

\begin_layout LyX-Code
Getting Started:
\end_layout

\begin_layout LyX-Code
    The distribution should contain at least the following programs:
\end_layout

\begin_layout LyX-Code
                README                  -     This document
\end_layout

\begin_layout LyX-Code
                ggpunit.c               -     One of the two source files
\end_layout

\begin_layout LyX-Code
                scan_for_matches.c      -     The second source file
\end_layout

\begin_layout LyX-Code
                
\end_layout

\begin_layout LyX-Code
                run_tests               -     A perl script to test things
\end_layout

\begin_layout LyX-Code
                show_hits               -     A handy perl script
\end_layout

\begin_layout LyX-Code
                test_dna_input          -     Test sequences for DNA
\end_layout

\begin_layout LyX-Code
                test_dna_patterns       -     Test patterns for DNA scan
\end_layout

\begin_layout LyX-Code
                test_output             -     Desired output from test
\end_layout

\begin_layout LyX-Code
                test_prot_input         -     Test protein sequences
\end_layout

\begin_layout LyX-Code
                test_prot_patterns      -     Test patterns for proteins
\end_layout

\begin_layout LyX-Code
                testit                  -     a perl script used for test
\end_layout

\begin_layout LyX-Code
    Only the first three files are required.
  The others are useful,
\end_layout

\begin_layout LyX-Code
    but only if you have Perl installed on your system.
  If you do
\end_layout

\begin_layout LyX-Code
    have Perl, I suggest that you type
\end_layout

\begin_layout LyX-Code
        
\end_layout

\begin_layout LyX-Code
                which perl
\end_layout

\begin_layout LyX-Code
    to find out where it installed.
  On my system, I get the following
\end_layout

\begin_layout LyX-Code
    response:
\end_layout

\begin_layout LyX-Code
        
\end_layout

\begin_layout LyX-Code
                clone% which perl
\end_layout

\begin_layout LyX-Code
                /usr/local/bin/perl
\end_layout

\begin_layout LyX-Code
    indicating that Perl is installed in /usr/local/bin.
  Anyway, once
\end_layout

\begin_layout LyX-Code
    you know where it is installed, edit the first line of files 
\end_layout

\begin_layout LyX-Code
        testit
\end_layout

\begin_layout LyX-Code
        show_hits
\end_layout

\begin_layout LyX-Code
    replacing /usr/local/bin/perl with the appropriate location.
  I
\end_layout

\begin_layout LyX-Code
    will assume that you can do this, although it is not critical (it
\end_layout

\begin_layout LyX-Code
    is needed only to test the installation and to use the "show_hits"
\end_layout

\begin_layout LyX-Code
    utility).
  Perl is not required to actually install and run
\end_layout

\begin_layout LyX-Code
    scan_for_matches.
 
\end_layout

\begin_layout LyX-Code
    If you do not have Perl, I suggest you get it and install it (it
\end_layout

\begin_layout LyX-Code
    is a wonderful utility).
  Information about Perl and how to get it
\end_layout

\begin_layout LyX-Code
    can be found in the book "Programming Perl" by Larry Wall and
\end_layout

\begin_layout LyX-Code
    Randall L.
 Schwartz, published by O'Reilly & Associates, Inc.
\end_layout

\begin_layout LyX-Code
    To get started, you will need to compile the program.
   I do this
\end_layout

\begin_layout LyX-Code
    using 
\end_layout

\begin_layout LyX-Code
        gcc -O -o scan_for_matches  ggpunit.c scan_for_matches.c
\end_layout

\begin_layout LyX-Code
    If you do not use GNU C, use 
\end_layout

\begin_layout LyX-Code
        cc -O -DCC -o scan_for_matches  ggpunit.c scan_for_matches.c
\end_layout

\begin_layout LyX-Code
    which works on my Sun.
  
\end_layout

\begin_layout LyX-Code
    Once you have compiled scan_for_matches, you can verify that it
\end_layout

\begin_layout LyX-Code
    works with
\end_layout

\begin_layout LyX-Code
        clone% run_tests tmp
\end_layout

\begin_layout LyX-Code
        clone% diff tmp test_output
\end_layout

\begin_layout LyX-Code
    You may get a few strange lines of the sort
\end_layout

\begin_layout LyX-Code
        clone% run_tests tmp
\end_layout

\begin_layout LyX-Code
        rm: tmp: No such file or directory
\end_layout

\begin_layout LyX-Code
        clone% diff tmp test_output
\end_layout

\begin_layout LyX-Code
    These should cause no concern.
  However, if the "diff" shows that
\end_layout

\begin_layout LyX-Code
    tmp and test_output are different, contact me (you have a
\end_layout

\begin_layout LyX-Code
    problem).
 
\end_layout

\begin_layout LyX-Code
    You should now be able to use scan_for_matches by following the
\end_layout

\begin_layout LyX-Code
    instructions given below (which is all the normal user should have
\end_layout

\begin_layout LyX-Code
    to understand, once things are installed properly).
\end_layout

\begin_layout LyX-Code
 ==============================================================
\end_layout

\begin_layout LyX-Code
How to run scan_for_matches:
\end_layout

\begin_layout LyX-Code
    To run the program, you type need to create two files
\end_layout

\begin_layout LyX-Code
    1.
  the first file contains the pattern you wish to scan for; I'll
\end_layout

\begin_layout LyX-Code
        call this file pat_file in what follows (but any name is ok)
\end_layout

\begin_layout LyX-Code
    2.
  the second file contains a set of sequences to scan.
  These
\end_layout

\begin_layout LyX-Code
        should be in "fasta format".
  Just look at the contents of
\end_layout

\begin_layout LyX-Code
        test_dna_input to see examples of this format.
  Basically,
\end_layout

\begin_layout LyX-Code
        each sequence begins with a line of the form
\end_layout

\begin_layout LyX-Code
           >sequence_id
\end_layout

\begin_layout LyX-Code
        and is followed by one or more lines containing the sequence.
\end_layout

\begin_layout LyX-Code
    Once these files have been created, you just use
\end_layout

\begin_layout LyX-Code
        scan_for_matches pat_file < input_file
\end_layout

\begin_layout LyX-Code
    to scan all of the input sequences for the given pattern.
  As an
\end_layout

\begin_layout LyX-Code
    example, suppose that pat_file contains a single line of the form
\end_layout

\begin_layout LyX-Code
                p1=4...7 3...8 ~p1
\end_layout

\begin_layout LyX-Code
    Then,
\end_layout

\begin_layout LyX-Code
                scan_for_matches pat_file < test_dna_input
\end_layout

\begin_layout LyX-Code
    should produce two "hits".
  When I run this on my machine, I get
\end_layout

\begin_layout LyX-Code
        clone% scan_for_matches pat_file < test_dna_input
\end_layout

\begin_layout LyX-Code
        >tst1:[6,27]
\end_layout

\begin_layout LyX-Code
        cguaacc ggttaacc gguuacg 
\end_layout

\begin_layout LyX-Code
        >tst2:[6,27]
\end_layout

\begin_layout LyX-Code
        CGUAACC GGTTAACC GGUUACG 
\end_layout

\begin_layout LyX-Code
        clone% 
\end_layout

\begin_layout LyX-Code
Simple Patterns Built by Matching Ranges and Reverse Complements
\end_layout

\begin_layout LyX-Code
    Let me first explain this simple pattern:
\end_layout

\begin_layout LyX-Code
                
\end_layout

\begin_layout LyX-Code
                p1=4...7 3...8 ~p1
\end_layout

\begin_layout LyX-Code
    The pattern consists of three "pattern units" separated by spaces.
\end_layout

\begin_layout LyX-Code
    The first pattern unit is
\end_layout

\begin_layout LyX-Code
                p1=4...7
\end_layout

\begin_layout LyX-Code
    which means "match 4 to 7 characters and call them p1".
  The
\end_layout

\begin_layout LyX-Code
    second pattern unit is
\end_layout

\begin_layout LyX-Code
                3...8
\end_layout

\begin_layout LyX-Code
    which means "then match 3 to 8 characters".
  The last pattern unit
\end_layout

\begin_layout LyX-Code
    is 
\end_layout

\begin_layout LyX-Code
                ~p1
\end_layout

\begin_layout LyX-Code
    which means "match the reverse complement of p1".
  The first
\end_layout

\begin_layout LyX-Code
    reported hit is shown as
\end_layout

\begin_layout LyX-Code
        >tst1:[6,27]
\end_layout

\begin_layout LyX-Code
        cguaacc ggttaacc gguuacg 
\end_layout

\begin_layout LyX-Code
    which states that characters 6 through 27 of sequence tst1 were
\end_layout

\begin_layout LyX-Code
    matched.
  "cguaac" matched the first pattern unit, "ggttaacc" the
\end_layout

\begin_layout LyX-Code
    second, and "gguuacg" the third.
  This is an example of a common
\end_layout

\begin_layout LyX-Code
    type of pattern used to search for sections of DNA or RNA that
\end_layout

\begin_layout LyX-Code
    would fold into a hairpin loop.
\end_layout

\begin_layout LyX-Code
Searching Both Strands
\end_layout

\begin_layout LyX-Code
    Now for a short aside: scan_for_matches only searched the
\end_layout

\begin_layout LyX-Code
    sequences in the input file; it did not search the opposite
\end_layout

\begin_layout LyX-Code
    strand.
  With a pattern of the sort we just used, there is not
\end_layout

\begin_layout LyX-Code
    need o search the opposite strand.
  However, it is normally the
\end_layout

\begin_layout LyX-Code
    case that you will wish to search both the sequence and the
\end_layout

\begin_layout LyX-Code
    opposite strand (i.e., the reverse complement of the sequence).
\end_layout

\begin_layout LyX-Code
    To do that, you would just use the "-c" command line.
  For example,
\end_layout

\begin_layout LyX-Code
        scan_for_matches -c pat_file < test_dna_input
\end_layout

\begin_layout LyX-Code
    Hits on the opposite strand will show a beginning location greater
\end_layout

\begin_layout LyX-Code
    than te end location of the match.
\end_layout

\begin_layout LyX-Code
Defining Pairing Rules and Allowing Mismatches, Insertions, and Deletions
\end_layout

\begin_layout LyX-Code
    Let us stop now and ask "What additional features would one need to
\end_layout

\begin_layout LyX-Code
    really find the kinds of loop structures that characterize tRNAs,
\end_layout

\begin_layout LyX-Code
    rRNAs, and so forth?"  I can immediately think of two:
\end_layout

\begin_layout LyX-Code
        a) you will need to be able to allow non-standard pairings
\end_layout

\begin_layout LyX-Code
           (those other than G-C and A-U), and
\end_layout

\begin_layout LyX-Code
        b) you will need to be able to tolerate some number of
\end_layout

\begin_layout LyX-Code
           mismatches and bulges.
\end_layout

\begin_layout LyX-Code
        
\end_layout

\begin_layout LyX-Code
    Let me first show you how to handle non-standard "rules for
\end_layout

\begin_layout LyX-Code
    pairing in reverse complements".
  Consider the following pattern,
\end_layout

\begin_layout LyX-Code
    which I show as two line (you may use as many lines as you like in
\end_layout

\begin_layout LyX-Code
    forming a pattern, although you can only break a pattern at points
\end_layout

\begin_layout LyX-Code
    where space would be legal):
\end_layout

\begin_layout LyX-Code
            r1={au,ua,gc,cg,gu,ug,ga,ag} 
\end_layout

\begin_layout LyX-Code
            p1=2...3 0...4 p2=2...5 1...5 r1~p2 0...4 ~p1        
\end_layout

\begin_layout LyX-Code
    The first "pattern unit" does not actually match anything; rather,
\end_layout

\begin_layout LyX-Code
    it defines a "pairing rule" in which standard pairings are
\end_layout

\begin_layout LyX-Code
    allowed, as well as G-A and A-G (in case you wondered, Us and Ts
\end_layout

\begin_layout LyX-Code
    and upper and lower case can be used interchangably; for example
\end_layout

\begin_layout LyX-Code
    r1={AT,UA,gc,cg} could be used to define the "standard rule" for
\end_layout

\begin_layout LyX-Code
    pairings).
  The second line consists of six pattern units which
\end_layout

\begin_layout LyX-Code
    may be interpreted as follows:
\end_layout

\begin_layout LyX-Code
            p1=2...3     match 2 or 3 characters (call it p1)
\end_layout

\begin_layout LyX-Code
            0...4        match 0 to 4 characters
\end_layout

\begin_layout LyX-Code
            p2=2...5     match 2 to 5 characters (call it p2)
\end_layout

\begin_layout LyX-Code
            1...5        match 1 to 5 characters
\end_layout

\begin_layout LyX-Code
            r1~p2        match the reverse complement of p2,
\end_layout

\begin_layout LyX-Code
                            allowing G-A and A-G pairs
\end_layout

\begin_layout LyX-Code
            0...4        match 0 to 4 characters        
\end_layout

\begin_layout LyX-Code
            ~p1          match the reverse complement of p1
\end_layout

\begin_layout LyX-Code
                            allowing only G-C, C-G, A-T, and T-A pairs
\end_layout

\begin_layout LyX-Code
    Thus, r1~p2 means "match the reverse complement of p2 using rule r1".
\end_layout

\begin_layout LyX-Code
    Now let us consider the issue of tolerating mismatches and bulges.
\end_layout

\begin_layout LyX-Code
    You may add a "qualifier" to the pattern unit that gives the
\end_layout

\begin_layout LyX-Code
    tolerable number of "mismatches, deletions, and insertions".
\end_layout

\begin_layout LyX-Code
    Thus,
\end_layout

\begin_layout LyX-Code
                p1=10...10 3...8 ~p1[1,2,1]
\end_layout

\begin_layout LyX-Code
    means that the third pattern unit must match 10 characters,
\end_layout

\begin_layout LyX-Code
    allowing one "mismatch" (a pairing other than G-C, C-G, A-T, or
\end_layout

\begin_layout LyX-Code
    T-A), two deletions (a deletion is a character that occurs in p1,
\end_layout

\begin_layout LyX-Code
    but has been "deleted" from the string matched by ~p1), and one
\end_layout

\begin_layout LyX-Code
    insertion (an "insertion" is a character that occurs in the string
\end_layout

\begin_layout LyX-Code
    matched by ~p1, but not for which no corresponding character
\end_layout

\begin_layout LyX-Code
    occurs in p1).
  In this case, the pattern would match
\end_layout

\begin_layout LyX-Code
              ACGTACGTAC GGGGGGGG GCGTTACCT
\end_layout

\begin_layout LyX-Code
    which is, you must admit, a fairly weak loop.
  It is common to
\end_layout

\begin_layout LyX-Code
    allow mismatches, but you will find yourself using insertions and
\end_layout

\begin_layout LyX-Code
    deletions much more rarely.
  In any event, you should note that
\end_layout

\begin_layout LyX-Code
    allowing mismatches, insertions, and deletions does force the
\end_layout

\begin_layout LyX-Code
    program to try many additional possible pairings, so it does slow
\end_layout

\begin_layout LyX-Code
    things down a bit.
\end_layout

\begin_layout LyX-Code
How Patterns Are Matched
\end_layout

\begin_layout LyX-Code
    Now is as good a time as any to discuss the basic flow of control
\end_layout

\begin_layout LyX-Code
    when matching patterns.
  Recall that a "pattern" is a sequence of
\end_layout

\begin_layout LyX-Code
    "pattern units".
  Suppose that the pattern units were
\end_layout

\begin_layout LyX-Code
        u1 u2 u3 u4 ...
 un
\end_layout

\begin_layout LyX-Code
    The scan of a sequence S begins by setting the current position
\end_layout

\begin_layout LyX-Code
    to 1.
  Then, an attempt is made to match u1 starting at the
\end_layout

\begin_layout LyX-Code
    current position.
  Each attempt to match a pattern unit can
\end_layout

\begin_layout LyX-Code
    succeed or fail.
  If it succeeds, then an attempt is made to match
\end_layout

\begin_layout LyX-Code
    the next unit.
  If it fails, then an attempt is made to find an
\end_layout

\begin_layout LyX-Code
    alternative match for the immediately preceding pattern unit.
  If
\end_layout

\begin_layout LyX-Code
    this succeeds, then we proceed forward again to the next unit.
  If
\end_layout

\begin_layout LyX-Code
    it fails we go back to the preceding unit.
  This process is called
\end_layout

\begin_layout LyX-Code
    "backtracking".
  If there are no previous units, then the current
\end_layout

\begin_layout LyX-Code
    position is incremented by one, and everything starts again.
  This
\end_layout

\begin_layout LyX-Code
    proceeds until either the current position goes past the end of
\end_layout

\begin_layout LyX-Code
    the sequence or all of the pattern units succeed.
  On success,
\end_layout

\begin_layout LyX-Code
    scan_for_matches reports the "hit", the current position is set
\end_layout

\begin_layout LyX-Code
    just past the hit, and an attempt is made to find another hit.
\end_layout

\begin_layout LyX-Code
    If you wish to limit the scan to simply finding a maximum of, say,
\end_layout

\begin_layout LyX-Code
    10 hits, you can use the -n option (-n 10 would set the limit to
\end_layout

\begin_layout LyX-Code
    10 reported hits).
  For example,
\end_layout

\begin_layout LyX-Code
        scan_for_matches -c -n 1 pat_file < test_dna_input
\end_layout

\begin_layout LyX-Code
    would search for just the first hit (and would stop searching the
\end_layout

\begin_layout LyX-Code
    current sequences or any that follow in the input file).
\end_layout

\begin_layout LyX-Code
Searching for repeats:
\end_layout

\begin_layout LyX-Code
    In the last section, I discussed almost all of the details
\end_layout

\begin_layout LyX-Code
    required to allow you to look for repeats.
  Consider the following
\end_layout

\begin_layout LyX-Code
    set of patterns:
\end_layout

\begin_layout LyX-Code
        p1=6...6 3...8 p1   (find exact 6 character repeat separated
\end_layout

\begin_layout LyX-Code
                             by to 8 characters)
\end_layout

\begin_layout LyX-Code
        p1=6...6 3..8 p1[1,0,0]   (allow one mismatch)
\end_layout

\begin_layout LyX-Code
        p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0]  
\end_layout

\begin_layout LyX-Code
                            (match 12 characters that are the remains
\end_layout

\begin_layout LyX-Code
                             of a 3-character sequence occurring 4 times)
\end_layout

\begin_layout LyX-Code
                
\end_layout

\begin_layout LyX-Code
        p1=4...8 0...3 p2=6...8 p1 0...3 p2
\end_layout

\begin_layout LyX-Code
                            (This would match things like
\end_layout

\begin_layout LyX-Code
                                ATCT G TCTTT ATCT TG TCTTT
\end_layout

\begin_layout LyX-Code
                            )
\end_layout

\begin_layout LyX-Code
Searching for particular sequences:
\end_layout

\begin_layout LyX-Code
    Occasionally, one wishes to match a specific, known sequence.
\end_layout

\begin_layout LyX-Code
    In such a case, you can just give the sequence (along with an
\end_layout

\begin_layout LyX-Code
    optional statement of the allowable mismatches, insertions, and
\end_layout

\begin_layout LyX-Code
    deletions).
  Thus,
\end_layout

\begin_layout LyX-Code
        p1=6...8 GAGA ~p1    (match a hairpin with GAGA as the loop)
\end_layout

\begin_layout LyX-Code
        RRRRYYYY             (match 4 purines followed by 4 pyrimidines)
\end_layout

\begin_layout LyX-Code
        TATAA[1,0,0]         (match TATAA, allowing 1 mismatch)
\end_layout

\begin_layout LyX-Code
        
\end_layout

\begin_layout LyX-Code
Matches against a "weight matrix":
\end_layout

\begin_layout LyX-Code
    I will conclude my examples of the types of pattern units
\end_layout

\begin_layout LyX-Code
    available for matching against nucleotide sequences by discussing a
\end_layout

\begin_layout LyX-Code
    crude implemetation of matching using a "weight matrix".
  While I
\end_layout

\begin_layout LyX-Code
    am less than overwhelmed with the syntax that I chose, I think that
\end_layout

\begin_layout LyX-Code
    the reader should be aware that I was thinking of generating
\end_layout

\begin_layout LyX-Code
    patterns containing such pattern units automatically from
\end_layout

\begin_layout LyX-Code
    alignments (and did not really plan on typing such things in by
\end_layout

\begin_layout LyX-Code
    hand very often).
  Anyway, suppose that you wanted to match a
\end_layout

\begin_layout LyX-Code
    sequence of eight characters.
  The "consensus" of these eight
\end_layout

\begin_layout LyX-Code
    characters is GRCACCGS, but the actual "frequencies of occurrence"
\end_layout

\begin_layout LyX-Code
    are given in the matrix below.
  Thus, the first character is an A
\end_layout

\begin_layout LyX-Code
    16% the time and a G 84% of the time.
  The second is an A 57% of
\end_layout

\begin_layout LyX-Code
    the time, a C 10% of the time, a G 29% of the time, and a T 4% of
\end_layout

\begin_layout LyX-Code
    the time.
  
\end_layout

\begin_layout LyX-Code
             C1     C2    C3    C4   C5    C6    C7    C8
\end_layout

\begin_layout LyX-Code
    
\end_layout

\begin_layout LyX-Code
       A     16     57     0    95    0    18     0     0
\end_layout

\begin_layout LyX-Code
       C      0     10    80     0  100    60     0    50
\end_layout

\begin_layout LyX-Code
       G     84     29     0     0    0    20   100    50
\end_layout

\begin_layout LyX-Code
       T      0      4    20     5    0     2     0     0   
\end_layout

\begin_layout LyX-Code
    
\end_layout

\begin_layout LyX-Code
    One could use the following pattern unit to search for inexact
\end_layout

\begin_layout LyX-Code
    matches related to such a "weight matrix":
\end_layout

\begin_layout LyX-Code
        {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
\end_layout

\begin_layout LyX-Code
         (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
\end_layout

\begin_layout LyX-Code
    This pattern unit will attempt to match exactly eight characters.
\end_layout

\begin_layout LyX-Code
    For each character in the sequence, the entry in the corresponding
\end_layout

\begin_layout LyX-Code
    tuple is added to an accumulated sum.
  If the sum is greater than
\end_layout

\begin_layout LyX-Code
    450, the match succeeds; else it fails.
\end_layout

\begin_layout LyX-Code
    Recently, this feature was upgraded to allow ranges.
  Thus,
\end_layout

\begin_layout LyX-Code
  600 >  {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
\end_layout

\begin_layout LyX-Code
         (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
\end_layout

\begin_layout LyX-Code
    will work, as well.
\end_layout

\begin_layout LyX-Code
Allowing Alternatives:
\end_layout

\begin_layout LyX-Code
    Very occasionally, you may wish to allow alternative pattern units
\end_layout

\begin_layout LyX-Code
    (i.e., "match either A or B").
  You can do this using something
\end_layout

\begin_layout LyX-Code
    like
\end_layout

\begin_layout LyX-Code
                ( GAGA | GCGCA)
\end_layout

\begin_layout LyX-Code
    which says "match either GAGA or GCGCA".
  You may take
\end_layout

\begin_layout LyX-Code
    alternatives of a list of pattern units, for example
\end_layout

\begin_layout LyX-Code
        (p1=3...3 3...8 ~p1 | p1=5...5 4...4 ~p1 GGG)
\end_layout

\begin_layout LyX-Code
    would match one of two sequences of pattern units.
  There is one
\end_layout

\begin_layout LyX-Code
    clumsy aspect of the syntax: to match a list of alternatives, you
\end_layout

\begin_layout LyX-Code
    need to fully the request.
  Thus,
\end_layout

\begin_layout LyX-Code
        (GAGA | (GCGCA | TTCGA))
\end_layout

\begin_layout LyX-Code
    would be needed to try the three alternatives.
\end_layout

\begin_layout LyX-Code
One Minor Extension
\end_layout

\begin_layout LyX-Code
    Sometimes a pattern will contain a sequence of distinct ranges,
\end_layout

\begin_layout LyX-Code
    and you might wish to limit the sum of the lengths of the matched
\end_layout

\begin_layout LyX-Code
    subsequences.
   For example, suppose that you basically wanted to
\end_layout

\begin_layout LyX-Code
    match something like
\end_layout

\begin_layout LyX-Code
    ARRYYTT p1=0...5 GCA[1,0,0] p2=1...6 ~p1 4...8 ~p2 p3=4...10 CCT
\end_layout

\begin_layout LyX-Code
    but that the sum of the lengths of p1, p2, and p3 must not exceed
\end_layout

\begin_layout LyX-Code
    eight characters.
  To do this, you could add 
\end_layout

\begin_layout LyX-Code
        length(p1+p2+p3) < 9
\end_layout

\begin_layout LyX-Code
    as the last pattern unit.
  It will just succeed or fail (but does
\end_layout

\begin_layout LyX-Code
    not actually match any characters in the sequence).
\end_layout

\begin_layout LyX-Code
    
\end_layout

\begin_layout LyX-Code
Matching Protein Sequences
\end_layout

\begin_layout LyX-Code
    Suppose that the input file contains protein sequences.
  In this
\end_layout

\begin_layout LyX-Code
    case, you must invoke scan_for_matches with the "-p" option.
  You
\end_layout

\begin_layout LyX-Code
    cannot use aspects of the language that relate directly to
\end_layout

\begin_layout LyX-Code
    nucleotide sequences (e.g., the -c command line option or pattern
\end_layout

\begin_layout LyX-Code
    constructs referring to the reverse complement of a previously
\end_layout

\begin_layout LyX-Code
    matched unit).
  
\end_layout

\begin_layout LyX-Code
    You also have two additional constructs that allow you to match
\end_layout

\begin_layout LyX-Code
    either "one of a set of amino acids" or "any amino acid other than
\end_layout

\begin_layout LyX-Code
    those a given set".
  For example,
\end_layout

\begin_layout LyX-Code
        p1=0...4 any(HQD) 1...3 notany(HK) p1
\end_layout

\begin_layout LyX-Code
    would successfully match a string like
\end_layout

\begin_layout LyX-Code
           YWV D AA C YWV
\end_layout

\begin_layout LyX-Code
Using the show_hits Utility
\end_layout

\begin_layout LyX-Code
    When viewing a large set of complex matches, you might find it
\end_layout

\begin_layout LyX-Code
    convenient to post-process the scan_for_matches output to get a
\end_layout

\begin_layout LyX-Code
    more readable version.
  We provide a simple post-processor called
\end_layout

\begin_layout LyX-Code
    "show_hits".
  To see its effect, just pipe the output of a
\end_layout

\begin_layout LyX-Code
    scan_for_matches into show_hits:
\end_layout

\begin_layout LyX-Code
     Normal Output:
\end_layout

\begin_layout LyX-Code
        clone% scan_for_matches -c pat_file < tmp
\end_layout

\begin_layout LyX-Code
        >tst1:[1,28]
\end_layout

\begin_layout LyX-Code
        gtacguaacc  ggttaac cgguuacgtac 
\end_layout

\begin_layout LyX-Code
        >tst1:[28,1]
\end_layout

\begin_layout LyX-Code
        gtacgtaacc  ggttaac cggttacgtac 
\end_layout

\begin_layout LyX-Code
        >tst2:[2,31]
\end_layout

\begin_layout LyX-Code
        CGTACGUAAC C GGTTAACC GGUUACGTACG 
\end_layout

\begin_layout LyX-Code
        >tst2:[31,2]
\end_layout

\begin_layout LyX-Code
        CGTACGTAAC C GGTTAACC GGTTACGTACG 
\end_layout

\begin_layout LyX-Code
        >tst3:[3,32]
\end_layout

\begin_layout LyX-Code
        gtacguaacc g gttaactt cgguuacgtac 
\end_layout

\begin_layout LyX-Code
        >tst3:[32,3]
\end_layout

\begin_layout LyX-Code
        gtacgtaacc g aagttaac cggttacgtac 
\end_layout

\begin_layout LyX-Code
     Piped Through show_hits:
\end_layout

\begin_layout LyX-Code
    
\end_layout

\begin_layout LyX-Code
        clone% scan_for_matches -c pat_file < tmp | show_hits
\end_layout

\begin_layout LyX-Code
        tst1:[1,28]:  gtacguaacc   ggttaac  cgguuacgtac
\end_layout

\begin_layout LyX-Code
        tst1:[28,1]:  gtacgtaacc   ggttaac  cggttacgtac
\end_layout

\begin_layout LyX-Code
        tst2:[2,31]:  CGTACGUAAC C GGTTAACC GGUUACGTACG
\end_layout

\begin_layout LyX-Code
        tst2:[31,2]:  CGTACGTAAC C GGTTAACC GGTTACGTACG
\end_layout

\begin_layout LyX-Code
        tst3:[3,32]:  gtacguaacc g gttaactt cgguuacgtac
\end_layout

\begin_layout LyX-Code
        tst3:[32,3]:  gtacgtaacc g aagttaac cggttacgtac
\end_layout

\begin_layout LyX-Code
        clone% 
\end_layout

\begin_layout LyX-Code
    Optionally, you can specify which of the "fields" in the matches
\end_layout

\begin_layout LyX-Code
    you wish to sort on, and show_hits will sort them.
  The field
\end_layout

\begin_layout LyX-Code
    numbers start with 0.
  So, you might get something like
\end_layout

\begin_layout LyX-Code
        clone% scan_for_matches -c pat_file < tmp | show_hits 2 1
\end_layout

\begin_layout LyX-Code
        tst2:[2,31]:  CGTACGUAAC C GGTTAACC GGUUACGTACG
\end_layout

\begin_layout LyX-Code
        tst2:[31,2]:  CGTACGTAAC C GGTTAACC GGTTACGTACG
\end_layout

\begin_layout LyX-Code
        tst3:[32,3]:  gtacgtaacc g aagttaac cggttacgtac
\end_layout

\begin_layout LyX-Code
        tst1:[1,28]:  gtacguaacc   ggttaac  cgguuacgtac
\end_layout

\begin_layout LyX-Code
        tst1:[28,1]:  gtacgtaacc   ggttaac  cggttacgtac
\end_layout

\begin_layout LyX-Code
        tst3:[3,32]:  gtacguaacc g gttaactt cgguuacgtac
\end_layout

\begin_layout LyX-Code
        clone% 
\end_layout

\begin_layout LyX-Code
    In this case, the hits have been sorted on fields 2 and 1 (that is,
\end_layout

\begin_layout LyX-Code
    the third and second matched subfields).
\end_layout

\begin_layout LyX-Code
    show_hits is just one possible little post-processor, and you
\end_layout

\begin_layout LyX-Code
    might well wish to write a customized one for yourself.
\end_layout

\begin_layout LyX-Code
Reducing the Cost of a Search
\end_layout

\begin_layout LyX-Code
    The scan_for_matches utility uses a fairly simple search, and may
\end_layout

\begin_layout LyX-Code
    consume large amounts of CPU time for complex patterns.
  Someday,
\end_layout

\begin_layout LyX-Code
    I may decide to optimize the code.
  However, until then, let me
\end_layout

\begin_layout LyX-Code
    mention one useful technique.
  
\end_layout

\begin_layout LyX-Code
    When you have a complex pattern that includes a number of varying
\end_layout

\begin_layout LyX-Code
    ranges, imprecise matches, and so forth, it is useful to
\end_layout

\begin_layout LyX-Code
    "pipeline" matches.
  That is, form a simpler pattern that can be
\end_layout

\begin_layout LyX-Code
    used to scan through a large database extracting sections that
\end_layout

\begin_layout LyX-Code
    might be matched by the more complex pattern.
  Let me illustrate
\end_layout

\begin_layout LyX-Code
    with a short example.
  Suppose that you really wished to match the
\end_layout

\begin_layout LyX-Code
    pattern 
\end_layout

\begin_layout LyX-Code
    p1=3...5 0...8 ~p1[1,1,0] p2=6...7 3...6 AGC 3...5 RYGC ~p2[1,0,0]
\end_layout

\begin_layout LyX-Code
    In this case, the pattern units AGC 3...5 RYGC can be used to rapidly
\end_layout

\begin_layout LyX-Code
    constrain the overall search.
  You can preprocess the overall
\end_layout

\begin_layout LyX-Code
    database using the pattern:
\end_layout

\begin_layout LyX-Code
          31...31 AGC 3...5 RYGC 7...7
\end_layout

\begin_layout LyX-Code
    Put the complex pattern in pat_file1 and the simpler pattern in
\end_layout

\begin_layout LyX-Code
    pat_file2.
  Then use,
\end_layout

\begin_layout LyX-Code
        scan_for_matches -c pat_file2 < nucleotide_database |
\end_layout

\begin_layout LyX-Code
        scan_for_matches pat_file1
\end_layout

\begin_layout LyX-Code
    The output will show things like
\end_layout

\begin_layout LyX-Code
    >seqid:[232,280][2,47]
\end_layout

\begin_layout LyX-Code
    matches pieces
\end_layout

\begin_layout LyX-Code
    Then, the actual section of the sequence that was matched can be
\end_layout

\begin_layout LyX-Code
    easily computed as [233,278] (remember, the positions start from
\end_layout

\begin_layout LyX-Code
    1, not 0).
\end_layout

\begin_layout LyX-Code
    Let me finally add, you should do a few short experiments to see
\end_layout

\begin_layout LyX-Code
    whether or not such pipelining actually improves performance -- it
\end_layout

\begin_layout LyX-Code
    is not always obvious where the time is going, and I have
\end_layout

\begin_layout LyX-Code
    sometimes found that the added complexity of pipelining actually
\end_layout

\begin_layout LyX-Code
    slowed things up.
  It gets its best improvements when there are
\end_layout

\begin_layout LyX-Code
    exact matches of more than just a few characters that can be
\end_layout

\begin_layout LyX-Code
    rapidly used to eliminate large sections of the database.
\end_layout

\begin_layout LyX-Code
=============
\end_layout

\begin_layout LyX-Code
Additions:
\end_layout

\begin_layout LyX-Code
Feb 9, 1995:   the pattern units ^ and $ now work as in normal regular
\end_layout

\begin_layout LyX-Code
               expressions.
  That is
\end_layout

\begin_layout LyX-Code
                        TTF $
\end_layout

\begin_layout LyX-Code
               matches only TTF at the end of the string and 
\end_layout

\begin_layout LyX-Code
                        ^ TTF 
\end_layout

\begin_layout LyX-Code
               matches only an initial TTF
\end_layout

\begin_layout LyX-Code
               The pattern unit 
\end_layout

\begin_layout LyX-Code
                        <p1
\end_layout

\begin_layout LyX-Code
               matches the reverse of the string named p1.
  That is,
\end_layout

\begin_layout LyX-Code
               if p1 matched GCAT, then <p1 would match TACG.
  Thus,
\end_layout

\begin_layout LyX-Code
                   p1=6...6 <p1
\end_layout

\begin_layout LyX-Code
               matches a real palindrome (not the biologically common
\end_layout

\begin_layout LyX-Code
               meaning of "reverse complement")
\end_layout

\begin_layout LyX-Code

\end_layout

\end_body
\end_document