1 #LyX 1.5.1 created this file. For more info see http://www.lyx.org/
7 \usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
13 \font_typewriter default
14 \font_default_family default
20 \paperfontsize default
28 \paperorientation portrait
31 \paragraph_separation skip
33 \quotes_language english
36 \paperpagestyle default
37 \tracking_changes false
53 \begin_layout Publishers
56 Institute for Molecular Bioscience
58 University of Queensland
63 E-mail: mail@maasha.dk
66 \begin_layout Standard
70 \begin_layout Standard
82 \begin_layout Standard
88 \begin_layout Standard
89 \begin_inset LatexCommand tableofcontents
96 \begin_layout Standard
97 \begin_inset FloatList figure
104 \begin_layout Standard
110 \begin_layout Section
114 \begin_layout Standard
115 Biotools is a selection of simple tools that can be linked together (piped
116 as we shall call it) in a very flexible manner to perform both simple and
118 The fundamental idea is that biotools work on a data stream that will only
119 terminate at the end of an analysis and that this data stream can be passed
120 through several different biotools, each performing one specific task.
121 The advantage of this approach is that a user can perform simple and complex
122 tasks without having to write advanced code.
123 Moreover, since the data format used to pass data between biotools is text
124 based, biotools can be written by different developers in their favorite
125 programming language --- and still the biotools will be able to work together.
128 \begin_layout Standard
129 In the most simple form bioools can be piped together on the command line
130 like this (using the pipe character '|'):
133 \begin_layout LyX-Code
134 read_data | calculate_something | write_result
137 \begin_layout Standard
138 However, a more comprehensive analysis could be composed:
141 \begin_layout LyX-Code
142 read_data | select_entries | convert_entries | search_database
145 \begin_layout LyX-Code
146 evaluate_results | plot_diagram | plot_another_diagram |
149 \begin_layout LyX-Code
153 \begin_layout Standard
154 The data stream that is piped through the biotools consists of records of
155 key/value pairs in the same way a hash does in order to keep as simple
156 a structure as possible.
157 An example record can be seen below:
160 \begin_layout LyX-Code
164 \begin_layout LyX-Code
170 \begin_layout LyX-Code
176 \begin_layout LyX-Code
182 \begin_layout LyX-Code
188 \begin_layout LyX-Code
194 \begin_layout LyX-Code
200 \begin_layout LyX-Code
206 \begin_layout LyX-Code
212 \begin_layout LyX-Code
218 \begin_layout Standard
223 \begin_layout Standard
234 ' denotes the delimiter of the records, and each key is a word followed
235 by a ':' and a white-space and then the value.
236 By convention the biotools only uses upper case keys (a list of used keys
237 can be seen in Appendix\InsetSpace ~
239 \begin_inset LatexCommand ref
245 Since the records basically are hash structures this mean that the order
246 of the keys in the stream is unordered, and in the above example it is
247 pure coincidence that HIT_BEG is displayed before HIT_END, however, when
248 the order of the keys is importent, the biotools will automagically see
252 \begin_layout Standard
253 All of the biotools are able to read and write a data stream to and from
254 file as long as the records are in the biotools format.
255 This means that if you are undertaking a lengthy analysis where one of
256 the steps is time consuming, you may save the stream after this step, and
257 subsequently start one or more analysis from that last step
261 \begin_layout Standard
262 It is a goal that the biotools at some point will be able to dump the data
263 stream to file in case one of the tools fail critically.
269 If you are running a lengthy analysis it is highly recommended that you
270 create a small test sample of the data and run that through the pipeline
271 --- and once you are satisfied with the result proceed with the full data
272 set (see\InsetSpace ~
274 \begin_inset LatexCommand ref
275 reference "sub:How-to-select-a-few-records"
282 \begin_layout Standard
283 All of the biotools can be supplied with long arguments prefixed with
287 \begin_layout Standard
296 switches or single character arguments prefixed with - switches that can
297 be grouped together (e.g.
299 In this cookbook only the long switches are used to emphasize what these
303 \begin_layout Section
307 \begin_layout Standard
308 In order to get the biotools to work, you need to add environment settings
309 to include the code binaries, scripts, and modules that constitute the
311 Assuming that you are using bash, add the following to your '~/.bashrc'
312 file using your favorite editor.
313 After the changes has been saved you need to either run 'source ~/.bashrc'
317 \begin_layout LyX-Code
320 # >>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<
323 \begin_layout LyX-Code
327 \begin_layout LyX-Code
330 # Stuff that enables biotools.
333 \begin_layout LyX-Code
337 \begin_layout LyX-Code
340 export TOOLS_DIR="/home/m.hansen/tools" # Contains binaries for BLAST
344 \begin_layout LyX-Code
347 export INST_DIR="/home/m.hansen/maasha" # Contains scripts and modules.
350 \begin_layout LyX-Code
353 export DATA_DIR="/home/m.hansen/DATA" # Contains genomic data etc.
356 \begin_layout LyX-Code
359 export TMP_DIR="/home/m.hansen/maasha/tmp" # Required temporary directory.
362 \begin_layout LyX-Code
366 \begin_layout LyX-Code
369 export PATH="$PATH:$TOOLS_DIR/blast-2.2.17/bin:$TOOLS_DIR/vmatch.distribution"
372 \begin_layout LyX-Code
375 export PATH="$INST_DIR/bin/:$INST_DIR/perl_scripts/:$INST_DIR/perl_scripts/b
379 \begin_layout LyX-Code
382 export PERL5LIB="$PERL5LIB:$INST_DIR"
385 \begin_layout LyX-Code
389 \begin_layout LyX-Code
392 # Alias allowing power scripting with biotools
395 \begin_layout LyX-Code
399 \begin_layout LyX-Code
402 alias bioscript="perl -I $INST_DIR/Maasha -MBiotools=read_stream,get_recor
406 \begin_layout LyX-Code
410 \begin_layout LyX-Code
413 # >>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<
416 \begin_layout Section
420 \begin_layout Standard
425 lists all the biotools along with a description:
428 \begin_layout LyX-Code
434 \begin_layout LyX-Code
437 align_seq Align sequences in stream using Muscle.
440 \begin_layout LyX-Code
443 analyze_seq Analysis the residue composition of each sequence
447 \begin_layout LyX-Code
450 analyze_vals Determine type, count, min, max, sum and mean for
454 \begin_layout LyX-Code
457 blast_seq BLAST sequences in stream against a specified database.
460 \begin_layout LyX-Code
463 blat_seq BLAT sequences in stream against a specified genome.
466 \begin_layout LyX-Code
469 complement_seq Complement sequences in stream.
472 \begin_layout LyX-Code
475 count_records Count the number of records in stream.
478 \begin_layout LyX-Code
481 count_seq Count sequences in stream.
484 \begin_layout LyX-Code
487 count_vals Count the number of times values of given keys exists
491 \begin_layout LyX-Code
494 create_blast_db Create a BLAST database from sequences in stream for
498 \begin_layout LyX-Code
504 \begin_layout Standard
505 To list the biotools for writing different formats, you can use unix's grep
509 \begin_layout LyX-Code
512 list_biotools | grep write
515 \begin_layout LyX-Code
518 write_align Write aligned sequences in pretty alignment format.
521 \begin_layout LyX-Code
524 write_bed Write records from stream as BED lines.
527 \begin_layout LyX-Code
530 write_blast Write BLAST records from stream in BLAST tabular format
534 \begin_layout LyX-Code
537 write_fasta Write sequences in FASTA format.
540 \begin_layout LyX-Code
543 write_psl Write records from stream in PSL format.
546 \begin_layout LyX-Code
549 write_tab Write records from stream as tab separated table.
552 \begin_layout Standard
553 In order to find out how a specific biotool works, you just type the program
554 name without any arguments and press return and the usage of the biotool
564 \begin_layout Standard
565 \begin_inset Box Frameless
574 height_special "totalheight"
577 \begin_layout LyX-Code
580 Program name: read_fasta
583 \begin_layout LyX-Code
586 Author: Martin Asser Hansen - Copyright (C) - All rights reserved
589 \begin_layout LyX-Code
592 Contact: mail@maasha.dk
595 \begin_layout LyX-Code
601 \begin_layout LyX-Code
604 License: GNU General Public License version 2 (http://www.gnu.org/copyleft/
608 \begin_layout LyX-Code
611 Description: Read FASTA entries.
614 \begin_layout LyX-Code
617 Usage: read_fasta [options] -i <FASTA file(s)>
620 \begin_layout LyX-Code
626 \begin_layout LyX-Code
629 [-i <file(s)> | --data_in=<file(s)>] - Comma separated list of files
633 \begin_layout LyX-Code
636 [-n <int> | --num=<int>] - Limit number of records to read.
639 \begin_layout LyX-Code
642 [-I <file> | --stream_in=<file>] - Read input stream from file
646 \begin_layout LyX-Code
649 [-O <file> | --stream_out=<file>] - Write output stream to file
653 \begin_layout LyX-Code
657 \begin_layout LyX-Code
663 \begin_layout LyX-Code
666 read_fasta -i test.fna - Read FASTA entries from file.
669 \begin_layout LyX-Code
672 read_fasta -i test1.fna,test2,fna - Read FASTA entries from files.
675 \begin_layout LyX-Code
678 read_fasta -i '*.fna' - Read FASTA entries from files.
681 \begin_layout LyX-Code
684 read_fasta -i test.fna -n 10 - Read first 10 FASTA entries from
693 \begin_layout Section
697 \begin_layout Subsection
698 How to read the data stream from file?
699 \begin_inset LatexCommand label
700 name "sub:How-to-read-stream"
707 \begin_layout Standard
708 You want to read a data stream that you previously have saved to file in
710 This can be done implicetly or explicitly.
711 The implicit way uses the 'stdout' stream of the Unix terminal:
714 \begin_layout LyX-Code
718 \begin_layout Standard
719 cat is the Unix command that reads a file and output the result to 'stdout'
720 --- which in this case is piped to any biotool represented by the <biotool>.
721 It is also possible to read the data stream using '<' to direct the 'stdout'
722 stream into the biotool like this:
725 \begin_layout LyX-Code
729 \begin_layout Standard
730 However, that will not work if you pipe more biotools together.
731 Then it is much safer to read the stream from a file explicitly like this:
734 \begin_layout LyX-Code
735 <biotool> --stream_in=<file>
738 \begin_layout Standard
739 Here the filename <file> is explicetly given to the biotool <biotool> with
744 \begin_layout Standard
754 This switch works with all biotools.
755 It is also possible to read in data from multiple sources by repeating
756 the explicit read step:
759 \begin_layout LyX-Code
760 <biotool> --stream_in=<file1> | <biotool> --stream_in=<file2>
763 \begin_layout Subsection
764 How to write the data stream to file?
765 \begin_inset LatexCommand label
766 name "sub:How-to-write-stream"
773 \begin_layout Standard
774 In order to save the output stream from a biotool to file, so you can read
775 in the stream again at a later time, you can do one of two things:
778 \begin_layout LyX-Code
782 \begin_layout Standard
783 All, the biotools write the data stream to 'stdout' by default which can
784 be written to a file by redirecting 'stdout' to file using '>' , however,
785 if one of the biotools for writing other formats is used then the both
786 the biotools records as well as the result output will go to 'stdout' in
787 a mixture causing havock! To avoid this you must use the switch
791 \begin_layout Standard
800 stream_out that explictly tells the biotool to write the output stream to
804 \begin_layout LyX-Code
805 <biotool> --stream_out=<file>
808 \begin_layout Standard
813 \begin_layout Standard
822 stream_out switch works with all biotools.
825 \begin_layout Subsection
826 How to terminate the data stream?
829 \begin_layout Standard
830 The data stream is never stops unless the user want to save the stream or
835 \begin_layout Standard
844 no_stream switch that will terminate the stream:
847 \begin_layout LyX-Code
848 <biotool> --no_stream
851 \begin_layout Standard
856 \begin_layout Standard
865 no_stream switch only works with those biotools where it makes sense that
866 the user might want to terminale the data stream,
871 after an analysis step where the user wants to output the result, but not
875 \begin_layout Subsection
876 How to write my results to file?
877 \begin_inset LatexCommand label
878 name "sub:How-to-write-result"
885 \begin_layout Standard
886 Saving the result of an analysis to file can be done implicitly or explicitly.
890 \begin_layout LyX-Code
891 <biotool> --no_stream > <file>
894 \begin_layout Standard
895 If you use '>' to redirect 'stdout' to file then it is important to use
900 \begin_layout Standard
909 no_stream switch to avoid writing a mix of biotools records and result to
910 the same file causing havock.
911 The safe way is to use the
915 \begin_layout Standard
924 result_out switch which explicetly tells the biotool to write the result
928 \begin_layout LyX-Code
929 <biotool> --result_out=<file>
932 \begin_layout Standard
933 Using the above method will not terminate the stream, so it is possible
934 to pipe that into another biotool generating different results:
937 \begin_layout LyX-Code
938 <biotool1> --result_out=<file1> | <biotool2> --result_out=<file2>
941 \begin_layout Standard
942 And still the data stream will continue unless terminated with
946 \begin_layout Standard
958 \begin_layout LyX-Code
959 <biotool> --result_out=<file> --no_stream
962 \begin_layout Standard
963 Or written to file using implicitly or explicity
964 \begin_inset LatexCommand eqref
965 reference "sub:How-to-write-result"
973 \begin_layout LyX-Code
974 <biotool> --result_out=<file1> --stream_out=<file2>
977 \begin_layout Subsection
978 How to read data from multiple sources?
981 \begin_layout Standard
982 To read multiple data sources, with the same type or different type of data
986 \begin_layout LyX-Code
987 <biotool1> --data_in=<file1> | <biotool2> --data_in=<file2>
990 \begin_layout Standard
991 where type is the data type a specific biotool reads.
994 \begin_layout Section
998 \begin_layout Subsection
999 How to read biotools input?
1002 \begin_layout Standard
1004 \begin_inset LatexCommand eqref
1005 reference "sub:How-to-read-stream"
1012 \begin_layout Subsection
1013 How to read in data?
1016 \begin_layout Standard
1017 Data in different formats can be read with the appropriate biotool for that
1019 The biotools are typicalled named 'read_<data type>' such as
1031 , etc., and all behave in a similar manner.
1032 Data can be read by supplying the
1036 \begin_layout Standard
1045 data_in switch and a file name to the file containing the data:
1048 \begin_layout LyX-Code
1049 <biotool> --data_in=<file>
1052 \begin_layout Standard
1053 It is also possible to read in a saved biotools stream (see
1054 \begin_inset LatexCommand ref
1055 reference "sub:How-to-read-stream"
1059 ) as well as reading data in one go:
1062 \begin_layout LyX-Code
1063 <biotool> --stream_in=<file1> --data_in=<file2>
1066 \begin_layout Standard
1067 If you want to read data from several files you can do this:
1070 \begin_layout LyX-Code
1071 <biotool> --data_in=<file1> | <biotool> --data_in=<file2>
1074 \begin_layout Standard
1075 If you have several data files you can read in all explicitly with a comma
1079 \begin_layout LyX-Code
1080 <biotool> --data_in=file1,file2,file3
1083 \begin_layout Standard
1084 And it is also possible to use file globbing
1088 \begin_layout Standard
1089 using the short option will only work if you quote the argument -i '*.fna'
1097 \begin_layout LyX-Code
1098 <biotool> --data_in=*.fna
1101 \begin_layout Standard
1102 Or in a combination:
1105 \begin_layout LyX-Code
1106 <biotool> --data_in=file1,/dir/*.fna
1109 \begin_layout Standard
1110 Finally, it is possible to read in data in different formats using the appropria
1111 te biotool for each format:
1114 \begin_layout LyX-Code
1115 <biotool1> --data_in=<file1> | <biotool2> --data_in=<file2> ...
1118 \begin_layout Subsection
1119 How to read FASTA input?
1122 \begin_layout Standard
1123 Sequences in FASTA format can be read explicitly using
1130 \begin_layout LyX-Code
1131 read_fasta --data_in=<file>
1134 \begin_layout Subsection
1135 How to read alignment input?
1138 \begin_layout Standard
1139 If your alignment if FASTA formatted then you can
1144 It is also possible to use
1148 since the data is FASTA formatted, however, with
1152 the key ALIGN will be omitted.
1153 The ALIGN key is used to determine which sequences belong to what alignment
1154 which is required for
1161 \begin_layout LyX-Code
1162 read_align --data_in=<file>
1165 \begin_layout Subsection
1166 How to read tabular input?
1167 \begin_inset LatexCommand label
1168 name "sub:How-to-read-table"
1175 \begin_layout Standard
1176 Tabular input can be read with
1180 which will read in all rows and chosen columns (separated by a given delimter)
1181 from a table in text format.
1184 \begin_layout Standard
1188 \begin_layout Standard
1191 \begin_inset Tabular
1192 <lyxtabular version="3" rows="4" columns="3">
1194 <column alignment="left" valignment="top" width="0">
1195 <column alignment="left" valignment="top" width="0">
1196 <column alignment="left" valignment="top" width="0">
1198 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1201 \begin_layout Standard
1207 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1210 \begin_layout Standard
1216 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1219 \begin_layout Standard
1227 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1230 \begin_layout Standard
1236 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1239 \begin_layout Standard
1245 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1248 \begin_layout Standard
1256 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1259 \begin_layout Standard
1265 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1268 \begin_layout Standard
1274 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1277 \begin_layout Standard
1285 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1288 \begin_layout Standard
1294 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1297 \begin_layout Standard
1303 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1306 \begin_layout Standard
1320 \begin_layout Standard
1321 Can be read using the command:
1324 \begin_layout LyX-Code
1325 read_tab --data_in=<file>
1328 \begin_layout Standard
1329 Which will result in four records, one for each row, where the keys V0,
1330 V1, V2 are the default keys for the organism, sequence, and count, respectively.
1331 It is possible to select a subset of colums to read by using the
1335 \begin_layout Standard
1344 cols switch which takes a comma separated list of columns numbers (first
1345 column is designated 0) as argument.
1346 So to read in only the sequence and the count so that the count comes before
1350 \begin_layout LyX-Code
1351 read_tab --data_in=<file> --cols=2,1
1354 \begin_layout Standard
1355 It is also possible to name the columns with the
1359 \begin_layout Standard
1371 \begin_layout LyX-Code
1372 read_tab --data_in=<file> --cols=2,1 --keys=COUNT,SEQ
1375 \begin_layout Subsection
1376 How to read BED input?
1379 \begin_layout Standard
1380 The BED (Browser Extensible Data
1384 \begin_layout Standard
1385 \begin_inset LatexCommand url
1386 target "http://genome.ucsc.edu/FAQ/FAQformat"
1395 ) format is a tabular format for data pertaining to one of the Eukaryotic
1396 genomes in the UCSC genome brower
1400 \begin_layout Standard
1401 \begin_inset LatexCommand url
1402 target "http://genome.ucsc.edu/"
1412 The BED format consists of up to 12 columns, where the first three are
1413 mandatory CHR, CHR_BEG, and CHR_END.
1414 The mandatory columns and any of the optional columns can all be read in
1422 \begin_layout LyX-Code
1423 read_bed --data_in=<file>
1426 \begin_layout Standard
1427 It is also possible to read the BED file with
1433 \begin_inset LatexCommand ref
1434 reference "sub:How-to-read-table"
1438 ), however, that will be more cumbersome because you need to specify the
1442 \begin_layout LyX-Code
1443 read_tab --data_in=<file> --keys=CHR,CHR_BEG,CHR_END ...
1446 \begin_layout Subsection
1447 How to read PSL input?
1450 \begin_layout Standard
1451 The PSL format is the output from BLAT and contains 21 mandatory fields
1452 that can be read with
1459 \begin_layout LyX-Code
1460 read_psl --data_in=<file>
1463 \begin_layout Section
1467 \begin_layout Standard
1468 All result output can be written explicitly to file using the
1472 \begin_layout Standard
1481 result_out switch which all result generating biotools have.
1482 It is also possible to write the result to file implicetly by directing
1483 'stdout' to file using '>', however, that requires the
1487 \begin_layout Standard
1496 no_stream swich to prevent a mixture of data stream and results in the file.
1497 The explicit (and safe) way:
1500 \begin_layout LyX-Code
1502 | <biotool> --result_out=<file>
1505 \begin_layout Standard
1509 \begin_layout LyX-Code
1511 | <biotool> --no_stream > <file>
1514 \begin_layout Subsection
1515 How to write biotools output?
1518 \begin_layout Standard
1520 \begin_inset LatexCommand eqref
1521 reference "sub:How-to-write-stream"
1528 \begin_layout Subsection
1529 How to write FASTA output?
1530 \begin_inset LatexCommand label
1531 name "sub:How-to-write-fasta"
1538 \begin_layout Standard
1539 FASTA output can be written with
1546 \begin_layout LyX-Code
1548 | write_fasta --result_out=<file>
1551 \begin_layout Standard
1552 It is also possible to wrap the sequences to a given width using the
1556 \begin_layout Standard
1565 wrap switch allthough wrapping of sequence is generally an evil thing:
1568 \begin_layout LyX-Code
1570 | write_fasta --no_stream --wrap=80
1573 \begin_layout Subsection
1574 How to write alignment output?
1575 \begin_inset LatexCommand label
1576 name "sub:How-to-write-alignment"
1583 \begin_layout Standard
1584 Pretty alignments with ruler
1588 \begin_layout Standard
1589 '.' for every 10 residues, ':' for every 50, and '|' for every 100
1594 and consensus sequence
1595 \begin_inset Note Note
1598 \begin_layout Standard
1599 which reminds me to make that an option.
1608 , what also have the optional
1612 \begin_layout Standard
1621 wrap switch to break the alignment into blocks of a given width:
1624 \begin_layout LyX-Code
1626 | write_align --result_out=<file> --wrap=80
1629 \begin_layout Standard
1630 If the number of sequnces in the alignment is 2 then a pairwise alignment
1631 will be output otherwise a multiple alignment.
1632 And if the sequence type, determined automagically, is protein, then residues
1633 and symbols (+,\InsetSpace ~
1635 .) will be used to show consensus according to the Blosum62
1639 \begin_layout Subsection
1640 How to write tabular output?
1641 \begin_inset LatexCommand label
1642 name "sub:How-to-write-tab"
1649 \begin_layout Standard
1650 Outputting the data stream as a table can be done with
1654 , which will write generate one row per record with the values as columns.
1655 If you supply the optional
1659 \begin_layout Standard
1668 comment switch, when the first row in the table will be a 'comment' line
1669 prefixed with a '#':
1672 \begin_layout LyX-Code
1674 | write_tab --result_out=<file> --comment
1677 \begin_layout Standard
1678 You can also change the delimiter from the default (tab) to
1686 \begin_layout LyX-Code
1688 | write_tab --result_out=<file> --delimit=','
1691 \begin_layout Standard
1692 If you want the values output in a specific order you have to supply a comma
1693 separated list using the
1697 \begin_layout Standard
1706 keys switch that will print only those keys in that order:
1709 \begin_layout LyX-Code
1711 | write_tab --result_out=<file> --keys=SEQ_NAME,COUNT
1714 \begin_layout Standard
1715 Alternatively, if you have some keys that you don't want in the tabular
1720 \begin_layout Standard
1730 So to print all keys except SEQ and SEQ_TYPE do:
1733 \begin_layout LyX-Code
1735 | write_tab --result_out=<file> --no_keys=SEQ,SEQ_TYPE
1738 \begin_layout Standard
1739 Finally, if you have a stream containing a mix of different records types,
1745 records with sequences and records with matches, then you can use
1749 to output all the records in tabluar format, however, the
1753 \begin_layout Standard
1766 \begin_layout Standard
1779 \begin_layout Standard
1788 no_keys switches will only respond to records of the first type encountered.
1789 The reason is that outputting mixed records is probably not what you want
1790 anyway, and you should remove all the unwanted records from the stream
1791 before outputting the table:
1795 is your friend (see\InsetSpace ~
1797 \begin_inset LatexCommand ref
1798 reference "sub:How-to-grab"
1805 \begin_layout Subsection
1806 How to write a BED output?
1807 \begin_inset LatexCommand label
1808 name "sub:How-to-write-BED"
1815 \begin_layout Standard
1816 Data in BED format can be output if the records contain the mandatory keys
1817 CHR, CHR_BEG, and CHR_END using
1822 If the optional keys are also present, they will be output as well:
1825 \begin_layout LyX-Code
1826 write_bed --result_out=<file>
1829 \begin_layout Subsection
1830 How to write PSL output?
1831 \begin_inset LatexCommand label
1832 name "sub:How-to-write-PSL"
1839 \begin_layout Standard
1840 Data in PSL format can be output using
1845 \begin_layout LyX-Code
1846 write_psl --result_out=<file>
1849 \begin_layout Section
1850 Manipulating Records
1853 \begin_layout Subsection
1854 How to select a few records?
1855 \begin_inset LatexCommand label
1856 name "sub:How-to-select-a-few-records"
1863 \begin_layout Standard
1864 To quickly get an overview of your data you can limit the data stream to
1866 This also very useful to test the pipeline with a few records if you are
1867 setting up a complex analysis using several biotools.
1868 That way you can inspect that all goes well before analyzing and waiting
1869 for the full data set.
1870 All of the read_<type> biotools have the
1874 \begin_layout Standard
1883 num switch which will take a number as argument and only that number of
1884 records will be read.
1885 So to read in the first 10 FASTA entries from a file:
1888 \begin_layout LyX-Code
1889 read_fasta --data_in=test.fna --num=10
1892 \begin_layout Standard
1893 Another way of doing this is to use
1897 will limit the stream to show the first 10 records (default):
1900 \begin_layout LyX-Code
1905 \begin_layout Standard
1910 directly after one of the read_<type> biotools will be a lot slower than
1915 \begin_layout Standard
1924 num switch with the read_<type> biotools, however,
1928 can also be used to limit the output from all the other biotools.
1929 It is also possible to give
1933 a number of records to show using the
1937 \begin_layout Standard
1947 So to display the first 100 records do:
1950 \begin_layout LyX-Code
1952 | head_records --num=100
1955 \begin_layout Subsection
1956 How to select random records?
1957 \begin_inset LatexCommand label
1958 name "sub:How-to-select-random-records"
1965 \begin_layout Standard
1966 If you want to inspect a number of random records from the stream this can
1972 So if you have 1 mio records in the stream and you want to select 1000
1976 \begin_layout LyX-Code
1978 | random_records --num=1000
1981 \begin_layout Subsection
1982 How to count all records in the data stream?
1985 \begin_layout Standard
1986 To count all the records in the data stream use
1990 , which adds one record (which is not included in the count) to the data
1992 So to count the number of sequences in a FASTA file you can do this:
1995 \begin_layout LyX-Code
1996 cat test.fna | read_fasta | count_records --no_stream
1999 \begin_layout Standard
2000 Which will write the last record containing the count to 'stdout':
2003 \begin_layout LyX-Code
2009 \begin_layout LyX-Code
2015 \begin_layout Standard
2016 It is also possible to write the count to file using the
2020 \begin_layout Standard
2032 \begin_layout Subsection
2033 How to get the length of record values?
2034 \begin_inset LatexCommand label
2035 name "sub:How-to-get-value_length"
2042 \begin_layout Standard
2047 biotool to get the length of each value for a comma separated list of keys:
2050 \begin_layout LyX-Code
2052 | length_vals --keys=HIT,PATTERN
2055 \begin_layout Subsection
2056 How to grab specific records?
2057 \begin_inset LatexCommand label
2058 name "sub:How-to-grab"
2065 \begin_layout Standard
2070 is related to the Unix grep and locates records based on matching keys
2071 and/or values using either a pattern, a Perl regex, or a numerical evaluation.
2076 all records in the stream that has any mentioning of the pattern 'human'
2077 just pipe the data stream through
2084 \begin_layout LyX-Code
2086 | grab --pattern=human
2089 \begin_layout Standard
2090 This will search for the pattern 'human' in all keys and all values.
2095 \begin_layout Standard
2104 pattern switch takes a comma separated list of patterns, so in order to
2105 match multiple patterns do:
2108 \begin_layout LyX-Code
2110 | grab --pattern=human,mouse
2113 \begin_layout Standard
2114 It is also possible to use the
2118 \begin_layout Standard
2127 pattern_in switch instead of
2131 \begin_layout Standard
2145 \begin_layout Standard
2154 pattern_in is used to read a file with one pattern per line:
2157 \begin_layout LyX-Code
2159 | grab --pattern_in=patterns.txt
2162 \begin_layout Standard
2163 If you want the opposite result --- to find all records that does not match
2164 the patterns, add the
2168 \begin_layout Standard
2177 invert switch, which not only works with the
2181 \begin_layout Standard
2190 pattern switch, but also with
2194 \begin_layout Standard
2207 \begin_layout Standard
2219 \begin_layout LyX-Code
2221 | grab --pattern=human --invert
2224 \begin_layout Standard
2225 If you want to search the record keys only,
2230 to find all records containing the key SEQ you can add the
2234 \begin_layout Standard
2244 This will prevent matching of SEQ in any record value, and in fact SEQ
2245 is a not uncommon peptide sequence you could get an unwanted record.
2246 Also, this will give an increase in speed since only the keys are searched:
2249 \begin_layout LyX-Code
2251 | grab --pattern=SEQ --keys_only
2254 \begin_layout Standard
2255 However, if you are interested in finding the peptide sequence SEQ and not
2256 the SEQ key, just add the
2260 \begin_layout Standard
2269 vals_only switch instead:
2272 \begin_layout LyX-Code
2274 | grab --pattern=SEQ --vals_only
2277 \begin_layout Standard
2278 Also, if you want to grab for certain key/value pairs you can supply a comma
2279 separated list of keys whos values will then be searched using the
2283 \begin_layout Standard
2293 This is handy if your records contain large genomic sequences and you dont
2294 want to search the entire sequence for
2299 the organism name --- it is much faster to tell
2303 which keys to search the value for:
2306 \begin_layout LyX-Code
2308 | grab --pattern=human --keys=SEQ_NAME
2311 \begin_layout LyX-Code
2315 \begin_layout Standard
2316 It is also possible to invoke flexible matching using regex (regular expressions
2317 ) instead of simple pattern matching.
2322 the regex engine is Perl based and allows use of different type of wild
2323 cards, alternatives,
2331 \begin_layout Standard
2332 \begin_inset LatexCommand url
2333 target "http://perldoc.perl.org/perlreref.html"
2347 records withs the sequence ATCG or GCTA you can do this:
2350 \begin_layout LyX-Code
2352 | grab --regex='ATCG|GCTA'
2355 \begin_layout Standard
2356 Or if you want to find sequences beginning with ATCG:
2359 \begin_layout LyX-Code
2361 | grab --regex='^ATCG'
2364 \begin_layout Standard
2369 to locate records that fulfill a numerical property using the
2373 \begin_layout Standard
2382 eval switch witch takes an expression in three parts.
2383 The first part is the key that holds the value we want to evaluate, the
2384 second part holds one the six operators:
2387 \begin_layout Enumerate
2391 \begin_layout Enumerate
2392 Greater than or equal to: >=
2395 \begin_layout Enumerate
2399 \begin_layout Enumerate
2400 Less than or equal to: <=
2403 \begin_layout Enumerate
2407 \begin_layout Enumerate
2411 \begin_layout Enumerate
2412 String wise equal to: eq
2415 \begin_layout Enumerate
2416 String wise not equal to: ne
2419 \begin_layout Standard
2420 And finally comes the number used in the evaluation.
2425 all records with a sequence length greater than 30:
2428 \begin_layout LyX-Code
2430 length_seq | grab --eval='SEQ_LEN > 30'
2433 \begin_layout Standard
2434 If you want to locate all records containing the pattern 'human' and where
2435 the sequence length is greater that 30, you do this by running the stream
2443 \begin_layout LyX-Code
2445 | grab --pattern='human' | length_seq | grab --eval='SEQ_LEN > 30'
2448 \begin_layout Standard
2449 Finally, it is possible to do fast matching of expressions from a file using
2454 \begin_layout Standard
2464 Each of these expressions has to be matched exactly over the entrie length,
2465 which if useful if you have a file with accession numbers, that you want
2466 to locate in the stream:
2469 \begin_layout LyX-Code
2471 | grab --exact acc_no.txt | ...
2474 \begin_layout Standard
2479 \begin_layout Standard
2488 exact is much faster than using
2492 \begin_layout Standard
2501 pattern_in, because with
2505 \begin_layout Standard
2514 exact the expression has to be complete matches, where
2518 \begin_layout Standard
2527 pattern_in looks for subpatterns.
2530 \begin_layout Standard
2531 NB! To get the best speed performance, use the most restrictive
2538 \begin_layout Subsection
2539 How to remove keys from records?
2542 \begin_layout Standard
2543 To remove one or more specific keys from all records in the data stream
2551 \begin_layout LyX-Code
2553 | remove_keys --keys=SEQ,SEQ_NAME
2556 \begin_layout Standard
2557 In the above example SEQ and SEQ_NAME will be removed from all records if
2558 they exists in these.
2559 If all keys are removed from a record, then the record will be removed.
2562 \begin_layout Subsection
2563 How to rename keys in records?
2566 \begin_layout Standard
2567 Sometimes you want to rename a record key,
2572 if you have read in a two column table with sequence name and sequence
2574 \begin_inset LatexCommand ref
2575 reference "sub:How-to-read-table"
2579 ) without specifying the key names, then the sequence name will be called
2580 V0 and the sequence V1 as default in the
2585 To rename the V0 and V1 keys we need to run the stream through
2589 twice (one for each key to rename):
2592 \begin_layout LyX-Code
2594 | rename_keys --keys=V0,SEQ_NAME | rename_keys --keys=V1,SEQ
2597 \begin_layout Standard
2598 The first instance of
2602 replaces all the V0 keys with SEQ_NAME, and the second instance of
2606 replaces all the V1 keys with SEQ.
2611 the data can now be used in the biotools that requires these keys.
2614 \begin_layout Section
2615 Manipulating Sequences
2618 \begin_layout Subsection
2619 How to get sequence lengths?
2622 \begin_layout Standard
2623 The length for sequences in records can be determined with
2627 , which adds the key SEQ_LEN to each record with the sequence length as
2629 It also generates an extra record that is emitted last with the key TOTAL_SEQ_L
2630 EN showing the total length of all the sequences.
2633 \begin_layout LyX-Code
2634 read_fasta --data_in=<file> | length_seq
2637 \begin_layout Standard
2638 It is also possible to determine the sequence length using the generic tool
2644 \begin_inset LatexCommand eqref
2645 reference "sub:How-to-get-value_length"
2649 , which determines the length of the values for a given list of keys:
2652 \begin_layout LyX-Code
2653 read_fasta --data_in=<file> | length_vals --keys=SEQ
2656 \begin_layout Standard
2657 To obtain the total length of all sequences use
2664 \begin_layout LyX-Code
2665 read_fasta --data_in=<file> | length_vals --keys=SEQ
2668 \begin_layout LyX-Code
2669 | sum_vals --keys=SEQ_LEN
2672 \begin_layout Standard
2677 will also determine the length of each sequence (see\InsetSpace ~
2679 \begin_inset LatexCommand ref
2680 reference "sub:How-to-analyze"
2687 \begin_layout Subsection
2688 How to analyze sequence composition?
2689 \begin_inset LatexCommand label
2690 name "sub:How-to-analyze"
2697 \begin_layout Standard
2698 If you want to find out the sequence type, composition, length, as well
2699 as GC content, indel content and proportions of soft and hard masked sequence,
2705 This handy biotool will determine all these things per sequence from which
2706 it is easy to get an overview using the
2710 biotool to output a table (see\InsetSpace ~
2712 \begin_inset LatexCommand ref
2713 reference "sub:How-to-write-tab"
2718 So in order to determine the sequence composition of a FASTA file with
2719 just one entry containing the sequence 'ATCG' we just read the data with
2724 and run the output through
2728 which will add the analysis to the record like this:
2731 \begin_layout LyX-Code
2732 read_fasta --data_in=test.fna | analyze_seq ...
2735 \begin_layout LyX-Code
2739 \begin_layout LyX-Code
2745 \begin_layout LyX-Code
2751 \begin_layout LyX-Code
2757 \begin_layout LyX-Code
2763 \begin_layout LyX-Code
2769 \begin_layout LyX-Code
2775 \begin_layout LyX-Code
2781 \begin_layout LyX-Code
2787 \begin_layout LyX-Code
2793 \begin_layout LyX-Code
2799 \begin_layout LyX-Code
2805 \begin_layout LyX-Code
2811 \begin_layout LyX-Code
2817 \begin_layout LyX-Code
2823 \begin_layout LyX-Code
2829 \begin_layout LyX-Code
2835 \begin_layout LyX-Code
2841 \begin_layout LyX-Code
2847 \begin_layout LyX-Code
2853 \begin_layout LyX-Code
2859 \begin_layout LyX-Code
2862 SEQ: TTTCAGTTTGGGACGGAGTAAGGCCTTCCtttttttttttttttttttttttttttttgagaccgagtcttgctc
2866 \begin_layout LyX-Code
2872 \begin_layout LyX-Code
2878 \begin_layout LyX-Code
2884 \begin_layout LyX-Code
2890 \begin_layout LyX-Code
2896 \begin_layout LyX-Code
2902 \begin_layout LyX-Code
2906 \begin_layout Standard
2907 Now to make a table of how may As, Ts, Cs, and Gs you can add the following:
2910 \begin_layout LyX-Code
2912 | analyze_seq | write_tab --keys=RES:A,RES:T,RES:C,RES:G
2915 \begin_layout Standard
2916 Or if you want to see the proportions of hard and soft masked sequence:
2919 \begin_layout LyX-Code
2921 | analyse_seq | write_tab --keys=HARD_MASK%,SOFT_MASK%
2924 \begin_layout Standard
2925 If you have a stack of sequences in one file and you want to determine the
2926 mean GC content you can do it using the
2933 \begin_layout LyX-Code
2934 read_fasta --data_in=test.fna | analyze_seq | mean_vals --keys=GC%
2937 \begin_layout Standard
2938 Or if you want the total count of Ns you can use
2945 \begin_layout LyX-Code
2946 read_fasta --data_in=test.fna | analyze_seq | sum_vals --keys=RES:N
2949 \begin_layout Standard
2950 The MIX_INDEX key is calculated as the count of the most common residue
2951 over the sequence length, and can be used as a cut-off for removing sequence
2952 tags consisting of mostly one nucleotide:
2955 \begin_layout LyX-Code
2956 read_fasta --data_in=test.fna | analyze_seq | grab --eval='MIX_INDEX<0.85'
2959 \begin_layout Subsection
2960 How to extract subsequences?
2961 \begin_inset LatexCommand label
2962 name "sub:How-to-extract"
2969 \begin_layout Standard
2970 In order to extract a subsequence from a longer sequence use the biotool
2971 extract_seq, which will replace the sequence in the record with the subsequence
2972 (this behaviour should probably be modified to be dependant of a
2976 \begin_layout Standard
2989 \begin_layout Standard
2999 \begin_inset Note Note
3002 \begin_layout Standard
3009 So to extract the first 20 residues from all sequences do (first residue
3013 \begin_layout LyX-Code
3015 | extract_seq --beg=1 --len=20
3018 \begin_layout Standard
3019 You can also specify a begin and end coordinate set:
3022 \begin_layout LyX-Code
3024 | extract_seq --beg=20 --end=40
3027 \begin_layout Standard
3028 If you want the subsequences from position 20 to the sequence end do:
3031 \begin_layout LyX-Code
3033 | extract_seq --beg=20
3036 \begin_layout Standard
3037 If you want to extract subsequences a given distance from the sequence end
3038 you can do this by reversing the sequence with the biotool
3043 \begin_inset LatexCommand eqref
3044 reference "sub:How-to-reverse-seq"
3052 to get the subsequence, and then
3056 again to get the subsequence back in the original orientation:
3059 \begin_layout LyX-Code
3060 read_fasta --data_in=test.fna | reverse_seq
3063 \begin_layout LyX-Code
3064 | extract_seq --beg=10 --len=10 | reverse_seq
3067 \begin_layout Subsection
3068 How to get genomic sequence?
3069 \begin_inset LatexCommand label
3070 name "sub:How-to-get-genomic-sequence"
3077 \begin_layout Standard
3082 can extract subsequences for a given genome specified with the
3086 \begin_layout Standard
3095 genome switch explicitly using the
3099 \begin_layout Standard
3112 \begin_layout Standard
3125 \begin_layout Standard
3137 \begin_layout LyX-Code
3138 get_genome_seq --genome=<genome> --beg=1 --len=100
3141 \begin_layout Standard
3146 can be used to append the corresponding sequence to BED, PSL, and BLAST
3150 \begin_layout LyX-Code
3151 read_bed --data_in=<BED file> | get_genome_seq --genome=<genome>
3154 \begin_layout Standard
3155 It is also possible to include flaking sequence using the
3159 \begin_layout Standard
3169 So to include 50 nucleotides upstream and 50 nucleotides downstream for
3173 \begin_layout LyX-Code
3174 read_bed --data_in=<BED file> | get_genome_seq --genome=<genome> --flank=50
3177 \begin_layout Subsection
3178 How to upper-case sequences?
3181 \begin_layout Standard
3182 Sequences can be shifted from lower case to upper case using
3189 \begin_layout LyX-Code
3194 \begin_layout Subsection
3195 How to reverse sequences?
3196 \begin_inset LatexCommand label
3197 name "sub:How-to-reverse-seq"
3204 \begin_layout Standard
3205 The order of residues in a sequence can be reversed using reverse_seq:
3208 \begin_layout LyX-Code
3213 \begin_layout Standard
3214 Note that in order to reverse/complement a sequence you also need the
3218 biotool (see\InsetSpace ~
3220 \begin_inset LatexCommand ref
3221 reference "sub:How-to-complement"
3228 \begin_layout Subsection
3229 How to complement sequences?
3230 \begin_inset LatexCommand label
3231 name "sub:How-to-complement"
3238 \begin_layout Standard
3239 DNA and RNA sequences can be complemented with
3243 , which automagically determines the sequence type:
3246 \begin_layout LyX-Code
3251 \begin_layout Standard
3252 Note that in order to reverse/complement a sequence you also need the
3256 biotool (see\InsetSpace ~
3258 \begin_inset LatexCommand ref
3259 reference "sub:How-to-reverse-seq"
3266 \begin_layout Subsection
3267 How to remove indels from sequnces?
3270 \begin_layout Standard
3271 Indels can be removed from sequences with the
3276 This is useful if you have aligned some sequences (see\InsetSpace ~
3278 \begin_inset LatexCommand ref
3279 reference "sub:How-to-align"
3283 ) and extracted (see\InsetSpace ~
3285 \begin_inset LatexCommand ref
3286 reference "sub:How-to-extract"
3290 ) a block of subsequences from the alignment and you want to use these sequence
3291 in a search where you need to remove the indels first.
3292 '-', '~', and '.' are considered indels:
3295 \begin_layout LyX-Code
3300 \begin_layout Subsection
3301 How to shuffle sequences?
3304 \begin_layout Standard
3305 All residues in sequences in the stream can be shuffled to random positions
3313 \begin_layout LyX-Code
3318 \begin_layout Subsection
3319 How to split sequences into overlapping subsequences?
3322 \begin_layout Standard
3323 Sequences can be slit into overlapping subsequences with the
3330 \begin_layout LyX-Code
3332 | split_seq --word_size=20 --uniq
3335 \begin_layout Subsection
3336 How to determine the oligo frequency?
3339 \begin_layout Standard
3340 In order to determine if any oligo usage is over represented in one or more
3341 sequences you can determine the frequency of oligos of a given size with
3349 \begin_layout LyX-Code
3351 | oligo_freq --word_size=4
3354 \begin_layout Standard
3355 And if you have more than one sequence and want to accumulate the frequences
3360 \begin_layout Standard
3372 \begin_layout LyX-Code
3374 | oligo_freq --word_size=4 --all
3377 \begin_layout Standard
3378 To get a meaningful result you need to write the resulting frequencies as
3385 \begin_inset LatexCommand ref
3386 reference "sub:How-to-write-tab"
3390 ), but first it is important to
3396 \begin_inset LatexCommand ref
3397 reference "sub:How-to-grab"
3401 ) the records with the frequencies to avoid full length sequences in the
3405 \begin_layout LyX-Code
3407 | oligo_freq --word_size=4 --all | grab --pattern=OLIGO --keys_only
3410 \begin_layout LyX-Code
3411 | write_tab --no_stream
3414 \begin_layout Standard
3415 And the resulting frequency table can be sorted with Unix sort (man sort).
3418 \begin_layout Subsection
3419 How to search for sequences in genomes?
3422 \begin_layout Standard
3423 See the following biotool:
3426 \begin_layout Itemize
3432 \begin_inset LatexCommand eqref
3433 reference "sub:How-to-use-patscan"
3440 \begin_layout Itemize
3446 \begin_inset LatexCommand eqref
3447 reference "sub:How-to-use-BLAT"
3454 \begin_layout Itemize
3460 \begin_inset LatexCommand eqref
3461 reference "sub:How-to-use-BLAST"
3468 \begin_layout Itemize
3474 \begin_inset LatexCommand eqref
3475 reference "sub:How-to-use-Vmatch"
3482 \begin_layout Subsection
3483 How to search sequences for a pattern?
3484 \begin_inset LatexCommand label
3485 name "sub:How-to-use-patscan"
3492 \begin_layout Standard
3493 It is possible to search sequences in the data stream for patterns using
3498 biotool which utilizes the powerful scan_for_matches engine.
3499 Consult the documentation for scan_for_matches in order to learn how to
3500 define patterns (the documentation is included in Appendix\InsetSpace ~
3502 \begin_inset LatexCommand ref
3503 reference "sec:scan_for_matches-README"
3510 \begin_layout Standard
3511 To search all sequences for a simple pattern consisting of the sequence
3512 ATCGATCG allowing for 3 mismatches, 2 insertions and 1 deletion:
3515 \begin_layout LyX-Code
3516 read_fasta --data_in=<file> | patscan_seq --pattern='ATCGATCG[3,2,1]'
3519 \begin_layout Standard
3524 \begin_layout Standard
3533 pattern switch takes a comma seperated list of patterns, so if you want
3534 to search for more that one pattern do:
3537 \begin_layout LyX-Code
3539 | patscan_seq --pattern='ATCGATCG[3,2,1],GCTAGCTA[3,2,1]'
3542 \begin_layout Standard
3543 It is also possible to have a list of different patterns to search for in
3544 a file with one pattern per line.
3549 to read these patterns use the
3553 \begin_layout Standard
3565 \begin_layout LyX-Code
3567 | patscan_seq --pattern_in=<file>
3570 \begin_layout Standard
3571 To also scan the complementary strand in nucleotide sequences (
3575 automagically determines the sequence type) you need to add the
3579 \begin_layout Standard
3591 \begin_layout LyX-Code
3593 | patscan_seq --pattern=<pattern> --comp
3596 \begin_layout Standard
3597 It is also possible to use
3601 to output those records that does not contain a certain pattern by using
3606 \begin_layout Standard
3618 \begin_layout LyX-Code
3620 | patscan_seq --pattern=<pattern> --invert
3623 \begin_layout Standard
3628 can also scan for patterns in a given genome sequence, instead of sequences
3629 in the stream, using the
3633 \begin_layout Standard
3645 \begin_layout LyX-Code
3646 patscan --pattern=<pattern> --genome=<genome>
3649 \begin_layout Subsection
3650 How to use BLAT for sequence search?
3651 \begin_inset LatexCommand label
3652 name "sub:How-to-use-BLAT"
3659 \begin_layout Standard
3660 Sequences in the data stream can be matched against supported genomes using
3665 which is a biotool using BLAT as the name might suggest.
3666 Currently only Mouse and Human genomes are available and it is not possible
3667 to use OOC files since there is still a need for a local repository for
3669 Otherwise it is just:
3672 \begin_layout LyX-Code
3673 read_fasta --data_in=<file> | blat_seq --genome=<genome>
3676 \begin_layout Standard
3677 The search results can then be written to file with
3683 \begin_inset LatexCommand ref
3684 reference "sub:How-to-write-PSL"
3694 \begin_inset LatexCommand ref
3695 reference "sub:How-to-write-BED"
3703 some information will be lost).
3704 It is also possible to plot chromosome distribution of the search results
3711 \begin_inset LatexCommand ref
3712 reference "sub:How-to-plot-chrdist"
3716 ) or the distribution of the match lengths using
3722 \begin_inset LatexCommand ref
3723 reference "sub:How-to-plot-lendist"
3727 ) or a karyogram with the hits using
3733 \begin_inset LatexCommand ref
3734 reference "sub:How-to-plot-karyogram"
3741 \begin_layout Subsection
3742 How to use BLAST for sequence search?
3743 \begin_inset LatexCommand label
3744 name "sub:How-to-use-BLAST"
3751 \begin_layout Standard
3752 Two biotools exist for blasting sequences:
3756 is used to create the BLAST database required for BLAST which is queried
3762 So in order to create a BLAST database from sequences in the data stream
3766 \begin_layout LyX-Code
3768 | create_blast_db --database=my_database --no_stream
3771 \begin_layout Standard
3772 The type of sequence to use for the database is automagically determined
3777 , but don't have a mixture of peptide and nucleic acids sequences in the
3783 \begin_layout Standard
3792 database switch takes a path as argument, but will default to 'blastdb_<time_sta
3796 \begin_layout Standard
3797 The resulting database can now be queried with sequences in another data
3805 \begin_layout LyX-Code
3807 | blast_seq --database=my_database
3810 \begin_layout Standard
3811 Again, the sequence type is determined automagically and the appropriate
3812 BLAST program is guessed (see below table), however, the program name can
3813 be overruled with the
3817 \begin_layout Standard
3829 \begin_layout Standard
3832 \begin_inset Tabular
3833 <lyxtabular version="3" rows="5" columns="3">
3835 <column alignment="center" valignment="top" width="0">
3836 <column alignment="center" valignment="top" width="0">
3837 <column alignment="center" valignment="top" width="0">
3838 <row bottomline="true">
3839 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3842 \begin_layout Standard
3848 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3851 \begin_layout Standard
3857 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3860 \begin_layout Standard
3868 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3871 \begin_layout Standard
3877 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3880 \begin_layout Standard
3886 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3889 \begin_layout Standard
3897 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3900 \begin_layout Standard
3906 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3909 \begin_layout Standard
3915 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3918 \begin_layout Standard
3926 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3929 \begin_layout Standard
3935 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3938 \begin_layout Standard
3944 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3947 \begin_layout Standard
3955 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3958 \begin_layout Standard
3964 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3967 \begin_layout Standard
3973 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3976 \begin_layout Standard
3990 \begin_layout Standard
3991 Finally, it is also possible to use
3995 for blasting sequences agains a preformatted genome using the
3999 \begin_layout Standard
4008 genome switch instead of the
4012 \begin_layout Standard
4024 \begin_layout LyX-Code
4026 | blast_seq --genome=<genome>
4029 \begin_layout Subsection
4030 How to use Vmatch for sequence search?
4031 \begin_inset LatexCommand label
4032 name "sub:How-to-use-Vmatch"
4039 \begin_layout Standard
4040 The powerful suffix array software package Vmatch
4044 \begin_layout Standard
4045 \begin_inset LatexCommand url
4046 target "http://www.vmatch.de/"
4055 can be used for exact mapping of sequences against indexed genomes using
4061 map 700000 ESTs to the human genome locating all 160 mio hits in less than
4063 Only nucleotide sequences and sequences longer than 11 nucleotides will
4065 It is recommended that sequences consisting of mostly one nucleotide type
4067 This can be done with the
4072 \begin_inset LatexCommand eqref
4073 reference "sub:How-to-analyze"
4080 \begin_layout LyX-Code
4082 | vmatch_seq --genome=<genome>
4085 \begin_layout Standard
4086 It is also possible to allow for mismatches using the
4090 \begin_layout Standard
4099 hamming_dist switch.
4100 So to allow for 2 mismatches:
4103 \begin_layout LyX-Code
4105 | vmatch_seq --genome=<genome> --hamming_dist=2
4108 \begin_layout Standard
4109 Or to allow for 10% mismathing nucleotides:
4112 \begin_layout LyX-Code
4114 | vmatch_seq --genome=<genome> --hamming_dist=10p
4117 \begin_layout Standard
4118 To allow both indels and mismatches use the
4122 \begin_layout Standard
4132 So to allow for one mismatch or one indel:
4135 \begin_layout LyX-Code
4137 | vmatch_seq --genome=<genome> --hamming_dist=1
4140 \begin_layout Standard
4141 Or to allow for 5% indels or mismatches:
4144 \begin_layout LyX-Code
4146 | vmatch_seq --genome=<genome> --hamming_dist=5p
4149 \begin_layout Standard
4154 \begin_layout Standard
4167 \begin_layout Standard
4176 edit_dist greatly slows down vmatch considerably --- use with care.
4179 \begin_layout Standard
4180 The resulting SCORE key can be replaced to hold the number of genome matches
4181 of a given sequence (multi-mappers) is the
4185 \begin_layout Standard
4194 count switch is given.
4197 \begin_layout Subsection
4198 How to find all matches between sequences?
4199 \begin_inset LatexCommand label
4200 name "sub:How-to-find-matches"
4207 \begin_layout Standard
4208 All matches between two sequences can be determined with the biotool
4213 The match finding engine underneath the hood of
4217 is the super fast suffix tree program MUMmer
4221 \begin_layout Standard
4222 \begin_inset LatexCommand url
4223 target "http://mummer.sourceforge.net/"
4232 , which will locate all forward and reverse matches between huge sequences
4233 in a matter of minutes (if the repeat count is not too high and if the
4234 word size used is appropriate).
4239 genomes (1.7Mbp) takes around 10 seconds:
4242 \begin_layout LyX-Code
4244 | match_seq --word_size=20 --direction=both
4247 \begin_layout Standard
4252 can be used to generate a dot plot with
4258 \begin_inset LatexCommand ref
4259 reference "sub:How-to-generate-dotplot"
4266 \begin_layout Subsection
4267 How to align sequences?
4268 \begin_inset LatexCommand label
4269 name "sub:How-to-align"
4276 \begin_layout Standard
4277 Sequences in the stream can be aligned with the
4281 biotool that uses Muscle
4285 \begin_layout Standard
4286 \begin_inset LatexCommand url
4287 target "http://www.drive5.com/muscle/muscle.html"
4297 Currently you cannot change any of the Muscle alignment parameters and
4302 will create an alignment based on the defaults (which are really good!):
4305 \begin_layout LyX-Code
4310 \begin_layout Standard
4311 The aligned output can be written to file in FASTA format using
4317 \begin_inset LatexCommand ref
4318 reference "sub:How-to-write-fasta"
4322 ) or in pretty text using
4328 \begin_inset LatexCommand ref
4329 reference "sub:How-to-write-alignment"
4336 \begin_layout Subsection
4337 How to create a weight matrix?
4340 \begin_layout Standard
4341 If you want a weight matrix to show the sequence composition of a stack
4342 of sequences you can use the biotool create_weight_matrix:
4345 \begin_layout LyX-Code
4347 | create_weight_matrix
4350 \begin_layout Standard
4351 The result can be output in percent using the
4355 \begin_layout Standard
4367 \begin_layout LyX-Code
4369 | create_weight_matrix --percent
4372 \begin_layout Standard
4373 The weight matrix can be written as tabular output with
4379 \begin_inset LatexCommand ref
4380 reference "sub:How-to-write-tab"
4384 ) after removeing the records containing SEQ with
4390 \begin_inset LatexCommand ref
4391 reference "sub:How-to-grab"
4398 \begin_layout LyX-Code
4400 | create_weight_matrix | grab --invert --keys=SEQ --keys_only
4403 \begin_layout LyX-Code
4404 | write_tab --no_stream
4407 \begin_layout Standard
4408 The V0 column will hold the residue, while the rest of the columns will
4409 hold the frequencies for each sequence position.
4412 \begin_layout Section
4416 \begin_layout Standard
4417 There exists several biotools for plotting.
4418 Some of these are based on GNUplot
4422 \begin_layout Standard
4423 \begin_inset LatexCommand url
4424 target "http://www.gnuplot.info/"
4433 , which is an extremely powerful platform to generate all sorts of plots
4434 and even though GNUplot has quite a steep learning curve, the biotools
4435 utilizing GNUplot are simple to use.
4436 GNUplot is able to output a lot of different formats (called terminals
4437 in GNUplot), but the biotools focusses on three formats only:
4440 \begin_layout Enumerate
4441 The 'dumb' terminal is default to the GNUplot based biotools and will output
4442 a plot in crude ASCII text (Fig.\InsetSpace ~
4444 \begin_inset LatexCommand ref
4445 reference "fig:Dumb-terminal"
4450 This is quite nice for a quick and dirty plot to get an overview of your
4454 \begin_layout Enumerate
4455 The 'post' or 'postscript' terminal output postscript code which is publication
4456 grade graphics that can be viewed with applications such as Ghostview,
4457 Photoshop, and Preview.
4460 \begin_layout Enumerate
4461 The 'svg' terminal output's scalable vector graphics (SVG) which is a vector
4463 SVG is great because you can edit the resulting plot using Photoshop or
4468 \begin_layout Standard
4469 Inkscape is a really handy drawing program that is free and open source.
4471 \begin_inset LatexCommand htmlurl
4472 target "http://www.inkscape.org"
4481 if you want to add additional labels, captions, arrows, and so on and then
4482 save the result in different formats, such as postscript without loosing
4486 \begin_layout Standard
4487 The biotools for plotting that are not based on GNUplot only output SVG
4488 (that may change in the future).
4491 \begin_layout Standard
4492 \begin_inset Float figure
4497 \begin_layout Standard
4500 \begin_inset Graphics
4501 filename lendist_ascii.png
4510 \begin_layout Standard
4511 \begin_inset Caption
4513 \begin_layout Standard
4514 \begin_inset LatexCommand label
4515 name "fig:Dumb-terminal"
4528 The output of a length distribution plot in the default 'dumb terminal'
4529 to the terminal window.
4538 \begin_layout Subsection
4539 How to plot a histogram?
4540 \begin_inset LatexCommand label
4541 name "How-to-plot-histogram"
4548 \begin_layout Standard
4549 A generic histogram for a given value can be plotted with the biotool
4555 \begin_inset LatexCommand ref
4556 reference "fig:Histogram"
4563 \begin_layout LyX-Code
4565 | plot_histogram --key=TISSUE --no_stream
4568 \begin_layout Standard
4572 \begin_layout Standard
4575 \begin_inset Float figure
4580 \begin_layout Standard
4583 \begin_inset Graphics
4584 filename histogram.png
4593 \begin_layout Standard
4594 \begin_inset Caption
4596 \begin_layout Standard
4597 \begin_inset LatexCommand label
4598 name "fig:Histogram"
4615 \begin_layout Subsection
4616 How to plot a length distribution?
4617 \begin_inset LatexCommand label
4618 name "sub:How-to-plot-lendist"
4625 \begin_layout Standard
4626 Plotting of length distributions, weather sequence lengths, patterns lengths,
4632 is a really handy thing and can be done with the the biotool
4637 If you have a file with FASTA entries and want to plot the length distribution
4638 you do it like this:
4641 \begin_layout LyX-Code
4642 read_fasta --data_in=<file> | length_seq
4645 \begin_layout LyX-Code
4646 | plot_lendist --key=SEQ_LEN --no_stream
4649 \begin_layout Standard
4650 The result will be written to the default dumb terminal and will look like
4653 \begin_inset LatexCommand ref
4654 reference "fig:Dumb-terminal"
4661 \begin_layout Standard
4662 If you instead want the result in postscript format you can do:
4665 \begin_layout LyX-Code
4667 | plot_lendist --key=SEQ_LEN --terminal=post --result_out=file.ps
4670 \begin_layout Standard
4671 That will generate the plot and save it to file, but not interrupt the data
4672 stream which can then be used in further analysis.
4673 You can also save the plot implicetly using '>', however, it is then important
4674 to terminate the stream with the
4678 \begin_layout Standard
4690 \begin_layout LyX-Code
4692 | plot_lendist --key=SEQ_LEN --terminal=post --no_stream > file.ps
4695 \begin_layout Standard
4696 The resulting plot can be seen in Fig.\InsetSpace ~
4698 \begin_inset LatexCommand ref
4699 reference "fig:Length-distribution"
4706 \begin_layout Standard
4707 \begin_inset Float figure
4712 \begin_layout Standard
4716 \begin_layout Standard
4719 \begin_inset Graphics
4729 \begin_layout Standard
4730 \begin_inset Caption
4732 \begin_layout Standard
4733 \begin_inset LatexCommand label
4734 name "fig:Length-distribution"
4747 Length distribution of 630 piRNA like RNAs.
4755 \begin_layout Subsection
4756 How to plot a chromosome distribution?
4757 \begin_inset LatexCommand label
4758 name "sub:How-to-plot-chrdist"
4765 \begin_layout Standard
4766 If you have the result of a sequence search against a multi chromosome genome,
4767 it is very practical to be able to plot the distribution of search hits
4768 on the different chromosomes.
4769 This can be done with
4776 \begin_layout LyX-Code
4777 read_fasta --data_in=<file> | blat_genome | plot_chrdist --no_stream
4780 \begin_layout Standard
4781 The above example will result in a crude plot using the 'dumb' terminal,
4782 and if you want to mess around with the results from the BLAT search you
4783 probably want to save the result to file first (see\InsetSpace ~
4785 \begin_inset LatexCommand ref
4786 reference "sub:How-to-write-PSL"
4791 To plot the chromosome distribution from the saved search result you can
4795 \begin_layout LyX-Code
4796 read_bed --data_in=file.bed | plot_chrdist --terminal=post --result_out=plot.ps
4799 \begin_layout Standard
4800 That will result in the output show in Fig.\InsetSpace ~
4802 \begin_inset LatexCommand ref
4803 reference "fig:Chromosome-distribution"
4810 \begin_layout Standard
4811 \begin_inset Float figure
4816 \begin_layout Standard
4820 \begin_layout Standard
4823 \begin_inset Graphics
4834 \begin_layout Standard
4835 \begin_inset Caption
4837 \begin_layout Standard
4838 \begin_inset LatexCommand label
4839 name "fig:Chromosome-distribution"
4843 Chromosome distribution
4856 \begin_layout Subsection
4857 How to generate a dotplot?
4858 \begin_inset LatexCommand label
4859 name "sub:How-to-generate-dotplot"
4866 \begin_layout Standard
4867 A dotplot is a powerful way to get an overview of the size and location
4868 of sequence insertions, deletions, and duplications between two sequences.
4869 Generating a dotplot with biotools is a two step process where you initially
4870 find all matches between two sequences using the tool
4876 \begin_inset LatexCommand ref
4877 reference "sub:How-to-find-matches"
4881 ) and plot the resulting matches with
4886 Matching and plotting two
4890 genomes (1.7Mbp) takes around 10 seconds:
4893 \begin_layout LyX-Code
4895 | match_seq | plot_matches --terminal=post --result_out=plot.ps
4898 \begin_layout Standard
4899 The resulting dotplot is in Fig.\InsetSpace ~
4901 \begin_inset LatexCommand ref
4902 reference "fig:Dotplot"
4909 \begin_layout Standard
4910 \begin_inset Float figure
4915 \begin_layout Standard
4918 \begin_inset Graphics
4928 \begin_layout Standard
4929 \begin_inset Caption
4931 \begin_layout Standard
4932 \begin_inset LatexCommand label
4946 Forward matches are displayed in green while reverse matches are displayed
4955 \begin_layout Subsection
4956 How to plot a sequence logo?
4959 \begin_layout Standard
4960 Sequence logos can be generate with
4965 The sequnce type is determined automagically and an entropy scale of 2
4966 bits and 4 bits is used for nucleotide and peptide sequences, respectively
4970 \begin_layout Standard
4971 \begin_inset LatexCommand htmlurl
4972 target "http://www.ccrnp.ncifcrf.gov/~toms/paper/hawaii/latex/node5.html"
4984 \begin_layout LyX-Code
4986 | plot_seqlogo --no_stream --result_out=seqlogo.svg
4989 \begin_layout Standard
4990 An example of a sequence logo can be seen in Fig.\InsetSpace ~
4992 \begin_inset LatexCommand ref
4993 reference "fig:Sequence-logo"
5000 \begin_layout Standard
5001 \begin_inset Float figure
5006 \begin_layout Standard
5009 \begin_inset Graphics
5010 filename seqlogo.png
5019 \begin_layout Standard
5020 \begin_inset Caption
5022 \begin_layout Standard
5023 \begin_inset LatexCommand label
5024 name "fig:Sequence-logo"
5041 \begin_layout Subsection
5042 How to plot a karyogram?
5043 \begin_inset LatexCommand label
5044 name "sub:How-to-plot-karyogram"
5051 \begin_layout Standard
5052 To plot search hits on genomes use
5056 , which will output a nice karyogram in SVG graphics:
5059 \begin_layout LyX-Code
5061 | plot_karyogram --result_out=karyogram.svg
5064 \begin_layout Standard
5065 The banding data is taken from the UCSC genome browser database and currently
5066 only Human and Mouse is supported.
5069 \begin_inset LatexCommand ref
5070 reference "fig:Karyogram"
5074 shows the distribution of piRNA like RNAs matched to the Human genome.
5077 \begin_layout Standard
5078 \begin_inset Float figure
5083 \begin_layout Standard
5086 \begin_inset Graphics
5087 filename karyogram.png
5096 \begin_layout Standard
5097 \begin_inset Caption
5099 \begin_layout Standard
5100 \begin_inset LatexCommand label
5101 name "fig:Karyogram"
5114 Hits from a search of piRNA like RNAs in the Human genome is displayed as
5115 short horizontal bars.
5123 \begin_layout Section
5127 \begin_layout Subsection
5128 How do I display my results in the UCSC Genome Browser?
5131 \begin_layout Standard
5132 Results from the list of biotools below can be uploaded directly to a local
5133 mirror of the UCSC Genome Browser using the biotool
5140 \begin_layout Itemize
5142 \begin_inset LatexCommand eqref
5143 reference "sub:How-to-use-patscan"
5150 \begin_layout Itemize
5152 \begin_inset LatexCommand eqref
5153 reference "sub:How-to-use-BLAT"
5160 \begin_layout Itemize
5162 \begin_inset LatexCommand eqref
5163 reference "sub:How-to-use-BLAST"
5170 \begin_layout Itemize
5172 \begin_inset LatexCommand eqref
5173 reference "sub:How-to-use-Vmatch"
5180 \begin_layout Standard
5181 The syntax for uploading data the most simple way requires two mandatory
5186 \begin_layout Standard
5195 database, which is the UCSC database name (such as hg18, mm9, etc.) and
5199 \begin_layout Standard
5208 table which should be the users initials followed by an underscore and a
5209 short description of the data:
5212 \begin_layout LyX-Code
5214 | upload_to_ucsc --database=hg18 --table=mah_snoRNAs
5217 \begin_layout Standard
5222 biotool modifies the users ~/ucsc/my_tracks.ra file automagically (a backup
5223 is created with the name ~/ucsc/my_tracks.ra~) with default values that
5224 can be overridden using the following switches:
5227 \begin_layout Itemize
5231 \begin_layout Standard
5240 short_label - Short label for track - Default=database->table
5243 \begin_layout Itemize
5247 \begin_layout Standard
5256 long_label - Long label for track - Default=database->table
5259 \begin_layout Itemize
5263 \begin_layout Standard
5272 group - Track group name - Default=<user name as defined in env>
5275 \begin_layout Itemize
5279 \begin_layout Standard
5288 priority - Track display priority - Default=1
5291 \begin_layout Itemize
5295 \begin_layout Standard
5304 color - Track color - Default=147,73,42
5307 \begin_layout Itemize
5311 \begin_layout Standard
5320 chunk_size - Chunks for loading - Default=10000000
5323 \begin_layout Itemize
5327 \begin_layout Standard
5336 visibility - Track visibility - Default=pack
5339 \begin_layout Standard
5340 Also, data in BED or PSL format can be uploaded with
5344 as long as these reference to genomes and chromosomes existing in the UCSC
5348 \begin_layout LyX-Code
5349 read_bed --data_in=<bed file> | upload_to_ucsc ...
5352 \begin_layout LyX-Code
5356 \begin_layout LyX-Code
5357 read_psl --data_in=<psl file> | upload_to_ucsc ...
5360 \begin_layout Section
5364 \begin_layout Standard
5365 It is possible to do commandline scripting of biotool records using Perl.
5366 Because a biotool record essentially is a hash structure, you can pass
5371 command, which is a wrapper around the Perl executable that allows direct
5372 manipulations of the records using the power of Perl.
5375 \begin_layout Standard
5376 In the below example we replace in all records the value to the CHR key
5377 with a forthrunning number:
5380 \begin_layout LyX-Code
5382 | bioscript 'while($r=get_record(
5384 *STDIN)){$r->{CHR}=$i++; put_record($r)}'
5387 \begin_layout Standard
5388 Something more useful would probably be to create custom FASTA headers.
5390 if we read in a BED file, lookup the genomic sequence, create a custom
5395 and output FASTA entries:
5398 \begin_layout LyX-Code
5400 | bioscript 'while($r=get_record(
5402 *STDIN)){$r->{SEQ_NAME}= //
5405 \begin_layout LyX-Code
5406 join("_",$r->{CHR},$r->{CHR_BEG},$r->{CHR_END}); put_record($r)}'
5409 \begin_layout Standard
5413 \begin_layout LyX-Code
5414 >chr2L_21567527_21567550
5417 \begin_layout LyX-Code
5418 taccaaacggatgcctcagacatc
5421 \begin_layout LyX-Code
5422 >chr2L_693380_693403
5425 \begin_layout LyX-Code
5426 taccaaacggatgcctcagacatc
5429 \begin_layout LyX-Code
5430 >chr2L_13859534_13859557
5433 \begin_layout LyX-Code
5434 taccaaacggatgcctcagacatc
5437 \begin_layout LyX-Code
5438 >chr2L_9005090_9005113
5441 \begin_layout LyX-Code
5442 taccaaacggatgcctcagacatc
5445 \begin_layout LyX-Code
5446 >chr2L_2106825_2106848
5449 \begin_layout LyX-Code
5450 taccaaacggatgcctcagacatc
5453 \begin_layout LyX-Code
5454 >chr2L_14649031_14649054
5457 \begin_layout LyX-Code
5458 taccaaacggatgcctcagacatc
5461 \begin_layout Section
5465 \begin_layout Standard
5466 Shoot the messenger!
5469 \begin_layout Section
5472 \begin_inset LatexCommand label
5480 \begin_layout Standard
5484 \begin_layout Standard
5488 \begin_layout Standard
5492 \begin_layout Standard
5496 \begin_layout Standard
5500 \begin_layout Standard
5504 \begin_layout Section
5506 \begin_inset LatexCommand label
5514 \begin_layout Standard
5518 \begin_layout Standard
5530 \begin_layout Standard
5534 \begin_layout Standard
5546 \begin_layout Standard
5550 \begin_layout Standard
5562 \begin_layout Standard
5566 \begin_layout Standard
5578 \begin_layout Standard
5582 \begin_layout Standard
5594 \begin_layout Standard
5598 \begin_layout Standard
5610 \begin_layout Section
5611 scan_for_matches README
5612 \begin_inset LatexCommand label
5613 name "sec:scan_for_matches-README"
5620 \begin_layout LyX-Code
5624 \begin_layout LyX-Code
5625 A Program to Scan Nucleotide or Protein Sequences for Matching Patterns
5628 \begin_layout LyX-Code
5632 \begin_layout LyX-Code
5636 \begin_layout LyX-Code
5637 Argonne National Laboratory
5640 \begin_layout LyX-Code
5644 \begin_layout LyX-Code
5648 \begin_layout LyX-Code
5649 Scan_for_matches is a utility that we have written to search for
5652 \begin_layout LyX-Code
5653 patterns in DNA and protein sequences.
5654 I wrote most of the code,
5657 \begin_layout LyX-Code
5658 although David Joerg and Morgan Price wrote sections of an
5661 \begin_layout LyX-Code
5663 The whole notion of pattern matching has a rich
5666 \begin_layout LyX-Code
5667 history, and we borrowed liberally from many sources.
5671 \begin_layout LyX-Code
5672 worth noting that we were strongly influenced by the elegant tools
5675 \begin_layout LyX-Code
5676 developed and distributed by David Searls.
5677 My intent is to make the
5680 \begin_layout LyX-Code
5681 existing tool available to anyone in the research community that might
5684 \begin_layout LyX-Code
5686 I will continue to try to fix bugs and make suggested
5689 \begin_layout LyX-Code
5690 enhancements, at least until I feel that a superior tool exists.
5693 \begin_layout LyX-Code
5694 Hence, I would appreciate it if all bug reports and suggestions are
5697 \begin_layout LyX-Code
5698 directed to me at Overbeek@mcs.anl.gov.
5702 \begin_layout LyX-Code
5703 I will try to log all bug fixes and report them to users that send me
5706 \begin_layout LyX-Code
5707 their email addresses.
5708 I do not require that you give me your name
5711 \begin_layout LyX-Code
5713 However, if you do give it to me, I will try to notify
5716 \begin_layout LyX-Code
5717 you of serious problems as they are discovered.
5720 \begin_layout LyX-Code
5724 \begin_layout LyX-Code
5725 The distribution should contain at least the following programs:
5728 \begin_layout LyX-Code
5729 README - This document
5732 \begin_layout LyX-Code
5733 ggpunit.c - One of the two source files
5736 \begin_layout LyX-Code
5737 scan_for_matches.c - The second source file
5740 \begin_layout LyX-Code
5744 \begin_layout LyX-Code
5745 run_tests - A perl script to test things
5748 \begin_layout LyX-Code
5749 show_hits - A handy perl script
5752 \begin_layout LyX-Code
5753 test_dna_input - Test sequences for DNA
5756 \begin_layout LyX-Code
5757 test_dna_patterns - Test patterns for DNA scan
5760 \begin_layout LyX-Code
5761 test_output - Desired output from test
5764 \begin_layout LyX-Code
5765 test_prot_input - Test protein sequences
5768 \begin_layout LyX-Code
5769 test_prot_patterns - Test patterns for proteins
5772 \begin_layout LyX-Code
5773 testit - a perl script used for test
5776 \begin_layout LyX-Code
5777 Only the first three files are required.
5778 The others are useful,
5781 \begin_layout LyX-Code
5782 but only if you have Perl installed on your system.
5786 \begin_layout LyX-Code
5787 have Perl, I suggest that you type
5790 \begin_layout LyX-Code
5794 \begin_layout LyX-Code
5798 \begin_layout LyX-Code
5799 to find out where it installed.
5800 On my system, I get the following
5803 \begin_layout LyX-Code
5807 \begin_layout LyX-Code
5811 \begin_layout LyX-Code
5815 \begin_layout LyX-Code
5819 \begin_layout LyX-Code
5820 indicating that Perl is installed in /usr/local/bin.
5824 \begin_layout LyX-Code
5825 you know where it is installed, edit the first line of files
5828 \begin_layout LyX-Code
5832 \begin_layout LyX-Code
5836 \begin_layout LyX-Code
5837 replacing /usr/local/bin/perl with the appropriate location.
5841 \begin_layout LyX-Code
5842 will assume that you can do this, although it is not critical (it
5845 \begin_layout LyX-Code
5846 is needed only to test the installation and to use the "show_hits"
5849 \begin_layout LyX-Code
5851 Perl is not required to actually install and run
5854 \begin_layout LyX-Code
5859 \begin_layout LyX-Code
5860 If you do not have Perl, I suggest you get it and install it (it
5863 \begin_layout LyX-Code
5864 is a wonderful utility).
5865 Information about Perl and how to get it
5868 \begin_layout LyX-Code
5869 can be found in the book "Programming Perl" by Larry Wall and
5872 \begin_layout LyX-Code
5874 Schwartz, published by O'Reilly & Associates, Inc.
5877 \begin_layout LyX-Code
5878 To get started, you will need to compile the program.
5882 \begin_layout LyX-Code
5886 \begin_layout LyX-Code
5887 gcc -O -o scan_for_matches ggpunit.c scan_for_matches.c
5890 \begin_layout LyX-Code
5891 If you do not use GNU C, use
5894 \begin_layout LyX-Code
5895 cc -O -DCC -o scan_for_matches ggpunit.c scan_for_matches.c
5898 \begin_layout LyX-Code
5899 which works on my Sun.
5903 \begin_layout LyX-Code
5904 Once you have compiled scan_for_matches, you can verify that it
5907 \begin_layout LyX-Code
5911 \begin_layout LyX-Code
5912 clone% run_tests tmp
5915 \begin_layout LyX-Code
5916 clone% diff tmp test_output
5919 \begin_layout LyX-Code
5920 You may get a few strange lines of the sort
5923 \begin_layout LyX-Code
5924 clone% run_tests tmp
5927 \begin_layout LyX-Code
5928 rm: tmp: No such file or directory
5931 \begin_layout LyX-Code
5932 clone% diff tmp test_output
5935 \begin_layout LyX-Code
5936 These should cause no concern.
5937 However, if the "diff" shows that
5940 \begin_layout LyX-Code
5941 tmp and test_output are different, contact me (you have a
5944 \begin_layout LyX-Code
5949 \begin_layout LyX-Code
5950 You should now be able to use scan_for_matches by following the
5953 \begin_layout LyX-Code
5954 instructions given below (which is all the normal user should have
5957 \begin_layout LyX-Code
5958 to understand, once things are installed properly).
5961 \begin_layout LyX-Code
5962 ==============================================================
5965 \begin_layout LyX-Code
5966 How to run scan_for_matches:
5969 \begin_layout LyX-Code
5970 To run the program, you type need to create two files
5973 \begin_layout LyX-Code
5975 the first file contains the pattern you wish to scan for; I'll
5978 \begin_layout LyX-Code
5979 call this file pat_file in what follows (but any name is ok)
5982 \begin_layout LyX-Code
5984 the second file contains a set of sequences to scan.
5988 \begin_layout LyX-Code
5989 should be in "fasta format".
5990 Just look at the contents of
5993 \begin_layout LyX-Code
5994 test_dna_input to see examples of this format.
5998 \begin_layout LyX-Code
5999 each sequence begins with a line of the form
6002 \begin_layout LyX-Code
6006 \begin_layout LyX-Code
6007 and is followed by one or more lines containing the sequence.
6010 \begin_layout LyX-Code
6011 Once these files have been created, you just use
6014 \begin_layout LyX-Code
6015 scan_for_matches pat_file < input_file
6018 \begin_layout LyX-Code
6019 to scan all of the input sequences for the given pattern.
6023 \begin_layout LyX-Code
6024 example, suppose that pat_file contains a single line of the form
6027 \begin_layout LyX-Code
6031 \begin_layout LyX-Code
6035 \begin_layout LyX-Code
6036 scan_for_matches pat_file < test_dna_input
6039 \begin_layout LyX-Code
6040 should produce two "hits".
6041 When I run this on my machine, I get
6044 \begin_layout LyX-Code
6045 clone% scan_for_matches pat_file < test_dna_input
6048 \begin_layout LyX-Code
6052 \begin_layout LyX-Code
6053 cguaacc ggttaacc gguuacg
6056 \begin_layout LyX-Code
6060 \begin_layout LyX-Code
6061 CGUAACC GGTTAACC GGUUACG
6064 \begin_layout LyX-Code
6068 \begin_layout LyX-Code
6069 Simple Patterns Built by Matching Ranges and Reverse Complements
6072 \begin_layout LyX-Code
6073 Let me first explain this simple pattern:
6076 \begin_layout LyX-Code
6080 \begin_layout LyX-Code
6084 \begin_layout LyX-Code
6085 The pattern consists of three "pattern units" separated by spaces.
6088 \begin_layout LyX-Code
6089 The first pattern unit is
6092 \begin_layout LyX-Code
6096 \begin_layout LyX-Code
6097 which means "match 4 to 7 characters and call them p1".
6101 \begin_layout LyX-Code
6102 second pattern unit is
6105 \begin_layout LyX-Code
6109 \begin_layout LyX-Code
6110 which means "then match 3 to 8 characters".
6111 The last pattern unit
6114 \begin_layout LyX-Code
6118 \begin_layout LyX-Code
6122 \begin_layout LyX-Code
6123 which means "match the reverse complement of p1".
6127 \begin_layout LyX-Code
6128 reported hit is shown as
6131 \begin_layout LyX-Code
6135 \begin_layout LyX-Code
6136 cguaacc ggttaacc gguuacg
6139 \begin_layout LyX-Code
6140 which states that characters 6 through 27 of sequence tst1 were
6143 \begin_layout LyX-Code
6145 "cguaac" matched the first pattern unit, "ggttaacc" the
6148 \begin_layout LyX-Code
6149 second, and "gguuacg" the third.
6150 This is an example of a common
6153 \begin_layout LyX-Code
6154 type of pattern used to search for sections of DNA or RNA that
6157 \begin_layout LyX-Code
6158 would fold into a hairpin loop.
6161 \begin_layout LyX-Code
6162 Searching Both Strands
6165 \begin_layout LyX-Code
6166 Now for a short aside: scan_for_matches only searched the
6169 \begin_layout LyX-Code
6170 sequences in the input file; it did not search the opposite
6173 \begin_layout LyX-Code
6175 With a pattern of the sort we just used, there is not
6178 \begin_layout LyX-Code
6179 need o search the opposite strand.
6180 However, it is normally the
6183 \begin_layout LyX-Code
6184 case that you will wish to search both the sequence and the
6187 \begin_layout LyX-Code
6188 opposite strand (i.e., the reverse complement of the sequence).
6191 \begin_layout LyX-Code
6192 To do that, you would just use the "-c" command line.
6196 \begin_layout LyX-Code
6197 scan_for_matches -c pat_file < test_dna_input
6200 \begin_layout LyX-Code
6201 Hits on the opposite strand will show a beginning location greater
6204 \begin_layout LyX-Code
6205 than te end location of the match.
6208 \begin_layout LyX-Code
6209 Defining Pairing Rules and Allowing Mismatches, Insertions, and Deletions
6212 \begin_layout LyX-Code
6213 Let us stop now and ask "What additional features would one need to
6216 \begin_layout LyX-Code
6217 really find the kinds of loop structures that characterize tRNAs,
6220 \begin_layout LyX-Code
6221 rRNAs, and so forth?" I can immediately think of two:
6224 \begin_layout LyX-Code
6225 a) you will need to be able to allow non-standard pairings
6228 \begin_layout LyX-Code
6229 (those other than G-C and A-U), and
6232 \begin_layout LyX-Code
6233 b) you will need to be able to tolerate some number of
6236 \begin_layout LyX-Code
6237 mismatches and bulges.
6240 \begin_layout LyX-Code
6244 \begin_layout LyX-Code
6245 Let me first show you how to handle non-standard "rules for
6248 \begin_layout LyX-Code
6249 pairing in reverse complements".
6250 Consider the following pattern,
6253 \begin_layout LyX-Code
6254 which I show as two line (you may use as many lines as you like in
6257 \begin_layout LyX-Code
6258 forming a pattern, although you can only break a pattern at points
6261 \begin_layout LyX-Code
6262 where space would be legal):
6265 \begin_layout LyX-Code
6266 r1={au,ua,gc,cg,gu,ug,ga,ag}
6269 \begin_layout LyX-Code
6270 p1=2...3 0...4 p2=2...5 1...5 r1~p2 0...4 ~p1
6273 \begin_layout LyX-Code
6274 The first "pattern unit" does not actually match anything; rather,
6277 \begin_layout LyX-Code
6278 it defines a "pairing rule" in which standard pairings are
6281 \begin_layout LyX-Code
6282 allowed, as well as G-A and A-G (in case you wondered, Us and Ts
6285 \begin_layout LyX-Code
6286 and upper and lower case can be used interchangably; for example
6289 \begin_layout LyX-Code
6290 r1={AT,UA,gc,cg} could be used to define the "standard rule" for
6293 \begin_layout LyX-Code
6295 The second line consists of six pattern units which
6298 \begin_layout LyX-Code
6299 may be interpreted as follows:
6302 \begin_layout LyX-Code
6303 p1=2...3 match 2 or 3 characters (call it p1)
6306 \begin_layout LyX-Code
6307 0...4 match 0 to 4 characters
6310 \begin_layout LyX-Code
6311 p2=2...5 match 2 to 5 characters (call it p2)
6314 \begin_layout LyX-Code
6315 1...5 match 1 to 5 characters
6318 \begin_layout LyX-Code
6319 r1~p2 match the reverse complement of p2,
6322 \begin_layout LyX-Code
6323 allowing G-A and A-G pairs
6326 \begin_layout LyX-Code
6327 0...4 match 0 to 4 characters
6330 \begin_layout LyX-Code
6331 ~p1 match the reverse complement of p1
6334 \begin_layout LyX-Code
6335 allowing only G-C, C-G, A-T, and T-A pairs
6338 \begin_layout LyX-Code
6339 Thus, r1~p2 means "match the reverse complement of p2 using rule r1".
6342 \begin_layout LyX-Code
6343 Now let us consider the issue of tolerating mismatches and bulges.
6346 \begin_layout LyX-Code
6347 You may add a "qualifier" to the pattern unit that gives the
6350 \begin_layout LyX-Code
6351 tolerable number of "mismatches, deletions, and insertions".
6354 \begin_layout LyX-Code
6358 \begin_layout LyX-Code
6359 p1=10...10 3...8 ~p1[1,2,1]
6362 \begin_layout LyX-Code
6363 means that the third pattern unit must match 10 characters,
6366 \begin_layout LyX-Code
6367 allowing one "mismatch" (a pairing other than G-C, C-G, A-T, or
6370 \begin_layout LyX-Code
6371 T-A), two deletions (a deletion is a character that occurs in p1,
6374 \begin_layout LyX-Code
6375 but has been "deleted" from the string matched by ~p1), and one
6378 \begin_layout LyX-Code
6379 insertion (an "insertion" is a character that occurs in the string
6382 \begin_layout LyX-Code
6383 matched by ~p1, but not for which no corresponding character
6386 \begin_layout LyX-Code
6388 In this case, the pattern would match
6391 \begin_layout LyX-Code
6392 ACGTACGTAC GGGGGGGG GCGTTACCT
6395 \begin_layout LyX-Code
6396 which is, you must admit, a fairly weak loop.
6400 \begin_layout LyX-Code
6401 allow mismatches, but you will find yourself using insertions and
6404 \begin_layout LyX-Code
6405 deletions much more rarely.
6406 In any event, you should note that
6409 \begin_layout LyX-Code
6410 allowing mismatches, insertions, and deletions does force the
6413 \begin_layout LyX-Code
6414 program to try many additional possible pairings, so it does slow
6417 \begin_layout LyX-Code
6421 \begin_layout LyX-Code
6422 How Patterns Are Matched
6425 \begin_layout LyX-Code
6426 Now is as good a time as any to discuss the basic flow of control
6429 \begin_layout LyX-Code
6430 when matching patterns.
6431 Recall that a "pattern" is a sequence of
6434 \begin_layout LyX-Code
6436 Suppose that the pattern units were
6439 \begin_layout LyX-Code
6444 \begin_layout LyX-Code
6445 The scan of a sequence S begins by setting the current position
6448 \begin_layout LyX-Code
6450 Then, an attempt is made to match u1 starting at the
6453 \begin_layout LyX-Code
6455 Each attempt to match a pattern unit can
6458 \begin_layout LyX-Code
6460 If it succeeds, then an attempt is made to match
6463 \begin_layout LyX-Code
6465 If it fails, then an attempt is made to find an
6468 \begin_layout LyX-Code
6469 alternative match for the immediately preceding pattern unit.
6473 \begin_layout LyX-Code
6474 this succeeds, then we proceed forward again to the next unit.
6478 \begin_layout LyX-Code
6479 it fails we go back to the preceding unit.
6480 This process is called
6483 \begin_layout LyX-Code
6485 If there are no previous units, then the current
6488 \begin_layout LyX-Code
6489 position is incremented by one, and everything starts again.
6493 \begin_layout LyX-Code
6494 proceeds until either the current position goes past the end of
6497 \begin_layout LyX-Code
6498 the sequence or all of the pattern units succeed.
6502 \begin_layout LyX-Code
6503 scan_for_matches reports the "hit", the current position is set
6506 \begin_layout LyX-Code
6507 just past the hit, and an attempt is made to find another hit.
6510 \begin_layout LyX-Code
6511 If you wish to limit the scan to simply finding a maximum of, say,
6514 \begin_layout LyX-Code
6515 10 hits, you can use the -n option (-n 10 would set the limit to
6518 \begin_layout LyX-Code
6523 \begin_layout LyX-Code
6524 scan_for_matches -c -n 1 pat_file < test_dna_input
6527 \begin_layout LyX-Code
6528 would search for just the first hit (and would stop searching the
6531 \begin_layout LyX-Code
6532 current sequences or any that follow in the input file).
6535 \begin_layout LyX-Code
6536 Searching for repeats:
6539 \begin_layout LyX-Code
6540 In the last section, I discussed almost all of the details
6543 \begin_layout LyX-Code
6544 required to allow you to look for repeats.
6545 Consider the following
6548 \begin_layout LyX-Code
6552 \begin_layout LyX-Code
6553 p1=6...6 3...8 p1 (find exact 6 character repeat separated
6556 \begin_layout LyX-Code
6560 \begin_layout LyX-Code
6561 p1=6...6 3..8 p1[1,0,0] (allow one mismatch)
6564 \begin_layout LyX-Code
6565 p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0]
6568 \begin_layout LyX-Code
6569 (match 12 characters that are the remains
6572 \begin_layout LyX-Code
6573 of a 3-character sequence occurring 4 times)
6576 \begin_layout LyX-Code
6580 \begin_layout LyX-Code
6581 p1=4...8 0...3 p2=6...8 p1 0...3 p2
6584 \begin_layout LyX-Code
6585 (This would match things like
6588 \begin_layout LyX-Code
6589 ATCT G TCTTT ATCT TG TCTTT
6592 \begin_layout LyX-Code
6596 \begin_layout LyX-Code
6597 Searching for particular sequences:
6600 \begin_layout LyX-Code
6601 Occasionally, one wishes to match a specific, known sequence.
6604 \begin_layout LyX-Code
6605 In such a case, you can just give the sequence (along with an
6608 \begin_layout LyX-Code
6609 optional statement of the allowable mismatches, insertions, and
6612 \begin_layout LyX-Code
6617 \begin_layout LyX-Code
6618 p1=6...8 GAGA ~p1 (match a hairpin with GAGA as the loop)
6621 \begin_layout LyX-Code
6622 RRRRYYYY (match 4 purines followed by 4 pyrimidines)
6625 \begin_layout LyX-Code
6626 TATAA[1,0,0] (match TATAA, allowing 1 mismatch)
6629 \begin_layout LyX-Code
6633 \begin_layout LyX-Code
6634 Matches against a "weight matrix":
6637 \begin_layout LyX-Code
6638 I will conclude my examples of the types of pattern units
6641 \begin_layout LyX-Code
6642 available for matching against nucleotide sequences by discussing a
6645 \begin_layout LyX-Code
6646 crude implemetation of matching using a "weight matrix".
6650 \begin_layout LyX-Code
6651 am less than overwhelmed with the syntax that I chose, I think that
6654 \begin_layout LyX-Code
6655 the reader should be aware that I was thinking of generating
6658 \begin_layout LyX-Code
6659 patterns containing such pattern units automatically from
6662 \begin_layout LyX-Code
6663 alignments (and did not really plan on typing such things in by
6666 \begin_layout LyX-Code
6668 Anyway, suppose that you wanted to match a
6671 \begin_layout LyX-Code
6672 sequence of eight characters.
6673 The "consensus" of these eight
6676 \begin_layout LyX-Code
6677 characters is GRCACCGS, but the actual "frequencies of occurrence"
6680 \begin_layout LyX-Code
6681 are given in the matrix below.
6682 Thus, the first character is an A
6685 \begin_layout LyX-Code
6686 16% the time and a G 84% of the time.
6687 The second is an A 57% of
6690 \begin_layout LyX-Code
6691 the time, a C 10% of the time, a G 29% of the time, and a T 4% of
6694 \begin_layout LyX-Code
6699 \begin_layout LyX-Code
6700 C1 C2 C3 C4 C5 C6 C7 C8
6703 \begin_layout LyX-Code
6707 \begin_layout LyX-Code
6708 A 16 57 0 95 0 18 0 0
6711 \begin_layout LyX-Code
6712 C 0 10 80 0 100 60 0 50
6715 \begin_layout LyX-Code
6716 G 84 29 0 0 0 20 100 50
6719 \begin_layout LyX-Code
6723 \begin_layout LyX-Code
6727 \begin_layout LyX-Code
6728 One could use the following pattern unit to search for inexact
6731 \begin_layout LyX-Code
6732 matches related to such a "weight matrix":
6735 \begin_layout LyX-Code
6736 {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
6739 \begin_layout LyX-Code
6740 (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
6743 \begin_layout LyX-Code
6744 This pattern unit will attempt to match exactly eight characters.
6747 \begin_layout LyX-Code
6748 For each character in the sequence, the entry in the corresponding
6751 \begin_layout LyX-Code
6752 tuple is added to an accumulated sum.
6753 If the sum is greater than
6756 \begin_layout LyX-Code
6757 450, the match succeeds; else it fails.
6760 \begin_layout LyX-Code
6761 Recently, this feature was upgraded to allow ranges.
6765 \begin_layout LyX-Code
6766 600 > {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
6769 \begin_layout LyX-Code
6770 (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
6773 \begin_layout LyX-Code
6777 \begin_layout LyX-Code
6778 Allowing Alternatives:
6781 \begin_layout LyX-Code
6782 Very occasionally, you may wish to allow alternative pattern units
6785 \begin_layout LyX-Code
6786 (i.e., "match either A or B").
6787 You can do this using something
6790 \begin_layout LyX-Code
6794 \begin_layout LyX-Code
6798 \begin_layout LyX-Code
6799 which says "match either GAGA or GCGCA".
6803 \begin_layout LyX-Code
6804 alternatives of a list of pattern units, for example
6807 \begin_layout LyX-Code
6808 (p1=3...3 3...8 ~p1 | p1=5...5 4...4 ~p1 GGG)
6811 \begin_layout LyX-Code
6812 would match one of two sequences of pattern units.
6816 \begin_layout LyX-Code
6817 clumsy aspect of the syntax: to match a list of alternatives, you
6820 \begin_layout LyX-Code
6821 need to fully the request.
6825 \begin_layout LyX-Code
6826 (GAGA | (GCGCA | TTCGA))
6829 \begin_layout LyX-Code
6830 would be needed to try the three alternatives.
6833 \begin_layout LyX-Code
6837 \begin_layout LyX-Code
6838 Sometimes a pattern will contain a sequence of distinct ranges,
6841 \begin_layout LyX-Code
6842 and you might wish to limit the sum of the lengths of the matched
6845 \begin_layout LyX-Code
6847 For example, suppose that you basically wanted to
6850 \begin_layout LyX-Code
6851 match something like
6854 \begin_layout LyX-Code
6855 ARRYYTT p1=0...5 GCA[1,0,0] p2=1...6 ~p1 4...8 ~p2 p3=4...10 CCT
6858 \begin_layout LyX-Code
6859 but that the sum of the lengths of p1, p2, and p3 must not exceed
6862 \begin_layout LyX-Code
6864 To do this, you could add
6867 \begin_layout LyX-Code
6868 length(p1+p2+p3) < 9
6871 \begin_layout LyX-Code
6872 as the last pattern unit.
6873 It will just succeed or fail (but does
6876 \begin_layout LyX-Code
6877 not actually match any characters in the sequence).
6880 \begin_layout LyX-Code
6884 \begin_layout LyX-Code
6885 Matching Protein Sequences
6888 \begin_layout LyX-Code
6889 Suppose that the input file contains protein sequences.
6893 \begin_layout LyX-Code
6894 case, you must invoke scan_for_matches with the "-p" option.
6898 \begin_layout LyX-Code
6899 cannot use aspects of the language that relate directly to
6902 \begin_layout LyX-Code
6903 nucleotide sequences (e.g., the -c command line option or pattern
6906 \begin_layout LyX-Code
6907 constructs referring to the reverse complement of a previously
6910 \begin_layout LyX-Code
6915 \begin_layout LyX-Code
6916 You also have two additional constructs that allow you to match
6919 \begin_layout LyX-Code
6920 either "one of a set of amino acids" or "any amino acid other than
6923 \begin_layout LyX-Code
6928 \begin_layout LyX-Code
6929 p1=0...4 any(HQD) 1...3 notany(HK) p1
6932 \begin_layout LyX-Code
6933 would successfully match a string like
6936 \begin_layout LyX-Code
6940 \begin_layout LyX-Code
6941 Using the show_hits Utility
6944 \begin_layout LyX-Code
6945 When viewing a large set of complex matches, you might find it
6948 \begin_layout LyX-Code
6949 convenient to post-process the scan_for_matches output to get a
6952 \begin_layout LyX-Code
6953 more readable version.
6954 We provide a simple post-processor called
6957 \begin_layout LyX-Code
6959 To see its effect, just pipe the output of a
6962 \begin_layout LyX-Code
6963 scan_for_matches into show_hits:
6966 \begin_layout LyX-Code
6970 \begin_layout LyX-Code
6971 clone% scan_for_matches -c pat_file < tmp
6974 \begin_layout LyX-Code
6978 \begin_layout LyX-Code
6979 gtacguaacc ggttaac cgguuacgtac
6982 \begin_layout LyX-Code
6986 \begin_layout LyX-Code
6987 gtacgtaacc ggttaac cggttacgtac
6990 \begin_layout LyX-Code
6994 \begin_layout LyX-Code
6995 CGTACGUAAC C GGTTAACC GGUUACGTACG
6998 \begin_layout LyX-Code
7002 \begin_layout LyX-Code
7003 CGTACGTAAC C GGTTAACC GGTTACGTACG
7006 \begin_layout LyX-Code
7010 \begin_layout LyX-Code
7011 gtacguaacc g gttaactt cgguuacgtac
7014 \begin_layout LyX-Code
7018 \begin_layout LyX-Code
7019 gtacgtaacc g aagttaac cggttacgtac
7022 \begin_layout LyX-Code
7023 Piped Through show_hits:
7026 \begin_layout LyX-Code
7030 \begin_layout LyX-Code
7031 clone% scan_for_matches -c pat_file < tmp | show_hits
7034 \begin_layout LyX-Code
7035 tst1:[1,28]: gtacguaacc ggttaac cgguuacgtac
7038 \begin_layout LyX-Code
7039 tst1:[28,1]: gtacgtaacc ggttaac cggttacgtac
7042 \begin_layout LyX-Code
7043 tst2:[2,31]: CGTACGUAAC C GGTTAACC GGUUACGTACG
7046 \begin_layout LyX-Code
7047 tst2:[31,2]: CGTACGTAAC C GGTTAACC GGTTACGTACG
7050 \begin_layout LyX-Code
7051 tst3:[3,32]: gtacguaacc g gttaactt cgguuacgtac
7054 \begin_layout LyX-Code
7055 tst3:[32,3]: gtacgtaacc g aagttaac cggttacgtac
7058 \begin_layout LyX-Code
7062 \begin_layout LyX-Code
7063 Optionally, you can specify which of the "fields" in the matches
7066 \begin_layout LyX-Code
7067 you wish to sort on, and show_hits will sort them.
7071 \begin_layout LyX-Code
7072 numbers start with 0.
7073 So, you might get something like
7076 \begin_layout LyX-Code
7077 clone% scan_for_matches -c pat_file < tmp | show_hits 2 1
7080 \begin_layout LyX-Code
7081 tst2:[2,31]: CGTACGUAAC C GGTTAACC GGUUACGTACG
7084 \begin_layout LyX-Code
7085 tst2:[31,2]: CGTACGTAAC C GGTTAACC GGTTACGTACG
7088 \begin_layout LyX-Code
7089 tst3:[32,3]: gtacgtaacc g aagttaac cggttacgtac
7092 \begin_layout LyX-Code
7093 tst1:[1,28]: gtacguaacc ggttaac cgguuacgtac
7096 \begin_layout LyX-Code
7097 tst1:[28,1]: gtacgtaacc ggttaac cggttacgtac
7100 \begin_layout LyX-Code
7101 tst3:[3,32]: gtacguaacc g gttaactt cgguuacgtac
7104 \begin_layout LyX-Code
7108 \begin_layout LyX-Code
7109 In this case, the hits have been sorted on fields 2 and 1 (that is,
7112 \begin_layout LyX-Code
7113 the third and second matched subfields).
7116 \begin_layout LyX-Code
7117 show_hits is just one possible little post-processor, and you
7120 \begin_layout LyX-Code
7121 might well wish to write a customized one for yourself.
7124 \begin_layout LyX-Code
7125 Reducing the Cost of a Search
7128 \begin_layout LyX-Code
7129 The scan_for_matches utility uses a fairly simple search, and may
7132 \begin_layout LyX-Code
7133 consume large amounts of CPU time for complex patterns.
7137 \begin_layout LyX-Code
7138 I may decide to optimize the code.
7139 However, until then, let me
7142 \begin_layout LyX-Code
7143 mention one useful technique.
7147 \begin_layout LyX-Code
7148 When you have a complex pattern that includes a number of varying
7151 \begin_layout LyX-Code
7152 ranges, imprecise matches, and so forth, it is useful to
7155 \begin_layout LyX-Code
7157 That is, form a simpler pattern that can be
7160 \begin_layout LyX-Code
7161 used to scan through a large database extracting sections that
7164 \begin_layout LyX-Code
7165 might be matched by the more complex pattern.
7169 \begin_layout LyX-Code
7170 with a short example.
7171 Suppose that you really wished to match the
7174 \begin_layout LyX-Code
7178 \begin_layout LyX-Code
7179 p1=3...5 0...8 ~p1[1,1,0] p2=6...7 3...6 AGC 3...5 RYGC ~p2[1,0,0]
7182 \begin_layout LyX-Code
7183 In this case, the pattern units AGC 3...5 RYGC can be used to rapidly
7186 \begin_layout LyX-Code
7187 constrain the overall search.
7188 You can preprocess the overall
7191 \begin_layout LyX-Code
7192 database using the pattern:
7195 \begin_layout LyX-Code
7196 31...31 AGC 3...5 RYGC 7...7
7199 \begin_layout LyX-Code
7200 Put the complex pattern in pat_file1 and the simpler pattern in
7203 \begin_layout LyX-Code
7208 \begin_layout LyX-Code
7209 scan_for_matches -c pat_file2 < nucleotide_database |
7212 \begin_layout LyX-Code
7213 scan_for_matches pat_file1
7216 \begin_layout LyX-Code
7217 The output will show things like
7220 \begin_layout LyX-Code
7221 >seqid:[232,280][2,47]
7224 \begin_layout LyX-Code
7228 \begin_layout LyX-Code
7229 Then, the actual section of the sequence that was matched can be
7232 \begin_layout LyX-Code
7233 easily computed as [233,278] (remember, the positions start from
7236 \begin_layout LyX-Code
7240 \begin_layout LyX-Code
7241 Let me finally add, you should do a few short experiments to see
7244 \begin_layout LyX-Code
7245 whether or not such pipelining actually improves performance -- it
7248 \begin_layout LyX-Code
7249 is not always obvious where the time is going, and I have
7252 \begin_layout LyX-Code
7253 sometimes found that the added complexity of pipelining actually
7256 \begin_layout LyX-Code
7258 It gets its best improvements when there are
7261 \begin_layout LyX-Code
7262 exact matches of more than just a few characters that can be
7265 \begin_layout LyX-Code
7266 rapidly used to eliminate large sections of the database.
7269 \begin_layout LyX-Code
7273 \begin_layout LyX-Code
7277 \begin_layout LyX-Code
7278 Feb 9, 1995: the pattern units ^ and $ now work as in normal regular
7281 \begin_layout LyX-Code
7286 \begin_layout LyX-Code
7290 \begin_layout LyX-Code
7291 matches only TTF at the end of the string and
7294 \begin_layout LyX-Code
7298 \begin_layout LyX-Code
7299 matches only an initial TTF
7302 \begin_layout LyX-Code
7306 \begin_layout LyX-Code
7310 \begin_layout LyX-Code
7311 matches the reverse of the string named p1.
7315 \begin_layout LyX-Code
7316 if p1 matched GCAT, then <p1 would match TACG.
7320 \begin_layout LyX-Code
7324 \begin_layout LyX-Code
7325 matches a real palindrome (not the biologically common
7328 \begin_layout LyX-Code
7329 meaning of "reverse complement")
7332 \begin_layout LyX-Code