1 #LyX 1.5.1 created this file. For more info see http://www.lyx.org/
7 \usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
13 \font_typewriter default
14 \font_default_family default
20 \paperfontsize default
28 \paperorientation portrait
31 \paragraph_separation skip
33 \quotes_language english
36 \paperpagestyle default
37 \tracking_changes false
53 \begin_layout Publishers
56 Institute for Molecular Bioscience
58 University of Queensland
63 E-mail: mail@maasha.dk
66 \begin_layout Standard
70 \begin_layout Standard
82 \begin_layout Standard
88 \begin_layout Standard
89 \begin_inset LatexCommand tableofcontents
96 \begin_layout Standard
97 \begin_inset FloatList figure
104 \begin_layout Standard
110 \begin_layout Section
114 \begin_layout Standard
115 Biotools is a selection of simple tools that can be linked together (piped
116 as we shall call it) in a very flexible manner to perform both simple and
118 The fundamental idea is that biotools work on a data stream that will only
119 terminate at the end of an analysis and that this data stream can be passed
120 through several different biotools, each performing one specific task.
121 The advantage of this approach is that a user can perform simple and complex
122 tasks without having to write advanced code.
123 Moreover, since the data format used to pass data between biotools is text
124 based, biotools can be written by different developers in their favorite
125 programming language --- and still the biotools will be able to work together.
128 \begin_layout Standard
129 In the most simple form bioools can be piped together on the command line
130 like this (using the pipe character '|'):
133 \begin_layout LyX-Code
134 read_data | calculate_something | write_result
137 \begin_layout Standard
138 However, a more comprehensive analysis could be composed:
141 \begin_layout LyX-Code
142 read_data | select_entries | convert_entries | search_database
145 \begin_layout LyX-Code
146 evaluate_results | plot_diagram | plot_another_diagram |
149 \begin_layout LyX-Code
153 \begin_layout Standard
154 The data stream that is piped through the biotools consists of records of
155 key/value pairs in the same way a hash does in order to keep as simple
156 a structure as possible.
157 An example record can be seen below:
160 \begin_layout LyX-Code
164 \begin_layout LyX-Code
170 \begin_layout LyX-Code
176 \begin_layout LyX-Code
182 \begin_layout LyX-Code
188 \begin_layout LyX-Code
194 \begin_layout LyX-Code
200 \begin_layout LyX-Code
206 \begin_layout LyX-Code
212 \begin_layout LyX-Code
218 \begin_layout Standard
223 \begin_layout Standard
234 ' denotes the delimiter of the records, and each key is a word followed
235 by a ':' and a white-space and then the value.
236 By convention the biotools only uses upper case keys (a list of used keys
237 can be seen in Appendix\InsetSpace ~
239 \begin_inset LatexCommand ref
245 Since the records basically are hash structures this mean that the order
246 of the keys in the stream is unordered, and in the above example it is
247 pure coincidence that HIT_BEG is displayed before HIT_END, however, when
248 the order of the keys is importent, the biotools will automagically see
252 \begin_layout Standard
253 All of the biotools are able to read and write a data stream to and from
254 file as long as the records are in the biotools format.
255 This means that if you are undertaking a lengthy analysis where one of
256 the steps is time consuming, you may save the stream after this step, and
257 subsequently start one or more analysis from that last step
261 \begin_layout Standard
262 It is a goal that the biotools at some point will be able to dump the data
263 stream to file in case one of the tools fail critically.
269 If you are running a lengthy analysis it is highly recommended that you
270 create a small test sample of the data and run that through the pipeline
271 --- and once you are satisfied with the result proceed with the full data
272 set (see\InsetSpace ~
274 \begin_inset LatexCommand ref
275 reference "sub:How-to-select-a-few-records"
282 \begin_layout Standard
283 All of the biotools can be supplied with long arguments prefixed with
287 \begin_layout Standard
296 switches or single character arguments prefixed with - switches that can
297 be grouped together (e.g.
299 In this cookbook only the long switches are used to emphasize what these
303 \begin_layout Section
307 \begin_layout Standard
308 In order to get the biotools to work, you need to add environment settings
309 to include the code binaries, scripts, and modules that constitute the
311 Assuming that you are using bash, add the following to your '~/.bashrc'
312 file using your favorite editor.
313 After the changes has been saved you need to either run 'source ~/.bashrc'
317 \begin_layout LyX-Code
320 if [ -f "/home/m.hansen/maasha/conf/bashrc" ]; then
323 \begin_layout LyX-Code
326 source "/home/m.hansen/maasha/conf/bashrc"
329 \begin_layout LyX-Code
335 \begin_layout Section
339 \begin_layout Standard
344 lists all the biotools along with a description:
347 \begin_layout LyX-Code
353 \begin_layout LyX-Code
356 align_seq Align sequences in stream using Muscle.
359 \begin_layout LyX-Code
362 analyze_seq Analysis the residue composition of each sequence
366 \begin_layout LyX-Code
369 analyze_vals Determine type, count, min, max, sum and mean for
373 \begin_layout LyX-Code
376 blast_seq BLAST sequences in stream against a specified database.
379 \begin_layout LyX-Code
382 blat_seq BLAT sequences in stream against a specified genome.
385 \begin_layout LyX-Code
388 complement_seq Complement sequences in stream.
391 \begin_layout LyX-Code
394 count_records Count the number of records in stream.
397 \begin_layout LyX-Code
400 count_seq Count sequences in stream.
403 \begin_layout LyX-Code
406 count_vals Count the number of times values of given keys exists
410 \begin_layout LyX-Code
413 create_blast_db Create a BLAST database from sequences in stream for
417 \begin_layout LyX-Code
423 \begin_layout Standard
424 To list the biotools for writing different formats, you can use unix's grep
428 \begin_layout LyX-Code
431 list_biotools | grep write
434 \begin_layout LyX-Code
437 write_align Write aligned sequences in pretty alignment format.
440 \begin_layout LyX-Code
443 write_bed Write records from stream as BED lines.
446 \begin_layout LyX-Code
449 write_blast Write BLAST records from stream in BLAST tabular format
453 \begin_layout LyX-Code
456 write_fasta Write sequences in FASTA format.
459 \begin_layout LyX-Code
462 write_psl Write records from stream in PSL format.
465 \begin_layout LyX-Code
468 write_tab Write records from stream as tab separated table.
471 \begin_layout Standard
472 In order to find out how a specific biotool works, you just type the program
473 name without any arguments and press return and the usage of the biotool
483 \begin_layout Standard
484 \begin_inset Box Frameless
493 height_special "totalheight"
496 \begin_layout LyX-Code
499 Program name: read_fasta
502 \begin_layout LyX-Code
505 Author: Martin Asser Hansen - Copyright (C) - All rights reserved
508 \begin_layout LyX-Code
511 Contact: mail@maasha.dk
514 \begin_layout LyX-Code
520 \begin_layout LyX-Code
523 License: GNU General Public License version 2 (http://www.gnu.org/copyleft/
527 \begin_layout LyX-Code
530 Description: Read FASTA entries.
533 \begin_layout LyX-Code
536 Usage: read_fasta [options] -i <FASTA file(s)>
539 \begin_layout LyX-Code
545 \begin_layout LyX-Code
548 [-i <file(s)> | --data_in=<file(s)>] - Comma separated list of files
552 \begin_layout LyX-Code
555 [-n <int> | --num=<int>] - Limit number of records to read.
558 \begin_layout LyX-Code
561 [-I <file> | --stream_in=<file>] - Read input stream from file
565 \begin_layout LyX-Code
568 [-O <file> | --stream_out=<file>] - Write output stream to file
572 \begin_layout LyX-Code
576 \begin_layout LyX-Code
582 \begin_layout LyX-Code
585 read_fasta -i test.fna - Read FASTA entries from file.
588 \begin_layout LyX-Code
591 read_fasta -i test1.fna,test2,fna - Read FASTA entries from files.
594 \begin_layout LyX-Code
597 read_fasta -i '*.fna' - Read FASTA entries from files.
600 \begin_layout LyX-Code
603 read_fasta -i test.fna -n 10 - Read first 10 FASTA entries from
612 \begin_layout Section
616 \begin_layout Subsection
617 How to read the data stream from file?
618 \begin_inset LatexCommand label
619 name "sub:How-to-read-stream"
626 \begin_layout Standard
627 You want to read a data stream that you previously have saved to file in
629 This can be done implicetly or explicitly.
630 The implicit way uses the 'stdout' stream of the Unix terminal:
633 \begin_layout LyX-Code
637 \begin_layout Standard
638 cat is the Unix command that reads a file and output the result to 'stdout'
639 --- which in this case is piped to any biotool represented by the <biotool>.
640 It is also possible to read the data stream using '<' to direct the 'stdout'
641 stream into the biotool like this:
644 \begin_layout LyX-Code
648 \begin_layout Standard
649 However, that will not work if you pipe more biotools together.
650 Then it is much safer to read the stream from a file explicitly like this:
653 \begin_layout LyX-Code
654 <biotool> --stream_in=<file>
657 \begin_layout Standard
658 Here the filename <file> is explicetly given to the biotool <biotool> with
663 \begin_layout Standard
673 This switch works with all biotools.
674 It is also possible to read in data from multiple sources by repeating
675 the explicit read step:
678 \begin_layout LyX-Code
679 <biotool> --stream_in=<file1> | <biotool> --stream_in=<file2>
682 \begin_layout Subsection
683 How to write the data stream to file?
684 \begin_inset LatexCommand label
685 name "sub:How-to-write-stream"
692 \begin_layout Standard
693 In order to save the output stream from a biotool to file, so you can read
694 in the stream again at a later time, you can do one of two things:
697 \begin_layout LyX-Code
701 \begin_layout Standard
702 All, the biotools write the data stream to 'stdout' by default which can
703 be written to a file by redirecting 'stdout' to file using '>' , however,
704 if one of the biotools for writing other formats is used then the both
705 the biotools records as well as the result output will go to 'stdout' in
706 a mixture causing havock! To avoid this you must use the switch
710 \begin_layout Standard
719 stream_out that explictly tells the biotool to write the output stream to
723 \begin_layout LyX-Code
724 <biotool> --stream_out=<file>
727 \begin_layout Standard
732 \begin_layout Standard
741 stream_out switch works with all biotools.
744 \begin_layout Subsection
745 How to terminate the data stream?
748 \begin_layout Standard
749 The data stream is never stops unless the user want to save the stream or
754 \begin_layout Standard
763 no_stream switch that will terminate the stream:
766 \begin_layout LyX-Code
767 <biotool> --no_stream
770 \begin_layout Standard
775 \begin_layout Standard
784 no_stream switch only works with those biotools where it makes sense that
785 the user might want to terminale the data stream,
790 after an analysis step where the user wants to output the result, but not
794 \begin_layout Subsection
795 How to write my results to file?
796 \begin_inset LatexCommand label
797 name "sub:How-to-write-result"
804 \begin_layout Standard
805 Saving the result of an analysis to file can be done implicitly or explicitly.
809 \begin_layout LyX-Code
810 <biotool> --no_stream > <file>
813 \begin_layout Standard
814 If you use '>' to redirect 'stdout' to file then it is important to use
819 \begin_layout Standard
828 no_stream switch to avoid writing a mix of biotools records and result to
829 the same file causing havock.
830 The safe way is to use the
834 \begin_layout Standard
843 result_out switch which explicetly tells the biotool to write the result
847 \begin_layout LyX-Code
848 <biotool> --result_out=<file>
851 \begin_layout Standard
852 Using the above method will not terminate the stream, so it is possible
853 to pipe that into another biotool generating different results:
856 \begin_layout LyX-Code
857 <biotool1> --result_out=<file1> | <biotool2> --result_out=<file2>
860 \begin_layout Standard
861 And still the data stream will continue unless terminated with
865 \begin_layout Standard
877 \begin_layout LyX-Code
878 <biotool> --result_out=<file> --no_stream
881 \begin_layout Standard
882 Or written to file using implicitly or explicity
883 \begin_inset LatexCommand eqref
884 reference "sub:How-to-write-result"
892 \begin_layout LyX-Code
893 <biotool> --result_out=<file1> --stream_out=<file2>
896 \begin_layout Subsection
897 How to read data from multiple sources?
900 \begin_layout Standard
901 To read multiple data sources, with the same type or different type of data
905 \begin_layout LyX-Code
906 <biotool1> --data_in=<file1> | <biotool2> --data_in=<file2>
909 \begin_layout Standard
910 where type is the data type a specific biotool reads.
913 \begin_layout Section
917 \begin_layout Subsection
918 How to read biotools input?
921 \begin_layout Standard
923 \begin_inset LatexCommand eqref
924 reference "sub:How-to-read-stream"
931 \begin_layout Subsection
935 \begin_layout Standard
936 Data in different formats can be read with the appropriate biotool for that
938 The biotools are typicalled named 'read_<data type>' such as
950 , etc., and all behave in a similar manner.
951 Data can be read by supplying the
955 \begin_layout Standard
964 data_in switch and a file name to the file containing the data:
967 \begin_layout LyX-Code
968 <biotool> --data_in=<file>
971 \begin_layout Standard
972 It is also possible to read in a saved biotools stream (see
973 \begin_inset LatexCommand ref
974 reference "sub:How-to-read-stream"
978 ) as well as reading data in one go:
981 \begin_layout LyX-Code
982 <biotool> --stream_in=<file1> --data_in=<file2>
985 \begin_layout Standard
986 If you want to read data from several files you can do this:
989 \begin_layout LyX-Code
990 <biotool> --data_in=<file1> | <biotool> --data_in=<file2>
993 \begin_layout Standard
994 If you have several data files you can read in all explicitly with a comma
998 \begin_layout LyX-Code
999 <biotool> --data_in=file1,file2,file3
1002 \begin_layout Standard
1003 And it is also possible to use file globbing
1007 \begin_layout Standard
1008 using the short option will only work if you quote the argument -i '*.fna'
1016 \begin_layout LyX-Code
1017 <biotool> --data_in=*.fna
1020 \begin_layout Standard
1021 Or in a combination:
1024 \begin_layout LyX-Code
1025 <biotool> --data_in=file1,/dir/*.fna
1028 \begin_layout Standard
1029 Finally, it is possible to read in data in different formats using the appropria
1030 te biotool for each format:
1033 \begin_layout LyX-Code
1034 <biotool1> --data_in=<file1> | <biotool2> --data_in=<file2> ...
1037 \begin_layout Subsection
1038 How to read FASTA input?
1041 \begin_layout Standard
1042 Sequences in FASTA format can be read explicitly using
1049 \begin_layout LyX-Code
1050 read_fasta --data_in=<file>
1053 \begin_layout Subsection
1054 How to read alignment input?
1057 \begin_layout Standard
1058 If your alignment if FASTA formatted then you can
1063 It is also possible to use
1067 since the data is FASTA formatted, however, with
1071 the key ALIGN will be omitted.
1072 The ALIGN key is used to determine which sequences belong to what alignment
1073 which is required for
1080 \begin_layout LyX-Code
1081 read_align --data_in=<file>
1084 \begin_layout Subsection
1085 How to read tabular input?
1086 \begin_inset LatexCommand label
1087 name "sub:How-to-read-table"
1094 \begin_layout Standard
1095 Tabular input can be read with
1099 which will read in all rows and chosen columns (separated by a given delimter)
1100 from a table in text format.
1103 \begin_layout Standard
1107 \begin_layout Standard
1110 \begin_inset Tabular
1111 <lyxtabular version="3" rows="4" columns="3">
1113 <column alignment="left" valignment="top" width="0">
1114 <column alignment="left" valignment="top" width="0">
1115 <column alignment="left" valignment="top" width="0">
1117 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1120 \begin_layout Standard
1126 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1129 \begin_layout Standard
1135 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1138 \begin_layout Standard
1146 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1149 \begin_layout Standard
1155 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1158 \begin_layout Standard
1164 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1167 \begin_layout Standard
1175 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1178 \begin_layout Standard
1184 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1187 \begin_layout Standard
1193 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1196 \begin_layout Standard
1204 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1207 \begin_layout Standard
1213 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1216 \begin_layout Standard
1222 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1225 \begin_layout Standard
1239 \begin_layout Standard
1240 Can be read using the command:
1243 \begin_layout LyX-Code
1244 read_tab --data_in=<file>
1247 \begin_layout Standard
1248 Which will result in four records, one for each row, where the keys V0,
1249 V1, V2 are the default keys for the organism, sequence, and count, respectively.
1250 It is possible to select a subset of colums to read by using the
1254 \begin_layout Standard
1263 cols switch which takes a comma separated list of columns numbers (first
1264 column is designated 0) as argument.
1265 So to read in only the sequence and the count so that the count comes before
1269 \begin_layout LyX-Code
1270 read_tab --data_in=<file> --cols=2,1
1273 \begin_layout Standard
1274 It is also possible to name the columns with the
1278 \begin_layout Standard
1290 \begin_layout LyX-Code
1291 read_tab --data_in=<file> --cols=2,1 --keys=COUNT,SEQ
1294 \begin_layout Subsection
1295 How to read BED input?
1298 \begin_layout Standard
1299 The BED (Browser Extensible Data
1303 \begin_layout Standard
1304 \begin_inset LatexCommand url
1305 target "http://genome.ucsc.edu/FAQ/FAQformat"
1314 ) format is a tabular format for data pertaining to one of the Eukaryotic
1315 genomes in the UCSC genome brower
1319 \begin_layout Standard
1320 \begin_inset LatexCommand url
1321 target "http://genome.ucsc.edu/"
1331 The BED format consists of up to 12 columns, where the first three are
1332 mandatory CHR, CHR_BEG, and CHR_END.
1333 The mandatory columns and any of the optional columns can all be read in
1341 \begin_layout LyX-Code
1342 read_bed --data_in=<file>
1345 \begin_layout Standard
1346 It is also possible to read the BED file with
1352 \begin_inset LatexCommand ref
1353 reference "sub:How-to-read-table"
1357 ), however, that will be more cumbersome because you need to specify the
1361 \begin_layout LyX-Code
1362 read_tab --data_in=<file> --keys=CHR,CHR_BEG,CHR_END ...
1365 \begin_layout Subsection
1366 How to read PSL input?
1369 \begin_layout Standard
1370 The PSL format is the output from BLAT and contains 21 mandatory fields
1371 that can be read with
1378 \begin_layout LyX-Code
1379 read_psl --data_in=<file>
1382 \begin_layout Section
1386 \begin_layout Standard
1387 All result output can be written explicitly to file using the
1391 \begin_layout Standard
1400 result_out switch which all result generating biotools have.
1401 It is also possible to write the result to file implicetly by directing
1402 'stdout' to file using '>', however, that requires the
1406 \begin_layout Standard
1415 no_stream swich to prevent a mixture of data stream and results in the file.
1416 The explicit (and safe) way:
1419 \begin_layout LyX-Code
1421 | <biotool> --result_out=<file>
1424 \begin_layout Standard
1428 \begin_layout LyX-Code
1430 | <biotool> --no_stream > <file>
1433 \begin_layout Subsection
1434 How to write biotools output?
1437 \begin_layout Standard
1439 \begin_inset LatexCommand eqref
1440 reference "sub:How-to-write-stream"
1447 \begin_layout Subsection
1448 How to write FASTA output?
1449 \begin_inset LatexCommand label
1450 name "sub:How-to-write-fasta"
1457 \begin_layout Standard
1458 FASTA output can be written with
1465 \begin_layout LyX-Code
1467 | write_fasta --result_out=<file>
1470 \begin_layout Standard
1471 It is also possible to wrap the sequences to a given width using the
1475 \begin_layout Standard
1484 wrap switch allthough wrapping of sequence is generally an evil thing:
1487 \begin_layout LyX-Code
1489 | write_fasta --no_stream --wrap=80
1492 \begin_layout Subsection
1493 How to write alignment output?
1494 \begin_inset LatexCommand label
1495 name "sub:How-to-write-alignment"
1502 \begin_layout Standard
1503 Pretty alignments with ruler
1507 \begin_layout Standard
1508 '.' for every 10 residues, ':' for every 50, and '|' for every 100
1513 and consensus sequence
1514 \begin_inset Note Note
1517 \begin_layout Standard
1518 which reminds me to make that an option.
1527 , what also have the optional
1531 \begin_layout Standard
1540 wrap switch to break the alignment into blocks of a given width:
1543 \begin_layout LyX-Code
1545 | write_align --result_out=<file> --wrap=80
1548 \begin_layout Standard
1549 If the number of sequnces in the alignment is 2 then a pairwise alignment
1550 will be output otherwise a multiple alignment.
1551 And if the sequence type, determined automagically, is protein, then residues
1552 and symbols (+,\InsetSpace ~
1554 .) will be used to show consensus according to the Blosum62
1558 \begin_layout Subsection
1559 How to write tabular output?
1560 \begin_inset LatexCommand label
1561 name "sub:How-to-write-tab"
1568 \begin_layout Standard
1569 Outputting the data stream as a table can be done with
1573 , which will write generate one row per record with the values as columns.
1574 If you supply the optional
1578 \begin_layout Standard
1587 comment switch, when the first row in the table will be a 'comment' line
1588 prefixed with a '#':
1591 \begin_layout LyX-Code
1593 | write_tab --result_out=<file> --comment
1596 \begin_layout Standard
1597 You can also change the delimiter from the default (tab) to
1605 \begin_layout LyX-Code
1607 | write_tab --result_out=<file> --delimit=','
1610 \begin_layout Standard
1611 If you want the values output in a specific order you have to supply a comma
1612 separated list using the
1616 \begin_layout Standard
1625 keys switch that will print only those keys in that order:
1628 \begin_layout LyX-Code
1630 | write_tab --result_out=<file> --keys=SEQ_NAME,COUNT
1633 \begin_layout Standard
1634 Alternatively, if you have some keys that you don't want in the tabular
1639 \begin_layout Standard
1649 So to print all keys except SEQ and SEQ_TYPE do:
1652 \begin_layout LyX-Code
1654 | write_tab --result_out=<file> --no_keys=SEQ,SEQ_TYPE
1657 \begin_layout Standard
1658 Finally, if you have a stream containing a mix of different records types,
1664 records with sequences and records with matches, then you can use
1668 to output all the records in tabluar format, however, the
1672 \begin_layout Standard
1685 \begin_layout Standard
1698 \begin_layout Standard
1707 no_keys switches will only respond to records of the first type encountered.
1708 The reason is that outputting mixed records is probably not what you want
1709 anyway, and you should remove all the unwanted records from the stream
1710 before outputting the table:
1714 is your friend (see\InsetSpace ~
1716 \begin_inset LatexCommand ref
1717 reference "sub:How-to-grab"
1724 \begin_layout Subsection
1725 How to write a BED output?
1726 \begin_inset LatexCommand label
1727 name "sub:How-to-write-BED"
1734 \begin_layout Standard
1735 Data in BED format can be output if the records contain the mandatory keys
1736 CHR, CHR_BEG, and CHR_END using
1741 If the optional keys are also present, they will be output as well:
1744 \begin_layout LyX-Code
1745 write_bed --result_out=<file>
1748 \begin_layout Subsection
1749 How to write PSL output?
1750 \begin_inset LatexCommand label
1751 name "sub:How-to-write-PSL"
1758 \begin_layout Standard
1759 Data in PSL format can be output using
1764 \begin_layout LyX-Code
1765 write_psl --result_out=<file>
1768 \begin_layout Section
1769 Manipulating Records
1772 \begin_layout Subsection
1773 How to select a few records?
1774 \begin_inset LatexCommand label
1775 name "sub:How-to-select-a-few-records"
1782 \begin_layout Standard
1783 To quickly get an overview of your data you can limit the data stream to
1785 This also very useful to test the pipeline with a few records if you are
1786 setting up a complex analysis using several biotools.
1787 That way you can inspect that all goes well before analyzing and waiting
1788 for the full data set.
1789 All of the read_<type> biotools have the
1793 \begin_layout Standard
1802 num switch which will take a number as argument and only that number of
1803 records will be read.
1804 So to read in the first 10 FASTA entries from a file:
1807 \begin_layout LyX-Code
1808 read_fasta --data_in=test.fna --num=10
1811 \begin_layout Standard
1812 Another way of doing this is to use
1816 will limit the stream to show the first 10 records (default):
1819 \begin_layout LyX-Code
1824 \begin_layout Standard
1829 directly after one of the read_<type> biotools will be a lot slower than
1834 \begin_layout Standard
1843 num switch with the read_<type> biotools, however,
1847 can also be used to limit the output from all the other biotools.
1848 It is also possible to give
1852 a number of records to show using the
1856 \begin_layout Standard
1866 So to display the first 100 records do:
1869 \begin_layout LyX-Code
1871 | head_records --num=100
1874 \begin_layout Subsection
1875 How to select random records?
1876 \begin_inset LatexCommand label
1877 name "sub:How-to-select-random-records"
1884 \begin_layout Standard
1885 If you want to inspect a number of random records from the stream this can
1891 So if you have 1 mio records in the stream and you want to select 1000
1895 \begin_layout LyX-Code
1897 | random_records --num=1000
1900 \begin_layout Subsection
1901 How to count all records in the data stream?
1904 \begin_layout Standard
1905 To count all the records in the data stream use
1909 , which adds one record (which is not included in the count) to the data
1911 So to count the number of sequences in a FASTA file you can do this:
1914 \begin_layout LyX-Code
1915 cat test.fna | read_fasta | count_records --no_stream
1918 \begin_layout Standard
1919 Which will write the last record containing the count to 'stdout':
1922 \begin_layout LyX-Code
1928 \begin_layout LyX-Code
1934 \begin_layout Standard
1935 It is also possible to write the count to file using the
1939 \begin_layout Standard
1951 \begin_layout Subsection
1952 How to get the length of record values?
1953 \begin_inset LatexCommand label
1954 name "sub:How-to-get-value_length"
1961 \begin_layout Standard
1966 biotool to get the length of each value for a comma separated list of keys:
1969 \begin_layout LyX-Code
1971 | length_vals --keys=HIT,PATTERN
1974 \begin_layout Subsection
1975 How to grab specific records?
1976 \begin_inset LatexCommand label
1977 name "sub:How-to-grab"
1984 \begin_layout Standard
1989 is related to the Unix grep and locates records based on matching keys
1990 and/or values using either a pattern, a Perl regex, or a numerical evaluation.
1995 all records in the stream that has any mentioning of the pattern 'human'
1996 just pipe the data stream through
2003 \begin_layout LyX-Code
2005 | grab --pattern=human
2008 \begin_layout Standard
2009 This will search for the pattern 'human' in all keys and all values.
2014 \begin_layout Standard
2023 pattern switch takes a comma separated list of patterns, so in order to
2024 match multiple patterns do:
2027 \begin_layout LyX-Code
2029 | grab --pattern=human,mouse
2032 \begin_layout Standard
2033 It is also possible to use the
2037 \begin_layout Standard
2046 pattern_in switch instead of
2050 \begin_layout Standard
2064 \begin_layout Standard
2073 pattern_in is used to read a file with one pattern per line:
2076 \begin_layout LyX-Code
2078 | grab --pattern_in=patterns.txt
2081 \begin_layout Standard
2082 If you want the opposite result --- to find all records that does not match
2083 the patterns, add the
2087 \begin_layout Standard
2096 invert switch, which not only works with the
2100 \begin_layout Standard
2109 pattern switch, but also with
2113 \begin_layout Standard
2126 \begin_layout Standard
2138 \begin_layout LyX-Code
2140 | grab --pattern=human --invert
2143 \begin_layout Standard
2144 If you want to search the record keys only,
2149 to find all records containing the key SEQ you can add the
2153 \begin_layout Standard
2163 This will prevent matching of SEQ in any record value, and in fact SEQ
2164 is a not uncommon peptide sequence you could get an unwanted record.
2165 Also, this will give an increase in speed since only the keys are searched:
2168 \begin_layout LyX-Code
2170 | grab --pattern=SEQ --keys_only
2173 \begin_layout Standard
2174 However, if you are interested in finding the peptide sequence SEQ and not
2175 the SEQ key, just add the
2179 \begin_layout Standard
2188 vals_only switch instead:
2191 \begin_layout LyX-Code
2193 | grab --pattern=SEQ --vals_only
2196 \begin_layout Standard
2197 Also, if you want to grab for certain key/value pairs you can supply a comma
2198 separated list of keys whos values will then be searched using the
2202 \begin_layout Standard
2212 This is handy if your records contain large genomic sequences and you dont
2213 want to search the entire sequence for
2218 the organism name --- it is much faster to tell
2222 which keys to search the value for:
2225 \begin_layout LyX-Code
2227 | grab --pattern=human --keys=SEQ_NAME
2230 \begin_layout LyX-Code
2234 \begin_layout Standard
2235 It is also possible to invoke flexible matching using regex (regular expressions
2236 ) instead of simple pattern matching.
2241 the regex engine is Perl based and allows use of different type of wild
2242 cards, alternatives,
2250 \begin_layout Standard
2251 \begin_inset LatexCommand url
2252 target "http://perldoc.perl.org/perlreref.html"
2266 records withs the sequence ATCG or GCTA you can do this:
2269 \begin_layout LyX-Code
2271 | grab --regex='ATCG|GCTA'
2274 \begin_layout Standard
2275 Or if you want to find sequences beginning with ATCG:
2278 \begin_layout LyX-Code
2280 | grab --regex='^ATCG'
2283 \begin_layout Standard
2288 to locate records that fulfill a numerical property using the
2292 \begin_layout Standard
2301 eval switch witch takes an expression in three parts.
2302 The first part is the key that holds the value we want to evaluate, the
2303 second part holds one the six operators:
2306 \begin_layout Enumerate
2310 \begin_layout Enumerate
2311 Greater than or equal to: >=
2314 \begin_layout Enumerate
2318 \begin_layout Enumerate
2319 Less than or equal to: <=
2322 \begin_layout Enumerate
2326 \begin_layout Enumerate
2330 \begin_layout Enumerate
2331 String wise equal to: eq
2334 \begin_layout Enumerate
2335 String wise not equal to: ne
2338 \begin_layout Standard
2339 And finally comes the number used in the evaluation.
2344 all records with a sequence length greater than 30:
2347 \begin_layout LyX-Code
2349 length_seq | grab --eval='SEQ_LEN > 30'
2352 \begin_layout Standard
2353 If you want to locate all records containing the pattern 'human' and where
2354 the sequence length is greater that 30, you do this by running the stream
2362 \begin_layout LyX-Code
2364 | grab --pattern='human' | length_seq | grab --eval='SEQ_LEN > 30'
2367 \begin_layout Standard
2368 Finally, it is possible to do fast matching of expressions from a file using
2373 \begin_layout Standard
2383 Each of these expressions has to be matched exactly over the entrie length,
2384 which if useful if you have a file with accession numbers, that you want
2385 to locate in the stream:
2388 \begin_layout LyX-Code
2390 | grab --exact acc_no.txt | ...
2393 \begin_layout Standard
2398 \begin_layout Standard
2407 exact is much faster than using
2411 \begin_layout Standard
2420 pattern_in, because with
2424 \begin_layout Standard
2433 exact the expression has to be complete matches, where
2437 \begin_layout Standard
2446 pattern_in looks for subpatterns.
2449 \begin_layout Standard
2450 NB! To get the best speed performance, use the most restrictive
2457 \begin_layout Subsection
2458 How to remove keys from records?
2461 \begin_layout Standard
2462 To remove one or more specific keys from all records in the data stream
2470 \begin_layout LyX-Code
2472 | remove_keys --keys=SEQ,SEQ_NAME
2475 \begin_layout Standard
2476 In the above example SEQ and SEQ_NAME will be removed from all records if
2477 they exists in these.
2478 If all keys are removed from a record, then the record will be removed.
2481 \begin_layout Subsection
2482 How to rename keys in records?
2485 \begin_layout Standard
2486 Sometimes you want to rename a record key,
2491 if you have read in a two column table with sequence name and sequence
2493 \begin_inset LatexCommand ref
2494 reference "sub:How-to-read-table"
2498 ) without specifying the key names, then the sequence name will be called
2499 V0 and the sequence V1 as default in the
2504 To rename the V0 and V1 keys we need to run the stream through
2508 twice (one for each key to rename):
2511 \begin_layout LyX-Code
2513 | rename_keys --keys=V0,SEQ_NAME | rename_keys --keys=V1,SEQ
2516 \begin_layout Standard
2517 The first instance of
2521 replaces all the V0 keys with SEQ_NAME, and the second instance of
2525 replaces all the V1 keys with SEQ.
2530 the data can now be used in the biotools that requires these keys.
2533 \begin_layout Section
2534 Manipulating Sequences
2537 \begin_layout Subsection
2538 How to get sequence lengths?
2541 \begin_layout Standard
2542 The length for sequences in records can be determined with
2546 , which adds the key SEQ_LEN to each record with the sequence length as
2548 It also generates an extra record that is emitted last with the key TOTAL_SEQ_L
2549 EN showing the total length of all the sequences.
2552 \begin_layout LyX-Code
2553 read_fasta --data_in=<file> | length_seq
2556 \begin_layout Standard
2557 It is also possible to determine the sequence length using the generic tool
2563 \begin_inset LatexCommand eqref
2564 reference "sub:How-to-get-value_length"
2568 , which determines the length of the values for a given list of keys:
2571 \begin_layout LyX-Code
2572 read_fasta --data_in=<file> | length_vals --keys=SEQ
2575 \begin_layout Standard
2576 To obtain the total length of all sequences use
2583 \begin_layout LyX-Code
2584 read_fasta --data_in=<file> | length_vals --keys=SEQ
2587 \begin_layout LyX-Code
2588 | sum_vals --keys=SEQ_LEN
2591 \begin_layout Standard
2596 will also determine the length of each sequence (see\InsetSpace ~
2598 \begin_inset LatexCommand ref
2599 reference "sub:How-to-analyze"
2606 \begin_layout Subsection
2607 How to analyze sequence composition?
2608 \begin_inset LatexCommand label
2609 name "sub:How-to-analyze"
2616 \begin_layout Standard
2617 If you want to find out the sequence type, composition, length, as well
2618 as GC content, indel content and proportions of soft and hard masked sequence,
2624 This handy biotool will determine all these things per sequence from which
2625 it is easy to get an overview using the
2629 biotool to output a table (see\InsetSpace ~
2631 \begin_inset LatexCommand ref
2632 reference "sub:How-to-write-tab"
2637 So in order to determine the sequence composition of a FASTA file with
2638 just one entry containing the sequence 'ATCG' we just read the data with
2643 and run the output through
2647 which will add the analysis to the record like this:
2650 \begin_layout LyX-Code
2651 read_fasta --data_in=test.fna | analyze_seq ...
2654 \begin_layout LyX-Code
2658 \begin_layout LyX-Code
2664 \begin_layout LyX-Code
2670 \begin_layout LyX-Code
2676 \begin_layout LyX-Code
2682 \begin_layout LyX-Code
2688 \begin_layout LyX-Code
2694 \begin_layout LyX-Code
2700 \begin_layout LyX-Code
2706 \begin_layout LyX-Code
2712 \begin_layout LyX-Code
2718 \begin_layout LyX-Code
2724 \begin_layout LyX-Code
2730 \begin_layout LyX-Code
2736 \begin_layout LyX-Code
2742 \begin_layout LyX-Code
2748 \begin_layout LyX-Code
2754 \begin_layout LyX-Code
2760 \begin_layout LyX-Code
2766 \begin_layout LyX-Code
2772 \begin_layout LyX-Code
2778 \begin_layout LyX-Code
2781 SEQ: TTTCAGTTTGGGACGGAGTAAGGCCTTCCtttttttttttttttttttttttttttttgagaccgagtcttgctc
2785 \begin_layout LyX-Code
2791 \begin_layout LyX-Code
2797 \begin_layout LyX-Code
2803 \begin_layout LyX-Code
2809 \begin_layout LyX-Code
2815 \begin_layout LyX-Code
2821 \begin_layout LyX-Code
2825 \begin_layout Standard
2826 Now to make a table of how may As, Ts, Cs, and Gs you can add the following:
2829 \begin_layout LyX-Code
2831 | analyze_seq | write_tab --keys=RES:A,RES:T,RES:C,RES:G
2834 \begin_layout Standard
2835 Or if you want to see the proportions of hard and soft masked sequence:
2838 \begin_layout LyX-Code
2840 | analyse_seq | write_tab --keys=HARD_MASK%,SOFT_MASK%
2843 \begin_layout Standard
2844 If you have a stack of sequences in one file and you want to determine the
2845 mean GC content you can do it using the
2852 \begin_layout LyX-Code
2853 read_fasta --data_in=test.fna | analyze_seq | mean_vals --keys=GC%
2856 \begin_layout Standard
2857 Or if you want the total count of Ns you can use
2864 \begin_layout LyX-Code
2865 read_fasta --data_in=test.fna | analyze_seq | sum_vals --keys=RES:N
2868 \begin_layout Standard
2869 The MIX_INDEX key is calculated as the count of the most common residue
2870 over the sequence length, and can be used as a cut-off for removing sequence
2871 tags consisting of mostly one nucleotide:
2874 \begin_layout LyX-Code
2875 read_fasta --data_in=test.fna | analyze_seq | grab --eval='MIX_INDEX<0.85'
2878 \begin_layout Subsection
2879 How to extract subsequences?
2880 \begin_inset LatexCommand label
2881 name "sub:How-to-extract"
2888 \begin_layout Standard
2889 In order to extract a subsequence from a longer sequence use the biotool
2890 extract_seq, which will replace the sequence in the record with the subsequence
2891 (this behaviour should probably be modified to be dependant of a
2895 \begin_layout Standard
2908 \begin_layout Standard
2918 \begin_inset Note Note
2921 \begin_layout Standard
2928 So to extract the first 20 residues from all sequences do (first residue
2932 \begin_layout LyX-Code
2934 | extract_seq --beg=1 --len=20
2937 \begin_layout Standard
2938 You can also specify a begin and end coordinate set:
2941 \begin_layout LyX-Code
2943 | extract_seq --beg=20 --end=40
2946 \begin_layout Standard
2947 If you want the subsequences from position 20 to the sequence end do:
2950 \begin_layout LyX-Code
2952 | extract_seq --beg=20
2955 \begin_layout Standard
2956 If you want to extract subsequences a given distance from the sequence end
2957 you can do this by reversing the sequence with the biotool
2962 \begin_inset LatexCommand eqref
2963 reference "sub:How-to-reverse-seq"
2971 to get the subsequence, and then
2975 again to get the subsequence back in the original orientation:
2978 \begin_layout LyX-Code
2979 read_fasta --data_in=test.fna | reverse_seq
2982 \begin_layout LyX-Code
2983 | extract_seq --beg=10 --len=10 | reverse_seq
2986 \begin_layout Subsection
2987 How to get genomic sequence?
2988 \begin_inset LatexCommand label
2989 name "sub:How-to-get-genomic-sequence"
2996 \begin_layout Standard
3001 can extract subsequences for a given genome specified with the
3005 \begin_layout Standard
3014 genome switch explicitly using the
3018 \begin_layout Standard
3031 \begin_layout Standard
3044 \begin_layout Standard
3056 \begin_layout LyX-Code
3057 get_genome_seq --genome=<genome> --beg=1 --len=100
3060 \begin_layout Standard
3065 can be used to append the corresponding sequence to BED, PSL, and BLAST
3069 \begin_layout LyX-Code
3070 read_bed --data_in=<BED file> | get_genome_seq --genome=<genome>
3073 \begin_layout Standard
3074 It is also possible to include flaking sequence using the
3078 \begin_layout Standard
3088 So to include 50 nucleotides upstream and 50 nucleotides downstream for
3092 \begin_layout LyX-Code
3093 read_bed --data_in=<BED file> | get_genome_seq --genome=<genome> --flank=50
3096 \begin_layout Subsection
3097 How to upper-case sequences?
3100 \begin_layout Standard
3101 Sequences can be shifted from lower case to upper case using
3108 \begin_layout LyX-Code
3113 \begin_layout Subsection
3114 How to reverse sequences?
3115 \begin_inset LatexCommand label
3116 name "sub:How-to-reverse-seq"
3123 \begin_layout Standard
3124 The order of residues in a sequence can be reversed using reverse_seq:
3127 \begin_layout LyX-Code
3132 \begin_layout Standard
3133 Note that in order to reverse/complement a sequence you also need the
3137 biotool (see\InsetSpace ~
3139 \begin_inset LatexCommand ref
3140 reference "sub:How-to-complement"
3147 \begin_layout Subsection
3148 How to complement sequences?
3149 \begin_inset LatexCommand label
3150 name "sub:How-to-complement"
3157 \begin_layout Standard
3158 DNA and RNA sequences can be complemented with
3162 , which automagically determines the sequence type:
3165 \begin_layout LyX-Code
3170 \begin_layout Standard
3171 Note that in order to reverse/complement a sequence you also need the
3175 biotool (see\InsetSpace ~
3177 \begin_inset LatexCommand ref
3178 reference "sub:How-to-reverse-seq"
3185 \begin_layout Subsection
3186 How to remove indels from sequnces?
3189 \begin_layout Standard
3190 Indels can be removed from sequences with the
3195 This is useful if you have aligned some sequences (see\InsetSpace ~
3197 \begin_inset LatexCommand ref
3198 reference "sub:How-to-align"
3202 ) and extracted (see\InsetSpace ~
3204 \begin_inset LatexCommand ref
3205 reference "sub:How-to-extract"
3209 ) a block of subsequences from the alignment and you want to use these sequence
3210 in a search where you need to remove the indels first.
3211 '-', '~', and '.' are considered indels:
3214 \begin_layout LyX-Code
3219 \begin_layout Subsection
3220 How to shuffle sequences?
3223 \begin_layout Standard
3224 All residues in sequences in the stream can be shuffled to random positions
3232 \begin_layout LyX-Code
3237 \begin_layout Subsection
3238 How to split sequences into overlapping subsequences?
3241 \begin_layout Standard
3242 Sequences can be slit into overlapping subsequences with the
3249 \begin_layout LyX-Code
3251 | split_seq --word_size=20 --uniq
3254 \begin_layout Subsection
3255 How to determine the oligo frequency?
3258 \begin_layout Standard
3259 In order to determine if any oligo usage is over represented in one or more
3260 sequences you can determine the frequency of oligos of a given size with
3268 \begin_layout LyX-Code
3270 | oligo_freq --word_size=4
3273 \begin_layout Standard
3274 And if you have more than one sequence and want to accumulate the frequences
3279 \begin_layout Standard
3291 \begin_layout LyX-Code
3293 | oligo_freq --word_size=4 --all
3296 \begin_layout Standard
3297 To get a meaningful result you need to write the resulting frequencies as
3304 \begin_inset LatexCommand ref
3305 reference "sub:How-to-write-tab"
3309 ), but first it is important to
3315 \begin_inset LatexCommand ref
3316 reference "sub:How-to-grab"
3320 ) the records with the frequencies to avoid full length sequences in the
3324 \begin_layout LyX-Code
3326 | oligo_freq --word_size=4 --all | grab --pattern=OLIGO --keys_only
3329 \begin_layout LyX-Code
3330 | write_tab --no_stream
3333 \begin_layout Standard
3334 And the resulting frequency table can be sorted with Unix sort (man sort).
3337 \begin_layout Subsection
3338 How to search for sequences in genomes?
3341 \begin_layout Standard
3342 See the following biotool:
3345 \begin_layout Itemize
3351 \begin_inset LatexCommand eqref
3352 reference "sub:How-to-use-patscan"
3359 \begin_layout Itemize
3365 \begin_inset LatexCommand eqref
3366 reference "sub:How-to-use-BLAT"
3373 \begin_layout Itemize
3379 \begin_inset LatexCommand eqref
3380 reference "sub:How-to-use-BLAST"
3387 \begin_layout Itemize
3393 \begin_inset LatexCommand eqref
3394 reference "sub:How-to-use-Vmatch"
3401 \begin_layout Subsection
3402 How to search sequences for a pattern?
3403 \begin_inset LatexCommand label
3404 name "sub:How-to-use-patscan"
3411 \begin_layout Standard
3412 It is possible to search sequences in the data stream for patterns using
3417 biotool which utilizes the powerful scan_for_matches engine.
3418 Consult the documentation for scan_for_matches in order to learn how to
3419 define patterns (the documentation is included in Appendix\InsetSpace ~
3421 \begin_inset LatexCommand ref
3422 reference "sec:scan_for_matches-README"
3429 \begin_layout Standard
3430 To search all sequences for a simple pattern consisting of the sequence
3431 ATCGATCG allowing for 3 mismatches, 2 insertions and 1 deletion:
3434 \begin_layout LyX-Code
3435 read_fasta --data_in=<file> | patscan_seq --pattern='ATCGATCG[3,2,1]'
3438 \begin_layout Standard
3443 \begin_layout Standard
3452 pattern switch takes a comma seperated list of patterns, so if you want
3453 to search for more that one pattern do:
3456 \begin_layout LyX-Code
3458 | patscan_seq --pattern='ATCGATCG[3,2,1],GCTAGCTA[3,2,1]'
3461 \begin_layout Standard
3462 It is also possible to have a list of different patterns to search for in
3463 a file with one pattern per line.
3468 to read these patterns use the
3472 \begin_layout Standard
3484 \begin_layout LyX-Code
3486 | patscan_seq --pattern_in=<file>
3489 \begin_layout Standard
3490 To also scan the complementary strand in nucleotide sequences (
3494 automagically determines the sequence type) you need to add the
3498 \begin_layout Standard
3510 \begin_layout LyX-Code
3512 | patscan_seq --pattern=<pattern> --comp
3515 \begin_layout Standard
3516 It is also possible to use
3520 to output those records that does not contain a certain pattern by using
3525 \begin_layout Standard
3537 \begin_layout LyX-Code
3539 | patscan_seq --pattern=<pattern> --invert
3542 \begin_layout Standard
3547 can also scan for patterns in a given genome sequence, instead of sequences
3548 in the stream, using the
3552 \begin_layout Standard
3564 \begin_layout LyX-Code
3565 patscan --pattern=<pattern> --genome=<genome>
3568 \begin_layout Subsection
3569 How to use BLAT for sequence search?
3570 \begin_inset LatexCommand label
3571 name "sub:How-to-use-BLAT"
3578 \begin_layout Standard
3579 Sequences in the data stream can be matched against supported genomes using
3584 which is a biotool using BLAT as the name might suggest.
3585 Currently only Mouse and Human genomes are available and it is not possible
3586 to use OOC files since there is still a need for a local repository for
3588 Otherwise it is just:
3591 \begin_layout LyX-Code
3592 read_fasta --data_in=<file> | blat_seq --genome=<genome>
3595 \begin_layout Standard
3596 The search results can then be written to file with
3602 \begin_inset LatexCommand ref
3603 reference "sub:How-to-write-PSL"
3613 \begin_inset LatexCommand ref
3614 reference "sub:How-to-write-BED"
3622 some information will be lost).
3623 It is also possible to plot chromosome distribution of the search results
3630 \begin_inset LatexCommand ref
3631 reference "sub:How-to-plot-chrdist"
3635 ) or the distribution of the match lengths using
3641 \begin_inset LatexCommand ref
3642 reference "sub:How-to-plot-lendist"
3646 ) or a karyogram with the hits using
3652 \begin_inset LatexCommand ref
3653 reference "sub:How-to-plot-karyogram"
3660 \begin_layout Subsection
3661 How to use BLAST for sequence search?
3662 \begin_inset LatexCommand label
3663 name "sub:How-to-use-BLAST"
3670 \begin_layout Standard
3671 Two biotools exist for blasting sequences:
3675 is used to create the BLAST database required for BLAST which is queried
3681 So in order to create a BLAST database from sequences in the data stream
3685 \begin_layout LyX-Code
3687 | create_blast_db --database=my_database --no_stream
3690 \begin_layout Standard
3691 The type of sequence to use for the database is automagically determined
3696 , but don't have a mixture of peptide and nucleic acids sequences in the
3702 \begin_layout Standard
3711 database switch takes a path as argument, but will default to 'blastdb_<time_sta
3715 \begin_layout Standard
3716 The resulting database can now be queried with sequences in another data
3724 \begin_layout LyX-Code
3726 | blast_seq --database=my_database
3729 \begin_layout Standard
3730 Again, the sequence type is determined automagically and the appropriate
3731 BLAST program is guessed (see below table), however, the program name can
3732 be overruled with the
3736 \begin_layout Standard
3748 \begin_layout Standard
3751 \begin_inset Tabular
3752 <lyxtabular version="3" rows="5" columns="3">
3754 <column alignment="center" valignment="top" width="0">
3755 <column alignment="center" valignment="top" width="0">
3756 <column alignment="center" valignment="top" width="0">
3757 <row bottomline="true">
3758 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3761 \begin_layout Standard
3767 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3770 \begin_layout Standard
3776 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3779 \begin_layout Standard
3787 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3790 \begin_layout Standard
3796 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3799 \begin_layout Standard
3805 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3808 \begin_layout Standard
3816 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3819 \begin_layout Standard
3825 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3828 \begin_layout Standard
3834 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3837 \begin_layout Standard
3845 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3848 \begin_layout Standard
3854 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3857 \begin_layout Standard
3863 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3866 \begin_layout Standard
3874 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3877 \begin_layout Standard
3883 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3886 \begin_layout Standard
3892 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3895 \begin_layout Standard
3909 \begin_layout Standard
3910 Finally, it is also possible to use
3914 for blasting sequences agains a preformatted genome using the
3918 \begin_layout Standard
3927 genome switch instead of the
3931 \begin_layout Standard
3943 \begin_layout LyX-Code
3945 | blast_seq --genome=<genome>
3948 \begin_layout Subsection
3949 How to use Vmatch for sequence search?
3950 \begin_inset LatexCommand label
3951 name "sub:How-to-use-Vmatch"
3958 \begin_layout Standard
3959 The powerful suffix array software package Vmatch
3963 \begin_layout Standard
3964 \begin_inset LatexCommand url
3965 target "http://www.vmatch.de/"
3974 can be used for exact mapping of sequences against indexed genomes using
3980 map 700000 ESTs to the human genome locating all 160 mio hits in less than
3982 Only nucleotide sequences and sequences longer than 11 nucleotides will
3984 It is recommended that sequences consisting of mostly one nucleotide type
3986 This can be done with the
3991 \begin_inset LatexCommand eqref
3992 reference "sub:How-to-analyze"
3999 \begin_layout LyX-Code
4001 | vmatch_seq --genome=<genome>
4004 \begin_layout Standard
4005 It is also possible to allow for mismatches using the
4009 \begin_layout Standard
4018 hamming_dist switch.
4019 So to allow for 2 mismatches:
4022 \begin_layout LyX-Code
4024 | vmatch_seq --genome=<genome> --hamming_dist=2
4027 \begin_layout Standard
4028 Or to allow for 10% mismathing nucleotides:
4031 \begin_layout LyX-Code
4033 | vmatch_seq --genome=<genome> --hamming_dist=10p
4036 \begin_layout Standard
4037 To allow both indels and mismatches use the
4041 \begin_layout Standard
4051 So to allow for one mismatch or one indel:
4054 \begin_layout LyX-Code
4056 | vmatch_seq --genome=<genome> --hamming_dist=1
4059 \begin_layout Standard
4060 Or to allow for 5% indels or mismatches:
4063 \begin_layout LyX-Code
4065 | vmatch_seq --genome=<genome> --hamming_dist=5p
4068 \begin_layout Standard
4073 \begin_layout Standard
4086 \begin_layout Standard
4095 edit_dist greatly slows down vmatch considerably --- use with care.
4098 \begin_layout Standard
4099 The resulting SCORE key can be replaced to hold the number of genome matches
4100 of a given sequence (multi-mappers) is the
4104 \begin_layout Standard
4113 count switch is given.
4116 \begin_layout Subsection
4117 How to find all matches between sequences?
4118 \begin_inset LatexCommand label
4119 name "sub:How-to-find-matches"
4126 \begin_layout Standard
4127 All matches between two sequences can be determined with the biotool
4132 The match finding engine underneath the hood of
4136 is the super fast suffix tree program MUMmer
4140 \begin_layout Standard
4141 \begin_inset LatexCommand url
4142 target "http://mummer.sourceforge.net/"
4151 , which will locate all forward and reverse matches between huge sequences
4152 in a matter of minutes (if the repeat count is not too high and if the
4153 word size used is appropriate).
4158 genomes (1.7Mbp) takes around 10 seconds:
4161 \begin_layout LyX-Code
4163 | match_seq --word_size=20 --direction=both
4166 \begin_layout Standard
4171 can be used to generate a dot plot with
4177 \begin_inset LatexCommand ref
4178 reference "sub:How-to-generate-dotplot"
4185 \begin_layout Subsection
4186 How to align sequences?
4187 \begin_inset LatexCommand label
4188 name "sub:How-to-align"
4195 \begin_layout Standard
4196 Sequences in the stream can be aligned with the
4200 biotool that uses Muscle
4204 \begin_layout Standard
4205 \begin_inset LatexCommand url
4206 target "http://www.drive5.com/muscle/muscle.html"
4216 Currently you cannot change any of the Muscle alignment parameters and
4221 will create an alignment based on the defaults (which are really good!):
4224 \begin_layout LyX-Code
4229 \begin_layout Standard
4230 The aligned output can be written to file in FASTA format using
4236 \begin_inset LatexCommand ref
4237 reference "sub:How-to-write-fasta"
4241 ) or in pretty text using
4247 \begin_inset LatexCommand ref
4248 reference "sub:How-to-write-alignment"
4255 \begin_layout Subsection
4256 How to create a weight matrix?
4259 \begin_layout Standard
4260 If you want a weight matrix to show the sequence composition of a stack
4261 of sequences you can use the biotool create_weight_matrix:
4264 \begin_layout LyX-Code
4266 | create_weight_matrix
4269 \begin_layout Standard
4270 The result can be output in percent using the
4274 \begin_layout Standard
4286 \begin_layout LyX-Code
4288 | create_weight_matrix --percent
4291 \begin_layout Standard
4292 The weight matrix can be written as tabular output with
4298 \begin_inset LatexCommand ref
4299 reference "sub:How-to-write-tab"
4303 ) after removeing the records containing SEQ with
4309 \begin_inset LatexCommand ref
4310 reference "sub:How-to-grab"
4317 \begin_layout LyX-Code
4319 | create_weight_matrix | grab --invert --keys=SEQ --keys_only
4322 \begin_layout LyX-Code
4323 | write_tab --no_stream
4326 \begin_layout Standard
4327 The V0 column will hold the residue, while the rest of the columns will
4328 hold the frequencies for each sequence position.
4331 \begin_layout Section
4335 \begin_layout Standard
4336 There exists several biotools for plotting.
4337 Some of these are based on GNUplot
4341 \begin_layout Standard
4342 \begin_inset LatexCommand url
4343 target "http://www.gnuplot.info/"
4352 , which is an extremely powerful platform to generate all sorts of plots
4353 and even though GNUplot has quite a steep learning curve, the biotools
4354 utilizing GNUplot are simple to use.
4355 GNUplot is able to output a lot of different formats (called terminals
4356 in GNUplot), but the biotools focusses on three formats only:
4359 \begin_layout Enumerate
4360 The 'dumb' terminal is default to the GNUplot based biotools and will output
4361 a plot in crude ASCII text (Fig.\InsetSpace ~
4363 \begin_inset LatexCommand ref
4364 reference "fig:Dumb-terminal"
4369 This is quite nice for a quick and dirty plot to get an overview of your
4373 \begin_layout Enumerate
4374 The 'post' or 'postscript' terminal output postscript code which is publication
4375 grade graphics that can be viewed with applications such as Ghostview,
4376 Photoshop, and Preview.
4379 \begin_layout Enumerate
4380 The 'svg' terminal output's scalable vector graphics (SVG) which is a vector
4382 SVG is great because you can edit the resulting plot using Photoshop or
4387 \begin_layout Standard
4388 Inkscape is a really handy drawing program that is free and open source.
4390 \begin_inset LatexCommand htmlurl
4391 target "http://www.inkscape.org"
4400 if you want to add additional labels, captions, arrows, and so on and then
4401 save the result in different formats, such as postscript without loosing
4405 \begin_layout Standard
4406 The biotools for plotting that are not based on GNUplot only output SVG
4407 (that may change in the future).
4410 \begin_layout Standard
4411 \begin_inset Float figure
4416 \begin_layout Standard
4419 \begin_inset Graphics
4420 filename lendist_ascii.png
4429 \begin_layout Standard
4430 \begin_inset Caption
4432 \begin_layout Standard
4433 \begin_inset LatexCommand label
4434 name "fig:Dumb-terminal"
4447 The output of a length distribution plot in the default 'dumb terminal'
4448 to the terminal window.
4457 \begin_layout Subsection
4458 How to plot a histogram?
4459 \begin_inset LatexCommand label
4460 name "How-to-plot-histogram"
4467 \begin_layout Standard
4468 A generic histogram for a given value can be plotted with the biotool
4474 \begin_inset LatexCommand ref
4475 reference "fig:Histogram"
4482 \begin_layout LyX-Code
4484 | plot_histogram --key=TISSUE --no_stream
4487 \begin_layout Standard
4491 \begin_layout Standard
4494 \begin_inset Float figure
4499 \begin_layout Standard
4502 \begin_inset Graphics
4503 filename histogram.png
4512 \begin_layout Standard
4513 \begin_inset Caption
4515 \begin_layout Standard
4516 \begin_inset LatexCommand label
4517 name "fig:Histogram"
4534 \begin_layout Subsection
4535 How to plot a length distribution?
4536 \begin_inset LatexCommand label
4537 name "sub:How-to-plot-lendist"
4544 \begin_layout Standard
4545 Plotting of length distributions, weather sequence lengths, patterns lengths,
4551 is a really handy thing and can be done with the the biotool
4556 If you have a file with FASTA entries and want to plot the length distribution
4557 you do it like this:
4560 \begin_layout LyX-Code
4561 read_fasta --data_in=<file> | length_seq
4564 \begin_layout LyX-Code
4565 | plot_lendist --key=SEQ_LEN --no_stream
4568 \begin_layout Standard
4569 The result will be written to the default dumb terminal and will look like
4572 \begin_inset LatexCommand ref
4573 reference "fig:Dumb-terminal"
4580 \begin_layout Standard
4581 If you instead want the result in postscript format you can do:
4584 \begin_layout LyX-Code
4586 | plot_lendist --key=SEQ_LEN --terminal=post --result_out=file.ps
4589 \begin_layout Standard
4590 That will generate the plot and save it to file, but not interrupt the data
4591 stream which can then be used in further analysis.
4592 You can also save the plot implicetly using '>', however, it is then important
4593 to terminate the stream with the
4597 \begin_layout Standard
4609 \begin_layout LyX-Code
4611 | plot_lendist --key=SEQ_LEN --terminal=post --no_stream > file.ps
4614 \begin_layout Standard
4615 The resulting plot can be seen in Fig.\InsetSpace ~
4617 \begin_inset LatexCommand ref
4618 reference "fig:Length-distribution"
4625 \begin_layout Standard
4626 \begin_inset Float figure
4631 \begin_layout Standard
4635 \begin_layout Standard
4638 \begin_inset Graphics
4648 \begin_layout Standard
4649 \begin_inset Caption
4651 \begin_layout Standard
4652 \begin_inset LatexCommand label
4653 name "fig:Length-distribution"
4666 Length distribution of 630 piRNA like RNAs.
4674 \begin_layout Subsection
4675 How to plot a chromosome distribution?
4676 \begin_inset LatexCommand label
4677 name "sub:How-to-plot-chrdist"
4684 \begin_layout Standard
4685 If you have the result of a sequence search against a multi chromosome genome,
4686 it is very practical to be able to plot the distribution of search hits
4687 on the different chromosomes.
4688 This can be done with
4695 \begin_layout LyX-Code
4696 read_fasta --data_in=<file> | blat_genome | plot_chrdist --no_stream
4699 \begin_layout Standard
4700 The above example will result in a crude plot using the 'dumb' terminal,
4701 and if you want to mess around with the results from the BLAT search you
4702 probably want to save the result to file first (see\InsetSpace ~
4704 \begin_inset LatexCommand ref
4705 reference "sub:How-to-write-PSL"
4710 To plot the chromosome distribution from the saved search result you can
4714 \begin_layout LyX-Code
4715 read_bed --data_in=file.bed | plot_chrdist --terminal=post --result_out=plot.ps
4718 \begin_layout Standard
4719 That will result in the output show in Fig.\InsetSpace ~
4721 \begin_inset LatexCommand ref
4722 reference "fig:Chromosome-distribution"
4729 \begin_layout Standard
4730 \begin_inset Float figure
4735 \begin_layout Standard
4739 \begin_layout Standard
4742 \begin_inset Graphics
4753 \begin_layout Standard
4754 \begin_inset Caption
4756 \begin_layout Standard
4757 \begin_inset LatexCommand label
4758 name "fig:Chromosome-distribution"
4762 Chromosome distribution
4775 \begin_layout Subsection
4776 How to generate a dotplot?
4777 \begin_inset LatexCommand label
4778 name "sub:How-to-generate-dotplot"
4785 \begin_layout Standard
4786 A dotplot is a powerful way to get an overview of the size and location
4787 of sequence insertions, deletions, and duplications between two sequences.
4788 Generating a dotplot with biotools is a two step process where you initially
4789 find all matches between two sequences using the tool
4795 \begin_inset LatexCommand ref
4796 reference "sub:How-to-find-matches"
4800 ) and plot the resulting matches with
4805 Matching and plotting two
4809 genomes (1.7Mbp) takes around 10 seconds:
4812 \begin_layout LyX-Code
4814 | match_seq | plot_matches --terminal=post --result_out=plot.ps
4817 \begin_layout Standard
4818 The resulting dotplot is in Fig.\InsetSpace ~
4820 \begin_inset LatexCommand ref
4821 reference "fig:Dotplot"
4828 \begin_layout Standard
4829 \begin_inset Float figure
4834 \begin_layout Standard
4837 \begin_inset Graphics
4847 \begin_layout Standard
4848 \begin_inset Caption
4850 \begin_layout Standard
4851 \begin_inset LatexCommand label
4865 Forward matches are displayed in green while reverse matches are displayed
4874 \begin_layout Subsection
4875 How to plot a sequence logo?
4878 \begin_layout Standard
4879 Sequence logos can be generate with
4884 The sequnce type is determined automagically and an entropy scale of 2
4885 bits and 4 bits is used for nucleotide and peptide sequences, respectively
4889 \begin_layout Standard
4890 \begin_inset LatexCommand htmlurl
4891 target "http://www.ccrnp.ncifcrf.gov/~toms/paper/hawaii/latex/node5.html"
4903 \begin_layout LyX-Code
4905 | plot_seqlogo --no_stream --result_out=seqlogo.svg
4908 \begin_layout Standard
4909 An example of a sequence logo can be seen in Fig.\InsetSpace ~
4911 \begin_inset LatexCommand ref
4912 reference "fig:Sequence-logo"
4919 \begin_layout Standard
4920 \begin_inset Float figure
4925 \begin_layout Standard
4928 \begin_inset Graphics
4929 filename seqlogo.png
4938 \begin_layout Standard
4939 \begin_inset Caption
4941 \begin_layout Standard
4942 \begin_inset LatexCommand label
4943 name "fig:Sequence-logo"
4960 \begin_layout Subsection
4961 How to plot a karyogram?
4962 \begin_inset LatexCommand label
4963 name "sub:How-to-plot-karyogram"
4970 \begin_layout Standard
4971 To plot search hits on genomes use
4975 , which will output a nice karyogram in SVG graphics:
4978 \begin_layout LyX-Code
4980 | plot_karyogram --result_out=karyogram.svg
4983 \begin_layout Standard
4984 The banding data is taken from the UCSC genome browser database and currently
4985 only Human and Mouse is supported.
4988 \begin_inset LatexCommand ref
4989 reference "fig:Karyogram"
4993 shows the distribution of piRNA like RNAs matched to the Human genome.
4996 \begin_layout Standard
4997 \begin_inset Float figure
5002 \begin_layout Standard
5005 \begin_inset Graphics
5006 filename karyogram.png
5015 \begin_layout Standard
5016 \begin_inset Caption
5018 \begin_layout Standard
5019 \begin_inset LatexCommand label
5020 name "fig:Karyogram"
5033 Hits from a search of piRNA like RNAs in the Human genome is displayed as
5034 short horizontal bars.
5042 \begin_layout Section
5046 \begin_layout Subsection
5047 How do I display my results in the UCSC Genome Browser?
5050 \begin_layout Standard
5051 Results from the list of biotools below can be uploaded directly to a local
5052 mirror of the UCSC Genome Browser using the biotool
5059 \begin_layout Itemize
5061 \begin_inset LatexCommand eqref
5062 reference "sub:How-to-use-patscan"
5069 \begin_layout Itemize
5071 \begin_inset LatexCommand eqref
5072 reference "sub:How-to-use-BLAT"
5079 \begin_layout Itemize
5081 \begin_inset LatexCommand eqref
5082 reference "sub:How-to-use-BLAST"
5089 \begin_layout Itemize
5091 \begin_inset LatexCommand eqref
5092 reference "sub:How-to-use-Vmatch"
5099 \begin_layout Standard
5100 The syntax for uploading data the most simple way requires two mandatory
5105 \begin_layout Standard
5114 database, which is the UCSC database name (such as hg18, mm9, etc.) and
5118 \begin_layout Standard
5127 table which should be the users initials followed by an underscore and a
5128 short description of the data:
5131 \begin_layout LyX-Code
5133 | upload_to_ucsc --database=hg18 --table=mah_snoRNAs
5136 \begin_layout Standard
5141 biotool modifies the users ~/ucsc/my_tracks.ra file automagically (a backup
5142 is created with the name ~/ucsc/my_tracks.ra~) with default values that
5143 can be overridden using the following switches:
5146 \begin_layout Itemize
5150 \begin_layout Standard
5159 short_label - Short label for track - Default=database->table
5162 \begin_layout Itemize
5166 \begin_layout Standard
5175 long_label - Long label for track - Default=database->table
5178 \begin_layout Itemize
5182 \begin_layout Standard
5191 group - Track group name - Default=<user name as defined in env>
5194 \begin_layout Itemize
5198 \begin_layout Standard
5207 priority - Track display priority - Default=1
5210 \begin_layout Itemize
5214 \begin_layout Standard
5223 color - Track color - Default=147,73,42
5226 \begin_layout Itemize
5230 \begin_layout Standard
5239 chunk_size - Chunks for loading - Default=10000000
5242 \begin_layout Itemize
5246 \begin_layout Standard
5255 visibility - Track visibility - Default=pack
5258 \begin_layout Standard
5259 Also, data in BED or PSL format can be uploaded with
5263 as long as these reference to genomes and chromosomes existing in the UCSC
5267 \begin_layout LyX-Code
5268 read_bed --data_in=<bed file> | upload_to_ucsc ...
5271 \begin_layout LyX-Code
5275 \begin_layout LyX-Code
5276 read_psl --data_in=<psl file> | upload_to_ucsc ...
5279 \begin_layout Section
5283 \begin_layout Standard
5284 It is possible to do commandline scripting of biotool records using Perl.
5285 Because a biotool record essentially is a hash structure, you can pass
5290 command, which is a wrapper around the Perl executable that allows direct
5291 manipulations of the records using the power of Perl.
5294 \begin_layout Standard
5295 In the below example we replace in all records the value to the CHR key
5296 with a forthrunning number:
5299 \begin_layout LyX-Code
5301 | bioscript 'while($r=get_record(
5303 *STDIN)){$r->{CHR}=$i++; put_record($r)}'
5306 \begin_layout Standard
5307 Something more useful would probably be to create custom FASTA headers.
5309 if we read in a BED file, lookup the genomic sequence, create a custom
5314 and output FASTA entries:
5317 \begin_layout LyX-Code
5319 | bioscript 'while($r=get_record(
5321 *STDIN)){$r->{SEQ_NAME}= //
5324 \begin_layout LyX-Code
5325 join("_",$r->{CHR},$r->{CHR_BEG},$r->{CHR_END}); put_record($r)}'
5328 \begin_layout Standard
5332 \begin_layout LyX-Code
5333 >chr2L_21567527_21567550
5336 \begin_layout LyX-Code
5337 taccaaacggatgcctcagacatc
5340 \begin_layout LyX-Code
5341 >chr2L_693380_693403
5344 \begin_layout LyX-Code
5345 taccaaacggatgcctcagacatc
5348 \begin_layout LyX-Code
5349 >chr2L_13859534_13859557
5352 \begin_layout LyX-Code
5353 taccaaacggatgcctcagacatc
5356 \begin_layout LyX-Code
5357 >chr2L_9005090_9005113
5360 \begin_layout LyX-Code
5361 taccaaacggatgcctcagacatc
5364 \begin_layout LyX-Code
5365 >chr2L_2106825_2106848
5368 \begin_layout LyX-Code
5369 taccaaacggatgcctcagacatc
5372 \begin_layout LyX-Code
5373 >chr2L_14649031_14649054
5376 \begin_layout LyX-Code
5377 taccaaacggatgcctcagacatc
5380 \begin_layout Section
5384 \begin_layout Standard
5385 Shoot the messenger!
5388 \begin_layout Section
5391 \begin_inset LatexCommand label
5399 \begin_layout Standard
5403 \begin_layout Standard
5407 \begin_layout Standard
5411 \begin_layout Standard
5415 \begin_layout Standard
5419 \begin_layout Standard
5423 \begin_layout Section
5425 \begin_inset LatexCommand label
5433 \begin_layout Standard
5437 \begin_layout Standard
5449 \begin_layout Standard
5453 \begin_layout Standard
5465 \begin_layout Standard
5469 \begin_layout Standard
5481 \begin_layout Standard
5485 \begin_layout Standard
5497 \begin_layout Standard
5501 \begin_layout Standard
5513 \begin_layout Standard
5517 \begin_layout Standard
5529 \begin_layout Section
5530 scan_for_matches README
5531 \begin_inset LatexCommand label
5532 name "sec:scan_for_matches-README"
5539 \begin_layout LyX-Code
5543 \begin_layout LyX-Code
5544 A Program to Scan Nucleotide or Protein Sequences for Matching Patterns
5547 \begin_layout LyX-Code
5551 \begin_layout LyX-Code
5555 \begin_layout LyX-Code
5556 Argonne National Laboratory
5559 \begin_layout LyX-Code
5563 \begin_layout LyX-Code
5567 \begin_layout LyX-Code
5568 Scan_for_matches is a utility that we have written to search for
5571 \begin_layout LyX-Code
5572 patterns in DNA and protein sequences.
5573 I wrote most of the code,
5576 \begin_layout LyX-Code
5577 although David Joerg and Morgan Price wrote sections of an
5580 \begin_layout LyX-Code
5582 The whole notion of pattern matching has a rich
5585 \begin_layout LyX-Code
5586 history, and we borrowed liberally from many sources.
5590 \begin_layout LyX-Code
5591 worth noting that we were strongly influenced by the elegant tools
5594 \begin_layout LyX-Code
5595 developed and distributed by David Searls.
5596 My intent is to make the
5599 \begin_layout LyX-Code
5600 existing tool available to anyone in the research community that might
5603 \begin_layout LyX-Code
5605 I will continue to try to fix bugs and make suggested
5608 \begin_layout LyX-Code
5609 enhancements, at least until I feel that a superior tool exists.
5612 \begin_layout LyX-Code
5613 Hence, I would appreciate it if all bug reports and suggestions are
5616 \begin_layout LyX-Code
5617 directed to me at Overbeek@mcs.anl.gov.
5621 \begin_layout LyX-Code
5622 I will try to log all bug fixes and report them to users that send me
5625 \begin_layout LyX-Code
5626 their email addresses.
5627 I do not require that you give me your name
5630 \begin_layout LyX-Code
5632 However, if you do give it to me, I will try to notify
5635 \begin_layout LyX-Code
5636 you of serious problems as they are discovered.
5639 \begin_layout LyX-Code
5643 \begin_layout LyX-Code
5644 The distribution should contain at least the following programs:
5647 \begin_layout LyX-Code
5648 README - This document
5651 \begin_layout LyX-Code
5652 ggpunit.c - One of the two source files
5655 \begin_layout LyX-Code
5656 scan_for_matches.c - The second source file
5659 \begin_layout LyX-Code
5663 \begin_layout LyX-Code
5664 run_tests - A perl script to test things
5667 \begin_layout LyX-Code
5668 show_hits - A handy perl script
5671 \begin_layout LyX-Code
5672 test_dna_input - Test sequences for DNA
5675 \begin_layout LyX-Code
5676 test_dna_patterns - Test patterns for DNA scan
5679 \begin_layout LyX-Code
5680 test_output - Desired output from test
5683 \begin_layout LyX-Code
5684 test_prot_input - Test protein sequences
5687 \begin_layout LyX-Code
5688 test_prot_patterns - Test patterns for proteins
5691 \begin_layout LyX-Code
5692 testit - a perl script used for test
5695 \begin_layout LyX-Code
5696 Only the first three files are required.
5697 The others are useful,
5700 \begin_layout LyX-Code
5701 but only if you have Perl installed on your system.
5705 \begin_layout LyX-Code
5706 have Perl, I suggest that you type
5709 \begin_layout LyX-Code
5713 \begin_layout LyX-Code
5717 \begin_layout LyX-Code
5718 to find out where it installed.
5719 On my system, I get the following
5722 \begin_layout LyX-Code
5726 \begin_layout LyX-Code
5730 \begin_layout LyX-Code
5734 \begin_layout LyX-Code
5738 \begin_layout LyX-Code
5739 indicating that Perl is installed in /usr/local/bin.
5743 \begin_layout LyX-Code
5744 you know where it is installed, edit the first line of files
5747 \begin_layout LyX-Code
5751 \begin_layout LyX-Code
5755 \begin_layout LyX-Code
5756 replacing /usr/local/bin/perl with the appropriate location.
5760 \begin_layout LyX-Code
5761 will assume that you can do this, although it is not critical (it
5764 \begin_layout LyX-Code
5765 is needed only to test the installation and to use the "show_hits"
5768 \begin_layout LyX-Code
5770 Perl is not required to actually install and run
5773 \begin_layout LyX-Code
5778 \begin_layout LyX-Code
5779 If you do not have Perl, I suggest you get it and install it (it
5782 \begin_layout LyX-Code
5783 is a wonderful utility).
5784 Information about Perl and how to get it
5787 \begin_layout LyX-Code
5788 can be found in the book "Programming Perl" by Larry Wall and
5791 \begin_layout LyX-Code
5793 Schwartz, published by O'Reilly & Associates, Inc.
5796 \begin_layout LyX-Code
5797 To get started, you will need to compile the program.
5801 \begin_layout LyX-Code
5805 \begin_layout LyX-Code
5806 gcc -O -o scan_for_matches ggpunit.c scan_for_matches.c
5809 \begin_layout LyX-Code
5810 If you do not use GNU C, use
5813 \begin_layout LyX-Code
5814 cc -O -DCC -o scan_for_matches ggpunit.c scan_for_matches.c
5817 \begin_layout LyX-Code
5818 which works on my Sun.
5822 \begin_layout LyX-Code
5823 Once you have compiled scan_for_matches, you can verify that it
5826 \begin_layout LyX-Code
5830 \begin_layout LyX-Code
5831 clone% run_tests tmp
5834 \begin_layout LyX-Code
5835 clone% diff tmp test_output
5838 \begin_layout LyX-Code
5839 You may get a few strange lines of the sort
5842 \begin_layout LyX-Code
5843 clone% run_tests tmp
5846 \begin_layout LyX-Code
5847 rm: tmp: No such file or directory
5850 \begin_layout LyX-Code
5851 clone% diff tmp test_output
5854 \begin_layout LyX-Code
5855 These should cause no concern.
5856 However, if the "diff" shows that
5859 \begin_layout LyX-Code
5860 tmp and test_output are different, contact me (you have a
5863 \begin_layout LyX-Code
5868 \begin_layout LyX-Code
5869 You should now be able to use scan_for_matches by following the
5872 \begin_layout LyX-Code
5873 instructions given below (which is all the normal user should have
5876 \begin_layout LyX-Code
5877 to understand, once things are installed properly).
5880 \begin_layout LyX-Code
5881 ==============================================================
5884 \begin_layout LyX-Code
5885 How to run scan_for_matches:
5888 \begin_layout LyX-Code
5889 To run the program, you type need to create two files
5892 \begin_layout LyX-Code
5894 the first file contains the pattern you wish to scan for; I'll
5897 \begin_layout LyX-Code
5898 call this file pat_file in what follows (but any name is ok)
5901 \begin_layout LyX-Code
5903 the second file contains a set of sequences to scan.
5907 \begin_layout LyX-Code
5908 should be in "fasta format".
5909 Just look at the contents of
5912 \begin_layout LyX-Code
5913 test_dna_input to see examples of this format.
5917 \begin_layout LyX-Code
5918 each sequence begins with a line of the form
5921 \begin_layout LyX-Code
5925 \begin_layout LyX-Code
5926 and is followed by one or more lines containing the sequence.
5929 \begin_layout LyX-Code
5930 Once these files have been created, you just use
5933 \begin_layout LyX-Code
5934 scan_for_matches pat_file < input_file
5937 \begin_layout LyX-Code
5938 to scan all of the input sequences for the given pattern.
5942 \begin_layout LyX-Code
5943 example, suppose that pat_file contains a single line of the form
5946 \begin_layout LyX-Code
5950 \begin_layout LyX-Code
5954 \begin_layout LyX-Code
5955 scan_for_matches pat_file < test_dna_input
5958 \begin_layout LyX-Code
5959 should produce two "hits".
5960 When I run this on my machine, I get
5963 \begin_layout LyX-Code
5964 clone% scan_for_matches pat_file < test_dna_input
5967 \begin_layout LyX-Code
5971 \begin_layout LyX-Code
5972 cguaacc ggttaacc gguuacg
5975 \begin_layout LyX-Code
5979 \begin_layout LyX-Code
5980 CGUAACC GGTTAACC GGUUACG
5983 \begin_layout LyX-Code
5987 \begin_layout LyX-Code
5988 Simple Patterns Built by Matching Ranges and Reverse Complements
5991 \begin_layout LyX-Code
5992 Let me first explain this simple pattern:
5995 \begin_layout LyX-Code
5999 \begin_layout LyX-Code
6003 \begin_layout LyX-Code
6004 The pattern consists of three "pattern units" separated by spaces.
6007 \begin_layout LyX-Code
6008 The first pattern unit is
6011 \begin_layout LyX-Code
6015 \begin_layout LyX-Code
6016 which means "match 4 to 7 characters and call them p1".
6020 \begin_layout LyX-Code
6021 second pattern unit is
6024 \begin_layout LyX-Code
6028 \begin_layout LyX-Code
6029 which means "then match 3 to 8 characters".
6030 The last pattern unit
6033 \begin_layout LyX-Code
6037 \begin_layout LyX-Code
6041 \begin_layout LyX-Code
6042 which means "match the reverse complement of p1".
6046 \begin_layout LyX-Code
6047 reported hit is shown as
6050 \begin_layout LyX-Code
6054 \begin_layout LyX-Code
6055 cguaacc ggttaacc gguuacg
6058 \begin_layout LyX-Code
6059 which states that characters 6 through 27 of sequence tst1 were
6062 \begin_layout LyX-Code
6064 "cguaac" matched the first pattern unit, "ggttaacc" the
6067 \begin_layout LyX-Code
6068 second, and "gguuacg" the third.
6069 This is an example of a common
6072 \begin_layout LyX-Code
6073 type of pattern used to search for sections of DNA or RNA that
6076 \begin_layout LyX-Code
6077 would fold into a hairpin loop.
6080 \begin_layout LyX-Code
6081 Searching Both Strands
6084 \begin_layout LyX-Code
6085 Now for a short aside: scan_for_matches only searched the
6088 \begin_layout LyX-Code
6089 sequences in the input file; it did not search the opposite
6092 \begin_layout LyX-Code
6094 With a pattern of the sort we just used, there is not
6097 \begin_layout LyX-Code
6098 need o search the opposite strand.
6099 However, it is normally the
6102 \begin_layout LyX-Code
6103 case that you will wish to search both the sequence and the
6106 \begin_layout LyX-Code
6107 opposite strand (i.e., the reverse complement of the sequence).
6110 \begin_layout LyX-Code
6111 To do that, you would just use the "-c" command line.
6115 \begin_layout LyX-Code
6116 scan_for_matches -c pat_file < test_dna_input
6119 \begin_layout LyX-Code
6120 Hits on the opposite strand will show a beginning location greater
6123 \begin_layout LyX-Code
6124 than te end location of the match.
6127 \begin_layout LyX-Code
6128 Defining Pairing Rules and Allowing Mismatches, Insertions, and Deletions
6131 \begin_layout LyX-Code
6132 Let us stop now and ask "What additional features would one need to
6135 \begin_layout LyX-Code
6136 really find the kinds of loop structures that characterize tRNAs,
6139 \begin_layout LyX-Code
6140 rRNAs, and so forth?" I can immediately think of two:
6143 \begin_layout LyX-Code
6144 a) you will need to be able to allow non-standard pairings
6147 \begin_layout LyX-Code
6148 (those other than G-C and A-U), and
6151 \begin_layout LyX-Code
6152 b) you will need to be able to tolerate some number of
6155 \begin_layout LyX-Code
6156 mismatches and bulges.
6159 \begin_layout LyX-Code
6163 \begin_layout LyX-Code
6164 Let me first show you how to handle non-standard "rules for
6167 \begin_layout LyX-Code
6168 pairing in reverse complements".
6169 Consider the following pattern,
6172 \begin_layout LyX-Code
6173 which I show as two line (you may use as many lines as you like in
6176 \begin_layout LyX-Code
6177 forming a pattern, although you can only break a pattern at points
6180 \begin_layout LyX-Code
6181 where space would be legal):
6184 \begin_layout LyX-Code
6185 r1={au,ua,gc,cg,gu,ug,ga,ag}
6188 \begin_layout LyX-Code
6189 p1=2...3 0...4 p2=2...5 1...5 r1~p2 0...4 ~p1
6192 \begin_layout LyX-Code
6193 The first "pattern unit" does not actually match anything; rather,
6196 \begin_layout LyX-Code
6197 it defines a "pairing rule" in which standard pairings are
6200 \begin_layout LyX-Code
6201 allowed, as well as G-A and A-G (in case you wondered, Us and Ts
6204 \begin_layout LyX-Code
6205 and upper and lower case can be used interchangably; for example
6208 \begin_layout LyX-Code
6209 r1={AT,UA,gc,cg} could be used to define the "standard rule" for
6212 \begin_layout LyX-Code
6214 The second line consists of six pattern units which
6217 \begin_layout LyX-Code
6218 may be interpreted as follows:
6221 \begin_layout LyX-Code
6222 p1=2...3 match 2 or 3 characters (call it p1)
6225 \begin_layout LyX-Code
6226 0...4 match 0 to 4 characters
6229 \begin_layout LyX-Code
6230 p2=2...5 match 2 to 5 characters (call it p2)
6233 \begin_layout LyX-Code
6234 1...5 match 1 to 5 characters
6237 \begin_layout LyX-Code
6238 r1~p2 match the reverse complement of p2,
6241 \begin_layout LyX-Code
6242 allowing G-A and A-G pairs
6245 \begin_layout LyX-Code
6246 0...4 match 0 to 4 characters
6249 \begin_layout LyX-Code
6250 ~p1 match the reverse complement of p1
6253 \begin_layout LyX-Code
6254 allowing only G-C, C-G, A-T, and T-A pairs
6257 \begin_layout LyX-Code
6258 Thus, r1~p2 means "match the reverse complement of p2 using rule r1".
6261 \begin_layout LyX-Code
6262 Now let us consider the issue of tolerating mismatches and bulges.
6265 \begin_layout LyX-Code
6266 You may add a "qualifier" to the pattern unit that gives the
6269 \begin_layout LyX-Code
6270 tolerable number of "mismatches, deletions, and insertions".
6273 \begin_layout LyX-Code
6277 \begin_layout LyX-Code
6278 p1=10...10 3...8 ~p1[1,2,1]
6281 \begin_layout LyX-Code
6282 means that the third pattern unit must match 10 characters,
6285 \begin_layout LyX-Code
6286 allowing one "mismatch" (a pairing other than G-C, C-G, A-T, or
6289 \begin_layout LyX-Code
6290 T-A), two deletions (a deletion is a character that occurs in p1,
6293 \begin_layout LyX-Code
6294 but has been "deleted" from the string matched by ~p1), and one
6297 \begin_layout LyX-Code
6298 insertion (an "insertion" is a character that occurs in the string
6301 \begin_layout LyX-Code
6302 matched by ~p1, but not for which no corresponding character
6305 \begin_layout LyX-Code
6307 In this case, the pattern would match
6310 \begin_layout LyX-Code
6311 ACGTACGTAC GGGGGGGG GCGTTACCT
6314 \begin_layout LyX-Code
6315 which is, you must admit, a fairly weak loop.
6319 \begin_layout LyX-Code
6320 allow mismatches, but you will find yourself using insertions and
6323 \begin_layout LyX-Code
6324 deletions much more rarely.
6325 In any event, you should note that
6328 \begin_layout LyX-Code
6329 allowing mismatches, insertions, and deletions does force the
6332 \begin_layout LyX-Code
6333 program to try many additional possible pairings, so it does slow
6336 \begin_layout LyX-Code
6340 \begin_layout LyX-Code
6341 How Patterns Are Matched
6344 \begin_layout LyX-Code
6345 Now is as good a time as any to discuss the basic flow of control
6348 \begin_layout LyX-Code
6349 when matching patterns.
6350 Recall that a "pattern" is a sequence of
6353 \begin_layout LyX-Code
6355 Suppose that the pattern units were
6358 \begin_layout LyX-Code
6363 \begin_layout LyX-Code
6364 The scan of a sequence S begins by setting the current position
6367 \begin_layout LyX-Code
6369 Then, an attempt is made to match u1 starting at the
6372 \begin_layout LyX-Code
6374 Each attempt to match a pattern unit can
6377 \begin_layout LyX-Code
6379 If it succeeds, then an attempt is made to match
6382 \begin_layout LyX-Code
6384 If it fails, then an attempt is made to find an
6387 \begin_layout LyX-Code
6388 alternative match for the immediately preceding pattern unit.
6392 \begin_layout LyX-Code
6393 this succeeds, then we proceed forward again to the next unit.
6397 \begin_layout LyX-Code
6398 it fails we go back to the preceding unit.
6399 This process is called
6402 \begin_layout LyX-Code
6404 If there are no previous units, then the current
6407 \begin_layout LyX-Code
6408 position is incremented by one, and everything starts again.
6412 \begin_layout LyX-Code
6413 proceeds until either the current position goes past the end of
6416 \begin_layout LyX-Code
6417 the sequence or all of the pattern units succeed.
6421 \begin_layout LyX-Code
6422 scan_for_matches reports the "hit", the current position is set
6425 \begin_layout LyX-Code
6426 just past the hit, and an attempt is made to find another hit.
6429 \begin_layout LyX-Code
6430 If you wish to limit the scan to simply finding a maximum of, say,
6433 \begin_layout LyX-Code
6434 10 hits, you can use the -n option (-n 10 would set the limit to
6437 \begin_layout LyX-Code
6442 \begin_layout LyX-Code
6443 scan_for_matches -c -n 1 pat_file < test_dna_input
6446 \begin_layout LyX-Code
6447 would search for just the first hit (and would stop searching the
6450 \begin_layout LyX-Code
6451 current sequences or any that follow in the input file).
6454 \begin_layout LyX-Code
6455 Searching for repeats:
6458 \begin_layout LyX-Code
6459 In the last section, I discussed almost all of the details
6462 \begin_layout LyX-Code
6463 required to allow you to look for repeats.
6464 Consider the following
6467 \begin_layout LyX-Code
6471 \begin_layout LyX-Code
6472 p1=6...6 3...8 p1 (find exact 6 character repeat separated
6475 \begin_layout LyX-Code
6479 \begin_layout LyX-Code
6480 p1=6...6 3..8 p1[1,0,0] (allow one mismatch)
6483 \begin_layout LyX-Code
6484 p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0]
6487 \begin_layout LyX-Code
6488 (match 12 characters that are the remains
6491 \begin_layout LyX-Code
6492 of a 3-character sequence occurring 4 times)
6495 \begin_layout LyX-Code
6499 \begin_layout LyX-Code
6500 p1=4...8 0...3 p2=6...8 p1 0...3 p2
6503 \begin_layout LyX-Code
6504 (This would match things like
6507 \begin_layout LyX-Code
6508 ATCT G TCTTT ATCT TG TCTTT
6511 \begin_layout LyX-Code
6515 \begin_layout LyX-Code
6516 Searching for particular sequences:
6519 \begin_layout LyX-Code
6520 Occasionally, one wishes to match a specific, known sequence.
6523 \begin_layout LyX-Code
6524 In such a case, you can just give the sequence (along with an
6527 \begin_layout LyX-Code
6528 optional statement of the allowable mismatches, insertions, and
6531 \begin_layout LyX-Code
6536 \begin_layout LyX-Code
6537 p1=6...8 GAGA ~p1 (match a hairpin with GAGA as the loop)
6540 \begin_layout LyX-Code
6541 RRRRYYYY (match 4 purines followed by 4 pyrimidines)
6544 \begin_layout LyX-Code
6545 TATAA[1,0,0] (match TATAA, allowing 1 mismatch)
6548 \begin_layout LyX-Code
6552 \begin_layout LyX-Code
6553 Matches against a "weight matrix":
6556 \begin_layout LyX-Code
6557 I will conclude my examples of the types of pattern units
6560 \begin_layout LyX-Code
6561 available for matching against nucleotide sequences by discussing a
6564 \begin_layout LyX-Code
6565 crude implemetation of matching using a "weight matrix".
6569 \begin_layout LyX-Code
6570 am less than overwhelmed with the syntax that I chose, I think that
6573 \begin_layout LyX-Code
6574 the reader should be aware that I was thinking of generating
6577 \begin_layout LyX-Code
6578 patterns containing such pattern units automatically from
6581 \begin_layout LyX-Code
6582 alignments (and did not really plan on typing such things in by
6585 \begin_layout LyX-Code
6587 Anyway, suppose that you wanted to match a
6590 \begin_layout LyX-Code
6591 sequence of eight characters.
6592 The "consensus" of these eight
6595 \begin_layout LyX-Code
6596 characters is GRCACCGS, but the actual "frequencies of occurrence"
6599 \begin_layout LyX-Code
6600 are given in the matrix below.
6601 Thus, the first character is an A
6604 \begin_layout LyX-Code
6605 16% the time and a G 84% of the time.
6606 The second is an A 57% of
6609 \begin_layout LyX-Code
6610 the time, a C 10% of the time, a G 29% of the time, and a T 4% of
6613 \begin_layout LyX-Code
6618 \begin_layout LyX-Code
6619 C1 C2 C3 C4 C5 C6 C7 C8
6622 \begin_layout LyX-Code
6626 \begin_layout LyX-Code
6627 A 16 57 0 95 0 18 0 0
6630 \begin_layout LyX-Code
6631 C 0 10 80 0 100 60 0 50
6634 \begin_layout LyX-Code
6635 G 84 29 0 0 0 20 100 50
6638 \begin_layout LyX-Code
6642 \begin_layout LyX-Code
6646 \begin_layout LyX-Code
6647 One could use the following pattern unit to search for inexact
6650 \begin_layout LyX-Code
6651 matches related to such a "weight matrix":
6654 \begin_layout LyX-Code
6655 {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
6658 \begin_layout LyX-Code
6659 (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
6662 \begin_layout LyX-Code
6663 This pattern unit will attempt to match exactly eight characters.
6666 \begin_layout LyX-Code
6667 For each character in the sequence, the entry in the corresponding
6670 \begin_layout LyX-Code
6671 tuple is added to an accumulated sum.
6672 If the sum is greater than
6675 \begin_layout LyX-Code
6676 450, the match succeeds; else it fails.
6679 \begin_layout LyX-Code
6680 Recently, this feature was upgraded to allow ranges.
6684 \begin_layout LyX-Code
6685 600 > {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
6688 \begin_layout LyX-Code
6689 (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
6692 \begin_layout LyX-Code
6696 \begin_layout LyX-Code
6697 Allowing Alternatives:
6700 \begin_layout LyX-Code
6701 Very occasionally, you may wish to allow alternative pattern units
6704 \begin_layout LyX-Code
6705 (i.e., "match either A or B").
6706 You can do this using something
6709 \begin_layout LyX-Code
6713 \begin_layout LyX-Code
6717 \begin_layout LyX-Code
6718 which says "match either GAGA or GCGCA".
6722 \begin_layout LyX-Code
6723 alternatives of a list of pattern units, for example
6726 \begin_layout LyX-Code
6727 (p1=3...3 3...8 ~p1 | p1=5...5 4...4 ~p1 GGG)
6730 \begin_layout LyX-Code
6731 would match one of two sequences of pattern units.
6735 \begin_layout LyX-Code
6736 clumsy aspect of the syntax: to match a list of alternatives, you
6739 \begin_layout LyX-Code
6740 need to fully the request.
6744 \begin_layout LyX-Code
6745 (GAGA | (GCGCA | TTCGA))
6748 \begin_layout LyX-Code
6749 would be needed to try the three alternatives.
6752 \begin_layout LyX-Code
6756 \begin_layout LyX-Code
6757 Sometimes a pattern will contain a sequence of distinct ranges,
6760 \begin_layout LyX-Code
6761 and you might wish to limit the sum of the lengths of the matched
6764 \begin_layout LyX-Code
6766 For example, suppose that you basically wanted to
6769 \begin_layout LyX-Code
6770 match something like
6773 \begin_layout LyX-Code
6774 ARRYYTT p1=0...5 GCA[1,0,0] p2=1...6 ~p1 4...8 ~p2 p3=4...10 CCT
6777 \begin_layout LyX-Code
6778 but that the sum of the lengths of p1, p2, and p3 must not exceed
6781 \begin_layout LyX-Code
6783 To do this, you could add
6786 \begin_layout LyX-Code
6787 length(p1+p2+p3) < 9
6790 \begin_layout LyX-Code
6791 as the last pattern unit.
6792 It will just succeed or fail (but does
6795 \begin_layout LyX-Code
6796 not actually match any characters in the sequence).
6799 \begin_layout LyX-Code
6803 \begin_layout LyX-Code
6804 Matching Protein Sequences
6807 \begin_layout LyX-Code
6808 Suppose that the input file contains protein sequences.
6812 \begin_layout LyX-Code
6813 case, you must invoke scan_for_matches with the "-p" option.
6817 \begin_layout LyX-Code
6818 cannot use aspects of the language that relate directly to
6821 \begin_layout LyX-Code
6822 nucleotide sequences (e.g., the -c command line option or pattern
6825 \begin_layout LyX-Code
6826 constructs referring to the reverse complement of a previously
6829 \begin_layout LyX-Code
6834 \begin_layout LyX-Code
6835 You also have two additional constructs that allow you to match
6838 \begin_layout LyX-Code
6839 either "one of a set of amino acids" or "any amino acid other than
6842 \begin_layout LyX-Code
6847 \begin_layout LyX-Code
6848 p1=0...4 any(HQD) 1...3 notany(HK) p1
6851 \begin_layout LyX-Code
6852 would successfully match a string like
6855 \begin_layout LyX-Code
6859 \begin_layout LyX-Code
6860 Using the show_hits Utility
6863 \begin_layout LyX-Code
6864 When viewing a large set of complex matches, you might find it
6867 \begin_layout LyX-Code
6868 convenient to post-process the scan_for_matches output to get a
6871 \begin_layout LyX-Code
6872 more readable version.
6873 We provide a simple post-processor called
6876 \begin_layout LyX-Code
6878 To see its effect, just pipe the output of a
6881 \begin_layout LyX-Code
6882 scan_for_matches into show_hits:
6885 \begin_layout LyX-Code
6889 \begin_layout LyX-Code
6890 clone% scan_for_matches -c pat_file < tmp
6893 \begin_layout LyX-Code
6897 \begin_layout LyX-Code
6898 gtacguaacc ggttaac cgguuacgtac
6901 \begin_layout LyX-Code
6905 \begin_layout LyX-Code
6906 gtacgtaacc ggttaac cggttacgtac
6909 \begin_layout LyX-Code
6913 \begin_layout LyX-Code
6914 CGTACGUAAC C GGTTAACC GGUUACGTACG
6917 \begin_layout LyX-Code
6921 \begin_layout LyX-Code
6922 CGTACGTAAC C GGTTAACC GGTTACGTACG
6925 \begin_layout LyX-Code
6929 \begin_layout LyX-Code
6930 gtacguaacc g gttaactt cgguuacgtac
6933 \begin_layout LyX-Code
6937 \begin_layout LyX-Code
6938 gtacgtaacc g aagttaac cggttacgtac
6941 \begin_layout LyX-Code
6942 Piped Through show_hits:
6945 \begin_layout LyX-Code
6949 \begin_layout LyX-Code
6950 clone% scan_for_matches -c pat_file < tmp | show_hits
6953 \begin_layout LyX-Code
6954 tst1:[1,28]: gtacguaacc ggttaac cgguuacgtac
6957 \begin_layout LyX-Code
6958 tst1:[28,1]: gtacgtaacc ggttaac cggttacgtac
6961 \begin_layout LyX-Code
6962 tst2:[2,31]: CGTACGUAAC C GGTTAACC GGUUACGTACG
6965 \begin_layout LyX-Code
6966 tst2:[31,2]: CGTACGTAAC C GGTTAACC GGTTACGTACG
6969 \begin_layout LyX-Code
6970 tst3:[3,32]: gtacguaacc g gttaactt cgguuacgtac
6973 \begin_layout LyX-Code
6974 tst3:[32,3]: gtacgtaacc g aagttaac cggttacgtac
6977 \begin_layout LyX-Code
6981 \begin_layout LyX-Code
6982 Optionally, you can specify which of the "fields" in the matches
6985 \begin_layout LyX-Code
6986 you wish to sort on, and show_hits will sort them.
6990 \begin_layout LyX-Code
6991 numbers start with 0.
6992 So, you might get something like
6995 \begin_layout LyX-Code
6996 clone% scan_for_matches -c pat_file < tmp | show_hits 2 1
6999 \begin_layout LyX-Code
7000 tst2:[2,31]: CGTACGUAAC C GGTTAACC GGUUACGTACG
7003 \begin_layout LyX-Code
7004 tst2:[31,2]: CGTACGTAAC C GGTTAACC GGTTACGTACG
7007 \begin_layout LyX-Code
7008 tst3:[32,3]: gtacgtaacc g aagttaac cggttacgtac
7011 \begin_layout LyX-Code
7012 tst1:[1,28]: gtacguaacc ggttaac cgguuacgtac
7015 \begin_layout LyX-Code
7016 tst1:[28,1]: gtacgtaacc ggttaac cggttacgtac
7019 \begin_layout LyX-Code
7020 tst3:[3,32]: gtacguaacc g gttaactt cgguuacgtac
7023 \begin_layout LyX-Code
7027 \begin_layout LyX-Code
7028 In this case, the hits have been sorted on fields 2 and 1 (that is,
7031 \begin_layout LyX-Code
7032 the third and second matched subfields).
7035 \begin_layout LyX-Code
7036 show_hits is just one possible little post-processor, and you
7039 \begin_layout LyX-Code
7040 might well wish to write a customized one for yourself.
7043 \begin_layout LyX-Code
7044 Reducing the Cost of a Search
7047 \begin_layout LyX-Code
7048 The scan_for_matches utility uses a fairly simple search, and may
7051 \begin_layout LyX-Code
7052 consume large amounts of CPU time for complex patterns.
7056 \begin_layout LyX-Code
7057 I may decide to optimize the code.
7058 However, until then, let me
7061 \begin_layout LyX-Code
7062 mention one useful technique.
7066 \begin_layout LyX-Code
7067 When you have a complex pattern that includes a number of varying
7070 \begin_layout LyX-Code
7071 ranges, imprecise matches, and so forth, it is useful to
7074 \begin_layout LyX-Code
7076 That is, form a simpler pattern that can be
7079 \begin_layout LyX-Code
7080 used to scan through a large database extracting sections that
7083 \begin_layout LyX-Code
7084 might be matched by the more complex pattern.
7088 \begin_layout LyX-Code
7089 with a short example.
7090 Suppose that you really wished to match the
7093 \begin_layout LyX-Code
7097 \begin_layout LyX-Code
7098 p1=3...5 0...8 ~p1[1,1,0] p2=6...7 3...6 AGC 3...5 RYGC ~p2[1,0,0]
7101 \begin_layout LyX-Code
7102 In this case, the pattern units AGC 3...5 RYGC can be used to rapidly
7105 \begin_layout LyX-Code
7106 constrain the overall search.
7107 You can preprocess the overall
7110 \begin_layout LyX-Code
7111 database using the pattern:
7114 \begin_layout LyX-Code
7115 31...31 AGC 3...5 RYGC 7...7
7118 \begin_layout LyX-Code
7119 Put the complex pattern in pat_file1 and the simpler pattern in
7122 \begin_layout LyX-Code
7127 \begin_layout LyX-Code
7128 scan_for_matches -c pat_file2 < nucleotide_database |
7131 \begin_layout LyX-Code
7132 scan_for_matches pat_file1
7135 \begin_layout LyX-Code
7136 The output will show things like
7139 \begin_layout LyX-Code
7140 >seqid:[232,280][2,47]
7143 \begin_layout LyX-Code
7147 \begin_layout LyX-Code
7148 Then, the actual section of the sequence that was matched can be
7151 \begin_layout LyX-Code
7152 easily computed as [233,278] (remember, the positions start from
7155 \begin_layout LyX-Code
7159 \begin_layout LyX-Code
7160 Let me finally add, you should do a few short experiments to see
7163 \begin_layout LyX-Code
7164 whether or not such pipelining actually improves performance -- it
7167 \begin_layout LyX-Code
7168 is not always obvious where the time is going, and I have
7171 \begin_layout LyX-Code
7172 sometimes found that the added complexity of pipelining actually
7175 \begin_layout LyX-Code
7177 It gets its best improvements when there are
7180 \begin_layout LyX-Code
7181 exact matches of more than just a few characters that can be
7184 \begin_layout LyX-Code
7185 rapidly used to eliminate large sections of the database.
7188 \begin_layout LyX-Code
7192 \begin_layout LyX-Code
7196 \begin_layout LyX-Code
7197 Feb 9, 1995: the pattern units ^ and $ now work as in normal regular
7200 \begin_layout LyX-Code
7205 \begin_layout LyX-Code
7209 \begin_layout LyX-Code
7210 matches only TTF at the end of the string and
7213 \begin_layout LyX-Code
7217 \begin_layout LyX-Code
7218 matches only an initial TTF
7221 \begin_layout LyX-Code
7225 \begin_layout LyX-Code
7229 \begin_layout LyX-Code
7230 matches the reverse of the string named p1.
7234 \begin_layout LyX-Code
7235 if p1 matched GCAT, then <p1 would match TACG.
7239 \begin_layout LyX-Code
7243 \begin_layout LyX-Code
7244 matches a real palindrome (not the biologically common
7247 \begin_layout LyX-Code
7248 meaning of "reverse complement")
7251 \begin_layout LyX-Code