1 #LyX 1.5.1 created this file. For more info see http://www.lyx.org/
7 \usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
13 \font_typewriter default
14 \font_default_family default
20 \paperfontsize default
28 \paperorientation portrait
31 \paragraph_separation skip
33 \quotes_language english
36 \paperpagestyle default
37 \tracking_changes false
53 \begin_layout Publishers
56 Institute for Molecular Bioscience
58 University of Queensland
63 E-mail: mail@maasha.dk
66 \begin_layout Standard
70 \begin_layout Standard
82 \begin_layout Standard
88 \begin_layout Standard
89 \begin_inset LatexCommand tableofcontents
96 \begin_layout Standard
97 \begin_inset FloatList figure
104 \begin_layout Standard
110 \begin_layout Section
114 \begin_layout Standard
115 Biopieces is a collection of simple tools that can be linked together (piped
116 as we shall call it) in a very flexible manner to perform both simple and
118 The fundamental idea is that biopieces work on a data stream that will
119 only terminate at the end of an analysis and that this data stream can
120 be passed through several different biopieces, each performing one specific
122 The advantage of this approach is that a user can perform simple and complex
123 tasks without having to write advanced code.
124 Moreover, since the data format used to pass data between biopieces is
125 text based, biopieces can be written by different developers in their favorite
126 programming language --- and still the biopieces will be able to work together.
129 \begin_layout Standard
130 In the most simple form bioools can be piped together on the command line
131 like this (using the pipe character '|'):
134 \begin_layout LyX-Code
135 read_data | calculate_something | write_result
138 \begin_layout Standard
139 However, a more comprehensive analysis could be composed:
142 \begin_layout LyX-Code
143 read_data | select_entries | convert_entries | search_database
146 \begin_layout LyX-Code
147 evaluate_results | plot_diagram | plot_another_diagram |
150 \begin_layout LyX-Code
154 \begin_layout Standard
155 The data stream that is piped through the biopieces consists of records
156 of key/value pairs in the same way a hash does in order to keep as simple
157 a structure as possible.
158 An example record can be seen below:
161 \begin_layout LyX-Code
165 \begin_layout LyX-Code
171 \begin_layout LyX-Code
177 \begin_layout LyX-Code
183 \begin_layout LyX-Code
189 \begin_layout LyX-Code
195 \begin_layout LyX-Code
201 \begin_layout LyX-Code
207 \begin_layout LyX-Code
213 \begin_layout LyX-Code
219 \begin_layout Standard
224 \begin_layout Standard
235 ' denotes the delimiter of the records, and each key is a word followed
236 by a ':' and a white-space and then the value.
237 By convention the biopieces only uses upper case keys (a list of used keys
238 can be seen in Appendix\InsetSpace ~
240 \begin_inset LatexCommand ref
246 Since the records basically are hash structures this mean that the order
247 of the keys in the stream is unordered, and in the above example it is
248 pure coincidence that HIT_BEG is displayed before HIT_END, however, when
249 the order of the keys is importent, the biopieces will automagically see
253 \begin_layout Standard
254 All of the biopieces are able to read and write a data stream to and from
255 file as long as the records are in the biopieces format.
256 This means that if you are undertaking a lengthy analysis where one of
257 the steps is time consuming, you may save the stream after this step, and
258 subsequently start one or more analysis from that last step
262 \begin_layout Standard
263 It is a goal that the biopieces at some point will be able to dump the data
264 stream to file in case one of the tools fail critically.
270 If you are running a lengthy analysis it is highly recommended that you
271 create a small test sample of the data and run that through the pipeline
272 --- and once you are satisfied with the result proceed with the full data
273 set (see\InsetSpace ~
275 \begin_inset LatexCommand ref
276 reference "sub:How-to-select-a-few-records"
283 \begin_layout Standard
284 All of the biopieces can be supplied with long arguments prefixed with
288 \begin_layout Standard
297 switches or single character arguments prefixed with - switches that can
298 be grouped together (e.g.
300 In this cookbook only the long switches are used to emphasize what these
304 \begin_layout Section
308 \begin_layout Standard
309 In order to get the biopieces to work, you need to add environment settings
310 to include the code binaries, scripts, and modules that constitute the
312 Assuming that you are using bash, add the following to your '~/.bashrc'
313 file using your favorite editor.
314 After the changes has been saved you need to either run 'source ~/.bashrc'
318 \begin_layout LyX-Code
321 if [ -f "/home/m.hansen/maasha/conf/bashrc" ]; then
324 \begin_layout LyX-Code
327 source "/home/m.hansen/maasha/conf/bashrc"
330 \begin_layout LyX-Code
336 \begin_layout Section
340 \begin_layout Standard
345 lists all the biopieces along with a description:
348 \begin_layout LyX-Code
354 \begin_layout LyX-Code
357 align_seq Align sequences in stream using Muscle.
360 \begin_layout LyX-Code
363 analyze_seq Analysis the residue composition of each sequence
367 \begin_layout LyX-Code
370 analyze_vals Determine type, count, min, max, sum and mean for
374 \begin_layout LyX-Code
377 blast_seq BLAST sequences in stream against a specified database.
380 \begin_layout LyX-Code
383 blat_seq BLAT sequences in stream against a specified genome.
386 \begin_layout LyX-Code
389 complement_seq Complement sequences in stream.
392 \begin_layout LyX-Code
395 count_records Count the number of records in stream.
398 \begin_layout LyX-Code
401 count_seq Count sequences in stream.
404 \begin_layout LyX-Code
407 count_vals Count the number of times values of given keys exists
411 \begin_layout LyX-Code
414 create_blast_db Create a BLAST database from sequences in stream for
418 \begin_layout LyX-Code
424 \begin_layout Standard
425 To list the biopieces for writing different formats, you can use unix's
429 \begin_layout LyX-Code
432 list_biopieces | grep write
435 \begin_layout LyX-Code
438 write_align Write aligned sequences in pretty alignment format.
441 \begin_layout LyX-Code
444 write_bed Write records from stream as BED lines.
447 \begin_layout LyX-Code
450 write_blast Write BLAST records from stream in BLAST tabular format
454 \begin_layout LyX-Code
457 write_fasta Write sequences in FASTA format.
460 \begin_layout LyX-Code
463 write_psl Write records from stream in PSL format.
466 \begin_layout LyX-Code
469 write_tab Write records from stream as tab separated table.
472 \begin_layout Standard
473 In order to find out how a specific biopiece works, you just type the program
474 name without any arguments and press return and the usage of the biopiece
484 \begin_layout Standard
485 \begin_inset Box Frameless
494 height_special "totalheight"
497 \begin_layout LyX-Code
500 Program name: read_fasta
503 \begin_layout LyX-Code
506 Author: Martin Asser Hansen - Copyright (C) - All rights reserved
509 \begin_layout LyX-Code
512 Contact: mail@maasha.dk
515 \begin_layout LyX-Code
521 \begin_layout LyX-Code
524 License: GNU General Public License version 2 (http://www.gnu.org/copyleft/
528 \begin_layout LyX-Code
531 Description: Read FASTA entries.
534 \begin_layout LyX-Code
537 Usage: read_fasta [options] -i <FASTA file(s)>
540 \begin_layout LyX-Code
546 \begin_layout LyX-Code
549 [-i <file(s)> | --data_in=<file(s)>] - Comma separated list of files
553 \begin_layout LyX-Code
556 [-n <int> | --num=<int>] - Limit number of records to read.
559 \begin_layout LyX-Code
562 [-I <file> | --stream_in=<file>] - Read input stream from file
566 \begin_layout LyX-Code
569 [-O <file> | --stream_out=<file>] - Write output stream to file
573 \begin_layout LyX-Code
577 \begin_layout LyX-Code
583 \begin_layout LyX-Code
586 read_fasta -i test.fna - Read FASTA entries from file.
589 \begin_layout LyX-Code
592 read_fasta -i test1.fna,test2,fna - Read FASTA entries from files.
595 \begin_layout LyX-Code
598 read_fasta -i '*.fna' - Read FASTA entries from files.
601 \begin_layout LyX-Code
604 read_fasta -i test.fna -n 10 - Read first 10 FASTA entries from
613 \begin_layout Section
617 \begin_layout Subsection
618 How to read the data stream from file?
619 \begin_inset LatexCommand label
620 name "sub:How-to-read-stream"
627 \begin_layout Standard
628 You want to read a data stream that you previously have saved to file in
630 This can be done implicetly or explicitly.
631 The implicit way uses the 'stdout' stream of the Unix terminal:
634 \begin_layout LyX-Code
638 \begin_layout Standard
639 cat is the Unix command that reads a file and output the result to 'stdout'
640 --- which in this case is piped to any biopiece represented by the <biopiece>.
641 It is also possible to read the data stream using '<' to direct the 'stdout'
642 stream into the biopiece like this:
645 \begin_layout LyX-Code
649 \begin_layout Standard
650 However, that will not work if you pipe more biopieces together.
651 Then it is much safer to read the stream from a file explicitly like this:
654 \begin_layout LyX-Code
655 <biopiece> --stream_in=<file>
658 \begin_layout Standard
659 Here the filename <file> is explicetly given to the biopiece <biopiece>
664 \begin_layout Standard
674 This switch works with all biopieces.
675 It is also possible to read in data from multiple sources by repeating
676 the explicit read step:
679 \begin_layout LyX-Code
680 <biopiece> --stream_in=<file1> | <biopiece> --stream_in=<file2>
683 \begin_layout Subsection
684 How to write the data stream to file?
685 \begin_inset LatexCommand label
686 name "sub:How-to-write-stream"
693 \begin_layout Standard
694 In order to save the output stream from a biopiece to file, so you can read
695 in the stream again at a later time, you can do one of two things:
698 \begin_layout LyX-Code
702 \begin_layout Standard
703 All, the biopieces write the data stream to 'stdout' by default which can
704 be written to a file by redirecting 'stdout' to file using '>' , however,
705 if one of the biopieces for writing other formats is used then the both
706 the biopieces records as well as the result output will go to 'stdout'
707 in a mixture causing havock! To avoid this you must use the switch
711 \begin_layout Standard
720 stream_out that explictly tells the biopiece to write the output stream
724 \begin_layout LyX-Code
725 <biopiece> --stream_out=<file>
728 \begin_layout Standard
733 \begin_layout Standard
742 stream_out switch works with all biopieces.
745 \begin_layout Subsection
746 How to terminate the data stream?
749 \begin_layout Standard
750 The data stream is never stops unless the user want to save the stream or
755 \begin_layout Standard
764 no_stream switch that will terminate the stream:
767 \begin_layout LyX-Code
768 <biopiece> --no_stream
771 \begin_layout Standard
776 \begin_layout Standard
785 no_stream switch only works with those biopieces where it makes sense that
786 the user might want to terminale the data stream,
791 after an analysis step where the user wants to output the result, but not
795 \begin_layout Subsection
796 How to write my results to file?
797 \begin_inset LatexCommand label
798 name "sub:How-to-write-result"
805 \begin_layout Standard
806 Saving the result of an analysis to file can be done implicitly or explicitly.
810 \begin_layout LyX-Code
811 <biopiece> --no_stream > <file>
814 \begin_layout Standard
815 If you use '>' to redirect 'stdout' to file then it is important to use
820 \begin_layout Standard
829 no_stream switch to avoid writing a mix of biopieces records and result
830 to the same file causing havock.
831 The safe way is to use the
835 \begin_layout Standard
844 result_out switch which explicetly tells the biopiece to write the result
848 \begin_layout LyX-Code
849 <biopiece> --result_out=<file>
852 \begin_layout Standard
853 Using the above method will not terminate the stream, so it is possible
854 to pipe that into another biopiece generating different results:
857 \begin_layout LyX-Code
858 <biopiece1> --result_out=<file1> | <biopiece2> --result_out=<file2>
861 \begin_layout Standard
862 And still the data stream will continue unless terminated with
866 \begin_layout Standard
878 \begin_layout LyX-Code
879 <biopiece> --result_out=<file> --no_stream
882 \begin_layout Standard
883 Or written to file using implicitly or explicity
884 \begin_inset LatexCommand eqref
885 reference "sub:How-to-write-result"
893 \begin_layout LyX-Code
894 <biopiece> --result_out=<file1> --stream_out=<file2>
897 \begin_layout Subsection
898 How to read data from multiple sources?
901 \begin_layout Standard
902 To read multiple data sources, with the same type or different type of data
906 \begin_layout LyX-Code
907 <biopiece1> --data_in=<file1> | <biopiece2> --data_in=<file2>
910 \begin_layout Standard
911 where type is the data type a specific biopiece reads.
914 \begin_layout Section
918 \begin_layout Subsection
919 How to read biopieces input?
922 \begin_layout Standard
924 \begin_inset LatexCommand eqref
925 reference "sub:How-to-read-stream"
932 \begin_layout Subsection
936 \begin_layout Standard
937 Data in different formats can be read with the appropriate biopiece for
939 The biopieces are typicalled named 'read_<data type>' such as
951 , etc., and all behave in a similar manner.
952 Data can be read by supplying the
956 \begin_layout Standard
965 data_in switch and a file name to the file containing the data:
968 \begin_layout LyX-Code
969 <biopiece> --data_in=<file>
972 \begin_layout Standard
973 It is also possible to read in a saved biopieces stream (see
974 \begin_inset LatexCommand ref
975 reference "sub:How-to-read-stream"
979 ) as well as reading data in one go:
982 \begin_layout LyX-Code
983 <biopiece> --stream_in=<file1> --data_in=<file2>
986 \begin_layout Standard
987 If you want to read data from several files you can do this:
990 \begin_layout LyX-Code
991 <biopiece> --data_in=<file1> | <biopiece> --data_in=<file2>
994 \begin_layout Standard
995 If you have several data files you can read in all explicitly with a comma
999 \begin_layout LyX-Code
1000 <biopiece> --data_in=file1,file2,file3
1003 \begin_layout Standard
1004 And it is also possible to use file globbing
1008 \begin_layout Standard
1009 using the short option will only work if you quote the argument -i '*.fna'
1017 \begin_layout LyX-Code
1018 <biopiece> --data_in=*.fna
1021 \begin_layout Standard
1022 Or in a combination:
1025 \begin_layout LyX-Code
1026 <biopiece> --data_in=file1,/dir/*.fna
1029 \begin_layout Standard
1030 Finally, it is possible to read in data in different formats using the appropria
1031 te biopiece for each format:
1034 \begin_layout LyX-Code
1035 <biopiece1> --data_in=<file1> | <biopiece2> --data_in=<file2> ...
1038 \begin_layout Subsection
1039 How to read FASTA input?
1042 \begin_layout Standard
1043 Sequences in FASTA format can be read explicitly using
1050 \begin_layout LyX-Code
1051 read_fasta --data_in=<file>
1054 \begin_layout Subsection
1055 How to read alignment input?
1058 \begin_layout Standard
1059 If your alignment if FASTA formatted then you can
1064 It is also possible to use
1068 since the data is FASTA formatted, however, with
1072 the key ALIGN will be omitted.
1073 The ALIGN key is used to determine which sequences belong to what alignment
1074 which is required for
1081 \begin_layout LyX-Code
1082 read_align --data_in=<file>
1085 \begin_layout Subsection
1086 How to read tabular input?
1087 \begin_inset LatexCommand label
1088 name "sub:How-to-read-table"
1095 \begin_layout Standard
1096 Tabular input can be read with
1100 which will read in all rows and chosen columns (separated by a given delimter)
1101 from a table in text format.
1104 \begin_layout Standard
1108 \begin_layout Standard
1111 \begin_inset Tabular
1112 <lyxtabular version="3" rows="4" columns="3">
1114 <column alignment="left" valignment="top" width="0">
1115 <column alignment="left" valignment="top" width="0">
1116 <column alignment="left" valignment="top" width="0">
1118 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1121 \begin_layout Standard
1127 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1130 \begin_layout Standard
1136 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1139 \begin_layout Standard
1147 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1150 \begin_layout Standard
1156 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1159 \begin_layout Standard
1165 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1168 \begin_layout Standard
1176 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1179 \begin_layout Standard
1185 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1188 \begin_layout Standard
1194 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1197 \begin_layout Standard
1205 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1208 \begin_layout Standard
1214 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1217 \begin_layout Standard
1223 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1226 \begin_layout Standard
1240 \begin_layout Standard
1241 Can be read using the command:
1244 \begin_layout LyX-Code
1245 read_tab --data_in=<file>
1248 \begin_layout Standard
1249 Which will result in four records, one for each row, where the keys V0,
1250 V1, V2 are the default keys for the organism, sequence, and count, respectively.
1251 It is possible to select a subset of colums to read by using the
1255 \begin_layout Standard
1264 cols switch which takes a comma separated list of columns numbers (first
1265 column is designated 0) as argument.
1266 So to read in only the sequence and the count so that the count comes before
1270 \begin_layout LyX-Code
1271 read_tab --data_in=<file> --cols=2,1
1274 \begin_layout Standard
1275 It is also possible to name the columns with the
1279 \begin_layout Standard
1291 \begin_layout LyX-Code
1292 read_tab --data_in=<file> --cols=2,1 --keys=COUNT,SEQ
1295 \begin_layout Subsection
1296 How to read BED input?
1299 \begin_layout Standard
1300 The BED (Browser Extensible Data
1304 \begin_layout Standard
1305 \begin_inset LatexCommand url
1306 target "http://genome.ucsc.edu/FAQ/FAQformat"
1315 ) format is a tabular format for data pertaining to one of the Eukaryotic
1316 genomes in the UCSC genome brower
1320 \begin_layout Standard
1321 \begin_inset LatexCommand url
1322 target "http://genome.ucsc.edu/"
1332 The BED format consists of up to 12 columns, where the first three are
1333 mandatory CHR, CHR_BEG, and CHR_END.
1334 The mandatory columns and any of the optional columns can all be read in
1342 \begin_layout LyX-Code
1343 read_bed --data_in=<file>
1346 \begin_layout Standard
1347 It is also possible to read the BED file with
1353 \begin_inset LatexCommand ref
1354 reference "sub:How-to-read-table"
1358 ), however, that will be more cumbersome because you need to specify the
1362 \begin_layout LyX-Code
1363 read_tab --data_in=<file> --keys=CHR,CHR_BEG,CHR_END ...
1366 \begin_layout Subsection
1367 How to read PSL input?
1370 \begin_layout Standard
1371 The PSL format is the output from BLAT and contains 21 mandatory fields
1372 that can be read with
1379 \begin_layout LyX-Code
1380 read_psl --data_in=<file>
1383 \begin_layout Section
1387 \begin_layout Standard
1388 All result output can be written explicitly to file using the
1392 \begin_layout Standard
1401 result_out switch which all result generating biopieces have.
1402 It is also possible to write the result to file implicetly by directing
1403 'stdout' to file using '>', however, that requires the
1407 \begin_layout Standard
1416 no_stream swich to prevent a mixture of data stream and results in the file.
1417 The explicit (and safe) way:
1420 \begin_layout LyX-Code
1422 | <biopiece> --result_out=<file>
1425 \begin_layout Standard
1429 \begin_layout LyX-Code
1431 | <biopiece> --no_stream > <file>
1434 \begin_layout Subsection
1435 How to write biopieces output?
1438 \begin_layout Standard
1440 \begin_inset LatexCommand eqref
1441 reference "sub:How-to-write-stream"
1448 \begin_layout Subsection
1449 How to write FASTA output?
1450 \begin_inset LatexCommand label
1451 name "sub:How-to-write-fasta"
1458 \begin_layout Standard
1459 FASTA output can be written with
1466 \begin_layout LyX-Code
1468 | write_fasta --result_out=<file>
1471 \begin_layout Standard
1472 It is also possible to wrap the sequences to a given width using the
1476 \begin_layout Standard
1485 wrap switch allthough wrapping of sequence is generally an evil thing:
1488 \begin_layout LyX-Code
1490 | write_fasta --no_stream --wrap=80
1493 \begin_layout Subsection
1494 How to write alignment output?
1495 \begin_inset LatexCommand label
1496 name "sub:How-to-write-alignment"
1503 \begin_layout Standard
1504 Pretty alignments with ruler
1508 \begin_layout Standard
1509 '.' for every 10 residues, ':' for every 50, and '|' for every 100
1514 and consensus sequence
1515 \begin_inset Note Note
1518 \begin_layout Standard
1519 which reminds me to make that an option.
1528 , what also have the optional
1532 \begin_layout Standard
1541 wrap switch to break the alignment into blocks of a given width:
1544 \begin_layout LyX-Code
1546 | write_align --result_out=<file> --wrap=80
1549 \begin_layout Standard
1550 If the number of sequnces in the alignment is 2 then a pairwise alignment
1551 will be output otherwise a multiple alignment.
1552 And if the sequence type, determined automagically, is protein, then residues
1553 and symbols (+,\InsetSpace ~
1555 .) will be used to show consensus according to the Blosum62
1559 \begin_layout Subsection
1560 How to write tabular output?
1561 \begin_inset LatexCommand label
1562 name "sub:How-to-write-tab"
1569 \begin_layout Standard
1570 Outputting the data stream as a table can be done with
1574 , which will write generate one row per record with the values as columns.
1575 If you supply the optional
1579 \begin_layout Standard
1588 comment switch, when the first row in the table will be a 'comment' line
1589 prefixed with a '#':
1592 \begin_layout LyX-Code
1594 | write_tab --result_out=<file> --comment
1597 \begin_layout Standard
1598 You can also change the delimiter from the default (tab) to
1606 \begin_layout LyX-Code
1608 | write_tab --result_out=<file> --delimit=','
1611 \begin_layout Standard
1612 If you want the values output in a specific order you have to supply a comma
1613 separated list using the
1617 \begin_layout Standard
1626 keys switch that will print only those keys in that order:
1629 \begin_layout LyX-Code
1631 | write_tab --result_out=<file> --keys=SEQ_NAME,COUNT
1634 \begin_layout Standard
1635 Alternatively, if you have some keys that you don't want in the tabular
1640 \begin_layout Standard
1650 So to print all keys except SEQ and SEQ_TYPE do:
1653 \begin_layout LyX-Code
1655 | write_tab --result_out=<file> --no_keys=SEQ,SEQ_TYPE
1658 \begin_layout Standard
1659 Finally, if you have a stream containing a mix of different records types,
1665 records with sequences and records with matches, then you can use
1669 to output all the records in tabluar format, however, the
1673 \begin_layout Standard
1686 \begin_layout Standard
1699 \begin_layout Standard
1708 no_keys switches will only respond to records of the first type encountered.
1709 The reason is that outputting mixed records is probably not what you want
1710 anyway, and you should remove all the unwanted records from the stream
1711 before outputting the table:
1715 is your friend (see\InsetSpace ~
1717 \begin_inset LatexCommand ref
1718 reference "sub:How-to-grab"
1725 \begin_layout Subsection
1726 How to write a BED output?
1727 \begin_inset LatexCommand label
1728 name "sub:How-to-write-BED"
1735 \begin_layout Standard
1736 Data in BED format can be output if the records contain the mandatory keys
1737 CHR, CHR_BEG, and CHR_END using
1742 If the optional keys are also present, they will be output as well:
1745 \begin_layout LyX-Code
1746 write_bed --result_out=<file>
1749 \begin_layout Subsection
1750 How to write PSL output?
1751 \begin_inset LatexCommand label
1752 name "sub:How-to-write-PSL"
1759 \begin_layout Standard
1760 Data in PSL format can be output using
1765 \begin_layout LyX-Code
1766 write_psl --result_out=<file>
1769 \begin_layout Section
1770 Manipulating Records
1773 \begin_layout Subsection
1774 How to select a few records?
1775 \begin_inset LatexCommand label
1776 name "sub:How-to-select-a-few-records"
1783 \begin_layout Standard
1784 To quickly get an overview of your data you can limit the data stream to
1786 This also very useful to test the pipeline with a few records if you are
1787 setting up a complex analysis using several biopieces.
1788 That way you can inspect that all goes well before analyzing and waiting
1789 for the full data set.
1790 All of the read_<type> biopieces have the
1794 \begin_layout Standard
1803 num switch which will take a number as argument and only that number of
1804 records will be read.
1805 So to read in the first 10 FASTA entries from a file:
1808 \begin_layout LyX-Code
1809 read_fasta --data_in=test.fna --num=10
1812 \begin_layout Standard
1813 Another way of doing this is to use
1817 will limit the stream to show the first 10 records (default):
1820 \begin_layout LyX-Code
1825 \begin_layout Standard
1830 directly after one of the read_<type> biopieces will be a lot slower than
1835 \begin_layout Standard
1844 num switch with the read_<type> biopieces, however,
1848 can also be used to limit the output from all the other biopieces.
1849 It is also possible to give
1853 a number of records to show using the
1857 \begin_layout Standard
1867 So to display the first 100 records do:
1870 \begin_layout LyX-Code
1872 | head_records --num=100
1875 \begin_layout Subsection
1876 How to select random records?
1877 \begin_inset LatexCommand label
1878 name "sub:How-to-select-random-records"
1885 \begin_layout Standard
1886 If you want to inspect a number of random records from the stream this can
1892 So if you have 1 mio records in the stream and you want to select 1000
1896 \begin_layout LyX-Code
1898 | random_records --num=1000
1901 \begin_layout Subsection
1902 How to count all records in the data stream?
1905 \begin_layout Standard
1906 To count all the records in the data stream use
1910 , which adds one record (which is not included in the count) to the data
1912 So to count the number of sequences in a FASTA file you can do this:
1915 \begin_layout LyX-Code
1916 cat test.fna | read_fasta | count_records --no_stream
1919 \begin_layout Standard
1920 Which will write the last record containing the count to 'stdout':
1923 \begin_layout LyX-Code
1929 \begin_layout LyX-Code
1935 \begin_layout Standard
1936 It is also possible to write the count to file using the
1940 \begin_layout Standard
1952 \begin_layout Subsection
1953 How to get the length of record values?
1954 \begin_inset LatexCommand label
1955 name "sub:How-to-get-value_length"
1962 \begin_layout Standard
1967 biopiece to get the length of each value for a comma separated list of
1971 \begin_layout LyX-Code
1973 | length_vals --keys=HIT,PATTERN
1976 \begin_layout Subsection
1977 How to grab specific records?
1978 \begin_inset LatexCommand label
1979 name "sub:How-to-grab"
1986 \begin_layout Standard
1991 is related to the Unix grep and locates records based on matching keys
1992 and/or values using either a pattern, a Perl regex, or a numerical evaluation.
1997 all records in the stream that has any mentioning of the pattern 'human'
1998 just pipe the data stream through
2005 \begin_layout LyX-Code
2007 | grab --pattern=human
2010 \begin_layout Standard
2011 This will search for the pattern 'human' in all keys and all values.
2016 \begin_layout Standard
2025 pattern switch takes a comma separated list of patterns, so in order to
2026 match multiple patterns do:
2029 \begin_layout LyX-Code
2031 | grab --pattern=human,mouse
2034 \begin_layout Standard
2035 It is also possible to use the
2039 \begin_layout Standard
2048 pattern_in switch instead of
2052 \begin_layout Standard
2066 \begin_layout Standard
2075 pattern_in is used to read a file with one pattern per line:
2078 \begin_layout LyX-Code
2080 | grab --pattern_in=patterns.txt
2083 \begin_layout Standard
2084 If you want the opposite result --- to find all records that does not match
2085 the patterns, add the
2089 \begin_layout Standard
2098 invert switch, which not only works with the
2102 \begin_layout Standard
2111 pattern switch, but also with
2115 \begin_layout Standard
2128 \begin_layout Standard
2140 \begin_layout LyX-Code
2142 | grab --pattern=human --invert
2145 \begin_layout Standard
2146 If you want to search the record keys only,
2151 to find all records containing the key SEQ you can add the
2155 \begin_layout Standard
2165 This will prevent matching of SEQ in any record value, and in fact SEQ
2166 is a not uncommon peptide sequence you could get an unwanted record.
2167 Also, this will give an increase in speed since only the keys are searched:
2170 \begin_layout LyX-Code
2172 | grab --pattern=SEQ --keys_only
2175 \begin_layout Standard
2176 However, if you are interested in finding the peptide sequence SEQ and not
2177 the SEQ key, just add the
2181 \begin_layout Standard
2190 vals_only switch instead:
2193 \begin_layout LyX-Code
2195 | grab --pattern=SEQ --vals_only
2198 \begin_layout Standard
2199 Also, if you want to grab for certain key/value pairs you can supply a comma
2200 separated list of keys whos values will then be searched using the
2204 \begin_layout Standard
2214 This is handy if your records contain large genomic sequences and you dont
2215 want to search the entire sequence for
2220 the organism name --- it is much faster to tell
2224 which keys to search the value for:
2227 \begin_layout LyX-Code
2229 | grab --pattern=human --keys=SEQ_NAME
2232 \begin_layout LyX-Code
2236 \begin_layout Standard
2237 It is also possible to invoke flexible matching using regex (regular expressions
2238 ) instead of simple pattern matching.
2243 the regex engine is Perl based and allows use of different type of wild
2244 cards, alternatives,
2252 \begin_layout Standard
2253 \begin_inset LatexCommand url
2254 target "http://perldoc.perl.org/perlreref.html"
2268 records withs the sequence ATCG or GCTA you can do this:
2271 \begin_layout LyX-Code
2273 | grab --regex='ATCG|GCTA'
2276 \begin_layout Standard
2277 Or if you want to find sequences beginning with ATCG:
2280 \begin_layout LyX-Code
2282 | grab --regex='^ATCG'
2285 \begin_layout Standard
2290 to locate records that fulfill a numerical property using the
2294 \begin_layout Standard
2303 eval switch witch takes an expression in three parts.
2304 The first part is the key that holds the value we want to evaluate, the
2305 second part holds one the six operators:
2308 \begin_layout Enumerate
2312 \begin_layout Enumerate
2313 Greater than or equal to: >=
2316 \begin_layout Enumerate
2320 \begin_layout Enumerate
2321 Less than or equal to: <=
2324 \begin_layout Enumerate
2328 \begin_layout Enumerate
2332 \begin_layout Enumerate
2333 String wise equal to: eq
2336 \begin_layout Enumerate
2337 String wise not equal to: ne
2340 \begin_layout Standard
2341 And finally comes the number used in the evaluation.
2346 all records with a sequence length greater than 30:
2349 \begin_layout LyX-Code
2351 length_seq | grab --eval='SEQ_LEN > 30'
2354 \begin_layout Standard
2355 If you want to locate all records containing the pattern 'human' and where
2356 the sequence length is greater that 30, you do this by running the stream
2364 \begin_layout LyX-Code
2366 | grab --pattern='human' | length_seq | grab --eval='SEQ_LEN > 30'
2369 \begin_layout Standard
2370 Finally, it is possible to do fast matching of expressions from a file using
2375 \begin_layout Standard
2385 Each of these expressions has to be matched exactly over the entrie length,
2386 which if useful if you have a file with accession numbers, that you want
2387 to locate in the stream:
2390 \begin_layout LyX-Code
2392 | grab --exact acc_no.txt | ...
2395 \begin_layout Standard
2400 \begin_layout Standard
2409 exact is much faster than using
2413 \begin_layout Standard
2422 pattern_in, because with
2426 \begin_layout Standard
2435 exact the expression has to be complete matches, where
2439 \begin_layout Standard
2448 pattern_in looks for subpatterns.
2451 \begin_layout Standard
2452 NB! To get the best speed performance, use the most restrictive
2459 \begin_layout Subsection
2460 How to remove keys from records?
2463 \begin_layout Standard
2464 To remove one or more specific keys from all records in the data stream
2472 \begin_layout LyX-Code
2474 | remove_keys --keys=SEQ,SEQ_NAME
2477 \begin_layout Standard
2478 In the above example SEQ and SEQ_NAME will be removed from all records if
2479 they exists in these.
2480 If all keys are removed from a record, then the record will be removed.
2483 \begin_layout Subsection
2484 How to rename keys in records?
2487 \begin_layout Standard
2488 Sometimes you want to rename a record key,
2493 if you have read in a two column table with sequence name and sequence
2495 \begin_inset LatexCommand ref
2496 reference "sub:How-to-read-table"
2500 ) without specifying the key names, then the sequence name will be called
2501 V0 and the sequence V1 as default in the
2506 To rename the V0 and V1 keys we need to run the stream through
2510 twice (one for each key to rename):
2513 \begin_layout LyX-Code
2515 | rename_keys --keys=V0,SEQ_NAME | rename_keys --keys=V1,SEQ
2518 \begin_layout Standard
2519 The first instance of
2523 replaces all the V0 keys with SEQ_NAME, and the second instance of
2527 replaces all the V1 keys with SEQ.
2532 the data can now be used in the biopieces that requires these keys.
2535 \begin_layout Section
2536 Manipulating Sequences
2539 \begin_layout Subsection
2540 How to get sequence lengths?
2543 \begin_layout Standard
2544 The length for sequences in records can be determined with
2548 , which adds the key SEQ_LEN to each record with the sequence length as
2550 It also generates an extra record that is emitted last with the key TOTAL_SEQ_L
2551 EN showing the total length of all the sequences.
2554 \begin_layout LyX-Code
2555 read_fasta --data_in=<file> | length_seq
2558 \begin_layout Standard
2559 It is also possible to determine the sequence length using the generic tool
2565 \begin_inset LatexCommand eqref
2566 reference "sub:How-to-get-value_length"
2570 , which determines the length of the values for a given list of keys:
2573 \begin_layout LyX-Code
2574 read_fasta --data_in=<file> | length_vals --keys=SEQ
2577 \begin_layout Standard
2578 To obtain the total length of all sequences use
2585 \begin_layout LyX-Code
2586 read_fasta --data_in=<file> | length_vals --keys=SEQ
2589 \begin_layout LyX-Code
2590 | sum_vals --keys=SEQ_LEN
2593 \begin_layout Standard
2598 will also determine the length of each sequence (see\InsetSpace ~
2600 \begin_inset LatexCommand ref
2601 reference "sub:How-to-analyze"
2608 \begin_layout Subsection
2609 How to analyze sequence composition?
2610 \begin_inset LatexCommand label
2611 name "sub:How-to-analyze"
2618 \begin_layout Standard
2619 If you want to find out the sequence type, composition, length, as well
2620 as GC content, indel content and proportions of soft and hard masked sequence,
2626 This handy biopiece will determine all these things per sequence from which
2627 it is easy to get an overview using the
2631 biopiece to output a table (see\InsetSpace ~
2633 \begin_inset LatexCommand ref
2634 reference "sub:How-to-write-tab"
2639 So in order to determine the sequence composition of a FASTA file with
2640 just one entry containing the sequence 'ATCG' we just read the data with
2645 and run the output through
2649 which will add the analysis to the record like this:
2652 \begin_layout LyX-Code
2653 read_fasta --data_in=test.fna | analyze_seq ...
2656 \begin_layout LyX-Code
2660 \begin_layout LyX-Code
2666 \begin_layout LyX-Code
2672 \begin_layout LyX-Code
2678 \begin_layout LyX-Code
2684 \begin_layout LyX-Code
2690 \begin_layout LyX-Code
2696 \begin_layout LyX-Code
2702 \begin_layout LyX-Code
2708 \begin_layout LyX-Code
2714 \begin_layout LyX-Code
2720 \begin_layout LyX-Code
2726 \begin_layout LyX-Code
2732 \begin_layout LyX-Code
2738 \begin_layout LyX-Code
2744 \begin_layout LyX-Code
2750 \begin_layout LyX-Code
2756 \begin_layout LyX-Code
2762 \begin_layout LyX-Code
2768 \begin_layout LyX-Code
2774 \begin_layout LyX-Code
2780 \begin_layout LyX-Code
2783 SEQ: TTTCAGTTTGGGACGGAGTAAGGCCTTCCtttttttttttttttttttttttttttttgagaccgagtcttgctc
2787 \begin_layout LyX-Code
2793 \begin_layout LyX-Code
2799 \begin_layout LyX-Code
2805 \begin_layout LyX-Code
2811 \begin_layout LyX-Code
2817 \begin_layout LyX-Code
2823 \begin_layout LyX-Code
2827 \begin_layout Standard
2828 Now to make a table of how may As, Ts, Cs, and Gs you can add the following:
2831 \begin_layout LyX-Code
2833 | analyze_seq | write_tab --keys=RES:A,RES:T,RES:C,RES:G
2836 \begin_layout Standard
2837 Or if you want to see the proportions of hard and soft masked sequence:
2840 \begin_layout LyX-Code
2842 | analyse_seq | write_tab --keys=HARD_MASK%,SOFT_MASK%
2845 \begin_layout Standard
2846 If you have a stack of sequences in one file and you want to determine the
2847 mean GC content you can do it using the
2854 \begin_layout LyX-Code
2855 read_fasta --data_in=test.fna | analyze_seq | mean_vals --keys=GC%
2858 \begin_layout Standard
2859 Or if you want the total count of Ns you can use
2866 \begin_layout LyX-Code
2867 read_fasta --data_in=test.fna | analyze_seq | sum_vals --keys=RES:N
2870 \begin_layout Standard
2871 The MIX_INDEX key is calculated as the count of the most common residue
2872 over the sequence length, and can be used as a cut-off for removing sequence
2873 tags consisting of mostly one nucleotide:
2876 \begin_layout LyX-Code
2877 read_fasta --data_in=test.fna | analyze_seq | grab --eval='MIX_INDEX<0.85'
2880 \begin_layout Subsection
2881 How to extract subsequences?
2882 \begin_inset LatexCommand label
2883 name "sub:How-to-extract"
2890 \begin_layout Standard
2891 In order to extract a subsequence from a longer sequence use the biopiece
2892 extract_seq, which will replace the sequence in the record with the subsequence
2893 (this behaviour should probably be modified to be dependant of a
2897 \begin_layout Standard
2910 \begin_layout Standard
2920 \begin_inset Note Note
2923 \begin_layout Standard
2930 So to extract the first 20 residues from all sequences do (first residue
2934 \begin_layout LyX-Code
2936 | extract_seq --beg=1 --len=20
2939 \begin_layout Standard
2940 You can also specify a begin and end coordinate set:
2943 \begin_layout LyX-Code
2945 | extract_seq --beg=20 --end=40
2948 \begin_layout Standard
2949 If you want the subsequences from position 20 to the sequence end do:
2952 \begin_layout LyX-Code
2954 | extract_seq --beg=20
2957 \begin_layout Standard
2958 If you want to extract subsequences a given distance from the sequence end
2959 you can do this by reversing the sequence with the biopiece
2964 \begin_inset LatexCommand eqref
2965 reference "sub:How-to-reverse-seq"
2973 to get the subsequence, and then
2977 again to get the subsequence back in the original orientation:
2980 \begin_layout LyX-Code
2981 read_fasta --data_in=test.fna | reverse_seq
2984 \begin_layout LyX-Code
2985 | extract_seq --beg=10 --len=10 | reverse_seq
2988 \begin_layout Subsection
2989 How to get genomic sequence?
2990 \begin_inset LatexCommand label
2991 name "sub:How-to-get-genomic-sequence"
2998 \begin_layout Standard
3003 can extract subsequences for a given genome specified with the
3007 \begin_layout Standard
3016 genome switch explicitly using the
3020 \begin_layout Standard
3033 \begin_layout Standard
3046 \begin_layout Standard
3058 \begin_layout LyX-Code
3059 get_genome_seq --genome=<genome> --beg=1 --len=100
3062 \begin_layout Standard
3067 can be used to append the corresponding sequence to BED, PSL, and BLAST
3071 \begin_layout LyX-Code
3072 read_bed --data_in=<BED file> | get_genome_seq --genome=<genome>
3075 \begin_layout Standard
3076 It is also possible to include flaking sequence using the
3080 \begin_layout Standard
3090 So to include 50 nucleotides upstream and 50 nucleotides downstream for
3094 \begin_layout LyX-Code
3095 read_bed --data_in=<BED file> | get_genome_seq --genome=<genome> --flank=50
3098 \begin_layout Subsection
3099 How to upper-case sequences?
3102 \begin_layout Standard
3103 Sequences can be shifted from lower case to upper case using
3110 \begin_layout LyX-Code
3115 \begin_layout Subsection
3116 How to reverse sequences?
3117 \begin_inset LatexCommand label
3118 name "sub:How-to-reverse-seq"
3125 \begin_layout Standard
3126 The order of residues in a sequence can be reversed using reverse_seq:
3129 \begin_layout LyX-Code
3134 \begin_layout Standard
3135 Note that in order to reverse/complement a sequence you also need the
3139 biopiece (see\InsetSpace ~
3141 \begin_inset LatexCommand ref
3142 reference "sub:How-to-complement"
3149 \begin_layout Subsection
3150 How to complement sequences?
3151 \begin_inset LatexCommand label
3152 name "sub:How-to-complement"
3159 \begin_layout Standard
3160 DNA and RNA sequences can be complemented with
3164 , which automagically determines the sequence type:
3167 \begin_layout LyX-Code
3172 \begin_layout Standard
3173 Note that in order to reverse/complement a sequence you also need the
3177 biopiece (see\InsetSpace ~
3179 \begin_inset LatexCommand ref
3180 reference "sub:How-to-reverse-seq"
3187 \begin_layout Subsection
3188 How to remove indels from sequnces?
3191 \begin_layout Standard
3192 Indels can be removed from sequences with the
3197 This is useful if you have aligned some sequences (see\InsetSpace ~
3199 \begin_inset LatexCommand ref
3200 reference "sub:How-to-align"
3204 ) and extracted (see\InsetSpace ~
3206 \begin_inset LatexCommand ref
3207 reference "sub:How-to-extract"
3211 ) a block of subsequences from the alignment and you want to use these sequence
3212 in a search where you need to remove the indels first.
3213 '-', '~', and '.' are considered indels:
3216 \begin_layout LyX-Code
3221 \begin_layout Subsection
3222 How to shuffle sequences?
3225 \begin_layout Standard
3226 All residues in sequences in the stream can be shuffled to random positions
3234 \begin_layout LyX-Code
3239 \begin_layout Subsection
3240 How to split sequences into overlapping subsequences?
3243 \begin_layout Standard
3244 Sequences can be slit into overlapping subsequences with the
3251 \begin_layout LyX-Code
3253 | split_seq --word_size=20 --uniq
3256 \begin_layout Subsection
3257 How to determine the oligo frequency?
3260 \begin_layout Standard
3261 In order to determine if any oligo usage is over represented in one or more
3262 sequences you can determine the frequency of oligos of a given size with
3270 \begin_layout LyX-Code
3272 | oligo_freq --word_size=4
3275 \begin_layout Standard
3276 And if you have more than one sequence and want to accumulate the frequences
3281 \begin_layout Standard
3293 \begin_layout LyX-Code
3295 | oligo_freq --word_size=4 --all
3298 \begin_layout Standard
3299 To get a meaningful result you need to write the resulting frequencies as
3306 \begin_inset LatexCommand ref
3307 reference "sub:How-to-write-tab"
3311 ), but first it is important to
3317 \begin_inset LatexCommand ref
3318 reference "sub:How-to-grab"
3322 ) the records with the frequencies to avoid full length sequences in the
3326 \begin_layout LyX-Code
3328 | oligo_freq --word_size=4 --all | grab --pattern=OLIGO --keys_only
3331 \begin_layout LyX-Code
3332 | write_tab --no_stream
3335 \begin_layout Standard
3336 And the resulting frequency table can be sorted with Unix sort (man sort).
3339 \begin_layout Subsection
3340 How to search for sequences in genomes?
3343 \begin_layout Standard
3344 See the following biopiece:
3347 \begin_layout Itemize
3353 \begin_inset LatexCommand eqref
3354 reference "sub:How-to-use-patscan"
3361 \begin_layout Itemize
3367 \begin_inset LatexCommand eqref
3368 reference "sub:How-to-use-BLAT"
3375 \begin_layout Itemize
3381 \begin_inset LatexCommand eqref
3382 reference "sub:How-to-use-BLAST"
3389 \begin_layout Itemize
3395 \begin_inset LatexCommand eqref
3396 reference "sub:How-to-use-Vmatch"
3403 \begin_layout Subsection
3404 How to search sequences for a pattern?
3405 \begin_inset LatexCommand label
3406 name "sub:How-to-use-patscan"
3413 \begin_layout Standard
3414 It is possible to search sequences in the data stream for patterns using
3419 biopiece which utilizes the powerful scan_for_matches engine.
3420 Consult the documentation for scan_for_matches in order to learn how to
3421 define patterns (the documentation is included in Appendix\InsetSpace ~
3423 \begin_inset LatexCommand ref
3424 reference "sec:scan_for_matches-README"
3431 \begin_layout Standard
3432 To search all sequences for a simple pattern consisting of the sequence
3433 ATCGATCG allowing for 3 mismatches, 2 insertions and 1 deletion:
3436 \begin_layout LyX-Code
3437 read_fasta --data_in=<file> | patscan_seq --pattern='ATCGATCG[3,2,1]'
3440 \begin_layout Standard
3445 \begin_layout Standard
3454 pattern switch takes a comma seperated list of patterns, so if you want
3455 to search for more that one pattern do:
3458 \begin_layout LyX-Code
3460 | patscan_seq --pattern='ATCGATCG[3,2,1],GCTAGCTA[3,2,1]'
3463 \begin_layout Standard
3464 It is also possible to have a list of different patterns to search for in
3465 a file with one pattern per line.
3470 to read these patterns use the
3474 \begin_layout Standard
3486 \begin_layout LyX-Code
3488 | patscan_seq --pattern_in=<file>
3491 \begin_layout Standard
3492 To also scan the complementary strand in nucleotide sequences (
3496 automagically determines the sequence type) you need to add the
3500 \begin_layout Standard
3512 \begin_layout LyX-Code
3514 | patscan_seq --pattern=<pattern> --comp
3517 \begin_layout Standard
3518 It is also possible to use
3522 to output those records that does not contain a certain pattern by using
3527 \begin_layout Standard
3539 \begin_layout LyX-Code
3541 | patscan_seq --pattern=<pattern> --invert
3544 \begin_layout Standard
3549 can also scan for patterns in a given genome sequence, instead of sequences
3550 in the stream, using the
3554 \begin_layout Standard
3566 \begin_layout LyX-Code
3567 patscan --pattern=<pattern> --genome=<genome>
3570 \begin_layout Subsection
3571 How to use BLAT for sequence search?
3572 \begin_inset LatexCommand label
3573 name "sub:How-to-use-BLAT"
3580 \begin_layout Standard
3581 Sequences in the data stream can be matched against supported genomes using
3586 which is a biopiece using BLAT as the name might suggest.
3587 Currently only Mouse and Human genomes are available and it is not possible
3588 to use OOC files since there is still a need for a local repository for
3590 Otherwise it is just:
3593 \begin_layout LyX-Code
3594 read_fasta --data_in=<file> | blat_seq --genome=<genome>
3597 \begin_layout Standard
3598 The search results can then be written to file with
3604 \begin_inset LatexCommand ref
3605 reference "sub:How-to-write-PSL"
3615 \begin_inset LatexCommand ref
3616 reference "sub:How-to-write-BED"
3624 some information will be lost).
3625 It is also possible to plot chromosome distribution of the search results
3632 \begin_inset LatexCommand ref
3633 reference "sub:How-to-plot-chrdist"
3637 ) or the distribution of the match lengths using
3643 \begin_inset LatexCommand ref
3644 reference "sub:How-to-plot-lendist"
3648 ) or a karyogram with the hits using
3654 \begin_inset LatexCommand ref
3655 reference "sub:How-to-plot-karyogram"
3662 \begin_layout Subsection
3663 How to use BLAST for sequence search?
3664 \begin_inset LatexCommand label
3665 name "sub:How-to-use-BLAST"
3672 \begin_layout Standard
3673 Two biopieces exist for blasting sequences:
3677 is used to create the BLAST database required for BLAST which is queried
3683 So in order to create a BLAST database from sequences in the data stream
3687 \begin_layout LyX-Code
3689 | create_blast_db --database=my_database --no_stream
3692 \begin_layout Standard
3693 The type of sequence to use for the database is automagically determined
3698 , but don't have a mixture of peptide and nucleic acids sequences in the
3704 \begin_layout Standard
3713 database switch takes a path as argument, but will default to 'blastdb_<time_sta
3717 \begin_layout Standard
3718 The resulting database can now be queried with sequences in another data
3726 \begin_layout LyX-Code
3728 | blast_seq --database=my_database
3731 \begin_layout Standard
3732 Again, the sequence type is determined automagically and the appropriate
3733 BLAST program is guessed (see below table), however, the program name can
3734 be overruled with the
3738 \begin_layout Standard
3750 \begin_layout Standard
3753 \begin_inset Tabular
3754 <lyxtabular version="3" rows="5" columns="3">
3756 <column alignment="center" valignment="top" width="0">
3757 <column alignment="center" valignment="top" width="0">
3758 <column alignment="center" valignment="top" width="0">
3759 <row bottomline="true">
3760 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3763 \begin_layout Standard
3769 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3772 \begin_layout Standard
3778 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3781 \begin_layout Standard
3789 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3792 \begin_layout Standard
3798 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3801 \begin_layout Standard
3807 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3810 \begin_layout Standard
3818 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3821 \begin_layout Standard
3827 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3830 \begin_layout Standard
3836 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3839 \begin_layout Standard
3847 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3850 \begin_layout Standard
3856 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3859 \begin_layout Standard
3865 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3868 \begin_layout Standard
3876 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3879 \begin_layout Standard
3885 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
3888 \begin_layout Standard
3894 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
3897 \begin_layout Standard
3911 \begin_layout Standard
3912 Finally, it is also possible to use
3916 for blasting sequences agains a preformatted genome using the
3920 \begin_layout Standard
3929 genome switch instead of the
3933 \begin_layout Standard
3945 \begin_layout LyX-Code
3947 | blast_seq --genome=<genome>
3950 \begin_layout Subsection
3951 How to use Vmatch for sequence search?
3952 \begin_inset LatexCommand label
3953 name "sub:How-to-use-Vmatch"
3960 \begin_layout Standard
3961 The powerful suffix array software package Vmatch
3965 \begin_layout Standard
3966 \begin_inset LatexCommand url
3967 target "http://www.vmatch.de/"
3976 can be used for exact mapping of sequences against indexed genomes using
3982 map 700000 ESTs to the human genome locating all 160 mio hits in less than
3984 Only nucleotide sequences and sequences longer than 11 nucleotides will
3986 It is recommended that sequences consisting of mostly one nucleotide type
3988 This can be done with the
3993 \begin_inset LatexCommand eqref
3994 reference "sub:How-to-analyze"
4001 \begin_layout LyX-Code
4003 | vmatch_seq --genome=<genome>
4006 \begin_layout Standard
4007 It is also possible to allow for mismatches using the
4011 \begin_layout Standard
4020 hamming_dist switch.
4021 So to allow for 2 mismatches:
4024 \begin_layout LyX-Code
4026 | vmatch_seq --genome=<genome> --hamming_dist=2
4029 \begin_layout Standard
4030 Or to allow for 10% mismathing nucleotides:
4033 \begin_layout LyX-Code
4035 | vmatch_seq --genome=<genome> --hamming_dist=10p
4038 \begin_layout Standard
4039 To allow both indels and mismatches use the
4043 \begin_layout Standard
4053 So to allow for one mismatch or one indel:
4056 \begin_layout LyX-Code
4058 | vmatch_seq --genome=<genome> --hamming_dist=1
4061 \begin_layout Standard
4062 Or to allow for 5% indels or mismatches:
4065 \begin_layout LyX-Code
4067 | vmatch_seq --genome=<genome> --hamming_dist=5p
4070 \begin_layout Standard
4075 \begin_layout Standard
4088 \begin_layout Standard
4097 edit_dist greatly slows down vmatch considerably --- use with care.
4100 \begin_layout Standard
4101 The resulting SCORE key can be replaced to hold the number of genome matches
4102 of a given sequence (multi-mappers) is the
4106 \begin_layout Standard
4115 count switch is given.
4118 \begin_layout Subsection
4119 How to find all matches between sequences?
4120 \begin_inset LatexCommand label
4121 name "sub:How-to-find-matches"
4128 \begin_layout Standard
4129 All matches between two sequences can be determined with the biopiece
4134 The match finding engine underneath the hood of
4138 is the super fast suffix tree program MUMmer
4142 \begin_layout Standard
4143 \begin_inset LatexCommand url
4144 target "http://mummer.sourceforge.net/"
4153 , which will locate all forward and reverse matches between huge sequences
4154 in a matter of minutes (if the repeat count is not too high and if the
4155 word size used is appropriate).
4160 genomes (1.7Mbp) takes around 10 seconds:
4163 \begin_layout LyX-Code
4165 | match_seq --word_size=20 --direction=both
4168 \begin_layout Standard
4173 can be used to generate a dot plot with
4179 \begin_inset LatexCommand ref
4180 reference "sub:How-to-generate-dotplot"
4187 \begin_layout Subsection
4188 How to align sequences?
4189 \begin_inset LatexCommand label
4190 name "sub:How-to-align"
4197 \begin_layout Standard
4198 Sequences in the stream can be aligned with the
4202 biopiece that uses Muscle
4206 \begin_layout Standard
4207 \begin_inset LatexCommand url
4208 target "http://www.drive5.com/muscle/muscle.html"
4218 Currently you cannot change any of the Muscle alignment parameters and
4223 will create an alignment based on the defaults (which are really good!):
4226 \begin_layout LyX-Code
4231 \begin_layout Standard
4232 The aligned output can be written to file in FASTA format using
4238 \begin_inset LatexCommand ref
4239 reference "sub:How-to-write-fasta"
4243 ) or in pretty text using
4249 \begin_inset LatexCommand ref
4250 reference "sub:How-to-write-alignment"
4257 \begin_layout Subsection
4258 How to create a weight matrix?
4261 \begin_layout Standard
4262 If you want a weight matrix to show the sequence composition of a stack
4263 of sequences you can use the biopiece create_weight_matrix:
4266 \begin_layout LyX-Code
4268 | create_weight_matrix
4271 \begin_layout Standard
4272 The result can be output in percent using the
4276 \begin_layout Standard
4288 \begin_layout LyX-Code
4290 | create_weight_matrix --percent
4293 \begin_layout Standard
4294 The weight matrix can be written as tabular output with
4300 \begin_inset LatexCommand ref
4301 reference "sub:How-to-write-tab"
4305 ) after removeing the records containing SEQ with
4311 \begin_inset LatexCommand ref
4312 reference "sub:How-to-grab"
4319 \begin_layout LyX-Code
4321 | create_weight_matrix | grab --invert --keys=SEQ --keys_only
4324 \begin_layout LyX-Code
4325 | write_tab --no_stream
4328 \begin_layout Standard
4329 The V0 column will hold the residue, while the rest of the columns will
4330 hold the frequencies for each sequence position.
4333 \begin_layout Section
4337 \begin_layout Standard
4338 There exists several biopieces for plotting.
4339 Some of these are based on GNUplot
4343 \begin_layout Standard
4344 \begin_inset LatexCommand url
4345 target "http://www.gnuplot.info/"
4354 , which is an extremely powerful platform to generate all sorts of plots
4355 and even though GNUplot has quite a steep learning curve, the biopieces
4356 utilizing GNUplot are simple to use.
4357 GNUplot is able to output a lot of different formats (called terminals
4358 in GNUplot), but the biopieces focusses on three formats only:
4361 \begin_layout Enumerate
4362 The 'dumb' terminal is default to the GNUplot based biopieces and will output
4363 a plot in crude ASCII text (Fig.\InsetSpace ~
4365 \begin_inset LatexCommand ref
4366 reference "fig:Dumb-terminal"
4371 This is quite nice for a quick and dirty plot to get an overview of your
4375 \begin_layout Enumerate
4376 The 'post' or 'postscript' terminal output postscript code which is publication
4377 grade graphics that can be viewed with applications such as Ghostview,
4378 Photoshop, and Preview.
4381 \begin_layout Enumerate
4382 The 'svg' terminal output's scalable vector graphics (SVG) which is a vector
4384 SVG is great because you can edit the resulting plot using Photoshop or
4389 \begin_layout Standard
4390 Inkscape is a really handy drawing program that is free and open source.
4392 \begin_inset LatexCommand htmlurl
4393 target "http://www.inkscape.org"
4402 if you want to add additional labels, captions, arrows, and so on and then
4403 save the result in different formats, such as postscript without loosing
4407 \begin_layout Standard
4408 The biopieces for plotting that are not based on GNUplot only output SVG
4409 (that may change in the future).
4412 \begin_layout Standard
4413 \begin_inset Float figure
4418 \begin_layout Standard
4421 \begin_inset Graphics
4422 filename lendist_ascii.png
4431 \begin_layout Standard
4432 \begin_inset Caption
4434 \begin_layout Standard
4435 \begin_inset LatexCommand label
4436 name "fig:Dumb-terminal"
4449 The output of a length distribution plot in the default 'dumb terminal'
4450 to the terminal window.
4459 \begin_layout Subsection
4460 How to plot a histogram?
4461 \begin_inset LatexCommand label
4462 name "How-to-plot-histogram"
4469 \begin_layout Standard
4470 A generic histogram for a given value can be plotted with the biopiece
4476 \begin_inset LatexCommand ref
4477 reference "fig:Histogram"
4484 \begin_layout LyX-Code
4486 | plot_histogram --key=TISSUE --no_stream
4489 \begin_layout Standard
4493 \begin_layout Standard
4496 \begin_inset Float figure
4501 \begin_layout Standard
4504 \begin_inset Graphics
4505 filename histogram.png
4514 \begin_layout Standard
4515 \begin_inset Caption
4517 \begin_layout Standard
4518 \begin_inset LatexCommand label
4519 name "fig:Histogram"
4536 \begin_layout Subsection
4537 How to plot a length distribution?
4538 \begin_inset LatexCommand label
4539 name "sub:How-to-plot-lendist"
4546 \begin_layout Standard
4547 Plotting of length distributions, weather sequence lengths, patterns lengths,
4553 is a really handy thing and can be done with the the biopiece
4558 If you have a file with FASTA entries and want to plot the length distribution
4559 you do it like this:
4562 \begin_layout LyX-Code
4563 read_fasta --data_in=<file> | length_seq
4566 \begin_layout LyX-Code
4567 | plot_lendist --key=SEQ_LEN --no_stream
4570 \begin_layout Standard
4571 The result will be written to the default dumb terminal and will look like
4574 \begin_inset LatexCommand ref
4575 reference "fig:Dumb-terminal"
4582 \begin_layout Standard
4583 If you instead want the result in postscript format you can do:
4586 \begin_layout LyX-Code
4588 | plot_lendist --key=SEQ_LEN --terminal=post --result_out=file.ps
4591 \begin_layout Standard
4592 That will generate the plot and save it to file, but not interrupt the data
4593 stream which can then be used in further analysis.
4594 You can also save the plot implicetly using '>', however, it is then important
4595 to terminate the stream with the
4599 \begin_layout Standard
4611 \begin_layout LyX-Code
4613 | plot_lendist --key=SEQ_LEN --terminal=post --no_stream > file.ps
4616 \begin_layout Standard
4617 The resulting plot can be seen in Fig.\InsetSpace ~
4619 \begin_inset LatexCommand ref
4620 reference "fig:Length-distribution"
4627 \begin_layout Standard
4628 \begin_inset Float figure
4633 \begin_layout Standard
4637 \begin_layout Standard
4640 \begin_inset Graphics
4650 \begin_layout Standard
4651 \begin_inset Caption
4653 \begin_layout Standard
4654 \begin_inset LatexCommand label
4655 name "fig:Length-distribution"
4668 Length distribution of 630 piRNA like RNAs.
4676 \begin_layout Subsection
4677 How to plot a chromosome distribution?
4678 \begin_inset LatexCommand label
4679 name "sub:How-to-plot-chrdist"
4686 \begin_layout Standard
4687 If you have the result of a sequence search against a multi chromosome genome,
4688 it is very practical to be able to plot the distribution of search hits
4689 on the different chromosomes.
4690 This can be done with
4697 \begin_layout LyX-Code
4698 read_fasta --data_in=<file> | blat_genome | plot_chrdist --no_stream
4701 \begin_layout Standard
4702 The above example will result in a crude plot using the 'dumb' terminal,
4703 and if you want to mess around with the results from the BLAT search you
4704 probably want to save the result to file first (see\InsetSpace ~
4706 \begin_inset LatexCommand ref
4707 reference "sub:How-to-write-PSL"
4712 To plot the chromosome distribution from the saved search result you can
4716 \begin_layout LyX-Code
4717 read_bed --data_in=file.bed | plot_chrdist --terminal=post --result_out=plot.ps
4720 \begin_layout Standard
4721 That will result in the output show in Fig.\InsetSpace ~
4723 \begin_inset LatexCommand ref
4724 reference "fig:Chromosome-distribution"
4731 \begin_layout Standard
4732 \begin_inset Float figure
4737 \begin_layout Standard
4741 \begin_layout Standard
4744 \begin_inset Graphics
4755 \begin_layout Standard
4756 \begin_inset Caption
4758 \begin_layout Standard
4759 \begin_inset LatexCommand label
4760 name "fig:Chromosome-distribution"
4764 Chromosome distribution
4777 \begin_layout Subsection
4778 How to generate a dotplot?
4779 \begin_inset LatexCommand label
4780 name "sub:How-to-generate-dotplot"
4787 \begin_layout Standard
4788 A dotplot is a powerful way to get an overview of the size and location
4789 of sequence insertions, deletions, and duplications between two sequences.
4790 Generating a dotplot with biopieces is a two step process where you initially
4791 find all matches between two sequences using the tool
4797 \begin_inset LatexCommand ref
4798 reference "sub:How-to-find-matches"
4802 ) and plot the resulting matches with
4807 Matching and plotting two
4811 genomes (1.7Mbp) takes around 10 seconds:
4814 \begin_layout LyX-Code
4816 | match_seq | plot_matches --terminal=post --result_out=plot.ps
4819 \begin_layout Standard
4820 The resulting dotplot is in Fig.\InsetSpace ~
4822 \begin_inset LatexCommand ref
4823 reference "fig:Dotplot"
4830 \begin_layout Standard
4831 \begin_inset Float figure
4836 \begin_layout Standard
4839 \begin_inset Graphics
4849 \begin_layout Standard
4850 \begin_inset Caption
4852 \begin_layout Standard
4853 \begin_inset LatexCommand label
4867 Forward matches are displayed in green while reverse matches are displayed
4876 \begin_layout Subsection
4877 How to plot a sequence logo?
4880 \begin_layout Standard
4881 Sequence logos can be generate with
4886 The sequnce type is determined automagically and an entropy scale of 2
4887 bits and 4 bits is used for nucleotide and peptide sequences, respectively
4891 \begin_layout Standard
4892 \begin_inset LatexCommand htmlurl
4893 target "http://www.ccrnp.ncifcrf.gov/~toms/paper/hawaii/latex/node5.html"
4905 \begin_layout LyX-Code
4907 | plot_seqlogo --no_stream --result_out=seqlogo.svg
4910 \begin_layout Standard
4911 An example of a sequence logo can be seen in Fig.\InsetSpace ~
4913 \begin_inset LatexCommand ref
4914 reference "fig:Sequence-logo"
4921 \begin_layout Standard
4922 \begin_inset Float figure
4927 \begin_layout Standard
4930 \begin_inset Graphics
4931 filename seqlogo.png
4940 \begin_layout Standard
4941 \begin_inset Caption
4943 \begin_layout Standard
4944 \begin_inset LatexCommand label
4945 name "fig:Sequence-logo"
4962 \begin_layout Subsection
4963 How to plot a karyogram?
4964 \begin_inset LatexCommand label
4965 name "sub:How-to-plot-karyogram"
4972 \begin_layout Standard
4973 To plot search hits on genomes use
4977 , which will output a nice karyogram in SVG graphics:
4980 \begin_layout LyX-Code
4982 | plot_karyogram --result_out=karyogram.svg
4985 \begin_layout Standard
4986 The banding data is taken from the UCSC genome browser database and currently
4987 only Human and Mouse is supported.
4990 \begin_inset LatexCommand ref
4991 reference "fig:Karyogram"
4995 shows the distribution of piRNA like RNAs matched to the Human genome.
4998 \begin_layout Standard
4999 \begin_inset Float figure
5004 \begin_layout Standard
5007 \begin_inset Graphics
5008 filename karyogram.png
5017 \begin_layout Standard
5018 \begin_inset Caption
5020 \begin_layout Standard
5021 \begin_inset LatexCommand label
5022 name "fig:Karyogram"
5035 Hits from a search of piRNA like RNAs in the Human genome is displayed as
5036 short horizontal bars.
5044 \begin_layout Section
5048 \begin_layout Subsection
5049 How do I display my results in the UCSC Genome Browser?
5052 \begin_layout Standard
5053 Results from the list of biopieces below can be uploaded directly to a local
5054 mirror of the UCSC Genome Browser using the biopiece
5061 \begin_layout Itemize
5063 \begin_inset LatexCommand eqref
5064 reference "sub:How-to-use-patscan"
5071 \begin_layout Itemize
5073 \begin_inset LatexCommand eqref
5074 reference "sub:How-to-use-BLAT"
5081 \begin_layout Itemize
5083 \begin_inset LatexCommand eqref
5084 reference "sub:How-to-use-BLAST"
5091 \begin_layout Itemize
5093 \begin_inset LatexCommand eqref
5094 reference "sub:How-to-use-Vmatch"
5101 \begin_layout Standard
5102 The syntax for uploading data the most simple way requires two mandatory
5107 \begin_layout Standard
5116 database, which is the UCSC database name (such as hg18, mm9, etc.) and
5120 \begin_layout Standard
5129 table which should be the users initials followed by an underscore and a
5130 short description of the data:
5133 \begin_layout LyX-Code
5135 | upload_to_ucsc --database=hg18 --table=mah_snoRNAs
5138 \begin_layout Standard
5143 biopiece modifies the users ~/ucsc/my_tracks.ra file automagically (a backup
5144 is created with the name ~/ucsc/my_tracks.ra~) with default values that
5145 can be overridden using the following switches:
5148 \begin_layout Itemize
5152 \begin_layout Standard
5161 short_label - Short label for track - Default=database->table
5164 \begin_layout Itemize
5168 \begin_layout Standard
5177 long_label - Long label for track - Default=database->table
5180 \begin_layout Itemize
5184 \begin_layout Standard
5193 group - Track group name - Default=<user name as defined in env>
5196 \begin_layout Itemize
5200 \begin_layout Standard
5209 priority - Track display priority - Default=1
5212 \begin_layout Itemize
5216 \begin_layout Standard
5225 color - Track color - Default=147,73,42
5228 \begin_layout Itemize
5232 \begin_layout Standard
5241 chunk_size - Chunks for loading - Default=10000000
5244 \begin_layout Itemize
5248 \begin_layout Standard
5257 visibility - Track visibility - Default=pack
5260 \begin_layout Standard
5261 Also, data in BED or PSL format can be uploaded with
5265 as long as these reference to genomes and chromosomes existing in the UCSC
5269 \begin_layout LyX-Code
5270 read_bed --data_in=<bed file> | upload_to_ucsc ...
5273 \begin_layout LyX-Code
5277 \begin_layout LyX-Code
5278 read_psl --data_in=<psl file> | upload_to_ucsc ...
5281 \begin_layout Section
5285 \begin_layout Standard
5286 It is possible to do commandline scripting of biopiece records using Perl.
5287 Because a biopiece record essentially is a hash structure, you can pass
5292 command, which is a wrapper around the Perl executable that allows direct
5293 manipulations of the records using the power of Perl.
5296 \begin_layout Standard
5297 In the below example we replace in all records the value to the CHR key
5298 with a forthrunning number:
5301 \begin_layout LyX-Code
5303 | bioscript 'while($r=get_record(
5305 *STDIN)){$r->{CHR}=$i++; put_record($r)}'
5308 \begin_layout Standard
5309 Something more useful would probably be to create custom FASTA headers.
5311 if we read in a BED file, lookup the genomic sequence, create a custom
5316 and output FASTA entries:
5319 \begin_layout LyX-Code
5321 | bioscript 'while($r=get_record(
5323 *STDIN)){$r->{SEQ_NAME}= //
5326 \begin_layout LyX-Code
5327 join("_",$r->{CHR},$r->{CHR_BEG},$r->{CHR_END}); put_record($r)}'
5330 \begin_layout Standard
5334 \begin_layout LyX-Code
5335 >chr2L_21567527_21567550
5338 \begin_layout LyX-Code
5339 taccaaacggatgcctcagacatc
5342 \begin_layout LyX-Code
5343 >chr2L_693380_693403
5346 \begin_layout LyX-Code
5347 taccaaacggatgcctcagacatc
5350 \begin_layout LyX-Code
5351 >chr2L_13859534_13859557
5354 \begin_layout LyX-Code
5355 taccaaacggatgcctcagacatc
5358 \begin_layout LyX-Code
5359 >chr2L_9005090_9005113
5362 \begin_layout LyX-Code
5363 taccaaacggatgcctcagacatc
5366 \begin_layout LyX-Code
5367 >chr2L_2106825_2106848
5370 \begin_layout LyX-Code
5371 taccaaacggatgcctcagacatc
5374 \begin_layout LyX-Code
5375 >chr2L_14649031_14649054
5378 \begin_layout LyX-Code
5379 taccaaacggatgcctcagacatc
5382 \begin_layout Section
5386 \begin_layout Standard
5387 Shoot the messenger!
5390 \begin_layout Section
5393 \begin_inset LatexCommand label
5401 \begin_layout Standard
5405 \begin_layout Standard
5409 \begin_layout Standard
5413 \begin_layout Standard
5417 \begin_layout Standard
5421 \begin_layout Standard
5425 \begin_layout Section
5427 \begin_inset LatexCommand label
5435 \begin_layout Standard
5439 \begin_layout Standard
5451 \begin_layout Standard
5455 \begin_layout Standard
5467 \begin_layout Standard
5471 \begin_layout Standard
5483 \begin_layout Standard
5487 \begin_layout Standard
5499 \begin_layout Standard
5503 \begin_layout Standard
5515 \begin_layout Standard
5519 \begin_layout Standard
5531 \begin_layout Section
5532 scan_for_matches README
5533 \begin_inset LatexCommand label
5534 name "sec:scan_for_matches-README"
5541 \begin_layout LyX-Code
5545 \begin_layout LyX-Code
5546 A Program to Scan Nucleotide or Protein Sequences for Matching Patterns
5549 \begin_layout LyX-Code
5553 \begin_layout LyX-Code
5557 \begin_layout LyX-Code
5558 Argonne National Laboratory
5561 \begin_layout LyX-Code
5565 \begin_layout LyX-Code
5569 \begin_layout LyX-Code
5570 Scan_for_matches is a utility that we have written to search for
5573 \begin_layout LyX-Code
5574 patterns in DNA and protein sequences.
5575 I wrote most of the code,
5578 \begin_layout LyX-Code
5579 although David Joerg and Morgan Price wrote sections of an
5582 \begin_layout LyX-Code
5584 The whole notion of pattern matching has a rich
5587 \begin_layout LyX-Code
5588 history, and we borrowed liberally from many sources.
5592 \begin_layout LyX-Code
5593 worth noting that we were strongly influenced by the elegant tools
5596 \begin_layout LyX-Code
5597 developed and distributed by David Searls.
5598 My intent is to make the
5601 \begin_layout LyX-Code
5602 existing tool available to anyone in the research community that might
5605 \begin_layout LyX-Code
5607 I will continue to try to fix bugs and make suggested
5610 \begin_layout LyX-Code
5611 enhancements, at least until I feel that a superior tool exists.
5614 \begin_layout LyX-Code
5615 Hence, I would appreciate it if all bug reports and suggestions are
5618 \begin_layout LyX-Code
5619 directed to me at Overbeek@mcs.anl.gov.
5623 \begin_layout LyX-Code
5624 I will try to log all bug fixes and report them to users that send me
5627 \begin_layout LyX-Code
5628 their email addresses.
5629 I do not require that you give me your name
5632 \begin_layout LyX-Code
5634 However, if you do give it to me, I will try to notify
5637 \begin_layout LyX-Code
5638 you of serious problems as they are discovered.
5641 \begin_layout LyX-Code
5645 \begin_layout LyX-Code
5646 The distribution should contain at least the following programs:
5649 \begin_layout LyX-Code
5650 README - This document
5653 \begin_layout LyX-Code
5654 ggpunit.c - One of the two source files
5657 \begin_layout LyX-Code
5658 scan_for_matches.c - The second source file
5661 \begin_layout LyX-Code
5665 \begin_layout LyX-Code
5666 run_tests - A perl script to test things
5669 \begin_layout LyX-Code
5670 show_hits - A handy perl script
5673 \begin_layout LyX-Code
5674 test_dna_input - Test sequences for DNA
5677 \begin_layout LyX-Code
5678 test_dna_patterns - Test patterns for DNA scan
5681 \begin_layout LyX-Code
5682 test_output - Desired output from test
5685 \begin_layout LyX-Code
5686 test_prot_input - Test protein sequences
5689 \begin_layout LyX-Code
5690 test_prot_patterns - Test patterns for proteins
5693 \begin_layout LyX-Code
5694 testit - a perl script used for test
5697 \begin_layout LyX-Code
5698 Only the first three files are required.
5699 The others are useful,
5702 \begin_layout LyX-Code
5703 but only if you have Perl installed on your system.
5707 \begin_layout LyX-Code
5708 have Perl, I suggest that you type
5711 \begin_layout LyX-Code
5715 \begin_layout LyX-Code
5719 \begin_layout LyX-Code
5720 to find out where it installed.
5721 On my system, I get the following
5724 \begin_layout LyX-Code
5728 \begin_layout LyX-Code
5732 \begin_layout LyX-Code
5736 \begin_layout LyX-Code
5740 \begin_layout LyX-Code
5741 indicating that Perl is installed in /usr/local/bin.
5745 \begin_layout LyX-Code
5746 you know where it is installed, edit the first line of files
5749 \begin_layout LyX-Code
5753 \begin_layout LyX-Code
5757 \begin_layout LyX-Code
5758 replacing /usr/local/bin/perl with the appropriate location.
5762 \begin_layout LyX-Code
5763 will assume that you can do this, although it is not critical (it
5766 \begin_layout LyX-Code
5767 is needed only to test the installation and to use the "show_hits"
5770 \begin_layout LyX-Code
5772 Perl is not required to actually install and run
5775 \begin_layout LyX-Code
5780 \begin_layout LyX-Code
5781 If you do not have Perl, I suggest you get it and install it (it
5784 \begin_layout LyX-Code
5785 is a wonderful utility).
5786 Information about Perl and how to get it
5789 \begin_layout LyX-Code
5790 can be found in the book "Programming Perl" by Larry Wall and
5793 \begin_layout LyX-Code
5795 Schwartz, published by O'Reilly & Associates, Inc.
5798 \begin_layout LyX-Code
5799 To get started, you will need to compile the program.
5803 \begin_layout LyX-Code
5807 \begin_layout LyX-Code
5808 gcc -O -o scan_for_matches ggpunit.c scan_for_matches.c
5811 \begin_layout LyX-Code
5812 If you do not use GNU C, use
5815 \begin_layout LyX-Code
5816 cc -O -DCC -o scan_for_matches ggpunit.c scan_for_matches.c
5819 \begin_layout LyX-Code
5820 which works on my Sun.
5824 \begin_layout LyX-Code
5825 Once you have compiled scan_for_matches, you can verify that it
5828 \begin_layout LyX-Code
5832 \begin_layout LyX-Code
5833 clone% run_tests tmp
5836 \begin_layout LyX-Code
5837 clone% diff tmp test_output
5840 \begin_layout LyX-Code
5841 You may get a few strange lines of the sort
5844 \begin_layout LyX-Code
5845 clone% run_tests tmp
5848 \begin_layout LyX-Code
5849 rm: tmp: No such file or directory
5852 \begin_layout LyX-Code
5853 clone% diff tmp test_output
5856 \begin_layout LyX-Code
5857 These should cause no concern.
5858 However, if the "diff" shows that
5861 \begin_layout LyX-Code
5862 tmp and test_output are different, contact me (you have a
5865 \begin_layout LyX-Code
5870 \begin_layout LyX-Code
5871 You should now be able to use scan_for_matches by following the
5874 \begin_layout LyX-Code
5875 instructions given below (which is all the normal user should have
5878 \begin_layout LyX-Code
5879 to understand, once things are installed properly).
5882 \begin_layout LyX-Code
5883 ==============================================================
5886 \begin_layout LyX-Code
5887 How to run scan_for_matches:
5890 \begin_layout LyX-Code
5891 To run the program, you type need to create two files
5894 \begin_layout LyX-Code
5896 the first file contains the pattern you wish to scan for; I'll
5899 \begin_layout LyX-Code
5900 call this file pat_file in what follows (but any name is ok)
5903 \begin_layout LyX-Code
5905 the second file contains a set of sequences to scan.
5909 \begin_layout LyX-Code
5910 should be in "fasta format".
5911 Just look at the contents of
5914 \begin_layout LyX-Code
5915 test_dna_input to see examples of this format.
5919 \begin_layout LyX-Code
5920 each sequence begins with a line of the form
5923 \begin_layout LyX-Code
5927 \begin_layout LyX-Code
5928 and is followed by one or more lines containing the sequence.
5931 \begin_layout LyX-Code
5932 Once these files have been created, you just use
5935 \begin_layout LyX-Code
5936 scan_for_matches pat_file < input_file
5939 \begin_layout LyX-Code
5940 to scan all of the input sequences for the given pattern.
5944 \begin_layout LyX-Code
5945 example, suppose that pat_file contains a single line of the form
5948 \begin_layout LyX-Code
5952 \begin_layout LyX-Code
5956 \begin_layout LyX-Code
5957 scan_for_matches pat_file < test_dna_input
5960 \begin_layout LyX-Code
5961 should produce two "hits".
5962 When I run this on my machine, I get
5965 \begin_layout LyX-Code
5966 clone% scan_for_matches pat_file < test_dna_input
5969 \begin_layout LyX-Code
5973 \begin_layout LyX-Code
5974 cguaacc ggttaacc gguuacg
5977 \begin_layout LyX-Code
5981 \begin_layout LyX-Code
5982 CGUAACC GGTTAACC GGUUACG
5985 \begin_layout LyX-Code
5989 \begin_layout LyX-Code
5990 Simple Patterns Built by Matching Ranges and Reverse Complements
5993 \begin_layout LyX-Code
5994 Let me first explain this simple pattern:
5997 \begin_layout LyX-Code
6001 \begin_layout LyX-Code
6005 \begin_layout LyX-Code
6006 The pattern consists of three "pattern units" separated by spaces.
6009 \begin_layout LyX-Code
6010 The first pattern unit is
6013 \begin_layout LyX-Code
6017 \begin_layout LyX-Code
6018 which means "match 4 to 7 characters and call them p1".
6022 \begin_layout LyX-Code
6023 second pattern unit is
6026 \begin_layout LyX-Code
6030 \begin_layout LyX-Code
6031 which means "then match 3 to 8 characters".
6032 The last pattern unit
6035 \begin_layout LyX-Code
6039 \begin_layout LyX-Code
6043 \begin_layout LyX-Code
6044 which means "match the reverse complement of p1".
6048 \begin_layout LyX-Code
6049 reported hit is shown as
6052 \begin_layout LyX-Code
6056 \begin_layout LyX-Code
6057 cguaacc ggttaacc gguuacg
6060 \begin_layout LyX-Code
6061 which states that characters 6 through 27 of sequence tst1 were
6064 \begin_layout LyX-Code
6066 "cguaac" matched the first pattern unit, "ggttaacc" the
6069 \begin_layout LyX-Code
6070 second, and "gguuacg" the third.
6071 This is an example of a common
6074 \begin_layout LyX-Code
6075 type of pattern used to search for sections of DNA or RNA that
6078 \begin_layout LyX-Code
6079 would fold into a hairpin loop.
6082 \begin_layout LyX-Code
6083 Searching Both Strands
6086 \begin_layout LyX-Code
6087 Now for a short aside: scan_for_matches only searched the
6090 \begin_layout LyX-Code
6091 sequences in the input file; it did not search the opposite
6094 \begin_layout LyX-Code
6096 With a pattern of the sort we just used, there is not
6099 \begin_layout LyX-Code
6100 need o search the opposite strand.
6101 However, it is normally the
6104 \begin_layout LyX-Code
6105 case that you will wish to search both the sequence and the
6108 \begin_layout LyX-Code
6109 opposite strand (i.e., the reverse complement of the sequence).
6112 \begin_layout LyX-Code
6113 To do that, you would just use the "-c" command line.
6117 \begin_layout LyX-Code
6118 scan_for_matches -c pat_file < test_dna_input
6121 \begin_layout LyX-Code
6122 Hits on the opposite strand will show a beginning location greater
6125 \begin_layout LyX-Code
6126 than te end location of the match.
6129 \begin_layout LyX-Code
6130 Defining Pairing Rules and Allowing Mismatches, Insertions, and Deletions
6133 \begin_layout LyX-Code
6134 Let us stop now and ask "What additional features would one need to
6137 \begin_layout LyX-Code
6138 really find the kinds of loop structures that characterize tRNAs,
6141 \begin_layout LyX-Code
6142 rRNAs, and so forth?" I can immediately think of two:
6145 \begin_layout LyX-Code
6146 a) you will need to be able to allow non-standard pairings
6149 \begin_layout LyX-Code
6150 (those other than G-C and A-U), and
6153 \begin_layout LyX-Code
6154 b) you will need to be able to tolerate some number of
6157 \begin_layout LyX-Code
6158 mismatches and bulges.
6161 \begin_layout LyX-Code
6165 \begin_layout LyX-Code
6166 Let me first show you how to handle non-standard "rules for
6169 \begin_layout LyX-Code
6170 pairing in reverse complements".
6171 Consider the following pattern,
6174 \begin_layout LyX-Code
6175 which I show as two line (you may use as many lines as you like in
6178 \begin_layout LyX-Code
6179 forming a pattern, although you can only break a pattern at points
6182 \begin_layout LyX-Code
6183 where space would be legal):
6186 \begin_layout LyX-Code
6187 r1={au,ua,gc,cg,gu,ug,ga,ag}
6190 \begin_layout LyX-Code
6191 p1=2...3 0...4 p2=2...5 1...5 r1~p2 0...4 ~p1
6194 \begin_layout LyX-Code
6195 The first "pattern unit" does not actually match anything; rather,
6198 \begin_layout LyX-Code
6199 it defines a "pairing rule" in which standard pairings are
6202 \begin_layout LyX-Code
6203 allowed, as well as G-A and A-G (in case you wondered, Us and Ts
6206 \begin_layout LyX-Code
6207 and upper and lower case can be used interchangably; for example
6210 \begin_layout LyX-Code
6211 r1={AT,UA,gc,cg} could be used to define the "standard rule" for
6214 \begin_layout LyX-Code
6216 The second line consists of six pattern units which
6219 \begin_layout LyX-Code
6220 may be interpreted as follows:
6223 \begin_layout LyX-Code
6224 p1=2...3 match 2 or 3 characters (call it p1)
6227 \begin_layout LyX-Code
6228 0...4 match 0 to 4 characters
6231 \begin_layout LyX-Code
6232 p2=2...5 match 2 to 5 characters (call it p2)
6235 \begin_layout LyX-Code
6236 1...5 match 1 to 5 characters
6239 \begin_layout LyX-Code
6240 r1~p2 match the reverse complement of p2,
6243 \begin_layout LyX-Code
6244 allowing G-A and A-G pairs
6247 \begin_layout LyX-Code
6248 0...4 match 0 to 4 characters
6251 \begin_layout LyX-Code
6252 ~p1 match the reverse complement of p1
6255 \begin_layout LyX-Code
6256 allowing only G-C, C-G, A-T, and T-A pairs
6259 \begin_layout LyX-Code
6260 Thus, r1~p2 means "match the reverse complement of p2 using rule r1".
6263 \begin_layout LyX-Code
6264 Now let us consider the issue of tolerating mismatches and bulges.
6267 \begin_layout LyX-Code
6268 You may add a "qualifier" to the pattern unit that gives the
6271 \begin_layout LyX-Code
6272 tolerable number of "mismatches, deletions, and insertions".
6275 \begin_layout LyX-Code
6279 \begin_layout LyX-Code
6280 p1=10...10 3...8 ~p1[1,2,1]
6283 \begin_layout LyX-Code
6284 means that the third pattern unit must match 10 characters,
6287 \begin_layout LyX-Code
6288 allowing one "mismatch" (a pairing other than G-C, C-G, A-T, or
6291 \begin_layout LyX-Code
6292 T-A), two deletions (a deletion is a character that occurs in p1,
6295 \begin_layout LyX-Code
6296 but has been "deleted" from the string matched by ~p1), and one
6299 \begin_layout LyX-Code
6300 insertion (an "insertion" is a character that occurs in the string
6303 \begin_layout LyX-Code
6304 matched by ~p1, but not for which no corresponding character
6307 \begin_layout LyX-Code
6309 In this case, the pattern would match
6312 \begin_layout LyX-Code
6313 ACGTACGTAC GGGGGGGG GCGTTACCT
6316 \begin_layout LyX-Code
6317 which is, you must admit, a fairly weak loop.
6321 \begin_layout LyX-Code
6322 allow mismatches, but you will find yourself using insertions and
6325 \begin_layout LyX-Code
6326 deletions much more rarely.
6327 In any event, you should note that
6330 \begin_layout LyX-Code
6331 allowing mismatches, insertions, and deletions does force the
6334 \begin_layout LyX-Code
6335 program to try many additional possible pairings, so it does slow
6338 \begin_layout LyX-Code
6342 \begin_layout LyX-Code
6343 How Patterns Are Matched
6346 \begin_layout LyX-Code
6347 Now is as good a time as any to discuss the basic flow of control
6350 \begin_layout LyX-Code
6351 when matching patterns.
6352 Recall that a "pattern" is a sequence of
6355 \begin_layout LyX-Code
6357 Suppose that the pattern units were
6360 \begin_layout LyX-Code
6365 \begin_layout LyX-Code
6366 The scan of a sequence S begins by setting the current position
6369 \begin_layout LyX-Code
6371 Then, an attempt is made to match u1 starting at the
6374 \begin_layout LyX-Code
6376 Each attempt to match a pattern unit can
6379 \begin_layout LyX-Code
6381 If it succeeds, then an attempt is made to match
6384 \begin_layout LyX-Code
6386 If it fails, then an attempt is made to find an
6389 \begin_layout LyX-Code
6390 alternative match for the immediately preceding pattern unit.
6394 \begin_layout LyX-Code
6395 this succeeds, then we proceed forward again to the next unit.
6399 \begin_layout LyX-Code
6400 it fails we go back to the preceding unit.
6401 This process is called
6404 \begin_layout LyX-Code
6406 If there are no previous units, then the current
6409 \begin_layout LyX-Code
6410 position is incremented by one, and everything starts again.
6414 \begin_layout LyX-Code
6415 proceeds until either the current position goes past the end of
6418 \begin_layout LyX-Code
6419 the sequence or all of the pattern units succeed.
6423 \begin_layout LyX-Code
6424 scan_for_matches reports the "hit", the current position is set
6427 \begin_layout LyX-Code
6428 just past the hit, and an attempt is made to find another hit.
6431 \begin_layout LyX-Code
6432 If you wish to limit the scan to simply finding a maximum of, say,
6435 \begin_layout LyX-Code
6436 10 hits, you can use the -n option (-n 10 would set the limit to
6439 \begin_layout LyX-Code
6444 \begin_layout LyX-Code
6445 scan_for_matches -c -n 1 pat_file < test_dna_input
6448 \begin_layout LyX-Code
6449 would search for just the first hit (and would stop searching the
6452 \begin_layout LyX-Code
6453 current sequences or any that follow in the input file).
6456 \begin_layout LyX-Code
6457 Searching for repeats:
6460 \begin_layout LyX-Code
6461 In the last section, I discussed almost all of the details
6464 \begin_layout LyX-Code
6465 required to allow you to look for repeats.
6466 Consider the following
6469 \begin_layout LyX-Code
6473 \begin_layout LyX-Code
6474 p1=6...6 3...8 p1 (find exact 6 character repeat separated
6477 \begin_layout LyX-Code
6481 \begin_layout LyX-Code
6482 p1=6...6 3..8 p1[1,0,0] (allow one mismatch)
6485 \begin_layout LyX-Code
6486 p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0]
6489 \begin_layout LyX-Code
6490 (match 12 characters that are the remains
6493 \begin_layout LyX-Code
6494 of a 3-character sequence occurring 4 times)
6497 \begin_layout LyX-Code
6501 \begin_layout LyX-Code
6502 p1=4...8 0...3 p2=6...8 p1 0...3 p2
6505 \begin_layout LyX-Code
6506 (This would match things like
6509 \begin_layout LyX-Code
6510 ATCT G TCTTT ATCT TG TCTTT
6513 \begin_layout LyX-Code
6517 \begin_layout LyX-Code
6518 Searching for particular sequences:
6521 \begin_layout LyX-Code
6522 Occasionally, one wishes to match a specific, known sequence.
6525 \begin_layout LyX-Code
6526 In such a case, you can just give the sequence (along with an
6529 \begin_layout LyX-Code
6530 optional statement of the allowable mismatches, insertions, and
6533 \begin_layout LyX-Code
6538 \begin_layout LyX-Code
6539 p1=6...8 GAGA ~p1 (match a hairpin with GAGA as the loop)
6542 \begin_layout LyX-Code
6543 RRRRYYYY (match 4 purines followed by 4 pyrimidines)
6546 \begin_layout LyX-Code
6547 TATAA[1,0,0] (match TATAA, allowing 1 mismatch)
6550 \begin_layout LyX-Code
6554 \begin_layout LyX-Code
6555 Matches against a "weight matrix":
6558 \begin_layout LyX-Code
6559 I will conclude my examples of the types of pattern units
6562 \begin_layout LyX-Code
6563 available for matching against nucleotide sequences by discussing a
6566 \begin_layout LyX-Code
6567 crude implemetation of matching using a "weight matrix".
6571 \begin_layout LyX-Code
6572 am less than overwhelmed with the syntax that I chose, I think that
6575 \begin_layout LyX-Code
6576 the reader should be aware that I was thinking of generating
6579 \begin_layout LyX-Code
6580 patterns containing such pattern units automatically from
6583 \begin_layout LyX-Code
6584 alignments (and did not really plan on typing such things in by
6587 \begin_layout LyX-Code
6589 Anyway, suppose that you wanted to match a
6592 \begin_layout LyX-Code
6593 sequence of eight characters.
6594 The "consensus" of these eight
6597 \begin_layout LyX-Code
6598 characters is GRCACCGS, but the actual "frequencies of occurrence"
6601 \begin_layout LyX-Code
6602 are given in the matrix below.
6603 Thus, the first character is an A
6606 \begin_layout LyX-Code
6607 16% the time and a G 84% of the time.
6608 The second is an A 57% of
6611 \begin_layout LyX-Code
6612 the time, a C 10% of the time, a G 29% of the time, and a T 4% of
6615 \begin_layout LyX-Code
6620 \begin_layout LyX-Code
6621 C1 C2 C3 C4 C5 C6 C7 C8
6624 \begin_layout LyX-Code
6628 \begin_layout LyX-Code
6629 A 16 57 0 95 0 18 0 0
6632 \begin_layout LyX-Code
6633 C 0 10 80 0 100 60 0 50
6636 \begin_layout LyX-Code
6637 G 84 29 0 0 0 20 100 50
6640 \begin_layout LyX-Code
6644 \begin_layout LyX-Code
6648 \begin_layout LyX-Code
6649 One could use the following pattern unit to search for inexact
6652 \begin_layout LyX-Code
6653 matches related to such a "weight matrix":
6656 \begin_layout LyX-Code
6657 {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
6660 \begin_layout LyX-Code
6661 (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
6664 \begin_layout LyX-Code
6665 This pattern unit will attempt to match exactly eight characters.
6668 \begin_layout LyX-Code
6669 For each character in the sequence, the entry in the corresponding
6672 \begin_layout LyX-Code
6673 tuple is added to an accumulated sum.
6674 If the sum is greater than
6677 \begin_layout LyX-Code
6678 450, the match succeeds; else it fails.
6681 \begin_layout LyX-Code
6682 Recently, this feature was upgraded to allow ranges.
6686 \begin_layout LyX-Code
6687 600 > {(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
6690 \begin_layout LyX-Code
6691 (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
6694 \begin_layout LyX-Code
6698 \begin_layout LyX-Code
6699 Allowing Alternatives:
6702 \begin_layout LyX-Code
6703 Very occasionally, you may wish to allow alternative pattern units
6706 \begin_layout LyX-Code
6707 (i.e., "match either A or B").
6708 You can do this using something
6711 \begin_layout LyX-Code
6715 \begin_layout LyX-Code
6719 \begin_layout LyX-Code
6720 which says "match either GAGA or GCGCA".
6724 \begin_layout LyX-Code
6725 alternatives of a list of pattern units, for example
6728 \begin_layout LyX-Code
6729 (p1=3...3 3...8 ~p1 | p1=5...5 4...4 ~p1 GGG)
6732 \begin_layout LyX-Code
6733 would match one of two sequences of pattern units.
6737 \begin_layout LyX-Code
6738 clumsy aspect of the syntax: to match a list of alternatives, you
6741 \begin_layout LyX-Code
6742 need to fully the request.
6746 \begin_layout LyX-Code
6747 (GAGA | (GCGCA | TTCGA))
6750 \begin_layout LyX-Code
6751 would be needed to try the three alternatives.
6754 \begin_layout LyX-Code
6758 \begin_layout LyX-Code
6759 Sometimes a pattern will contain a sequence of distinct ranges,
6762 \begin_layout LyX-Code
6763 and you might wish to limit the sum of the lengths of the matched
6766 \begin_layout LyX-Code
6768 For example, suppose that you basically wanted to
6771 \begin_layout LyX-Code
6772 match something like
6775 \begin_layout LyX-Code
6776 ARRYYTT p1=0...5 GCA[1,0,0] p2=1...6 ~p1 4...8 ~p2 p3=4...10 CCT
6779 \begin_layout LyX-Code
6780 but that the sum of the lengths of p1, p2, and p3 must not exceed
6783 \begin_layout LyX-Code
6785 To do this, you could add
6788 \begin_layout LyX-Code
6789 length(p1+p2+p3) < 9
6792 \begin_layout LyX-Code
6793 as the last pattern unit.
6794 It will just succeed or fail (but does
6797 \begin_layout LyX-Code
6798 not actually match any characters in the sequence).
6801 \begin_layout LyX-Code
6805 \begin_layout LyX-Code
6806 Matching Protein Sequences
6809 \begin_layout LyX-Code
6810 Suppose that the input file contains protein sequences.
6814 \begin_layout LyX-Code
6815 case, you must invoke scan_for_matches with the "-p" option.
6819 \begin_layout LyX-Code
6820 cannot use aspects of the language that relate directly to
6823 \begin_layout LyX-Code
6824 nucleotide sequences (e.g., the -c command line option or pattern
6827 \begin_layout LyX-Code
6828 constructs referring to the reverse complement of a previously
6831 \begin_layout LyX-Code
6836 \begin_layout LyX-Code
6837 You also have two additional constructs that allow you to match
6840 \begin_layout LyX-Code
6841 either "one of a set of amino acids" or "any amino acid other than
6844 \begin_layout LyX-Code
6849 \begin_layout LyX-Code
6850 p1=0...4 any(HQD) 1...3 notany(HK) p1
6853 \begin_layout LyX-Code
6854 would successfully match a string like
6857 \begin_layout LyX-Code
6861 \begin_layout LyX-Code
6862 Using the show_hits Utility
6865 \begin_layout LyX-Code
6866 When viewing a large set of complex matches, you might find it
6869 \begin_layout LyX-Code
6870 convenient to post-process the scan_for_matches output to get a
6873 \begin_layout LyX-Code
6874 more readable version.
6875 We provide a simple post-processor called
6878 \begin_layout LyX-Code
6880 To see its effect, just pipe the output of a
6883 \begin_layout LyX-Code
6884 scan_for_matches into show_hits:
6887 \begin_layout LyX-Code
6891 \begin_layout LyX-Code
6892 clone% scan_for_matches -c pat_file < tmp
6895 \begin_layout LyX-Code
6899 \begin_layout LyX-Code
6900 gtacguaacc ggttaac cgguuacgtac
6903 \begin_layout LyX-Code
6907 \begin_layout LyX-Code
6908 gtacgtaacc ggttaac cggttacgtac
6911 \begin_layout LyX-Code
6915 \begin_layout LyX-Code
6916 CGTACGUAAC C GGTTAACC GGUUACGTACG
6919 \begin_layout LyX-Code
6923 \begin_layout LyX-Code
6924 CGTACGTAAC C GGTTAACC GGTTACGTACG
6927 \begin_layout LyX-Code
6931 \begin_layout LyX-Code
6932 gtacguaacc g gttaactt cgguuacgtac
6935 \begin_layout LyX-Code
6939 \begin_layout LyX-Code
6940 gtacgtaacc g aagttaac cggttacgtac
6943 \begin_layout LyX-Code
6944 Piped Through show_hits:
6947 \begin_layout LyX-Code
6951 \begin_layout LyX-Code
6952 clone% scan_for_matches -c pat_file < tmp | show_hits
6955 \begin_layout LyX-Code
6956 tst1:[1,28]: gtacguaacc ggttaac cgguuacgtac
6959 \begin_layout LyX-Code
6960 tst1:[28,1]: gtacgtaacc ggttaac cggttacgtac
6963 \begin_layout LyX-Code
6964 tst2:[2,31]: CGTACGUAAC C GGTTAACC GGUUACGTACG
6967 \begin_layout LyX-Code
6968 tst2:[31,2]: CGTACGTAAC C GGTTAACC GGTTACGTACG
6971 \begin_layout LyX-Code
6972 tst3:[3,32]: gtacguaacc g gttaactt cgguuacgtac
6975 \begin_layout LyX-Code
6976 tst3:[32,3]: gtacgtaacc g aagttaac cggttacgtac
6979 \begin_layout LyX-Code
6983 \begin_layout LyX-Code
6984 Optionally, you can specify which of the "fields" in the matches
6987 \begin_layout LyX-Code
6988 you wish to sort on, and show_hits will sort them.
6992 \begin_layout LyX-Code
6993 numbers start with 0.
6994 So, you might get something like
6997 \begin_layout LyX-Code
6998 clone% scan_for_matches -c pat_file < tmp | show_hits 2 1
7001 \begin_layout LyX-Code
7002 tst2:[2,31]: CGTACGUAAC C GGTTAACC GGUUACGTACG
7005 \begin_layout LyX-Code
7006 tst2:[31,2]: CGTACGTAAC C GGTTAACC GGTTACGTACG
7009 \begin_layout LyX-Code
7010 tst3:[32,3]: gtacgtaacc g aagttaac cggttacgtac
7013 \begin_layout LyX-Code
7014 tst1:[1,28]: gtacguaacc ggttaac cgguuacgtac
7017 \begin_layout LyX-Code
7018 tst1:[28,1]: gtacgtaacc ggttaac cggttacgtac
7021 \begin_layout LyX-Code
7022 tst3:[3,32]: gtacguaacc g gttaactt cgguuacgtac
7025 \begin_layout LyX-Code
7029 \begin_layout LyX-Code
7030 In this case, the hits have been sorted on fields 2 and 1 (that is,
7033 \begin_layout LyX-Code
7034 the third and second matched subfields).
7037 \begin_layout LyX-Code
7038 show_hits is just one possible little post-processor, and you
7041 \begin_layout LyX-Code
7042 might well wish to write a customized one for yourself.
7045 \begin_layout LyX-Code
7046 Reducing the Cost of a Search
7049 \begin_layout LyX-Code
7050 The scan_for_matches utility uses a fairly simple search, and may
7053 \begin_layout LyX-Code
7054 consume large amounts of CPU time for complex patterns.
7058 \begin_layout LyX-Code
7059 I may decide to optimize the code.
7060 However, until then, let me
7063 \begin_layout LyX-Code
7064 mention one useful technique.
7068 \begin_layout LyX-Code
7069 When you have a complex pattern that includes a number of varying
7072 \begin_layout LyX-Code
7073 ranges, imprecise matches, and so forth, it is useful to
7076 \begin_layout LyX-Code
7078 That is, form a simpler pattern that can be
7081 \begin_layout LyX-Code
7082 used to scan through a large database extracting sections that
7085 \begin_layout LyX-Code
7086 might be matched by the more complex pattern.
7090 \begin_layout LyX-Code
7091 with a short example.
7092 Suppose that you really wished to match the
7095 \begin_layout LyX-Code
7099 \begin_layout LyX-Code
7100 p1=3...5 0...8 ~p1[1,1,0] p2=6...7 3...6 AGC 3...5 RYGC ~p2[1,0,0]
7103 \begin_layout LyX-Code
7104 In this case, the pattern units AGC 3...5 RYGC can be used to rapidly
7107 \begin_layout LyX-Code
7108 constrain the overall search.
7109 You can preprocess the overall
7112 \begin_layout LyX-Code
7113 database using the pattern:
7116 \begin_layout LyX-Code
7117 31...31 AGC 3...5 RYGC 7...7
7120 \begin_layout LyX-Code
7121 Put the complex pattern in pat_file1 and the simpler pattern in
7124 \begin_layout LyX-Code
7129 \begin_layout LyX-Code
7130 scan_for_matches -c pat_file2 < nucleotide_database |
7133 \begin_layout LyX-Code
7134 scan_for_matches pat_file1
7137 \begin_layout LyX-Code
7138 The output will show things like
7141 \begin_layout LyX-Code
7142 >seqid:[232,280][2,47]
7145 \begin_layout LyX-Code
7149 \begin_layout LyX-Code
7150 Then, the actual section of the sequence that was matched can be
7153 \begin_layout LyX-Code
7154 easily computed as [233,278] (remember, the positions start from
7157 \begin_layout LyX-Code
7161 \begin_layout LyX-Code
7162 Let me finally add, you should do a few short experiments to see
7165 \begin_layout LyX-Code
7166 whether or not such pipelining actually improves performance -- it
7169 \begin_layout LyX-Code
7170 is not always obvious where the time is going, and I have
7173 \begin_layout LyX-Code
7174 sometimes found that the added complexity of pipelining actually
7177 \begin_layout LyX-Code
7179 It gets its best improvements when there are
7182 \begin_layout LyX-Code
7183 exact matches of more than just a few characters that can be
7186 \begin_layout LyX-Code
7187 rapidly used to eliminate large sections of the database.
7190 \begin_layout LyX-Code
7194 \begin_layout LyX-Code
7198 \begin_layout LyX-Code
7199 Feb 9, 1995: the pattern units ^ and $ now work as in normal regular
7202 \begin_layout LyX-Code
7207 \begin_layout LyX-Code
7211 \begin_layout LyX-Code
7212 matches only TTF at the end of the string and
7215 \begin_layout LyX-Code
7219 \begin_layout LyX-Code
7220 matches only an initial TTF
7223 \begin_layout LyX-Code
7227 \begin_layout LyX-Code
7231 \begin_layout LyX-Code
7232 matches the reverse of the string named p1.
7236 \begin_layout LyX-Code
7237 if p1 matched GCAT, then <p1 would match TACG.
7241 \begin_layout LyX-Code
7245 \begin_layout LyX-Code
7246 matches a real palindrome (not the biologically common
7249 \begin_layout LyX-Code
7250 meaning of "reverse complement")
7253 \begin_layout LyX-Code