1 --------------------------------------------------------------------------------
3 --------------------------------------------------------------------------------
5 BamTools: a C++ API & toolkit for reading/writing/manipulating BAM files.
21 --------------------------------------------------------------------------------
23 --------------------------------------------------------------------------------
25 BamTools provides both a programmer's API and an end-user's toolkit for handling
28 ----------------------------------------
30 ----------------------------------------
32 The API consists of 2 main modules: BamReader and BamWriter. As you would
33 expect, BamReader provides read-access to BAM files, while BamWriter handles
34 writing data to BAM files. BamReader provides the interface for random-access
35 (jumping) in a BAM file, as well as generating BAM index files.
37 BamMultiReader is an extra module that allows you to manage multiple open BAM
38 files for reading. It provides some validation & bookkeeping under the hood to
39 keep all files sync'ed up for you.
41 Additional files used by the API:
43 - BamAlignment.* : implements the BamAlignment data structure
45 - BamAux.h : contains various constants, data structures and utility
46 methods used throught the API.
48 - BamIndex.* : implements both the standard BAM format index (".bai") as
49 well as a new BamTools-specific index (".bti").
51 - BGZF.* : contains our implementation of the Broad Institute's BGZF
54 ----------------------------------------
56 ----------------------------------------
58 If you've been using the BamTools since the early days, you'll notice that our
59 'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone. We have
60 dumped these in favor of a suite of small utilities that we hope both
61 developers and end-users find useful:
63 usage: bamtools [--help] COMMAND [ARGS]
65 Available bamtools commands:
67 convert Converts between BAM and a number of other formats
68 count Prints number of alignments in BAM file(s)
69 coverage Prints coverage statistics from the input BAM file
70 filter Filters BAM file(s) by user-specified criteria
71 header Prints BAM header information
72 index Generates index for BAM file
73 merge Merge multiple BAM files into single file
74 random Select random alignments from existing BAM file(s)
75 sort Sorts the BAM file according to some criteria
76 split Splits a BAM file on user-specifed property, creating a
77 new BAM output file for each value found
78 stats Prints some basic statistics from input BAM file(s)
80 See 'bamtools help COMMAND' for more information on a specific command.
82 --------------------------------------------------------------------------------
84 --------------------------------------------------------------------------------
86 ** General usage information - perhaps explain common terms, point to SAM/BAM
89 ----------------------------------------
91 ----------------------------------------
93 The API, as noted above, contains 2 main modules - BamReader & BamWriter - for
94 dealing with BAM files. Alignment data is made available through the
95 BamAlignment data structure.
97 A simple (read-only) scenario for accessing BAM data would look like the
100 // open our BamReader
102 reader.Open("someData.bam", "someData.bam.bai");
104 // define our region of interest
105 // in this example: bases 0-500 on the reference "chrX"
106 int id = reader.GetReferenceID("chrX");
107 BamRegion region(id, 0, id, 500);
108 reader.SetRegion(region);
110 // iterate through alignments in this region,
111 // ignoring alignments with a MQ below some cutoff
113 while ( reader.GetNextAlignment(al) ) {
114 if ( al.MapQuality >= 50 )
121 To use this API in your application, you simply need to do 3 things:
123 1 - Drop the BamTools API files somewhere the compiler can find them.
125 2 - Import BamTools API with the following lines of code
126 #include "BamReader.h" // (or "BamMultiReader.h") as needed
127 #include "BamWriter.h" // as needed
128 using namespace BamTools; // all of BamTools classes/methods live in
131 3 - Link with '-lz' ('l' as in Lima) to access ZLIB compression library
132 (For MSVC users, I can provide you modified zlib headers - just contact
135 See any included programs and Makefile for more specific compiling/usage
136 examples. See comments in the header files for more detailed API documentation.
138 ----------------------------------------
140 ----------------------------------------
142 BamTools provides a small, but powerful suite of command-line utility programs
143 for manipulating and querying BAM files for data.
149 All BamTools utilities handle I/O operations using a common set of arguments.
154 The input BAM files(s).
156 If a tool accepts multiple BAM files as input, each file gets its own "-in"
157 option on the command line. If no "-in" is provided, the tool will attempt
158 to read BAM data from stdin.
160 To read a single BAM file, use a single "-in" option:
161 > bamtools *tool* -in myData1.bam ...ARGS...
163 To read multiple BAM files, use multiple "-in" options:
164 > bamtools *tool* -in myData1.bam -in myData2.bam ...ARGS...
166 To read from stdin (if supported), omit the "-in" option:
167 > bamtools *tool* ...ARGS...
173 If a tool outputs a result BAM file, specify the filename using this option.
174 If none is provided, the tool will typically write to stdout.
176 *Note: Not all tools output BAM data (e.g. count, header, etc.)
180 A region of interest. See below for accepted 'REGION string' formats.
182 Many of the tools accept this option, which allows a user to only consider
183 alignments that overlap this region (whether counting, filtering, merging,
186 An alignment is considered to overlap a region if any part of the alignments
187 intersects the left/right boundaries. Thus, a 50bp alignment at position 70
188 will overlap a region beginning at position 100.
191 ----------------------
192 A proper REGION string can be formatted like any of the following examples:
193 where 'chr1' is the name of a reference (not its ID)and '' is any valid
194 integer position within that reference.
197 chr1 - only alignments on (entire) reference 'chr1'
198 chr1:500 - only alignments overlapping the region starting at
199 chr1:500 and continuing to the end of chr1
200 chr1:500..1000 - only alignments overlapping the region starting at
201 chr1:500 and continuing to chr1:1000
202 chr1:500..chr3:750 - only alignments overlapping the region starting at
203 chr1:500 and continuing to chr3:750. This 'spanning'
204 region assumes that the reference specified as the
205 right boundary will occur somewhere in the file after
206 the left boundary. On a sorted BAM, a REGION of
207 'chr4:500..chr2:1500' will produce undefined
208 (incorrect) results. So don't do it. :)
210 *Note: Most of the tools that accept a REGION string will perform without an
211 index file, but typically at great cost to performance (having to
212 plow through the entire file until the region of interest is found).
213 For optimum speed, be sure that index files are available for your
218 Force compression of BAM output.
220 When tools are piped together (see details below), the default behavior is
221 to turn off compression. This can greatly increase performance when the data
222 does not have to be constantly decompressed and recompressed. This is
223 ignored any time an output BAM file is specified using "-out".
229 Many of the tools in BamTools can be chained together by piping. Any tool that
230 accepts stdin can be piped into, and any that can output stdout can be piped
233 > bamtools filter -in data1.bam -in data2.bam -mapQuality ">50" | bamtools count
235 will give a count of all alignments in your 2 BAM files with a mapQuality of
236 greater than 50. And of course, any tool writing to stdout can be piped into
243 convert Converts between BAM and a number of other formats
244 count Prints number of alignments in BAM file(s)
245 coverage Prints coverage statistics from the input BAM file
246 filter Filters BAM file(s) by user-specified criteria
247 header Prints BAM header information
248 index Generates index for BAM file
249 merge Merge multiple BAM files into single file
250 random Select random alignments from existing BAM file(s)
251 sort Sorts the BAM file according to some criteria
252 split Splits a BAM file on user-specifed property, creating a new
253 BAM output file for each value found
254 stats Prints some basic statistics from input BAM file(s)
260 Description: converts BAM to a number of other formats
262 Usage: bamtools convert -format <FORMAT> [-in <filename> -in <filename> ...]
263 [-out <filename>] [other options]
266 -in <BAM filename> the input BAM file(s) [stdin]
267 -out <BAM filename> the output BAM file [stdout]
268 -format <FORMAT> the output file format - see below for
272 -region <REGION> genomic region. Index file is recommended for
273 better performance, and is read
274 automatically if it exists. See 'bamtools
275 help index' for more details on creating
279 -fasta <FASTA filename> FASTA reference file
280 -mapqual print the mapping qualities
283 -noheader omit the SAM header from output
286 --help, -h shows this help text
290 - Currently supported output formats ( BAM -> X )
292 Format type FORMAT (command-line argument)
293 ------------ -------------------------------
303 > bamtools convert -format json -in myData.bam -out myData.json
305 - Pileup Options have no effect on formats other than "pileup"
306 SAM Options have no effect on formats other than "sam"
312 Description: prints number of alignments in BAM file(s).
314 Usage: bamtools count [-in <filename> -in <filename> ...] [-region <REGION>]
317 -in <BAM filename> the input BAM file(s) [stdin]
318 -region <REGION> genomic region. Index file is recommended
319 for better performance, and is used
320 automatically if it exists. See
321 'bamtools help index' for more details
325 --help, -h shows this help text
331 Description: prints coverage data for a single BAM file.
333 Usage: bamtools coverage [-in <filename>] [-out <filename>]
336 -in <BAM filename> the input BAM file [stdin]
337 -out <filename> the output file [stdout]
340 --help, -h shows this help text
346 Description: filters BAM file(s).
348 Usage: bamtools filter [-in <filename> -in <filename> ...]
349 [-out <filename> | [-forceCompression]]
351 [ [-script <filename] | [filterOptions] ]
354 -in <BAM filename> the input BAM file(s) [stdin]
355 -out <BAM filename> the output BAM file [stdout]
356 -region <REGION> only read data from this genomic region (see
357 README for more details)
358 -script <filename> the filter script file (see README for more
360 -forceCompression if results are sent to stdout (like when
361 piping to another tool), default behavior
362 is to leave output uncompressed. Use this
363 flag to override and force compression
366 -alignmentFlag <int> keep reads with this *exact* alignment flag
367 (for more detailed queries, see below)
368 -insertSize <int> keep reads with insert size that matches
370 -mapQuality <[0-255]> keep reads with map quality that matches
372 -name <string> keep reads with name that matches pattern
373 -queryBases <string> keep reads with motif that matches pattern
374 -tag <TAG:VALUE> keep reads with this key=>value pair
376 Alignment Flag Filters:
377 -isDuplicate <true/false> keep only alignments that are marked as
379 -isFailedQC <true/false> keep only alignments that failed QC [true]
380 -isFirstMate <true/false> keep only alignments marked as first mate
382 -isMapped <true/false> keep only alignments that were mapped [true]
383 -isMateMapped <true/false> keep only alignments with mates that mapped
385 -isMateReverseStrand <true/false> keep only alignments with mate on reverse
387 -isPaired <true/false> keep only alignments that were sequenced as
389 -isPrimaryAlignment <true/false> keep only alignments marked as primary
391 -isProperPair <true/false> keep only alignments that passed paired-end
393 -isReverseStrand <true/false> keep only alignments on reverse strand
395 -isSecondMate <true/false> keep only alignments marked as second mate
399 --help, -h shows this help text
405 The BamTools filter tool allows you to use an external filter script to define
406 complex filtering behavior. This script uses what I'm calling properties,
407 filters, and a rule - all implemented in a JSON syntax.
411 A 'property' is a typical JSON entry of the form:
413 "propertyName" : "value"
415 Here are the property names that BamTools will recognize:
440 For properties with boolean values, use the words "true" or "false".
445 will keep only alignments that are flagged as 'mapped'.
447 For properties with numeric values, use the desired number with optional
448 comparison operators ( >, >=, <, <=, !). For example,
450 "mapQuality" : ">=75"
452 will keep only alignments with mapQuality greater than or equal to 75.
454 If you're familiar with JSON, you know that integers can be bare (without
455 quotes). However, if you a comparison operator, be sure to enclose in quotes.
457 For string-based properties, the above operators are available. In addition,
458 you can also use some basic pattern-matching operators. For example,
460 "reference" : "ALU*" // reference starts with 'ALU'
461 "name" : "*foo" // name ends with 'foo'
462 "cigar" : "*D*" // cigar contains a 'D' anywhere
465 The reference property refers to the reference name, not the BAM reference
468 The tag property has an extra layer, so that the syntax will look like this:
472 where XX is the 2-letter SAM/BAM tag and value is, well, the value.
473 Comparison operators can still apply to values, so tag properties of:
482 A 'filter' is a JSON container of properties that will be AND-ed together. For
486 "reference" : "chr1",
487 "mapQuality" : ">50",
491 would result in an output BAM file containing only alignments from chr1 with a
492 mapQuality >50 and edit distance of less than 4.
494 A single, unnamed filter like this is the minimum necessary for a complete
495 filter script. Save this file and use as the -script parameter and you should
498 Moving on to more potent filtering...
500 You can also define multiple filters.
501 To do so, you just need to use the "filters" keyword along with JSON array
508 "reference" : "chr1",
512 "reference" : "chr1",
513 "isReverseStrand" : "true"
518 These filters will be (inclusive) OR-ed together by default. So you'd get a
519 resulting BAM with only alignments from chr1 that had either mapQuality >50 or
520 on the reverse strand (or both).
524 Alternatively to anonymous OR-ed filters, you can also provide what I've called
525 a "rule". By giving each filter an "id", using this "rule" keyword you can
526 describe boolean relationships between your filter sets.
528 Available rule operators:
534 This might sound a little fuzzy at this point, so let's get back to an example:
541 "reference" : "chr1",
546 "reference" : "chr1",
547 "isReverseStrand" : "true"
551 "reference" : "chr1",
552 "queryBases" : "AGCT*"
556 "rule" : " (filter1 | filter2) & !filter3 "
559 In this case, we would only retain aligments that passed filter 1 OR filter 2,
560 AND also NOT filter 3.
562 These are dummy examples, and don't make much sense as an actual query case. But
563 hopefully this serves an adequate primer to get you started and discover the
564 potential flexibility here.
570 Description: prints header from BAM file(s).
572 Usage: bamtools header [-in <filename> -in <filename> ...]
575 -in <BAM filename> the input BAM file(s) [stdin]
578 --help, -h shows this help text
584 Description: creates index for BAM file.
586 Usage: bamtools index [-in <filename>] [-bti]
589 -in <BAM filename> the input BAM file [stdin]
590 -bti create (non-standard) BamTools index file
591 (*.bti). Default behavior is to create
592 standard BAM index (*.bai)
595 --help, -h shows this help tex
601 Description: merges multiple BAM files into one.
603 Usage: bamtools merge [-in <filename> -in <filename> ...]
604 [-out <filename> | [-forceCompression]] [-region <REGION>]
607 -in <BAM filename> the input BAM file(s)
608 -out <BAM filename> the output BAM file
609 -forceCompression if results are sent to stdout (like when
610 piping to another tool), default behavior
611 is to leave output uncompressed. Use this
612 flag to override and force compression
613 -region <REGION> genomic region. See README for more details
616 --help, -h shows this help text
622 Description: grab a random subset of alignments.
624 Usage: bamtools random [-in <filename> -in <filename> ...]
625 [-out <filename>] [-forceCompression] [-n]
629 -in <BAM filename> the input BAM file [stdin]
630 -out <BAM filename> the output BAM file [stdout]
631 -forceCompression if results are sent to stdout (like when
632 piping to another tool), default behavior
633 is to leave output uncompressed. Use this
634 flag to override and force compression
635 -region <REGION> only pull random alignments from within this
636 genomic region. Index file is
637 recommended for better performance, and
638 is used automatically if it exists. See
639 'bamtools help index' for more details
643 -n <count> number of alignments to grab. Note that no
644 duplicate checking is performed [10000]
647 --help, -h shows this help text
653 Description: sorts a BAM file.
655 Usage: bamtools sort [-in <filename>] [-out <filename>] [sortOptions]
658 -in <BAM filename> the input BAM file [stdin]
659 -out <BAM filename> the output BAM file [stdout]
662 -byname sort by alignment name
665 -n <count> max number of alignments per tempfile
667 -mem <Mb> max memory to use [1024]
670 --help, -h shows this help text
676 Description: splits a BAM file on user-specified property, creating a new BAM
677 output file for each value found.
679 Usage: bamtools split [-in <filename>] [-stub <filename stub>]
680 < -mapped | -paired | -reference | -tag <TAG> >
683 -in <BAM filename> the input BAM file [stdin]
684 -stub <filename stub> prefix stub for output BAM files (default
685 behavior is to use input filename,
686 without .bam extension, as stub). If
687 input is stdin and no stub provided, a
688 timestamp is generated as the stub.
691 -mapped split mapped/unmapped alignments
692 -paired split single-end/paired-end alignments
693 -reference split alignments by reference
694 -tag <tag name> splits alignments based on all values of TAG
695 encountered (i.e. -tag RG creates a BAM
696 file for each read group in original
700 --help, -h shows this help text
706 Description: prints general alignment statistics.
708 Usage: bamtools stats [-in <filename> -in <filename> ...] [statsOptions]
711 -in <BAM filename> the input BAM file [stdin]
714 -insert summarize insert size data
717 --help, -h shows this help text
719 --------------------------------------------------------------------------------
721 --------------------------------------------------------------------------------
723 Both the BamTools API and toolkit are released under the MIT License.
724 Copyright (c) 2009-2010 Derek Barnett, Erik Garrison, Gabor Marth,
727 See included file LICENSE for details.
729 --------------------------------------------------------------------------------
730 IV. Acknowledgements :
731 --------------------------------------------------------------------------------
733 * Aaron Quinlan for several key feature ideas and bug fix contributions
734 * Baptiste Lepilleur for the public-domain JSON parser (JsonCPP)
735 * Heng Li, author of SAMtools - the original C-language BAM API/toolkit.
737 --------------------------------------------------------------------------------
739 --------------------------------------------------------------------------------
741 Feel free to contact me with any questions, comments, suggestions, bug reports,
746 Biology Dept., Boston College
748 Email: barnetde@bc.edu
749 Project Websites: http://github.com/pezmaster31/bamtools (ACTIVE SUPPORT)
750 http://sourceforge.net/projects/bamtools (major updates only)