1 --------------------------------------------------------------------------------
3 --------------------------------------------------------------------------------
5 BamTools: a C++ API & toolkit for reading/writing/manipulating BAM files.
23 --------------------------------------------------------------------------------
25 --------------------------------------------------------------------------------
27 BamTools provides both a programmer's API and an end-user's toolkit for handling
30 ----------------------------------------
32 ----------------------------------------
34 The API consists of 2 main modules: BamReader and BamWriter. As you would
35 expect, BamReader provides read-access to BAM files, while BamWriter handles
36 writing data to BAM files. BamReader provides the interface for random-access
37 (jumping) in a BAM file, as well as generating BAM index files.
39 BamMultiReader is an extra module that allows you to manage multiple open BAM
40 files for reading. It provides some validation & bookkeeping under the hood to
41 keep all files sync'ed up for you.
43 Additional files used by the API:
45 - BamAlignment.* : implements the BamAlignment data structure
47 - BamAux.h : contains various constants, data structures and utility
48 methods used throught the API.
50 - BamIndex.* : implements both the standard BAM format index (".bai") as
51 well as a new BamTools-specific index (".bti").
53 - BGZF.* : contains our implementation of the Broad Institute's BGZF
56 ----------------------------------------
58 ----------------------------------------
60 If you've been using the BamTools since the early days, you'll notice that our
61 'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone. We have
62 dumped these in favor of a suite of small utilities that we hope both
63 developers and end-users find useful:
65 usage: bamtools [--help] COMMAND [ARGS]
67 Available bamtools commands:
69 convert Converts between BAM and a number of other formats
70 count Prints number of alignments in BAM file(s)
71 coverage Prints coverage statistics from the input BAM file
72 filter Filters BAM file(s) by user-specified criteria
73 header Prints BAM header information
74 index Generates index for BAM file
75 merge Merge multiple BAM files into single file
76 random Select random alignments from existing BAM file(s)
77 sort Sorts the BAM file according to some criteria
78 split Splits a BAM file on user-specifed property, creating a
79 new BAM output file for each value found
80 stats Prints some basic statistics from input BAM file(s)
82 See 'bamtools help COMMAND' for more information on a specific command.
84 --------------------------------------------------------------------------------
86 --------------------------------------------------------------------------------
88 ----------------------------------------
90 ----------------------------------------
92 BamTools has been migrated to a CMake-based build system. We believe that this
93 should simplify the build process across all platforms, especially as the
94 BamTools API moves into a shared library (that you link to instead of compiling
95 lots of source files directly into your application). CMake is available on all
96 major platforms, and indeed comes *out-of-the-box* with many Linux distributions.
98 To see if you have CMake (and which version), try this command:
102 BamTools requires CMake version >= 2.6.4. If you are missing CMake or have an
103 older version, check your OS package manager (for Linux) or download it here:
104 http://www.cmake.org/cmake/resources/software.html .
106 ----------------------------------------
108 ----------------------------------------
110 Ok, now that you have CMake ready to go, let's build BamTools. A good
111 practice in building applications is to do an out-of-source build, meaning
112 that we're going to set up an isolated place to hold all the intermediate
115 In the top-level directory of BamTools, type the following commands:
122 This creates a Visual Studio solution file, which can then be built to create
123 the toolkit executable and API DLL's.
126 After running cmake, just run:
130 Then go back up to the BamTools root directory.
134 ----------------------------------------
136 ----------------------------------------
138 Assuming the build process finished correctly, you should be able to find the
139 toolkit executable here:
143 The BamTools-associated libraries will be found here:
147 The BamTools API headers will be found here:
151 --------------------------------------------------------------------------------
153 --------------------------------------------------------------------------------
155 ** General usage information - perhaps explain common terms, point to SAM/BAM
158 ----------------------------------------
160 ----------------------------------------
162 The API, as noted above, contains 2 main modules - BamReader & BamWriter - for
163 dealing with BAM files. Alignment data is made available through the
164 BamAlignment data structure.
166 A simple (read-only) scenario for accessing BAM data would look like the
169 // open our BamReader
171 reader.Open("someData.bam", "someData.bam.bai");
173 // define our region of interest
174 // in this example: bases 0-500 on the reference "chrX"
175 int id = reader.GetReferenceID("chrX");
176 BamRegion region(id, 0, id, 500);
177 reader.SetRegion(region);
179 // iterate through alignments in this region,
180 // ignoring alignments with a MQ below some cutoff
182 while ( reader.GetNextAlignment(al) ) {
183 if ( al.MapQuality >= 50 )
190 To use this API in your application, you simply need to do the following:
192 1 - Build the BamTools library (see Installation steps above).
194 2 - Import BamTools API functionality as needed, for example:
196 #include "api/BamReader.h"
197 #include "api/BamWriter.h"
198 using namespace BamTools; // all BamTools classes/methods live within
201 3 - In your own build step, point your include path to the
202 (BAMTOOLS_ROOT)/include directory. Link your app with '-lbamtools' ('l'
205 * You may need to modify the -L flag (library path) as well to help your linker
206 find the (BAMTOOLS_ROOT)/lib directory.
208 * Depending on your platform and where you install the BamTools API library, you
209 may also need to adjust how your app locates the shared library at runtime. For
210 Windows users, this can be as simple as dropping the DLL in the same folder as
211 your executable. For *nix users (using gcc at least), you can add the following
212 to your app's CXXFLAGS:
214 -Wl,-rpath,$(BAMTOOLS_LIB_DIR)
216 where BAMTOOLS_LIB_DIR is, as you would guess, the directory containing the libs.
217 An alternative is to set your local LD_LIBRARY_PATH environment variable.
219 Another alternative is to use the newly provided static library libbamtools.a and
220 resolve this issue at compile/link time, instead of runtime.
222 See any included programs for more detailed usage examples. See comments in the
223 header files for more detailed API documentation.
225 Note - For users that don't want to bother with the new BamTools shared library
226 scheme: you are certainly free to just compile the API source code directly into
227 your application, but be aware that the files involved are subject to change.
228 Meaning that filenames, number of files, etc. are not fixed. You will also need
229 to be sure to link with '-lz' for ZLIB functionality (linking with '-lbamtools'
230 gives you this automatically).
232 ----------------------------------------
234 ----------------------------------------
236 BamTools provides a small, but powerful suite of command-line utility programs
237 for manipulating and querying BAM files for data.
243 All BamTools utilities handle I/O operations using a common set of arguments.
248 The input BAM files(s).
250 If a tool accepts multiple BAM files as input, each file gets its own "-in"
251 option on the command line. If no "-in" is provided, the tool will attempt
252 to read BAM data from stdin.
254 To read a single BAM file, use a single "-in" option:
255 > bamtools *tool* -in myData1.bam ...ARGS...
257 To read multiple BAM files, use multiple "-in" options:
258 > bamtools *tool* -in myData1.bam -in myData2.bam ...ARGS...
260 To read from stdin (if supported), omit the "-in" option:
261 > bamtools *tool* ...ARGS...
267 If a tool outputs a result BAM file, specify the filename using this option.
268 If none is provided, the tool will typically write to stdout.
270 *Note: Not all tools output BAM data (e.g. count, header, etc.)
274 A region of interest. See below for accepted 'REGION string' formats.
276 Many of the tools accept this option, which allows a user to only consider
277 alignments that overlap this region (whether counting, filtering, merging,
280 An alignment is considered to overlap a region if any part of the alignments
281 intersects the left/right boundaries. Thus, a 50bp alignment at position 70
282 will overlap a region beginning at position 100.
285 ----------------------
286 A proper REGION string can be formatted like any of the following examples:
287 where 'chr1' is the name of a reference (not its ID)and '' is any valid
288 integer position within that reference.
291 chr1 - only alignments on (entire) reference 'chr1'
292 chr1:500 - only alignments overlapping the region starting at
293 chr1:500 and continuing to the end of chr1
294 chr1:500..1000 - only alignments overlapping the region starting at
295 chr1:500 and continuing to chr1:1000
296 chr1:500..chr3:750 - only alignments overlapping the region starting at
297 chr1:500 and continuing to chr3:750. This 'spanning'
298 region assumes that the reference specified as the
299 right boundary will occur somewhere in the file after
300 the left boundary. On a sorted BAM, a REGION of
301 'chr4:500..chr2:1500' will produce undefined
302 (incorrect) results. So don't do it. :)
304 *Note: Most of the tools that accept a REGION string will perform without an
305 index file, but typically at great cost to performance (having to
306 plow through the entire file until the region of interest is found).
307 For optimum speed, be sure that index files are available for your
312 Force compression of BAM output.
314 When tools are piped together (see details below), the default behavior is
315 to turn off compression. This can greatly increase performance when the data
316 does not have to be constantly decompressed and recompressed. This is
317 ignored any time an output BAM file is specified using "-out".
323 Many of the tools in BamTools can be chained together by piping. Any tool that
324 accepts stdin can be piped into, and any that can output stdout can be piped
327 > bamtools filter -in data1.bam -in data2.bam -mapQuality ">50" | bamtools count
329 will give a count of all alignments in your 2 BAM files with a mapQuality of
330 greater than 50. And of course, any tool writing to stdout can be piped into
337 convert Converts between BAM and a number of other formats
338 count Prints number of alignments in BAM file(s)
339 coverage Prints coverage statistics from the input BAM file
340 filter Filters BAM file(s) by user-specified criteria
341 header Prints BAM header information
342 index Generates index for BAM file
343 merge Merge multiple BAM files into single file
344 random Select random alignments from existing BAM file(s)
345 sort Sorts the BAM file according to some criteria
346 split Splits a BAM file on user-specifed property, creating a new
347 BAM output file for each value found
348 stats Prints some basic statistics from input BAM file(s)
354 Description: converts BAM to a number of other formats
356 Usage: bamtools convert -format <FORMAT> [-in <filename> -in <filename> ...]
357 [-out <filename>] [other options]
360 -in <BAM filename> the input BAM file(s) [stdin]
361 -out <BAM filename> the output BAM file [stdout]
362 -format <FORMAT> the output file format - see below for
366 -region <REGION> genomic region. Index file is recommended for
367 better performance, and is read
368 automatically if it exists. See 'bamtools
369 help index' for more details on creating
373 -fasta <FASTA filename> FASTA reference file
374 -mapqual print the mapping qualities
377 -noheader omit the SAM header from output
380 --help, -h shows this help text
384 - Currently supported output formats ( BAM -> X )
386 Format type FORMAT (command-line argument)
387 ------------ -------------------------------
397 > bamtools convert -format json -in myData.bam -out myData.json
399 - Pileup Options have no effect on formats other than "pileup"
400 SAM Options have no effect on formats other than "sam"
406 Description: prints number of alignments in BAM file(s).
408 Usage: bamtools count [-in <filename> -in <filename> ...] [-region <REGION>]
411 -in <BAM filename> the input BAM file(s) [stdin]
412 -region <REGION> genomic region. Index file is recommended
413 for better performance, and is used
414 automatically if it exists. See
415 'bamtools help index' for more details
419 --help, -h shows this help text
425 Description: prints coverage data for a single BAM file.
427 Usage: bamtools coverage [-in <filename>] [-out <filename>]
430 -in <BAM filename> the input BAM file [stdin]
431 -out <filename> the output file [stdout]
434 --help, -h shows this help text
440 Description: filters BAM file(s).
442 Usage: bamtools filter [-in <filename> -in <filename> ...]
443 [-out <filename> | [-forceCompression]]
445 [ [-script <filename] | [filterOptions] ]
448 -in <BAM filename> the input BAM file(s) [stdin]
449 -out <BAM filename> the output BAM file [stdout]
450 -region <REGION> only read data from this genomic region (see
451 README for more details)
452 -script <filename> the filter script file (see README for more
454 -forceCompression if results are sent to stdout (like when
455 piping to another tool), default behavior
456 is to leave output uncompressed. Use this
457 flag to override and force compression
460 -alignmentFlag <int> keep reads with this *exact* alignment flag
461 (for more detailed queries, see below)
462 -insertSize <int> keep reads with insert size that matches
464 -mapQuality <[0-255]> keep reads with map quality that matches
466 -name <string> keep reads with name that matches pattern
467 -queryBases <string> keep reads with motif that matches pattern
468 -tag <TAG:VALUE> keep reads with this key=>value pair
470 Alignment Flag Filters:
471 -isDuplicate <true/false> keep only alignments that are marked as
473 -isFailedQC <true/false> keep only alignments that failed QC [true]
474 -isFirstMate <true/false> keep only alignments marked as first mate
476 -isMapped <true/false> keep only alignments that were mapped [true]
477 -isMateMapped <true/false> keep only alignments with mates that mapped
479 -isMateReverseStrand <true/false> keep only alignments with mate on reverse
481 -isPaired <true/false> keep only alignments that were sequenced as
483 -isPrimaryAlignment <true/false> keep only alignments marked as primary
485 -isProperPair <true/false> keep only alignments that passed paired-end
487 -isReverseStrand <true/false> keep only alignments on reverse strand
489 -isSecondMate <true/false> keep only alignments marked as second mate
493 --help, -h shows this help text
499 The BamTools filter tool allows you to use an external filter script to define
500 complex filtering behavior. This script uses what I'm calling properties,
501 filters, and a rule - all implemented in a JSON syntax.
505 A 'property' is a typical JSON entry of the form:
507 "propertyName" : "value"
509 Here are the property names that BamTools will recognize:
534 For properties with boolean values, use the words "true" or "false".
539 will keep only alignments that are flagged as 'mapped'.
541 For properties with numeric values, use the desired number with optional
542 comparison operators ( >, >=, <, <=, !). For example,
544 "mapQuality" : ">=75"
546 will keep only alignments with mapQuality greater than or equal to 75.
548 If you're familiar with JSON, you know that integers can be bare (without
549 quotes). However, if you a comparison operator, be sure to enclose in quotes.
551 For string-based properties, the above operators are available. In addition,
552 you can also use some basic pattern-matching operators. For example,
554 "reference" : "ALU*" // reference starts with 'ALU'
555 "name" : "*foo" // name ends with 'foo'
556 "cigar" : "*D*" // cigar contains a 'D' anywhere
559 The reference property refers to the reference name, not the BAM reference
562 The tag property has an extra layer, so that the syntax will look like this:
566 where XX is the 2-letter SAM/BAM tag and value is, well, the value.
567 Comparison operators can still apply to values, so tag properties of:
576 A 'filter' is a JSON container of properties that will be AND-ed together. For
580 "reference" : "chr1",
581 "mapQuality" : ">50",
585 would result in an output BAM file containing only alignments from chr1 with a
586 mapQuality >50 and edit distance of less than 4.
588 A single, unnamed filter like this is the minimum necessary for a complete
589 filter script. Save this file and use as the -script parameter and you should
592 Moving on to more potent filtering...
594 You can also define multiple filters.
595 To do so, you just need to use the "filters" keyword along with JSON array
602 "reference" : "chr1",
606 "reference" : "chr1",
607 "isReverseStrand" : "true"
612 These filters will be (inclusive) OR-ed together by default. So you'd get a
613 resulting BAM with only alignments from chr1 that had either mapQuality >50 or
614 on the reverse strand (or both).
618 Alternatively to anonymous OR-ed filters, you can also provide what I've called
619 a "rule". By giving each filter an "id", using this "rule" keyword you can
620 describe boolean relationships between your filter sets.
622 Available rule operators:
628 This might sound a little fuzzy at this point, so let's get back to an example:
635 "reference" : "chr1",
640 "reference" : "chr1",
641 "isReverseStrand" : "true"
645 "reference" : "chr1",
646 "queryBases" : "AGCT*"
650 "rule" : " (filter1 | filter2) & !filter3 "
653 In this case, we would only retain aligments that passed filter 1 OR filter 2,
654 AND also NOT filter 3.
656 These are dummy examples, and don't make much sense as an actual query case. But
657 hopefully this serves an adequate primer to get you started and discover the
658 potential flexibility here.
664 Description: prints header from BAM file(s).
666 Usage: bamtools header [-in <filename> -in <filename> ...]
669 -in <BAM filename> the input BAM file(s) [stdin]
672 --help, -h shows this help text
678 Description: creates index for BAM file.
680 Usage: bamtools index [-in <filename>] [-bti]
683 -in <BAM filename> the input BAM file [stdin]
684 -bti create (non-standard) BamTools index file
685 (*.bti). Default behavior is to create
686 standard BAM index (*.bai)
689 --help, -h shows this help tex
695 Description: merges multiple BAM files into one.
697 Usage: bamtools merge [-in <filename> -in <filename> ...]
698 [-out <filename> | [-forceCompression]] [-region <REGION>]
701 -in <BAM filename> the input BAM file(s)
702 -out <BAM filename> the output BAM file
703 -forceCompression if results are sent to stdout (like when
704 piping to another tool), default behavior
705 is to leave output uncompressed. Use this
706 flag to override and force compression
707 -region <REGION> genomic region. See README for more details
710 --help, -h shows this help text
716 Description: grab a random subset of alignments.
718 Usage: bamtools random [-in <filename> -in <filename> ...]
719 [-out <filename>] [-forceCompression] [-n]
723 -in <BAM filename> the input BAM file [stdin]
724 -out <BAM filename> the output BAM file [stdout]
725 -forceCompression if results are sent to stdout (like when
726 piping to another tool), default behavior
727 is to leave output uncompressed. Use this
728 flag to override and force compression
729 -region <REGION> only pull random alignments from within this
730 genomic region. Index file is
731 recommended for better performance, and
732 is used automatically if it exists. See
733 'bamtools help index' for more details
737 -n <count> number of alignments to grab. Note that no
738 duplicate checking is performed [10000]
741 --help, -h shows this help text
747 Description: sorts a BAM file.
749 Usage: bamtools sort [-in <filename>] [-out <filename>] [sortOptions]
752 -in <BAM filename> the input BAM file [stdin]
753 -out <BAM filename> the output BAM file [stdout]
756 -byname sort by alignment name
759 -n <count> max number of alignments per tempfile
761 -mem <Mb> max memory to use [1024]
764 --help, -h shows this help text
770 Description: splits a BAM file on user-specified property, creating a new BAM
771 output file for each value found.
773 Usage: bamtools split [-in <filename>] [-stub <filename stub>]
774 < -mapped | -paired | -reference | -tag <TAG> >
777 -in <BAM filename> the input BAM file [stdin]
778 -stub <filename stub> prefix stub for output BAM files (default
779 behavior is to use input filename,
780 without .bam extension, as stub). If
781 input is stdin and no stub provided, a
782 timestamp is generated as the stub.
785 -mapped split mapped/unmapped alignments
786 -paired split single-end/paired-end alignments
787 -reference split alignments by reference
788 -tag <tag name> splits alignments based on all values of TAG
789 encountered (i.e. -tag RG creates a BAM
790 file for each read group in original
794 --help, -h shows this help text
800 Description: prints general alignment statistics.
802 Usage: bamtools stats [-in <filename> -in <filename> ...] [statsOptions]
805 -in <BAM filename> the input BAM file [stdin]
808 -insert summarize insert size data
811 --help, -h shows this help text
813 --------------------------------------------------------------------------------
815 --------------------------------------------------------------------------------
817 Both the BamTools API and toolkit are released under the MIT License.
818 Copyright (c) 2009-2010 Derek Barnett, Erik Garrison, Gabor Marth,
821 See included file LICENSE for details.
823 --------------------------------------------------------------------------------
824 V. Acknowledgements :
825 --------------------------------------------------------------------------------
827 * Aaron Quinlan for several key feature ideas and bug fix contributions
828 * Baptiste Lepilleur for the public-domain JSON parser (JsonCPP)
829 * Heng Li, author of SAMtools - the original C-language BAM API/toolkit.
831 --------------------------------------------------------------------------------
833 --------------------------------------------------------------------------------
835 Feel free to contact me with any questions, comments, suggestions, bug reports,
840 Biology Dept., Boston College
842 Email: barnetde@bc.edu
843 Project Websites: http://github.com/pezmaster31/bamtools (ACTIVE SUPPORT)
844 http://sourceforge.net/projects/bamtools (major updates only)