-------------------------------------------------------------------------------- README : BAMTOOLS -------------------------------------------------------------------------------- BamTools: a C++ API & toolkit for reading/writing/manipulating BAM files. I. Introduction a. The API b. The Toolkit II. Installation III. Usage a. The API b. The Toolkit IV. License V. Acknowledgements VI. Contact -------------------------------------------------------------------------------- I. Introduction: -------------------------------------------------------------------------------- BamTools provides both a programmer's API and an end-user's toolkit for handling BAM files. ---------------------------------------- Ia. The API: ---------------------------------------- The API consists of 2 main modules: BamReader and BamWriter. As you would expect, BamReader provides read-access to BAM files, while BamWriter handles writing data to BAM files. BamReader provides the interface for random-access (jumping) in a BAM file, as well as generating BAM index files. BamMultiReader is an extra module that allows you to manage multiple open BAM files for reading. It provides some validation & bookkeeping under the hood to keep all files sync'ed up for you. Additional files used by the API: - BamAlignment.* : implements the BamAlignment data structure - BamAux.h : contains various constants, data structures and utility methods used throught the API. - BamIndex.* : implements both the standard BAM format index (".bai") as well as a new BamTools-specific index (".bti"). - BGZF.* : contains our implementation of the Broad Institute's BGZF compression format. ---------------------------------------- Ib. The Toolkit: ---------------------------------------- If you've been using the BamTools since the early days, you'll notice that our 'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone. We have dumped these in favor of a suite of small utilities that we hope both developers and end-users find useful: usage: bamtools [--help] COMMAND [ARGS] Available bamtools commands: convert Converts between BAM and a number of other formats count Prints number of alignments in BAM file(s) coverage Prints coverage statistics from the input BAM file filter Filters BAM file(s) by user-specified criteria header Prints BAM header information index Generates index for BAM file merge Merge multiple BAM files into single file random Select random alignments from existing BAM file(s) sort Sorts the BAM file according to some criteria split Splits a BAM file on user-specifed property, creating a new BAM output file for each value found stats Prints some basic statistics from input BAM file(s) See 'bamtools help COMMAND' for more information on a specific command. -------------------------------------------------------------------------------- II. Installation : -------------------------------------------------------------------------------- ---------------------------------------- IIa. Get CMake ---------------------------------------- BamTools has been migrated to a CMake-based build system. We believe that this should simplify the build process across all platforms, especially as the BamTools API moves into a shared library (that you link to instead of compiling lots of source files directly into your application). CMake is available on all major platforms, and indeed comes *out-of-the-box* with many Linux distributions. To see if you have CMake (and which version), try this command: $ cmake --version BamTools requires CMake version >= 2.6.4. If you are missing CMake or have an older version, check your OS package manager (for Linux) or download it here: http://www.cmake.org/cmake/resources/software.html . ---------------------------------------- IIb. Build BamTools ---------------------------------------- Ok, now that you have CMake ready to go, let's build BamTools. A good practice in building applications is to do an out-of-source build, meaning that we're going to set up an isolated place to hold all the intermediate installation steps. In the top-level directory of BamTools, type the following commands: $ mkdir build $ cd build $ cmake .. Windows users: This creates a Visual Studio solution file, which can then be built to create the toolkit executable and API DLL's. Everybody else: After running cmake, just run: $ make Then go back up to the BamTools root directory. $ cd .. ---------------------------------------- IIIb. Check It ---------------------------------------- Assuming the build process finished correctly, you should be able to find the toolkit executable here: ./bin/ The BamTools-associated libraries will be found here: ./lib/ -------------------------------------------------------------------------------- III. Usage : -------------------------------------------------------------------------------- ** General usage information - perhaps explain common terms, point to SAM/BAM spec, etc ** ---------------------------------------- IIIa. The API ---------------------------------------- The API, as noted above, contains 2 main modules - BamReader & BamWriter - for dealing with BAM files. Alignment data is made available through the BamAlignment data structure. A simple (read-only) scenario for accessing BAM data would look like the following: // open our BamReader BamReader reader; reader.Open("someData.bam", "someData.bam.bai"); // define our region of interest // in this example: bases 0-500 on the reference "chrX" int id = reader.GetReferenceID("chrX"); BamRegion region(id, 0, id, 500); reader.SetRegion(region); // iterate through alignments in this region, // ignoring alignments with a MQ below some cutoff BamAlignment al; while ( reader.GetNextAlignment(al) ) { if ( al.MapQuality >= 50 ) // do something } // close the reader reader.Close(); To use this API in your application, you simply need to do 3 things: 1 - Build the BamTools library (see Installation steps above). 2 - Import BamTools API with the following lines of code #include "BamReader.h" // (or "BamMultiReader.h") as needed #include "BamWriter.h" // as needed using namespace BamTools; // all of BamTools classes/methods live in // this namespace 3 - Link with '-lbamtools' ('l' as in Lima). You may need to modify the -L flag (library path) as well to help your linker find the (BAMTOOLS_ROOT)/lib directory. See any included programs for more detailed usage examples. See comments in the header files for more detailed API documentation. Note - For users that don't want to bother with the new BamTools shared library scheme: you are certainly free to just compile the API source code directly into your application, but be aware that the files involved are subject to change. Meaning that filenames, number of files, etc. are not fixed. You will also need to be sure to link with '-lz' for ZLIB functionality (linking with '-lbamtools' gives you this automatically). ---------------------------------------- IIIb. The Toolkit ---------------------------------------- BamTools provides a small, but powerful suite of command-line utility programs for manipulating and querying BAM files for data. -------------------- Input/Output -------------------- All BamTools utilities handle I/O operations using a common set of arguments. These include: -in The input BAM files(s). If a tool accepts multiple BAM files as input, each file gets its own "-in" option on the command line. If no "-in" is provided, the tool will attempt to read BAM data from stdin. To read a single BAM file, use a single "-in" option: > bamtools *tool* -in myData1.bam ...ARGS... To read multiple BAM files, use multiple "-in" options: > bamtools *tool* -in myData1.bam -in myData2.bam ...ARGS... To read from stdin (if supported), omit the "-in" option: > bamtools *tool* ...ARGS... -out The output BAM file. If a tool outputs a result BAM file, specify the filename using this option. If none is provided, the tool will typically write to stdout. *Note: Not all tools output BAM data (e.g. count, header, etc.) -region A region of interest. See below for accepted 'REGION string' formats. Many of the tools accept this option, which allows a user to only consider alignments that overlap this region (whether counting, filtering, merging, etc.). An alignment is considered to overlap a region if any part of the alignments intersects the left/right boundaries. Thus, a 50bp alignment at position 70 will overlap a region beginning at position 100. REGION string format ---------------------- A proper REGION string can be formatted like any of the following examples: where 'chr1' is the name of a reference (not its ID)and '' is any valid integer position within that reference. To read chr1 - only alignments on (entire) reference 'chr1' chr1:500 - only alignments overlapping the region starting at chr1:500 and continuing to the end of chr1 chr1:500..1000 - only alignments overlapping the region starting at chr1:500 and continuing to chr1:1000 chr1:500..chr3:750 - only alignments overlapping the region starting at chr1:500 and continuing to chr3:750. This 'spanning' region assumes that the reference specified as the right boundary will occur somewhere in the file after the left boundary. On a sorted BAM, a REGION of 'chr4:500..chr2:1500' will produce undefined (incorrect) results. So don't do it. :) *Note: Most of the tools that accept a REGION string will perform without an index file, but typically at great cost to performance (having to plow through the entire file until the region of interest is found). For optimum speed, be sure that index files are available for your data. -forceCompression Force compression of BAM output. When tools are piped together (see details below), the default behavior is to turn off compression. This can greatly increase performance when the data does not have to be constantly decompressed and recompressed. This is ignored any time an output BAM file is specified using "-out". -------------------- Piping -------------------- Many of the tools in BamTools can be chained together by piping. Any tool that accepts stdin can be piped into, and any that can output stdout can be piped from. For example: > bamtools filter -in data1.bam -in data2.bam -mapQuality ">50" | bamtools count will give a count of all alignments in your 2 BAM files with a mapQuality of greater than 50. And of course, any tool writing to stdout can be piped into other utilities. -------------------- The Tools -------------------- convert Converts between BAM and a number of other formats count Prints number of alignments in BAM file(s) coverage Prints coverage statistics from the input BAM file filter Filters BAM file(s) by user-specified criteria header Prints BAM header information index Generates index for BAM file merge Merge multiple BAM files into single file random Select random alignments from existing BAM file(s) sort Sorts the BAM file according to some criteria split Splits a BAM file on user-specifed property, creating a new BAM output file for each value found stats Prints some basic statistics from input BAM file(s) ---------- convert ---------- Description: converts BAM to a number of other formats Usage: bamtools convert -format [-in -in ...] [-out ] [other options] Input & Output: -in the input BAM file(s) [stdin] -out the output BAM file [stdout] -format the output file format - see below for supported formats Filters: -region genomic region. Index file is recommended for better performance, and is read automatically if it exists. See 'bamtools help index' for more details on creating one. Pileup Options: -fasta FASTA reference file -mapqual print the mapping qualities SAM Options: -noheader omit the SAM header from output Help: --help, -h shows this help text ** Notes ** - Currently supported output formats ( BAM -> X ) Format type FORMAT (command-line argument) ------------ ------------------------------- BED bed FASTA fasta FASTQ fastq JSON json Pileup pileup SAM sam YAML yaml Usage example: > bamtools convert -format json -in myData.bam -out myData.json - Pileup Options have no effect on formats other than "pileup" SAM Options have no effect on formats other than "sam" ---------- count ---------- Description: prints number of alignments in BAM file(s). Usage: bamtools count [-in -in ...] [-region ] Input & Output: -in the input BAM file(s) [stdin] -region genomic region. Index file is recommended for better performance, and is used automatically if it exists. See 'bamtools help index' for more details on creating one Help: --help, -h shows this help text ---------- coverage ---------- Description: prints coverage data for a single BAM file. Usage: bamtools coverage [-in ] [-out ] Input & Output: -in the input BAM file [stdin] -out the output file [stdout] Help: --help, -h shows this help text ---------- filter ---------- Description: filters BAM file(s). Usage: bamtools filter [-in -in ...] [-out | [-forceCompression]] [-region ] [ [-script the input BAM file(s) [stdin] -out the output BAM file [stdout] -region only read data from this genomic region (see README for more details) -script the filter script file (see README for more details) -forceCompression if results are sent to stdout (like when piping to another tool), default behavior is to leave output uncompressed. Use this flag to override and force compression General Filters: -alignmentFlag keep reads with this *exact* alignment flag (for more detailed queries, see below) -insertSize keep reads with insert size that matches pattern -mapQuality <[0-255]> keep reads with map quality that matches pattern -name keep reads with name that matches pattern -queryBases keep reads with motif that matches pattern -tag keep reads with this key=>value pair Alignment Flag Filters: -isDuplicate keep only alignments that are marked as duplicate [true] -isFailedQC keep only alignments that failed QC [true] -isFirstMate keep only alignments marked as first mate [true] -isMapped keep only alignments that were mapped [true] -isMateMapped keep only alignments with mates that mapped [true] -isMateReverseStrand keep only alignments with mate on reverse strand [true] -isPaired keep only alignments that were sequenced as paired [true] -isPrimaryAlignment keep only alignments marked as primary [true] -isProperPair keep only alignments that passed paired-end resolution [true] -isReverseStrand keep only alignments on reverse strand [true] -isSecondMate keep only alignments marked as second mate [true] Help: --help, -h shows this help text ***************** * Filter Script * ***************** The BamTools filter tool allows you to use an external filter script to define complex filtering behavior. This script uses what I'm calling properties, filters, and a rule - all implemented in a JSON syntax. ** Properties ** A 'property' is a typical JSON entry of the form: "propertyName" : "value" Here are the property names that BamTools will recognize: alignmentFlag cigar insertSize isDuplicate isFailedQC isFirstMate isMapped isMateMapped isMateReverseStrand isPaired isPrimaryAlignment isProperPair isReverseStrand isSecondMate mapQuality matePosition mateReference name position queryBases reference tag For properties with boolean values, use the words "true" or "false". For example, "isMapped" : "true" will keep only alignments that are flagged as 'mapped'. For properties with numeric values, use the desired number with optional comparison operators ( >, >=, <, <=, !). For example, "mapQuality" : ">=75" will keep only alignments with mapQuality greater than or equal to 75. If you're familiar with JSON, you know that integers can be bare (without quotes). However, if you a comparison operator, be sure to enclose in quotes. For string-based properties, the above operators are available. In addition, you can also use some basic pattern-matching operators. For example, "reference" : "ALU*" // reference starts with 'ALU' "name" : "*foo" // name ends with 'foo' "cigar" : "*D*" // cigar contains a 'D' anywhere Notes - The reference property refers to the reference name, not the BAM reference numeric ID. The tag property has an extra layer, so that the syntax will look like this: "tag" : "XX:value" where XX is the 2-letter SAM/BAM tag and value is, well, the value. Comparison operators can still apply to values, so tag properties of: "tag" : "AS:>60" "tag" : "RG:foo*" are perfectly valid. ** Filters ** A 'filter' is a JSON container of properties that will be AND-ed together. For example, { "reference" : "chr1", "mapQuality" : ">50", "tag" : "NM:<4" } would result in an output BAM file containing only alignments from chr1 with a mapQuality >50 and edit distance of less than 4. A single, unnamed filter like this is the minimum necessary for a complete filter script. Save this file and use as the -script parameter and you should be all set. Moving on to more potent filtering... You can also define multiple filters. To do so, you just need to use the "filters" keyword along with JSON array syntax, like this: { "filters" : [ { "reference" : "chr1", "mapQuality" : ">50" }, { "reference" : "chr1", "isReverseStrand" : "true" } ] } These filters will be (inclusive) OR-ed together by default. So you'd get a resulting BAM with only alignments from chr1 that had either mapQuality >50 or on the reverse strand (or both). ** Rule ** Alternatively to anonymous OR-ed filters, you can also provide what I've called a "rule". By giving each filter an "id", using this "rule" keyword you can describe boolean relationships between your filter sets. Available rule operators: & // and | // or ! // not This might sound a little fuzzy at this point, so let's get back to an example: { "filters" : [ { "id" : "filter1", "reference" : "chr1", "mapQuality" : ">50" }, { "id" : "filter2", "reference" : "chr1", "isReverseStrand" : "true" }, { "id" : "filter3", "reference" : "chr1", "queryBases" : "AGCT*" } ], "rule" : " (filter1 | filter2) & !filter3 " } In this case, we would only retain aligments that passed filter 1 OR filter 2, AND also NOT filter 3. These are dummy examples, and don't make much sense as an actual query case. But hopefully this serves an adequate primer to get you started and discover the potential flexibility here. ---------- header ---------- Description: prints header from BAM file(s). Usage: bamtools header [-in -in ...] Input & Output: -in the input BAM file(s) [stdin] Help: --help, -h shows this help text ---------- index ---------- Description: creates index for BAM file. Usage: bamtools index [-in ] [-bti] Input & Output: -in the input BAM file [stdin] -bti create (non-standard) BamTools index file (*.bti). Default behavior is to create standard BAM index (*.bai) Help: --help, -h shows this help tex ---------- merge ---------- Description: merges multiple BAM files into one. Usage: bamtools merge [-in -in ...] [-out | [-forceCompression]] [-region ] Input & Output: -in the input BAM file(s) -out the output BAM file -forceCompression if results are sent to stdout (like when piping to another tool), default behavior is to leave output uncompressed. Use this flag to override and force compression -region genomic region. See README for more details Help: --help, -h shows this help text ---------- random ---------- Description: grab a random subset of alignments. Usage: bamtools random [-in -in ...] [-out ] [-forceCompression] [-n] [-region ] Input & Output: -in the input BAM file [stdin] -out the output BAM file [stdout] -forceCompression if results are sent to stdout (like when piping to another tool), default behavior is to leave output uncompressed. Use this flag to override and force compression -region only pull random alignments from within this genomic region. Index file is recommended for better performance, and is used automatically if it exists. See 'bamtools help index' for more details on creating one Settings: -n number of alignments to grab. Note that no duplicate checking is performed [10000] Help: --help, -h shows this help text ---------- sort ---------- Description: sorts a BAM file. Usage: bamtools sort [-in ] [-out ] [sortOptions] Input & Output: -in the input BAM file [stdin] -out the output BAM file [stdout] Sorting Methods: -byname sort by alignment name Memory Settings: -n max number of alignments per tempfile [10000] -mem max memory to use [1024] Help: --help, -h shows this help text ---------- split ---------- Description: splits a BAM file on user-specified property, creating a new BAM output file for each value found. Usage: bamtools split [-in ] [-stub ] < -mapped | -paired | -reference | -tag > Input & Output: -in the input BAM file [stdin] -stub prefix stub for output BAM files (default behavior is to use input filename, without .bam extension, as stub). If input is stdin and no stub provided, a timestamp is generated as the stub. Split Options: -mapped split mapped/unmapped alignments -paired split single-end/paired-end alignments -reference split alignments by reference -tag splits alignments based on all values of TAG encountered (i.e. -tag RG creates a BAM file for each read group in original BAM file) Help: --help, -h shows this help text ---------- stats ---------- Description: prints general alignment statistics. Usage: bamtools stats [-in -in ...] [statsOptions] Input & Output: -in the input BAM file [stdin] Additional Stats: -insert summarize insert size data Help: --help, -h shows this help text -------------------------------------------------------------------------------- IV. License : -------------------------------------------------------------------------------- Both the BamTools API and toolkit are released under the MIT License. Copyright (c) 2009-2010 Derek Barnett, Erik Garrison, Gabor Marth, Michael Stromberg See included file LICENSE for details. -------------------------------------------------------------------------------- V. Acknowledgements : -------------------------------------------------------------------------------- * Aaron Quinlan for several key feature ideas and bug fix contributions * Baptiste Lepilleur for the public-domain JSON parser (JsonCPP) * Heng Li, author of SAMtools - the original C-language BAM API/toolkit. -------------------------------------------------------------------------------- VI. Contact : -------------------------------------------------------------------------------- Feel free to contact me with any questions, comments, suggestions, bug reports, etc. Derek Barnett Marth Lab Biology Dept., Boston College Email: barnetde@bc.edu Project Websites: http://github.com/pezmaster31/bamtools (ACTIVE SUPPORT) http://sourceforge.net/projects/bamtools (major updates only)