README : BAMTOOLS
--------------------------------------------------------------------------------
-BamTools: a C++ API & toolkit for reading/writing/manipulating BAM files.
-
-I. Introduction
- a. The API
- b. The Toolkit
-
-II. Installation
-
-III. Usage
- a. The API
- b. The Toolkit
-
-IV. License
-
-V. Acknowledgements
-
-VI. Contact
-
---------------------------------------------------------------------------------
-I. Introduction:
---------------------------------------------------------------------------------
-
BamTools provides both a programmer's API and an end-user's toolkit for handling
BAM files.
-----------------------------------------
-Ia. The API:
-----------------------------------------
-
-The API consists of 2 main modules: BamReader and BamWriter. As you would
-expect, BamReader provides read-access to BAM files, while BamWriter handles
-writing data to BAM files. BamReader provides the interface for random-access
-(jumping) in a BAM file, as well as generating BAM index files.
-
-BamMultiReader is an extra module that allows you to manage multiple open BAM
-files for reading. It provides some validation & bookkeeping under the hood to
-keep all files sync'ed up for you.
-
-Additional files used by the API:
-
- - BamAlignment.* : implements the BamAlignment data structure
-
- - BamAux.h : contains various constants, data structures and utility
- methods used throught the API.
-
- - BamIndex.* : implements both the standard BAM format index (".bai") as
- well as a new BamTools-specific index (".bti").
-
- - BGZF.* : contains our implementation of the Broad Institute's BGZF
- compression format.
-
-----------------------------------------
-Ib. The Toolkit:
-----------------------------------------
-
-If you've been using the BamTools since the early days, you'll notice that our
-'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone. We have
-dumped these in favor of a suite of small utilities that we hope both
-developers and end-users find useful:
-
-usage: bamtools [--help] COMMAND [ARGS]
-
-Available bamtools commands:
-
- convert Converts between BAM and a number of other formats
- count Prints number of alignments in BAM file(s)
- coverage Prints coverage statistics from the input BAM file
- filter Filters BAM file(s) by user-specified criteria
- header Prints BAM header information
- index Generates index for BAM file
- merge Merge multiple BAM files into single file
- random Select random alignments from existing BAM file(s)
- sort Sorts the BAM file according to some criteria
- split Splits a BAM file on user-specifed property, creating a
- new BAM output file for each value found
- stats Prints some basic statistics from input BAM file(s)
-
-See 'bamtools help COMMAND' for more information on a specific command.
-
---------------------------------------------------------------------------------
-II. Installation :
---------------------------------------------------------------------------------
-
-----------------------------------------
-IIa. Get CMake
-----------------------------------------
-
-BamTools has been migrated to a CMake-based build system. We believe that this
-should simplify the build process across all platforms, especially as the
-BamTools API moves into a shared library (that you link to instead of compiling
-lots of source files directly into your application). CMake is available on all
-major platforms, and indeed comes *out-of-the-box* with many Linux distributions.
-
-To see if you have CMake (and which version), try this command:
-
- $ cmake --version
-
-BamTools requires CMake version >= 2.6.4. If you are missing CMake or have an
-older version, check your OS package manager (for Linux) or download it here:
-http://www.cmake.org/cmake/resources/software.html .
-
-----------------------------------------
-IIb. Build BamTools
-----------------------------------------
-
-Ok, now that you have CMake ready to go, let's build BamTools. A good
-practice in building applications is to do an out-of-source build, meaning
-that we're going to set up an isolated place to hold all the intermediate
-installation steps.
-
-In the top-level directory of BamTools, type the following commands:
-
- $ mkdir build
- $ cd build
- $ cmake ..
-
-Windows users:
-This creates a Visual Studio solution file, which can then be built to create
-the toolkit executable and API DLL's.
-
-Everybody else:
-After running cmake, just run:
-
- $ make
+I. Learn More
-Then go back up to the BamTools root directory.
+II. License
- $ cd ..
+III. Acknowledgements
-----------------------------------------
-IIIb. Check It
-----------------------------------------
-
-Assuming the build process finished correctly, you should be able to find the
-toolkit executable here:
-
- ./bin/
-
-The BamTools-associated libraries will be found here:
-
- ./lib/
-
-The BamTools API headers will be found here:
-
- ./include/*
+IV. Contact
--------------------------------------------------------------------------------
-III. Usage :
+I. Learn More:
--------------------------------------------------------------------------------
-** General usage information - perhaps explain common terms, point to SAM/BAM
-spec, etc **
-
-----------------------------------------
-IIIa. The API
-----------------------------------------
-
-The API, as noted above, contains 2 main modules - BamReader & BamWriter - for
-dealing with BAM files. Alignment data is made available through the
-BamAlignment data structure.
-
-A simple (read-only) scenario for accessing BAM data would look like the
-following:
-
- // open our BamReader
- BamReader reader;
- reader.Open("someData.bam", "someData.bam.bai");
-
- // define our region of interest
- // in this example: bases 0-500 on the reference "chrX"
- int id = reader.GetReferenceID("chrX");
- BamRegion region(id, 0, id, 500);
- reader.SetRegion(region);
-
- // iterate through alignments in this region,
- // ignoring alignments with a MQ below some cutoff
- BamAlignment al;
- while ( reader.GetNextAlignment(al) ) {
- if ( al.MapQuality >= 50 )
- // do something
- }
-
- // close the reader
- reader.Close();
-
-To use this API in your application, you simply need to do the following:
-
- 1 - Build the BamTools library (see Installation steps above).
-
- 2 - Import BamTools API functionality as needed, for example:
-
- #include "api/BamReader.h"
- #include "api/BamWriter.h"
- using namespace BamTools; // all BamTools classes/methods live within
- // this namespace
-
- 3 - In your own build step, point your include path to the
- (BAMTOOLS_ROOT)/include directory. Link your app with '-lbamtools' ('l'
- as in Lima).
-
-* You may need to modify the -L flag (library path) as well to help your linker
-find the (BAMTOOLS_ROOT)/lib directory.
-
-* Depending on your platform and where you install the BamTools API library, you
-may also need to adjust how your app locates the library at runtime. For
-Windows users, this can be as simple as dropping the DLL in the same folder as
-your executable. For *nix users (using gcc at least), you can add the following
-to your app's CXXFLAGS:
-
- -Wl,-rpath,$(BAMTOOLS_LIB_DIR)
-
-where BAMTOOLS_LIB_DIR is, as you would guess, the directory containing the libs.
-An alternative is to set your local LD_LIBRARY_PATH environment variable.
-
-See any included programs for more detailed usage examples. See comments in the
-header files for more detailed API documentation.
-
-Note - For users that don't want to bother with the new BamTools shared library
-scheme: you are certainly free to just compile the API source code directly into
-your application, but be aware that the files involved are subject to change.
-Meaning that filenames, number of files, etc. are not fixed. You will also need
-to be sure to link with '-lz' for ZLIB functionality (linking with '-lbamtools'
-gives you this automatically).
-
-----------------------------------------
-IIIb. The Toolkit
-----------------------------------------
-
-BamTools provides a small, but powerful suite of command-line utility programs
-for manipulating and querying BAM files for data.
-
---------------------
-Input/Output
---------------------
-
-All BamTools utilities handle I/O operations using a common set of arguments.
-These include:
-
- -in <BAM file>
-
-The input BAM files(s).
-
- If a tool accepts multiple BAM files as input, each file gets its own "-in"
- option on the command line. If no "-in" is provided, the tool will attempt
- to read BAM data from stdin.
-
- To read a single BAM file, use a single "-in" option:
- > bamtools *tool* -in myData1.bam ...ARGS...
-
- To read multiple BAM files, use multiple "-in" options:
- > bamtools *tool* -in myData1.bam -in myData2.bam ...ARGS...
-
- To read from stdin (if supported), omit the "-in" option:
- > bamtools *tool* ...ARGS...
-
- -out <BAM file>
-
-The output BAM file.
-
- If a tool outputs a result BAM file, specify the filename using this option.
- If none is provided, the tool will typically write to stdout.
-
- *Note: Not all tools output BAM data (e.g. count, header, etc.)
-
- -region <REGION>
-
-A region of interest. See below for accepted 'REGION string' formats.
-
- Many of the tools accept this option, which allows a user to only consider
- alignments that overlap this region (whether counting, filtering, merging,
- etc.).
-
- An alignment is considered to overlap a region if any part of the alignments
- intersects the left/right boundaries. Thus, a 50bp alignment at position 70
- will overlap a region beginning at position 100.
-
- REGION string format
- ----------------------
- A proper REGION string can be formatted like any of the following examples:
- where 'chr1' is the name of a reference (not its ID)and '' is any valid
- integer position within that reference.
-
- To read
- chr1 - only alignments on (entire) reference 'chr1'
- chr1:500 - only alignments overlapping the region starting at
- chr1:500 and continuing to the end of chr1
- chr1:500..1000 - only alignments overlapping the region starting at
- chr1:500 and continuing to chr1:1000
- chr1:500..chr3:750 - only alignments overlapping the region starting at
- chr1:500 and continuing to chr3:750. This 'spanning'
- region assumes that the reference specified as the
- right boundary will occur somewhere in the file after
- the left boundary. On a sorted BAM, a REGION of
- 'chr4:500..chr2:1500' will produce undefined
- (incorrect) results. So don't do it. :)
-
- *Note: Most of the tools that accept a REGION string will perform without an
- index file, but typically at great cost to performance (having to
- plow through the entire file until the region of interest is found).
- For optimum speed, be sure that index files are available for your
- data.
-
- -forceCompression
-
-Force compression of BAM output.
-
- When tools are piped together (see details below), the default behavior is
- to turn off compression. This can greatly increase performance when the data
- does not have to be constantly decompressed and recompressed. This is
- ignored any time an output BAM file is specified using "-out".
-
---------------------
-Piping
---------------------
-
-Many of the tools in BamTools can be chained together by piping. Any tool that
-accepts stdin can be piped into, and any that can output stdout can be piped
-from. For example:
-
-> bamtools filter -in data1.bam -in data2.bam -mapQuality ">50" | bamtools count
-
-will give a count of all alignments in your 2 BAM files with a mapQuality of
-greater than 50. And of course, any tool writing to stdout can be piped into
-other utilities.
-
---------------------
-The Tools
---------------------
-
- convert Converts between BAM and a number of other formats
- count Prints number of alignments in BAM file(s)
- coverage Prints coverage statistics from the input BAM file
- filter Filters BAM file(s) by user-specified criteria
- header Prints BAM header information
- index Generates index for BAM file
- merge Merge multiple BAM files into single file
- random Select random alignments from existing BAM file(s)
- sort Sorts the BAM file according to some criteria
- split Splits a BAM file on user-specifed property, creating a new
- BAM output file for each value found
- stats Prints some basic statistics from input BAM file(s)
-
-----------
-convert
-----------
-
-Description: converts BAM to a number of other formats
-
-Usage: bamtools convert -format <FORMAT> [-in <filename> -in <filename> ...]
- [-out <filename>] [other options]
-
-Input & Output:
- -in <BAM filename> the input BAM file(s) [stdin]
- -out <BAM filename> the output BAM file [stdout]
- -format <FORMAT> the output file format - see below for
- supported formats
-
-Filters:
- -region <REGION> genomic region. Index file is recommended for
- better performance, and is read
- automatically if it exists. See 'bamtools
- help index' for more details on creating
- one.
-
-Pileup Options:
- -fasta <FASTA filename> FASTA reference file
- -mapqual print the mapping qualities
-
-SAM Options:
- -noheader omit the SAM header from output
-
-Help:
- --help, -h shows this help text
-
-** Notes **
-
- - Currently supported output formats ( BAM -> X )
-
- Format type FORMAT (command-line argument)
- ------------ -------------------------------
- BED bed
- FASTA fasta
- FASTQ fastq
- JSON json
- Pileup pileup
- SAM sam
- YAML yaml
-
- Usage example:
- > bamtools convert -format json -in myData.bam -out myData.json
-
- - Pileup Options have no effect on formats other than "pileup"
- SAM Options have no effect on formats other than "sam"
-
-----------
-count
-----------
-
-Description: prints number of alignments in BAM file(s).
-
-Usage: bamtools count [-in <filename> -in <filename> ...] [-region <REGION>]
-
-Input & Output:
- -in <BAM filename> the input BAM file(s) [stdin]
- -region <REGION> genomic region. Index file is recommended
- for better performance, and is used
- automatically if it exists. See
- 'bamtools help index' for more details
- on creating one
-
-Help:
- --help, -h shows this help text
-
-----------
-coverage
-----------
-
-Description: prints coverage data for a single BAM file.
-
-Usage: bamtools coverage [-in <filename>] [-out <filename>]
-
-Input & Output:
- -in <BAM filename> the input BAM file [stdin]
- -out <filename> the output file [stdout]
-
-Help:
- --help, -h shows this help text
-
-----------
-filter
-----------
-
-Description: filters BAM file(s).
-
-Usage: bamtools filter [-in <filename> -in <filename> ...]
- [-out <filename> | [-forceCompression]]
- [-region <REGION>]
- [ [-script <filename] | [filterOptions] ]
-
-Input & Output:
- -in <BAM filename> the input BAM file(s) [stdin]
- -out <BAM filename> the output BAM file [stdout]
- -region <REGION> only read data from this genomic region (see
- README for more details)
- -script <filename> the filter script file (see README for more
- details)
- -forceCompression if results are sent to stdout (like when
- piping to another tool), default behavior
- is to leave output uncompressed. Use this
- flag to override and force compression
-
-General Filters:
- -alignmentFlag <int> keep reads with this *exact* alignment flag
- (for more detailed queries, see below)
- -insertSize <int> keep reads with insert size that matches
- pattern
- -mapQuality <[0-255]> keep reads with map quality that matches
- pattern
- -name <string> keep reads with name that matches pattern
- -queryBases <string> keep reads with motif that matches pattern
- -tag <TAG:VALUE> keep reads with this key=>value pair
-
-Alignment Flag Filters:
- -isDuplicate <true/false> keep only alignments that are marked as
- duplicate [true]
- -isFailedQC <true/false> keep only alignments that failed QC [true]
- -isFirstMate <true/false> keep only alignments marked as first mate
- [true]
- -isMapped <true/false> keep only alignments that were mapped [true]
- -isMateMapped <true/false> keep only alignments with mates that mapped
- [true]
- -isMateReverseStrand <true/false> keep only alignments with mate on reverse
- strand [true]
- -isPaired <true/false> keep only alignments that were sequenced as
- paired [true]
- -isPrimaryAlignment <true/false> keep only alignments marked as primary
- [true]
- -isProperPair <true/false> keep only alignments that passed paired-end
- resolution [true]
- -isReverseStrand <true/false> keep only alignments on reverse strand
- [true]
- -isSecondMate <true/false> keep only alignments marked as second mate
- [true]
-
-Help:
- --help, -h shows this help text
-
- *****************
- * Filter Script *
- *****************
-
-The BamTools filter tool allows you to use an external filter script to define
-complex filtering behavior. This script uses what I'm calling properties,
-filters, and a rule - all implemented in a JSON syntax.
-
- ** Properties **
-
-A 'property' is a typical JSON entry of the form:
-
- "propertyName" : "value"
-
-Here are the property names that BamTools will recognize:
-
- alignmentFlag
- cigar
- insertSize
- isDuplicate
- isFailedQC
- isFirstMate
- isMapped
- isMateMapped
- isMateReverseStrand
- isPaired
- isPrimaryAlignment
- isProperPair
- isReverseStrand
- isSecondMate
- mapQuality
- matePosition
- mateReference
- name
- position
- queryBases
- reference
- tag
-
-For properties with boolean values, use the words "true" or "false".
-For example,
-
- "isMapped" : "true"
-
-will keep only alignments that are flagged as 'mapped'.
-
-For properties with numeric values, use the desired number with optional
-comparison operators ( >, >=, <, <=, !). For example,
-
- "mapQuality" : ">=75"
-
-will keep only alignments with mapQuality greater than or equal to 75.
-
-If you're familiar with JSON, you know that integers can be bare (without
-quotes). However, if you a comparison operator, be sure to enclose in quotes.
-
-For string-based properties, the above operators are available. In addition,
- you can also use some basic pattern-matching operators. For example,
-
- "reference" : "ALU*" // reference starts with 'ALU'
- "name" : "*foo" // name ends with 'foo'
- "cigar" : "*D*" // cigar contains a 'D' anywhere
-
-Notes -
-The reference property refers to the reference name, not the BAM reference
-numeric ID.
-
-The tag property has an extra layer, so that the syntax will look like this:
-
- "tag" : "XX:value"
-
-where XX is the 2-letter SAM/BAM tag and value is, well, the value.
-Comparison operators can still apply to values, so tag properties of:
-
- "tag" : "AS:>60"
- "tag" : "RG:foo*"
-
-are perfectly valid.
-
- ** Filters **
-
-A 'filter' is a JSON container of properties that will be AND-ed together. For
-example,
-
-{
- "reference" : "chr1",
- "mapQuality" : ">50",
- "tag" : "NM:<4"
-}
-
-would result in an output BAM file containing only alignments from chr1 with a
-mapQuality >50 and edit distance of less than 4.
-
-A single, unnamed filter like this is the minimum necessary for a complete
-filter script. Save this file and use as the -script parameter and you should
-be all set.
-
-Moving on to more potent filtering...
-
-You can also define multiple filters.
-To do so, you just need to use the "filters" keyword along with JSON array
-syntax, like this:
-
-{
- "filters" :
- [
- {
- "reference" : "chr1",
- "mapQuality" : ">50"
- },
- {
- "reference" : "chr1",
- "isReverseStrand" : "true"
- }
- ]
-}
-
-These filters will be (inclusive) OR-ed together by default. So you'd get a
-resulting BAM with only alignments from chr1 that had either mapQuality >50 or
-on the reverse strand (or both).
-
- ** Rule **
-
-Alternatively to anonymous OR-ed filters, you can also provide what I've called
-a "rule". By giving each filter an "id", using this "rule" keyword you can
-describe boolean relationships between your filter sets.
-
-Available rule operators:
-
- & // and
- | // or
- ! // not
-
-This might sound a little fuzzy at this point, so let's get back to an example:
-
-{
- "filters" :
- [
- {
- "id" : "filter1",
- "reference" : "chr1",
- "mapQuality" : ">50"
- },
- {
- "id" : "filter2",
- "reference" : "chr1",
- "isReverseStrand" : "true"
- },
- {
- "id" : "filter3",
- "reference" : "chr1",
- "queryBases" : "AGCT*"
- }
- ],
-
- "rule" : " (filter1 | filter2) & !filter3 "
-}
-
-In this case, we would only retain aligments that passed filter 1 OR filter 2,
-AND also NOT filter 3.
-
-These are dummy examples, and don't make much sense as an actual query case. But
-hopefully this serves an adequate primer to get you started and discover the
-potential flexibility here.
-
-----------
-header
-----------
-
-Description: prints header from BAM file(s).
-
-Usage: bamtools header [-in <filename> -in <filename> ...]
-
-Input & Output:
- -in <BAM filename> the input BAM file(s) [stdin]
-
-Help:
- --help, -h shows this help text
-
-----------
-index
-----------
-
-Description: creates index for BAM file.
-
-Usage: bamtools index [-in <filename>] [-bti]
-
-Input & Output:
- -in <BAM filename> the input BAM file [stdin]
- -bti create (non-standard) BamTools index file
- (*.bti). Default behavior is to create
- standard BAM index (*.bai)
-
-Help:
- --help, -h shows this help tex
-
-----------
-merge
-----------
-
-Description: merges multiple BAM files into one.
-
-Usage: bamtools merge [-in <filename> -in <filename> ...]
- [-out <filename> | [-forceCompression]] [-region <REGION>]
-
-Input & Output:
- -in <BAM filename> the input BAM file(s)
- -out <BAM filename> the output BAM file
- -forceCompression if results are sent to stdout (like when
- piping to another tool), default behavior
- is to leave output uncompressed. Use this
- flag to override and force compression
- -region <REGION> genomic region. See README for more details
-
-Help:
- --help, -h shows this help text
-
-----------
-random
-----------
-
-Description: grab a random subset of alignments.
-
-Usage: bamtools random [-in <filename> -in <filename> ...]
- [-out <filename>] [-forceCompression] [-n]
- [-region <REGION>]
-
-Input & Output:
- -in <BAM filename> the input BAM file [stdin]
- -out <BAM filename> the output BAM file [stdout]
- -forceCompression if results are sent to stdout (like when
- piping to another tool), default behavior
- is to leave output uncompressed. Use this
- flag to override and force compression
- -region <REGION> only pull random alignments from within this
- genomic region. Index file is
- recommended for better performance, and
- is used automatically if it exists. See
- 'bamtools help index' for more details
- on creating one
-
-Settings:
- -n <count> number of alignments to grab. Note that no
- duplicate checking is performed [10000]
-
-Help:
- --help, -h shows this help text
-
-----------
-sort
-----------
-
-Description: sorts a BAM file.
-
-Usage: bamtools sort [-in <filename>] [-out <filename>] [sortOptions]
-
-Input & Output:
- -in <BAM filename> the input BAM file [stdin]
- -out <BAM filename> the output BAM file [stdout]
-
-Sorting Methods:
- -byname sort by alignment name
-
-Memory Settings:
- -n <count> max number of alignments per tempfile
- [10000]
- -mem <Mb> max memory to use [1024]
-
-Help:
- --help, -h shows this help text
-
-----------
-split
-----------
-
-Description: splits a BAM file on user-specified property, creating a new BAM
-output file for each value found.
-
-Usage: bamtools split [-in <filename>] [-stub <filename stub>]
- < -mapped | -paired | -reference | -tag <TAG> >
-
-Input & Output:
- -in <BAM filename> the input BAM file [stdin]
- -stub <filename stub> prefix stub for output BAM files (default
- behavior is to use input filename,
- without .bam extension, as stub). If
- input is stdin and no stub provided, a
- timestamp is generated as the stub.
-
-Split Options:
- -mapped split mapped/unmapped alignments
- -paired split single-end/paired-end alignments
- -reference split alignments by reference
- -tag <tag name> splits alignments based on all values of TAG
- encountered (i.e. -tag RG creates a BAM
- file for each read group in original
- BAM file)
-
-Help:
- --help, -h shows this help text
-
-----------
-stats
-----------
-
-Description: prints general alignment statistics.
-
-Usage: bamtools stats [-in <filename> -in <filename> ...] [statsOptions]
+Installation steps, tutorial, API documentation, etc. are all now available
+through the BamTools project wiki:
-Input & Output:
- -in <BAM filename> the input BAM file [stdin]
+https://github.com/pezmaster31/bamtools/wiki
-Additional Stats:
- -insert summarize insert size data
+Join the mailing list(s) to stay informed of updates or get involved with
+contributing:
-Help:
- --help, -h shows this help text
+https://github.com/pezmaster31/bamtools/wiki/Mailing-lists
--------------------------------------------------------------------------------
-IV. License :
+II. License :
--------------------------------------------------------------------------------
Both the BamTools API and toolkit are released under the MIT License.
See included file LICENSE for details.
--------------------------------------------------------------------------------
-V. Acknowledgements :
+III. Acknowledgements :
--------------------------------------------------------------------------------
* Aaron Quinlan for several key feature ideas and bug fix contributions
* Heng Li, author of SAMtools - the original C-language BAM API/toolkit.
--------------------------------------------------------------------------------
-VI. Contact :
+IV. Contact :
--------------------------------------------------------------------------------
Feel free to contact me with any questions, comments, suggestions, bug reports,