X-Git-Url: https://git.donarmstrong.com/?p=bamtools.git;a=blobdiff_plain;f=README;h=498f4be06dc0ecc00162c43cdaa65841cea5dff8;hp=b6a8b0721163e88abe3831d7b0b22a4967df03d9;hb=HEAD;hpb=4c9d2fdc9c556713531bdd2f25ce49685ab218e9 diff --git a/README b/README index b6a8b07..498f4be 100644 --- a/README +++ b/README @@ -2,795 +2,33 @@ README : BAMTOOLS -------------------------------------------------------------------------------- -BamTools: a C++ API & toolkit for reading/writing/manipulating BAM files. - -I. Introduction - a. The API - b. The Toolkit - -II. Installation - -III. Usage - a. The API - b. The Toolkit - -IV. License - -V. Acknowledgements - -VI. Contact - --------------------------------------------------------------------------------- -I. Introduction: --------------------------------------------------------------------------------- - BamTools provides both a programmer's API and an end-user's toolkit for handling BAM files. ----------------------------------------- -Ia. The API: ----------------------------------------- - -The API consists of 2 main modules: BamReader and BamWriter. As you would -expect, BamReader provides read-access to BAM files, while BamWriter handles -writing data to BAM files. BamReader provides the interface for random-access -(jumping) in a BAM file, as well as generating BAM index files. - -BamMultiReader is an extra module that allows you to manage multiple open BAM -files for reading. It provides some validation & bookkeeping under the hood to -keep all files sync'ed up for you. - -Additional files used by the API: +I. Learn More - - BamAlignment.* : implements the BamAlignment data structure +II. License - - BamAux.h : contains various constants, data structures and utility - methods used throught the API. +III. Acknowledgements - - BamIndex.* : implements both the standard BAM format index (".bai") as - well as a new BamTools-specific index (".bti"). - - - BGZF.* : contains our implementation of the Broad Institute's BGZF - compression format. - ----------------------------------------- -Ib. The Toolkit: ----------------------------------------- - -If you've been using the BamTools since the early days, you'll notice that our -'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone. We have -dumped these in favor of a suite of small utilities that we hope both -developers and end-users find useful: - -usage: bamtools [--help] COMMAND [ARGS] - -Available bamtools commands: - - convert Converts between BAM and a number of other formats - count Prints number of alignments in BAM file(s) - coverage Prints coverage statistics from the input BAM file - filter Filters BAM file(s) by user-specified criteria - header Prints BAM header information - index Generates index for BAM file - merge Merge multiple BAM files into single file - random Select random alignments from existing BAM file(s) - sort Sorts the BAM file according to some criteria - split Splits a BAM file on user-specifed property, creating a - new BAM output file for each value found - stats Prints some basic statistics from input BAM file(s) - -See 'bamtools help COMMAND' for more information on a specific command. +IV. Contact -------------------------------------------------------------------------------- -II. Installation : +I. Learn More: -------------------------------------------------------------------------------- ----------------------------------------- -IIa. Get CMake ----------------------------------------- - -BamTools has been migrated to a CMake-based build system. We believe that this -should simplify the build process across all platforms, especially as the -BamTools API moves into a shared library (that you link to instead of compiling -lots of source files directly into your application). CMake is available on all -major platforms, and indeed comes *out-of-the-box* with many Linux distributions. - -To see if you have CMake (and which version), try this command: - - $ cmake --version - -BamTools requires CMake version >= 2.6.4. If you are missing CMake or have an -older version, check your OS package manager (for Linux) or download it here: -http://www.cmake.org/cmake/resources/software.html . - ----------------------------------------- -IIb. Build BamTools ----------------------------------------- - -Ok, now that you have CMake ready to go, let's build BamTools. A good -practice in building applications is to do an out-of-source build, meaning -that we're going to set up an isolated place to hold all the intermediate -installation steps. - -In the top-level directory of BamTools, type the following commands: - - $ mkdir build - $ cd build - $ cmake .. - -Windows users: -This creates a Visual Studio solution file, which can then be built to create -the toolkit executable and API DLL's. - -Everybody else: -After running cmake, just run: - - $ make - -Then go back up to the BamTools root directory. - - $ cd .. - ----------------------------------------- -IIIb. Check It ----------------------------------------- - -Assuming the build process finished correctly, you should be able to find the -toolkit executable here: - - ./bin/ - -The BamTools-associated libraries will be found here: - - ./lib/ - --------------------------------------------------------------------------------- -III. Usage : --------------------------------------------------------------------------------- - -** General usage information - perhaps explain common terms, point to SAM/BAM -spec, etc ** - ----------------------------------------- -IIIa. The API ----------------------------------------- - -The API, as noted above, contains 2 main modules - BamReader & BamWriter - for -dealing with BAM files. Alignment data is made available through the -BamAlignment data structure. - -A simple (read-only) scenario for accessing BAM data would look like the -following: - - // open our BamReader - BamReader reader; - reader.Open("someData.bam", "someData.bam.bai"); - - // define our region of interest - // in this example: bases 0-500 on the reference "chrX" - int id = reader.GetReferenceID("chrX"); - BamRegion region(id, 0, id, 500); - reader.SetRegion(region); - - // iterate through alignments in this region, - // ignoring alignments with a MQ below some cutoff - BamAlignment al; - while ( reader.GetNextAlignment(al) ) { - if ( al.MapQuality >= 50 ) - // do something - } - - // close the reader - reader.Close(); - -To use this API in your application, you simply need to do 3 things: - - 1 - Build the BamTools library (see Installation steps above). - - 2 - Import BamTools API with the following lines of code - #include "BamReader.h" // (or "BamMultiReader.h") as needed - #include "BamWriter.h" // as needed - using namespace BamTools; // all of BamTools classes/methods live in - // this namespace - - 3 - Link with '-lbamtools' ('l' as in Lima). - -You may need to modify the -L flag (library path) as well to help your linker -find the (BAMTOOLS_ROOT)/lib directory. - -See any included programs for more detailed usage examples. See comments in the -header files for more detailed API documentation. - -Note - For users that don't want to bother with the new BamTools shared library -scheme: you are certainly free to just compile the API source code directly into -your application, but be aware that the files involved are subject to change. -Meaning that filenames, number of files, etc. are not fixed. You will also need -to be sure to link with '-lz' for ZLIB functionality (linking with '-lbamtools' -gives you this automatically). - ----------------------------------------- -IIIb. The Toolkit ----------------------------------------- - -BamTools provides a small, but powerful suite of command-line utility programs -for manipulating and querying BAM files for data. - --------------------- -Input/Output --------------------- - -All BamTools utilities handle I/O operations using a common set of arguments. -These include: - - -in - -The input BAM files(s). - - If a tool accepts multiple BAM files as input, each file gets its own "-in" - option on the command line. If no "-in" is provided, the tool will attempt - to read BAM data from stdin. - - To read a single BAM file, use a single "-in" option: - > bamtools *tool* -in myData1.bam ...ARGS... - - To read multiple BAM files, use multiple "-in" options: - > bamtools *tool* -in myData1.bam -in myData2.bam ...ARGS... - - To read from stdin (if supported), omit the "-in" option: - > bamtools *tool* ...ARGS... - - -out - -The output BAM file. - - If a tool outputs a result BAM file, specify the filename using this option. - If none is provided, the tool will typically write to stdout. - - *Note: Not all tools output BAM data (e.g. count, header, etc.) - - -region - -A region of interest. See below for accepted 'REGION string' formats. - - Many of the tools accept this option, which allows a user to only consider - alignments that overlap this region (whether counting, filtering, merging, - etc.). - - An alignment is considered to overlap a region if any part of the alignments - intersects the left/right boundaries. Thus, a 50bp alignment at position 70 - will overlap a region beginning at position 100. - - REGION string format - ---------------------- - A proper REGION string can be formatted like any of the following examples: - where 'chr1' is the name of a reference (not its ID)and '' is any valid - integer position within that reference. - - To read - chr1 - only alignments on (entire) reference 'chr1' - chr1:500 - only alignments overlapping the region starting at - chr1:500 and continuing to the end of chr1 - chr1:500..1000 - only alignments overlapping the region starting at - chr1:500 and continuing to chr1:1000 - chr1:500..chr3:750 - only alignments overlapping the region starting at - chr1:500 and continuing to chr3:750. This 'spanning' - region assumes that the reference specified as the - right boundary will occur somewhere in the file after - the left boundary. On a sorted BAM, a REGION of - 'chr4:500..chr2:1500' will produce undefined - (incorrect) results. So don't do it. :) - - *Note: Most of the tools that accept a REGION string will perform without an - index file, but typically at great cost to performance (having to - plow through the entire file until the region of interest is found). - For optimum speed, be sure that index files are available for your - data. - - -forceCompression - -Force compression of BAM output. - - When tools are piped together (see details below), the default behavior is - to turn off compression. This can greatly increase performance when the data - does not have to be constantly decompressed and recompressed. This is - ignored any time an output BAM file is specified using "-out". - --------------------- -Piping --------------------- - -Many of the tools in BamTools can be chained together by piping. Any tool that -accepts stdin can be piped into, and any that can output stdout can be piped -from. For example: - -> bamtools filter -in data1.bam -in data2.bam -mapQuality ">50" | bamtools count - -will give a count of all alignments in your 2 BAM files with a mapQuality of -greater than 50. And of course, any tool writing to stdout can be piped into -other utilities. - --------------------- -The Tools --------------------- - - convert Converts between BAM and a number of other formats - count Prints number of alignments in BAM file(s) - coverage Prints coverage statistics from the input BAM file - filter Filters BAM file(s) by user-specified criteria - header Prints BAM header information - index Generates index for BAM file - merge Merge multiple BAM files into single file - random Select random alignments from existing BAM file(s) - sort Sorts the BAM file according to some criteria - split Splits a BAM file on user-specifed property, creating a new - BAM output file for each value found - stats Prints some basic statistics from input BAM file(s) - ----------- -convert ----------- - -Description: converts BAM to a number of other formats - -Usage: bamtools convert -format [-in -in ...] - [-out ] [other options] - -Input & Output: - -in the input BAM file(s) [stdin] - -out the output BAM file [stdout] - -format the output file format - see below for - supported formats - -Filters: - -region genomic region. Index file is recommended for - better performance, and is read - automatically if it exists. See 'bamtools - help index' for more details on creating - one. - -Pileup Options: - -fasta FASTA reference file - -mapqual print the mapping qualities - -SAM Options: - -noheader omit the SAM header from output - -Help: - --help, -h shows this help text - -** Notes ** - - - Currently supported output formats ( BAM -> X ) - - Format type FORMAT (command-line argument) - ------------ ------------------------------- - BED bed - FASTA fasta - FASTQ fastq - JSON json - Pileup pileup - SAM sam - YAML yaml - - Usage example: - > bamtools convert -format json -in myData.bam -out myData.json - - - Pileup Options have no effect on formats other than "pileup" - SAM Options have no effect on formats other than "sam" - ----------- -count ----------- - -Description: prints number of alignments in BAM file(s). - -Usage: bamtools count [-in -in ...] [-region ] - -Input & Output: - -in the input BAM file(s) [stdin] - -region genomic region. Index file is recommended - for better performance, and is used - automatically if it exists. See - 'bamtools help index' for more details - on creating one - -Help: - --help, -h shows this help text - ----------- -coverage ----------- - -Description: prints coverage data for a single BAM file. - -Usage: bamtools coverage [-in ] [-out ] - -Input & Output: - -in the input BAM file [stdin] - -out the output file [stdout] - -Help: - --help, -h shows this help text - ----------- -filter ----------- - -Description: filters BAM file(s). - -Usage: bamtools filter [-in -in ...] - [-out | [-forceCompression]] - [-region ] - [ [-script the input BAM file(s) [stdin] - -out the output BAM file [stdout] - -region only read data from this genomic region (see - README for more details) - -script the filter script file (see README for more - details) - -forceCompression if results are sent to stdout (like when - piping to another tool), default behavior - is to leave output uncompressed. Use this - flag to override and force compression - -General Filters: - -alignmentFlag keep reads with this *exact* alignment flag - (for more detailed queries, see below) - -insertSize keep reads with insert size that matches - pattern - -mapQuality <[0-255]> keep reads with map quality that matches - pattern - -name keep reads with name that matches pattern - -queryBases keep reads with motif that matches pattern - -tag keep reads with this key=>value pair - -Alignment Flag Filters: - -isDuplicate keep only alignments that are marked as - duplicate [true] - -isFailedQC keep only alignments that failed QC [true] - -isFirstMate keep only alignments marked as first mate - [true] - -isMapped keep only alignments that were mapped [true] - -isMateMapped keep only alignments with mates that mapped - [true] - -isMateReverseStrand keep only alignments with mate on reverse - strand [true] - -isPaired keep only alignments that were sequenced as - paired [true] - -isPrimaryAlignment keep only alignments marked as primary - [true] - -isProperPair keep only alignments that passed paired-end - resolution [true] - -isReverseStrand keep only alignments on reverse strand - [true] - -isSecondMate keep only alignments marked as second mate - [true] - -Help: - --help, -h shows this help text - - ***************** - * Filter Script * - ***************** - -The BamTools filter tool allows you to use an external filter script to define -complex filtering behavior. This script uses what I'm calling properties, -filters, and a rule - all implemented in a JSON syntax. - - ** Properties ** - -A 'property' is a typical JSON entry of the form: - - "propertyName" : "value" - -Here are the property names that BamTools will recognize: - - alignmentFlag - cigar - insertSize - isDuplicate - isFailedQC - isFirstMate - isMapped - isMateMapped - isMateReverseStrand - isPaired - isPrimaryAlignment - isProperPair - isReverseStrand - isSecondMate - mapQuality - matePosition - mateReference - name - position - queryBases - reference - tag - -For properties with boolean values, use the words "true" or "false". -For example, - - "isMapped" : "true" - -will keep only alignments that are flagged as 'mapped'. - -For properties with numeric values, use the desired number with optional -comparison operators ( >, >=, <, <=, !). For example, - - "mapQuality" : ">=75" - -will keep only alignments with mapQuality greater than or equal to 75. - -If you're familiar with JSON, you know that integers can be bare (without -quotes). However, if you a comparison operator, be sure to enclose in quotes. - -For string-based properties, the above operators are available. In addition, - you can also use some basic pattern-matching operators. For example, - - "reference" : "ALU*" // reference starts with 'ALU' - "name" : "*foo" // name ends with 'foo' - "cigar" : "*D*" // cigar contains a 'D' anywhere - -Notes - -The reference property refers to the reference name, not the BAM reference -numeric ID. - -The tag property has an extra layer, so that the syntax will look like this: - - "tag" : "XX:value" - -where XX is the 2-letter SAM/BAM tag and value is, well, the value. -Comparison operators can still apply to values, so tag properties of: - - "tag" : "AS:>60" - "tag" : "RG:foo*" - -are perfectly valid. - - ** Filters ** - -A 'filter' is a JSON container of properties that will be AND-ed together. For -example, - -{ - "reference" : "chr1", - "mapQuality" : ">50", - "tag" : "NM:<4" -} - -would result in an output BAM file containing only alignments from chr1 with a -mapQuality >50 and edit distance of less than 4. - -A single, unnamed filter like this is the minimum necessary for a complete -filter script. Save this file and use as the -script parameter and you should -be all set. - -Moving on to more potent filtering... - -You can also define multiple filters. -To do so, you just need to use the "filters" keyword along with JSON array -syntax, like this: - -{ - "filters" : - [ - { - "reference" : "chr1", - "mapQuality" : ">50" - }, - { - "reference" : "chr1", - "isReverseStrand" : "true" - } - ] -} - -These filters will be (inclusive) OR-ed together by default. So you'd get a -resulting BAM with only alignments from chr1 that had either mapQuality >50 or -on the reverse strand (or both). - - ** Rule ** - -Alternatively to anonymous OR-ed filters, you can also provide what I've called -a "rule". By giving each filter an "id", using this "rule" keyword you can -describe boolean relationships between your filter sets. - -Available rule operators: - - & // and - | // or - ! // not - -This might sound a little fuzzy at this point, so let's get back to an example: - -{ - "filters" : - [ - { - "id" : "filter1", - "reference" : "chr1", - "mapQuality" : ">50" - }, - { - "id" : "filter2", - "reference" : "chr1", - "isReverseStrand" : "true" - }, - { - "id" : "filter3", - "reference" : "chr1", - "queryBases" : "AGCT*" - } - ], - - "rule" : " (filter1 | filter2) & !filter3 " -} - -In this case, we would only retain aligments that passed filter 1 OR filter 2, -AND also NOT filter 3. - -These are dummy examples, and don't make much sense as an actual query case. But -hopefully this serves an adequate primer to get you started and discover the -potential flexibility here. - ----------- -header ----------- - -Description: prints header from BAM file(s). - -Usage: bamtools header [-in -in ...] - -Input & Output: - -in the input BAM file(s) [stdin] - -Help: - --help, -h shows this help text - ----------- -index ----------- - -Description: creates index for BAM file. - -Usage: bamtools index [-in ] [-bti] - -Input & Output: - -in the input BAM file [stdin] - -bti create (non-standard) BamTools index file - (*.bti). Default behavior is to create - standard BAM index (*.bai) - -Help: - --help, -h shows this help tex - ----------- -merge ----------- - -Description: merges multiple BAM files into one. - -Usage: bamtools merge [-in -in ...] - [-out | [-forceCompression]] [-region ] - -Input & Output: - -in the input BAM file(s) - -out the output BAM file - -forceCompression if results are sent to stdout (like when - piping to another tool), default behavior - is to leave output uncompressed. Use this - flag to override and force compression - -region genomic region. See README for more details - -Help: - --help, -h shows this help text - ----------- -random ----------- - -Description: grab a random subset of alignments. - -Usage: bamtools random [-in -in ...] - [-out ] [-forceCompression] [-n] - [-region ] - -Input & Output: - -in the input BAM file [stdin] - -out the output BAM file [stdout] - -forceCompression if results are sent to stdout (like when - piping to another tool), default behavior - is to leave output uncompressed. Use this - flag to override and force compression - -region only pull random alignments from within this - genomic region. Index file is - recommended for better performance, and - is used automatically if it exists. See - 'bamtools help index' for more details - on creating one - -Settings: - -n number of alignments to grab. Note that no - duplicate checking is performed [10000] - -Help: - --help, -h shows this help text - ----------- -sort ----------- - -Description: sorts a BAM file. - -Usage: bamtools sort [-in ] [-out ] [sortOptions] - -Input & Output: - -in the input BAM file [stdin] - -out the output BAM file [stdout] - -Sorting Methods: - -byname sort by alignment name - -Memory Settings: - -n max number of alignments per tempfile - [10000] - -mem max memory to use [1024] - -Help: - --help, -h shows this help text - ----------- -split ----------- - -Description: splits a BAM file on user-specified property, creating a new BAM -output file for each value found. - -Usage: bamtools split [-in ] [-stub ] - < -mapped | -paired | -reference | -tag > - -Input & Output: - -in the input BAM file [stdin] - -stub prefix stub for output BAM files (default - behavior is to use input filename, - without .bam extension, as stub). If - input is stdin and no stub provided, a - timestamp is generated as the stub. - -Split Options: - -mapped split mapped/unmapped alignments - -paired split single-end/paired-end alignments - -reference split alignments by reference - -tag splits alignments based on all values of TAG - encountered (i.e. -tag RG creates a BAM - file for each read group in original - BAM file) - -Help: - --help, -h shows this help text - ----------- -stats ----------- - -Description: prints general alignment statistics. - -Usage: bamtools stats [-in -in ...] [statsOptions] +Installation steps, tutorial, API documentation, etc. are all now available +through the BamTools project wiki: -Input & Output: - -in the input BAM file [stdin] +https://github.com/pezmaster31/bamtools/wiki -Additional Stats: - -insert summarize insert size data +Join the mailing list(s) to stay informed of updates or get involved with +contributing: -Help: - --help, -h shows this help text +https://github.com/pezmaster31/bamtools/wiki/Mailing-lists -------------------------------------------------------------------------------- -IV. License : +II. License : -------------------------------------------------------------------------------- Both the BamTools API and toolkit are released under the MIT License. @@ -800,7 +38,7 @@ Copyright (c) 2009-2010 Derek Barnett, Erik Garrison, Gabor Marth, See included file LICENSE for details. -------------------------------------------------------------------------------- -V. Acknowledgements : +III. Acknowledgements : -------------------------------------------------------------------------------- * Aaron Quinlan for several key feature ideas and bug fix contributions @@ -808,7 +46,7 @@ V. Acknowledgements : * Heng Li, author of SAMtools - the original C-language BAM API/toolkit. -------------------------------------------------------------------------------- -VI. Contact : +IV. Contact : -------------------------------------------------------------------------------- Feel free to contact me with any questions, comments, suggestions, bug reports, @@ -818,6 +56,5 @@ Derek Barnett Marth Lab Biology Dept., Boston College -Email: barnetde@bc.edu -Project Websites: http://github.com/pezmaster31/bamtools (ACTIVE SUPPORT) - http://sourceforge.net/projects/bamtools (major updates only) +Email: derekwbarnett@gmail.com +Project Website: http://github.com/pezmaster31/bamtools