From: derek Date: Sun, 24 Oct 2010 19:26:15 +0000 (-0400) Subject: Updated README X-Git-Url: https://git.donarmstrong.com/?a=commitdiff_plain;h=fe07d12a66f7a43128e3417c95a44732b831e97a;p=bamtools.git Updated README --- diff --git a/README b/README index 6e1a3ab..00d9658 100644 --- a/README +++ b/README @@ -23,100 +23,100 @@ I. Introduction: -------------------------------------------------------------------------------- BamTools provides both a programmer's API and an end-user's toolkit for handling -BAM files. +BAM files. ---------------------------------------- -Ia. The API: +Ia. The API: ---------------------------------------- -The API consists of 2 main modules: BamReader and BamWriter. As you would -expect, BamReader provides read-access to BAM files, while BamWriter handles -writing data to BAM files. BamReader provides the interface for random-access +The API consists of 2 main modules: BamReader and BamWriter. As you would +expect, BamReader provides read-access to BAM files, while BamWriter handles +writing data to BAM files. BamReader provides the interface for random-access (jumping) in a BAM file, as well as generating BAM index files. - -BamMultiReader is an extra module that allows you to manage multiple open BAM -files for reading. It provides some validation & bookkeeping under the hood to + +BamMultiReader is an extra module that allows you to manage multiple open BAM +files for reading. It provides some validation & bookkeeping under the hood to keep all files sync'ed up for you. Additional files used by the API: - - BamAlignment.* : implements the BamAlignment data structure + - BamAlignment.* : implements the BamAlignment data structure - - BamAux.h : contains various constants, data structures and utility + - BamAux.h : contains various constants, data structures and utility methods used throught the API. - - BamIndex.* : implements both the standard BAM format index (".bai") as + - BamIndex.* : implements both the standard BAM format index (".bai") as well as a new BamTools-specific index (".bti"). - - BGZF.* : contains our implementation of the Broad Institute's BGZF + - BGZF.* : contains our implementation of the Broad Institute's BGZF compression format. ---------------------------------------- Ib. The Toolkit: ---------------------------------------- -If you've been using the BamTools since the early days, you'll notice that our -'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone. We have -dumped these in favor of a suite of small utilities that we hope both +If you've been using the BamTools since the early days, you'll notice that our +'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone. We have +dumped these in favor of a suite of small utilities that we hope both developers and end-users find useful: usage: bamtools [--help] COMMAND [ARGS] Available bamtools commands: - convert Converts between BAM and a number of other formats - count Prints number of alignments in BAM file(s) - coverage Prints coverage statistics from the input BAM file - filter Filters BAM file(s) by user-specified criteria - header Prints BAM header information - index Generates index for BAM file - merge Merge multiple BAM files into single file - random Select random alignments from existing BAM file(s) - sort Sorts the BAM file according to some criteria - split Splits a BAM file on user-specifed property, creating a + convert Converts between BAM and a number of other formats + count Prints number of alignments in BAM file(s) + coverage Prints coverage statistics from the input BAM file + filter Filters BAM file(s) by user-specified criteria + header Prints BAM header information + index Generates index for BAM file + merge Merge multiple BAM files into single file + random Select random alignments from existing BAM file(s) + sort Sorts the BAM file according to some criteria + split Splits a BAM file on user-specifed property, creating a new BAM output file for each value found - stats Prints some basic statistics from input BAM file(s) + stats Prints some basic statistics from input BAM file(s) See 'bamtools help COMMAND' for more information on a specific command. -------------------------------------------------------------------------------- -II. Usage : +II. Usage : -------------------------------------------------------------------------------- -** General usage information - perhaps explain common terms, point to SAM/BAM +** General usage information - perhaps explain common terms, point to SAM/BAM spec, etc ** ---------------------------------------- IIa. The API ---------------------------------------- -The API, as noted above, contains 2 main modules - BamReader & BamWriter - for -dealing with BAM files. Alignment data is made available through the +The API, as noted above, contains 2 main modules - BamReader & BamWriter - for +dealing with BAM files. Alignment data is made available through the BamAlignment data structure. -A simple (read-only) scenario for accessing BAM data would look like the +A simple (read-only) scenario for accessing BAM data would look like the following: - // open our BamReader - BamReader reader; - reader.Open("someData.bam", "someData.bam.bai"); + // open our BamReader + BamReader reader; + reader.Open("someData.bam", "someData.bam.bai"); - // define our region of interest - // in this example: bases 0-500 on the reference "chrX" - int id = reader.GetReferenceID("chrX"); - BamRegion region(id, 0, id, 500); - reader.SetRegion(region); + // define our region of interest + // in this example: bases 0-500 on the reference "chrX" + int id = reader.GetReferenceID("chrX"); + BamRegion region(id, 0, id, 500); + reader.SetRegion(region); - // iterate through alignments in this region, - // ignoring alignments with a MQ below some cutoff - BamAlignment al; - while ( reader.GetNextAlignment(al) ) { - if ( al.MapQuality >= 50 ) - // do something - } + // iterate through alignments in this region, + // ignoring alignments with a MQ below some cutoff + BamAlignment al; + while ( reader.GetNextAlignment(al) ) { + if ( al.MapQuality >= 50 ) + // do something + } - // close the reader - reader.Close(); + // close the reader + reader.Close(); To use this API in your application, you simply need to do 3 things: @@ -125,22 +125,22 @@ To use this API in your application, you simply need to do 3 things: 2 - Import BamTools API with the following lines of code #include "BamReader.h" // (or "BamMultiReader.h") as needed #include "BamWriter.h" // as needed - using namespace BamTools; // all of BamTools classes/methods live in + using namespace BamTools; // all of BamTools classes/methods live in // this namespace - + 3 - Link with '-lz' ('l' as in Lima) to access ZLIB compression library - (For MSVC users, I can provide you modified zlib headers - just contact + (For MSVC users, I can provide you modified zlib headers - just contact me if needed). -See any included programs and Makefile for more specific compiling/usage +See any included programs and Makefile for more specific compiling/usage examples. See comments in the header files for more detailed API documentation. ---------------------------------------- IIb. The Toolkit ---------------------------------------- -BamTools provides a small, but powerful suite of command-line utility programs -for manipulating and querying BAM files for data. +BamTools provides a small, but powerful suite of command-line utility programs +for manipulating and querying BAM files for data. -------------------- Input/Output @@ -151,10 +151,10 @@ These include: -in -The input BAM files(s). +The input BAM files(s). - If a tool accepts multiple BAM files as input, each file gets its own "-in" - option on the command line. If no "-in" is provided, the tool will attempt + If a tool accepts multiple BAM files as input, each file gets its own "-in" + option on the command line. If no "-in" is provided, the tool will attempt to read BAM data from stdin. To read a single BAM file, use a single "-in" option: @@ -163,78 +163,78 @@ The input BAM files(s). To read multiple BAM files, use multiple "-in" options: > bamtools *tool* -in myData1.bam -in myData2.bam ...ARGS... - To read from stdin (if supported), omit the "-in" option: + To read from stdin (if supported), omit the "-in" option: > bamtools *tool* ...ARGS... -out -The output BAM file. +The output BAM file. If a tool outputs a result BAM file, specify the filename using this option. - If none is provided, the tool will typically write to stdout. + If none is provided, the tool will typically write to stdout. *Note: Not all tools output BAM data (e.g. count, header, etc.) -region -A region of interest. See below for accepted 'REGION string' formats. +A region of interest. See below for accepted 'REGION string' formats. - Many of the tools accept this option, which allows a user to only consider - alignments that overlap this region (whether counting, filtering, merging, - etc.). + Many of the tools accept this option, which allows a user to only consider + alignments that overlap this region (whether counting, filtering, merging, + etc.). An alignment is considered to overlap a region if any part of the alignments - intersects the left/right boundaries. Thus, a 50bp alignment at position 70 + intersects the left/right boundaries. Thus, a 50bp alignment at position 70 will overlap a region beginning at position 100. REGION string format ---------------------- - A proper REGION string can be formatted like any of the following examples: - where 'chr1' is the name of a reference (not its ID)and '' is any valid + A proper REGION string can be formatted like any of the following examples: + where 'chr1' is the name of a reference (not its ID)and '' is any valid integer position within that reference. - To read + To read chr1 - only alignments on (entire) reference 'chr1' - chr1:500 - only alignments overlapping the region starting at + chr1:500 - only alignments overlapping the region starting at chr1:500 and continuing to the end of chr1 - chr1:500..1000 - only alignments overlapping the region starting at + chr1:500..1000 - only alignments overlapping the region starting at chr1:500 and continuing to chr1:1000 - chr1:500..chr3:750 - only alignments overlapping the region starting at - chr1:500 and continuing to chr3:750. This 'spanning' - region assumes that the reference specified as the - right boundary will occur somewhere in the file after - the left boundary. On a sorted BAM, a REGION of - 'chr4:500..chr2:1500' will produce undefined + chr1:500..chr3:750 - only alignments overlapping the region starting at + chr1:500 and continuing to chr3:750. This 'spanning' + region assumes that the reference specified as the + right boundary will occur somewhere in the file after + the left boundary. On a sorted BAM, a REGION of + 'chr4:500..chr2:1500' will produce undefined (incorrect) results. So don't do it. :) *Note: Most of the tools that accept a REGION string will perform without an - index file, but typically at great cost to performance (having to - plow through the entire file until the region of interest is found). - For optimum speed, be sure that index files are available for your + index file, but typically at great cost to performance (having to + plow through the entire file until the region of interest is found). + For optimum speed, be sure that index files are available for your data. -forceCompression Force compression of BAM output. - When tools are piped together (see details below), the default behavior is + When tools are piped together (see details below), the default behavior is to turn off compression. This can greatly increase performance when the data - does not have to be constantly decompressed and recompressed. This is - ignored any time an output BAM file is specified using "-out". + does not have to be constantly decompressed and recompressed. This is + ignored any time an output BAM file is specified using "-out". -------------------- Piping -------------------- -Many of the tools in BamTools can be chained together by piping. Any tool that -accepts stdin can be piped into, and any that can output stdout can be piped +Many of the tools in BamTools can be chained together by piping. Any tool that +accepts stdin can be piped into, and any that can output stdout can be piped from. For example: > bamtools filter -in data1.bam -in data2.bam -mapQuality ">50" | bamtools count -will give a count of all alignments in your 2 BAM files with a mapQuality of -greater than 50. And of course, any tool writing to stdout can be piped into -other utilities. +will give a count of all alignments in your 2 BAM files with a mapQuality of +greater than 50. And of course, any tool writing to stdout can be piped into +other utilities. -------------------- The Tools @@ -242,7 +242,7 @@ The Tools convert Converts between BAM and a number of other formats count Prints number of alignments in BAM file(s) - coverage Prints coverage statistics from the input BAM file + coverage Prints coverage statistics from the input BAM file filter Filters BAM file(s) by user-specified criteria header Prints BAM header information index Generates index for BAM file @@ -259,20 +259,20 @@ convert Description: converts BAM to a number of other formats -Usage: bamtools convert -format [-in -in ...] +Usage: bamtools convert -format [-in -in ...] [-out ] [other options] Input & Output: -in the input BAM file(s) [stdin] -out the output BAM file [stdout] - -format the output file format - see below for + -format the output file format - see below for supported formats Filters: -region genomic region. Index file is recommended for - better performance, and is read + better performance, and is read automatically if it exists. See 'bamtools - help index' for more details on creating + help index' for more details on creating one. Pileup Options: @@ -288,14 +288,14 @@ Help: ** Notes ** - Currently supported output formats ( BAM -> X ) - + Format type FORMAT (command-line argument) ------------ ------------------------------- BED bed FASTA fasta FASTQ fastq JSON json - Pileup pileup + Pileup pileup SAM sam YAML yaml @@ -303,7 +303,7 @@ Help: > bamtools convert -format json -in myData.bam -out myData.json - Pileup Options have no effect on formats other than "pileup" - SAM Options have no effect on formats other than "sam" + SAM Options have no effect on formats other than "sam" ---------- count @@ -314,69 +314,414 @@ Description: prints number of alignments in BAM file(s). Usage: bamtools count [-in -in ...] [-region ] Input & Output: - -in the input BAM file(s) [stdin] - -Filters: - -region genomic region. Index file is required and - is read automatically if it exists. See + -in the input BAM file(s) [stdin] + -region genomic region. Index file is recommended + for better performance, and is used + automatically if it exists. See 'bamtools help index' for more details - on creating one. + on creating one Help: - --help, -h shows this help text - + --help, -h shows this help text ---------- coverage ---------- +Description: prints coverage data for a single BAM file. + +Usage: bamtools coverage [-in ] [-out ] + +Input & Output: + -in the input BAM file [stdin] + -out the output file [stdout] + +Help: + --help, -h shows this help text ---------- filter ---------- +Description: filters BAM file(s). + +Usage: bamtools filter [-in -in ...] + [-out | [-forceCompression]] + [-region ] + [ [-script the input BAM file(s) [stdin] + -out the output BAM file [stdout] + -region only read data from this genomic region (see + README for more details) + -script the filter script file (see README for more + details) + -forceCompression if results are sent to stdout (like when + piping to another tool), default behavior + is to leave output uncompressed. Use this + flag to override and force compression + +General Filters: + -alignmentFlag keep reads with this *exact* alignment flag + (for more detailed queries, see below) + -insertSize keep reads with insert size that matches + pattern + -mapQuality <[0-255]> keep reads with map quality that matches + pattern + -name keep reads with name that matches pattern + -queryBases keep reads with motif that matches pattern + -tag keep reads with this key=>value pair + +Alignment Flag Filters: + -isDuplicate keep only alignments that are marked as + duplicate [true] + -isFailedQC keep only alignments that failed QC [true] + -isFirstMate keep only alignments marked as first mate + [true] + -isMapped keep only alignments that were mapped [true] + -isMateMapped keep only alignments with mates that mapped + [true] + -isMateReverseStrand keep only alignments with mate on reverse + strand [true] + -isPaired keep only alignments that were sequenced as + paired [true] + -isPrimaryAlignment keep only alignments marked as primary + [true] + -isProperPair keep only alignments that passed paired-end + resolution [true] + -isReverseStrand keep only alignments on reverse strand + [true] + -isSecondMate keep only alignments marked as second mate + [true] + +Help: + --help, -h shows this help text + + ***************** + * Filter Script * + ***************** + +The BamTools filter tool allows you to use an external filter script to define +complex filtering behavior. This script uses what I'm calling properties, +filters, and a rule - all implemented in a JSON syntax. + + ** Properties ** + +A 'property' is a typical JSON entry of the form: + + "propertyName" : "value" + +Here are the property names that BamTools will recognize: + + alignmentFlag + cigar + insertSize + isDuplicate + isFailedQC + isFirstMate + isMapped + isMateMapped + isMateReverseStrand + isPaired + isPrimaryAlignment + isProperPair + isReverseStrand + isSecondMate + mapQuality + matePosition + mateReference + name + position + queryBases + reference + tag + +For properties with boolean values, use the words "true" or "false". +For example, + + "isMapped" : "true" + +will keep only alignments that are flagged as 'mapped'. + +For properties with numeric values, use the desired number with optional +comparison operators ( >, >=, <, <=, !). For example, + + "mapQuality" : ">=75" + +will keep only alignments with mapQuality greater than or equal to 75. + +If you're familiar with JSON, you know that integers can be bare (without +quotes). However, if you a comparison operator, be sure to enclose in quotes. + +For string-based properties, the above operators are available. In addition, + you can also use some basic pattern-matching operators. For example, + + "reference" : "ALU*" // reference starts with 'ALU' + "name" : "*foo" // name ends with 'foo' + "cigar" : "*D*" // cigar contains a 'D' anywhere + +Notes - +The reference property refers to the reference name, not the BAM reference +numeric ID. + +The tag property has an extra layer, so that the syntax will look like this: + + "tag" : "XX:value" + +where XX is the 2-letter SAM/BAM tag and value is, well, the value. +Comparison operators can still apply to values, so tag properties of: + + "tag" : "AS:>60" + "tag" : "RG:foo*" + +are perfectly valid. + + ** Filters ** + +A 'filter' is a JSON container of properties that will be AND-ed together. For +example, + +{ + "reference" : "chr1", + "mapQuality" : ">50", + "tag" : "NM:<4" +} + +would result in an output BAM file containing only alignments from chr1 with a +mapQuality >50 and edit distance of less than 4. + +A single, unnamed filter like this is the minimum necessary for a complete +filter script. Save this file and use as the -script parameter and you should +be all set. + +Moving on to more potent filtering... + +You can also define multiple filters. +To do so, you just need to use the "filters" keyword along with JSON array +syntax, like this: + +{ + "filters" : + [ + { + "reference" : "chr1", + "mapQuality" : ">50" + }, + { + "reference" : "chr1", + "isReverseStrand" : "true" + } + ] +} + +These filters will be (inclusive) OR-ed together by default. So you'd get a +resulting BAM with only alignments from chr1 that had either mapQuality >50 or +on the reverse strand (or both). + + ** Rule ** + +Alternatively to anonymous OR-ed filters, you can also provide what I've called +a "rule". By giving each filter an "id", using this "rule" keyword you can +describe boolean relationships between your filter sets. + +Available rule operators: + + & // and + | // or + ! // not + +This might sound a little fuzzy at this point, so let's get back to an example: + +{ + "filters" : + [ + { + "id" : "filter1", + "reference" : "chr1", + "mapQuality" : ">50" + }, + { + "id" : "filter2", + "reference" : "chr1", + "isReverseStrand" : "true" + }, + { + "id" : "filter3", + "reference" : "chr1", + "queryBases" : "AGCT*" + } + ], + + "rule" : " (filter1 | filter2) & !filter3 " +} + +In this case, we would only retain aligments that passed filter 1 OR filter 2, +AND also NOT filter 3. + +These are dummy examples, and don't make much sense as an actual query case. But +hopefully this serves an adequate primer to get you started and discover the +potential flexibility here. ---------- header ---------- +Description: prints header from BAM file(s). + +Usage: bamtools header [-in -in ...] + +Input & Output: + -in the input BAM file(s) [stdin] + +Help: + --help, -h shows this help text ---------- index ---------- +Description: creates index for BAM file. + +Usage: bamtools index [-in ] [-bti] + +Input & Output: + -in the input BAM file [stdin] + -bti create (non-standard) BamTools index file + (*.bti). Default behavior is to create + standard BAM index (*.bai) + +Help: + --help, -h shows this help tex ---------- merge ---------- +Description: merges multiple BAM files into one. + +Usage: bamtools merge [-in -in ...] + [-out | [-forceCompression]] [-region ] + +Input & Output: + -in the input BAM file(s) + -out the output BAM file + -forceCompression if results are sent to stdout (like when + piping to another tool), default behavior + is to leave output uncompressed. Use this + flag to override and force compression + -region genomic region. See README for more details + +Help: + --help, -h shows this help text ---------- random ---------- +Description: grab a random subset of alignments. + +Usage: bamtools random [-in -in ...] + [-out ] [-forceCompression] [-n] + [-region ] + +Input & Output: + -in the input BAM file [stdin] + -out the output BAM file [stdout] + -forceCompression if results are sent to stdout (like when + piping to another tool), default behavior + is to leave output uncompressed. Use this + flag to override and force compression + -region only pull random alignments from within this + genomic region. Index file is + recommended for better performance, and + is used automatically if it exists. See + 'bamtools help index' for more details + on creating one + +Settings: + -n number of alignments to grab. Note that no + duplicate checking is performed [10000] + +Help: + --help, -h shows this help text ---------- sort ---------- +Description: sorts a BAM file. + +Usage: bamtools sort [-in ] [-out ] [sortOptions] + +Input & Output: + -in the input BAM file [stdin] + -out the output BAM file [stdout] + +Sorting Methods: + -byname sort by alignment name + +Memory Settings: + -n max number of alignments per tempfile + [10000] + -mem max memory to use [1024] + +Help: + --help, -h shows this help text ---------- split ---------- +Description: splits a BAM file on user-specified property, creating a new BAM +output file for each value found. + +Usage: bamtools split [-in ] [-stub ] + < -mapped | -paired | -reference | -tag > + +Input & Output: + -in the input BAM file [stdin] + -stub prefix stub for output BAM files (default + behavior is to use input filename, + without .bam extension, as stub). If + input is stdin and no stub provided, a + timestamp is generated as the stub. + +Split Options: + -mapped split mapped/unmapped alignments + -paired split single-end/paired-end alignments + -reference split alignments by reference + -tag splits alignments based on all values of TAG + encountered (i.e. -tag RG creates a BAM + file for each read group in original + BAM file) + +Help: + --help, -h shows this help text ---------- stats ---------- +Description: prints general alignment statistics. + +Usage: bamtools stats [-in -in ...] [statsOptions] + +Input & Output: + -in the input BAM file [stdin] + +Additional Stats: + -insert summarize insert size data + +Help: + --help, -h shows this help text -------------------------------------------------------------------------------- III. License : -------------------------------------------------------------------------------- Both the BamTools API and toolkit are released under the MIT License. -Copyright (c) 2009-2010 Derek Barnett, Erik Garrison, Gabor Marth, +Copyright (c) 2009-2010 Derek Barnett, Erik Garrison, Gabor Marth, Michael Stromberg See included file LICENSE for details. @@ -393,14 +738,13 @@ IV. Acknowledgements : V. Contact : -------------------------------------------------------------------------------- -Feel free to contact me with any questions, comments, suggestions, bug reports, +Feel free to contact me with any questions, comments, suggestions, bug reports, etc. - + Derek Barnett Marth Lab Biology Dept., Boston College -Email: barnetde@bc.edu +Email: barnetde@bc.edu Project Websites: http://github.com/pezmaster31/bamtools (ACTIVE SUPPORT) http://sourceforge.net/projects/bamtools (major updates only) -