Fixed: bug related to accessing data (or regions with no alignments) near the ends of references, when using .bti index files. Required modifying the BamToolsIndex build step. *NOTE: This update invalidates any existing .bti files, please re-generate any that you have currently.* Versioning system in BTI will not allow users to use the older, buggy version... so no chance of accidental usage.
On second thought, moved the (non-indexing) constants back to BamAux.h, since they are not technically specific to BamWriter only and clients may be using them in their code
Added new PileupEngine to the toolkit. This is used by CoverageTool as well as ConvertTool for pileup format. Pileup conversion output before was buggy and overall incorrect. Now should match SAMtools output to the best of my knowledge
Added option for users to specifiy half-open (1-based) return value for BamAlignment::GetEndPosition(). By default, returns 0-based coordinate after recent modification.
Reimplemented BamToolsIndex for bug fix and performance upgrade. *** NOTE *** This commit invalidats any previous BamToolsIndex files (.bti). Please re-run 'bamtools index -bti -in <yourBam>' to generate the new index files.
Reorganizing index/jumping calls. Now all BamReader cares about is sending a Jump() request to a BamIndex with a desired region and receiving a success/fail flag. BamIndex-derived classes will now handle all index-format-specific offset calculation, overlap checking, etc and make sure its associated BGZF stream has seek-ed as close to the desired region as that index scheme allows.
Relicensed the source code used in bamtools_options. Most of it derived from Mosaik source, which was originally released under GPL. Now under MIT License with author's permission. This puts the entire BamTools codebase securely under one unrestrictive license.
* Moved FileExists() to BamAux.h so that all API classes have access to its functionality.
* Created 2 'factory methods' in BamIndex.h to return a BamIndex subclass, depending on client\'s specified PreferredIndexType & on what files actually exist on disk.
* Renamed BamDefaultIndex as BamStandardIndex. Hopefully this name should be a clearer description going forward than BamDefaultIndex, since the standardized index may not always be the 'default' in every situation.
Modified Utilities::FileExists() so it doesnt rely on sys/stat.h. While this header is de facto provided and supported on *most* systems, it really is not standard C/C++, so cant be trusted to be fully portable.
Cleaned up index file handling throughout toolkit. Did this by adding a FileExists() methods to BamMultiReader for determining which index file to load.
Added uncompressed output as default behavior for Filter-, Merge-, and RandomTools when sending results to stdout. User can override this behavior using the command line option: -forceCompression
Derek [Mon, 30 Aug 2010 19:57:55 +0000 (15:57 -0400)]
Fixed: Calls to GetEndPosition() rely on CIGAR data being parsed. Previously this was not set if BamAlignment was retrieved via GetNextAlignmentCore(). Moved CIGAR parsing back to LoadNextAlignment() to ensure this works properly.
Fixed variable length tag data retrieval in BamAlignment::GetTag(). To do this, removed templated GetTag(). Now have explicit overridden string, unsigned int, signed int, and float flavors of the method.
Added basic implementation of RandomTool. This generates a random (well, pseudo... its based on rand()) subset of alignments from BAM file(s). User can specify REGION and/or number of alignments to generate. No duplicate checking implemented. TODO: Handle BAM files without existing index - tool currently depends heavily on being able to jump around randomly
Fixed Rewind(). Now using LoadNextAlignment() instead of GetNextAlignmentCore(). GNAC() does region checks which, in this case of clearing prior region data, are unnecessary at best and most likely erroneous.
Modified handling of BamAlignmentSupportData. This fix should allow BamWriter::SaveAlignment to still handle BamAlignments retrieved using the 'standard' GetNextAlignment[Core] correctly, while adding support for BamAlignments **generated directly** in client code. Switched BASD::IsParsed to HasCoreOnly, which is false by default and only set by BamReader::GetMextAlignmentCore() - therefore the client should never to touch the BASD struct.
Added support for reading FASTA sequences, as well as generating FASTA index (.fai) files. TODO: need to drop FASTA functionality into pileup conversion, as well as create command line feature to generate FASTA indices
Reorganized convert tool code. Restored stdin by default. Implemented FASTA/FASTQ convert methods. Still need to include support for new (.bti) index file format
Erik Garrison [Wed, 7 Jul 2010 19:54:39 +0000 (15:54 -0400)]
Merge BamMultiReader and SetRegion into bamtools convert
This commit merges the BamMultiReader and SetRegion method into the
conversion tool. This greatly simplifies the process of dumping
alignments from regions in a set of bam files.
Breaking in this commit: stdin input by default. To be fixed in a
subsequent commit.
Erik Garrison [Wed, 7 Jul 2010 19:32:48 +0000 (15:32 -0400)]
Remove heavy-handed failure mode in BamMultiReader::SetRegion
In practice a failure of BamReader::SetRegion means that we can't get
alignments from the specified region. It is simpler to ignore failures
of SetRegion as they are gracefully handled by UpdateAlignments, which
simply doesn't add alignments from the readers which don't have
alignments in the target region.
This resolves a bug in which bamtools count (and any other utility using
BamMultiReader::SetRegion) would crash when provided a target region
with no alignments.
Derek [Mon, 28 Jun 2010 19:10:28 +0000 (15:10 -0400)]
Modified BamReader(and BGZF)::Open() to return bool. Tried to eliminate most exit() calls. These changes should allow for more graceful error handling. Some 'code cleanup' in BW, but no logic changes.
Derek [Tue, 22 Jun 2010 02:23:32 +0000 (22:23 -0400)]
Rough implementation of sort tool. Generate lots of smaller sorted (STL sort with 'custom' compare fxn) temp files in one pass, using a specified buffer size. Uses BamMultiReader paired with BamWriter to re-merge all temp files back into a single output BAM, also in one pass. Some work remains on optimizing parameters (e.g. default buffer size), scalability (single pass generates lots of temp files), & parallelization to make tool more sophisticated & robust.
Erik Garrison [Mon, 21 Jun 2010 15:56:39 +0000 (11:56 -0400)]
Gracefully handle empty files with the BamMultiReader
This commit handles the case where an empty BAM file is passed in the
list of filenames given to BamMultiReader::Open(...). Now a warning is
emitted if the file contains no alignments (or cannot be opened) and the
file is ignored.
Erik Garrison [Fri, 18 Jun 2010 19:13:46 +0000 (15:13 -0400)]
integration of SetRegion into BamMultiReader
Also includes update to bamtools_count which uses the BamMultiReader by
default and no longer requires the specification of an index file on the
command line, as this would be very cumbersome to parse for multiple
input files. Added method to check for file existence using stat to
bamtools_utilities.cpp
Derek [Thu, 17 Jun 2010 21:35:27 +0000 (17:35 -0400)]
Modified Jump() scheme to take better account of specified region and drill down closer to region beginning. Introduced RegionState to BRP in order to allow LoadNextAlignment to quit once an alignment is found beyond region.
Derek [Thu, 17 Jun 2010 04:01:56 +0000 (00:01 -0400)]
Added concept of a fully specified region of interest to the BamReader API. Added BamRegion struct to BamAux.h. Added SetRegion() methods to BamReader. Reorganized or modified these existing BamReaderPrivate functions: BamReaderPrivate(), Close(), GetNextAlignment/Core(), IsOverlap(), Jump(), & Rewind(). Cleans up a lot of region-checking client code.
Erik Garrison [Thu, 10 Jun 2010 17:33:28 +0000 (13:33 -0400)]
change merger to use GetNextAlignmentCore
This provides a modest performance boost to the merger. A small change
to the BamAlignment copy constructor was required (to copy
BamAlignmentSupportData).
Derek [Thu, 10 Jun 2010 04:50:37 +0000 (00:50 -0400)]
Moved BamAlignmentSupportData into BamAlignment data type. This continues the read/write speedup mentioned in prior commits, but removes the need for clients to manage this additional auxilary data object. The 'BamAlignment lite' is accessed by calling BamReader::GetNextAlignmentCore() and written by BamWriter::SaveAlignment() which checks to see how much parsing & packing is needed before writing.
Erik Garrison [Wed, 9 Jun 2010 13:39:07 +0000 (09:39 -0400)]
fixed potential bug with previous commit
The previous commit made assumptions about the ordering of subtags
within @RG header lines. This commit only assumes that the read group
ID is specified by "ID", thus following spec.
Erik Garrison [Wed, 9 Jun 2010 13:31:01 +0000 (09:31 -0400)]
fixed bug with @RG handling
Prior to this commit files merged with bamtools merge would have one @RG
tag for each file. This is undesirable behavior. This commit fixes the
issue by tracking unique @RG tags in our unified header
(BamMultiReader::GetHeaderText) and prevents the MultiReader from
observing more than one @RG tag in the header. Future merges will have
the correct header.
Derek [Wed, 9 Jun 2010 03:29:45 +0000 (23:29 -0400)]
Added GetNextAlignmentCore() to BamReader API as well as a corresponding SaveAlignment() in BamWriter. Both utilitze the BamAlignmentSupportData structure which contains the raw character data and lengths, and which has been bumped to BamAux.h. Exposing these methods should allow for quicker read/writes for tools that are only concerned with alignment/positional data, not the actual sequences.
Erik Garrison [Tue, 8 Jun 2010 20:28:58 +0000 (16:28 -0400)]
BamMultiReader data structure rewrite
Rewrite to improve performance of the MultiReader on large sets of
files. Move tracking of readers, alignments, and positions from several
decoupled vectors into a single multimap, allowing rapid acquisition of
the lowest 'current' alignment among the set of open readers. Expect
some performance boost when running the MultReader on large numbers of
files, as prior to this rewrite each alignment required roughly 3 x N
ops (where N is the number of files) checking all these vectors.