Derek [Thu, 21 Oct 2010 04:19:40 +0000 (00:19 -0400)]
Implemented index cache mode for both BAI & BTI formats
* Client code can now decide between 3 index cache modes:
Full : save entire index data in memory
Limited (default) : save only index data for current reference
None : save no index data - only load data necessary for a single-
* Required a major overhaul to BamIndex interface and derived classes.
Lots of refactoring to move common code up to BamIndex.
Derived classes now share much of the same method names &
organization. Only implementation details differ, as needed.
* Miscellaneous: moved BAMTOOLS_LFS definitions into BamAux.h & cleaned
up BGZF.h
Derek [Thu, 21 Oct 2010 04:19:40 +0000 (00:19 -0400)]
Implemented index cache mode for both BAI & BTI formats
* Client code can now decide between 3 index cache modes:
Full : save entire index data in memory
Limited (default) : save only index data for current reference
None : save no index data - only load data necessary for a single-
* Required a major overhaul to BamIndex interface and derived classes.
Lots of refactoring to move common code up to BamIndex.
Derived classes now share much of the same method names &
organization. Only implementation details differ, as needed.
* Miscellaneous: moved BAMTOOLS_LFS definitions into BamAux.h & cleaned
up BGZF.h
Derek [Sat, 9 Oct 2010 23:28:42 +0000 (19:28 -0400)]
Fixed: bug(s) related to empty references and regions.
* NOTE - This fix does introduce a slight modification to the *.bti index format.
So any existing BTI index files will need to be rebuilt to support the bug fix (apologies).
Implemented 'compound rule' logic in the FilterTool's script support.
* Previously the script was limited to doing 'OR' comparisons on various
property sets (what I call here filters). Now, by providing each filter
with an id, you can use these id's to define a compound rule.
* Documentation is severely lacking on this end at the moment, but I hope
to have a good explanation up soon. I think this interface could provide
a powerful flexibility in querying BAM files for very specific cases for
further analyses.
Added implementation of new SplitTool. This tool splits a single BAM file into multiple BAMs, based on a user-specified property. For now, properties supported are mapped/unmapped, paired/unpaired, split by reference, and split based on a given tag.
Moved BamAlignment data structure out to its own .h/.cpp. BamAux.h was getting over-crowded. *NOTE - This means that if you were using the BamAlignment data structure in code without a reader/writer, you need to include BamAlignment.h instead of BamAux.h. If your code was using reader/writer, no changes should be necessary on your end.
Fixed: bug related to accessing data (or regions with no alignments) near the ends of references, when using .bti index files. Required modifying the BamToolsIndex build step. *NOTE: This update invalidates any existing .bti files, please re-generate any that you have currently.* Versioning system in BTI will not allow users to use the older, buggy version... so no chance of accidental usage.
On second thought, moved the (non-indexing) constants back to BamAux.h, since they are not technically specific to BamWriter only and clients may be using them in their code
Added new PileupEngine to the toolkit. This is used by CoverageTool as well as ConvertTool for pileup format. Pileup conversion output before was buggy and overall incorrect. Now should match SAMtools output to the best of my knowledge
Added option for users to specifiy half-open (1-based) return value for BamAlignment::GetEndPosition(). By default, returns 0-based coordinate after recent modification.
Reimplemented BamToolsIndex for bug fix and performance upgrade. *** NOTE *** This commit invalidats any previous BamToolsIndex files (.bti). Please re-run 'bamtools index -bti -in <yourBam>' to generate the new index files.
Reorganizing index/jumping calls. Now all BamReader cares about is sending a Jump() request to a BamIndex with a desired region and receiving a success/fail flag. BamIndex-derived classes will now handle all index-format-specific offset calculation, overlap checking, etc and make sure its associated BGZF stream has seek-ed as close to the desired region as that index scheme allows.
Relicensed the source code used in bamtools_options. Most of it derived from Mosaik source, which was originally released under GPL. Now under MIT License with author's permission. This puts the entire BamTools codebase securely under one unrestrictive license.
* Moved FileExists() to BamAux.h so that all API classes have access to its functionality.
* Created 2 'factory methods' in BamIndex.h to return a BamIndex subclass, depending on client\'s specified PreferredIndexType & on what files actually exist on disk.
* Renamed BamDefaultIndex as BamStandardIndex. Hopefully this name should be a clearer description going forward than BamDefaultIndex, since the standardized index may not always be the 'default' in every situation.
Modified Utilities::FileExists() so it doesnt rely on sys/stat.h. While this header is de facto provided and supported on *most* systems, it really is not standard C/C++, so cant be trusted to be fully portable.
Cleaned up index file handling throughout toolkit. Did this by adding a FileExists() methods to BamMultiReader for determining which index file to load.
Added uncompressed output as default behavior for Filter-, Merge-, and RandomTools when sending results to stdout. User can override this behavior using the command line option: -forceCompression
Derek [Mon, 30 Aug 2010 19:57:55 +0000 (15:57 -0400)]
Fixed: Calls to GetEndPosition() rely on CIGAR data being parsed. Previously this was not set if BamAlignment was retrieved via GetNextAlignmentCore(). Moved CIGAR parsing back to LoadNextAlignment() to ensure this works properly.
Fixed variable length tag data retrieval in BamAlignment::GetTag(). To do this, removed templated GetTag(). Now have explicit overridden string, unsigned int, signed int, and float flavors of the method.
Added basic implementation of RandomTool. This generates a random (well, pseudo... its based on rand()) subset of alignments from BAM file(s). User can specify REGION and/or number of alignments to generate. No duplicate checking implemented. TODO: Handle BAM files without existing index - tool currently depends heavily on being able to jump around randomly