– Improve speed of sam-to-compact when log4j is not configured properly.
– Improve alignment-to-text mode a bit. The html format now writes pair and splice flags to text. Since the sam format output was incomplete, we disabled it in this release. Goby 2 will include a fully functional SAM/BAM export. Fix problem reported by Heng Li where only the first chunk of the alignment was processed (option -e -1 was not interpreted correctly).
– Add test and fix for bug that went back to start of alignment file, even though iterate alignment was created for a slice of input. The problem only affected the IterateAlignments class because it was calling reposition(0,0) and the
method did not enforce slice limits.
– Fixes to sam-to-compact mode. Previous versions could fail for a variety of reasons. We have stress tested this mode throwing at it various input BAM files, sorted or not and fixed the bugs we found. For instance, the –sorted option would not work in some 220.127.116.11.1- versions of Goby after samtools/picard changed the semantic of the record comparator Goby relied upon to verify the input was indeed sorted by position. (This made it impossible to convert already sorted BAM files as sorted Goby alignments in some previous versions).Performance of this mode was also drastically improved by disabling verbose debug log statements.
– The RandomAccessSequenceCache had problems with bases that weren’t G/A/T/C/N. Such bases would be skipped silently, causing rare, but potentially significant, problems (such as on human chr 3 of the 1000g genome reference where a R base appears). Bases not in the group G/A/T/C/N would introduce position shifts for bases immediately following the offending character. Now bases other than G/A/T/C are stored as N and maintain the position of the following bases. Please note that the problem was in a library used by RandomAccessSequenceCache, we updated the library in this release, and no change to the code of RandomAccessSequenceCache was needed to fix the problem.
– Make it possible to open Goby alignments through HTTP. Simply specify a URL as a basename as argument to the goby tools. This is supported broadly by the API, so the concatenation reader also supports URLs, for instance. TMH files currently cannot be loaded remotely. Alignments that require upgrading will also fail to load remotely.
– vcf methylation format: removed space in name of C and Cm group INFO fields.
– Added the ability in alignment-to-text mode to output HTML (-f html), to start/end at offsets (-s/-e) in the alignments and to limit the number of alignment entries to output (-n).
– Fix a bug related to writing paired end alignments in the GSNAP parser (C API). This release is important if you need to compile the GSNAP binaries with Goby support and will run paired-end alignemnts with GSNAP. The Java APIs and tools are unchanged compared to 18.104.22.168.
– Added a methylation_region format capable of averaging methylation rates for different cytosine contexts over arbitrarily defined regions.
– Added a diploid genotype filter to use when calling genotypes in a diploid genome.
– discover-sequence-variants format compare_groups: Write distinct fisher p-values for each comparison pair
– Fix FDR mode output for TSV format. Make open –column-selection-filter work.
– Fix bug that prevented methylation vcf output from writing any line.
– Fix bug in GenotypesOutputFormat that caused GenotypesOutputFormat to throw an exception when processing some sites.
– Discover-sequence-variants: add ability to describe zero, one or more group comparisons. Syntax is A/B,A/C to compare group A to B and group A to C. Additional pairs can be described, separated by coma.
– Extend methyl-stats mode to estimate fraction of methylated cytosine observed in CpX contexts.
– Discover-sequence-variants, genotype format: Fix a bug where alleleSet was cleared in each sample, rather than before any sample is processed. This made it possible for some positions to be ignored erroneously when samples were given on a specific order on the command line. Specifically, positions would be ignored if they were not typed (i.e., not enough good bases) in the last sample given on the command line.
– Optimize merging of TMH when the files are large (>100M compressed).
– Fixed a major bug where NonAmbiguousAlignmentReader would stop iterating after encountering an ambiguous alignment. Alignments with shorter reads were much more likely to be affected.
– Fix sam-extract-reads for paired-end BAM files. Each BAM file contains both pairs. To convert to compact reads, the input BAM file must be sorted by read name, since this is the only way we can put the pairs back together in one Goby record.
– Mode discover-sequence-variants now limits the maximum coverage per site in order to limit the impact on peak memory of a few very high coverage sites. The default setting is set to 500,000x and can be changed with option –max-coverage-per-site
– Switched IndexedIdentifier to an AVLTreeMap to help scale when we have millions of elements to compare in diff exp.
– Fixed a subtle bug in IterateSortedAlignment that would cause iteration to return partial results for some alignments when restricting results to a window. The problem would manifest more clearly for alignments against genomes where contigs have smaller indices than chromosomes and chromosome sequences are listed in non-increasing order (e.g., chr16 appearing before chr 10) and restricting to window from chr16 to MT (which should include chr 10 in that genome, but returned no result on chr 10).
– Trim mode: Fix exception that could occur when trimming reads with no quality scores.
– Change goby script to request the bash shell explicitly. This is needed on systems where bin/sh is not a synonym for bash. Thanks to Martin Frith for catching this on Ubuntu.
– Change how targetLengths are concatenated. It turns out that last-to-compact needs alignment entries matching the target to record the length in the alignment. We need to keep any length seen when we concat because the first chunk may just not have the length for the remaining parts..
– Improved logic for –paired-end filename support in the fastaToCompactMode.
– Fix a NPE in suggest-position-slices that could occur with very small alignment files.
– The BaseStats utility was transformed into a Goby mode (base-stats). The new mode has the ability to tally occurrence of CpX motifs in reads. Useful as a proxy to the amount of unconverted Cs in bisulfite converted reads.
– The methyl-stats mode take a VCF file produced by Goby methylation output and a genome and calculates various statistics about the distribution of fragment lengths between CpG interrogated by the assay.
– FDR mode now accepts –column-selection-filter to select columns matching string.
– Proof of principle that protocol buffer can seamlessly cohabit with data-specific compression schemes. The –codec option on fasta-to-compact is introduced to activate compression of reads when writing compact reads. The codec provided (called read-codec-1) achieves about 10-12% better compression of read files than pure protocol-buffer encoding. This read-codec-1 codec stores bases and quality scores with an arithmetic coder in a protocol buffer field called ‘compressed_data’. Please note that we do not recommend using this option at this stage since the C/C++ APIs cannot load data encoded with this codec at this time.
– Add ability to run alignment-to-annotation-counts on a specific genomic region (see –start-position and –end-position).
– alignment-to-annotation mode has a new option (–remove-shared-segments). When active, this option will remove annotation segments when they partially overlap with more than one primary annotation id. When this option is selected and the primary id is a gene, and secondary id is an exon, the mode will remove exons that are associated with several genes. When the option is used with transcript id as primary and exon as secondary, exons are removed that are shared across different transcripts of the same gene.
– mode base-stats now supports multiple input files.
– VCFParser will now set column type when reading TSV files by using TabToColumnInfoMode to scan the actual values stored in the TSV file. The first time this is done for a each file, a .colinfo file will be created and then used if the file is read again by VCFParser in the future.
– Added the mode tab-to-column-info to read the data from TSV files to determine the the column types (double/integer/string). Write a .colinfo file detailing the column names and types.
– Upgraded to SAM JDK 1.52
– Modes sam-to-compact and sam-extract-reads now set SILENT validation before reading file header. This is required because the SAM JDK validation rules are more stringent than required by the specification. This means that some valid SAM files (per the SAM spec) cannot be parsed without error when the strict validation is used.
– Fixed a bug with ReadsQualityStatsMode when when SampleFraction == 1.0d, such as for files with a small number of reads.
– Mode sam-extract-reads now supports extracting reads from paired samples. See the new options –paired-end and –pair-indicator. These options work similarly to the fasta-to-compact options.
– Fix problem with suggestion-position-slices that could create empty slices.
– Fix bug in discover-sequence-variants methylation format that wrote methylation rates only for up to two samples.
– Fix bug in alignment-to-counts that caused problems with large alignments.
– Fix allele frequency format to write genotype first in FORMAT per vcf spec.
– Add new INFO fields in compare group vcf format to show allele counts in each group.
– Ability to support short versions of mode names, such as “compact-file-stats” has the short mode name “cfs”. There is a default short mode name generation implementation in
AbstractCommandLineMode.getShortModeName() but each mode class can override this method in the case of short mode name collisions. In the case of collisions, the command line parser will not offer/accept ANY short mode names for the classes in question.
– SamToCompact: Generate sorted goby alignments when a sorted BAM files is provided as input (use –sorted flag to activate this option). Thanks to Bradford Powell for the suggestion and draft implementation.
– Fixed a bug in tally-reads that was triggered by reads of different lengths. Thanks to Adrian Platts for the bug report.
– Fix realignment around indels bug that prevented reads from being realigned to the left in exome data. Now correctly updates the start position of the moving window.
– Renamed AlignmentEntry.splicedAlignmentLink to AlignmentEntry.splicedForwardAlignmentLink and added AlignmentEntry.splicedForwardAlignmentLink so splice links can be both bidirectional and more than two segments long. This change is included in the C/C++ APIs and make it possible for GSNAP to write splice information to Goby alignment files.
– FDR mode now supports reporting the top n hits irrespective of corrected q-value threshold (top n hits are defined by the ranking produced by ordering the hits by increasing p-value, for the last column adjusted).
– Significantly reduced memory consumption when performing FDR BH adjustment on hundreds of million of elements.
– VCFWriter now writes missing value ‘.’ in ID, ALT and FILTER fields, as required by VCF 4.1 documentation (http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41). This change is required to read the files generated by Goby with the latest version of Tribble used in IGV EA.
– AlignmentToTextMode will now display splice information.
– alignment-to-counts now generates indexed base-level histogram files. Indexing makes it possible to jump quickly to a new genomic location in IGV. This is especially useful when viewing coverage for tens of tracks.
– Filter out ambiguous reads from alignment-to-counts base level histogram output. Pre-22.214.171.124 behaviour can be obtained by setting the argument –filter-ambiguous-reads to false.
alignment-to-counts: also tried a new way to create base-level histograms from sorted alignment files. This turns out to be about 3 times slower than the current approach. We still keep the new approach because it should scale to any size alignment. Mode alignment-to-count will use to the new approach if an alignment is sorted and has more than 50 million aligned reads.
– Filter out ambiguous reads from alignment-to-annotation-counts by default. Pre-126.96.36.199 behaviour can be obtained by setting the argument –filter-ambiguous-reads to false.
– Add ability to switch off the recording of sampleIndex. This is useful when concat is just used to put pieces of a large alignment back together after splitting reads for parallel processing.
– Do not print indices at the end of upgrade. This caused upgrade to fail on some alignments with an exception.
– Extended IterateAlignments to create alignment reader with a configurable AlignmentReaderFactory.
– Set the default normalization method for alignment-to-annotation-count to bullard normalization only.
– Fix a bug in VCFParser that affected parsing tab delimited files. Some files would be parsed with a tab in the value of the last column, separating the values of the last two actual columns.
– We have added the capability to perform on the fly realignment around indels. This feature is available in mode discover-sequence-variants and in concatenate-alignments. The feature is activated with the new –processor realign_near_indels option. When the option is provided, a compressed reference genome must also be given on the command line (with the –genome option). This will trigger realignment of reads in regions where candidate indels are found by the aligner. The algorithm is very fast, in fact much faster than previously described approaches and consumes a reasonable amount of memory (function of maximum depth of coverage in the region where candidate indels are observed, but typically <2GB). Realignment correctly removes artefactual SNPs that can be introduced when an aligner fails to align the read ends properly through a read deletion. Please note that this version realigns read deletions. Realignment of read insertions has not been implemented.
– Added a new mode (file-to-attributes) to generate a sample attribute file suitable for loading in IGV. Useful when files are named with the convention attr1-attr2-attr3.counts
– Make it possible to open an alignment if the header file is present, but the entries file is missing. This allows to read the header only, for instance when we need to load counts and have access to targetIds.
– Add mode to convert annotations to counts archive format.
– Add new coverage mode to calculate coverage stats over annotation regions. When annotation regions are defined with capture regions, this mode outputs enrichment efficiency efficiency and depth of coverage for specific proportions of captured sites. The mode uses just .header and .count files and traverses count transitions. The algorithm used to iterate through count transitions is very efficient (for instance it takes about ~20 seconds to estimate coverage stats for an alignment with ~20M aligned reads). Count files are produced with GobyWeb together with the alignment or with the alignment-to-counts mode. See this tutorial for details.
– Add new trim mode to trim adapters. This mode works with singe end or paired-end compact reads files. When trimming a paired-end file, it keeps pair and existence of mate or pair consistent with the input.
– Optimized the performance of VCFParser on files with large number of columns. The VCF format seems designed without performance in mind, so it is hard to come up with a reasonably fast implementation. The current implementation of the Goby VCF parser can process about 8,000 lines of compressed VCF per second on a desktop machine
– AlignmentEntry schema change: a new field sample_index holds the index of the alignment from which the entry was read. This is useful when concatenating over multiple alignments and realigning reads that span indels, to reliably track the alignment origin of each entry. The concatenation readers have been modified to set sample_index accordingly. Please note that the activeIndex field of the sorted reader is not a reliable way to identify the alignment of origin when realignment is active. Please use the new sample_index field instead.
– Add CountBinningAdaptor, useful to bin counts on the fly at any resolution for display in IGV.
– Added ability to record total number of bases and sites seen in count archive.
– Now using protobuf 2.4.1. Please upgrade your local version of protobuf if you are recompiling from sources.
– Patched VCF output for compatibility with VCF specification. Specifically, we now write . in the QUAL field and write genotype as the first field in the methylation output format. Additionally, we only write a VCF line if the site can be typed in at least on of the samples. This changes make Goby VCF output compatible with the IGV 2.0 VCFTrack.
– Fix a bug in merge that could trigger a ArrayIndexOutOfBoundsException with some alignments.
– AlignmentWriter now correctly records Goby version in header upon close(). This fixes a problem when alignments read from read-only files would fail upon trying a new upgrade.
– AlignmentReaderImpl now supports full random access to an alignment. Use reposition(ref,pos) followed by skipTo(ref,pos) to obtain the first entry matching at location (ref,pos). Prior to 1.9.6, the reposition method would not reposition to a location already visited forcing clients to close the alignment reader and reopen it (this new behaviour should improve performance in IGV).
– The indexing logic used in versions of Goby up to 1.9.5 (inclusive) had subtle flaws. This could cause the skipTo method to behave incorrectly for some aligments. For instance, if reads matched on target N at a position larger than the length of target N+1, these reads would not be returned by skipTo. Thanks to Alec Chapman for identifying these issues.
We have corrected the problem and added additional unit tests to check the behavior of the implementation in various edge cases. A consequence of this change is that the new indexing logic requires recalculating the .index data structure for alignments sorted and indexed with a version of Goby prior to 1.9.6.
We provide a new mode, goby upgrade, to perform these calculations and fix such alignments. To upgrade alignments off-line, simply do:
goby 3g upgrade [files].
This command will upgrade each alignment corresponding to the filenames provided. It skips those alignments produced by versions of Goby that do not require upgrading. The upgrade process creates a backup of the files that are affected: .index and .header are backed to .index.bak and .header.bak respectively.
The upgrade process is relatively fast, in our tests we upgraded a 750Mb alignment file in 2’30”.
– Version 1.9.6 will try to upgrade alignments on the fly to the new version of the index data structures.
– Detect when FastaToCompact is running in API mode versus command line. Do NOT do System.exit in API mode and instead throw exceptions. Also, API mode doesn’t run conversions in parallel but instead runs them serially for easier exception catching.
– VCFParser now splits headers by tab instead of whitespace so column names that contain spaces are read correctly.
– New mode simulate-reads will generate reads artifically against a reference sequence. We use this mode to create simulated datasets of bisulfite converted reads or mutated reads and to test that Goby produces the expected results.
– Show phred scores in DisplaySequenceVariants (tab + base)
– Add a QualityEncoding.PHRED in case one just wants to transfer quality scores without changing quality scale
– Rewritten sam-to-compact mode that handles sequence variations better, handles bsmap sam files better, and handles quality score conversions more flexibly. The old mode is still around called sam-to-compact-old for comparison. The new mode has slightly different command line paramters.
– Added a discover-sequence-variants mode format ‘methylation’ to estimate methylation rates for RRBS and Methyl-Seq alignments.
– Dramatically improved TMH loading times for large alignments.
– Completely removed support for queryLength in header. This usage was deprecated in Goby 1.7, complicates the code unecessarily and is error prone (because we had two ways to store read length in the previous versions of Goby). Note that versions since 1.7 had a concat mode that transfered information from the header to the alignment entries transparently. Use this mode from a pre 1.9.4 release if you need to migrate a 1.6- alignment to work with Goby 1.9.5+.
– Fixed a bug where merge-compact-alignments would throw an ArrayIndexOutOfBounds because a TMH query index was smaller than the first query index in the alignment.
– Changed discover-sequence-variant mode to filter out alignment entries whose read mapped multiple locations in the reference (as determined by the aligner argument (i.e., -n for gsnap)).
– Made AlignmentReader an interface. The previous AlignmentReader class is now called AlignmentReaderImpl.
– ConcatSortedAlignmentReader and ConcatAlignemntReader now support a configurable AlignmentReaderFactory. The factory makes it possible to plug in alignment reads that filter entries as they are read. The default factory returns all reads. However, if NonAmbiguousAlignmentReader factory is installed, the concatenate reader returns only entries for which the read did not match other locations in the genome. Other filtering behaviour can be implemented in a sub-class of AlignmentReader (see NonAmbiguousAlignmentReader for an example) and a factory created to return instances of this class. This mechanism is used to filter out entries whose reads match several locations on the reference sequence.
– Goby now includes a VCFParser class (see package edu.cornell.med.icb.goby.readers.vcf). VCF stands for Variant Call Format. The VCF format is described at http://www.1000genomes.org/node/101. The Goby VCFParser class implements a VCF 4.0+ parser. Importantly, this implementation also can be used to parse plain TSV files, or VCF that do not include the fixed VCF columns. It therefore support an extended version of the VCF format that is as generic as a TSV file, but can also provide meta-information about the columns in the specific file. Another difference with VCF 4.0 is that we support the Group attribute on column fields. This makes it possible to indicate that fields are part of the same group. Such a feature can be used by user interfaces that would like to offer the ability to manipulate multiple column fields as a group (for instance to hide or show an entire group of fields).
– FDR mode now supports VCF input files and outputs. See the option –vcf to activate processing of VCF formatted files.
– Added a VCFWriter class to write files in the VCF4 format. This class is now used by discover-sequence-variants when writing in genotypes format. This should make it possible to use vcf-tools on the genotype files produced.
– Fix logic for IterateSortedAlignments which, in turn, fixes sequence-variation-stats2. The issue primarily dealt with insertions, deletions, and left and/or right padding.
– Fixed the logic for TAB_SINGLE_BASE in display-sequence-variation mode to report the correct read_index and ref_position.
– The C API (used by BWA, GSNAP) has been updated to more accurately write sequence variations (this version fixes problems in reporting of the read index). We have created examples of how sequence variations are encoded in Goby alignment files. These examples are available at http://tinyurl.com/goby-sequence-variations
– Mode concatenate-alignments now propagates names and versions of the aligners that contributed input alignments.
– Mode sort now propogates the name and version of the aligner that produced the alignment.
– Mode compact-file-stats now reports the name and version of the aligner that produced a Goby alignment file.
– Mode discover-sequence-variants has been extended to support multiple types of outputs (see –format flag). One output format prints genotypes (–format genotypes), while another estimates the proportion of the reference allele in each sample (–format allele_frequencies).
– Added a mechanism to support base filters in discover-sequence-variants. To activate these filters, you must provide the –eval option with the “filter” option. Two filters are currently active when –eval filter is used: one filters variant bases by quality score (keeping only bases with q-phred>=30) and another is a simple and efficient strategy to remove bases that do not quite agree across all the observations. Future versions will make it possible to customize the set of filters and their options.
– sequence-variation-stats2 now runs in parallel up to the available number of threads when multiple alignments are given as input.
– display-sequence-variations and sequence-variation-stats modes: Fix problems in the logic to calculate read-index for large insertions/deletions.
– This release has a C API compatible with a development version of GSNAP. A version of GSNAP released after 2011-03-11 should compile with Goby 1.9.3.
– Add new statistics for discover-sequence variants. Notably, we now record the log odds ratio, the estimated standard error of the log odds ratio, as well as a Z-score for the log odds. Standard error and Z-score are only estimated if more than 10 counts exist in each cell of the contingency table. Also added the proportion of reference allele (refCount / (refCount+varCount).
– Fix reformat-compact-reads bug where quality scores where longer by 1 than the sequence.
– Reduce the memory needed by compact-file-stats to determine the number of reads in a compact reads file.
– Changed how the number of reads in an alignment file is determined by compact-file-stats. We now report the number stored in the alignment header.
– Change how log2 fold change was estimated. We used to estimate as ((log2_rpkm_group_a+1)/ (log2_rpkm_group_b+1)). This can cause problems when log2 rpkm are negative in one group and positive in the other. We now add 1 to counts before calculating RPKMs and taking the log. Similar changes were done to the fold-change. RPKM columns now return PRKM of (count+1).
– Mode reformat-compact-reads now takes an optional -f argument to filter reads. This option can be used to remove redundant reads from a compact-reads file (see tally-reads mode to produce the read filter). It is no longer necessary to do round-trips to FASTQ to remove redundant reads.
– Fixed a major bug in discover-sequence-variants that sometimes could cause confusion in the group of origin of a variation. This bug could affect between group p-values. A Junit test now checks for the error condition and is part of regression testing.
– ConcatenateAlignmentReader would consume excessive amounts of memory when several large alignments (e.g., with >100 million reads) were concatenated. The reader was trying to allocate very large queryLength arrays, even though each underlying reader indicated that it its entries carried the queryLength. The fix consists in detecting that all the concatenated readers support queryLength in entries, and not allocating these arrays at all. This is a major bug fix that makes makes it possible to run more instances of goby modes on the same server (i.e., differential expression and sequence variant discovery modes have significantly improved memory usage).
– sam-to-extract mode: append “.compact-reads” to output filename when the extension is missing.
– Mode sam-extract-reads now supports an optional –quality-encoding argument. Default is BAM encoding.
– QualityEncoding now supports BAM encoding (no offset or adjustment, the value of the character in ascii is the Phred score).
– Fixed sam-extract-reads. Was not extracting sequences from BAM files.
– compact-to-fasta mode: now supports reading an arbitrary slice of input.
– sam-to-compact mode: draft support for importing SAM files produced by BSMAP.
– fixed a bug that prevented running sam-to-compact mode from command line. An assertion prevented the code from running from the command line. Clarified the text of the assertion error and read the required parameter from the command line argument so that the mode will run again on SAM files generated outside of Goby.
– reformat-compact-reads must trim quality scores in the same way that it trims the sequence. Quality scores were not trimmed in previous versions. This is now fixed.
– reformat-compact-reads now correctly processes sequence pairs. Sequence pairs and quality scores can now be trimmed in the same way as the primary sequence.
– Expose sampleFraction via API and command line for read-quality-stats mode
– Make fasta-to-compact mode more callable via API
– reformat-compact-reads during ‘mutate’ will no longer complain when there is no sequence-pair that it cannot mutate (mutation will not be attempted nor complained about if sequence.length is zero).
– fasta-to-compact mode: fix bug that prevented checking that quality encoding are in the allowed range. Quality score must now be converted within the correct score range before the compact-reads file can be written successfully.
– Paralellize the estimation of statistics. This can speed up mode alignment-to-annotation-counts.
– Introduced a field spliced_alignment_link and spliced_flags in AlignmentEntry to represent relation between parts of reads that span exon-exon junctions.
– Introduced insert_size in Alignment entry to represent the size of the insert used when making the sequence library.
– Introduced meta-data in compact-reads files. Meta-data provide a way to document how the sample was obtained. Suggested information to be recorded includes when the library was sequenced (useful to detect batch-effect, as suggested by a participant to the SEQC meeting at the NIH Bethesda campus), as well as sequencing instrument. Modes fasta-to-compact, compact-file-stats and reformat-compact-reads have been updated to define, transfer or display meta-data when appropriate.
– Mode compact-alignment-stats now prints statistics about paired-end reads.
– Removed spurious SAM header when writing alignments in plain text format.
– New fdr mode provides a tool to combine tab delimited file where some columns contain P-values and adjust selected P-values for multiple testing with the Benjamini Hochberg method. The tool is efficient in that it only keep P-values that need to be adjusted in memory, but otherwise keeps other column on disk. This strategy is expected to scale to hundreds of millions of lines of information.
– Add a way to open only a slice of an indexed alignment file by position. This feature makes it possible to retrieve all alignment entries that start between specific position boundaries. See new constructor in AlignmentReader and ConcatSortedAlignmentReader.
– Fix a bug in skipTo that caused some alignment entries to fail to be returned (skipTo previoulsy ignored entries that occured in the chunk just before where the index points). This behaviour is incorrect because the chunk just before where the index points may contain entries with positions equal to the skipTo requested position. The index contract is to return the chunk that starts with an entry with the requested location. Because chunks contain multiple entries with increasing positions, the chunk immediately before the indexed chunk must be scanned and filtered to remove entries with positions before the skipTo requested position. A new test was written to check for this issue (TestSkipTo.testFewSkips4).
– Provide Building/Installation instructions for the Goby C++/C API. – Implemented a fast concatenation operation for read files. The new -q flag in ConcatenateCompactReadsMode activates the fast concatenation. Chunks of compressed data are appended without requiring decompression and compression of the entries. This results in much faster concatenation that are bounded only by available IO.
– Add mapping_quality field to AlignmentEntry protobuf schema.
– Add aligner name and version in AlignmentHeader protobuf schema.
– Added C/C++ api methods to set aligner name and version, and alignment entry mapping quality.
– Updated the C API to be more generic, less oriented toward any one particular 3rd party tool. The read-API is now more generic, the write-API hasn’t changed. The C API files, including the .h header files, have been renamed.
– In C_Alignments.c/.h & C_CompactHelpers.h added CSamHelper and samHelper_* methods to assist with conversion of BWA to support CompactAlignments as the data stored in BWA just prior to writing alignments is effectively already in SAM format. These methods make it possible to reconstruct the aligned query and reference so data can be written in compact alignment.
– Goby C/C++ API now requires the pcre (regex) >=8.10 library. See http://www.pcre.org/
– C API introduced to support native Goby support in GSNAP.
– We now distribute a subset of Goby as the Goby IO API. This subset is packaged in the goby-io.jar
file and released under the LGPL3 license. This was done to make it possible to include Goby format
input output code directly into other software licensed under the LGPL3.
– Fixed a bug that prevented Goby opening large alignment files (>3Gb).
– Fixed a bug in AlignmentIterator triggered when reading alignment files with targetIndices starting at
numbers larger than zero.
– Removed dependency on colt (because it is not a pure LGPL license by adding restriction in military
– SGE helper scripts bz2compact.sh and keep-unique-reads.sh help process hundred of lanes in
parallel on an SGE grid. bz2compact extracts fastq files compressed with BZip2 and converts
them to compact-reads format. keep-unique-reads.sh determines the set of reads that are unique
in each input .compact-reads and writes this information to a .uniqset-keep.filter
– Mode concatenate-compact-reads now supports read index filters. This makes it possible to
concatenate and keep only reads that are unique within each file.
– Draft helper to iterate through individual reference positions of a sorted set of alignments
– Alternative implementation of sequence-variation-stats mode (called sequence-variation-stats2)
that determines the number of reference bases matched at a given read index. This info is needed
to call sequence variants, but slows down the stats. The initial implementation is preserved for
– New mode discover-sequence-variants will either (i) identify sequence variants within a group of sample
or (ii) identify variants whose frequency is significantly enriched in one of two groups.
This mode requires sorted/indexed alignments as input.
– SamToCompact mode now populates the read quality scores for sequence variations (toQuality field).
– Update picard/samtools to version 1.25.
– In the mode “alignment-to-annotation-counts” the “–eval” options supports
a new value “counts” which will output a format specifically designed
for use with R’s DESeq and notably for the R script geneDESeqAnalysis.R
which is used with GobyWeb.
– Fix bug in extract sequence variations for SAM format, where matches on the
reverse strand got a read-index larger than one from the correct value.
– By default, don’t use “counts” in DiffExp as it is a specialized output for preparing for DESeq.
– API interface for ReadsToWeightsMode.
– LastToCompactMode wasn’t writing target lengths. Fixed.
– Read TMH in Python using Gzip.
– Fixed Python utilies so -o actually writes to a file.
– Added transcript-align.sh script to assist with aligning via transcripts.
– In MessageChunksWriter, flush logic should occure on a COMPLETELY empty file, but otherwise it
should only occure if entries have been added since the last flush(). In both C++ and Java.
– DiffAlignmentMode can better compare differences when alignments were done by two different
aligners and the Target Indexes are the same in label but not the same TargetIndex
by building a master TargetIndex and translation maps for the two different alignments.
Targets are now shown by label name instead of TargetIndex.
– CompactFileStats –verbose on a compact alignment shows the targetIndex -> targetIdentifier
map and also displays the targetLength for that targetIndex.
– Extended fasta-to-compact and compact-to-fasta to handle paired end runs. See new command
line arguments –paired-end and pair-indicator arguments in fasta-to-compact and
–pair-output argument in compact-to-fasta.
– Draft support for paired sequence runs. The compact file format is extended to store
sequence, sequence length and quality scores for the paired run. This extension makes
it possible to store both paired end runs in a single compact file. This should help
keep the data together.
– Implemented translation back and from Solexa quality score encoding in fasta-to-compact
and compact-to-fasta. Thanks to Cock PJA et al NAR 2010 for the clear description of the
Solexa base quality scores.
– The sort mode now supports reading only a slice of an input alignment (see options
–start-position and –end-position).
– Refactored CompactAlignmentToAnnotationCountsMode to use IterateAlignments (provides
large speed ups when working with sorted/indexed alignments and selecting a subset of
reference sequences for DE).
– IterateAlignments now takes advantage of the skipTo method when the alignment is sorted
and indexed. This provides large performance improvements when one needs to access data
for only a few reference sequences in an alignments. All the modes that use
IterateAlignments benefit, including display-sequence-variations, and
– Index alignments that are sorted upon writing. The skipTo method leverages the index
to provide fast semi-random access to entries by genomic location. This feature is used
by the IGV Goby plugin, which requires Goby 1.7+.
– Concatenate alignment now produces sorted alignments if all the input alignments
– Added a mode to sort alignment by reference sequence and then by position
on the reference sequence.
– Support to estimate read weights described in Hansen KD et al NAR 2010.
In contrast to the initial publication, Goby supports using the weights to
reweight annotation counts and transcript counts.
– Support to estimate GC content weights for reads and to reweight raw counts to
remove the dependence of counts on GC read content.
– Preliminary support for barcoded reads (barcodes in the sequence), see new
mode decode-barcodes (and tutorial online at
– alignment-to-*-counts: New –eval argument allows to specify which statistics
to evaluate when comparing samples.
– alignment-to-*-counts: New eval options ‘samples’ will write a column per sample
for RPKM, log2(RPKM) and raw counts. RPKM and log2(RPKM) are written once per sample
and global normalization method.
– Reduce memory requirements when concatenating many alignments. A change
introduced in 1.6 caused more memory than needed to be allocated for each
split of an alignment (as much as the number of reads in the file that
was split). Each split now uses only as much memory as needed to keep
query lengths for the split.
– Dramatically improved performance for differential expression tests with millions of
differentially expressed elements (e.g., exon+gene+other). The code previously
incorrectly grew internal arrays from zero to the number of new DE element described
in the annotation file.
Changes that impact the compact alignment format:
– The compact file format is extended to store sequence, sequence length and quality scores
for the paired run. This extension makes it possible to store both paired end runs in a
single compact file. This should help keep the data together.
– Moved query lengths from header to alignment entries. This scales much
better when processing large alignment files (generated from more than
a few hundred million reads).
– The optional ‘sorted’ attribute in header indicates if an alignment has been sorted.