The change log of the project is listed below. You can also view change logs for previous versions:
Goby 1.0 to 22.214.171.124.1 [Jan 2010 to Jun 2012]
2.3 [apr 26 2013]
- removed LeftOver filter for somatic variation output. The somatic variation format appears quite stable in our tests.
- concatenate-alignments mode: add ability to restrict output to a genomic slice (see -s and -e options).
- API change: AlignmentSliceHelper makes it easier to parse and process genomic slices for sets of alignments.
- concatenate-alignments mode: now transfers read groups to output in the same way that non-sorted concat does.
- concatenate-alignments mode: Add a mechanism to override/define read groups/read origin info on the fly when reading alignments that did not include them. Coupled with changes to compact-to-sam, this makes it possible to get BAM files with read groups directly from Goby alignments.
- compact-to-sam mode: fixed output of read groups, which were not correctly written for platform, platform unit, and library.
- suggest-position-slices: add –restrict-per-chromosome option. When this switch is provided, slices will be restricted to start and end on the same chromosome. This is useful to produce intervals to give Mutect, for instance.
- Trim mode: add –trim-left –trim-right parameters to control trimming of specific sequence extremities.
- Trim mode: add –verbose flag.
2.2.1 [apr 1 2013]
- FDR mode: add ability to read groups from VCF file and adjust columns/fields marked as p-value. Mark adjusted
columns with group q-value.
- Somatic variation output format: annotate somatic p-value column with ‘p-value’ group. Fix the type of the p-value
column to be a number (was String in release 2.2).
- Somatic variation output format: handle unrecognized sample-ids in the parents column.
- discover-sequence-variants mode: add assertion to give hint to user that syntax is incorrect in for -s and -e options.
- compact-file-stats mode: print progress when scanning reads files. Use a buffered reader to improve read file
- discover-sequence-variants: adjust multiplier for left-over filter for somatic variations output format
- discover-sequence-variants: Add a new filter to remove indels at a site where a sample shows lots of distinct possible indels. Indels at these sites are very likely to be artefactual. We count the number of samples where three distinct indel genotypes are seen. If more than 1/4 of the samples have likely indel artifacts, we remove all indel candidates at the site. maxIndelPerSite:Maximum number of distinct indels at a given genomic site.:1 Additional filter: fractionOfSamples: Maximum fraction of samples that can have an indel candidate for the indel to be considered (indel candidates that occur in many samples are more likely to be spurious).:0.25. This filter is added to the somatic variations output format. See dynamic options for this filter with –x-help
2.2 [mar 19 2013]
- Remove threshold effects when calling genotypes in several samples. Modified the filters to not remove bases in specific samples when the genotype survived filters in at least another sample (previous versions reported these threshold edge effects as differences, which could be confusing, this version simply shows the marginal raw base counts in samples where the genotype could have been filtered by a filter, which makes it easier to compare the strength of the genotype support across samples). This adjustment was done for both base genotype and indel genotypes.
- LeftOverFilter: now uses minVariationSupport as minimum threshold.
- Mode suggest-position-slices: add option number-of-bytes to suggest slices with a uniform number of compressed bytes. This option aims to provide more balanced slices in bases where the genome as very non uniform coverage by position. With this option, the number of slices is determined to yield slices that need to decompress about the amount of bytes indicated on the command line. `
- - Framework API change: introduce class PositionToBasesMap<T> to use as type for positionToBases. The class provides methods to get the range of positions described in the map. This unfortunately requires changes to all clients/implementations of IterateSortedAlignments<T>.
- Mode discover-sequence-variants: Fix various problems that prevented reporting genotypes for deletions (i.e., C/-).
- Fix a potential NPE in GroupAssociations when samples are null.
- Fix for issue #2, see https://github.com/CampagneLaboratory/goby/issues/2
- Expose comparator in SortedAnnotations.
2.1.2 [dec 31 2012]
- Upgrade xstream to version 1.4.3. This fixes the compatibility problem seen when running goby 2.1.1 with java 1.7+. Goby 2.1.2 should run with Java 1.7+, but more testing will be needed to rule out other migration problems. If you are running JDK 1.7+ please let us know any issues you encounter.
- Fix VCFParser issue https://github.com/CampagneLaboratory/goby/issues/1. The issue could be triggered when the FORMAT column changed from line to line.
- VCFWriter: improve support for VCF group associations. The Goby VCF parser makes it possible to associate columns to groups (these associations are written in a ##FieldGroupAssociations field).
- Methylation rate VCF output: mark the context column with group ‘indexed’.
- Do not try to upgrade alignments when reading the header to concatenate permutations. This is not necessary and can open too many files when we are trying to concatenate alignments.
2.1.1 [nov 19 2012]
- Add extract-splicing-events mode. This mode is used by GobyWeb 1.9 to extract splicing events from spliced Goby alignments (generated either by GSNAP or STAR at this time).
- Trim mode:Fix bug that caused quality scores to be duplicated (the bug triggered the assertion that checks that sequence length equal quality length).
- Trim mode: Some sequence must remain after trimming to append to the output.
- Fix bug in alignment-to-annotation-counts when counts would be zero for samples whose name contained a period ‘.’ The code was incorrectly stripping alignment extensions twice.
- alignment-to-annotation-counts: add comparison description to t-test statistic column name (e.g. t-test[A/B] rather than t-test). This change makes it possible to retrieve the t-test p-values when more than one comparison is performed.
- Fix a bug where RandomAccessAnnotations could return results on a different chromosome.
- Add annotation loading test and fix for when annotation file is truncated. Goby now loads annotations up to the truncation and logs truncated lines.
- Correct calculation for fold-change-magnitude column in goby diff exp mode. Previous calculation under-estimated magnitude when comparing low rpkms.
- Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension (this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not return 404 errors for missing content).
2.1 [jul 20 2012]
- Improve compression of hybrid-1 codec by about 8% on average at similar speed. You can enable this improvement with option -x AlignmentCollectionHandler:symbol-modeling=plus. This option will be made the default in a future release. It is not currently the default since Goby 2.1 has just been integrated into IGV and will need time to propagate from IGV dev to production builds.
- Remove import of NH:i bam tags as read-origin-index, since the NH tag seems to contain different types of data depending on the aligner that produced the alignment.
- compact-to-sam mode: fix bug where bam tags containing a colon character (:) would be truncated after the first colon. Thanks to Vadim Zalunin for reporting this problem.
- compact-file-stats: Add a feature to scan only alignment headers.
- VCFParser group associations: Make it possible to lookup an INFO column by either INFO/colname or colname.
- NonAmbiguousAlignmentReader: fix an NPE when reading alignments where all entries have the ambiguity field.
- Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension (this problem could create subtle issues when goby tried to access .info.txt files on a web server that did not return 404 errors for missing content). Thanks to Jim Robinson and Helga Thorvaldsdottir for reporting this issue.
2.0.1 [jun 30 2012]
- Release Goby C/C++ APIs under the LGPL license version 3 to make it possible for companies to incorporate support
for Goby formats in their tools. Thanks to Collin Hercus for the suggestion. Please note that part of the Goby
Java APIs are already licensed under the LGPL (anything packaged under the Goby-io.jar file).
- C++ API: Support to set placed unmapped (i.e., mate that does not map is recorded with the read that mapped)
and clipleft/clipright with quality scores.
- Fix problem when using a genome backed by a samtools/picard faidx file. In some cases, read bases would be returned
shifted by one position. Thanks to James Bonfield for reporting this problem.
- SAM/BAM tags start at column 12, index 11. –preserve-all-tags could skip the first tag on some datasets (e.g.,
dataset where the first tag was not a MD:Z or RG:Z). Thanks to James Bonfield for reporting this problem.
- Introduce interface for ReadsWriter. Introduce mock implementation to write reads to text. This is useful to write
more intelligible JUnit tests.
- mode sam-to-compact now supports option –read-names-are-query-indices to indicate that the read names are integers
(typically produced by compact-to-fasta from a chunk of a large file).
- Fix a bug in reformat-compact-reads which did not trim quality scores for paired end reads correctly.
2.0 [jun 15 2012]
Substantial new features introduced in 2.0 are described in this tutorial.
- Introduce chunk codecs for protocol buffer encoded collection messages (supports both reads and alignments).
- Refactor AlignmentWriter to introduce an interface and make it easier to create facades that modify the behaviour
of the default writer. For instance, such a facade is BufferedSortingAlignmentWriter, which keeps a number of entries
in memory to re-sort these entries by genomic position. This feature is used when importing already sorted SAM/BAM
files to create sorted Goby alignments and the files contain spliced alignments that would cause mis-ordering during
- Fixes to sam-to-compact mode. Previous versions could fail for a variety of reasons. We have stress tested this mode
throwing at it various input BAM files, sorted or not and fixed the bugs we found. For instance, the –sorted option
would not work in some 1.9 versions of Goby after samtools/picard changed the semantic of the record comparator Goby
relied upon to verify the input was indeed sorted by position. This made it impossible to convert already sorted BAM
files as sorted Goby alignments.
- Added SortLargeMode which can sort compact alignments of any size, multithreaded. -m sort now invokes LargeSortMode.
- Add ability to preserve SAM/BAM read groups. Read groups are automatically preserved if present in the input BAM file.
The concatenate mode automatically reassigns read origin indices (see field read_origin_index) to prevent conflicts
when Goby files from different origins are concatenated. The approach we use is to keep the most specific read origin
information, and let the client decide what origins/groups are equivalent given the type of analysis at hand.
Read groups are supported by the hybrid codec (and therefore stored very efficiently), are imported from BAM with
sam-to-compact and are exported back to SAM/BAM with the compact-to-bam mode.
- Add ability to preserve all BAM attributes during import and export. Use –preserve-all-tags in mode sam-to-compact
to enable this.
- Add ability to preserve all quality scores. Use –preserve-all-mapped-qualities in mode sam-to-compact.
- Supports bzip2 compression in fasta-to-compact mode and sam-extract-reads (use the -x MessageChunksWriter:codec=bzip2
- Moved error messages produced when parsing the command line of a mode to after usage. This is a simple change that
will make it easier to diagnose problems on a command line without having to scroll back up the console.
- Make it possible to open Goby alignments through HTTP. Simply specify a URL as a basename as argument to the goby
tools. This is supported broadly by the API, so the concatenation reader also supports URLs, for instance. TMH files
currently cannot be loaded remotely. Alignments that require upgrading will also fail to load remotely.
- Refactor dynamic options with a central registry, and make GobyDriver handle option parsing.
This removes duplication of code parsing for each mode that would need dynamic options.
- Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
- The mode for methylation region can now estimate empirical p-values. Empirical P-values require biological replicates in at least
one of the groups under analysis. Two passes over the data are required. In the first pass, the empirical null
distribution is observed by comparing pairs of samples in the same group. In the second pass, this distribution is
used to estimate the p-value of observing the between group differences.
- Support empirical p-value for individual bases (VCF output). Write a DMR INFO field that stores how many significant
sites were found in a moving window that ends at the site (significance is judged according to a configurable
threshold on the empirical p-value).
- New empirical-p mode to estimate p-values from data in text files. This makes it easier to derive p-values for
simulated data or counts generated by other tools than Goby.
- Add a draft implementation of random access sequence interface that can read a fasta file indexed with faidx.
- Added the ability in alignment-to-text mode to output HTML (-f html), to start/end at offsets (-s/-e) in the alignments and
to limit the number of alignment entries to output (-n).
- The RandomAccessSequenceCache had problems with bases that weren’t G/A/T/C/N. Such bases would be skipped silently,
causing rare, but potentially significant, problems (such as on human chr 3 of the 1000g genome reference where a
R base appears). Bases not in the group G/A/T/C/N would introduce position shifts for bases immediately following
the offending character. Now bases other than G/A/T/C are stored as N and maintain the position of the following
bases. Please note that the problem was in a library used by RandomAccessSequenceCache, we updated the library in
this release, and no change to the code of RandomAccessSequenceCache was needed to fix the problem.
- Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
- last-to-compact: add option to substitute some bases with others in the aligned read.
- Add test and fix for bug that went back to start of alignment file, even though iterate alignment was created for a
slice of input. The problem only affected the IterateAlignments class because it was calling reposition(0,0) and the
method did not enforce slice limits.
- The code base was simplified by removing the now obsolete align mode.
- Fix a problem where sample names with several dots were stripped of too many extensions. For instance, a.b.c.entries
would be reduced to a, which could be non-unique across the remaining samples. Problem reported by Fang Fang in her
data on GobyWeb.
- DistinctIntValueCounterBitSet now uses LongArrayBitVector as its bit set implementation. The java BitSet implementation
was found to throw java.lang.ArrayIndexOutOfBoundsException for indices that should fit easily in a bit array (e.g.,
2,080,948 which can stored with about 230 MB).
- AlignmentEntry field insertSize is now stored in protobuf with sint32 rather than uint32 since negative values can be
stored in this field.
- The mode sample-quality-scores now supports .sam, .sam.gz, and .bam files to make a guess at the scale of
the quality scores contained in these files.
- Fixed a problem with concatenate-compact-reads that previously transferred only specific fields of a read to the
output file. concatenate-compact-reads now transfers all fields (including pair sequence and quality score).
- Make default chunk-size dependent on the type of chunk codec used. This is useful because hybrid compression does
better with larger chunk sizes (default chunk size for hybrid is 30000, 20000 for bzip2 and 10000 for gzip). The
default chunk size can be overriden with -x MessageChunksWriter:chunk-size=int. Note that smaller chunk sizes reduce the time
needed to seek at specific positions of sorted alignments. We do not recommend using values larger than 100,000
because performance of interactive visualization of the alignments could suffer (e.g., in IGV).
- version mode now prints an official version number if the jar constains a VERSION.txt file.
- Prevent logging when the log4j system has not been configured. For some reason, LOG.isDebugEnabled can return true
when the logging system is not initialized. For SamHelper, this means calling String.Format million of times to
create debug output that is never shown. This change dramatically improves the performance of the sam-to-compact mode
when logging is not properly configured.
- Fix issues with the barcode-decode mode. Add support for processing fasta/fastq files.
- vcf methylation format: removed space in name of C and Cm group INFO fields.