The change log of the project is listed below. You can also view change logs for previous versions:
Goby 1.0 to 18.104.22.168.1 [Jan 2010 to Jun 2012]
2.3.6 [June 27 2016]
This release improves the scalability of the local realignment processor, which is now able to process RNA-Seq reads for entire samples. The previous version had a bug where data was kept around and eventually led to out of memory errors or extremely degraded performance on such samples.
2.3.5 [march 2 2015]
– Add a mode to infer sex of samples from data (tested on exome data). Useful as quality control to check the data you get checks out with respect to the what is known about the samples. See –mode infer-sex. Works faster on sorted alignments where the index is used to jump quickly to the human sex chromosome.
– Prevent AbstractAlignmentToCompactMode to print more than 10 warnings if quality scores are not available in an alignment.
– suggest-position-slices: fix a bug in that caused some slices to overlap. Found with a job with hundreds of alignments, so not common.
22.214.171.124 [aug 29 2014]
– Add an option to the fasta-to-compact mode that will convert a set of files and concatenate the result to a single compact-reads file (see new –concat option).
– Add a mode to test that the connection from Goby to R is working (requires JRI and R built with shared library support). The mode is called test-r-connection (tcr).
– Fix a bug that caused some slices to occur within annotations, despite the –annotation option being given on the command line to the suggest-slices mode. The problem was that the chromosome index was not /obtained from the genome and was set to zero, always. In rare cases, this would cause one annotation to be omitted from the output (when the annotation overlapped with the alignment split by genomic position). Thanks go to Laurent Mesnard for reporting this problem.
– Restore STRICT_SOMATIC filter.
– Close files opened when loading Goby Alignment header and index files. This fixes a too many file error that could occur when loading hundreds of alignments simultaneously.
– Allow lenient import mode for TSV files. This makes it possible to convert TSV files to lucene.index when they have been created with Goby in the past with a \t character as last character of the column line.
2.3.4 [may 1 2014]
- Optimize the speed of genotyping when some sites have very high coverage (>500M bases). Now sub-sampling to keep a random set of 10,000 bases for such sites. Expose the default sub-sample size with a dynamic option called sub-sample-size in IterateSortedAlignmentsListImpl. (-x IterateSortedAlignmentsListImpl:sub-sample-size <int>)
- LastToCompact mode now supports the import of paired end alignments produced by Last’s last-pair-probs.sh.
- LastToCompact mode now supports the import of quality scores (lastal must be done with -Q1 since the import assumes Phred quality scores on the q lines).
- Add two methods to AlignmentReader to determine the minimum and maximum genomic locations represented in the reader. This is useful when suggesting slices to split a set of alignments. This commit includes a fix for possible null start or end positions in slices generated with suggest-position-slices.
- Fix a problem with run-in-parallel where some threads would never finish when they do not detect the keyword. Now indicate that the thread finished so that others can start when the processing completes.
- Reads-file-stats mode: remove any path from basename in the output.
2.3.3 [dec 20 2013]
- IterateSortedAlignmentsListImpl: Use a WarningCounter to limit warnings to 10 instances. This is needed to avoid writing Gb of log output when the threshold is met.
- discover-sequence-variants somatic output: Make it possible to run a simple trio design by removing the requirement for a germline sample.
- discover-sequence-variants somatic output: Earlier versions were reporting somatic variation candidates when two parents are homozygotes and the somatic samples was Het (the fisher p-value with each parent is very significant in this case, but does not indicate a somatic change). This also improves q-values because they are less results that need to be corrected.
- discover-sequence-variants somatic output: Add an error message when a sample is mis-spelled in the covariates file.
- Refactor code base to keep base counts for forward and reverse strands separately in SampleCountInfo.
- Normalize somatic priority score by number of mapped reads, and number of parents and germline samples used in the calculation.
- Add a StrandBiasFilter in somatic analyses. The filter rejects variations that are not represented on both strands when at least j reads support the variation. The value of j is set to 9 by default, so a variation with 10 bases needs to have at least the two strands represented. Remove candidate somatic variation that can occur when the germline samples have less coverage than the somatic sample. Now require at least twice the coverage in the somatic sample than the minimum coverage in the germline samples.
- Add a STRICT_SOMATIC filter that flags genomic sites where some bases appear in support of the variation in the parents or germline samples. Please note the VCF spec semantic: PASS indicates that all filters passed. This means that lines with the STRICT_SOMATIC value in the FILTER column failed that test.
- Fix a bug in FDR mode that would not handle vcf files with non default FILTER values.
2.3.2 [jul 22 2013]
– run-parallel-mode now supports paired input files.
– fasta-to-compact: add –force-quality-encoding option to force the quality values within the specified encoding range.
– suggest-position-slices: fix problem where first slice of genome was omitted from output (with new split by number of bytes option introduced in 2.3).
2.3.1 [jul 2 2013]
– Fix for https://github.com/CampagneLaboratory/goby/issues/3
– Upgrade commons-io and dsiutils to latest jar versions. Log messages when scanning reads file with cfs mode.
– DistinctValueCounterBitSet: now grows to biggest size at construction time.
– Fixed a performance problem. When reading large reads file (>10GB), performance of ReadsReader would degrade over time. This was due to caching of data in static protobuf methods of ReadCollection. We now create a builder instance that gets garbage collected when it is no longer used. This fixes a subtle performance problem. The same fix has been applied to alignment readers.
2.3 [apr 26 2013]
- removed LeftOver filter for somatic variation output. The somatic variation format appears quite stable in our tests.
- concatenate-alignments mode: add ability to restrict output to a genomic slice (see -s and -e options).
- API change: AlignmentSliceHelper makes it easier to parse and process genomic slices for sets of alignments.
- concatenate-alignments mode: now transfers read groups to output in the same way that non-sorted concat does.
- concatenate-alignments mode: Add a mechanism to override/define read groups/read origin info on the fly when reading alignments that did not include them. Coupled with changes to compact-to-sam, this makes it possible to get BAM files with read groups directly from Goby alignments.
- compact-to-sam mode: fixed output of read groups, which were not correctly written for platform, platform unit, and library.
- suggest-position-slices: add –restrict-per-chromosome option. When this switch is provided, slices will be restricted to start and end on the same chromosome. This is useful to produce intervals to give Mutect, for instance.
- Trim mode: add –trim-left –trim-right parameters to control trimming of specific sequence extremities.
- Trim mode: add –verbose flag.
2.2.1 [apr 1 2013]
- FDR mode: add ability to read groups from VCF file and adjust columns/fields marked as p-value. Mark adjusted
columns with group q-value.
- Somatic variation output format: annotate somatic p-value column with ‘p-value’ group. Fix the type of the p-value
column to be a number (was String in release 2.2).
- Somatic variation output format: handle unrecognized sample-ids in the parents column.
- discover-sequence-variants mode: add assertion to give hint to user that syntax is incorrect in for -s and -e options.
- compact-file-stats mode: print progress when scanning reads files. Use a buffered reader to improve read file
- discover-sequence-variants: adjust multiplier for left-over filter for somatic variations output format
- discover-sequence-variants: Add a new filter to remove indels at a site where a sample shows lots of distinct possible indels. Indels at these sites are very likely to be artefactual. We count the number of samples where three distinct indel genotypes are seen. If more than 1/4 of the samples have likely indel artifacts, we remove all indel candidates at the site. maxIndelPerSite:Maximum number of distinct indels at a given genomic site.:1 Additional filter: fractionOfSamples: Maximum fraction of samples that can have an indel candidate for the indel to be considered (indel candidates that occur in many samples are more likely to be spurious).:0.25. This filter is added to the somatic variations output format. See dynamic options for this filter with –x-help
2.2 [mar 19 2013]
- Remove threshold effects when calling genotypes in several samples. Modified the filters to not remove bases in specific samples when the genotype survived filters in at least another sample (previous versions reported these threshold edge effects as differences, which could be confusing, this version simply shows the marginal raw base counts in samples where the genotype could have been filtered by a filter, which makes it easier to compare the strength of the genotype support across samples). This adjustment was done for both base genotype and indel genotypes.
- LeftOverFilter: now uses minVariationSupport as minimum threshold.
- Mode suggest-position-slices: add option number-of-bytes to suggest slices with a uniform number of compressed bytes. This option aims to provide more balanced slices in bases where the genome as very non uniform coverage by position. With this option, the number of slices is determined to yield slices that need to decompress about the amount of bytes indicated on the command line. `
- – Framework API change: introduce class PositionToBasesMap<T> to use as type for positionToBases. The class provides methods to get the range of positions described in the map. This unfortunately requires changes to all clients/implementations of IterateSortedAlignments<T>.
- Mode discover-sequence-variants: Fix various problems that prevented reporting genotypes for deletions (i.e., C/-).
- Fix a potential NPE in GroupAssociations when samples are null.
- Fix for issue #2, see https://github.com/CampagneLaboratory/goby/issues/2
- Expose comparator in SortedAnnotations.
2.1.2 [dec 31 2012]
- Upgrade xstream to version 1.4.3. This fixes the compatibility problem seen when running goby 2.1.1 with java 1.7+. Goby 2.1.2 should run with Java 1.7+, but more testing will be needed to rule out other migration problems. If you are running JDK 1.7+ please let us know any issues you encounter.
- Fix VCFParser issue https://github.com/CampagneLaboratory/goby/issues/1. The issue could be triggered when the FORMAT column changed from line to line.
- VCFWriter: improve support for VCF group associations. The Goby VCF parser makes it possible to associate columns to groups (these associations are written in a ##FieldGroupAssociations field).
- Methylation rate VCF output: mark the context column with group ‘indexed’.
- Do not try to upgrade alignments when reading the header to concatenate permutations. This is not necessary and can open too many files when we are trying to concatenate alignments.
2.1.1 [nov 19 2012]
- Add extract-splicing-events mode. This mode is used by GobyWeb 1.9 to extract splicing events from spliced Goby alignments (generated either by GSNAP or STAR at this time).
- Trim mode:Fix bug that caused quality scores to be duplicated (the bug triggered the assertion that checks that sequence length equal quality length).
- Trim mode: Some sequence must remain after trimming to append to the output.
- Fix bug in alignment-to-annotation-counts when counts would be zero for samples whose name contained a period ‘.’ The code was incorrectly stripping alignment extensions twice.
- alignment-to-annotation-counts: add comparison description to t-test statistic column name (e.g. t-test[A/B] rather than t-test). This change makes it possible to retrieve the t-test p-values when more than one comparison is performed.
- Fix a bug where RandomAccessAnnotations could return results on a different chromosome.
- Add annotation loading test and fix for when annotation file is truncated. Goby now loads annotations up to the truncation and logs truncated lines.
- Correct calculation for fold-change-magnitude column in goby diff exp mode. Previous calculation under-estimated magnitude when comparing low rpkms.
- Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension (this problem could create subtle issues when the goby tried to access .info.txt files on a web server that did not return 404 errors for missing content).
2.1 [jul 20 2012]
- Improve compression of hybrid-1 codec by about 8% on average at similar speed. You can enable this improvement with option -x AlignmentCollectionHandler:symbol-modeling=plus. This option will be made the default in a future release. It is not currently the default since Goby 2.1 has just been integrated into IGV and will need time to propagate from IGV dev to production builds.
- Remove import of NH:i bam tags as read-origin-index, since the NH tag seems to contain different types of data depending on the aligner that produced the alignment.
- compact-to-sam mode: fix bug where bam tags containing a colon character (:) would be truncated after the first colon. Thanks to Vadim Zalunin for reporting this problem.
- compact-file-stats: Add a feature to scan only alignment headers.
- VCFParser group associations: Make it possible to lookup an INFO column by either INFO/colname or colname.
- NonAmbiguousAlignmentReader: fix an NPE when reading alignments where all entries have the ambiguity field.
- Fix a problem where AlignmentReaderImpl.canRead would return true when the file ended with an incorrect extension (this problem could create subtle issues when goby tried to access .info.txt files on a web server that did not return 404 errors for missing content). Thanks to Jim Robinson and Helga Thorvaldsdottir for reporting this issue.
2.0.1 [jun 30 2012]
- Release Goby C/C++ APIs under the LGPL license version 3 to make it possible for companies to incorporate support
for Goby formats in their tools. Thanks to Collin Hercus for the suggestion. Please note that part of the Goby
Java APIs are already licensed under the LGPL (anything packaged under the Goby-io.jar file).
- C++ API: Support to set placed unmapped (i.e., mate that does not map is recorded with the read that mapped)
and clipleft/clipright with quality scores.
- Fix problem when using a genome backed by a samtools/picard faidx file. In some cases, read bases would be returned
shifted by one position. Thanks to James Bonfield for reporting this problem.
- SAM/BAM tags start at column 12, index 11. –preserve-all-tags could skip the first tag on some datasets (e.g.,
dataset where the first tag was not a MD:Z or RG:Z). Thanks to James Bonfield for reporting this problem.
- Introduce interface for ReadsWriter. Introduce mock implementation to write reads to text. This is useful to write
more intelligible JUnit tests.
- mode sam-to-compact now supports option –read-names-are-query-indices to indicate that the read names are integers
(typically produced by compact-to-fasta from a chunk of a large file).
- Fix a bug in reformat-compact-reads which did not trim quality scores for paired end reads correctly.
2.0 [jun 15 2012]
Substantial new features introduced in 2.0 are described in this tutorial.
- Introduce chunk codecs for protocol buffer encoded collection messages (supports both reads and alignments).
- Refactor AlignmentWriter to introduce an interface and make it easier to create facades that modify the behaviour
of the default writer. For instance, such a facade is BufferedSortingAlignmentWriter, which keeps a number of entries
in memory to re-sort these entries by genomic position. This feature is used when importing already sorted SAM/BAM
files to create sorted Goby alignments and the files contain spliced alignments that would cause mis-ordering during
- Fixes to sam-to-compact mode. Previous versions could fail for a variety of reasons. We have stress tested this mode
throwing at it various input BAM files, sorted or not and fixed the bugs we found. For instance, the –sorted option
would not work in some 1.9 versions of Goby after samtools/picard changed the semantic of the record comparator Goby
relied upon to verify the input was indeed sorted by position. This made it impossible to convert already sorted BAM
files as sorted Goby alignments.
- Added SortLargeMode which can sort compact alignments of any size, multithreaded. -m sort now invokes LargeSortMode.
- Add ability to preserve SAM/BAM read groups. Read groups are automatically preserved if present in the input BAM file.
The concatenate mode automatically reassigns read origin indices (see field read_origin_index) to prevent conflicts
when Goby files from different origins are concatenated. The approach we use is to keep the most specific read origin
information, and let the client decide what origins/groups are equivalent given the type of analysis at hand.
Read groups are supported by the hybrid codec (and therefore stored very efficiently), are imported from BAM with
sam-to-compact and are exported back to SAM/BAM with the compact-to-bam mode.
- Add ability to preserve all BAM attributes during import and export. Use –preserve-all-tags in mode sam-to-compact
to enable this.
- Add ability to preserve all quality scores. Use –preserve-all-mapped-qualities in mode sam-to-compact.
- Supports bzip2 compression in fasta-to-compact mode and sam-extract-reads (use the -x MessageChunksWriter:codec=bzip2
- Moved error messages produced when parsing the command line of a mode to after usage. This is a simple change that
will make it easier to diagnose problems on a command line without having to scroll back up the console.
- Make it possible to open Goby alignments through HTTP. Simply specify a URL as a basename as argument to the goby
tools. This is supported broadly by the API, so the concatenation reader also supports URLs, for instance. TMH files
currently cannot be loaded remotely. Alignments that require upgrading will also fail to load remotely.
- Refactor dynamic options with a central registry, and make GobyDriver handle option parsing.
This removes duplication of code parsing for each mode that would need dynamic options.
- Support multiple group comparisons for RNA-Seq diff exp (mode compact-alignment-to-annotation-counts).
- The mode for methylation region can now estimate empirical p-values. Empirical P-values require biological replicates in at least
one of the groups under analysis. Two passes over the data are required. In the first pass, the empirical null
distribution is observed by comparing pairs of samples in the same group. In the second pass, this distribution is
used to estimate the p-value of observing the between group differences.
- Support empirical p-value for individual bases (VCF output). Write a DMR INFO field that stores how many significant
sites were found in a moving window that ends at the site (significance is judged according to a configurable
threshold on the empirical p-value).
- New empirical-p mode to estimate p-values from data in text files. This makes it easier to derive p-values for
simulated data or counts generated by other tools than Goby.
- Add a draft implementation of random access sequence interface that can read a fasta file indexed with faidx.
- Added the ability in alignment-to-text mode to output HTML (-f html), to start/end at offsets (-s/-e) in the alignments and
to limit the number of alignment entries to output (-n).
- The RandomAccessSequenceCache had problems with bases that weren’t G/A/T/C/N. Such bases would be skipped silently,
causing rare, but potentially significant, problems (such as on human chr 3 of the 1000g genome reference where a
R base appears). Bases not in the group G/A/T/C/N would introduce position shifts for bases immediately following
the offending character. Now bases other than G/A/T/C are stored as N and maintain the position of the following
bases. Please note that the problem was in a library used by RandomAccessSequenceCache, we updated the library in
this release, and no change to the code of RandomAccessSequenceCache was needed to fix the problem.
- Added a mode sam-comparison to compare a source SAM/BAM file with one that generated after sam-to-compact then
- last-to-compact: add option to substitute some bases with others in the aligned read.
- Add test and fix for bug that went back to start of alignment file, even though iterate alignment was created for a
slice of input. The problem only affected the IterateAlignments class because it was calling reposition(0,0) and the
method did not enforce slice limits.
- The code base was simplified by removing the now obsolete align mode.
- Fix a problem where sample names with several dots were stripped of too many extensions. For instance, a.b.c.entries
would be reduced to a, which could be non-unique across the remaining samples. Problem reported by Fang Fang in her
data on GobyWeb.
- DistinctIntValueCounterBitSet now uses LongArrayBitVector as its bit set implementation. The java BitSet implementation
was found to throw java.lang.ArrayIndexOutOfBoundsException for indices that should fit easily in a bit array (e.g.,
2,080,948 which can stored with about 230 MB).
- AlignmentEntry field insertSize is now stored in protobuf with sint32 rather than uint32 since negative values can be
stored in this field.
- The mode sample-quality-scores now supports .sam, .sam.gz, and .bam files to make a guess at the scale of
the quality scores contained in these files.
- Fixed a problem with concatenate-compact-reads that previously transferred only specific fields of a read to the
output file. concatenate-compact-reads now transfers all fields (including pair sequence and quality score).
- Make default chunk-size dependent on the type of chunk codec used. This is useful because hybrid compression does
better with larger chunk sizes (default chunk size for hybrid is 30000, 20000 for bzip2 and 10000 for gzip). The
default chunk size can be overriden with -x MessageChunksWriter:chunk-size=int. Note that smaller chunk sizes reduce the time
needed to seek at specific positions of sorted alignments. We do not recommend using values larger than 100,000
because performance of interactive visualization of the alignments could suffer (e.g., in IGV).
- version mode now prints an official version number if the jar constains a VERSION.txt file.
- Prevent logging when the log4j system has not been configured. For some reason, LOG.isDebugEnabled can return true
when the logging system is not initialized. For SamHelper, this means calling String.Format million of times to
create debug output that is never shown. This change dramatically improves the performance of the sam-to-compact mode
when logging is not properly configured.
- Fix issues with the barcode-decode mode. Add support for processing fasta/fastq files.
- vcf methylation format: removed space in name of C and Cm group INFO fields.