Goby 1.7 will support sorting and indexing alignments. Sorting arranges alignment entries by genomic location order. Indexing provides semi-random access to locations in a sorted file. The features provide large performance improvements when software needs to access only a specific window of genomic location in a very large alignment file (e.g., tens or hundreds of gigabytes).

Many Goby modes will benefit from this feature transparently. For instance, modes that allow the user to restrict an analysis to a subset of reference sequences (i.e., modes with the -r/–include-reference-names option), use the index to load only the part of the alignment that align within the regions of interest. This results in very large speed-ups (>9x) for analyses that need to process one chromosome at a time in a large alignment file.

End users can sort alignments with the new sort mode (see this tutorial). Concatenating a set of sorted alignments with yield a globally sorted alignment (as usual concatenating alignments is done with mode concatenate-alignments, which detects sorted inputs automatically). A sorted alignment is automatically indexed when written to disk.

Developers should check out the new skipTo(targetIndex, position) method on edu.cornell.med.icb.goby.alignments.AlignmentReader, as well as the new version of edu.cornell.med.icb.goby.alignments.IterateAlignments, which leverages skipTo for indexed alignments (see the new IterateAlignments tutorial).

The next version of Goby will support reweighting reads with the method recently described by Hansen KD et al NAR 2010. Counts can be reweighted before producing wiggle and bedgraph plots, but Goby also supports reweighting reads when estimating gene expression. See our tutorial for a preview of this upcoming feature.

The next version of Goby will support sequence variations. We have added a simple way to represent mutations, insertions and deletions in the compact alignment format. We have also developed parsers to import variations from SAM format and the MAF format used by Last. Finally, new modes will support converting sequence variations from compact format to summary statistics (e.g., frequency of variation along the read positions) or to tab delimited files for analysis with other software packages.

We are interested in collaborating with groups who are developing statistical models for SNP and indel calling to develop efficient and accurate variant calling solutions. Please let us know if this would be of interest.

We are now writing the manuscript that describes the design principles and solutions implemented in Goby. This manuscript should include storage and performance benchmarks for the compact format.

We’re working on a new lab web site. Stay tuned.