Goby 2.0 is a major milestone of the project. This is a quick guide to some of the most useful features introduced in this release.

SAM/BAM and Goby

Release 2.0 introduces very robust BAM import and export capabilities. We have entirely rewritten the sam-to-compact and compact-to-sam modes and tested these modes with a variety of datasets (DNA-Seq, RNA-Seq, methyl-Seq).

To convert a BAM file to Goby, do:

goby 1g sam-to-compact [--sorted] input.bam -o output-basename [--preserve-all-mapped-qualities]

Use the –sorted option if you know that the BAM file is sorted by genomic position (the index file is not necessary for the import). If you have a sam file, simply substitute .bam with .sam and the mode will appropriately read this input.

Note the –preserve-* options that make it possible to store information such as quality scores for the entire read, soft-clips and their quality scores, BAM tags (when –preserve-all-tags is not active) Goby nevertheless preserves BAM read groups.

To convert a Goby alignment to BAM, do (see a more complete description here):

goby 1g compact-to-bam goby-basename -o output.bam -g genome

If the goby-basename alignment was sorted, the mode will try to create a sorted BAM output file. Because of differences in the way Goby and BAM represent spliced alignments, some spliced RNA-Seq datasets may require resorting after import or export (in case you wondered, Goby has a more general representation that can represent fusion events naturally, so there is no need for a TopHat fusion SAM format variant). You will be notified when such a dataset is encountered and invited to sort the file.

Sorting large files

Goby 1.x provided a sort mode, but was limited by the size of computer memory. This was never a problem for us because our large alignments were produced with GobyWeb, which provides parallel sort during alignment and only needed to sort small pieces one at a time. Goby 2.0 offers a command line sort mode that scales to very large alignments and can run with multiple threads.

The old sort has been renamed sort-1 and you can run the new sort with:

goby 4g sort pre-sort-basename -o sorted-basename

A number of options control memory consumption and parallelization, but the default settings are conservative and work across a range of datasets.

Compressing alignment files

We have introduced novel methods for compressing alignment data that result in state of the art compression over a range of next-generation sequencing datasets. This new capability is seamlessly supported throughout the framework and Goby tools. This means that you can choose any Goby mode and activate compression of the output it produces. This also means that the compressed files work seamlessly with any Goby analysis pipeline that you have already developed. The new compression methods yield alignment files that are on average 25% the size of Goby 1.x alignments. Importantly, the compression methods we introduced in Goby 2.0 preserve your ability to extend the protobuf schema. You can freely extend the Goby alignment schema with new structured data fields (e.g., for specific applications), recompile the project, and immediately get strong compression for most of the alignment data stored.

Goby 2.0 supports pluggable compression/decompression approaches with a codec mechanism (so that you could try out new approaches to compression). Codecs currently offered include: gzip (traditional 1.x Goby compression), bzip2 (stronger compression that BZip2, but slower), hybrid (new method we developed for Goby 2.0, faster than BZip2, slower than GZip, but substantial compression advantage compared to BZip2).

An easy way to compare the effectiveness of the codecs is to compress the output of concatenate-alignment. Assume you have an input-basename alignment generated with Goby 1.x and want to compress it with the new bzip2 codec. You would do:

goby 1g concatenate-alignment input-basename -o input-basename-bzip2 -x MessageChunksWriter:codec=bzip2

This will produce input-basename-bzip2.entries, .header, .index (and optionally a .tmh file if the input alignment contained one).

In the command above, -x MessageChunksWriter:codec=bzip2 indicates that the output file should be written with the ‘bzip2’ codec. You can also compress with the ‘hybrid-1’ codec for smaller files and faster compression, or with the ‘gzip’ codec (the default) to yield Goby 1.x compression.

As mentioned previously, you can use the option -x MessageChunksWriter:codec=’codec’ with any Goby mode that outputs an alignment. For instance, you can import a BAM file and compress it with the hybrid codec on the fly:

goby 1g sam-to-compact --sorted input.bam -o output-basename -x MessageChunksWriter:codec=hybrid-1

Such compressed files will seamlessly work with the rest of the Goby tools. For instance,  you could compress a collection of BAM files with the hybrid-1 codec and perform RNA-Seq analysis with Goby directly on the compressed alignment files. Importantly,  compression codecs can work with either sorted or non-sorted alignments and you have the flexibility of choosing the method that yield the best compression/performance tradeoff for the analysis at hand.

Query index permutations

Goby 2.0 introduces the notion of query index permutation. This is a technique that significantly reduces the size of sorted alignments. If you are compressing an alignment that was produced directly with Goby (typically written to Goby format with BWA or GSNAP, then sorted by genomic order), you can increase compression significantly for the sorted file by specifying the option:

-x AlignmentWriterImpl:permutate-query-indices=true

This will produce a .perm file. The permutation file is not needed for most applications (e.g, it is not needed to view alignments with IGV). The .perm file stores information that helps track alignment entries back to the original reads. Separating this information to a separate file makes it possible to decide if and when such data need to be transferred or preserved.

Empirical p-values

Goby 2.0 supports a new method to estimate significance of differences in methylation data. The method requires biological replicates in the groups under comparison and adapts to distribution of the data. This approach is often much more conservative than a Fisher Exact test because it takes into account the consistency of changes within the groups and between the groups under comparison. Empirical p-values are supported in base-level and region analysis of methylation data.

RNA-Seq designs comparing multiple groups

Goby 2.0 makes it possible to generate  tables of data comparing multiple groups of samples. In Goby 1.x, the RNA-Seq analysis was limited to a single group comparison (e.g., comparing group A versus B). The mode compact-alignment-to-annotation-counts now accepts several comparisons (e.g., A/B A/C B/D) and will write statistics (p and q-values) for each comparison.

Git repository

The source code for Goby 2.0 is now available as a repository on GitHub. We welcome feedback, comments and of course contributions to the project.