Goby provides compressed file formats that are time and space efficient. The Goby alignment format is 80-90% smaller than a corresponding BAM format, while making it possible to perform similar analyses. A complete list of Goby file formats is provided below. A number of tools can be used with Goby, these include BWA (read/write), GSNAP (read/write), IGV 2+ (read and display alignments).
In addition to file formats, Goby is a complete software framework to develop NGS analysis programs. A software framework is simply some code that makes it easier to develop programs that share some characteristics. Bioperl/BioJava/BioPython were software frameworks focused on helping developers write programs for sequence analysis. Goby is a framework to help write programs for analysis of next-generation sequencing data. The framework is designed for performance and to run with the lowest possible amount of memory and disk storage. This is of practical importance given that sequencing capacity approximately doubles every year and that naive computational approaches fail to scale to large datasets. More information about the framework is available on the project Developer’s pages.
Finally, Goby is also a set of tools written with the Goby framework. The tools in the Goby toolbox are written to help with the projects that our laboratory is working on. We have so far developed tools for RNASeq data analysis, methyl-Seq or RRBS analysis as well as genomic variation analysis. End-users interested in these tools should consult the tutorials.
|compact reads||An alternative to FASTA/FASTQ, which is fast to parse, unambiguous, compact, and chunckable. Chunkability means that a very large file can be processed in independent chunks without having to traverse the entire file, just the chunk of interest can be read. This property is leveraged by GobyWeb to support parallel alignments.|
|compact alignments||An alternative to Elan text format, MAQ, or SAM/BAM. Goby alignments are chunkable, compact, unambiguous, fast to parse. They are typically 80-90% smaller than corresponding BAM files.|
|read sets||Keep track of millions of read indices and multiplicity information in a space efficient way. Useful to align only non-redundant sequences and reconstitute the alignment that would result from aligning all sequences. Also useful to just eliminate replicates from a reads file.|
|counts||A representation of the histogram of read count along a reference sequence, at single base pair resolution. This representation is highly space efficient. Each count transition (positions where the value of the count changes along the histogram) is encoded in about 13 bits.|
|count archives||An archive of counts, one histogram per reference sequence in an alignment. Archives can store histogram data for a complete genome. They are very space efficient, with only about 20Mb needed to store a histogram of reads aligned against the human genome at base pair resolution. In contrast, a wiggle plot stored at 20bp resolution needs about 45Mb.|