This tutorial is meant for programmers interested in writing efficient analysis programs for short-read data in Java. The tutorial starts by explaining basic Goby framework concepts and continues by demonstrating how to scan Goby reads and alignment files for information of interest.

Protocol Buffers

Goby uses Protocol Buffers to represent data structures on disk and in memory. The Protocol Buffers web site provides plenty of detailed information about the technology, so we will simply introduce the basics here.

A data structure is represented by Protocol Buffers as a message. The structure of a protocol buffer message type is defined in a .proto file. Goby defined several message types. The Goby .proto files are stored in the protobuf/ directory in the Goby source distribution.

Two files are currently provided (with links to the current version of the file in our subversion repository):

Before looking at these files in detail, we should note that vanilla Protocol Buffer does not support messages larger than a few megabytes. Goby provides software that effectively removes collection size limitation.

Making Protocol Buffers scale to very large collections of messages

Next-generation data requires to handle million to billions of small data structures. For instance, a sequencing experiment many generate  a few tens of millions of short reads. Figure 1 below illustrates how Goby stores very many messages in a single file with Protocol Buffers.

Figure 1. Encoding of very large collections with GZip compression and Protocol Buffers.

Collection of reads

The Reads.proto file contains the following directives (line numbers shown on the left):

1	package goby;
2
3	option java_package = "edu.cornell.med.icb.goby.reads";
4	option optimize_for = SPEED;
5
6	message ReadCollection {
7
8	     repeated ReadEntry reads =1;
9
10	}

The first line indicates that the messages declared in this file belong to package goby. Protocol Buffers generates parsers and data structures in different programming languages. Line 3 specifies the Java package where classes that Protocol Buffers generates will reside. Line 4 indicates that the generated classes should be optimized for parsing speed.

Line 6 defines a new protocol buffer message type. The type is called ReadCollection. The message type includes the elements enclosed in braces, from line 6 to 10. In this case, ReadCollection has a field called reads, with message type ReadEntry. The repeated keyword indicates that the field reads can contain multiple elements.

ReadCollection is the message written in each Chunk of a compact reads file. Chunks typically include 10,000 or less ReadEntry messages. This keeps the length of the Protocol Buffer serialized message below a few megabytes. Goby provides a ReadsReader class. ReadsReader implements Iterator<Reads.ReadEntry> and Iterable<Reads.ReadEntry>. This allows programs to use the class in a for loop to iterate over entries described in a complete compact reads file. The iterator decodes the chunked structure as it traverses the file and exposes each ReadEntry message, effectively hiding the chunk structure of the files. The process is very transparent to client programs, as illustrated in the following code snippet:

36	            String inputFilename = "input.compact-reads";
37	            final MutableString sequence = new MutableString();
38	            ReadsReader reader =  new ReadsReader(new FileInputStream(inputFilename));
39
40	            for (final Reads.ReadEntry readEntry : reader) {
41
42	                ReadsReader.decodeSequence(readEntry, sequence);
43	                System.out.printf("read-index: %d read-id: %s sequence: %s %n",
44	                        readEntry.getReadIndex(),
45	                        readEntry.hasReadIdentifier() ? readEntry.getReadIdentifier() : "",
46	                        sequence);
47	            }

For those users interested in the Python version of this example, see the Goby Python API page.

On line 36, the filename of the compact reads file is defined to be input.compact-reads. Line 37 defines a MutableString instance to hold the sequence of the reads as they are read from the file. Goby uses MutableString whenever possible for efficiency. Line 38 creates a ReadsReader instance to iterate through the input file.

Line 40 starts a for loop that iterates through ReadEntry instances exposed by the reader. The loop will execute for as long as the hasNext() method of ReadsReader returns true. The Chunk structure of the underlying file is completely hidden from client code.

On line 42, the sequence object is populated with the sequence of the read. The sequence is coded as bytes in the ReadEntry message, but the static decodeSequence method recovers a string representation.

Line 43 prints the read index, read identifier and sequence for each readEntry. Since the read identifier is defined as an optional field (see Reads.proto), some compact files may not have such a field. This code snippet detect the case when readEntry has no read identifier field, and prints an empty string instead.

The complete ReadEntry message type is shown below for reference (from Reads.proto):

10	message ReadEntry {
11	  /*
12	    Index of a read.
13	  */
14	  required uint32 read_index = 1;
15	   /*
16	    Index of the barcode, if any.
17	  */
18	  optional uint32 barcode_index = 10;
19	  /*
20	     Read identifier/name may be present.
21	  */
22	  optional string read_identifier = 23;
23	  /*
24	     Additional description about the read (from Fasta/Q format).
25	   */
26	  optional string description = 22;
27	  /*
28	    Length of the sequence.
29	   */
30	  required uint32 read_length = 2;
31	  /*
32	    Sequence, encoded as ascii characters stored in single bytes.
33	   */
34	  optional bytes sequence = 3;
35	  /*
36	   The second sequence in a pair. Stored the same way as the sequence attribute.
37	  */
38	  optional bytes sequence_pair = 5;
39	  /*
40	    Length of the second sequence in a pair.
41	  */
42	  optional uint32 read_length_pair = 6;
43	  /*
44	    Quality scores in Phred units, stored as single bytes (0-255).
45	  */
46	  optional bytes quality_scores = 4;
47	  /*
48	    Quality scores for the second sequence in a pair. Stored as the 'qualityScores' attribute.
49	   */
50	  optional bytes quality_scores_pair = 7;
51	}

Collection of alignment entries

Goby stores alignments in a set of files named with a specific convention. Assume the alignment is called ‘basename’, Goby will expect the following files to exist:

  • basename.entries [required] A collection of alignment entries, where an entry describes how a read maps to a reference.
  • basename.header [optional] Global information about the alignment, such as sequence identifiers or sorted attribute.
  • basename.tmh [optional] TooManyHits data structure, to record reads that map to too many locations in reference sequences.
  • basename.index [optional] An index from genomic location to offset in the entries file.

In the convention above, basename can be substituted with any valid filename prefix or path/filename prefix as in the following examples:

  1. /home/goby/alignments/illumina-1-2010.entries the path is /home/goby/alignments/ the basename is illumina-1-2010
  2. ./solid-2-2010.header the path is ./ the basename is solid-2-2010
  3. helicos-3-2010.tmh the path is ./, the basename is helicos-3-2010

The format of the entries file is defined in Alignments.proto. Entries are a large collection of AlignmentEntry messages.

1	package goby;
2
3	option java_package = "edu.cornell.med.icb.goby.alignments";
4	option optimize_for = SPEED;
5
6	message AlignmentCollection {
7
8	     repeated AlignmentEntry alignmentEntries =1;
9	}

Similarly to what we described for reads, Goby provides an AlignmentReader. The following code snippet illustrates how to iterate through a Goby compact alignment file:

40	            String inputFilename = "input.entries";
41
42	            AlignmentReader reader = new AlignmentReader(inputFilename);
43
44	            for (final Alignments.AlignmentEntry alignmentEntry : reader) {
45
46	                System.out.printf("query-index: %d target-index: %d score: %f %n",
47	                        alignmentEntry.getQueryIndex(),
48	                        alignmentEntry.getTargetIndex(),
49	                        alignmentEntry.getScore());
50	            }

For those users interested in the Python version of this example, see the Goby Python API page.

Line 40 provides an input filename. The filename will be stripped of valid alignment extensions and interpreted as a basename. Providing the “input” basename as an argument or providing “input.tmh” or “input.header” would work just as well.

Line 42 creates an instance of the AlignmentReader class to parse the specified input alignment.

Line 44 starts iterating over all the alignment entries in the input.

Line 46 prints a few fields of the AlignmentEntry message to standard output.

An alternative method is useful to iterate over a set of several alignments. In this case, we recommend using the helper class IterateAlignments.

The complete AlignmentEntry message type is reproduced below for reference (from Alignments.proto, the version shown below corresponds to Goby 1.7, you can obtain the latest version of this file from our subversion repository).

11	message AlignmentEntry {
12	 /* Multiplicity of this entry. The number of times this  alignment entry would be repeated exactly the same if
13	  query redundancy had not been removed by read factorization.
14	 */
15	 optional uint32 multiplicity=7;
16
17	 /* An integer that uniquely identifies the query (a short read) in a set of alignment runs. When several
18	    alignment runs are made with the same set of query sequences, equality of query index means that the query
19	    sequences were the same. (Comparing integers for equality is much faster than comparing strings.)
20	  */
21	  required uint32 query_index =1;
22	  /* An integer that uniquely identifies the target (e.g., a chromosome) in a set of alignment runs. When several
23	    alignment runs are made with the same set of target sequences, equality of target index means that the target
24	    sequence was the same across the runs. (Comparing integers for equality is much faster than comparing strings.)
25	  */
26	  required uint32 target_index =2;
27	  /*
28	    The position on the target of the start of the alignment between the query and the target.
29	    In the following example, position is 3 because the first base of the query was aligned with
30	    position 3 of the reference. This example shows that the alignment can start at a mismatch if
31	    it was so constructed by the aligner.
32
33	     0123456789
34	     AAAAGTCAAA  target
35	        CGTC     query
36	   */
37	  required uint32 position=3;
38
39	  /*
40	     True when the query matches the target on the reverse strand
41	  */
42	  required bool matching_reverse_strand=6;
43
44	  /*
45	   The position on the query where the alignment starts. This value is different from zero
46	   when some bases/residues of the query could not be aligned with the target.
47	  */
48	  optional uint32 query_position=5;
49
50	  /*
51	    The score of the alignment, where larger scores indicate better matches between the query and the target.
52	    If an aligner outputs only the number of mismatches between query and target, the score is taken to be
53	    -(#mismatches(query,target)).
54	   */
55	  optional float score=4;
56
57	  /*
58	    Number of bases/residues that differ in the alignment between query and target sequences.
59	  */
60	  optional uint32 number_of_mismatches=8;
61
62	  /*
63	    Cumulative number of insertions and/or deletions present in the alignment.
64	   */
65	  optional uint32 number_of_indels=9;
66
67	  /*
68	    Number of bases that have been aligned for the query.
69	   */
70	  optional uint32 query_aligned_length=11;
71
72	   /*
73	    Number of bases that have been aligned for the target.
74	   */
75	  optional uint32 target_aligned_length=12;
76	  repeated SequenceVariation sequence_variations=13;
77	    /*
78	     Length of the query sequence.
79	    */
80	    optional uint32 queryLength=10;
81	}