This page describes the Python API for reading binary data files created using the Goby next-gen data management framework. The Goby Python API provides a subset of of the operations in the Java implementation, specifically the parsing of Goby compact read and alignment files that were created using Java Goby framework is supported.
Specific details about how the Goby framework uses Protocol buffers can be found in the section called Developing with Goby
- Make sure you have Python 2.5 or newer. If in doubt, run:
$ python -V
- Download and install the prerequisite python packages
- Install the Goby package:
$ python setup.py install
This step may require superuser privileges.
Collection of reads
As with the Java implementation, the Goby Python API provides a ReadsReader class which is an iterator over ReadEntry objects. This allows programs to use the class in a for loop to iterate over entries described in a complete compact reads file. The iterator decodes the chunked structure as it traverses the file and exposes each ReadEntry message, effectively hiding the chunk structure of the files. The process is very transparent to client programs, as illustrated in the following code snippet:
1 from goby.Reads import ReadsReader 2 3 inputFilename = "input.compact-reads" 4 reader = ReadsReader(inputFilename) 5 6 for entry in reader: 7 print "read-index: %d read-id: %s sequence: %s" % (entry.readIndex, entry.readIdentifier, entry.sequence)
Line 1 tells Python that ReadsReader class from the Reads module in the goby package will be required. As with the Java example, the filename of the compact reads file is defined to be input.compact-reads on line 3. A ReadsReader instance to iterate through the input file is then created on line 4.
Line 6 starts a for loop that iterates through ReadEntry instances exposed by the reader. The loop will execute for as long as there are more ReadEntry objects to be read. The Chunk structure of the underlying file is completely hidden from client code. Finally, line 7 prints the read index, read identifier and sequence for each entry. Even though the read identifier is defined as an optional field (see Reads.proto) and some compact files may not have such a field, the Python implementation will safely return an empty string in these cases.
Collection of alignment entries
Similar to what we described for reads, the Goby Python API provides an AlignmentReader. The following code snippet illustrates how to iterate through a Goby compact alignment file:
from goby.Alignments import AlignmentReader inputFilename = "input.entries" reader = AlignmentReader(inputFilename) for entry in reader: print "query-index: %d target-index: %s score: %f" % (entry.query_index, entry.target_index, entry.score)
While the Goby Python API does not provide the full set of operations as found in the Java implementation, a few of the modes have been written using the Python API. These are included as part of the Goby distribution and reviewing these are a good way to learn a lot about how to use Goby for your own projects.
- Scan a Goby compact alignment file and prints statistics about the alignment. It provides the same information as the Java mode “compact-file-stats“. Similar to the corresponding Java mode, the script takes a basename of a compact alignment as input. (The files basename.entries and basename.header must exist).
- Converts a Goby compact alignment to to plain text. It provides the same information as the Java mode “alignment-to-text“. Similar to the corresponding Java mode, the script takes a basename of a compact alignment as input.
- Converts a Goby compact reads file to FASTA or FASTQ format. It is similar to the Java mode “compact-to-fasta“.
GobyCompactToFasta.py [-f|--format <fasta|fastq>] [-o|--output <output-filename>] <filename>
- Scan a Goby compact reads file and prints statistics about the entries. It provides the same information as the Java mode “compact-file-stats“. Similar to the corresponding Java mode, the script takes a name of a compact reads file as input.