Since we posted an announcement for Goby 2.0 on the samtools developer mailling list, I received a number of questions about differences between Goby alignments and SAM alignments. Many people start by assuming that Goby stores alignments in a way somewhat similar to SAM, but there are key differences, which I attempt to summarize here:
- Goby supports 5 minute schema changes (i.e., data schema flexibility)
- Alignments that span a splice site are stored differently
- Sequence variations are stored explicitly in Goby
Goby supports 5 minute schema changes
Goby stores data with protocol buffers, according to well-defined data schemas. A key feature of protocol buffers is that it supports “5 minute schema changes”. You can add a new field (e.g., a collection of new data structures) to a schema and recompile the code (typically takes about 5 minutes). The recompiled program will seamlessly read old data files, and old programs will be able to read data generated by the recompiled program. This functionality is quite useful when you are developing new programs, or when a community of developers work on various tools that need a common schema, but often also require tool specific extensions. Goby supports 5 minutes schema changes. SAM is limited to extensions via new tags. You can encode structured data in tags, but you are left to write a serialization mechanism between data structures and strings (this will likely take more than 5 minutes).
Alignments that span a splice site are stored differently
Consider this alignment in SAM format (one alignment block spanning a splice site, from a paired-end file):
61PKHAAXX_HWUSI-EAS627_0007:5:38:17885:13714 97 chr1 1231595 255 66M144N9M = 1231903 0 \
AGCAGCTCAGGCTCCCACGAGTCCAGCGTCAGGGACCGCACCTTGGAGCAGTGGACACCCAGGCTCCTGTGGATG GGGGGFGGGGGDFGGGGGEGGEGGCGGFCGGGFG?GGEEEEEEGEGCEEEBBDDBBDBCED=CACDC@C9@A:<A NM:i:0 XS:A:- NH:i:1 MD:Z:75The same alignment is stored by Goby in data structures that contain the following (the field names are shown for clarity, but are not stored). Please note that both Goby entries are indexed independently. Storing two entries make it possible to represent gene fusions easily. Imaging the fusion is between two chromosomes, Goby will store the fusion as usual, with splice links pointing at positions on different chromosomes.
{ // first part of the alignment block, before the splice site:
query_index: 0
fragment_index: 0 // first part/fragment of the read.
target_index: 0
position: 1231594
query_position: 0
matching_reverse_strand: false
multiplicity: 1
query_length: 75
query_aligned_length: 66 // this alignment matches the first 66 bases
target_aligned_length: 66
mapping_quality: 255
pair_flags: 97
pair_alignment_link { // indicates this block is part of a pair,
// found at this location:
target_index: 0
position: 1231902
fragment_index: 2 // the pair is another fragment of the same read.
}
spliced_forward_alignment_link { // indicates this block is part of a
// spliced alignment, points to second
// entry
target_index: 0 // indicates the target/chromosome of the block
// after the splice
position: 1231804 // the position of the alignment block
// after the slice
fragment_index: 1 // which part of the read we link to
// after the splice
}
bam_attributes: "NH:i:1"
bam_attributes: "NM:i:0"
bam_attributes: "XS:A:-"
}
{ // second part of the alignment block, after the splice site:
query_index: 0 // query index match with first alignment, this is the same read.
fragment_index: 1 // second part of a read
target_index: 0
position: 1231804
query_position: 66
matching_reverse_strand: false
multiplicity: 1
query_length: 75
query_aligned_length: 9 // this alignment block matches only the 9 last bases
target_aligned_length: 9
mapping_quality: 255
pair_flags: 97
pair_alignment_link { // indicates this block is part of a pair,
// found at this location
target_index: 0
position: 1231902
fragment_index: 2
}
spliced_backward_alignment_link { // indicates this block is part of a spliced alignment,
// points back to first entry:
target_index: 0
position: 1231594
fragment_index: 0
}
bam_attributes: "NH:i:1"
bam_attributes: "NM:i:0"
bam_attributes: "XS:A:-"
}Sequence variations are stored explicitly in Goby
SAM/BAM rely on CIGAR strings and MD tags to represent sequence variations. A complicated CIGAR can look like this: 6S23M1I6M1D16M1I47M while the corresponding MD tag will look like this: MD:Z:3G4A7C4C1A2T2^T2C4T3A0G5C44. Goby replaces these codes with explicit data structures.The SAM alignment:
509.6.68.19057.157284 83 chr1 45881869 29 6S23M1I6M1D16M1I47M = 45881519 -443 TTACCCGCTTTCCTTGCCCAAATTTTAAGTTTCNGGAAAAGGGGAGGGAAATGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTGACAGAGTGTCAC #######################################################ECGGGGGGGGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHH MD:Z:3G4A7C4C1A2T2^T2C4T3A0G5C44 RG:Z:1 XG:i:3 AM:i:29 NM:i:14 SM:i:29 XM:i:10 XO:i:3 XT:A:M
Will be stored as shown below. Please note that while MD tags are optional in SAM, the reference base is typically currently stored in Goby alignments. The read_index field indicates the cycle in the read where the variation was observed.
{
query_index: 0
target_index: 0
position: 45881868
query_position: 6
matching_reverse_strand: true
multiplicity: 1
query_length: 100
query_aligned_length: 94
target_aligned_length: 93
mapping_quality: 29
pair_flags: 83
fragment_index: 0
insert_size: -443
read_origin_index: 0
softClippedBasesLeft: "TTACCC"
softClippedQualityLeft: "\002\002\002\002\002\002"
bam_attributes: "XG:i:3"
bam_attributes: "AM:i:29"
bam_attributes: "NM:i:14"
bam_attributes: "SM:i:29"
bam_attributes: "XM:i:10"
bam_attributes: "XO:i:3"
bam_attributes: "XT:A:M"
sequence_variations {
to: "T" // the read contained a T
from: "G" // the reference had a G
position: 4 // at position +4 from the start of the alignment
to_quality: "\002" // quality of the base was Phred=2
read_index: 91 // the variation was observed at cycle 91 (the read matches the reverse strand, only a fraction of 100bp aligned).
}
sequence_variations {
to: "T"
from: "A"
position: 9
to_quality: "\002"
read_index: 86
}
sequence_variations {
to: "T"
from: "C"
position: 17
to_quality: "\002"
read_index: 78
}
sequence_variations {
to: "A"
from: "C"
position: 22
to_quality: "\002"
read_index: 73
}
sequence_variations {
to: "T"
from: "-"
position: 23
to_quality: "\002"
read_index: 71
}
sequence_variations {
to: "T"
from: "A"
position: 24
to_quality: "\002"
read_index: 70
}
sequence_variations {
to: "N"
from: "T"
position: 27
to_quality: "\002"
read_index: 67
}
sequence_variations {
to: "-"
from: "T"
position: 30
read_index: 64
}
sequence_variations {
to: "A"
from: "C"
position: 33
to_quality: "\002"
read_index: 62
}
sequence_variations {
to: "G"
from: "T"
position: 38
to_quality: "\002"
read_index: 57
}
sequence_variations {
to: "GA"
from: "AG"
position: 42
to_quality: "\002\002"
read_index: 53
}
sequence_variations {
to: "G"
from: "-"
position: 46
to_quality: "\002"
read_index: 48
}
sequence_variations {
to: "T"
from: "C"
position: 49
to_quality: "$"
read_index: 45
}
}Like this:
Like Loading...