We have recently started a large-scale evaluation of the genotype calling features of Goby and GobyWeb. To this end, we decided to obtain exome data from the 1000 genome project, and compare the genotypes called by Goby when all processing is done with GobyWeb (alignment and genotype calls). Since the 1000g project has way more data than we need for this evaluation, we picked two exome samples semi-randomly. Both are paired-end and one has length 76bp, while the other is 90bp long.
Exome data realignments
Realigning the reads to the 1000g reference was no trouble, we simply converted the bam files distributed by the 1000g project to the compact-reads format and uploaded this to GobyWeb. The rest is pretty much automated and was done in a matter of hours.
Extracting a few samples from the 1000g VCF files
The 1000g genome project distributes many versions of the genotype calls, in the VCF format. Locating the version that was produced against the 1000g reference (based on hg19) that we have installed in GobyWeb was a bit tricky since there is really no summary of all the versions. Thanks to Juan Rodriguez-Flores (in Jason Mesei’s lab at the ICB), for recommending this version:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20111111_old_phase1_release_files/
(we had initially gone directly to the very latest release, being wary of the “old” keyword in this version, but that the most recent version turned out to have been aligned against some ancestral reference reconstructed from primates, as far as we could tell and would not work for this validation).
As one can see form this directory, genotype calls are given in the VCF format, and split in one file per chromosome (excluding the X and Y chromosome and MT). The files are in the tens of GB even though they are compressed. The files are large because they contain genotypes and annotations for hundreds of samples studied in the 1000g genome.
Trying vcf-compare against just one of these files convinced us the comparison would be too slow against the complete files. We decided to extract just the two samples we selected for validation to yield smaller files that could be compared more efficiently.
Fortunately—we initially thought—VCF-tools provides a program called vcf-subset. Let’s just run this program with the names of the two samples we need to extract, on each of these chromosome, then concat the result. It turns out that vcf-subset is incredibly slow for the work it needs to perform. To be more specific, on a fast server, after a day of processing, we had not finished extracting the two samples from the chromosome 1 file. Upon closer inspection, the perl process was running at 100% CPU, but did not appear to make much process through this file (as judged from the speed at which results were added to the output). At this point, rather than throwing more CPUs at the problem and go on vacation (Chrismas time), we decided to go green (consider that inefficient programs are just as detrimental to our environment than other wasteful ways to burn oil).
Since Goby provides an efficient VCF parser, we reasoned we could write a more efficient way to extract the data we needed without too much trouble. To this end, we added a vcf-subset mode in Goby (an early implementation of this mode made it to 1.9.8.1, but we suggest getting the source code directly from our subversion server until we push 1.9.8.2 since the mode has improved a lot since that first release). Re-implementing the mode indeed provided a performance boost, but also offered an opportunity to add new options. One new feature that we added quickly was to process a number of files in parallel. We are now able to subset the 1000g VCF files in a few hours on a multi-threaded server.
Why is vcf-subset so slow?
This is obviously much better, but one has to wonder what is taking so long? After all, the input files are only a few Gigabytes, and we don’t need to do anything complicated, just extract a subset of information. It turns out that the design of the VCF format makes the task very computationally demanding (much more so than it would need to be). First of all, VCF is a text-based format, which by definition is slow to parse. To complicate matters, the format of the file can vary from line to line (look at the specification of the FORMAT and sample column). In my opinion, this is a very poor design decision. Consider whether you have yet to encountered a VCF file that use the feature (e.g., that has different fields in the FORMAT field on different lines)? The need is not common, yet every program must be written to support this “flexibility” and there is a clear computing cost for supporting the feature. This is a typical red flag that should have told the designers of the added flexibility was not worth the compute cost.
Another complication introduced by VCF is that the type of delimiters varies by field (the INFO fields are delimited by ;, FORMAT fields use :, while other fields are delimited by tabs). All these “features” may seem to provide flexibility, but they combine to create significant inefficiencies. The format looks like as if it was designed so that humans can read it, yet is now used to store gigabytes of data. All this suggests to me that the committee/members of the mailing list who have been responsible for the design of VCF could have paid more attention to the actual uses of the format and should have weighted the impact of adding “nice to have, but not that frequent” features against the computational cost of these features. Then again, most people don’t care if a program or format is inefficient, at least until the computational cost makes some needed tasks impractical. It seems that VCF may be ripe for a redesign, it clearly meets requirements, but it is so complicated and inefficient that it would make sense to replace it with a leaner and efficient alternative.
I would not be surprised if the techniques we used in Goby formats could yield a more compressed VCF format alternative that could be subset in minutes rather than hours. Whether the community is feeling enough pain to consider adopting an alternative is a different question. What do you think?

Leave a Comment