Yesterday, we released software to train and evaluate deep learning models to call genotypes in sequencing data. The software is open source and on GitHub.

This first version supports calling SNPs. More work is needed to refine and evaluate the method for indels.

We chose to release at this stage to identify possible collaborators interested in working together on an efficient and fully open source DL platform for genotypes.

A couple weeks ago, a team at Google lead by Marc DePristo has deposited a preprint showing state of the art performance for a universal deep learning genotype caller. The Google caller (DeepVariant) transforms genomic data into images and models genotypes probabilities with an Inception v2 architecture (established DL model for image analysis developed at Google). While Google has plans to release the software, the source code is not currently available.

Building on our previous work [1,2], we took a different approach. We map alignment data to features to train a DL model (feed forward neural network with 5 fully connected layers). Alignments can be provided in BAM, CRAM, or Goby format and processed using version 3.2 of the Goby framework. This framework can optionally realign variants around indels. Training data are stored in the Sequence Base Information format (extension .sbi), providing compact representation in a protocol buffer based storage. DL models are trained with the variation analysis project (release 1.2+).

This tool can train genotyping models for a variety of assays and sequencing platform, as long as true genotypes are available. This is often accomplished by sequencingNA12877 or NA12878 for which ground truth had been previously established [3].

How does this tool differ from DeepVariant? The design and implementation were done independently, and while the key idea of applying deep learning to improve genotype calling is similar, our group made different decisions:

Ploidy. A key difference is that the DeepVariant model can only call genotypes in diploid genomes. We think that this is too limiting unless you only work with animals. Many plants are tri or tetraploid. Our project supports arbitrary ploidy, as long as you have ground-truth genotypes and matching sequencing data to train with. Our model does not put arbitrary limits on ploidy.

Need for preprocessing. DeepVariant requires specialized preprocessing, such as haplotype assembly and realignment. Our project relies on Goby to provide fast realignment of variations around indels, but beyond this, does not require much preprocessing. We aim to train models that can perform their own data processing on the fly and learn how to do this in the same round as they learn how to call genotypes. By eliminating the preprocessing pipeline, such models could drastically reduce the computational requirements for calling genotypes in sequencing data, without sacrificing quality.

Image or no image? Assembling image data and training Deep Variant models with them seems computationally intensive (each site is represented by an  10,000 by 300 pixels image). More importantly, it may not be an optimal data representation for the problem. Our project does not convert alignments to images, but instead maps structured data (alignment) to a small number of informative features, which can be used for efficient DL model training. We can explore different mapping strategies and see the impact of adding various types of information to the features. We can also explore different DL architectures because we use a framework that makes it easy to try alternatives.

Interested in helping train and evaluate models? Training new models will be required when data from new platforms becomes available and for this reason we feel that a community effort is best suited to efficiently develop these technologies. We are looking for collaborators to help us train models and rigorously evaluate performance on a variety of sequencing platform. We hope that opening its development will help variationanalysis become the de facto open source tool for calling genotypes with deep learning approaches.

Please let us know your thoughts on gitter or the Q/A forum. Note that the official releases of these projects has been pushed after the end of year holiday, but the source code and SNAPSHOT releases are available (see links in tutorial for binaries).