Browsing Posts published by Fabien Campagne

Yesterday, we released software to train and evaluate deep learning models to call genotypes in sequencing data. The software is open source and on GitHub.

This first version supports calling SNPs. More work is needed to refine and evaluate the method for indels.

We chose to release at this stage to identify possible collaborators interested in working together on an efficient and fully open source DL platform for genotypes.

A couple weeks ago, a team at Google lead by Marc DePristo has deposited a preprint showing state of the art performance for a universal deep learning genotype caller. The Google caller (DeepVariant) transforms genomic data into images and models genotypes probabilities with an Inception v2 architecture (established DL model for image analysis developed at Google). While Google has plans to release the software, the source code is not currently available.

Building on our previous work [1,2], we took a different approach. We map alignment data to features to train a DL model (feed forward neural network with 5 fully connected layers). Alignments can be provided in BAM, CRAM, or Goby format and processed using version 3.2 of the Goby framework. This framework can optionally realign variants around indels. Training data are stored in the Sequence Base Information format (extension .sbi), providing compact representation in a protocol buffer based storage. DL models are trained with the variation analysis project (release 1.2+).

This tool can train genotyping models for a variety of assays and sequencing platform, as long as true genotypes are available. This is often accomplished by sequencingNA12877 or NA12878 for which ground truth had been previously established [3].

How does this tool differ from DeepVariant? The design and implementation were done independently, and while the key idea of applying deep learning to improve genotype calling is similar, our group made different decisions:

Ploidy. A key difference is that the DeepVariant model can only call genotypes in diploid genomes. We think that this is too limiting unless you only work with animals. Many plants are tri or tetraploid. Our project supports arbitrary ploidy, as long as you have ground-truth genotypes and matching sequencing data to train with. Our model does not put arbitrary limits on ploidy.

Need for preprocessing. DeepVariant requires specialized preprocessing, such as haplotype assembly and realignment. Our project relies on Goby to provide fast realignment of variations around indels, but beyond this, does not require much preprocessing. We aim to train models that can perform their own data processing on the fly and learn how to do this in the same round as they learn how to call genotypes. By eliminating the preprocessing pipeline, such models could drastically reduce the computational requirements for calling genotypes in sequencing data, without sacrificing quality.

Image or no image? Assembling image data and training Deep Variant models with them seems computationally intensive (each site is represented by an  10,000 by 300 pixels image). More importantly, it may not be an optimal data representation for the problem. Our project does not convert alignments to images, but instead maps structured data (alignment) to a small number of informative features, which can be used for efficient DL model training. We can explore different mapping strategies and see the impact of adding various types of information to the features. We can also explore different DL architectures because we use a framework that makes it easy to try alternatives.

Interested in helping train and evaluate models? Training new models will be required when data from new platforms becomes available and for this reason we feel that a community effort is best suited to efficiently develop these technologies. We are looking for collaborators to help us train models and rigorously evaluate performance on a variety of sequencing platform. We hope that opening its development will help variationanalysis become the de facto open source tool for calling genotypes with deep learning approaches.

Please let us know your thoughts on gitter or the Q/A forum. Note that the official releases of these projects has been pushed after the end of year holiday, but the source code and SNAPSHOT releases are available (see links in tutorial for binaries).

We have released Goby 3.0. There are a few novelties in this release that warranted a new major version:

  • Goby 3 estimates probabilities that a genomic site contains a somatic mutation using adaptive models trained with DeepLearning4Java. Models are provided for RNA-Seq, paired exome and trio experimental design (subject and parents, with subject with possible de-novo mutations). See the companion project variationanalysis which we developed to train these models. New models can be trained with the companion project and used directly with Goby. These models can be used with the somatic_variation format of the discovery-sequence-variants mode. A preprint about the method is being finalized.
  • We have enabled support for BAM and CRAM in the discovery-sequence-variant mode of Goby. This means that you can directly use BAM files to call somatic variations in them. CRAM support has not been well tested, but is supported by the HTSDK library that we also use to support BAM, so it will hopefully also work.
  • We have upgraded Goby from an older version of the Java Samtools library to the latest HTSJDK (version 2.2.4). This may cause trouble with alignments in BAM format sorted with samtools 0.x. We discovered that HTSJDK will not recognize that these BAM files are sorted unless you modify the header or sort the BAM again with a recent version of samtools (1.x+).
  • We ported the source code to Maven and consolidated dependencies. As a result you will find the goby-io jar in Maven Central, which makes it easier to use the Goby framework in any Maven project.
  • The source code of the project has been moved to a new repository on GitHub:

Version 2.0.0 of MetaR has introduced instant refresh for composable R scripts. This feature is illustrated in a GIF in the following tweet:

As you can see, it has become possible to modify an R script and very quickly see the impact of a change on the plot that the script produces (there is no need to hit run/source, the change is detected automatically, execution triggered and the display updated). Refresh occurs in less than two seconds in this case. The reason instant refresh is fast is that in most cases, the entire script does not need to be re-executed when a change is detected. Instant refresh uses the point of change to calculate the minimum set of instructions necessary to refresh the script outputs.

Instant refresh works well when coupled with another feature of MetaR 2, which enables capturing plots and showing a preview of them inside the R script:

Notice the export plot -> Output  { } block at the top of the animation.  This block identifies the set of expressions that produce a plot and give the plot a name. Once the plot is named, the multiplot statement is used to preview the plot content. You will be familiar with multiplot if you have used MetaR before. The  multiplot statement helps define multi-panel figures. In this case, we use it to show the content of one plot embedded into the R script.

The features work together because the R script is written in composable R, a version of the R language that can be extended with new kinds of expressions. Composable R is more flexible than the traditional R language since we can embed plots, buttons and other graphical elements directly in the source code of the script.

In addition to graphics, we can extend the language with expressions  that offer new semantic. We introduced the Save Session expression. Save Session gives users the ability to save an R session, and instant refresh the ability to quickly identify where a session was saved in the script. An extension was also introduced to install or load R packages. Both extensions are used by instant refresh to determine the closest Save Session sufficient to restart execution of the script, as well as the list of packages that must be loaded in the partial execution. The semantic of these new expressions therefore benefits usability (in the case of installOrLoad, users get installation or loading of a package as necessary so that the script will run well even on machines when a package was never installed), but it also benefits the implementation of instant refresh, since we do not need to perform complex pattern matching to try to guess what packages are loaded and used by a script.


For instance, consider a change made at the blue arrow in the following snapshot:


Only the expressions marked by red brackets need to be re-executed because the state of the environment before Save Session can be recovered by loading the session.

This features illustrate the first steps towards a composable R notebook where results and code are mingled and results update as the code is changed.

Instant refresh was developed by Alexander Pann. Alex joined the lab from June to August 2016 for a summer internship and did a fantastic job prototyping instant refresh. He developed two versions of it, the first one with Jupyter (released in MetaR 1.9). The first version supported MetaR analyses only, but was complex because of the client server code needed to communicate with Jupyter. In the second version, we dropped Jupyter for direct R evaluation. This sped up execution times dramatically as well as simplified the architecture and made it more robust. The second iteration enabled to support composable R scripts with instant refresh as well as partial re-execution.

We have released MetaR 1.9. This release offers a draft instant refresh feature. This feature was developed by Alexander Pann during the start of his summer internship in the lab. The following animated GIF illustrates how it works:


As you can see in the animation, instant refresh monitors changes in analyses and triggers the execution of the analysis to refresh the content of files produced by an analysis. In this example, we import a table, preview it, rearrange columns b dropping columns with group “ignore”, and preview the result. Finally, we move the LPS columns first. Observe that the previews are refreshed after a few seconds. This feature should work for table previews and plots (i.e., multiplot previews). It is an early release and we expect to improve the speed of the refresh and overall stability. See the new Jupyter Preferences and tool for details. Stop the container to disable instant refresh.

See the Change Log for details about this release. As usual, you can download it from the JetBrains MPS plugin repository (within MPS choose Preferences>Plugin and locate the MetaR plugin to upgrade or install).

In the lab, we often want to show analyses we do with MetaR or NextflowWorkbench to others who may not be familiar with MPS. These analyses are written with MPS and others have to install MPS on their machine before they can look at them. It probably takes 5 minutes to download and install MPS on a modern computer with good network, but this is still too slow when somebody just wants to have a look at some new technology.

We’ve developed the prototype of a tool, called Circles, that makes it easier to show analyses done with MPS on the web. Once models are published to the web, you can see them in the web app, or even embed a Circles page directly in another web page. For instance, the following shows a MetaR analysis published to Circles:

You can access the web app directly at and search for content, or share links to your published models with others. Here’s a link to the NextflowWorkbench Process called Sample_KallistoCountsWithTuples.

You can publish your own MPS models to Circles. See Publishing instructions on the Circles project page.

Any questions? Join the chat at

We have released MetaR 1.8.1. This release has three features:

  1. It enables automatic refresh of images when the content of the image changes. In previous versions of MetaR, you had to click on Hide preview/Preview to refresh multiplots, and had to move the cursor away from a plot and back on it to refresh an inspector plot preview. This feature automatically refreshes the display of images when their content is changed. For the automatic refresh to work, you need to add the RESULTS_R directory as a default model root for the solutions where you develop models. This will enable MPS to monitor the files under the directory and will trigger refreshes of the images when the file content changes.
  2. We have updated image rendering to work nicely when you publish nodes to Circles (more on this in a future post).
  3. It provides a preview table statement, which shows the content of some columns and rows. This statement was developed by Alexander Pann, who joined the lab last week for a summer internship (Alexander knew MPS before joining the lab). Here’s what the feature does:


I skimmed through Daniel Engber’s piece on reproducibility in Biomedicine this morning (titled “Cancer Research Is Broken“). There’s several problems in my opinion with the way the piece is written.

The first problem is that the piece starts to claim that there is widespread “waste of time and money”. The first mention of the word “waste” is in this sentence: “Many science funders share Parker’s antsiness over all the waste of time and money.” Note that there is no source listed. I think if you are going to call out waste of time and money, you could be specific who told you they thought so, and provide data about the extent of the waste, and how exactly it is characterized (i.e., what are you comparing to?). In my book, waste happens when one performs an activity in one way, when another way exists that can save time or money.   In the example of cancer research, I think many researchers would love to know what way would be better than the one followed, which would save time and money, if indeed “all the waste of time and money” exists.

The next sentence does not bring anything to address this question. We have to read another 11 sentence that do not use the word waste. The word occurs again in the following sentence:

“Given current U.S. spending habits, the resulting waste amounts to more than $28 billion”. We are now given a huge number to clarify the scope of the implied waste, but note that we are no closer to receiving an explanation of the term waste (i.e., what other activity would result in the same outcome but save time and money).

At this point, I skimmed through the rest.

One point this piece tries to make is that all the money spent is “wasted” because of lack of reproducibility. Reproducibility is defined as the ability to redo the exact same experience and get the same results. There is a broad discussion about reproducibility in biomedicine, and this piece seems to attempt to infer that lack of reproducibility is the result of corruption: “data are corrupt” (data don’t care about money, I guarantee you: they are not human)  or inability: “The science doesn’t work.” (another example of misplaced anthropomorphism, science does not do anything, it’s a method. Scientists do use the scientific method, and have more success that people who do not use the scientific method).

One phrase that makes sense to me in this piece is “The findings are unstable.” I would have liked this point to be developed, because I believe it is the main reason why some studies conclusions/interpretations can end up being shown wrong (often later after the study has been published). A finding is unstable if you change the material of the plates and you get a different result. A finding is unstable if you try to do the same experiment in a different cell type where you expect the same result, and cannot observe the same effect.  Reproducibility has little to do with it. If you change a factor that should be irrelevant to the outcome, but the outcome changes, the outcome is brittle, not robust, not worth writing about, and certainly should not end up being published in the scientific literature.

So I think if Biomedicine is in a crisis, it one of generalization more than reproducibility.  I believe that if we could devise fully reproducible approaches, publishing results that are true when obtained with only one specific type of plates would not be particularly useful. Instead, I think that scientists continue to focus on understanding what factors should affect the result of an experiment and which should not, and design their experiments to include a at least one factor that they don’t expect to change the interpretation. I believe good scientists aim to report robust findings and are already taking such steps to filter results before they send them for publication. Others should take note.


We have just released MetaR 1.8.0 (available from the JetBrains MPS plugin repository, see installation instructions). This release includes a number of new features:

  • Support to generate UpSet plots to look at intersection of gene lists/and or table subsets:


  • Added the ability to create annotated MA Plots (Thanks go to Kevin Hadi for the R code this generates to):


  • New support for Sleuth. You can now read Kallisto results directly with MetaR and call transcripts differentially expressed with Sleuth. The sleuth statement integrates with the Table annotation capabilities of MetaR and supports both Wald and Likelihood Ratio Tests.


Once configured, this is what a Sleuth analysis looks like in MetaR:


  • Add normalized table output for Limma Voom statement (previously normalized counts were written to the adjusted table, but this table was only available if you had more than one factor in the model). This change makes it possible to get normalized expression data in the more common situation of a model with one factor.
  • Add the ability to choose a group to identify the column that contains the names of the genes to show on a heatmap. Migration script will set this new annotation to the default ID group.
  • Improve editing of R code, especially when calling functions.

The documentation has been updated to describe these new features. See the index heading “New in MetaR 1.8” to locate the updates.

We are finishing a release of NextflowWorkbench (NW) that will support running workflows in the cloud. In practice, this means that NW will:

  • Help you provision a cluster on Google Cloud Platform (we may support other clouds in the future).
  • Submitting to the cluster running in the cloud to execute workflows there. We will use this feature in training sessions to make it possible to run workflows from older laptops that don’t have enough memory to run the most recent tools (e.g., Kallisto, Salmon).

Here’s a preview of what these new features will look like:

Provisioning a cluster:

The person who creates the cluster needs a laptop that can run docker, creating the cluster will be quite simple:



After following these instructions, you will see:


If you’ve created a cluster to use in a training session, you can paste public keys for each trainee into the MPS editor and press `Grant Access’ to give each trainee remote access to the cluster.

Running Workflows in the Cloud:

The `List nodes’ button will show the information needed to configure workflow submission in the workbench (e.g., IP address of the front-end and user name). Each trainee needs to configure these attributes in the same way that they would to run on a lab cluster:


With this configuration, workflows can be run in the cloud:


You should expect release 1.5 within a week or so.

NextflowWorkbenchLogo-2We have released NextflowWorkbench 1.3. This new release makes it easier to create frozen docker images with the bioinformatics resources needed by your analysis pipeline. Creating frozen images is useful in two situations:

  1. You are organizing training sessions to show others how to use a workflow and you need a simple way for trainees to install all the required software. Creating a frozen docker image that includes all the dependencies, from software to data indices will help you save time. Trainees download the image and are ready to go without lengthy configurations.
  2. You need to create a frozen analysis pipeline for a clinical data analysis pipeline (i.e., in the context of CLIA and/or New York State clinical test regulations). You can package the workflow and its dependencies in a docker image that can be tagged and represents a frozen state of the software and data.

Creating such a frozen image got easier with release 1.3 of the NextflowWorkbench. You can use a special instruction (install gobyweb artifacts) inside the Dockerfile and declaratively indicate which resources are needed, as shown in this snapshot:


If you have defined this resource declaration in a Process or Bash script, you can simply type “install gobyweb artifacts” in the dockerfile and paste the resource bit of the statement (copy it from the script).

This special instruction will expand into a series of shell instructions that will install the requested software inside the image. The Process or script, when run, will automatically detect that the resources are present in the image and will run much faster, as these resources will not be installed again.

The following improvements are also included in release 1.3:

  • Improved auto-completion for artifact paths and paths inside the docker image (for instance, auto-completion will now warn you if the interactive container is not running).
  • Provide the ability to specify a current directory for interactive containers, so that relative path auto-completion knows what suggestions to offer when auto-completion is invoked.
  • Add Salmon 0.5.0 to the list of resources that can be installed as a GobyWeb resource.