Following up on ideas mentioned in the previous post, we continued to experiment with Language Workbench technology for data analysis.

Our third prototype is MetaR, a simple  language for biological data analysis. As a first goal for this language, I wanted to make it simpler to create heatmaps from tables of read counts. This is one of the top requests we get from people who use GobyWeb to call differentially expressed genes in RNA-Seq. Can GobyWeb generate heatmaps? My answer had been that there are many tools to do this, and since we don’t like to reinvent the wheel,  I would just provide pointers to R/Bioconductor packages (such as pheatmap) or applications such as GENE-E, or the more recent HeatMapViewer web-based viewer.

However, I had a couple of heatmaps to produce for projects recently, and using R/Bioconductor quickly became tedious, not to mention error prone when projects had tens of samples and I needed to track how each of these samples mapped to a number of analysis covariates. I am not a fan of user interfaces for data analysis, because I tend to redo analyses often when new data is added to a project, and rebuilding visualizations for each update takes as much time as building the first plot. This is not the case with scripts, because once they are developed, you just need to execute them again after updating the data to produce the new figures. For these reasons, I wanted a better way to build heatmaps, that would be faster than writing R code, but still retain the write once, run many times advantage of analysis scripts.

Knowing a few things about MPS, and reusing a few languages we have been building, such as TextOutput to simplify generation of R code, or XChart representations of TSV files, I was able to quickly put together a prototype that allowed me to generate R code using a few high level abstractions (e.g., Tables, Column groups and Column group usage, Analysis script). The language quickly became very useful. Manuele joined the development and added support for table previews in the editor,  typed columns and contributed many fixes and polishing. After some brainstorming, we decided to call the language MetaR (it brings meta programming to the R language). MetaR analyses can be put under version control, offer auto-completion, are much shorter and simpler than equivalent R scripts (mostly because a lot of configuration that R programmers need to write can be generated automatically, using a few conventions and taking advantage of the structured Table and Column Group concepts). These ideas are simple, but work remarkably well to simplify the production of heatmaps and other visualizations. In practice, we find that MetaR analyses scripts are 5-10 times shorter than the R scripts that they generate.

At some point, I started using MetaR to put together figures with multiple panels for a manuscript. To do this, I extended the language with a multi-plot statement. This statement helps arrange plots produced by other statements into a matrix of n rows and m columns. You can organize the order of the plots in the matrix by entering references to these plots, and after you run the script, you can get a preview of the resulting multi-plot. This is much more intuitive to use than the lower-level layout feature of R that this statement generates.

You can see the language in action in the following video  (the second part of this tutorial is available here):

Notice how the edgeR language is added to the environment at runtime. This capability is possible because metaR supports seamless language composition (a key feature of the MPS Language Workbench). Many micro-languages can be developed to extend MetaR. For instance, you could create your own language to add one or more types of statements to MetaR. The statements you define would become available in the analyses where you import your language extension, and would be able to generate R code to produce an executable R script.

I read a very interesting post this morning, by way of Twitter. The transcript of a keynote lecture that Sean Eddy gave at the “High Throughput Sequencing for Neuroscience” conference at the Janelia Farm (Oct 26-29th). While reading the transcript, I thought the content of the lecture is quite on target, especially when it states that biologists need to own the analyses of the data their experiments generate.

Eddy further argues that “We need to rethink how we’re doing bioinformatics”.  I am also in agreement with this point. I have been thinking along the same lines for many years and think that many of the developments in our field are geared towards bioinformaticians, when I think they should be oriented more towards biologists, to empower them and make it possible for them analyze their own data.

Towards the end of the transcript, Eddy argues that biologists should learn perl scripting, because it is not that hard (he claims), and because once they know it they will be empowered to analyze their own data. I think this part of the argument is the weakest. I am very skeptical that scripting (in perl, or your favorite scripting language) is the answer to enable the kind of transformation that Sean Eddy is inviting. In fact, I think that scripting, for most data analysis problems, is the wrong kind of tool. I do believe that biologists need elements of programming and computational thinking, but I think the last twenty years offer plenty of evidence that relying on scripting for all problems is a mistake.

In my opinion, scripting languages offer abstractions that are too low level for most analysis problems that biologists are interested in. I strongly believe that biologists need high-level abstractions and user interfaces to facilitate data analysis. Abstractions and interfaces that offer higher-level views of a data analysis problem and its solution than scripting languages can offer. We have prototyped such approaches using Language Workbench technology and found the methods quite useful. You can learn more about it in this pre-print, or in this prezi:

Our lab web site gets a fair amount of traffic. For instance, in the last month alone, we had about 1,900 different visitors. When you think about it, this is comparable to the attendance a medium size conference usually brings together. A conference runs for a few days, the stats are for a month, but then a specific conference is given once a year while we keep the web site available yearly.

I made plans to attend the new Data Sciences meeting in Cold Spring next month. I submitted an abstract, but unfortunately the abstract was not selected for a talk. I just happened to have given an internal talk a few weeks ago about the same project, so I thought I would post the slides here. If you happen to see this and also attend the meeting next month, I’d be happy to talk about it. I think this is quite novel and would love to hear what others think, or give a demo if you are curious.

In the meantime, I’ll try to think of ideas to present a new interactive data analysis paradigm via a static piece of paper.

We have released build 5 of the NYoSh Analysis Workbench (version 2.0.5). This build include changes needed to support a workshop that we will start to offer in November at the Weill Cornell Medical College.

Starting with this build, we not only distribute the workbench as a standalone application (see download links here), but also make it available as regular MPS plugins, which you can install in a standard JetBrains MPS distribution. Most plugins are much smaller than the complete standalone distribution, so updating them can save quite a bit of network transfer. To install the NYoSh Analysis Workbench with plugins, open MPS Preferences, select Plugins, and look for the org.campagnelab.Interactive plugin, version 2.0.5 in the remote repository. Installing this plugin will identify required dependencies and give you access to the languages that the workbench uses.

NYoSh Analysis Workbench Plugins v2.0.5

NYoSh Analysis Workbench Plugins v2.0.5

BDVal for MPS

We have just released BDVal for MPS. This MPS plugin provides a nice editor to configure BDVal to develop models from high-throughput datasets and evaluaBDVal_for_MPSte their performance. While BDVal required configuration files in XML format (Apache Ant build files), BDVal for MPS provides easy to use MPS editors. The design principle is to provide high-level abstractions for users to easily configure an analysis. The configured abstraction is translated to executable code (i.e., legacy BDVal XML format). Execution is transparent to the end-user, who only needs to interact with the editor.

The plugin was tested on Mac and PC and should work on Linux. You can download the plugin here, or directly from the JetBrains MPS plugin repository when it clears moderation.

BDVal for MPS was developed by Victoria Benson (undergrad Cornell Ithaca) during a three month summer internship in the lab.

goby2_logoWe have released a new version of Goby. This new release mostly fixes a few problems identified since the last release.

New Features:

  – Add an option to the fasta-to-compact mode that will convert a set of files and concatenate the result to a single compact-reads file (see new –concat option).

– Add a mode to test that the connection from Goby to R is working (requires JRI and R built with shared library support). The mode is called test-r-connection (tcr).

Bug fixes:

  – Fix a bug that caused some slices to occur within annotations, despite the –annotation option being given on the command line to the suggest-slices mode. The problem was that the chromosome index was not /obtained from the genome and was set to zero, always. In rare cases, this would cause one annotation to be omitted from the output (when the annotation overlapped with the alignment split by genomic position). Thanks go to Laurent Mesnard for reporting this problem.

  – Restore STRICT_SOMATIC filter.

– Close files opened when loading Goby Alignment header and index files. This fixes a too many file error that could occur when loading hundreds of alignments simultaneously.

– Allow lenient import mode for TSV files. This makes it possible to convert TSV files to lucene.index when they have been created with Goby in the past with a \t character as last character of the column line.


I finally figured out how to submit a language to the MPS plugin repository. The first language I submitted was the org.campagnelab.mps.ui language. This language was developed as part of NYoSh, but we see ourselves wanting to use it into other MPS-based projects developed in the lab, so a plugin makes a lot of sense.

Here’s a description about what the language does. The UI language is designed to be used in an MPS editor (for more information about MPS, see my book). When you configure an editor facet to use the UI language (see below for instructions), you get access to UI editor cells. At this time, there are three cells:

  • button
  • file selection button
  • single file selection button

When you type the alias ‘button’ in an editor under construction, a cell is inserted in the editor that looks like this:




You can customize the label of the button by clicking on top of ‘Click me!’ and typing a different label content. When you open the inspector, you will see the following:





This concept function will be executed when the user presses the button (will be visible when the editor is rendered). The concept function gives you access to the UI and to the node currently bound to the editor. You can use the node parameter for instance to call a behavior method on the node.

When you have customized the label and click behavior, try generating the editor. Here’s an example of how the buttons look in a generated MPS editor:




The second type of button makes it easy to show a file selection dialog, next to a string property. Assume one of your concept has a path property, where you would like the user to enter a path to a file or directory. In addition to showing the property in the editor, you can add a file selection dialog UI next to it.




Selecting the file selection dialog cell will show this in the inspector:



The ‘accept files’ attribute determines whether the dialog will let users select files. When the attribute is false, as shown, only directories can be selected in the dialog. When the attribute is true, all files can be selected. The ‘property’ attribute will initially be empty, but will auto-complete to let you select which property the path to the directory or file selected by the user should go into. As shown above, the file selection dialog will put the directory selected by the user into the artifactRepoPath property.

That’s it. When you generate the editor, you will see the following:



When the user clicks on the button, a file selection dialog is shown that allows only one directory to be selected. When a choice is accepted in this dialog, the value of the path is put into the property of the node shown in the editor, as configured in the inspector.

This is a good example of language composability in MPS: the UI language composes with the editor language to provide custom ways to capture information in the editor. You could do the same thing in MPS by using a swing component editor cell, and entering a bit of boiler plate code to drive the file chooser or button, but a dedicated language makes it much easier to integrate such buttons in any editor.


To use this language, start by installing the org.campagnelab.mps.UI plugin into MPS (open Preferences, look under Plugin and either find the plugin under the public repository, or download the ZIP via the download button, then click ‘Install plugin from disk..’).

Make sure you activate the plugin. You should see the following:




Once the plugin is active, you can find the language under Modules Pool and import it (cmd-L/ctrl-L) into the editor aspect. Note that two things are required to use the language:

  • Used Languages for the editor must contain org.campagnelab.mps.ui
  • Dependencies of the editor aspect must contain org.campagnelab.mps.ui.code.Swing

RequiredDependency UsedLanguages-2


We have released a new version of Goby. This new release includes several new features and a critical performance improvement for genotyping and methylation pipelines:

New Features:

  • LastToCompact mode now supports the import of paired end alignments produced by Last’s
  • LastToCompact mode now supports the import of quality scores (lastal must be done with -Q1 since the  import assumes Phred quality scores on the q lines).
  • Add two methods to AlignmentReader to determine the minimum and maximum genomic locations represented in the reader. This is useful when suggesting slices to split a set of alignments. This commit includes a fix for possible null start or end positions in slices generated with suggest-position-slices.

Critical performance improvement:

  • Optimize the speed of genotyping when some sites have very high coverage (>500M bases). Now sub-sampling to keep a random set of 10,000 bases for such sites. Expose the default sub-sample size with a dynamic option called sub-sample-size in IterateSortedAlignmentsListImpl.   (-x IterateSortedAlignmentsListImpl:sub-sample-size <int>)

See the Change Log for bug fixes. You can download the latest version here.

We have a post-doctoral position available starting Sept 1st 2014 to work on developing biomarkers for Chronic Fatigue Syndrome with RNA-Seq data. See the advertisement for details.

When I told some of my colleagues I was thinking about writing a book, most advised me against it. The rational was that it would be a lot of work and that nobody would read it. I have now almost finished this book (it releases in a bit more than a week, on March 20th 2014). I wanted to share a bit about my experience writing it.

First of all, it was not that bad (no really, it was a lot of work, but I enjoyed it). I had used LaTeX for writing my thesis and it came back quickly enough. I was able to find a nice book template that I could customize to my taste. Second, I was writing only during my vacations and week ends, starting shortly after the beginning of winter 2013, over the holidays. I was trying to write one chapter every week end, or finish one and draft another. Also, this meant that writing was  interleaved with work, which made the whole winter a bit intense. Finally, the work was not very complicated, since my goal was to write a detailed reference and introductory book for a tool I had used for some time.


As for who will read it: I hope it will find a public. The book is about JetBrains MPS, a terrific tool, and a paradigm shift if I have ever seen one. If you are thinking that  programming is no longer fun, you should have a look at MPS, by the time you wrap your head around it you will wonder why it took you so long to find it.  MPS is open-source, so anybody can use it, if they can understand how it works (hint: my book is a good place to start). You can also see the free documentation on the project web site, but do yourself a favor and get the book. I promise it will be simpler.

Volume I, on the cover, suggests there may be plans for a second volume. I’ll have to see how well the first volume is received before I start working on a second one (and I could use a few week ends for other activities in the mean time). There’s certainly quite a bit of more advanced MPS material that could go into another volume.