Nick Loman (@pathogenomenick) has recently collected data from bioinformaticians with different levels of experience about what they considered most challenging in their work (“What things most frustrate you or limit your ability to carry out bioinformatics analysis?”). Many respondents noted the practical difficulty they experience when installing bioinformatics software. In fact, out of 261 responses, about 10% list important difficulties with software installation, generally related to software dependencies. For instance, a few of the answers are reproduced here:

  • “Installation of open source packages and their dependencies is often a nightmare”
  • “Dependency hell when installing tools, lack of method description, hardly neutral/unbiased tool benchmarks”
  • “Tools that come with a million dependencies”
  • (You can see the full set of responses here.)

In the lab, we are of course also faced with similar challenges. New bioinformatics software (or new versions of existing tools) keep popping up that we need to install before we can use them to process data. Sometimes, installing the software and its dependencies takes as long as trying the tool and figuring out its performance won’t scale (this was common about 5 years ago). More often, installation problems are not a one time thing because you really need each tool to be installed on every node when you can run computations. With today’s datasets, this means you need each tool and its dependencies installed on every node of a cluster where your analyses will run.

That’s a problem because:

  1. You are not a sys admin whose job it is to install software. You care about using the tool, not at all about the details of its installation.
  2. Some tools are used with indices built from large amounts of data (e.g., genome indices used when aligning short reads, for instance). Indices are often not portable from one platform to another, or even in some cases on the same machine, from one version of the tool to a new version.
  3. There is often one index per species, and often several indices per species (index over genomes, but also indices over transcriptomes). If you work with multiple species, you could create an index for each species, but local storage on the cluster nodes will quickly evaporate if everybody does the same thing.

In summary, bioinformaticians need a system to install software, data and indices on the compute nodes where they will run analyses.

To make matters worse, it is not particularly convenient to develop analysis code directly on a cluster, so many in our field use desktop or laptop computers to develop analysis code. Unless you are willing to go with Linux on the desktop, you are de facto juggling with two environments:

  • the desktop where you develop scripts and workflows (the development environment) and
  • the nodes of the cluster where you run these scripts over large datasets (the production environment).

Ideally, you would want a system that will mirror the production environment in your development environment, so you don’t have to address environment differences when you thought the script was ready to go.

We built such a system in NextflowWorkbench 1.2. This workbench comes with a resource installation system that lets you specify that you need a resource. You do this in a purely declarative way:

ResourceDeclarations

The blue box in the snapshot above shows how you would declare that you need an index built with Kallisto over the human genome (GCRh38 build, ensembl version 82). Declaring this dependency will cause:

  1. installation of Kallisto binaries
  2. download of the Ensembl transcript sequences for the specified build
  3. indexing of the transcriptome with Kallisto

This happens transparently because the resource called KALLISTO_INDEX has a dependency on the ENSEMBL_TRANSCRIPT resource, which requires FETCH_URL to download the Fasta file, which requires the BASH_LIBRARY that provides function to work with patterns in URLs.

KallistoIndexDeps

But you don’t need to know about these details. You should only care that you need an index and this index must be built before the process can start on a node. This is exactly what the require resource clause lets you indicate. The runtime installation is automated and performed using software provided in a docker container. This docker container runs on your development machine and in the production environment.

The ${artifact path KALLISTO_INDEX.INDEX} syntax is used inside the script to locate the files installed by the resource installation system. This language construct provides auto-completion for the files installed by the resource (you need to run the script once to perform the installation, and then auto-completion can list the files of the resource (this is similar to auto-completion inside the docker container  presented in the previous post).dockerLogin

We have released version 1.2.0 or the NextflowWorkbench. This is a major release that includes:

  • Languages that provide a docker IDE,
  • Languages to easily install tens of bioinformatics tools useful for high-throughput sequence data analysis.

This post gives an overview of the docker IDE features. We have added two new Chapters to the documentation booklet [Tablet, PDF]. Please see the updated doc for details.

Docker IDE

Nextflow supports docker containers, but the construction of suitable Docker images is necessary to customize the tools available to the workflow, and if you are new to docker, using its command line interface is not particularly user-friendly. The workbench now offers a language to configure and build images from Dockerfiles:

NextflowDockerfiles

In the snapshot above, you can see this language used to describe how a centos base image should be created. The light-bulb provides intentions to merge two successive RUN commands. (You can of course customize intentions to support your workflows since this language is fully extendable in MPS). Clicking on the Build image button will instruct docker to construct the image. You can follow the build process directly within MPS (in the info message tab). When the process succeeds, an image root node is created:

ImageRootNode

This node links the image back to the docker file that was used to create it. The image is added to the list of images built with the Dockerfile (shown at the top of the previous figure). Using the image root node, you can tag, push to a docker registry or run a container.

Once the image is ready, or available in Docker Registry (e.g., Docker Hub), you can use it to execute a BASH script or a Nextflow Process:

RunBashInsideContainer

More than this, the container can be started for interactive use. This makes it possible for a user to interactively peek at the filesystem in the container:

VisualizingFilesInTheRunningContainer

 

This makes it a bit easier to figure out where files are inside the container filesystem. The next post will describe the features introduced in release 1.2 of interest to bioinformaticians.

We have released version 1.1 of the NextflowWorkbench. This major release follows the previous bug fix releases 1.0.1 and 1.0.2. We have continued adding features to the IDE to match the latest version of Nextflow. This post describes key new features:

Support for Closures

Many functions offered by Nextflow accept a closure as parameter. For instance, the view function accepts an optional closure to determine what should be printed to the console when the channel on the left side of view emits a value. For instance, a closure is shown in the following code fragment:

ViewClosureExample

You can add the closure on the view function by pressing return over the <no closure> text. New closures are determined according to the type of the argument and the function where they are used, so in this case, the closure inside the view function will be created with an int argument called numsCopy. You just need to enter the body of the closure without having to remember the specific syntax for closure definition.

Support for Tuples

We have improved support for tuples, so that you can now use them inside closures and refer to their elements in a natural way inside the closure. For instance:

ImprovedSupportForTuples

Support for Constants

Constants are a handy way to define a value that will be used across several process executions. To define a constant at the top of a workflow, define a new input name and remove the right-part (list definition : []). Type constant instead and proceed to define the value of the constant. For instance, the following defines a channel with three values, and one constant:

WorkflowWithConstants

The AcceptTwoInts process will be executed three times because the constant value does not affect the number of times a process can run.

Improved Workflow Run Configuration

We have extended run configurations (the dialog that you see when you request the execution of a workflow), to offer command line parameters and a resume execution option.

1_1RunConfigurationFeatures

The resume option causes Nextflow to use cached values for process execution steps that succeeded in prior executions. Using command line parameters is described in the next section.

Support for Workflow Parameters

This release supports workflow parameters. Here’s an example:

ParametersExample

Notice that you can provide a default value to the parameter. Use –alpha on the command line when running the workflow to set the value of the alpha parameter to a value other than the default.

Support for most Nextflow functions

This release supports most functions offered by Nextflow, whether they require closures, regular expressions or are just plain functions. You can find many examples of function uses in GitHub, which we have created as test cases of these language constructs:

ClosureWorkflowExamples

We are continuing to explore the use of Language Workbench Technology to help with Data Analysis. In this installment, we looked at workflows, that is: analysis pipelines that consist of several computational steps, which are often complex and time consuming, and need to run on a grid or cluster in parallel. Our lab often develops custom workflows to transform data. Several tools have been developed to help with this task. At the start of the project, we evaluated three such tools: Big Data Script, Swift and Nextflow. For various reasons uncovered during our evaluation –which will be presented elsewhere– we picked Nextflow as the most useful abstraction that could help us build workflows.

Jason P. Kurs joined our lab over the summer of 2015 for an internship. He was tasked with developing an MPS version of Nextflow that we could use to write pipelines. The results of his efforts are now available as the Nextflow Workbench. Here’s a snapshot of a very simple workflow built with this workbench:

WorkflowIllustration-1

The workflow refers to two Processes called splitSequence and reverse, which are defined as follows:

ProcessSplitSequenceProcessReverse

In a first approximation, you could think of the Nextflow Workbench as an integrated development environment (IDE) for Nextflow. However, this workbench exposes a language that is a bit different from Nextflow. We aimed to simplify the language and make it more consistent and easier to learn (see the documentation). In other instances, we added features that we felt were important, but not so easy to do with Nextflow.  For instance, the Workbench makes it possible to reuse Process definitions in several workflows, without having to copy and paste and rename channels. Another extension is explicit data types, which we think help develop and maintain sound pipelines. Despite these simplifications or extensions, the Workbench will produce plain Nextflow scripts. Indeed, the above workflow and process will generate the following script:

#!/usr/bin/env nextflow
import workflow1.workflowBash_Methods;
_fastafile = [file("/Users/fac2003/MPSProjects/git/NextflowWorkbench/data/sample1.fasta")].channel()
process splitSequence {
input:
file 'input' from _fastafile
output:
file 'seq_*' into _splitfile
shell:
 '''
awk '/^>/{f="seq_"++d} {print > f}' < input
 '''
}
process reverse {
input:
file record from _splitfile
output:
file 'finaleres.txt' into _result
shell:
 '''
cat !{record}| rev >>finaleres.txt
 '''
}
_result.subscribe{ c ->
workflowBash_Methods.reportAbout_result(c.toFile());}

We have released MetaR 1.4.0. This release offers substantial new features, including:

  1. New Biomart language for MetaR. The query biomart statement provides a convenient way to query Biomart to retrieve data right from MetaR. The statement offers auto-completion for database and dataset names, as well as attributes and filters. The following snapshot illustrates retrieving information for a pre-defined gene list. The query biomart statement can also retrieve information for identifiers present in tables produced during analysis.

Cursor_and_bio_-_MetaR_-____MPSProjects_git_metar_

  1. Exposing R functions (from any package) as stubs and using them in analyses. This feature makes it possible to take advantage of any R package available in CRAN or Bioconductor and use functions defined in these packages inside MetaR. See documentation for details.

FunctionStubsPHeatmap

  1. Ability to refer columns from imported Tables when invoking R functions. When using R functions inside MetaR, this feature gives you the ability to refer to columns visible in the current MetaR Analysis and use them as function values.
  2. Documentation: we have written new chapters to describe Simulating Dataset, using Biomart and Functions in MetaR (the documentation booklet now has 108 pages in the tablet edition).

You can obtain MetaR 1.4.0 from the Jetbrains MPS plugin repository when it clears moderation, or directly as a ZIP file from ourdownload plugins page. Note that starting with 1.4.0, MetaR requires the ANTLR plugin, which it uses to support pasting of R code when using R functions.

We have released MetaR 1.3.1. This version is a major upgrade from MetaR 1.2. Here’s a list of improvements in this version:

  • MetaR is now compatible with MPS 3.2. You will find that the editor is more responsive than in MPS 3.1.
  • Installation and migration instructions have been updated: you can now directly install MetaR from the Jetbrains Plugin Repository (no need for Cygwin installation on Windows, users get notifications for new versions when they open MPS).
  • Support for Limma Voom (for RNA-Seq data analysis with the popular Limma package),
  • Support for continuous covariates in linear models (EdgeR and Limma).
  • Table Viewer Tool to view data in intermediate tables,
  • Support for Venn Diagrams,
  • Docker integration, to run analyses in a controlled environment and prevent any possibility of failure during R package installations (this also removes the need to install R and XQuartz separately). We currently support this feature for Mac OS and Linux. Windows support will be offered when Kitematic is ported to Windows (expected sometime in June 2015).
  • A revised documentation booklet describe these new features.MetaR_Documentation_Booklet_What_s_new

We will hold a special training session to help you upgrade safely to MetaR 1.3.1 and MPS 3.2. The session will describe the new 1.3.1 features and explain how you can usethe Git source control system with MetaR to make backups of your analyses and/or share them with others (see online registration form). The first session will be April 17 2-3:30PM.

 

The MPS Language Workbench: Volume II. The second volume of the series explains how to customize the MPS platform to better integrate it with the needs of your languages. This volume continues where volume I ended and discusses more advanced features of the MPS platform.

The second volume of the MPS book series is on the final approach. For this second volume, I am taking a different approach to releasing the eBook. Last year, I released the complete book as an ebook on March 21st. This year, I am opting for an iterative release schedule. I will release the first chapters mid-April, and will then continue to release new chapters every couple weeks, as additional chapters are being finalized. This will not only give me more time to finalize the complete book, but it will also I hope give more time to readers to provide feedback on the ebook, before a print edition is finalized.

Because the ebook will not be complete when it first goes on sale, I am also setting the initial list price to $20. The list price will increase as more chapters are released, to reach the final list price of $50 for the ebook. Note that buying the ebook at any time in its release cycle will give you access to all future updates, including any future editions of the second volume. If you have purchased the first volume and found it useful, I encourage you to lock in the $20 pre-order price for the ebook of the second volume. You will get the second volume at a significant discount to the final list price.

Note that iterative release is only available from Google Play. The final ebook will be released both on Google Play and Amazon KDP. The main reasons for releasing iterative on Google Play are (1) updating the eBook is much easier at my end with Google Play (2) I received reports that the ebook is not easy to search on a desktop computer with the Amazon KDP platform.

 

 

 

We have released MetaR 1.2.0. This version provides redesigned styles. Here’s what styles look like for a heatmap:

HeatmapStyle

Styles can extend other styles, which provides an ability to modularize appearance. This is useful if you need to build collections of plots, and most plots have a subset of style attributes, but the title of each plot or the X or Y variable changes. You would define the common style attributes in a style, and extend this style each time you need to specialize the appearance of a plot.

We used the new styles to make it possible to customize the color palette for heatmaps. See the Change Log for other changes.

If you have models built with previous versions of MetaR, you will need to apply the migration called “Meta R: Migrate Styles” (see Migration guide) to transform previous styles into the new style constructs.

We have released MetarR 1.1.6. This version contains several bug fixes:

  • We have streamlined the process for running MetaR scripts, so that you should no longer need to set the R mirror and library folder manually in R the first time you try to execute a MetaR analysis.
  • We fixed the undo/redo feature for Analysis nodes (the problem was introduced in release 1.1.4).
  • Any reference to a table now shows the columns and groups of the referenced table.

Starting MetaRBookletwith this release, we are distributing a documentation booklet that explains how to use MetaR. The booklet can be downloaded as a PDF file (see Documentation section), or on Google Play for the tablet edition.

 

We have released MetaR 1.1.5. This release includes polishing fixes for minor problems we encountered during  training sessions. A couple of noteworthy features are included:

  • Execution of analyses now produce an output that links back to the statement in the editor. For instance, when you run the workshop tutorial, you will see the following in the run console:

edgeR_diff_exp_-_HTAnalysis_-____MPSProjects_git_metar__-_JetBrains_MPS__MPS__MPS-135_1462

 

 

 

 

 

 

Clicking on one of these blue links will highlight the statement that was executed in the editor.

  • Improved scope for Tables, Columns and Group Usages. Better restrict visible nodes to current model and context.
  • Scripts now take advantage of R’s tryCatch capabilities to better link warnings and errors to the source Analysis. You will also see a hyperlink, immediately before the warning or error message produced by the R code.
  • Statements that write to a file (render and write table) have been refactored for more consistent user experience. This code refactoring means that you should apply a migration if you developed analyses with a previous version of MetaR. Migrations are completely automatic, but you need to invoke them manually. In this case,
  1. Right-click on the model that you want to migrate,
  2. Select Scripts > Enhancements > MetaR: Migrate Output Filenames.
  3. A preview of nodes affected by the migration will be displayed.
  4. Click ‘Apply Migrations’

You can also invoke migrations one node at a time if you wish to do so. Nodes that require migration will show a special kind of intention. Running the “MetaR: Migrate Output Filenames” intention will migrate the node.

See the detailed Change Log for a complete list of changes included in MetaR 1.1.5.