Nick Loman (@pathogenomenick) has recently collected data from bioinformaticians with different levels of experience about what they considered most challenging in their work (“What things most frustrate you or limit your ability to carry out bioinformatics analysis?”). Many respondents noted the practical difficulty they experience when installing bioinformatics software. In fact, out of 261 responses, about 10% list important difficulties with software installation, generally related to software dependencies. For instance, a few of the answers are reproduced here:

  • “Installation of open source packages and their dependencies is often a nightmare”
  • “Dependency hell when installing tools, lack of method description, hardly neutral/unbiased tool benchmarks”
  • “Tools that come with a million dependencies”
  • (You can see the full set of responses here.)

In the lab, we are of course also faced with similar challenges. New bioinformatics software (or new versions of existing tools) keep popping up that we need to install before we can use them to process data. Sometimes, installing the software and its dependencies takes as long as trying the tool and figuring out its performance won’t scale (this was common about 5 years ago). More often, installation problems are not a one time thing because you really need each tool to be installed on every node when you can run computations. With today’s datasets, this means you need each tool and its dependencies installed on every node of a cluster where your analyses will run.

That’s a problem because:

  1. You are not a sys admin whose job it is to install software. You care about using the tool, not at all about the details of its installation.
  2. Some tools are used with indices built from large amounts of data (e.g., genome indices used when aligning short reads, for instance). Indices are often not portable from one platform to another, or even in some cases on the same machine, from one version of the tool to a new version.
  3. There is often one index per species, and often several indices per species (index over genomes, but also indices over transcriptomes). If you work with multiple species, you could create an index for each species, but local storage on the cluster nodes will quickly evaporate if everybody does the same thing.

In summary, bioinformaticians need a system to install software, data and indices on the compute nodes where they will run analyses.

To make matters worse, it is not particularly convenient to develop analysis code directly on a cluster, so many in our field use desktop or laptop computers to develop analysis code. Unless you are willing to go with Linux on the desktop, you are de facto juggling with two environments:

  • the desktop where you develop scripts and workflows (the development environment) and
  • the nodes of the cluster where you run these scripts over large datasets (the production environment).

Ideally, you would want a system that will mirror the production environment in your development environment, so you don’t have to address environment differences when you thought the script was ready to go.

We built such a system in NextflowWorkbench 1.2. This workbench comes with a resource installation system that lets you specify that you need a resource. You do this in a purely declarative way:

ResourceDeclarations

The blue box in the snapshot above shows how you would declare that you need an index built with Kallisto over the human genome (GCRh38 build, ensembl version 82). Declaring this dependency will cause:

  1. installation of Kallisto binaries
  2. download of the Ensembl transcript sequences for the specified build
  3. indexing of the transcriptome with Kallisto

This happens transparently because the resource called KALLISTO_INDEX has a dependency on the ENSEMBL_TRANSCRIPT resource, which requires FETCH_URL to download the Fasta file, which requires the BASH_LIBRARY that provides function to work with patterns in URLs.

KallistoIndexDeps

But you don’t need to know about these details. You should only care that you need an index and this index must be built before the process can start on a node. This is exactly what the require resource clause lets you indicate. The runtime installation is automated and performed using software provided in a docker container. This docker container runs on your development machine and in the production environment.

The ${artifact path KALLISTO_INDEX.INDEX} syntax is used inside the script to locate the files installed by the resource installation system. This language construct provides auto-completion for the files installed by the resource (you need to run the script once to perform the installation, and then auto-completion can list the files of the resource (this is similar to auto-completion inside the docker container  presented in the previous post).dockerLogin