We are continuing to explore the use of Language Workbench Technology to help with Data Analysis. In this installment, we looked at workflows, that is: analysis pipelines that consist of several computational steps, which are often complex and time consuming, and need to run on a grid or cluster in parallel. Our lab often develops custom workflows to transform data. Several tools have been developed to help with this task. At the start of the project, we evaluated three such tools: Big Data Script, Swift and Nextflow. For various reasons uncovered during our evaluation –which will be presented elsewhere– we picked Nextflow as the most useful abstraction that could help us build workflows.

Jason P. Kurs joined our lab over the summer of 2015 for an internship. He was tasked with developing an MPS version of Nextflow that we could use to write pipelines. The results of his efforts are now available as the Nextflow Workbench. Here’s a snapshot of a very simple workflow built with this workbench:

WorkflowIllustration-1

The workflow refers to two Processes called splitSequence and reverse, which are defined as follows:

ProcessSplitSequenceProcessReverse

In a first approximation, you could think of the Nextflow Workbench as an integrated development environment (IDE) for Nextflow. However, this workbench exposes a language that is a bit different from Nextflow. We aimed to simplify the language and make it more consistent and easier to learn (see the documentation). In other instances, we added features that we felt were important, but not so easy to do with Nextflow.  For instance, the Workbench makes it possible to reuse Process definitions in several workflows, without having to copy and paste and rename channels. Another extension is explicit data types, which we think help develop and maintain sound pipelines. Despite these simplifications or extensions, the Workbench will produce plain Nextflow scripts. Indeed, the above workflow and process will generate the following script:

#!/usr/bin/env nextflow
import workflow1.workflowBash_Methods;
_fastafile = [file("/Users/fac2003/MPSProjects/git/NextflowWorkbench/data/sample1.fasta")].channel()
process splitSequence {
input:
file 'input' from _fastafile
output:
file 'seq_*' into _splitfile
shell:
 '''
awk '/^>/{f="seq_"++d} {print > f}' < input
 '''
}
process reverse {
input:
file record from _splitfile
output:
file 'finaleres.txt' into _result
shell:
 '''
cat !{record}| rev >>finaleres.txt
 '''
}
_result.subscribe{ c ->
workflowBash_Methods.reportAbout_result(c.toFile());}