The GobyWeb Fileset Manager is a utility that helps plugins obtain the files they will use to compute with and store the results of the computation. The Fileset Manager can be called by running plugin jobs using the ${FILESET_COMMAND} environment variable.

Interacting with the FileSet Manager

How to query for input FileSets

Tasks have named input and output slots. The slots are defined in the Task config.xml file and associated with a specific FileSet. At runtime, a Task Job needs to discover which FileSets instances were bound to its input slots. This can be achieved with the FileSet manager.

For instance, assuming that in the Task configuration the InputSchema declares an input slot named INPUT_READS and that its “type” is a FileSet configuration defining an entry named READS_FILE, a task job can

  • query to check if the INPUT_READS slot has values (references to FileSet instances of the type defined in the schema)
  • query the referred instances to check if their READS_FILE entries have valid files.

Such queries are performed by invoking the FileSet manager with the –has-fileset option inside the script.sh:

    # Check if there are input values (references to FileSet instances) associated to the INPUT_READS slot:

  • ${FILESET_COMMAND} –has-fileset INPUT_READS
  • RESPONSE=$?
    # Check if there are input files associated to at least one READS_FILE entry in the instances referred by the INPUT_READS slot:

  • ${FILESET_COMMAND} –has-fileset INPUT_READS.READS_FILE
  • RESPONSE=$?

Queries return 0 if they successfully find valid results, 1 otherwise. Such returned values can be easily managed with the built-in function dieUponError() available to all Tasks.

List of parameters and/or entries can be queried at the same time with arbitrary combinations. In such cases, the FileSet manager returns 0 only if at least one match is found for each element of the list.

How to fetch input FileSets

Tasks usually consume FileSet instances provided at submission time. To consume them, they firstly need to fetch files matching the input slots using the –fetch option in one of the following ways:

    # Fetch all the READS_FILEs stored in the FileSets defined as value of the INPUT_READS slot:

  • READ_FILES_LIST=`${FILESET_COMMAND} –fetch INPUT_READS.READS_FILE`
    # Fetch all the files (regardless their entry names) stored in the FileSets defined as value of the INPUT_READS slot:

  • READ_FILES_LIST=`${FILESET_COMMAND} –fetch INPUT_READS`

In both cases, a list of absolute paths to the matching files is returned. Paths are divided by a space. List of slots can be provided if the Task is designed to consume them all at once.
If one or more input slots are not available, the fetch operation fails (even if some of them are available), i.e. all the specified input slots must be available for obtaining a success. The returned value can be easily managed with the built-in function dieUponError() available to all Tasks.

How to push output FileSets

When a Task produces some output files (fileset instances that need to be bound to the task output slots), it can decide to permanently store them in the FileSet Area in order to be available for a human or to be next consumed by another Plugin. Let’s assume a Task configuration that declares an OutputSchema with a STATS output slot. Let’s also assume that the STATS slot has a FileSet type which accepts tab delimited files with extension .tsv (the fileset entry is called TSV). Finally, assume that the STATS slot is defined with multiplicity 1..n (meaning that the slot can accept one or more files of type TSV). Then, using the fileset manager, such a Task can bind its output file(s) to the STATS slot in several ways:

    # Bind all the TSV file results matching the pattern to the STATS slot:

  • REGISTERED_TAGS=`${FILESET_COMMAND} –push STATS: *.tsv`
    # Bind a single file to the STATS output slot:

  • REGISTERED_TAGS=`${FILESET_COMMAND} –push STATS: out.tsv`
    # Try to bind any .tsv file to the output slots of the task. This will only succeed if the assignment is not ambiguous (i.e., there is no other slot in the task that can accept files with .tsv extension):

  • REGISTERED_TAGS=`${FILESET_COMMAND} –push *.tsv`
    # The same as above, but with a specific tsv file:

  • REGISTERED_TAGS=`${FILESET_COMMAND} –push out.tsv`

The Push operation returns a list of space-separated tags, one for each FileSet registered with the input files. For example, if *.tsv matches 4 files and the output slot STATS refers a configuration with one TSV file, 4 instances of such configuration are created and their 4 tags returned. If the Push operation fails, it returns an error exit code that can be managed with the built-in function dieUponError().