Starting with release 1.1 BDVal supports efficient evaluation of biomarker discovery approaches by leveraging the Oracle Grid Engine (SGE). While BDVAL utilizes multiple threads to process data in parallel, evaluation of different endpoints and methods is done in a serial fashion, one after another. This page describes how to configure BDVAL so that these serial tasks can be efficiently spread across multiple nodes using SGE. Some familiarity with SGE is assumed.

Contents

Overview

A typical run of BDVAL may span many different endpoint datasets and most certainly multiple feature selection strategies. This often results in many (a few hundred or more) permutations. The SGE setup takes each of these individual configurations and runs each separately on an SGE node. Each configuration uses an independent configuration that is automatically generated once. Once all the “subjobs” are complete the results, including final models, are collected and ready for final processing and analysis.

Configuration and setup

Up until the point where you are ready to submit your jobs to the grid, configuring BDVAL is really not any different than described in the BDVAL User Guide. Once have configured BDVAL normally and decided which endpoints and methods to run, the following parameters are needed to configure a BDVAL job for SGE.

Required settings

sge-job-name
The name for the SGE jobs. The job submission scripts and default result directory will use this as a prefix.
sge-job-tag
This is used to set the “save tag” used by BDVAL. When running BDVAL this usually defaults to the date and time of the run. The date and time is not desirable in this case since all the nodes are not likely to start at the same time.
sge-job-description
This sets the description that gets saved along with all the runs.
target
The ant target to run on the cluster. By convention, the target “sge-job” is typically used.

Optional settings

number-of-threads
The number of threads BDVAL will use per node (default: 8g)
jvm-max-memory
The amount of memory that will be allocated to the JVM (default: 16g)
sge-max-memory
The total amount of memory allocated “per node” on SGE (default: 20g). Note that this must be larger than that amount allocated to the JVM to accommodate the Rserve instances used per node (1 per thread) plus any job overhead.
output-directory
Location to place the the job submission files (default: the current working directory i.e., “.”)
rserve-start-port
First port to start Rserve instances on each node. (default: 6311)

These, along with any endpoint and methods should be set in the BDVAL ant script in the target called “setup-sge-job”.

A full example using the maqcii-c.xml script (available as part of the BDVAL source distribution) follows.

MAQCII-C

Here we describe how to setup a job using the maqcii-c data and configurations. This section assumes that bdval.jar and the maqcii-c.xml and other configuration files have been extracted into the BDV_INSTALL_DIR and the datasets needed to run are available as described in the BDVAL User Manual. This section also assumes that BDVAL (i.e, files in BDV_INSTALL_DIR), any prerequisite libraries (i.e., Java, Ant, R) and any datasets are available from the submission and execution nodes.

We’ll call the name of this job “maqcii-sge”. As shown below, all the maqcii-c endpoints are configured to run along with the methods “baseline”, “naive-bayes”, “logit-boost”, etc. At this point the target called “setup-sge-job” in BDV_INSTALL_DIR/data/maqcii-c.xml should look something like:

   <target name="setup-sge-job" description="setup a sge job to run on a sge cluster">
       <property name="sge-job-name" value="maqcii-sge"/>
       <property name="sge-job-tag" value="maqcii-sge"/>
       <property name="sge-job-description" value="BDVAL on a SGE cluster"/>
       <property name="do.HamnerWithControl" value="true"/>
       <property name="do.Iconix" value="true"/>
       <property name="do.NIEHS" value="true"/>
       <property name="do.MDACC_PCR" value="true"/>
       <property name="do.MDACC_ERPOS" value="true"/>
       <property name="do.Cologne_OS_MO" value="true"/>
       <property name="do.Cologne_EFS_MO" value="true"/>
       <property name="do.UAMS_EFS_MO" value="true"/>
       <property name="do.UAMS_OS_MO" value="true"/>
       <property name="do.UAMS_CPR1" value="true"/>
       <property name="do.UAMS_CPS1" value="true"/>
       <property name="do.Cologne_NEP_R" value="true"/>
       <property name="do.Cologne_NEP_S" value="true"/>
       <property name="do.GSE8402_FusionYesNo" value="true"/>
       <property name="use-feature-selection-fold=true" value="true"/>
       <property name="use-feature-selection-fold=false" value="true"/>
       <property name="do.baseline" value="true"/>
       <property name="do.naive-bayes" value="true"/>
       <property name="do.logit-boost" value="true"/>
       <property name="do.logistic" value="true"/>
       <property name="do.random-forest" value="true"/>
       <property name="do.k-star" value="true"/>
       <property name="do.foldchange-genetic-algorithm" value="true"/>
       <property name="do.foldchange-svmglobal" value="true"/>
       <property name="do.foldchange-svmiterative" value="true"/>
       <property name="do.full-genetic-algorithm" value="true"/>
       <property name="do.genelist-genetic-algorithm" value="true"/>
       <property name="do.genelist-svmglobal" value="true"/>
       <property name="do.minmax-svmglobal" value="true"/>
       <property name="do.pathways-ttest-svmglobal" value="true"/>
       <property name="do.pathways.baseline" value="true"/>
       <property name="do.svmiterative" value="true"/>
       <property name="do.ttest-genetic-algorithm" value="true"/>
       <property name="do.ttest-svmglobal" value="true"/>
       <property name="do.ttest-svmiterative" value="true"/>
       <tempfile property="save-data-dir" destdir="${java.io.tmpdir}"/>
       <setup-sge-job target="sge-job" job-name="${sge-job-name}"
                      tag="${sge-job-tag}" tag-description="${sge-job-description}"
                      number-of-threads="8" jvm-max-memory="16g" sge-max-memory="20g"/>
   </target>

Now that the job parameters have been defined, the job splits and submission scripts must be created. This is done by executing the following command from the BDV_INSTALL_DIR/data directory:

 cd BDV_INSTALL_DIR/data
 ant -f maqcii-c.xml setup-sge-job

After you type this command, you will see:

 Buildfile: maqcii-c.xml
   [echo] Configuration execution for a server machine.
   [version-info] version for 'file:bdval.jar' is: release bdval_1.1
 setup-sge-job:
    [echo] Running process-splits-all with the following parameters...
    [echo] -------------------------------------------------------
    [echo] endpoint-name=HamnerWithControl
    [echo] dataset-name=HamnerWithControl
    [echo] dataset-root=/home/marko/maqcii-c
    [echo] platform=/home/marko/maqcii-c/platforms/GPL1261_platform.soft.gz
    ...
    [echo] Writing file: /home/marko/bdval_1.1/data/maqcii-sge/maqcii-sge-531.properties
    [echo] Writing file: /home/marko/bdval_1.1/data/maqcii-sge/maqcii-sge-532.properties
    [echo] Wrote a total of 532 files
    [copy] Copying 1 file to /home/marko/bdval_1.1/data
 BUILD SUCCESSFUL
 Total time: 32 seconds

This command creates an SGE submission script called BDV_INSTALL_DIR/data/maqcii-sge.qsub and a directory BDV_INSTALL_DIR/data/maqcii-sge/ (note the name of the files created matches the value of the parameter called “sge-job-name”). The directory contains the appropriate configurations based on the endpoints and methods used. In this case, a total of 532 individual SGE tasks will be executed. This directory must be visible to the execution nodes.

The job can now submitted to the SGE queue for execution by executing:

 qsub maqcii-sge.qsub

You should get a response similar to:

 Your job-array 14342.1-532:1 ("maqcii-sge") has been submitted

Once these jobs start running, a directory called BDV_INSTALL_DIR/data/maqcii-sge-results (again note the name matches the “sge-job-name”) is created. When the jobs have completed, the directory will contain zip files that contain the results from each array subjob as follows:

 maqcii-sge-results/maqcii-sge-1.zip
 ...
 maqcii-sge-results/maqcii-sge-531.zip
 maqcii-sge-results/maqcii-sge-532.zip

Each individual zip file in the results directory file will contain the features, models, predictions and final-models generated from the subjob. Log files for each subjob will also be produced. These files will be named “maqcii-sge-oNNNNN.1”, “maqcii-sge-oNNNNN.2”, …, “maqcii-sge-oNNNNN.532” where NNNN is the SGE job number (i.e., 14342).

Once all the jobs complete, there is one more step required which will combine all the results from each subjob into a single directory in a form suitable for post processing. This is done by executing a customized script called “extract-results.sh” from the BDV_INSTALL_DIR/data/maqcii-sge/ as follows:

 cd BDV_INSTALL_DIR/data/maqcii-sge/
 ./extract-results.sh <output-directory>

where ‘<output-directory>’ is the location where you would like to place the merged results. This directory will contain a single “model-conditions.txt” file and directory can be used directly to run any further BDVAL tasks.

Notes

  • The directory BDV_INSTALL_DIR/scripts contains template scripts used to create the SGE job submission files. These scripts assume that all the source datasets and result output directory are visible from each worker node. If this is not the case, the template scripts need to be modified to change local copying (i.e., /bin/cp, etc.) to something appropriate for your environment (i.e., scp, ftp, etc). For example, you will need to change or override the submit queue which is currently hardcoded to “*@@rascals.h”.
  • Instances of the Rserve process are started on each node automatically when an individual BDVAL subtask is run. The RConnectionPool.xml file present during the setup is not used but is generated based on the number of threads defined when the job is defined.
  • With the exception of the dataset root, the “local” property configuration file (i.e., BDV_INSTALL_DIR/config/maqcii-local.properties) is not used to run on the cluster. This file gets generated based on the parameters given to the setup-sge-job target.
  • All the data per subjob is self contained so that if one or two jobs fail, they can be resubmitted individually without having to resubmit all the other jobs. So for example, if subjob 302 from maqcii-c above failed because of a power loss or some other recoverable problem, it can be resubmitted easily by executing “qsub -t 302 maqcii-sge.qsub”