BDVAL is designed as an extensible framework that can incorporate new feature selection methods, classifiers, and other options over time.

Here we describe the process of extending the BDVal source code to include a new mode. A basic understanding of Java programming is assumed.

Contents

1. Prerequisites

This example assumes the source distribution has been downloaded and extracted to a local directory. In addition to the requirements described in the installation and configuration sections, a Java Development Kit (JDK) and apache Ant is required. Familiarity with the Java Simple Argument Parser (JSAP) is desirable.

2. Extending BDVal with a new feature selection mode

Let’s assume we would like to add a new mode that uses a cox regression model (partial likelihood and p-value) for feature selection into BDVal. The following sections describe which classes need to be modified and/or extended to integrate this into BDVal. This mode will be referred to as “cox-regression”.

Define Parameters for the Mode

The parameters for the cox regression mode will be similar to those in the existing T-Test mode. An additional parameter that specifies a “survival file” will be required.

Implement the Mode

Modes in BDVal all extend org.bdval.DAVMode.java. The implementation of the cox regression mode will extend this class.

Since the new mode is similar to an existing mode, it may be helpful to use the code from the existing class org.bdval.DiscoverWithTTest.java as a reference.
Our new class will be called DiscoverWithCoxRegression and reside in the org.bdval package (classes in this package are found in files in the src/org/bdval directory, in the source distribution).

Set the mode options

The method defineOptions(JSAP jsap) of DAVMode is responsible for defining the options that a particular mode expects. The class DiscoverWithCoxRegression will override this method to add any parameters required by the new mode. Parameter parsing is done with the JSAP API. For example, adding the option for a “survival” file would be done as follows:

 final Parameter survivalFilenameOption = new FlaggedOption("survival")
         .setStringParser(JSAP.STRING_PARSER).setDefault(JSAP.NO_DEFAULT)
               .setRequired(true).setLongFlag("survival")
               .setHelp("Survival filename. This file contains survival data "
                       + "in tab delimited table; column 1: chipID has to match cids and "
                       + "tmm, column 2: time to event, column 3 censor with 1 as event 0 "
                       + "as censor, column 4 and beyond are all numerical covariates that "
                       + "will be included in the regression model");
 jsap.registerParameter(survivalFilenameOption);

Interpret the mode options

Each mode will need to handle any options it requires. The method interpretArguments(JSAP jsap, JSAPResult result, DAVOptions options) of DAVMode is responsible for interpreting the options that a particular mode expects. DiscoverWithCoxRegression will override this method to validate any options using JSAP and JSAPResult. Any BDVal general options may be stored or read from the options parameter. Mode specific options can be stored in the mode instance. For example, to parse the content of the file given after the –survival option, the following statements would be provided in the method interpretArguments():

 super.interpretArguments(jsap, result, options);
 final String survivalFileName = result.getString("survival");
 try {
    readSurvival(survivalFileName);
 } catch (IOException e) {
     LOG.fatal("Cannot read input file \"" + options.input + "\"", e);
 }

Note that the actual mode processing is not executed in this method. The method interpretArguments() should perform parameter validation and preprocessing (e.g., parsing input files).

Process the mode options

Once all options have been interpreted and validated, the method process(DAVOptions options) of DAVMode will be invoked. DiscoverWithCoxRegression will override this method to filter features appropriately. The parsed options would have been stored during the previous interpret step and/or stored by BDVal into the options parameter or the mode instance. We provide a line-by-line explanation of the DiscoverWithTTest process method to help you understand how this method implements feature selection.

Register the Mode

The main class for BDVal is org.bdval.DiscoverAndValidate.java. This class is essentially responsible for taking command line arguments from the user and delegating these to the appropriate mode(s) for processing.

In order to expose the new mode for use, it must be “registered” so that it is recognized as a valid –mode option. A method for registering modes exists in org.bdval.DAVMode.java and is called registerMode. There are a series of registration calls in DiscoverAndValidate and adding the following to that list will register the new mode.

 davMode.registerMode("cox-regression", DiscoverWithCoxRegression.class);

The first argument to registerMode is the name of the mode here: “cox-regression”. The second argument is the name of the class that implements the mode. As discussed, this class implements the methods

  1. defineOptions(JSAP jsap)
  2. interpretArguments(JSAP jsap, JSAPResult result, DAVOptions options)
  3. process(DAVOptions options)

Integrate the Mode into the rest of BDVal

Once the new mode has been compiled into BDVal, it will be available for use in other parts of BDVal. The file coxreg-svmglobal.sequence illustrates how the cox-regression mode can be used with the Sequence mode.

3. Estimating other statistics

BDVal supports two methods to estimating performance prediction results:

  • The first method delegates performance estimation to the ROCR package. ROCR is written in R and supports most classification performance measures. BDVal supports any ROCR performance measure, which can be referenced by its ROCR name. The advantage of this method is that ROCR is a well tested package. The main drawback of using ROCR is that BDVal must communicate with a R process to pass prediction results and known labels and retrieve performance values. These inter-process communications can be slow and become a bootleneck when millions of performance estimations must be carried out.
  • As a second high performance option, BDVal provides pure java implementation of classification performance measures. Performance measures that are used in genetic algorithm optimization runs should be implemented natively. The follow sections describes how to extend BDVal with a new performance statistics.

Adding a native performance statistics calculator

This section demonstrates how to extend BDVal with a pure Java statistics calculator. As an example, we describe how to estimate accuracy.

We start by creating a class that implements the edu.cornell.med.icb.stat.PredictionStatisticCalculator abstract class. We call the implementing class AccuracyCalculator. Since PredictionStatisticCalculator is abstract, we need to implement two abstract methods: String getMeasureName() should return a string that identifies the statistics to the end-user. In this case, this method will simply return the string “accuracy”.

Finally, we need to implement the method double evaluateStatisticAtThreshold(final double threshold, final double[] decisionValues, final double[] labels). This method takes the value of a decision threshold, an array of decision values, and an array of true labels. It returns the accuracy value given the decision threshold.

A simple way to implement this method a follows:

47 evaluateContingencyTable(threshold, decisionValues, labels); 
48
49 double value = (TP + TN) / 
50 (TP + TN + FN + FP); 
51 if (value != value) { 
52 // NaN
53 value = 0;
54 }
55 return value;
Line 47 calls a method defined in PredictionStatisticCalculator to populate the variables TP, TN, FN, FP with the count of true positives, true negatives, false negatives, false positives, given the threshold, decision values and labels. Line 49 to 50 estimate accuracy from these counts. Line 51 checks if the results if NaN (only possible when empty arrays are passed in for decisionValues) and converts NaN to zero. This is important since performance statistics are averages across splits of validation and a single NaN would propagate to the final result.

Finally, line 55 returns the computed accuracy value. The abstract class PredictionStatisticCalculator handles the calculation of threshold independent accuracy, defined as the accuracy calculated with the threshold that maximizes accuracy. The calculation is performed by calling evaluateStatisticAtThreshold for each discreete value of the threshold observed in the array decisionValues and returning the maximum value obtained.

If this strategy is not appropriate for the statistic that you need to implement, you can  override the method double thresholdIndependentStatistic(final double[] decisionValues, final double[] labels):

53 @Override
54 public double thresholdIndependentStatistic(final double[] decisionValues, final double[] labels) {
55 // Provide your own logic here.
56 }

For an example, see the class AreaUnderTheRocCurveCalculator.