This page provides a line-by-line description of the process method of DiscoverWithTTest. This method performs feature selection with the student T-test p-value statistic and writes a gene list with the probesets that match the selection criteria.

One line 117, the method starts by the process method of the DAVMode superclass. The super.process method takes care of processing steps that are common across all BDVal modes.

Lines 118-119 loop through classification tasks and gene lists provided on the command line. Line 12 creates an instance of TTestImpl, the stats apache commons class that implements the Student T-Test.

Lines 123-124 turn off feature normalization and scaling, so that the T-statistic is calculated directly with signal values from the input file.

Lines 125-126 load the part of the input table that corresponds to features in the gene list.

Lines 127-128 obtain the label column (first column of the table).

Lines 129-130 check that the column is of type String.

Line 131 obtains the labels in the label column as one array of Strings.

Lines 133-134 calculates the labelValueGroups variable. This variable is a list of sets. Each set contains the labels (type String) of the samples that belong to a given group. In a two group comparison, labelValueGroup will be initialized with two elements of type set. The first set will contain the labels for the samples in group 1 and the second element for the labels of samples in group 2.

Lines 136-137 creates an instance of a selectedProbesets where indices of probesets that make the selection criteria will be kept. We use a BoundedSizeQueue implementation to keep only the maxProbesToReport lowest P-values.

Line 138-139 start a for loop to iterate over each feature column in the input input table. BDVal identifies each feature by either a String identifier or by an integer feature index. The loop index starts at one  because the column at index zero contains the sample labels.
Line 141 converts the featureIndex to the index of the probeset on the platform (substracting one to account for label column).

Line 142 gets the column description for the current feature.

Lines 143-144 check that the column has type double.

Line 145 obtains the values of the signal values for the feature in an array of doubles. Elements are indexed by sampleIndex.

Lines 146-147 initialize variables to hold the signal values in each group under comparison (negative label group 1 and positive label group 2).

Lines 148-158 assign signal values in each group using sampleId to find which group a sample value belongs to.

Line 159 declares the variable that will hold the T-Test P-value.

Line 162-164 check that each group has at least two value, a pre-condition for performing a T-Test.

Line 167-169 use the Apache stats common implementation to estimates the P-value.

Lines 172-174 replace NaN return values with the value p-value 1.0.

Line 178 retrieves the probeset identifier corresponding to the given probesetIndex in the platform.

Line 179 stores the P-value in the probesetPvalues map.

Lines 181-186 enqueue the probeset with the P-value if the feature has a P-value before the significance threshold. The score used to enqueue is 1-Pvalue to keep the lowest P-Values in the priority queue.

This concludes the loop started on line 138.

Lines 187-189 prints the number of selected probesets to standard out and the significance threshold.

Lines 190-196 dequeue the features that have the highest (1-p-values) priority (lowest p-values) and pass the feature to the reporter. Reporter implementations are responsible for writing the feature in the appropriate gene list format.

Writing a custom feature selection strategy can be as simple as duplicating this code and changing the part that estimates the statistic used for feature selection. Wrapper feature selection modes are more complex (see DiscoverWithGeneticAlgorithm) and beyond the scope of this example.