This mode is used to discover sets of features that maximize a given performance measure using optimization with genetic algorithms. Classification is performed with a support vector machine (linear or RBF kernel). Starting with the entire set of features presented as input (containing N features), the algorithm optimizes CV10 performance of a N*ratio set of features with a genetic algorithm (Typical choice for r is ratio r is 0.5 to keep 50% of features at each iteration). Various parameters of the optimization can affect the computational resources required to carry out the optimization, and how close the found solution is to the optimal solution of the optimization problem. Larger values or population size and number of iterations (see runtime arguments) favor optimal solutions, but increase computational time. As usual with optimization algorithm, there is no guarantee that the optimal solution will be found. In the case of biomarker discovery, that is probably OK, since the fitness function (cross validation F-1 on a finite training set) is also not optimal.

This method performs aggressive feature selection that optimizes cross-validation performance. Additionally, it is capable of optimizing any performance measure for any classifier type. Unfortunately, methods using genetic algorithms tend to scale poorly with number of features and training set size.

It is implemented by org.bdval.DiscoverWithGeneticAlgoritthm.java.

Mode Parameters

The following options are available in this mode

Flag Arguments Required Description
(-r | --ratio)rationoThe ratio of new number of feature to original number of features, for each iteration. (default: 0.5)
(-n | --number-of-steps)number-of-stepsnoThe number of genetic algorithm evolution steps. Larger values increase the chance that the optimal solution will be found, but increase computation time. (default: 100)
(-s | --population-size)population-sizenoNumber of chromosomes for genetic algorithm optimization. The larger the population size, the more diversity can be represented in the population, and the more effective cross-over will be at combining successful solutions into a more optimal offspring. Larger population sizes are more computationally expensive, since the fitness function must be evaluated for each chromosome at each evolution step. (default: 10)
--discrete-parametersdiscrete-parametersnoA list of discrete classifier parameters to optimize at the same time as the feature set. Parameters must be described in the format param1=value1,value2,…[:param2=value1,value2,…]. For instance, alpha=1,2,3,4:beta=0.2,.5,.33 will optimize the parameters alpha and beta alongside with the feature set. The combination of features and parameter values that optimizes CV performance will be kept. Optimal parameter values will be written to stderr, or to the value of argument –optimal-parameters-out
(-f | --folds)foldsnoNumber of cross validation folds. default=10/CV10. (default: 10)
--cv-repeatscv-repeatsnoNumber of cross validation repeats. default=1 (does on round of cross-validation). Values larger than one cause the cross validation to be repeated and results averaged over the rounds. (default: 1)
--output-gene-listn/anoWrite features to the output in the tissueinfo gene list format.
--rocn/anoOptimize the area under the ROC curve. If neither this option nor –maximize is not provided, maximizes the F-1 measure (harmonic mean of precision and recall) Otherwise, the parameter –maximize will name the objective function.
--num-featuresnum-featuresnoNumber of features to select. (default: 50)
--optimal-parameters-outoptimal-parameters-outnoName of the file where optimal parameters will be written (as Java properties).
--maximizemaximizenoSelect the objective measure that the GA process will try to maximize. Valid measure names include auc, mat, acc. For a complete list of measure names, see the ROCR documentation.