This mode partitions a training dataset into various splits for training and testing. A typical split design is cross-validation, but other splitting strategies are possible (though not supported at this time). This tool generates a file which precisely describes how the samples in the whole training set should be distributed into splits.

The generated file consists of lines of the form:

split-id  repeat-id  fold-id   split-type   sample-id  sample-class-label      sample-index
split-id an integer which uniquely identifies a split
repeat-id an integer that identifies a (random) repetition of the split strategy
fold-id an integer that identifies the fold number
split-type a  string that identifies the purpose of the fold in a given split. Samples which have a fold-type=training should be used for training the model, whereas samples with fold-type=test should be used to test the model
sample-id a string which indicates that the corresponding sample is part of the split/fold described
sample-class-label an integer that describes the class label for this particular sample
sample-index an integer that uniquely identifies the sample in the original dataset

The last two columns, sample-class-label and sample-index, are optional but are useful to keeping track of class labels and original sample indices.

The following encodes a leave-one-out split strategy with three samples:

1 1 1 training sample2
1 1 1 training sample3
1 1 1 test sample1
2 1 2 training sample1
2 1 2 training sample3
2 1 2 test sample2
3 1 3 training sample1
3 1 3 training sample2
3 1 3 test sample3

This encoding makes it possible to devise strategies that define several partitions of the input samples. For instance, it is possible to define feature-selection, training and test fold-types, in the context  of cross-validation with a number of random repeats.

Define Splits is implemented by org.bdval.DefineSplitsMode.java.  The split plan can also be generated independently of Define Splits Mode and given to Execute Splits Mode.

Mode Parameters

The following options are available in this mode

Flag Arguments Required Description
(-f|--folds) folds yes Number of cross validation folds.
--cv-repeats cv-repeats no Number of cross validation repeats. default=1 (does on round of cross-validation). Values larger than one cause the cross validation to be repeated and results averagedover the rounds. (default: 1)
--stratification stratification no When true, each random fold is constrained to contain the same proportion of positive samplesas the whole input set (modulo integer rounding errors). Default is true. (default: true)
--feature-selection-fold feature-selection-fold no When true, one fold is labeled for feature selection (split-type=feature-selection) and excluded from the training split. Default false. (default: false)