This mode partitions a training dataset into various splits for training and testing. A typical split design is cross-validation, but other splitting strategies are possible (though not supported at this time). This tool generates a file which precisely describes how the samples in the whole training set should be distributed into splits.

The generated file consists of lines of the form:

split-id  repeat-id  fold-id   split-type   sample-id  sample-class-label      sample-index
split-idan integer which uniquely identifies a split
repeat-idan integer that identifies a (random) repetition of the split strategy
fold-idan integer that identifies the fold number
split-typea  string that identifies the purpose of the fold in a given split. Samples which have a fold-type=training should be used for training the model, whereas samples with fold-type=test should be used to test the model
sample-ida string which indicates that the corresponding sample is part of the split/fold described
sample-class-labelan integer that describes the class label for this particular sample
sample-indexan integer that uniquely identifies the sample in the original dataset

The last two columns, sample-class-label and sample-index, are optional but are useful to keeping track of class labels and original sample indices.

The following encodes a leave-one-out split strategy with three samples:

111trainingsample2
111trainingsample3
111testsample1
212trainingsample1
212trainingsample3
212testsample2
313trainingsample1
313trainingsample2
313testsample3

This encoding makes it possible to devise strategies that define several partitions of the input samples. For instance, it is possible to define feature-selection, training and test fold-types, in the context  of cross-validation with a number of random repeats.

Define Splits is implemented by org.bdval.DefineSplitsMode.java.  The split plan can also be generated independently of Define Splits Mode and given to Execute Splits Mode.

Mode Parameters

The following options are available in this mode

Flag Arguments Required Description
(-f|--folds)foldsyesNumber of cross validation folds.
--cv-repeatscv-repeatsnoNumber of cross validation repeats. default=1 (does on round of cross-validation). Values larger than one cause the cross validation to be repeated and results averagedover the rounds. (default: 1)
--stratificationstratificationnoWhen true, each random fold is constrained to contain the same proportion of positive samplesas the whole input set (modulo integer rounding errors). Default is true. (default: true)
--feature-selection-foldfeature-selection-foldnoWhen true, one fold is labeled for feature selection (split-type=feature-selection) and excluded from the training split. Default false. (default: false)