This mode partitions a training dataset into various splits for training and testing. A typical split design is cross-validation, but other splitting strategies are possible (though not supported at this time). This tool generates a file which precisely describes how the samples in the whole training set should be distributed into splits.
The generated file consists of lines of the form:
split-id repeat-id fold-id split-type sample-id sample-class-label sample-index |
| split-id | an integer which uniquely identifies a split |
| repeat-id | an integer that identifies a (random) repetition of the split strategy |
| fold-id | an integer that identifies the fold number |
| split-type | a string that identifies the purpose of the fold in a given split. Samples which have a fold-type=training should be used for training the model, whereas samples with fold-type=test should be used to test the model |
| sample-id | a string which indicates that the corresponding sample is part of the split/fold described |
| sample-class-label | an integer that describes the class label for this particular sample |
| sample-index | an integer that uniquely identifies the sample in the original dataset |
The last two columns, sample-class-label and sample-index, are optional but are useful to keeping track of class labels and original sample indices.
The following encodes a leave-one-out split strategy with three samples:
| 1 | 1 | 1 | training | sample2 |
| 1 | 1 | 1 | training | sample3 |
| 1 | 1 | 1 | test | sample1 |
| 2 | 1 | 2 | training | sample1 |
| 2 | 1 | 2 | training | sample3 |
| 2 | 1 | 2 | test | sample2 |
| 3 | 1 | 3 | training | sample1 |
| 3 | 1 | 3 | training | sample2 |
| 3 | 1 | 3 | test | sample3 |
This encoding makes it possible to devise strategies that define several partitions of the input samples. For instance, it is possible to define feature-selection, training and test fold-types, in the context of cross-validation with a number of random repeats.
Define Splits is implemented by org.bdval.DefineSplitsMode.java. The split plan can also be generated independently of Define Splits Mode and given to Execute Splits Mode.
Mode Parameters
The following options are available in this mode
| Flag | Arguments | Required | Description |
|---|---|---|---|
(-f|--folds) |
folds | yes | Number of cross validation folds. |
--cv-repeats |
cv-repeats | no | Number of cross validation repeats. default=1 (does on round of cross-validation). Values larger than one cause the cross validation to be repeated and results averagedover the rounds. (default: 1) |
--stratification |
stratification | no | When true, each random fold is constrained to contain the same proportion of positive samplesas the whole input set (modulo integer rounding errors). Default is true. (default: true) |
--feature-selection-fold |
feature-selection-fold | no | When true, one fold is labeled for feature selection (split-type=feature-selection) and excluded from the training split. Default false. (default: false) |

Leave a Comment