Every model generated by BDVal is assigned a unique identifier. This unique identifier is made up of a 6 letter code derived from a subset of command line arguments used to produce that model. The model identifier is used to keep track of the model at each stage during model development and evaluation.
Split Plans
The split-plan file indicates how samples in the input file are assigned to cross-validation folds, for each random repeat of cross-validation. The split-plan is generated by Define Splits Modes and is saved to a .split-plan file so that different feature selection strategies can be tested with exactly the same split partitions. The format of this file is shown in the Define Splits Mode.
Model Conditions file
<property1=value>[tab]<property2=value>[tab]..........
| model-id=YVIKN input=data/bdval/GSE8402/norm-data/GSE8402_family.soft.gz overwrite-output=false task-list=data/bdval/GSE8402/tasks/GSE8402-FusionYesNo-TrainingSplit.tasks platform-filenames=data/bdval/GSE8402/platforms/GPL5474_family.soft.gz conditions=data/bdval/GSE8402/cids/GSE8402-FusionYesNo-TrainingSplit.cids pathway-aggregation-method=PCA scale-features=true percentile-scaling=false normalize-features=false classifier=edu.cornell.med.icb.learning.libsvm.LibSvmClassifier gene-features-dir=./ dataset-name=dataset-name dataset-root=ds-root rserve-port=-1 cache-dir=cache pathway-components-dir=pathway-components num-features=10 splits=data/bdval/GSE8402/splits/fusion-cv-5-fs=false.split sequence-file=data/sequences/baseline.sequence evaluate-statistics=true |
Feature Lists
The output of feature selection is a feature list that descibes the most informative features for a particular model generated. The size of the feature list is determined by different parameter settings of each feature selection method. The feature list is saved in the format <dataset-name>-<classifier>-<model-id>-features-.txt and is a list of informative feature probeset ids. The feature list file format is the same as the gene list format. If feature pre-filtering is disabled then the feature list will contain only probeset IDs. An example is shown below:
| Ensembl Gene ID | EMBL ID | Refseq ID | Probeset ID |
| 205696_s_at | |||
| 204304_s_at | |||
| 205380_at | |||
| 214440_at | |||
| 205009_at |
Zipped Model File
All output files associated with a particular model have the same filename prefix which is a string combination of the methods used to build that model. For example the model prefix
libSVM_GSE8402_FusionYesNo_TrainingSplit-baseline-global-svm-weights-AGCKW.zip
indicates that this model was built using the LibSVM classifier, on the GEO series dataset GSE8402, the endpoint under consideration was FusionYesNo, this model was built from traning data using the global svm weights feature selection strategy and that the unique model idenfier is AGCKW.
The unzipped model file contains several component files which are used to reconstitute the model by different BDVal parameters.
<modelFilenamePrefix>.means.properties
<modelFilenamePrefix>.model
<modelFilenamePrefix>.platform.externalIds2TranscriptIndex.id2Index.properties
<modelFilenamePrefix>.platform.externalIds2TranscriptIndex.runningIndex.properties
<modelFilenamePrefix>.platform.externalIdType.properties
<modelFilenamePrefix>.platform.externalIndex2Id.properties
<modelFilenamePrefix>.platform.name.properties
<modelFilenamePrefix>.platform.probeIds2ProbeIndex.id2Index.properties
<modelFilenamePrefix>.platform.probeIds2ProbeIndex.runningIndex.properties
<modelFilenamePrefix>.platform.probeIndex2ExternalIDIndex.properties
<modelFilenamePrefix>.platform.probeIndex2probeId.properties
<modelFilenamePrefix>.properties
<modelFilenamePrefix>.ranges.properties
The <modelFilenamePrefix>.properties file contains information about how the model was built. An example is shown below:
trained.from.dataset = GSE8402-FusionYesNo-TrainingSplit
training.class0.label = NO
training.class0.encoding = -1
training.class1.label = YES
training.class1.encoding = 1
training.classifier.classname = edu.cornell.med.icb.learning.libsvm.LibSvmClassifier
training.classifier.parameters = probability=false,
scaling.use.percentiles = false
scaling.enabled = true
feature-normalization.enabled = false
trained.from.split.split-id = 20
trained.from.split-type = training
trained.from.split-plan = bdval\/GSE8402\/splits\/GSE8402_FusionYesNo_TrainingSplit-split-plan-fs=false-CV-5-R-5.txt
scaling.implementation.classname = edu.cornell.med.icb.learning.MinMaxScalingRowProcessor
pathway.aggregation.method = PCA
Predict mode outputs a prediction table for each model built. This table contains the following fields:
| Field | Description |
| split id | an integer which uniquely identifies a split |
| split type | a string that identifies the purpose of the fold in a given split |
| repeatId | an integer that identifies a (random) repetition of the split strategy |
| modelFilenamePrefix | model filename excluding any extensions |
| sampleIndex | an integer that uniquely identifies the sample in the original dataset |
| sampleId | a string which indicates that the corresponding sample is part of the split/fold described |
| probabilityOfClass1 | probability assigned to a sample based on the predicted |
| predictedSymbolicLabel | predicted symbolic label |
| probabilityOfPredictedClass | numerical value for the probability of the predicted class |
| probabilityClass1 | assigned class based on the prediction |
| trueLabel | the true class label of the sample |
| numericTrueLabel | numeric value for the true class label |
| correct | flag to indicate whether the model prediction is correct compared to the true label |
| modelNumFeatures | the number of features in the model |

Leave a Comment