1. Model Identifiers
  2. Split Plans
  3. Model conditions file
  4. Feature Lists
  5. Zipped Model File
  6. Prediction Tables


Model Identifiers

Every model generated by BDVal is assigned a unique identifier.  This unique identifier is made up of a 6 letter code derived from a subset of command line arguments used to produce that model.  The model identifier is used to keep track of the model at each stage during model development and evaluation.

Split Plans

The split-plan file indicates how samples in the input file are assigned to cross-validation folds, for each random repeat of cross-validation. The split-plan is generated by Define Splits Modes and is saved to a .split-plan file so that different feature selection strategies can be tested with exactly the same split partitions. The format of this file is shown in the Define Splits Mode.

Model Conditions file

The model conditions file describes a set of models generated and the values of all parameters for that model.  Each model is represented by one line with tab delimited values for all the properties used to generate that model.  The general format of the model conditions file is shown below:
<property1=value>[tab]<property2=value>[tab]..........
An example of one model in the model conditions file:

model-id=YVIKN input=data/bdval/GSE8402/norm-data/GSE8402_family.soft.gz overwrite-output=false task-list=data/bdval/GSE8402/tasks/GSE8402-FusionYesNo-TrainingSplit.tasks platform-filenames=data/bdval/GSE8402/platforms/GPL5474_family.soft.gz conditions=data/bdval/GSE8402/cids/GSE8402-FusionYesNo-TrainingSplit.cids pathway-aggregation-method=PCA scale-features=true percentile-scaling=false normalize-features=false classifier=edu.cornell.med.icb.learning.libsvm.LibSvmClassifier gene-features-dir=./ dataset-name=dataset-name dataset-root=ds-root rserve-port=-1 cache-dir=cache pathway-components-dir=pathway-components num-features=10 splits=data/bdval/GSE8402/splits/fusion-cv-5-fs=false.split sequence-file=data/sequences/baseline.sequence evaluate-statistics=true

Feature Lists

The output of feature selection is a feature list that descibes the most informative features for a particular model generated.  The size of the feature list is determined by different parameter settings of each feature selection method.  The feature list is saved in the format  <dataset-name>-<classifier>-<model-id>-features-.txt and is a list of informative feature probeset ids.   The feature list file format is the same as the gene list format.  If feature pre-filtering is disabled then the feature list will contain only probeset IDs.  An example is shown below:

Ensembl Gene IDEMBL IDRefseq IDProbeset ID
205696_s_at
204304_s_at
205380_at
214440_at
205009_at


Zipped Model File


All output files associated with a particular model have the same filename prefix which is a string combination of the methods used to build that model. For example the model prefix

 libSVM_GSE8402_FusionYesNo_TrainingSplit-baseline-global-svm-weights-AGCKW.zip

indicates that this model was built using the LibSVM classifier, on the GEO series dataset GSE8402, the endpoint under consideration was FusionYesNo, this model was built from traning data using the global svm weights feature selection strategy and that the unique model idenfier is AGCKW.

The unzipped model file contains several component files which are used to reconstitute the model by different BDVal parameters.


<modelFilenamePrefix>.means.properties 
<modelFilenamePrefix>.model  
<modelFilenamePrefix>.platform.externalIds2TranscriptIndex.id2Index.properties
<modelFilenamePrefix>.platform.externalIds2TranscriptIndex.runningIndex.properties
<modelFilenamePrefix>.platform.externalIdType.properties
<modelFilenamePrefix>.platform.externalIndex2Id.properties
<modelFilenamePrefix>.platform.name.properties
<modelFilenamePrefix>.platform.probeIds2ProbeIndex.id2Index.properties
<modelFilenamePrefix>.platform.probeIds2ProbeIndex.runningIndex.properties
<modelFilenamePrefix>.platform.probeIndex2ExternalIDIndex.properties
<modelFilenamePrefix>.platform.probeIndex2probeId.properties
<modelFilenamePrefix>.properties
<modelFilenamePrefix>.ranges.properties

The <modelFilenamePrefix>.properties file contains information about how the model was built.  An example is shown below:


trained.from.dataset = GSE8402-FusionYesNo-TrainingSplit
training.class0.label = NO
training.class0.encoding = -1
training.class1.label = YES
training.class1.encoding = 1
training.classifier.classname = edu.cornell.med.icb.learning.libsvm.LibSvmClassifier
training.classifier.parameters = probability=false,
scaling.use.percentiles = false
scaling.enabled = true
feature-normalization.enabled = false
trained.from.split.split-id = 20
trained.from.split-type = training
trained.from.split-plan = bdval\/GSE8402\/splits\/GSE8402_FusionYesNo_TrainingSplit-split-plan-fs=false-CV-5-R-5.txt
scaling.implementation.classname = edu.cornell.med.icb.learning.MinMaxScalingRowProcessor
pathway.aggregation.method = PCA

Predict mode outputs a prediction table for each model built. This table contains the following fields:

FieldDescription
split idan integer which uniquely identifies a split
split typea string that identifies the purpose of the fold in a given split
repeatId an integer that identifies a (random) repetition of the split strategy
modelFilenamePrefixmodel filename excluding any extensions
sampleIndexan integer that uniquely identifies the sample in the original dataset
sampleIda string which indicates that the corresponding sample is part of the split/fold described
probabilityOfClass1probability assigned to a sample based on the predicted
predictedSymbolicLabelpredicted symbolic label
probabilityOfPredictedClassnumerical value for the probability of the predicted class
probabilityClass1assigned class based on the prediction
trueLabelthe true class label of the sample
numericTrueLabelnumeric value for the true class label
correctflag to indicate whether the model prediction is correct compared to the true label
modelNumFeaturesthe number of features in the model