Pre-filtering features involes eliminating features based on some predetermined criteria. Typically, pre-filtering is carried out using gene lists and is a way of using biologically specific features to build models.
The gene list method, as implemented in BDVal, leverages published gene list data in order to focus feature selection on genes that are likely to be predictive. When gene lists are selected independently from the dataset, the potential for over-fitting should be reduced. Similar phenotypes are likely to be mechanistically related. The method requires probes to genes information and potentially relevant gene lists.
Gene list Files
Gene list files are text files with 1 or more columns with a tab character between each column. Gene list files contain one line per feature.
PrimaryID [tab] GenBankID [tab] RefSeqID [tab] ProbesetID
Lines beginning with the character ‘#’ are ignored. The fourth field is the probe set identifier which matches the chip.