In contrast to feature selection, which eliminates features from the input file, feature aggregation combines input features into a smaller set of features, called aggregated features.

The figure below illustrates how feature aggregation reduces the dimensionality of an input set. Two methods are currently supported for step 3) pathway feature aggregation: signal averaging and projection to principal component analysis (the figure indicates SVD, but this method is not implemented at this time).

Feature averaging

A simple form of feature aggregation is feature averaging. In this case, n features of the input file are averaged to yield just one aggregated feature per sample. BDVal provides a flexible mechanism to perform feature aggregation. On the command line, this mechanism uses the options --pathways, --gene-to-probes and --pathway-aggregation-method. The option --pathway-aggregation-method average indicates that features should be averaged, while --pathways, and --gene-to-probes determines which sets of features will be averaged.

Pathway information file format

The argument to --pathways is a tab delimited file. Each line of the file represents a pathway. BDVal uses the term pathway to refer to  the set of features that will be aggregated because historically we considered feature aggregation of probesets that map to genes of the same pathway. Each line has two tab delimited field. The first field provides a pathway  identifier. The second field is space delimited. Each token of the  second field is an (Ensembl) gene ids for gene that belong to the pathway. When this option is provided, features are aggregated by pathway and computations are performed in aggregated feature space. Some aggregation algorithms may generate several aggregated features per pathway (e.g., projection to principal components can project to several largest principal components).

The following presents an example of two lines of pathway information (obtained form KEGG). The pathway identifiers are shown in bold. The fragment defines two pathways: Lysine biosynthesis and Alanine and aspartate metabolism. The second field on each line indicates which gene identifiers participate to the pathway (gene identifiers are separated by space characters). The format is pathway-id<tab>[gene-id<space>]+gene-id

Lysine biosynthesis<tab>ENSG00000008311 ENSG00000157426 ENSG00000065427 ENSG00000109576 ENSG00000149313
Alanine and aspartate metabolism<tab>ENSG00000185100 ENSG00000095321 ENSG00000100357 ENSG00000035687 ENSG00000090861 ENSG00000115866 ENSG00000150768 ENSG00000183044 ENSG000001
72482 ENSG00000128683 ENSG00000136750 ENSG00000120053 ENSG00000125166 ENSG00000167701 ENSG00000169910 ENSG00000070669 ENSG00000108381 ENSG00000130707 ENSG00000134440 ENSG00000173
599 ENSG00000131828 ENSG00000163114 ENSG00000168291 ENSG00000117593 ENSG00000124608 ENSG00000113492 ENSG00000084774 ENSG00000137513 ENSG00000162174 ENSG00000166123 ENSG0000020379
7 ENSG00000132744

Gene-to-probes file format

The argument to --gene-to-probes is the name of a file with one line per gene. Each line is tab delimited. The first field is an ensembl gene id. The second field is a probe id which measures expression of a transcript of the gene. Several lines may share the same gene id, indicating that multiple probe ids exist for the gene. Gene to probe information is used in combination with the platform information to map pathways (i.e., sets of features to be aggregated) to sets of probes in the input data.

The following gene to probe example associates three probesets to genes. As shown in this example, BDVal does not require that every gene identifier maps to a probeset. Gene identifiers that do not map to probesets are ignored. These files can often be created with Biomart.

#Ensembl Gene ID AFFY HG U133A
ENSG00000215916 219337_at
ENSG00000187488 220734_s_at
ENSG00000205090 219068_x_at
ENSG00000215914 207118_s_at

Potential sources of pathway information

We have used the following sources of data to derive sets of genes for feature aggregation. Formatted pathway files for these sources are available here.


Human protein-protein interactions are clustered to yield densely connected networks of proteins. Proteins are mapped to genes and then to probsets on each chip.
Metabolic pathways in KEGG are mapped to probesets on each chip.
Tissue expression profiles are clustered to yield sets of transcripts that are co-expressed in tissues of the same organism. Transcripts are mapped to genes then to probsets on each chip.