Contents

  • Input Files
  • Platform Files
  • Condition Identifiers (cids) files
  • Task Files

  • Input Files

    Input dataset files for BDVAL contain the measurement data used to discover markers. A dataset provides probe signals for expression of transcripts or proteins in different biological samples. The following file types are supported:

    TMM
    TMM files end with the extension “.tmm“. Each line in the file represents the probe signal values for all biological samples. Each column represents a biological sample and each column of the table is separated by a tab character. The first “row” of the table defines the sample names for the dataset. The first column defines the name of the probe. There is no concept of a comment line in this file. A sample dataset is available here. The file parser is implemented by edu.mssm.crover.tables.readers.ColumbiaTmmReader.
    GEO
    Gene Expression Omnibus (GEO) “SOFT” files are expected to end with the extension “.soft“. The SOFT format structure and content are described in detail at the NCBI GEO SOFT GEO SOFT Deposit page. The file parser is implemented by edu.mssm.crover.tables.readers.GeoDataSetReader.
    Iconix
    Iconix files are expected to end with the extension “.iconix“. The file parser is implemented by edu.mssm.crover.tables.readers.IconixReader.
    Whitehead
    Whitehead “res” files are expected to end with the extension “.res“. The file parser is implemented by edu.mssm.crover.tables.readers.WhiteheadResReader.
    Cologne
    Cologne files are expected to end with the extension “.cologne“. The file parser is implemented by edu.mssm.crover.tables.readers.CologneReader.

    Files that are compressed with gzip in the form of “ext.gz” are also supported (example: file.tmm.gz).

    Platform Files

    Gene Expression Omnibus (GEO) “SOFT” files as described at the NCBI GEO SOFT GEO SOFT Deposit page are typically used as input for the platforms. Additionally, a somewhat simplified format can be used as shown below:

     #ID
     #GB_LIST
     !platform_table_begin
     ID    GB_LIST
     A_23_P167452    NM_016442
     Hs12503.1    NM_172200 NM_002189
     Hs3353.1    NM_054025 NM_018644
     Hs246107.1    NM_017770
     ...
     probeset [tab] [gene-id] ([tab] [gene-id])+
     !platform_table_end

    BDVal supports RefSeq, Ensembl gene and transcript ids, and EMBL ids. The platform is used when pathway or gene lists runs are performed.

    Like the input datasets, files that are compressed with gzip in the form of “ext.gz” are supported.


    Condition Identifiers (cids) files

    Condition identifiers (cids) files are used to group biological samples into distinct “classes”. The file format is a simple tab delimited file with two entries per line. The first column is a string that defines the name of the class the second column is a biological sample to be included in the class. The name of the sample should map to a sample found in the input dataset. The first row of the file is considered to be a description and is therefore ignored. A cids file that corresponds to the sample input data file in tmm format is available here.

     class [tab] sample


    Task files

    Task files are used to define distinct groups of evaluations to be executed. Task files are expected to end with the extension “.tasks”. The format of the task file is a tab delimited file with 5 columns. The first column is the name associated with the evaluation. The next two columns define the classes (as defined in the corresponding cids file) to be used. The forth and fifth columns represent the number of biological samples in the classes defined in columns two and three respectively. These last two columns also serve as a way to cross check the entries cids file.

     name [tab] class1 [tab] class2 [tab] #_of_samples_in_class1 [tab] #_of_samples_in_class2