Experimental design

We will start by assuming that you have designed an RNASeq experiment with two groups and three biological replicates by group. Assume the groups are called A and B, and the replicates

  • A-1, A-2, A-3,
  • B-1, B-2, B-3.

The goal of the experiment is to identify genes differentially expressed between groups A and B.

You should have already aligned the reads to the appropriate reference genome (see how to align on theĀ detailed introductory walk-through). For convenience, we will assume that the alignment files are named with a basename that matches the sample names shown above. This means that you have a number of files, including A-1.entries, A-2.entries, and so on. (To perform alignments, see the Goby demo on the home page.)

Annotation files

The first step is to count how many reads overlap with gene annotations. Gene annotations describe where the various exons of a gene are located and make it possible to sum exon and 3′ and 5′ UTR counts to yield a gene count. The Goby distribution includes annotation files for human and mouse, but it is easy to generate annotation files for a different species or genome. Biomart is the perfect tool for this purpose. Make sure you follow the format of the annotations Goby expects. Briefly, annotation files must be tab delimited and have the following fields:

Chromosome Name<tab>Strand<tab>Ensembl Gene ID<tab>Ensembl Exon ID<tab>Exon Chr Start (bp)<tab>Exon Chr End (bp)
NT_161916     -1       ENSMUSG00000079832      ENSMUSE00000704931      585717  586400
NT_166443      1       ENSMUSG00000074844      ENSMUSE00000614004      14098   14763
NT_166443      1       ENSMUSG00000074844      ENSMUSE00000704983      14765   14830

Please note that <tab> is used to indicate delimiters in the sample above, where spaces could cause confusion. See annotation files under the Goby distribution data directory. The files distributed are straight download from the Ensembl Biomart.

Determine gene counts

The following command will aggregate counts at the gene level across all the experimental conditions.

 java -Xmx3g -jar goby.jar -m alignment-to-annotation-counts A-?.entries B-?.entries \
      --annotation <goby-distribution>/data/biomart-mouse-exons-ensembl55-genes-NCBIM37.txt \
      --include-annotation-types gene

Alignments are processed in turn and new files are added to each basename. After executing this command, the following files should have been created:

 A-1.ann-counts.tsv
 A-2.ann-counts.tsv
 A-3.ann-counts.tsv
 A-1.ann-counts.tsv
 B-2.ann-counts.tsv
 B-3.ann-counts.tsv

The annotation counts file format

Files with extension .ann-counts.tsv are tab delimited files with the following fields:

  1. basename Basename of the aligment where the counts were determined
  2. main-id Main Id of the annotation (i.e., gene id)
  3. secondary-id Secondary id of the annotation (e.g., exon, several secondary-ids per main-id).
  4. type Type of annotation (gene, exon, intron)
  5. chro Chromosome
  6. strand Strand
  7. length Length of the annotation
  8. start Start position on chromosome
  9. end End position on chromosome
  10. in-count Number of reads whose extremities are completely inside the interval defined by the annotation.
  11. over-count Number of reads that partially overlap with the interval defined by annotation.
  12. RPKM Reads per Kilobase per Million Reads
  13. log2(RPKM+1) Log2 of RPKM+1
  14. expression obsolete
  15. num-exons Number of exons (valid for a gene annotation).

Statistical tests across groups

Evaluating statistics for differential expression can be done by specifying how the samples belong to experimental groups, and by indicating which group comparison should be tested. This can be done by specifying the groups parameter on the above command line: –groups A=A-1,A-2,A-3/B=B-1,B-2,B-3 and with the –comparison A/B parameter

Re-running the above command with these extra flags will output a summary stat file (named after the –stats parameters).

 java -Xmx3g -jar goby.jar -m alignment-to-annotation-counts A-?.entries B-?.entries \
      --annotation <goby-distribution>/data/biomart-mouse-exons-ensembl55-genes-NCBIM37.txt \
      --include-annotation-types gene --compare A/B --groups A=A-1,A-2,A-3/B=B-1,B-2,B-3 \
      --stats stats.tsv

Format of the summary stats file

The stats.tsv file produced by the above command is tab delimited with the following fields:

  1. element-id The id of the annotation element tested for differential expression between groups
  2. fold-change The fold change between the specified groups
  3. fold-change-magnitude The magnitude of fold-change (max(fold-change, 1/fold-change), makes up and down-regulation fold change comparable.
  4. average RPKM group A Average of element RPKMs over samples in group A.
  5. average count group A Average of element counts over samples in group A.
  6. average RPKM group B Average of element RPKMs over samples in group B.
  7. average count group B Average of element counts over samples in group B.
  8. t-test t test P-value
  9. t-statistic t test statistic
  10. fisher-exact-test Fisher exact test P-value
  11. fisher-exact-R Fisher exact test P-value calculated using R (if the R libraries are available)
  12. chi-square-test Chi Square test P-value (approximates Fisher’s test for large counts, supports multiple group comparison, invalid if any compared count <5)
  13. t-test-Bonferroni-adjusted Bonferroni adjustment to the t test P-value
  14. fisher-exact-test-Bonferroni-adjusted Bonferroni adjustment to the Fisher exact test P-value
  15. chi-square-test-Bonferroni-adjusted Bonferroni adjustment to the chi-square test P-value
  16. t-test-BH-FDR-q-value Benjamini and Hochberg FDR adjustment (q-value) for t-test
  17. fisher-exact-test-BH-FDR-q-value Benjamini and Hochberg FDR adjustment (q-value) for Fisher exact test
  18. chi-square-test-BH-FDR-q-value Benjamini and Hochberg FDR adjustment (q-value) for chi square test

Three statistical test are currently reported, with their raw P-values and their P-value adjusted for multiple testing. The tests differ in the null hypotheses that they are trying to reject:

  • t-test This tests the null hypothesis that the average of ln(RPKM+1) is unchanged across groups. T-test compares the variability across groups to the variability within group, so if A-1..3 represent distinct biological replicates, the t-test P-value can control for variability between individuals. Please note that more than 3 biological replicates are likely to be required to accurately estimate the variance with each group. The t-test implemented in Goby assumes equal variance between groups, because (i) there is no reason to assume a difference a priori, and (2) because the alternative requires estimating the [degree of freedom from the estimated variance], which itself is a poor estimate (few replicates are still used in most experiments because of the cost of RNASeq).
  • Fisher exact test This tests the null hypothesis that the sum of the counts in each group is generated by random sampling from the total counts observed in each group. Fisher exact test does not account for the intra-group sample variability because all counts are summed within one group. The test is exact, but the implementation used in Goby (from GoMiner) can fail for very large counts observed in some RNASeq experiments. Be cautious when interpreting this P-value when counts are large. We now provide an alternative implementation of the Fisher exact test. This implementation uses the R shared libraries and returns the same P-values that woud be evaluated by R. The R implementation was derived from theĀ FEXACT fortran code published in ACM. FEXACT was ported to the C language and cleanup a bit. The statistic fisher-exact-R is available in the summary stats file when R shared libraries are configured.
  • Chi Square test This tests the same hypothesis as the Fisher exact test, but the chi square test approximates P-values. The approximation holds for large counts in each cell of the contingency table. In such conditions, Fisher and Chi Square test should yield comparable P-values. Do not rely on the Chi Square P-value if any count is less than 5, since the assumptions of the approximation are not met (for such cases, rely on the Fisher test instead).