Goby 2.2 will offer a new VCF output format designed to identify somatic variations. This tutorial describes how to run Goby on somatic and germline samples to produce this output.

If you are not familiar with the Goby mode discover-sequence-variants, please see this tutorial before continuing.

The somatic output format is activated with the option –format SOMATIC_VARIATIONS. This format also requires the –covariates option, which must provide the path of a tab delimited file with covariate information.

In this example, we will assume that you have four DNA-seq samples, aligned against the same reference sequence: S1 … S4. The five samples were obtained from three individuals: two parents (father S1 and mother S2), and a patient (S3-S5). Blood (S3) and skin (S4) were obtained from the patient. We further assume that we expect to detect somatic variations in blood, but not in the patient skin. To address this question, you would provide a covariate information file organized as follows:

sample-idpatient-idgendertypekind-of-sampletissueparents
S1P1MaleFatherGermlineBloodN/A
S2P2FemaleMotherGermlineBloodN/A
S3P3MalePatientSomaticBloodS1|S2
S4P3MalePatientGermlineSkinS1|S2

Given the above table, genotype frequencies will be compared across the following pairs of samples:

S1|S2 vs S3 [determines if S3 has variations not explained by either parent S1 or S2]
S3 vs S4 [determines if S3 has variations not found in germline DNA]

These comparisons are determined from the covariates because: P1 and P2 are parents of P3.

If you had also extracted PBMC from whole blood from the patient, you could add a line to describe this new sample to instruct Goby to also derive a P-value for the PBMC sample:

sample-idpatient-idgendertypekind-of-sampletissueparents
S1P1MaleFatherGermlineBloodN/A
S2P2FemaleMotherGermlineBloodN/A
S3P3MalePatientSomaticBloodS1|S2
S4P3MalePatientGermlineSkinS1|S2
S5P3MalePatientSomaticPBMCS1|S2

Given the previous covariates, the –format SOMATIC_VARIATIONS option will produce a VCF file where each variation is annotated with two p-values: Somatic-P-value(Fisher)[S3]=? and Somatic-P-value(Fisher)[S5]. These P-values are stored in the INFO field of the output.

This new mode supports defining covariates for a number of somatic samples, and is also useful when analyzing trio datasets to identify variations found in cases, but not in parents.