This tutorial describes how to run Goby and produce somatic variations using custom contig reference sequence from parents reads.

If you are not familiar with the Goby mode “discover-sequence-variants”, please see this tutorial before continuing.

In order to obtain somatic variations based on custom contig reference alignment and map the variation position on contig back to the position on human reference genome (hg19), there are a few parameters needed to setup along with the “discover-sequence-variants” mode in Goby. First, it will need to assign “–format SOMATIC_VARIATIONS” and use dynamic option “-x SomaticVariationOutputFormat:alnTableFilename“.  Second, custom contig reference will also need to be prepared in Goby random access format and the file will be followed by “–genome” parameter. Last, provide family information in the covariate table using “–covariates” parameter.

Example command

goby 4g discover-sequence-variants father.sorted mother.sorted patient_blood.sorted patient_skin.sorted \
-x SomaticVariationOutputFormat:alnTableFilename=aln_mix_to_hg19.pos_mapping.txt \
-o genotypes.vcf \
–genome mix_assembly.contigs \
–minimum-variation-support 9 \
–threshold-distinct-read-indices 3 \
–covariates covariate_table.txt


Input files:
  • Goby sorted alignment files from patient’s family.
    (father.sorted.entries, father.sorted.index, father.sorted.header, father.sorted.tmh )
    (mother.sorted.entries, mother.sorted.index, mother.sorted.header, mother.sorted.tmh )
    (patient_blood.sorted.entries, patient_blood.sorted.index, patient_blood.sorted.header, patient_blood.sorted.tmh )
    (patient_skin.sorted.entries, patient_skin.sorted.index, patient_skin.sorted.header, patient_skin.sorted.tmh )
  • Goby random access index files.
    (mix_assembly.bases, mix_assembly.sizes, mix_assembly.names, mix_assembly.ignore )
  • Custom contig reference and human reference sequence mapping file.
  • Covariates file.
Output files:
  • genotypes.vcf

Format of contig mapping table

In the custom contig reference file, we assume each contig can be aligned back to a unique region in human reference genome (hg19). Since the length of each contig is long enough, we adopt global sequence alignment method to align contig sequence with human reference genome. The table will have five fields, contig ID, which strand in human genome it mapped, chromosome number, start position in chromosome, end position in chromosome. The field will be separated with tab delimit in text format.

….. .. … … …..

Format of covariates table

In this example, we will assume that you have four DNA-seq samples (father, mother, patient_blood, patient_skin ), aligned against the same contig reference sequence: mix_assmbly.fasta (The contig reference was built from the reads mixture of father and mother using  de novo assembly method). We further assume that we expect to detect unique somatic variations in blood, but not in the patient skin and not explained by the somatic variations in father or mother. To address this question, a covariate file will be need in the following text format:


Given the above table, genotype frequencies will be compared across the following pairs of samples:

father|mother vs patient_blood [ determines if patient_blood has variations not explained by either parent, father or mother ]
patient_blood vs patient_skin [ determines if patient_blood has variations not found in germline DNA ]

These comparisons are determined from the covariates because: P1 and P2 are parents of P3.