This tutorial describes how to run Goby and produce somatic variations using custom contig reference sequence from parents reads.

If you are not familiar with the Goby mode “discover-sequence-variants”, please see this tutorial before continuing.

In order to obtain somatic variations based on custom contig reference alignment and map the variation position on contig back to the position on human reference genome (hg19), there are a few parameters needed to setup along with the “discover-sequence-variants” mode in Goby. First, it will need to assign “–format SOMATIC_VARIATIONS” and use dynamic option “-x SomaticVariationOutputFormat:alnTableFilename“.  Second, custom contig reference will also need to be prepared in Goby random access format and the file will be followed by “–genome” parameter. Last, provide family information in the covariate table using “–covariates” parameter.

Example command
==============

goby 4g discover-sequence-variants father.sorted mother.sorted patient_blood.sorted patient_skin.sorted \
–format SOMATIC_VARIATIONS \
-x SomaticVariationOutputFormat:alnTableFilename=aln_mix_to_hg19.pos_mapping.txt \
-o genotypes.vcf \
–genome mix_assembly.contigs \
–minimum-variation-support 9 \
–threshold-distinct-read-indices 3 \
–covariates covariate_table.txt

==============

Input files:
  • Goby sorted alignment files from patient’s family.
    (father.sorted.entries, father.sorted.index, father.sorted.header, father.sorted.tmh )
    (mother.sorted.entries, mother.sorted.index, mother.sorted.header, mother.sorted.tmh )
    (patient_blood.sorted.entries, patient_blood.sorted.index, patient_blood.sorted.header, patient_blood.sorted.tmh )
    (patient_skin.sorted.entries, patient_skin.sorted.index, patient_skin.sorted.header, patient_skin.sorted.tmh )
  • Goby random access index files.
    (mix_assembly.bases, mix_assembly.sizes, mix_assembly.names, mix_assembly.ignore )
  • Custom contig reference and human reference sequence mapping file.
    (aln_mix_to_hg19.pos_mapping.txt)
  • Covariates file.
    (covariate_table.txt)
Output files:
  • genotypes.vcf

Format of contig mapping table

In the custom contig reference file, we assume each contig can be aligned back to a unique region in human reference genome (hg19). Since the length of each contig is long enough, we adopt global sequence alignment method to align contig sequence with human reference genome. The table will have five fields, contig ID, which strand in human genome it mapped, chromosome number, start position in chromosome, end position in chromosome. The field will be separated with tab delimit in text format.

0__len__712chr296298589630577
1__len__681+chr109267205892672738
2__len__1058+chrX118751093118752150
3__len__195+chr1182569118182569312
….. .. … … …..

Format of covariates table

In this example, we will assume that you have four DNA-seq samples (father, mother, patient_blood, patient_skin ), aligned against the same contig reference sequence: mix_assmbly.fasta (The contig reference was built from the reads mixture of father and mother using  de novo assembly method). We further assume that we expect to detect unique somatic variations in blood, but not in the patient skin and not explained by the somatic variations in father or mother. To address this question, a covariate file will be need in the following text format:

sample-idpatient-idgendertypekind-of-sampletissueparents
father.sortedP1MaleFatherGermlineBloodN/A
mother.sortedP2FemaleMotherGermlineBloodN/A
patient_blood.sortedP3MalePatientSomaticBloodP1|P2
patient_skin.sortedP3MalePatientGermlineSkinP1|P2

Given the above table, genotype frequencies will be compared across the following pairs of samples:

father|mother vs patient_blood [ determines if patient_blood has variations not explained by either parent, father or mother ]
patient_blood vs patient_skin [ determines if patient_blood has variations not found in germline DNA ]

These comparisons are determined from the covariates because: P1 and P2 are parents of P3.