Background

Goby supports discovering sequence variants, calling genotypes, estimating allele frequencies, or methylation rates (RRBS or methyl-seq datasets). These analyses are implemented in the mode discover-sequence-variants.

In the following, we call ‘sequence variations’ any sequence difference between reference and sequence variants, including sequencing artifacts such as sequencing errors. Sequence variations include ‘sequence variants’: true sequence differences that you wish to identify between a reference genome and a given sample. Goby offers a mode called display-sequence-variations that let’s you view all the sequence variations represented in an alignment. In contrast to this mode, discover-sequence-variants aims to remove sequencing artifacts to focus on true variants.

Briefly, the discover-sequence-variants mode

  • estimates which sequence variations are true sequence variants in a group of samples (i.e., discarding variation artifacts to prioritize the sequence variants most likely to be present in the sample).
  • can perform various analysis of the allele counts at each genomic location where a variant or reference base is observed.
  • reports analysis results in VCF format. The specific set of columns included in the output depend on the type of analysis discover-sequence-variants is performing.

Worked example

This tutorial assumes that you have alignments in the Goby format (the simplest way to do this is align reads with a version of BWA or GSNAP with Goby output, or to use GobyWeb). These alignment files must be sorted and indexed (you can do this with the Goby sort mode after the alignments finish if you used BWA/GSNAP directly; alignments produced by GobyWeb are already sorted and indexed). For the purpose of this example, we will use the Pickrell 2010 datasets. These RNASeq datasets were produced from RNA from HapMap individuals and can be obtained from SRA/ENA (accession code GSE19480).

Here’s a list of the alignments we will be using (you can download the alignments in Goby format here, alignments generated with GSNAP against the 1000 genome reference):

  • NA18486_yale
  • NA18486_argonne
  • NA19143_yale
  • NA19143_argonne

As the name suggests, alignments are for two HapMap individuals (NA18486 and NA19143), sequenced at two sites (‘yale’ and ‘argonne’).

You will also need the reference genome these alignments were done against. The following two lines (1) download the genome (may take a while, please be patient) and (2) prepares for fast random access:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
goby 1g build-sequence-cache human_g1k_v37.fasta.gz

We are now ready to discover variants. In a first step, we call genotypes at each site within each of these alignments:

goby 1g discover-sequence-variants  *.entries --format genotypes  -o genotypes.vcf --genome human_g1k_v37

Let’s take a closer look at this command. We provide the mode with each entries file for the alignments that we want to analyze. If you have downloaded the demo files, ‘*.entries’ on the above command line should match the following filenames:

  • DPMGKHJ-Pickrell-eQTL-NA19143_yale.entries
  • WTFVJVU-Pickrell-eQTL-NA19143_argonne.entries
  • PJCBGUJ-pickrellNA18486_yale.entries
  • ZPVIZVB-pickrellNA18486_argonne.entries

The –format flat specifies that we want to output genotypes. The -o flag indicates that the output file should be called genotypes.vcf (we use the extension .vcf because the genotypes format writes in the Variant Call Format).

When this command is invoked, it generates the following console output:

INFO  GobyDriver           - edu.cornell.med.icb.goby.modes.GobyDriver Implementation-Version: development (20110423180332)
ssociating basename: DPMGKHJ-Pickrell-eQTL-NA19143_yale to group: group1
Associating basename: PJCBGUJ-pickrellNA18486_yale to group: group2
Associating basename: WTFVJVU-Pickrell-eQTL-NA19143_argonne to group: group3
Associating basename: ZPVIZVB-pickrellNA18486_argonne to group: group4
sample: PJCBGUJ-pickrellNA18486_yale group group2
sample: ZPVIZVB-pickrellNA18486_argonne group group4
sample: WTFVJVU-Pickrell-eQTL-NA19143_argonne group group3
sample: DPMGKHJ-Pickrell-eQTL-NA19143_yale group group1
Filtering reads that have these criteria:
q<30
#count(allele) < (2 *#filtered)
count(allele in sample) < 1/4 * max_count over alleles in sample
Loading genome cache human_g1k_v37
INFO  IterateSortedAlignments - Alignment contains 84 reference sequences
INFO  NonAmbiguousAlignmentReader - start reading TMH for DPMGKHJ-Pickrell-eQTL-NA19143_yale
INFO  NonAmbiguousAlignmentReader - finished loading TMH for DPMGKHJ-Pickrell-eQTL-NA19143_yale
INFO  NonAmbiguousAlignmentReader - start reading TMH for DPMGKHJ-Pickrell-eQTL-NA19143_yale
INFO  NonAmbiguousAlignmentReader - done reading TMH for DPMGKHJ-Pickrell-eQTL-NA19143_yale
INFO  NonAmbiguousAlignmentReader - start reading TMH for PJCBGUJ-pickrellNA18486_yale
INFO  NonAmbiguousAlignmentReader - finished loading TMH for PJCBGUJ-pickrellNA18486_yale
INFO  NonAmbiguousAlignmentReader - start reading TMH for PJCBGUJ-pickrellNA18486_yale
INFO  NonAmbiguousAlignmentReader - done reading TMH for PJCBGUJ-pickrellNA18486_yale
INFO  NonAmbiguousAlignmentReader - start reading TMH for WTFVJVU-Pickrell-eQTL-NA19143_argonne
INFO  NonAmbiguousAlignmentReader - finished loading TMH for WTFVJVU-Pickrell-eQTL-NA19143_argonne
INFO  NonAmbiguousAlignmentReader - start reading TMH for WTFVJVU-Pickrell-eQTL-NA19143_argonne
INFO  NonAmbiguousAlignmentReader - done reading TMH for WTFVJVU-Pickrell-eQTL-NA19143_argonne
INFO  NonAmbiguousAlignmentReader - start reading TMH for ZPVIZVB-pickrellNA18486_argonne
INFO  NonAmbiguousAlignmentReader - finished loading TMH for ZPVIZVB-pickrellNA18486_argonne
INFO  NonAmbiguousAlignmentReader - start reading TMH for ZPVIZVB-pickrellNA18486_argonne
INFO  NonAmbiguousAlignmentReader - done reading TMH for ZPVIZVB-pickrellNA18486_argonne
INFO  IterateSortedAlignments - 864,256 items, 10s, 86,391.04 items/s
INFO  IterateSortedAlignments - 1,735,680 items, 20s, 86,753.64 items/s
INFO  IterateSortedAlignments - 2,291,712 items, 30s, 76,334.42 items/s
INFO  IterateSortedAlignments - 2,911,232 items, 40s, 72,622.85 items/s
INFO  IterateSortedAlignments - 3,396,608 items, 50s, 67,641.30 items/s
INFO  IterateSortedAlignments - 4,008,960 items, 1m 0s, 66,563.06 items/s
...

Progress will continue until the tool has finished scanning the genomic positions represented in the alignment (this takes about 4 minutes on a decent laptop). The output file that the genotypes format writes has the following fields:

##fileformat=VCFv4.1
##Goby=development (20110423180332)
##INFO=<ID=BIOMART_COORDS,Number=1,Type=String,Description="Coordinates for use with Biomart.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=BC,Number=1,Type=String,Description="Base counts in format A=?;T=?;C=?;G=?;N=?.">
##FORMAT=<ID=GB,Number=1,Type=String,Description="Number of bases that pass base filters in this sample.">
##FORMAT=<ID=FB,Number=1,Type=String,Description="Number of bases that failed base filters in this sample.">
##FORMAT=<ID=Zygosity,Number=1,Type=String,Description="Zygosity">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  DPMGKHJ-Pickrell-eQTL-NA19143_yale      PJCBGUJ-pickrellNA18486_yale    WTFVJVU-Pickrell-eQTL-NA19143_argonne   ZPVIZVB-pickrellNA18486_argonne
1       564643  .       A                               BIOMART_COORDS=1:564643:564643  GT:BC:GB:FB:Zygosity    ./.:A=0,T=0,C=0,G=0,N=0:0:0:homozygous  0/0:A=329,T=0
,C=0,G=0,N=0:329:3:homozygous      ./.:A=0,T=0,C=0,G=0,N=0:0:0:homozygous  0/0:A=812,T=0,C=0,G=0,N=0:812:10:homozygous
1       564648  .       A                               BIOMART_COORDS=1:564648:564648  GT:BC:GB:FB:Zygosity    ./.:A=0,T=0,C=0,G=0,N=0:0:0:homozygous  0/0:A=405,T=0

Files in the VCF format can be manipulated with VCF-tools. Goby also provides a VCFParser class to read these files efficiently in Java.

Let’s now consider how to compare variations between groups of samples. Since the Pickrell samples were sequenced at two sites, we may want to group alignments according to the individual HapMap identifier in order to pool the technical replicates. To address the question of finding variants that differ between the two individuals, we can do:

goby 1g discover-sequence-variants  *.entries --compare NA19143/NA18486  --format compare_groups
  --groups NA19143=DPMGKHJ-Pickrell-eQTL-NA19143_yale,WTFVJVU-Pickrell-eQTL-NA19143_argonne/NA18486=PJCBGUJ-pickrellNA18486_yale,ZPVIZVB-pickrellNA18486_argonne

This command uses the compare_groups format. This format requires both a –compare and a –group flag. The –compare flag indicates that we want to compare two groups, called ‘NA19143’ and ‘NA18486’. The option –groups associate input files to a group declaration. We define the groups to represent HapMap individuals. It is not necessary to reference each input file in the –groups declaration, but alignments that are defined in groups must be provided in the input file list. Finally, the option –eval filter indicates that we wish to apply filters to cleanup base calls (this is strongly recommended). The output is written to standard out, but you can redirect it with the –output/-o flag, as before. The command prints back the groups definition that it recognized (hint: it is useful to check this information against what you thought you typed on the command line).

The compare_groups format outputs the following VCF file. The column FisherP[NA19143/NA18486] stores the Fisher exact test p-value of the allelic frequencies (number of samples with the allele) being significantly different between the two groups.

##fileformat=VCFv4.1
##Goby=development (20110423181948)
##INFO=<ID=BIOMART_COORDS,Number=1,Type=String,Description="Coordinates for use with Biomart.">
##INFO=<ID=LOD[NA19143/NA18486],Number=1,Type=Float,Description=Log2 of the odds-ratio of observing a variant in group NA19143 versus group NA18486">
##INFO=<ID=LOD_SE[NA19143/NA18486],Number=1,Type=Float,Description="Standard Error of the log2 of the odds-ratio between group NA19143 and group NA18486">
##INFO=<ID=LOD_Z[NA19143/NA18486],Number=1,Type=Float,Description="Z value of the odds-ratio of observing a variant in group NA19143 versus group NA18486">
##INFO=<ID=FisherP[NA19143/NA18486],Number=1,Type=Float,Description="Fisher exact P-value that the allelic frequencies (each sample contributes at least one toward the group count of each allele) differ between group NA19143 and group NA18486.">
##INFO=<ID=refCountGroup[NA19143],Number=1,Type=Integer,Description="Number of reference allele called in group NA19143.">
##INFO=<ID=varCountGroup[NA19143],Number=1,Type=Integer,Description="Number of variant allele(s) called in group NA19143.">
##INFO=<ID=refCountGroup[NA18486],Number=1,Type=Integer,Description="Number of reference allele called in group NA18486.">
##INFO=<ID=varCountGroup[NA18486],Number=1,Type=Integer,Description="Number of variant allele(s) called in group NA18486.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of sequencing across groups at this site">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=BC,Number=1,Type=String,Description="Base counts in format A=?;T=?;C=?;G=?;N=?.">
##FORMAT=<ID=GB,Number=1,Type=String,Description="Number of bases that pass base filters in this sample.">
##FORMAT=<ID=FB,Number=1,Type=String,Description="Number of bases that failed base filters in this sample.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  DPMGKHJ-Pickrell-eQTL-NA19143_yale      PJCBGUJ-pickrellNA18486_yale    WTFVJVU-Pickrell-eQTL-NA19143_argonne   ZPVIZVB-pickrellNA18486_argonne
1       565324          T                               BIOMART_COORDS=1:565324:565324;LOD[NA19143/NA18486]=3.669199037889188;LOD_SE[NA19143/NA18486]=NaN;LOD_Z[NA19143/NA18486]=NaN;FisherP[NA19143/NA18486]=1.0;refCountGroup[NA19143]=78;varCountGroup[NA19143]=0;refCountGroup[NA18486]=1004;varCountGroup[NA18486]=0;DP=1082       GT:BC:GB:FB     0/0:A=0,T=29,C=0,G=0,N=0:29:0   0/0:A=0,T=267,C=0,G=0,N=0:267:1 0/0:A=0,T=49,C=0,G=0,N=0:49:0   0/0:A=0,T=737,C=0,G=0,N=0:737:16

When estimating P-values across many genomic sites, it is a good idea to control for multiple-testing. You can adjust P-values with the Goby fdr mode:

goby 1g fdr compare-groups.vcf --vcf -o adj-compare-groups.vcf -c FisherP[NA19143/NA18486]

The previous command scans one file ‘compare-groups.vcf’, to adjust the P-value in the column called ‘FisherP[NA19143/NA18486]’. The result is written to the output. Both input and output are in VCF format (–vcf flag).

The output now contains a new column ‘fisherp[na19143/na18486]-BH-FDR-q-value’ with the value of the original p-value adjusted for multiple testing with the method of Benjamini-Hochberg.

Other applications

The mode discover-sequence-variants also supports the following applications:

  • estimate allele frequencies (for RNA-Seq samples)
  • estimate methylation rates at base-level resolution (for RRBS or methyl-Seq samples, output-format: methylation)
  • estimate methylation rates over regions of the genome (for RRBS or methyl-Seq samples, output-format: methylation_region)

These applications are implemented in different output formats (for instance using –format allele_frequencies instead of compare_groups in the command line above will estimate the frequency of the reference allele in these RNASeq samples). See the help message for a description of formats supported by the discover-sequence-variants mode:

goby 1g discover-sequence-variants --help

Other options

You may find useful to realign reads around indels. Goby has an ultrafast algorithm to realign reads that span indels. See this tutorial that explains how to activate this option.