This tutorial describes how to estimate enrichment efficiency and depth of coverage for targeted capture experiments, such as sequencing after exome capture. The coverage mode is part of Goby 1.9.7+.

Prerequisites

We assume that you have aligned reads to an appropriate reference sequence and have generated alignments in the Goby format. For this tutorial you will also need to generate counts files. If you aligned with GobyWeb, you can simply download these files after the alignment completes since they are generated automatically. If you aligned manually, you can create counts files with the Goby mode alignment-to-counts.

You will not need to download the entire alignment since the coverage mode below requires only the .header and .counts files. These files are much smaller than the rest of the alignment files and this comes handy if you need to transfer data over the network before estimating coverage.

We will also assume that you need to estimate coverage statistics for two alignments: align1 and align2. You should have the following files available:

  1. align1.header
  2. align1.counts
  3. align2.header
  4. align2.counts

Define capture regions

We start by defining capture regions. We can do this by creating a tab delimited file with three fields:

  1. reference/chromosomeId
  2. start-position (zero-based and inclusive)
  3. end-position (zero-based and inclusive)

The file should contain these three columns (and possibly other columns that will be ignored) and no header.

Let’s take an example. Assume we have designed an experiment to capture the region on chromosome 22 from base 1,000,000 to base 1,000,1000 (one-based coordinates). We create the following capture annotation file (\t indicates a tab character, press tab instead of entering \t when creating the file in an editor):

22\t999999\t1000999

Assume this one line file is called capture-regions.tsv. We convert this annotation file to a counts file with the following command:

goby 1g annotations-to-counts capture-regions.tsv -o capture-regions

The previous step converts the tab delimited file to a compressed binary file in the counts format. The counts format can be processed efficiently with the coverage tool. You can also use counts-to-bedgraph or counts-to-wiggle to create annotation tracks for visual inspection in IGV or the UCSC genome browser.

Note that annotations-to-counts has a flanking size option. This option conveniently expands capture regions by the specified number of bases. This is useful because sequencing often captures up to a few hundred bases around the probes designed to capture the sequence. It is not unusual to report enrichment efficiency when extending capture regions by 100 bases on either side. This can result in merging some capture regions that were closer than the extension size and the –flanking-size option handles this transparently.

Estimate coverage statistics

Estimating coverage statistics can then be done in one command. The command take the capture region annotation as input and any number of alignment basename. By defaults it outputs statistics to the console, but in the following line we redirect these statistics to the file coverage-stats.tsv

goby 1g coverage align1 align2 -a capture-regions -o coverage-stats.tsv

After this command completes (typically expect to wait less than 20 seconds per alignment), you should see output similar to the following in the output file:

average-depth-captured  UNJRBQV-421-mut2-chr22-simulated-triangular     -       273.892
average-depth   UNJRBQV-421-mut2-chr22-simulated-triangular     -       286.249
enrichment-efficiency   UNJRBQV-421-mut2-chr22-simulated-triangular     99.9677%        -
depth-captured  UNJRBQV-421-mut2-chr22-simulated-triangular     90      63
depth-captured  UNJRBQV-421-mut2-chr22-simulated-triangular     75      182
depth-captured  UNJRBQV-421-mut2-chr22-simulated-triangular     50      304
depth-captured  UNJRBQV-421-mut2-chr22-simulated-triangular     10      469
depth-captured  UNJRBQV-421-mut2-chr22-simulated-triangular     1       527
percent-capture-sites-at-depth  UNJRBQV-421-mut2-chr22-simulated-triangular     98      5
percent-capture-sites-at-depth  UNJRBQV-421-mut2-chr22-simulated-triangular     97      10
percent-capture-sites-at-depth  UNJRBQV-421-mut2-chr22-simulated-triangular     96      15
percent-capture-sites-at-depth  UNJRBQV-421-mut2-chr22-simulated-triangular     95      20
percent-capture-sites-at-depth  UNJRBQV-421-mut2-chr22-simulated-triangular     94      30

The output above was generated with a synthetic dataset where reads were sampled from a region on chromosome 22. We generated 10,000 reads over a small region and the coverage is therefore quite high. The first line of output indicates that the average coverage over all sites where a base is observed is ~274. Average depth of coverage over annotation (second line) is higher still: ~286.
The third line indicates that enrichment efficiency was 99.9677%, which is expected since reads were generated that matched only the capture region.
The lines that start with depth-captured contain the basename of the alignment that was studied (useful when you process many alignments in one run of coverage), following by two numbers: p and d. The number p is the percentile of sites that have depth>=d.
Therefore, the data presented above should be understood as

90% of captured sites have depth>= 63
75% of captured sites have depth>= 182
50% of captured sites have depth>= 304
1% of captured sites have depth>= 527

Lines that start with percent-capture-sites-at-depth look at the coverage from a slightly different perspective. They show what fraction of the captured sites (sites within annotation boundaries) have a depth of coverage >=d, where d={5,10,15,20,30}.
For instance, the last line of output indicates that 94% of captured sites have depth>=30.
The default percentiles and depths can be configured with the –depths and –percentiles options.

Note that the output of coverage is grep-friendly. It is easy to get all statistics pertaining to enrichment efficiency with

grep enrichment-efficiency coverage-stats.tsv

(the resulting file is tab delimited).