Goby supports estimating methylation rates over arbitrary regions of the genome. This functionality was developed by Nyasha Chambwe, and is supported by a new output format for the Goby mode discover-sequence-variants.

This tutorial describes how to use this functionality to obtain estimates of methylation rates over regions of the genome.

Before we go into the details of how to run the analysis, here are a few snapshots visualizing data produced in this way. These data were generated analyzing a published dataset with GobyWeb. Reads were generated for the following study and were retried from SRA (GEO Accession  GSE26758):

Genes methylated by DNA methyltransferase 3b are similar in mouse intestine and human colon cancer. Steine EJ, Ehrich M, Bell GW, Raj A, Reddy S, van Oudenaarden A, Jaenisch R, Linhart HG. J Clin Invest. 2011 May;121(5):1748-52. doi: 10.1172/JCI43169. 2011.

The next plot shows CpG island methylation estimated across the genome.
Methylation estimates over CpG islands estimated with Goby/GobyWeb


GobyWeb can generate estimates of methylation rates for regions, as well as for individual sites. The following view presents these two types of data viewed in IGV for the Dnmt3b dataset. Regions over promoters are shown on top, while individual sites are shown at the bottom of the visualization (both strands are shown separately in the base-level view). Combining both regions and site level makes it possible to study DNA methylation changes at different scales.

Visualizing both region and site estimates of methylation rates with IGV

Visualizing both region and site estimates of methylation rates with IGV










Estimating methylation rates over regions

You can generate region-level methylation rate estimates with Goby with the discover-sequence-variants mode. See the main tutorial for this mode to learn how to load alignment files and associate groups to individual samples. Use the methylation_region output format to generate tab delimited files. This format requires that you provide an annotation file. You can do this with the following command line option:

-x MethylationRegionsOutputFormat:annotations=<path>/annotations.tsv

See the annotation file section about the format expected in annotations.tsv

Files generated by the methylation_region output format can be renamed with an .igv extension and loaded into IGV. The base-level estimates are produced with the methylation output format and are in the VCF format. Use tabix -p vcf file to index these files before you can load them into IGV.

Annotation files

The annotation files must have the following columns (see header in the example below):

Chromosome Strand Annotation ID Segment ID Start(bp) End(bp)
1 NA 1_CpG: 116_strong-island 1_CpG: 116_strong-island 28735 29810
1 NA 2_CpG: 30_weak-island 2_CpG: 30_weak-island 135124 135563

Each annotation has an ID, and one or more segment id. An annotation has several segments when several lines present the same primary ID and several distinct segment ids. The segments of an annotation can be discontinuous, but must lie on the same chromosome. Methylation rates are estimated by considering the sum of methylated and non-methylated cytosines overlapping the segments of each annotation. Rates are written associated with the primary annotation ID. In the CpG island example above, each annotation has one segment (annotation ID and segment ID are equal for all lines).

Chromosome Strand Annotation ID Segment ID Start(bp) End(bp)
1 NA GENE_1 EXON_1 28735 29810
1 NA GENE_1 EXON_2 29735 30810

In the example above, two segments are defined to represent exons attached to GENE_1. The methylation rates will be estimated over the two exon regions and written with the GENE_1 identifier.