The following steps demonstrate how Goby handles the simple data analysis task of aligning RNA-Seq reads to a reference and generating a wiggle track for that alignment.

This example assumes that you have already:

  1. Downloaded the Goby source distribution [here]
  2. Installed the C/C++ Goby API [here]
  3. Configured Goby [here]
  4. Downloaded and installed BWA with native Goby support [here]

You may wish to consult the mode reference page for Goby while walking through the examples.

The analysis is carried out as follows:

  1. Dataset configuration
  2. Align reads to a reference genome
  3. Generating count data and wiggle file for sample visualization with the UCSC Genome Browser.

Dataset Configuration

Reads data

We use the sample dataset included with the Goby distribution in data/reads/. The dataset is included in the FASTA and Goby compact-reads formats.  The Goby compact format makes it possible to parallelize the alignment process efficiently.  However, since most aligners also accept FASTQ format directly, if you are not seeking to parallelize the alignment you can just use the aligner directly.

Reference data

For purposes of this example a single chromosome from the UCSC Mouse July 2007 (mm9) assembly is used for simplicity. The full genome could be used instead.  Alternatively you can download dataset from http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr1.fa.gz and place the file in the directory data/reference/mm9. (Note: this file already included as part of the Goby binary distribution)

Index the reference dataset

The BWA aligner process uses an index mechanism for efficiency. This index may be created once and used along with multiple reads datasets.

bwa index data/reference/mm9/chr1.fa.gz

When this process finishes, the directory data/reference-index/mm9 will contain the files that will be used during the alignment phase.

 data/reference-index/mm9/chr1.fa.gz.amb
 data/reference-index/mm9/chr1.fa.gz.ann
 data/reference-index/mm9/chr1.fa.gz.bwt
 data/reference-index/mm9/chr1.fa.gz.pac
 data/reference-index/mm9/chr1.fa.gz.rbwt
 data/reference-index/mm9/chr1.fa.gz.rpac
 data/reference-index/mm9/chr1.fa.gz.rsa
 data/reference-index/mm9/chr1.fa.gz.sa

Align reads to the reference

At this point the the alignment can be executed using the reference index built previously.

Aligning with FASTA/FASTQ to Goby alignment format:

bwa aln -t 16 -f alignment.sai data/reference/mm9/chr1.fa.gz data/reads/goby-mouse-reads-sample.fasta.gz
bwa samse -F goby -f goby-sample data/reference/mm9/chr1.fa.gz alignment.sai data/reads/goby-mouse-reads-sample.fasta.gz

In the bwa samse step above, the -F argument instructs BWA to write the alignment in the Goby format. This produces the following files:

-rw-r--r-- 1 53    goby-sample.header
-rw-r--r-- 1 12K   goby-sample.tmh
-rw-r--r-- 1 181   goby-sample.stats
-rw-r--r-- 1 791K  goby-sample.entries

Together, these files constitute a Goby compact alignment:

goby-sample.header contains detailed information about the reads and reference sequences used to produce the alignment. The header stores reference sequence lengths, read lengths, the number of reads used to align and other information useful to interpret the alignment.

goby-sample.entries contains information about where the reads map to the reference. This file contains a collection of AlignmentEntries.

goby-sample.tmh stores information about ambiguous reads – reads that were mapped to several positions on the reference.

goby-sample.stats gives statistics for the alignment e.g. the number of aligned reads, percentage of mapped and ambiguous reads, and so on.

Goby compact alignments can be analyzed efficiently with many tools offered in the Goby toolbox.

Aligning with Goby compact reads to Goby alignment format:

FASTA/FASTQ input data files can be converted to the Goby compact format. The Goby read format supports parallelization and the definition of meta-data about the experiment. The following shows how to convert FASTA/Fastq to the Goby read format and how to proceed with the alignment.

goby fasta-to-compact data/reads/goby-mouse-reads-sample.fasta.gz
bwa aln -t 16 -f alignment.sai data/reference/mm9/chr1.fa.gz data/reads/goby-mouse-reads-sample.compact-reads
bwa samse -F goby -f goby-sample data/reference/mm9/chr1.fa.gz alignment.sai data/reads/goby-mouse-reads-sample.compact-reads

Generate wiggle file

Count information can be produced from this alignment (goby 3g is a shortcut for java -Xmx3g -jar goby.jar. the shortcut will work if you have installed Goby and included the distribution directory in your path, see configuration instructions for details).

 goby 3g alignment-to-counts goby-sample

This command produces a highly compressed, base-resolution histogram of read coverage:

-rw-r--r-- 1 204K  goby-sample.counts
These counts can be converted to the wiggle format for visualization on the UCSC genome browser.

goby 3g counts-to-wiggle goby-sample --resolution 1
gobyweb icb 3.4M goby-sample-all.wig.gz

Try uploading the file produced (i.e., goby-sample-all.wig.gz) to the UCSC Genome Browser. If the file is too large for upload, you may have to reduce the resolution argument to bin counts in larger windows. For instance, with windows of 20 bp:

goby 3g counts-to-wiggle goby-sample --resolution 20
278K goby-sample-all.wig.gz

If you prefer BED format to the wiggle format, use the mode counts-to-bedgraph, which operate similarly to counts-to-wiggle.