The following steps demonstrate how Goby handles the simple data analysis task of aligning RNA-Seq reads to a reference and generating a wiggle track for that alignment.
This example assumes that you have already:
- Downloaded the Goby source distribution [here]
- Installed the C/C++ Goby API [here]
- Configured Goby [here]
- Downloaded and installed BWA with native Goby support [here]
You may wish to consult the mode reference page for Goby while walking through the examples.
The analysis is carried out as follows:
- Dataset configuration
- Align reads to a reference genome
- Generating count data and wiggle file for sample visualization with the UCSC Genome Browser.
We use the sample dataset included with the Goby distribution in data/reads/. The dataset is included in the FASTA and Goby compact-reads formats. The Goby compact format makes it possible to parallelize the alignment process efficiently. However, since most aligners also accept FASTQ format directly, if you are not seeking to parallelize the alignment you can just use the aligner directly.
For purposes of this example a single chromosome from the UCSC Mouse July 2007 (mm9) assembly is used for simplicity. The full genome could be used instead. Alternatively you can download dataset from http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr1.fa.gz and place the file in the directory data/reference/mm9. (Note: this file already included as part of the Goby binary distribution)
Index the reference dataset
The BWA aligner process uses an index mechanism for efficiency. This index may be created once and used along with multiple reads datasets.
bwa index data/reference/mm9/chr1.fa.gz
When this process finishes, the directory data/reference-index/mm9 will contain the files that will be used during the alignment phase.
data/reference-index/mm9/chr1.fa.gz.amb data/reference-index/mm9/chr1.fa.gz.ann data/reference-index/mm9/chr1.fa.gz.bwt data/reference-index/mm9/chr1.fa.gz.pac data/reference-index/mm9/chr1.fa.gz.rbwt data/reference-index/mm9/chr1.fa.gz.rpac data/reference-index/mm9/chr1.fa.gz.rsa data/reference-index/mm9/chr1.fa.gz.sa
Align reads to the reference
At this point the the alignment can be executed using the reference index built previously.
Aligning with FASTA/FASTQ to Goby alignment format:
bwa aln -t 16 -f alignment.sai data/reference/mm9/chr1.fa.gz data/reads/goby-mouse-reads-sample.fasta.gz bwa samse -F goby -f goby-sample data/reference/mm9/chr1.fa.gz alignment.sai data/reads/goby-mouse-reads-sample.fasta.gz
In the bwa samse step above, the -F argument instructs BWA to write the alignment in the Goby format. This produces the following files:
-rw-r--r-- 1 53 goby-sample.header -rw-r--r-- 1 12K goby-sample.tmh -rw-r--r-- 1 181 goby-sample.stats -rw-r--r-- 1 791K goby-sample.entries
Together, these files constitute a Goby compact alignment:
goby-sample.header contains detailed information about the reads and reference sequences used to produce the alignment. The header stores reference sequence lengths, read lengths, the number of reads used to align and other information useful to interpret the alignment.
goby-sample.entries contains information about where the reads map to the reference. This file contains a collection of AlignmentEntries.
goby-sample.tmh stores information about ambiguous reads – reads that were mapped to several positions on the reference.
goby-sample.stats gives statistics for the alignment e.g. the number of aligned reads, percentage of mapped and ambiguous reads, and so on.
Goby compact alignments can be analyzed efficiently with many tools offered in the Goby toolbox.
Aligning with Goby compact reads to Goby alignment format:
FASTA/FASTQ input data files can be converted to the Goby compact format. The Goby read format supports parallelization and the definition of meta-data about the experiment. The following shows how to convert FASTA/Fastq to the Goby read format and how to proceed with the alignment.
goby fasta-to-compact data/reads/goby-mouse-reads-sample.fasta.gz
bwa aln -t 16 -f alignment.sai data/reference/mm9/chr1.fa.gz data/reads/goby-mouse-reads-sample.compact-reads bwa samse -F goby -f goby-sample data/reference/mm9/chr1.fa.gz alignment.sai data/reads/goby-mouse-reads-sample.compact-reads
Generate wiggle file
Count information can be produced from this alignment (goby 3g is a shortcut for java -Xmx3g -jar goby.jar. the shortcut will work if you have installed Goby and included the distribution directory in your path, see configuration instructions for details).
goby 3g alignment-to-counts goby-sample
This command produces a highly compressed, base-resolution histogram of read coverage:
-rw-r--r-- 1 204K goby-sample.counts
goby 3g counts-to-wiggle goby-sample --resolution 1
gobyweb icb 3.4M goby-sample-all.wig.gz
Try uploading the file produced (i.e., goby-sample-all.wig.gz) to the UCSC Genome Browser. If the file is too large for upload, you may have to reduce the resolution argument to bin counts in larger windows. For instance, with windows of 20 bp:
goby 3g counts-to-wiggle goby-sample --resolution 20