Goby includes a greedy algorithm to realign sequence variations around indels. The algorithm is very fast, allowing to process about 90,000 mapped reads per second. This makes it possible to use realignment near indels not only as a pre-processing step, but also on the fly when comparing sequence variations in groups of samples. This tutorial demonstrates how to enable realignment near indels for pre-processing or on the fly.
Realignment as a pre-processing step
The mode concatenate-alignments takes a set of alignments and writes a single alignment that contains all the information of the input files into one destination file. Since Goby 1.9.7, this mode supports the options –processor and –genome. The –processor option can be used to indicate that aligned reads should be realigned in the proximity of indels. This is accomplished with –processor realign_near_indels. The –genome option must point to the basename of a compressed genome (compressed genomes are built with the mode build-sequence-cache from a reference genome fasta or compact-read file).
For this tutorial, we will assume that you have aligned two sets of reads against the human genome and produced align1 and align2. We start by building a compressed genome from the 1000g reference we used to align the reads against:
goby 3g build-sequence-cache /data/1000g/human_g1k_v37.fasta.gz -b human
Compressing the genome needs to be only once per genome. When the command completes, you see the following files:
Collectively, these files are a compressed genome. You can refer to this genome with the basename, that is the string immediately preceding the various filename extensions (conveniently, this string is “human” in this tutorial).
You are now read to pre-process the alignment files align1 and align2 and generate an output alignment where variations have been realigned around indels. You can do this as follows:
goby 2g concatenate-alignments align1 align2 -o realigned_12 --processor realign_near_indels --genome human
The command will display progress as it concatenates and simultaneously realigns reads around indels. On desktop machines, we have observed about 90,000 reads per seconds, so the command should terminate in a few minutes if you have less than tens of million reads.
Realignment as a processing step is useful if you need to visually inspect the realigned alignments (for instance by loading in in IGV to compare to a concatenate without realignment).
On the fly realignment
Since the algorithm used by Goby is very fast, and because the Goby framework design makes it easy to combine realignment with other operations, it is very simple to perform realignment on the fly with mode discover-sequence-variants.
To learn how to use discover-sequence-variants, please consult these detailed tutorials (discover-sequence-variants, estimate methylation rates). In order to enable on the fly local realignment, you simply provide the same –processor and –genome options that we described for concatenate-alignments, but this time to the appropriate discover-sequence-variants command line.
For instance, if you need to call genotypes, e.g.,
goby 3g discover-sequence-variants -f genotypes align1 align2 ...
to enable local realignment, you would simply enter:
goby 3g discover-sequence-variants -f genotypes align1 align2 ... --processor realign_near_indels --genome human