Sorting and indexing alignment files improves performance of analyses tools that need to process only subsets of genomic locations.

Sorting small alignments

Goby provides a mode to sort and index an alignment. The mode is called sort and takes an unsorted alignment as input. It produces one alignment as output that is sorted by genomic position and indexed. The following command will sort the alignment identified by input-basename:

goby 4g sort input-basename.entries -o sorted-basename

(The goby command will be available if you have added the Goby installation directory to your execution PATH. If not, can type java  -Xmx4g <path-to-goby>/goby.jar instead).

The sort mode loads all the entries contained in the input basename in memory before sorting and writing the results. This mode may need as much as 35 times the amount of memory as size of the alignment file on disk. This is because alignment entries have be held in memory in decompressed form to perform the sort. The sort mode is therefore not practical to sort large alignment files.

Sorting large alignments

Sorting a large alignment file can be done in a few steps:

  1. Split the large file into smaller alignments (see reformat mode)
  2. Sort each smaller alignment
  3. Concatenate the sorted alignments to yield a globally sorted output.

Goby supports this strategy as follows. Steps 1 and 2 are implemented in the sort mode. This mode accepts –start-position and –end-position arguments to specify which subset of the input file should be sorted. This will be demonstrated with an example. Assume large.entries is a file of 30Mb, which we need to split in three parts: from 0-10Mb, from 10 to 20Mb, from 20 to end of file. We can split and sort each part with these three commands:

goby 4g sort large.entries -o large-part-1 --start-position 0 --end-position 10485760
goby 4g sort large.entries -o large-part-2 --start-position 10485760 --end-position 20971520
goby 4g sort large.entries -o large-part-3 --start-position 20971520

These commands will yield three sorted alignments: large-part-1, large-part-2 and large-part-3.

We are left with step 3: concatenating these sorted alignments. This can be done with:

goby 4g concatenate-alignments large-part-1 large-part-2 large-part-3 -o large-sorted
goby 1g compact-file-stats large-sorted.entries

The last command should report that the resulting alignment file is sorted and indexed.

Fast positional retrieval in sorted indexed alignments

When an alignment has been sorted by genomic position, positional retrieval becomes much more efficient. For instance, retrieving sequence variations that occur in reads that mapped to chromosomes 1,2 or X can be done efficiently by using the alignment index.

goby 1g display-sequence-variations sorted-basename -include-reference-names 1,2,X

The previous command will read only the parts of sorted-basename that contain entries which map to the specified reference sequences. The option -include-reference-names (or -r for short) is supported by the following Goby modes:

  • alignment-to-text [*]
  • alignment-to-annotation-counts [*]
  • alignment-to-counts
  • alignment-to-transcript-counts
  • count-archive-to-peak-annotations
  • count-archive-to-peak-union-annotations
  • count-to-bedgraph
  • counts-to-wiggle
  • display-sequence-variations [*]
  • sequence-variation-stats [*]
  • sort [*]
  • tally-bases
  • within-group-variability [*]

The modes marked by an asterisk ([*]) leverage alignment indices when the -r/-include-reference-names option is used.