Goby provides an efficient implementation to filter out non-unique reads from a large set of reads.

Proceed as follows.

For the purpose of this tutorial, we will use a small input Fasta file, data/with-redundancy.fasta, the content of which is shown below.

If you have a fasta/fastq format, first convert to compact format. This can be done as follows:

java -Xmx3g -jar goby.jar -m fasta-to-compact data/with-redundancy.fasta

(The file with-redundancy.compact-reads should have been created.)

Use the tally-reads mode to calculate how many times each sequence appears in the input:

java -Xmx3g -jar goby.jar -m tally-reads -i data/with-redundancy.compact-reads -o myfilter

The tally-reads mode leverages sequence digests and works in two passes to minimize memory usage. Input files can contain tens of millions of reads.

Convert back to fasta, excluding sequences that appear more than once:

java -Xmx3g -jar goby.jar -m compact-to-fasta -i data/with-redundancy.compact-reads -f myfilter-keep.filter -o unique-reads.fa

The file unique-reads.fa correctly excludes repeat occurrences of reads whose sequence appear more than once in the input.

Starting with Goby version 1.4.1, you can also convert the compressed read-set to text format, to obtain multiplicity information for each read in the input.

java -jar goby.jar -m set-to-text myfilter -o out.tsv

The file out.tsv should now contain:

queryIndex	multiplicity
0	2
1	0
2	3
3	0
4	0
5	1

The read set filter can also be provided directly as an argument to the align mode (option -f|–read-index-filter). In this case, it will remove redundant reads and align only with the unique reads.