Goby provides an efficient implementation to filter out non-unique reads from a large set of reads.
Proceed as follows.
>0 AAAAAAA >1 AAAAAAA >2 ACACACA >3 ACACACA >4 ACACACA >5 ACATTTT
If you have a fasta/fastq format, first convert to compact format. This can be done as follows:
java -Xmx3g -jar goby.jar -m fasta-to-compact data/with-redundancy.fasta
(The file with-redundancy.compact-reads should have been created.)
Use the tally-reads mode to calculate how many times each sequence appears in the input:
java -Xmx3g -jar goby.jar -m tally-reads -i data/with-redundancy.compact-reads -o myfilter
The tally-reads mode leverages sequence digests and works in two passes to minimize memory usage. Input files can contain tens of millions of reads.
Convert back to fasta, excluding sequences that appear more than once:
java -Xmx3g -jar goby.jar -m compact-to-fasta -i data/with-redundancy.compact-reads -f myfilter-keep.filter -o unique-reads.fa
The file unique-reads.fa correctly excludes repeat occurrences of reads whose sequence appear more than once in the input.
Starting with Goby version 1.4.1, you can also convert the compressed read-set to text format, to obtain multiplicity information for each read in the input.
java -jar goby.jar -m set-to-text myfilter -o out.tsv
The file out.tsv should now contain:
queryIndex multiplicity 0 2 1 0 2 3 3 0 4 0 5 1
The read set filter can also be provided directly as an argument to the align mode (option -f|–read-index-filter). In this case, it will remove redundant reads and align only with the unique reads.