Barcoding is a technique used to make it possible to multiplex samples of different origins on a single lane of sequencing. The technique is useful to reduce the cost of sequencing per sample. It’s obvious drawback is that the number of reads obtained for each barcoded sample will be lower than if each sample had been sequenced in separate lanes.
Individual samples are processed to add a short barcode tag at an extremity of each DNA fragment of the sample (the extremity targeted is consistent for all reads in a sample as it is determined by the experimental protocol used to append the barcode).
Goby 1.7+ supports decoding barcodes that were added at the 5′ or the 3′ end of the reads. This tutorial describes how to decode barcodes included in the sequence of the reads.
Let’s assume that the file barcoded-sample.compact-reads is available. (As usual, you can obtain compact-read files by converting them from FASTQ or other sequencer specific format, see the fasta-to-compact mode for details.)
The barcode information file
This file is used to associate barcode and adapter sequences to barcode indices and sample identifiers. It must be provided by the user, since there is no automatic way to know which barcode was associated with each measured sample. This file has three columns, delimited by a tab character.
Let’s call this file barcode-info.tsv. The first column contains the barcode sequence (in italic in the example provided), directly followed by the adapter sequence (constant across all barcodes). When considering a potential match between a barcode and a read, Goby will look for the maximal substring between column 1 and each read, such that the substring will occur at the specified extremity of the read. When a match is detected, the sequence is clipped to remove the substring found common between column 1 and the read.
Use barcodes to split into multiple files
Suppose you would like to split reads into separate files, such that each file has reads with the the same barcode.
java -jar goby.jar --mode barcode-decoder barcoded-sample.compact-reads --barcode-info barcode-info.tsv --extremity 3_PRIME --minimal-match-length 5 --max-mismatches 0
This command will scan the input file (barcoded-sample.compact-reads), and compare each read to each barcode to determine the read barcode. When no output is specified on the command line, reads that belong to each barcode will be written to separate files, named after the sample-id in the barcode-info.tsv file.
For instance, if you provided a barcode-info file with two barcodes/sample-ids, as shown above, the previous command will generate two new files, named patient-1.compact-reads and patient-2.compact-reads. Each file will contain only the reads that had the barcode corresponding to sample-id at their 3′ end. The reads are written after clipping the sequence to remove the barcode and adapter.
Reads in the output compact reads file are recorded with the protocol buffer attribute barcode index.
Writting to a single output file
It is also possible to specify an output argument, as in
java -jar goby.jar --mode barcode-decoder barcoded-sample.compact-reads -o all.compact-reads --barcode-info barcode-info.tsv --extremity 3_PRIME --minimal-match-length 5 --max-mismatches 0
In this case, all.compact-reads will contain all the reads that could be associated with a barcode. Such reads will have the barcodeIndex attribute set on the Read protocol buffer message used in compact-reads file. This mode of operation can be useful for developers who write programs that use barcodeIndex directly. For instance, it would be possible to create programs that filter for reads that have only a specific set of barcodeIndex.