I have recently been interested in reanalyzing datasets from the study of Ajay et al, Genome Research (‘Accurate and comprehensive sequencing of personal genomes’). The data were deposited to the ENA, and the accession code given in the paper (ERP000765), so all seemed good.
I prefer to start with FastQ files, but these were not available for this study. I located BAM files at ENA under the Submitted files tab.
Goby provides a tool to extract reads from BAM files, so I did not expect major difficulties. For single-end reads, I would typically do:
goby 1g sam-extract-reads input.bam -o output.compact-reads
and proceed to analyze the compact-reads files with GobyWeb. The first problem I experienced was that the study used a paired-end design. This is a great idea when you are sequencing a complete genome, but we did not have an option in sam-extract-reads to load two BAM files and put the read and its mate together into one compact reads file. We do have this option in the tool fasta-to-compact, but since the files are tens of Gigabytes each, I did not feel like creating several huge intermediary files, or waiting for the conversions to run ((from BAM to compact-reads, then to fastq) x 2, then to compact again).
Using the options of fasta-to-compact as a template, I added the same capability to sam-extract-reads (this option will be available in our next release).
Here’s what you can now do:
goby 1g sam-extract-reads SRA_GAIIX_RUN_1.bam SRA_GAIIX_RUN_2.bam --paired-end -o output.compact-reads
This would cause the two input files to be loaded and each read combined with its pair and written to output.compact-reads (Goby stores primary and paired read in same file so that we can be sure these sequences never get out of order).
In implementing these options, I was careful to check that the id of the reads matched for the primary read and its paired read. As soon as I started this new tool on the datasets I obtained from ENA, I got an error indicating that the id of the first read in SRA_GAIIX_RUN_1.bam did not match the id of first read in SRA_GAIIX_RUN_2.bam.
I guess I should have expected these BAM files to be sorted in chromosomal order. Fortunately, I saw that samtools sort has an option to resort a BAM file by read id. My plan was to resort each file by id, then use sam-extract-reads.
Here’s the samtools command to resort by id:
samtools sort -n SRA_GAIIX_RUN_1.bam SRA_GAIIX_RUN_namesorted_1.bam
Since the GAIIX file was 23GB, I started the sort overnight. Here’s the message I got this morning:
[bam_sort_core] merging from 316 files…
open: Too many open files
[bam_merge_core] fail to open file SRA_GAIIX_RUN-namesort_1.bam.0252.bam
This does not look good. Let’s look at what happened here. The program samtools sort handles alignments that are larger than can fit in memory by splitting the input in small pieces, each large enough to fit in memory, then sorts the piece, and finally writes the result to disk. The last step of the process is to sort all pieces to write the final sorted output. The problem I encountered was that the default size for each piece resulted in so many files that the merge step failed (most operating systems have a limit on the number of files a process can open at any one time).
The work-around would consist in increasing the default -m parameter to a few GB and run on a machine with sufficient memory. This will reduce the number of files that need to be kept open in the merge step. However, this means that the whole process is now limited in the amount of memory one can get on a server. Consider that server memory is not growing nearly as fast as sequencing capacity and one can envision a point when the dataset is too large to process on any server. This is unlikely to happen tomorrow however, and the biggest problem is that one has to go through several steps (that do not run in parallel and hence are quite slow on such large files) when trying to reanalyze large paired-end datasets provided in BAM files.
It would have been so much simpler if the reads had been deposited in a format that keeps read pairs together. Goby compact reads does this nicely by keeping everything in one file, FASTQ does the trick with two files. Considering this experience, BAM files sorted by genomic location should probably not be the first choice when considering how to distribute full-genome paired-end NGS read data.

Leave a Comment