Memory-Efficient All-Pair Suffix-Prefix Overlaps on GPU

Abstract

Obtaining overlaps between all pairs from billions of strings is a fundamental computational challenge in de novo whole genome assembly. This paper presents a memory-efficient, massively parallel algorithm that uses GPUs to find overlaps from large sequencing datasets with a very low probability of false positives. Here we use a Rabin-fingerprint-based indexing method which stores the strings with their fingerprints and uses them to generate the fingerprints of suffixes and prefixes. We then sort these fingerprints in hybrid CPU-GPU memory and stream them in GPU to find matches. Experiments show that our implementation can detect a trillion highly probable overlaps within 1.5 billion DNA fragments in just over two hours. Compared to the existing CPU-based approach, it routinely achieves speedups of 5–15x while having a collision rate of 1 in 20 million.

Publication
In The 23rd International Conference on Computational Science.
Sayan Goswami
Sayan Goswami
Postdoctoral Researcher

My research interests include distributed robotics, mobile computing and programmable matter.