A scalable, high-performance framework for bioinformatics workloads, and the Aggregate Genomic Data (AGD) format
| Larus James |
The rapidly advancing genome sequencing technology is generating large and growing datasets. Along with this deluge of data, the technology has now reached a nearly feasible cost trajectory. Compared to a cost of about $100 million to sequence a whole human genome in 2001, it is now down to almost $100. However, existing software systems would require several hours to process a single genome and usually run only on a single computer. Besides, genomic data is stored in file formats that are not conducive for parallel or distributed computation.
To address this challenge, we developed Persona, a scalable, high-performance framework for bioinformatics workloads, and the Aggregate Genomic Data (AGD) format. Persona and AGD have two primary objectives. First, they are designed to provide essential functionalities for bioinformatics computation, such as read alignment, sorting, duplicate marking, filtering, and variant calling. Second, the implementation of both Persona and AGD allows straightforward integration of new capabilities like different alignment algorithms or new data fields.
In our case study on sequence alignment, Persona could sustain 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and align a full genome in ~16.7 seconds using the SNAP algorithm.
Our results demonstrate that:
- Alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads.
- On a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster.
- With AGD, a 7 node COTS network storage system can service up to 60 alignment compute nodes.
- Server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.
We also introduce ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach. Our single-node, shared-memory design scales nearly linearly and achieves a speedup of 21 × on a 24 core machine. Our distributed design achieves a speedup of 604 × while maintaining a strong scaling efficiency of 79% on a distributed cluster of 768 cores (90% on larger datasets).
ClusterMerge and our implementations for protein sequence clustering are open-sourced.