A scalable, high-performance framework for bioinformatics workloads, and the Aggregate Genomic Data (AGD) format

Team

The rapidly advancing genome sequencing technology is generating large and growing datasets. Along with this deluge of data, the technology has now reached a nearly feasible cost trajectory. Compared to a cost of about $100 million to sequence a whole human genome in 2001, it is now down to almost $100. However, existing software systems would require several hours to process a single genome and usually run only on a single computer. Besides, genomic data is stored in file formats that are not conducive for parallel or distributed computation.

To address this challenge, we developed Persona, a scalable, high-performance framework for bioinformatics workloads, and the Aggregate Genomic Data (AGD) format. Persona and AGD have two primary objectives. First, they are designed to provide essential functionalities for bioinformatics computation, such as read alignment, sorting, duplicate marking, filtering, and variant calling. Second, the implementation of both Persona and AGD allows straightforward integration of new capabilities like different alignment algorithms or new data fields.

In our case study on sequence alignment, Persona could sustain 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and align a full genome in ~16.7 seconds using the SNAP algorithm.

Our results demonstrate that:

Alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads.
On a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster.
With AGD, a 7 node COTS network storage system can service up to 60 alignment compute nodes.
Server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.

We also introduce ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach. Our single-node, shared-memory design scales nearly linearly and achieves a speedup of 21 × on a 24 core machine. Our distributed design achieves a speedup of 604 × while maintaining a strong scaling efficiency of 79% on a distributed cluster of 768 cores (90% on larger datasets).

ClusterMerge and our implementations for protein sequence clustering are open-sourced.

Papers: PDF

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Events

EcoCloud Annual Event

Genomic and Reconfigurable Computing

A scalable, high-performance framework for bioinformatics workloads, and the Aggregate Genomic Data (AGD) format

Team