Maximize resource utilization and concurrently achieve high parallelism and load balance

Team

Big data computing has emerged as a powerful paradigm not only for Internet companies but also for conventional business institutions and government agencies. According to a study by International Data Corporation, the market size is expected to grow exponentially from $130.1 billion in 2016 to more than $203 billion by the end of the decade. That is why cash-rich companies are investing heavily in Artificial Intelligence and its subsets. But what has been missing so far is a similar thrust in developing new design principles and software architecture for analyzing large-scale data. That is the core area of a new study in progress at EPFL’s Operating Systems Laboratory (LABOS).

Lead researcher Willy Zwaenepoel and his team are developing innovative design principles that could provide a good match between big data algorithms and the underlying computing and storage resources. They have identified two main problem areas in big data analysis: graph-structured data and data skew.

Graph analytics imbibes complicated algorithms, which consume considerable computational resources in processing peta-scale data analysis for diverse fields such as social networking, medicine, bioinformatics, content analysis, and search engines. Computing resources and memory needs, however, have failed to keep up with the increasing scale of data analytics.

The existing approach to process large graphs is to store them in the main memory of a single machine or several machines. In contrast to this in-memory approach, Professor Zwaenepoel’s method proposes to process graphs from secondary storage. This approach uses up only a fraction of the resources required by the conventional method. The research team has developed two systems to implement this scheme: X-Stream for processing graphs from secondary storage on a single machine, and SlipStream, which works with distributed storage in a cluster.

Prof. Zwaenepoel’s research also proposes a new approach to tame data skew, which refers to a non-uniform distribution in a dataset. Data skew in complex database queries results in poor load balancing and increased response time. By targeting these problems, the research aims to maximize resource utilization and concurrently achieve high parallelism and load balance.

Suggested Readings

http://drops.dagstuhl.de/opus/volltexte/2017/7072/pdf/LIPIcs-OPODIS-2016-3.pdf
https://labos.epfl.ch/x-stream
https://www.cl.cam.ac.uk/~ey204/pubs/Dagstuhl_14462_Report.pdf

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Events

Future Health: Harnessing Multimodal Data and GenAI for Health Promotion

Swiss Federal Offices Day 2024

Annual Event

Large-scale Data Analytics

Maximize resource utilization and concurrently achieve high parallelism and load balance

Team