Co-located Deep Learning Training and Inference

Team

Despite hitting the headlines repeatedly in various applications, Deep Neural Networks (DNNs) comprise a nascent machine learning methodology. This leaves ample scope for users to try several hardware architectures to augment performance. As a subset of machine learning, DNNs are gaining traction because of the need for accurate data analytics of images, videos, and speech in the repositories of datacenters worldwide.

With the rapid pace of innovation of DNN algorithms, several studies have attempted to determine whether Field Programmable Gate Arrays (FPGAs) can beat Graphics Processing Units (GPUs) in accelerating next-gen deep learning. Although GPUs are widely used by operators to improve prediction accuracy, they are not keeping abreast of the growing need for complex computations required for DNN training. Moreover, GPUs affect the consistency and homogeneity of datacenters and are not scalable.

Conversely, FPGAs offer greater continuity by collocating inference and training on the same platform. However, such implementation of FPGAs brings its own set of challenges: FPGAs are handicapped by low computational density; algorithms are unable to match up to increased communication requirements; and there is a need for processes that give precedence to inference over training.

To address these fundamental problems, a research called “Coltrain: Co-located Deep Learning Training and Inference” is underway. The study is being developed by Babak Falsafi and Martin Jaggi of EPFL’s School of Computer and Communication Sciences (IC) and Eric Chung of Microsoft Research. Its main objectives are to restructure training and inference algorithms to:

Leverage the tolerance of DNNs for low-precision operations.
Accelerate communication.
Scale the training of single networks to arbitrary number of cores.
Implement FPGA-based load-balancing techniques to offer latency guarantees for inference tasks under heavy loads.
Enable the use of idle accelerator cycles to train networks when operating under lower loads.

The project was launched earlier this year in a workshop organized at the Microsoft Research Cambridge Lab. It is one of six projects undertaken jointly by EPFL and Microsoft as part of the Swiss Joint Research Center.

Suggested readings:

https://actu.epfl.ch/news/six-ic-projects-gain-funding-from-microsoft/

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Events

Future Health: Harnessing Multimodal Data and GenAI for Health Promotion

Swiss Federal Offices Day 2024

Annual Event

COLTRAIN

Co-located Deep Learning Training and Inference

Team