Co-located Deep Learning Training and Inference


  Falsafi Babak
  Jaggi Martin

Despite hitting the headlines repeatedly in various applications, Deep Neural Networks (DNNs) comprise a nascent machine learning methodology. This leaves ample scope for users to try several hardware architectures to augment performance. As a subset of machine learning, DNNs are gaining traction because of the need for accurate data analytics of images, videos, and speech in the repositories of datacenters worldwide.

With the rapid pace of innovation of DNN algorithms, several studies have attempted to determine whether Field Programmable Gate Arrays (FPGAs) can beat Graphics Processing Units (GPUs) in accelerating next-gen deep learning. Although GPUs are widely used by operators to improve prediction accuracy, they are not keeping abreast of the growing need for complex computations required for DNN training. Moreover, GPUs affect the consistency and homogeneity of datacenters and are not scalable.

Conversely, FPGAs offer greater continuity by collocating inference and training on the same platform. However, such implementation of FPGAs brings its own set of challenges: FPGAs are handicapped by low computational density; algorithms are unable to match up to increased communication requirements; and there is a need for processes that give precedence to inference over training.

To address these fundamental problems, a research called “Coltrain: Co-located Deep Learning Training and Inference” is underway. The study is being developed by Babak Falsafi and Martin Jaggi of EPFL’s School of Computer and Communication Sciences (IC) and Eric Chung of Microsoft Research. Its main objectives are to restructure training and inference algorithms to:

  • Leverage the tolerance of DNNs for low-precision operations.
  • Accelerate communication.
  • Scale the training of single networks to arbitrary number of cores.
  • Implement FPGA-based load-balancing techniques to offer latency guarantees for inference tasks under heavy loads.
  • Enable the use of idle accelerator cycles to train networks when operating under lower loads.

The project was launched earlier this year in a workshop organized at the Microsoft Research Cambridge Lab. It is one of six projects undertaken jointly by EPFL and Microsoft as part of the Swiss Joint Research Center.

Suggested readings: