DNN training and inference have similar basic operators but with fundamentally different requirements. The former is throughput bound and relies on high precision floating-point arithmetic for convergence while the latter is latency-bound and tolerant to low-precision arithmetic. Both workloads require high computational capabilities and can benefit from hardware accelerators. The disparity in resource requirements forces datacenter operators to choose between custom accelerators for training and inference or training accelerators for inference.

However, neither of these two options is an optimum solution. While the former results in datacenter heterogeneity and higher management costs, the latter results in inefficient inference. Moreover, dedicated inference accelerators face load fluctuations, leading to overprovisioning and low average utilization.

The objective of EPFL’s ColTraIn: Co-located DNN Training and Inference team of PARSA and MLO is to restore datacenter homogeneity and co-locate training and inference without compromising inference efficiency or quality of service (QoS) guarantees. ColTraIn aims to overcome two key challenges: (1) the difference in the arithmetic representation used in workloads, and (2) the scheduling of training tasks in inference-bound accelerators. The recent release of HBFP (Hybrid Block Floating Point) meets the first challenge.

HBFP trains DNNs with dense, fixed-point-like arithmetic for most operations without sacrificing accuracy, thus facilitating effective co-location. More specifically, HBFP offers the accuracy of 32-bit floating-point with the numeric and silicon density of 8-bit fixed-point for many models (ResNet, WideResNet, DenseNet, AlexNet, LSTM, and BERT).

The open-source project repository is available for ongoing research on training DNNs with HBFP.

The ColTraIn team is working to address the second challenge of developing a co-locating accelerator. The design adds training capabilities to an inference accelerator and pairs it with a scheduler that takes both resource utilization and tasks’ QoS constraints into account to co-locate DNN training and inference.

https://github.com/parsa-epfl/HBFPEmulator
https://ecocloud.epfl.ch/coltrain/
Mario Drumond, Tao Lin, Martin Jaggi, Babak Falsafi: Training DNNs with Hybrid Block Floating Point. NeurIPS 2018: 451-461.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Events

Future Health: Harnessing Multimodal Data and GenAI for Health Promotion

Swiss Federal Offices Day 2024

Annual Event

ColTraIn Releases Open-source HBFP Training Emulator

Previous Post

New Transistor Design Reduces Energy Dissipation in High-power Applications