Training for Recommendation Models on Heterogeneous Servers

Strategies to automatically select the best accelerator to run a specific DNN training

Team

This project aims to develop strategies to automatically select the best accelerator to run a specific DNN training. We will create the necessary software libraries to allocate workload efficiently by considering performance, power, and accuracy constraints. We also propose to develop meta-learning algorithms to create and train DL models and configure their hyper-parameters in an automated way, outperforming current state-of-the- art approaches. We envision a full RM infrastructure that can work with minimal user interaction and allow automated tuning, allocation, and execution of DNN trainings on heterogeneous accelerators.

The proposed framework will select the best underlying hardware based on user-defined constraints and find the most adequate resources to run the workload. Apart from leveraging our existing expertise and improving them, we will use knowledge transfer and Automatic ML (AutoML) Neural Architecture Search techniques based on reinforcement learning. We will efficiently deploy our developed algorithms into libraries and incorporate them into a full software stack that enables scheduling DNN trainings from a single entry-point to the different heterogeneous servers and accelerators available in the datacenter.

Our work so far has focused on the power and performance characterization of the key traces of the DLRM deep learning recommendation model, provided by Facebook, and we have developed the first version of a reinforcement learning scheduler that can optimize the training phase for heterogeneous servers. We carried out optimization of the data loading and assignment, especially a prefetching mechanism to DLRM that can speed up the execution time for each batch loading time plus model time) by more than 30% compared to our initial assessment, on both Kaggle and Terrabyte datasets on a Nvidia V100 GPU.

Our short-term objective is to finalize, in the DLRM framework, the complete model pipelining and data parallelization and its application to the reinforcement learning strategy developed by us. As a continuation of this project, we plan to extend the reinforcement learning automatic training methodology to other recommendation models, with specific focus on graph neural nets, while enhancing its performance by further exploiting heterogeneous hardware resources.

We will work towards improving the reinforcement learning strategy along two different axes by adding: (1) dynamic task re-scheduling and server parameter tuning, and (2) the capability of managing several training instances. Finally, we plan to target complex learning cases that can be decomposed in ensembles of neural networks, which can run in distributed devices to cope with the current trend of pruning, but with guarantees regarding target output quality.

We expect our approach to result in significant savings in the total training time and a significantly improved robustness against minimization for smaller memory size designs. We will also delve into the possibility of defining a heuristic approach to automate the design of the voter-based ensemble architecture for different types of neural networks.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.