Strategies to automatically select the best accelerator to run a specific DNN training


  Falsafi Babak


This project aims to develop strategies to automatically select the best accelerator to run a specific DNN training. We will create the necessary software libraries to allocate workload efficiently by considering performance, power, and accuracy constraints. We also propose to develop meta-learning algorithms to create and train DL models and configure their hyper-parameters in an automated way, outperforming current state-of-the- art approaches. We envision a full RM infrastructure that can work with minimal user interaction and allow automated tuning, allocation, and execution of DNN trainings on heterogeneous accelerators.

The proposed framework will select the best underlying hardware based on user-defined constraints and find the most adequate resources to run the workload. Apart from leveraging our existing expertise and improving them, we will use knowledge transfer and Automatic ML (AutoML) Neural Architecture Search techniques based on reinforcement learning. We will efficiently deploy our developed algorithms into libraries and incorporate them into a full software stack that enables scheduling DNN trainings from a single entry-point to the different heterogeneous servers and accelerators available in the datacenter.

Our work so far has focused on the power and performance characterization of the key traces of the DLRM deep learning recommendation model, provided by Facebook, and we have developed the first version of a reinforcement learning scheduler that can optimize the training phase for heterogeneous servers. We carried out optimization of the data loading and assignment, especially a prefetching mechanism to DLRM that can speed up the execution time for each batch loading time plus model time) by more than 30% compared to our initial assessment, on both Kaggle and Terrabyte datasets on a Nvidia V100 GPU.

Our short-term objective is to finalize, in the DLRM framework, the complete model pipelining and data parallelization and its application to the reinforcement learning strategy developed by us. As a continuation of this project, we plan to extend the reinforcement learning automatic training methodology to other recommendation models, with specific focus on graph neural nets, while enhancing its performance by further exploiting heterogeneous hardware resources.

We will work towards improving the reinforcement learning strategy along two different axes by adding: (1) dynamic task re-scheduling and server parameter tuning, and (2) the capability of managing several training instances. Finally, we plan to target complex learning cases that can be decomposed in ensembles of neural networks, which can run in distributed devices to cope with the current trend of pruning, but with guarantees regarding target output quality.

We expect our approach to result in significant savings in the total training time and a significantly improved robustness against minimization for smaller memory size designs. We will also delve into the possibility of defining a heuristic approach to automate the design of the voter-based ensemble architecture for different types of neural networks.