Ultron-AutoML: an open-source, distributed, scalable framework for efficient hyper-parameter optimization
Published in IEEE BigData, 2020
Recommended citation: @inproceedings{narayan2020ultron, title={Ultron-AutoML: an open-source, distributed, scalable framework for efficient hyper-parameter optimization}, author={Narayan, Swarnim and Krishna, Chepuri Shri and Mishra, Varun and Rai, Abhinav and Rai, Himanshu and Bharti, Chandrakant and Sodhi, Gursirat Singh and Gupta, Ashish and Singh, Nitinbalaji}, booktitle={2020 IEEE International Conference on Big Data (Big Data)}, pages={1584--1593}, year={2020}, organization={IEEE} }
Abstract: We present Ultron-AutoML, an open-source, distributed framework for efficient hyper-parameter optimization (HPO) of ML models. Considering that hyper-parameter optimization is compute intensive and time-consuming, the framework has been designed for reliability – the ability to successfully complete an HPO Job in a multi-tenant, failure prone environment, as well as efficiency – completing the job with minimum compute cost and wall-clock time. From a user’s perspective, the framework emphasizes ease of use and customizability. The user can declaratively specify and execute an HPO Job, while ancillary tasks – containerizing and running the user’s scripts, model checkpointing, monitoring progress, parallelization – are handled by the framework. At the same time, the user has complete flexibility in composing the code-base for specifying the ML model training algorithm as well as, optionally, any custom HPO algorithm. The framework supports the creation of datapipelines to stream batches of shuffled and augmented data from a distributed file system. This comes in handy for training Deep Learning models based on self-supervised, semi-supervised or representation learning algorithms over large training data-sets. We demonstrate the framework’s reliability and efficiency by running a BERT pre-training job over a large training corpus using pre-emptible GPU compute targets. Despite the inherent unreliability of the underlying compute nodes, the framework is able to complete such long running jobs at 30% of the cost with a marginal increase in wall-clock time. The framework also comes with a service to monitor jobs and ensures reproducibility of any result.