Reducing the Ever-growing Cost of Large-Scale Machine Learning Services


The recent success of machine learning (ML) has dramatically benefited from the exponential growth of ML model capacity. However, the enormous capacity of ML models also leads to a significantly higher cost. In practice, the high cost of ML comes from three sources: i) the cost of optimizing/deploying ML services over the ever-changing hardware; ii) the low utilization of the hardware due to parallel/distributed communication overhead; and iii) the high cost of accessing the hardware. 


My work attempts to reduce the cost in all three categories above: Developing and deploying ML workflows in ever-changing execution environments is a tedious and time-consuming job and would require a significant amount of engineering effort to scale out the computation; my work proposes new abstractions for ML system design and implementation with expressivity, easy optimization, and high performance. In parallel/distributed ML training, communication is usually the main bottleneck that restricts hardware efficiency; my work explores system relaxations of communications under different parallel ML training paradigms to increase hardware efficiency without compromising their statistical efficiency. Given the advances in system optimization and relaxation, my work further investigates how to deploy the ML service over a decentralized open collective environment consisting of much cheaper and underutilized decentralized GPUs; the result is promising: when the decentralized interconnections are 100X slower than the data center network, under efficient scheduling, the end-to-end training throughput is only 1.7~3.5X slower than the state-of-the-art solutions inside a data center.