ChonLam Lao, Yixi Chen, IIIS master candidates, and Prof. Wenfei Wu received the best paper award at the 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2021) for their paper “ATP: In-network Aggregation for Multitenant Learning”. This is the first time that this award has been granted to a Tsinghua member as the first author in China.
Prof. Wenfei Wu’s Research Group
Distributed deep neural network training (DT) systems are widely deployed in clusters where the network is shared across multiple tenants, i.e., multiple DT jobs. Each DT job computes and aggregates gradients. Recent advances in hardware accelerators have shifted the performance bottleneck of training from computation to communication. To speed up DT jobs' communication, Prof. Wu’s group proposed ATP, a service for in-network aggregation aiming at modern multi-rack, multi-job DT settings.
The aggregation process in ATP
ATP uses emerging programmable switch hardware to support in-network aggregation at multiple rack switches in a cluster to speedup DT jobs. ATP performs decentralized, dynamic, best-effort aggregation, enables efficient and equitable sharing of limited switch resources across simultaneously running DT jobs, and gracefully accommodates heavy contention for switch resources. ATP outperforms existing systems accelerating training throughput by up to 38% - 66% in a cluster shared by multiple DT jobs.
Model Training Speed of ATP and other baseline architectures
The work is conducted by Assistant Professor Wenfei Wu’s group in collaboration with Prof. Aditya Akella’s group from University of Wisconsin-Madison. ChonLam Lao is the first author while Prof. Wenfei Wu the corresponding author.
NSDI is a premier conference in network system sponsored by USENIX. It focuses on the design principles, implementation, and practical evaluation of networked and distributed systems. This year, the conference received 369 submissions with 59 papers accepted and the acceptance rate is 16%. Each year, one paper is picked as the Best Paper.
The full paper is available at https://www.usenix.org/conference/nsdi21/presentation/lao
By Yuying Chang