Speaker: Lei Wu Peking University
Time: 2024-11-19 10:00-2024-11-19 11:30
Venue: FIT 1-222
Abstract:
Training neural networks involves navigating highly non-convex and degenerate landscapes, making it challenging to understand the underlying dynamics. In this talk, we introduce a stability-based perspective to explain how stochastic gradient descent (SGD) and its variants explore these high-dimensional landscapes. This perspective provides an explanation of the well-known flat minima hypothesis: SGD converges to flat minima and flat minima generate better. In particular, our analysis can reveal the crucial roles of finite learning rate, small batch size, and the anisotropic structure of noise in help find flat minima. Finally, we present two algorithms derived from this stability perspective: one significantly accelerates the discovery of flat minima, and the other integrates seamlessly into existing deep learning frameworks to enhance large language model (LLM) pretraining.
Short Bio:
Lei Wu is currently an assistant professor in the School of Mathematical Sciences at Peking University and his research centers on theoretical aspects of deep learning. He received his Bachelor's degree in pure mathematics from Nankai University in 2012 and a PhD degree in computational mathematics from Peking University in 2018. From November 2018 to October 2021, he worked as a postdoctoral researcher separately at Princeton University and the University of Pennsylvania.