First-order methods for numerical optimization are central to modern machine learning, due to their scalability and simplicity. In this talk we explore their theory and practice by re-considering traditional assumptions and common wisdom. First, we introduce a descent method that converges under a novel generalization of Lipschitz smoothness, which is a ubiquitous assumption in the analysis of first-order methods. Our method is a non-linear preconditioning of gradient descent, and we show how this method can be applied to p-norm regression and exponential penalty function minimization. Second, we demonstrate the sensitivity of optimizer comparisons to the metaparameter tuning protocol. As tuning effort grows without bound, more general update rules should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but the recent attempts to compare optimizers for deep learning either assume these inclusion relationships are not relevant in practice or restrict the metaparameters they tune to break the inclusions. In our experiments, we find that the inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adative gradient methods never underperform momentum or gradient descent, a challenge to commonly held wisdom.
Chris Maddison is member at the Insitute for Advanced Study in Princeton in the Special Year on Theoretical Machine Learning, and he will be joining the Departments of Computer Science and Statistical Sciences at the University of Toronto and the Vector Institute as an assistant professor in July 2020. His work is on the methodology of machine learning, with an emphasis on methods that work at scale in deep learning applications. He is an Open Philanthropy AI Fellow, received a NIPS Best Paper Award in 2014, and was one of the founding members of the AlphaGo project.