【SLAI Seminar】Two Perspectives on Muon for Deep Learning Optimization: Preconditioning and Isotropic Curvature（Jan 8, 10:00)

You are cordially invited to attend the seminar with the topic on "Two Perspectives on Muon for Deep Learning Optimization: Preconditioning and Isotropic Curvature", from 10AM, January 8th (tomorrow).

Speaker: Prof. Weijie Su (UPenn)

Host: Prof. Ruoyu Sun

Mode of Participation: Hybrid

On-site: B411 Lecture Hall

Remote: Tencent Meeting (Meeting ID: 792-529-745)

About the Speaker:

Weijie Su is an Associate Professor in the Wharton Statistics and Data Science Department and, by courtesy, in the Departments of Computer and Information Science and Mathematics at the University of Pennsylvania. He is a co-director of the Penn Research in Machine Learning (PRiML) Center. He received his Ph.D. from Stanford University in 2016 and bachelor's degree from Peking University in 2011. His research interests span the mathematical foundations of generative AI, privacy-preserving machine learning, optimization, and high-dimensional statistics. He serves as a founding co-editor of the new journal Statistical Learning and Data Science, an associate editor of the Journal of Machine Learning Research, Journal of the American Statistical Association, Operations Research, Journal of the Operations Research Society of China, the Annals of Applied Statistics, Foundations and Trends in Statistics, and Harvard Data Science Review, and he is currently on the Organizing Committee of ICML 2026 as the Scientific Integrity Chair. His work has been recognized with several awards, such as the Stanford Anderson Dissertation Award, NSF CAREER Award, Sloan Research Fellowship, IMS Peter Hall Prize, SIAM Early Career Prize in Data Science, ASA Noether Early Career Award, ICBS Frontiers of Science Award in Mathematics, IMS Medallion Lectureship, and Outstanding Young Talent Award in the 2025 China Annual Review of Mathematics, and he is a Fellow of the IMS.

Abstract:

Introduced in December 2024, Muon is an optimization method for training language models that updates the weight along the direction of an orthogonalized gradient. The superiority of Muon has been quickly recognized, as demonstrated on industry-scale models; for example, it has been successfully used to train a trillion-parameter frontier language model. In this talk, we offer two perspectives to shed light on this matrix-gradient method. First, we introduce a unifying framework that precisely distinguishes between preconditioning for curvature anisotropy (like Adam) and gradient anisotropy (like Muon). This perspective not only offers new insights into Adam's instabilities and Muon's accelerated convergence but also leads to a new extension, such as PolarGrad. Next, we introduce a second perspective based on an isotropic curvature model. We derive this model by assuming isotropy of curvature (including Hessian and higher-order terms) across all perturbation directions. We show that under a general growth condition, the optimal update is one that makes the gradient's spectrum more homogeneous; that is, making its singular values closer in ratio. We then show that the orthogonalized gradient becomes optimal for this model when the curvature exhibits a phase transition in growth. Taken together, these results suggest that the gradient orthogonalization employed in Muon is directionally correct but may not be strictly optimal, and we will discuss how to leverage this model for designing new optimization methods. This talk is based on arXiv:2505.21799 and 2511.00674.