Edge of Stochastic Stability: SGD does not train neural networks as you expect

Recent findings demonstrate that when training neural networks using full-batch gradient descent with step size eta, the largest eigenvalue lambda of the full-batch Hessian consistently stabilizes around 2/eta. These results have significant implications for convergence and generalization. This, however, is not the case for mini-batch optimization algorithms, limiting the broader applicability of the consequences of these findings. We show that mini-batch Stochastic Gradient Descent (SGD) trains in a different regime, which we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at 2/eta is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence, lambda---which is generally smaller than Batch Sharpness---is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for m! athematic al modeling of SGD trajectories.
Contatto:
paolo.zunino@polimi.it