Empirically Investigating Sharpness-Aware Minimization
DOI:
https://doi.org/10.13021/jssr2022.3393Abstract
Sharpness Aware Minimization (SAM) is an optimizer that supposedly finds more generalizable solutions by introducing a “sharpness” penalty into the loss function, causing the model to converge to flatter optima. Several studies have questioned the effectiveness of SAM, with one such paper (On the Maximum Hessian Eigenvalue and Generalization) finding that larger batch sizes diminish and eventually completely eliminate the generalization benefits of SAM over Stochastic Gradient Descent (SGD), and that the correlation between flatness as measured by the maximum hessian eigenvalue of the loss function and generalization is extremely conditional, e.g. manipulating the maximum hessian eigenvalue by scaling learning rate and batch size without affecting generalization. Through preliminary tests on the CIFAR-10 image classification dataset using ResNet18, we found that SAM’s training accuracy compared to SGD is lower, but does seem to generalize better to the testing set. In this study, we aim to find empirical evidence of SAM’s ability to find flatter optima compared to SGD by measuring flatness through loss values at projected points of gradient ascent.
Published
Issue
Section
Categories
License
Copyright (c) 2022 Alex Li, Michael Crawshaw, Dr. Mingrui Liu
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.