Comparing Learning Rate and Batch Size Transfer Across Different Network Widths for Standard and Maximal Update Parameterization

Authors

  • Yeeyung Li Department of Computer Science, George Mason University, Fairfax, VA
  • Xiaochuan Gong Department of Computer Science, George Mason University, Fairfax, VA
  • Mingrui Liu Department of Computer Science, George Mason University, Fairfax, VA

Abstract

In deep learning, hyperparameters for models are commonly trained on smaller models and transferred to larger models to reduce the cost of learning. However, the scaling behavior of certain hyperparameters from smaller to larger network widths is not well understood. Maximal update parameterization is a new practice for deep learning models that aims to achieve hyperparameter stability across all model widths. This project compares the training dynamics for the optimal learning rate on neural networks trained using standard parameterization and maximal update parameterization on the MNIST dataset. The standard parameterization networks showed 0.001 as the optimal learning rate over 0.01 and 0.1 for network widths 128, 256, and 512 but the optimal batch size was shown to vary from network width to network width, which displays hyperparameter instability. 

Published

2025-09-25

Issue

Section

College of Engineering and Computing: Department of Computer Science