Deep Intrinsic Surprise-Regularized Control: Scaling Temporal-Difference Updates for Stability in Deep Q-Networks

Authors

  • Yash Kini Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA
  • Shiv Davay Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA
  • Shreya Polavarapu Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA
  • Hamed Poursiami Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA
  • Shay Snyder Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA
  • Maryam Parsa Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5291

Abstract

Deep reinforcement learning (DRL) has driven major advances in autonomous control, but standard Deep Q-Network (DQN) agents, while already scaling updates by temporal-difference (TD) error and gradients, typically use fixed learning rates without an explicit mechanism to modulate overall update magnitudes. Though prioritized experience replays and adaptive optimizers indirectly shape learning dynamics, few methods explicitly adjust each update’s scale through an intrinsic signal. We introduce Deep Intrinsic Surprise-Regularized Control (DISRC), a biologically inspired augmentation to DQN that computes a deviation-based surprise score via a moving latent setpoint in a LayerNorm-based encoder, scaling each Q-update in proportion to both TD error and surprise intensity. This design promotes higher plasticity during early exploration and more conservative updates as learning stabilizes. We evaluated DISRC on CartPole-v1 under identical settings to a vanilla DQN across multiple runs. The vanilla DQN achieved a mean reward of 419.68 in the final 100 episodes, reached the 200-reward threshold in 147 episodes, and produced an area under the reward curve (AUC) of 150,605.00. DISRC achieved a mean reward of 159.84, required 556 episodes to reach the threshold, and achieved an AUC of 46,234.50. DISRC’s lower reward standard deviation (92.96 vs. 149.22) reflects more uniform, though consistently suboptimal, episode returns. Although DISRC underperforms on this dense-reward benchmark, we hypothesize that strong external rewards may diminish or override the benefits of intrinsic surprise modulation. We are looking to explore DISRC in more sparse-reward or complex environments, where intrinsic regulation could more meaningfully guide exploration and learning. This work introduces a novel mechanism for regulating update magnitudes in off-policy agents, positioning DISRC as a promising direction for stability-enhanced, biologically grounded DRL.

Published

2025-09-25

Issue

Section

College of Engineering and Computing: Department of Electrical and Computer Engineering