On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
Abstract
Random parameter update masking achieves superior optimization for large language models by inducing curvature-dependent regularization, with a momentum-aligned variant delivering significant performance improvements over state-of-the-art adaptive optimizers.
Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.
Community
Randomly masking parameter updates in adaptive optimizers yields curvature-regularization; Magma, momentum-aligned masking, is a drop-in that improves LLM perplexity (about 19% vs Adam, 9% vs Muon) with minimal overhead.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum (2026)
- TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers (2026)
- Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy (2026)
- Mano: Restriking Manifold Optimization for LLM Training (2026)
- Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise (2026)
- Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs (2026)
- FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper