Self-Improving Pretraining: using post-trained models to pretrain better models
Abstract
A reinforcement learning-based pretraining method improves language model safety, factuality, and quality by evaluating generations through a combination of model rollouts, original suffixes, and rewritten suffixes.
Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.
Community
Streaming pretraining uses a strong post-trained model to judge next-token generations with RL, improving quality, safety, and factuality earlier in training.
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/self-improving-pretraining-using-post-trained-models-to-pretrain-better-models-4847-351430c0
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning (2026)
- P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering (2026)
- Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models (2026)
- CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria (2026)
- PretrainZero: Reinforcement Active Pretraining (2025)
- Long-Chain Reasoning Distillation via Adaptive Prefix Alignment (2026)
- Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper