Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners Paper • 2509.26226 • Published Sep 30 • 32
EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control Paper • 2511.15248 • Published 23 days ago • 6
Exploration and Anti-Exploration with Distributional Random Network Distillation Paper • 2401.09750 • Published Jan 18, 2024
A Two-stage Reinforcement Learning-based Approach for Multi-entity Task Allocation Paper • 2407.00496 • Published Jun 29, 2024
BATON: Aligning Text-to-Audio Model with Human Preference Feedback Paper • 2402.00744 • Published Feb 1, 2024
Novelty-Guided Data Reuse for Efficient and Diversified Multi-Agent Reinforcement Learning Paper • 2412.15517 • Published Dec 20, 2024
Novelty-based Sample Reuse for Continuous Robotics Control Paper • 2410.13490 • Published Oct 17, 2024
CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning Paper • 2406.07541 • Published Jun 11, 2024
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model Paper • 2311.13231 • Published Nov 22, 2023 • 29