yolay/SPEAR-Sokoban-DrBoT-GiGPO-3B
4B
•
Updated
•
8
Checkpoints "Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning" arxiv [2509.22601]