arxiv:2512.06835

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

Published on Dec 7

· Submitted by

TINGYU LI on Dec 9

OpenDataLab-RAISER

Upvote

Authors:

Tingyu Li ,

Jingxuan Wei ,

Conghui He ,

Abstract

DoGe, a dual-decoupling framework, enhances vision-language models by separating context learning from problem solving, using a curriculum learning pipeline to improve reward signals and data diversity.

AI-generated summary

Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.

View arXiv page View PDF Add to collection

Community

Saito-Karuha

Paper author Paper submitter about 7 hours ago

•

edited about 7 hours ago

Experiment Results 📊

We evaluate DoGe on 7 benchmarks covering:

General visual reasoning & hallucination (MMMU, MMStar, HallBench)
Specialized domain reasoning (MathVision, MathVista, ChemBench, MSEarthMCQ)

3B-level Models Performance

Method	MMMU	MMStar	HallBench	MathVision	MathVista	ChemBench	MSEarthMCQ	Avg.
InternVL2.5-2B	43.6	53.7	42.6	13.5	51.3	-	-	-
Visionary-3B	40.7	50.5	59.8	17.1	54.7	40.8	38.2	43.1
Qwen2.5VL-3B* (Base)	41.0	49.3	60.6	18.7	48.8	43.4	40.8	43.2
DoGe-3B (Iter1)	46.6	54.5	61.5	21.7	🥇57.9	45.8	🥇48.3	48.0
DoGe-3B (Iter2)	48.9	52.5	🥇62.5	23.1	54.2	🥇47.7	46.2	47.9
DoGe-3B (Iter3)	🥇50.2	🥇54.7	61.8	🥇24.2	57.0	46.9	47.3	🥇48.9
⬆️ Max Gain (vs. Base)	+9.2	+5.4	+1.9	+5.5	+9.1	+4.3	+7.5	+5.7

7B-level Models Performance

Method	MMMU	MMStar	HallBench	MathVision	MathVista	ChemBench	MSEarthMCQ	Avg.
InternVL2.5-8B	48.9	62.8	50.1	22.0	64.4	-	-	-
Vision-R1-7B	46.9	60.8	66.7	🥇29.0	68.5	46.0	44.1	51.7
Qwen2.5VL-7B* (Base)	49.9	60.7	66.3	23.6	64.1	48.6	43.3	50.9
DoGe-7B (Iter1)	53.1	🥇63.2	54.4	24.3	62.1	48.7	46.4	50.3
DoGe-7B (Iter2)	50.9	60.0	🥇68.3	25.3	🥇68.8	🥇49.0	🥇46.5	52.7
DoGe-7B (Iter3)	🥇53.6	63.0	68.0	25.2	68.3	48.5	45.8	🥇53.2
⬆️ Max Gain (vs. Base)	+3.7	+2.5	+2.0	+1.7	+4.7	+0.4	+3.2	+2.3

Key Takeaways ✨

Stable Self-Evolution: DoGe achieves consistent performance improvement across 3 iterations for both 3B and 7B models
Domain Generalization:
- 3B models: Average +5.7% performance gain across all benchmarks
- 7B models: Average +2.3% performance gain (maintains superiority over strong baselines)
Hallucination Reduction: +2.0% average improvement on HallBench, mitigating visual hallucination
Data Efficiency: Excels in data-scarce domains (Chemistry, Earth Science) with limited manual annotations

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.06835 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.06835 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.06835 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.