Papers
arxiv:2601.09733

Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

Published on Dec 30, 2025
Authors:
,
,
,
,
,

Abstract

A closed-loop dataset engineering framework using value-anchored rankings and multi-dimensional analysis improves LLM training by systematically constructing high-quality datasets for mathematics reasoning and multi-domain instruction tasks.

AI-generated summary

The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: ODA-Math-460k, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and ODA-Mixture (100k \& 500k), a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.09733 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.09733 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.09733 in a Space README.md to link it from this page.

Collections including this paper 2