Papers
arxiv:2509.16975

Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs

Published on Sep 21, 2025
Authors:
,
,
,
,
,

Abstract

Audio editing evaluation framework leveraging multimodal large language models with natural language-based assessment, fine-tuning tasks, and chain-of-thought prompting for improved accuracy and interpretability.

AI-generated summary

Automatic mean opinion score (MOS) prediction provides a more perceptual alternative to objective metrics, offering deeper insights into the evaluated models. With the rapid progress of multimodal large language models (MLLMs), their enhanced perceptual and reasoning abilities enable more comprehensive and interpretable audio quality assessment. In this work, we tackle the challenging task of audio editing evaluation and propose the first natural language-based automated evaluation framework built on MLLMs. Our approach introduces two fine-tuning tasks to boost multi-audio understanding, combined with Chain-of-Thought prompting, and lightweight instruction tuning, to enhance step-by-step reasoning. Experiment demonstrate that our framework delivers accurate, interpretable, and text-based editing evaluation, closely aligning with human judgments and objective metrics while substantially improving over baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.16975 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.16975 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.16975 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.