AudioMCQ-Weak-to-Strong

arXiv Dataset DCASE 2025

Overview

This repository contains the Weak-to-Strong model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.

Training Paradigm

The Weak-to-Strong training paradigm follows a two-stage approach:

Stage 1: SFT on weak audio-contribution data
Stage 2: GRPO (RL) on strong audio-contribution data

This paradigm begins with supervised fine-tuning on samples with weak audio contribution (where visual or textual cues provide substantial information), then applies reinforcement learning on challenging strong audio-contribution samples to enhance audio-specific understanding capabilities.

Model Details

  • Base Model: Qwen2.5-Omni
  • Training Data: AudioMCQ Dataset (571k samples)
  • Training Stages:
    • Stage 1 (SFT): Weak audio-contribution subset
    • Stage 2 (GRPO): Strong audio-contribution subset
  • System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."

Usage

Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the official documentation.

Input Format

The evaluation input prompt structure is:

[Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in <answer> </answer>.

Example Usage

# Load model following Qwen2.5-Omni documentation
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
# Format your question with the input structure above

Performance

The Weak-to-Strong model achieves competitive performance across multiple benchmarks:

  • MMAU-test-mini: Strong accuracy on general audio understanding
  • MMAR: Robust performance on music understanding tasks
  • MMSU: Solid results on speech understanding
  • Strong Audio-Contribution Splits: Enhanced performance on challenging samples requiring deep audio understanding

For detailed performance metrics and comparisons, please refer to our paper.

Related Resources

Citation

If you find this model useful in your research, please cite:

@article{he2025audiomcq,
  title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
  author={He, Haolin and others},
  journal={arXiv preprint arXiv:2509.21060},
  year={2025}
}

Contact

Acknowledgements

We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.

Downloads last month
5
Safetensors
Model size
11B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support