AudioMCQ-Weak-to-Strong

Overview

This repository contains the Weak-to-Strong model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.

Training Paradigm

The Weak-to-Strong training paradigm follows a two-stage approach:

Stage 1: SFT on weak audio-contribution data
Stage 2: GRPO (RL) on strong audio-contribution data

This paradigm begins with supervised fine-tuning on samples with weak audio contribution (where visual or textual cues provide substantial information), then applies reinforcement learning on challenging strong audio-contribution samples to enhance audio-specific understanding capabilities.

Model Details

Base Model: Qwen2.5-Omni
Training Data: AudioMCQ Dataset (571k samples)
Training Stages:
- Stage 1 (SFT): Weak audio-contribution subset
- Stage 2 (GRPO): Strong audio-contribution subset
System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."

Usage

Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the official documentation.

Input Format

The evaluation input prompt structure is:

[Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in <answer> </answer>.

Example Usage

# Load model following Qwen2.5-Omni documentation
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
# Format your question with the input structure above

Performance

The Weak-to-Strong model achieves competitive performance across multiple benchmarks:

MMAU-test-mini: Strong accuracy on general audio understanding
MMAR: Robust performance on music understanding tasks
MMSU: Solid results on speech understanding
Strong Audio-Contribution Splits: Enhanced performance on challenging samples requiring deep audio understanding

For detailed performance metrics and comparisons, please refer to our paper.

Related Resources

AudioMCQ Dataset: https://huggingface.co/datasets/inclusionAI/AudioMCQ
Mixed-to-Strong Checkpoint: https://huggingface.co/inclusionAI/AudioMCQ-Mixed-To-Strong
Paper: arXiv:2509.21060
DCASE 2025 Challenge: http://dcase.community/challenge2025/

Citation

If you find this model useful in your research, please cite:

@article{he2025audiomcq,
  title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
  author={He, Haolin and others},
  journal={arXiv preprint arXiv:2509.21060},
  year={2025}
}

Contact

Haolin He: harlandzzc@link.cuhk.edu.hk

Acknowledgements

We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.

Downloads last month: 5

Safetensors

Model size

11B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support