XACLE-TMU-2026

Large Audio Language Model for Audio-Text Alignment Score Prediction

This model was developed for the XACLE Challenge by TMU.

For detailed usage instructions, please refer to GitHub.

Model Description

XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between audio and text captions. The model combines:

  • BEATs audio encoder (90M params, frozen)
  • SwiGLU MLP audio projection with gated residual (10M params)
  • Qwen2.5-0.5B-Instruct LLM backbone (494M params)
  • MLP Score Head for score prediction

Total: ~594M parameters

Performance

Split SRCC
Validation 0.6746

Usage

from tmu_xacle.model.xacle_model import XACLEModel

# Load model
model = XACLEModel.from_pretrained("Atotti/xacle-tmu-2026", device="cuda")

# Predict alignment score
score = model.predict("audio.wav", "A dog barking in the park")
print(f"Alignment Score: {score:.2f}")  # Score in [0, 10]

Architecture

Audio Waveform (16kHz)
       |
  BEATs Encoder (frozen)
  [B, 500, 768]
       |
  SwiGLU MLP + Gated Residual
  [B, 100, 896]
       |
  [TEXT] [AUDIO_START] [AUDIO] [AUDIO_END] [SCORE] [EOS]
       |
  Qwen2.5-0.5B-Instruct
       |
  [SCORE] Token Hidden State
  [B, 896]
       |
  MLP Score Head (896 -> 512 -> 128)
       |
  Linear (128 -> 1)
       |
  Alignment Score [-1, 1] -> [0, 10]

Training

The model was trained in 3 stages:

  1. Stage 1: Audio Captioning Pretraining (skipped, using pretrained components)
  2. Stage 2: CLAP Pseudo-Label Pretraining
  3. Stage 3: XACLE Fine-tuning with ListNet loss

Training details:

  • Optimizer: AdamW (lr=6.2e-6)
  • Loss: ListNet Top-1 Loss
  • SpecAugment: freqm=15, timem=30
  • Epochs: 50

Citation

wip

License

CC-BY-NC-4.0

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support