TagSpeech

TagSpeech is a fully end-to-end multi-speaker ASR and diarization model.
Given a raw waveform of a multi-speaker conversation, the model directly outputs speaker-attributed transcriptions with timestamps and gender labels, without requiring a separate diarization or clustering stage.

🔗 Paper: TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model

Available checkpoints:

English (AMI): AudenAI/TagSpeech-AMI
Mandarin (AliMeeting): AudenAI/TagSpeech-Alimeeting

🔍 What Can This Model Do?

🎙️ Multi-speaker speech recognition
🧑‍🤝‍🧑 Speaker diarization
⏱️ Timestamped utterances
🚻 Gender prediction
🧩 Single forward pass (no external diarization model required)

The model is designed for meeting-style conversational audio with overlapping speakers.

Quick Inference

import torch
from model import TagSpeech
from utils.xml_utils import xml_to_json

device = "cuda" if torch.cuda.is_available() else "cpu"
model = TagSpeech.from_pretrained("AudenAI/TagSpeech-Alimeeting").to(device) 

wav_files = ["assets/test_example_Alimeeting_R8009_M8028-11-0-116.wav"]

audio_token = model.config.audio_token
messages = [
    [{"role": "user", "content": f"<text>{audio_token}</text>\n<speaker>{audio_token}</speaker>"}]
    for _ in wav_files
]

outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False)

# Print outputs in XML and JSON formats
for i, output in enumerate(outputs, 1):
    print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}")
    
    json_output = xml_to_json(output)
    if json_output:
        print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}")
    else:
        print(f"\n⚠️  Warning: Output {i} could not be parsed as valid XML\n{'='*80}")

Example Output

{
  "segments": [
    {
      "start": 0.0,
      "end": 1.52,
      "text": "嗯行行行",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 1.42,
      "end": 6.48,
      "text": "然后你们需要什么材料需要哪些材料现在就可以着手准备了吗",
      "speaker_id": "2",
      "speaker_gender": "male"
    },
    {
      "start": 6.35,
      "end": 6.96,
      "text": "嗯",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 6.95,
      "end": 8.62,
      "text": "就刚刚咱们说的这个",
      "speaker_id": "2",
      "speaker_gender": "male"
    }
  ]
}

📌 Model Characteristics

Input: Raw audio waveform (16 kHz recommended)
Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format
Backend LLM: Qwen2.5-7B-Instruct (frozen)
Architecture: Dual encoders (semantic + voice) with numeric time anchors

⚠️ Limitations

This checkpoint is trained on approximately 103 hours of Alimeeting speech only, and is primarily optimized for noisy, far-field, multi-speaker meeting scenarios. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data.
The model is recommended for short inference (≤ 30 seconds). For long-form recordings, chunk-based inference is required. Chunking and post-processing logic are not provided in this repository.
This model is designed for offline inference only and does not support real-time or streaming ASR.
Speaker attributes such as gender shown here are currently experimental and may not produce reliable results due to the limited amount of training data and annotation.

Citation

If you use TagSpeech in your research, please cite:

@article{huo2026tagspeech,
  title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding},
  author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng},
  journal={arXiv preprint arXiv:2601.06896},
  year={2026}
}

Downloads last month: 27

Model tree for AudenAI/TagSpeech-Alimeeting

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2372)

this model

Collection including AudenAI/TagSpeech-Alimeeting

Auden-LLM

Collection

3 items • Updated 6 days ago

Paper for AudenAI/TagSpeech-Alimeeting

TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

Paper • 2601.06896 • Published 14 days ago