TagSpeech

TagSpeech is a fully end-to-end multi-speaker ASR and diarization model.
Given a raw waveform of a multi-speaker conversation, the model directly outputs speaker-attributed transcriptions with timestamps and gender labels, without requiring a separate diarization or clustering stage.

πŸ”— Paper: TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model

Available checkpoints:

  • English (AMI): AudenAI/TagSpeech-AMI
  • Mandarin (AliMeeting): AudenAI/TagSpeech-Alimeeting

πŸ” What Can This Model Do?

  • πŸŽ™οΈ Multi-speaker speech recognition
  • πŸ§‘β€πŸ€β€πŸ§‘ Speaker diarization
  • ⏱️ Timestamped utterances
  • 🚻 Gender prediction
  • 🧩 Single forward pass (no external diarization model required)

The model is designed for meeting-style conversational audio with overlapping speakers.

Quick Inference

import torch
from model import TagSpeech
from utils.xml_utils import xml_to_json

device = "cuda" if torch.cuda.is_available() else "cpu"
model = TagSpeech.from_pretrained("AudenAI/TagSpeech-Alimeeting").to(device) 

wav_files = ["assets/test_example_Alimeeting_R8009_M8028-11-0-116.wav"]

audio_token = model.config.audio_token
messages = [
    [{"role": "user", "content": f"<text>{audio_token}</text>\n<speaker>{audio_token}</speaker>"}]
    for _ in wav_files
]

outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False)

# Print outputs in XML and JSON formats
for i, output in enumerate(outputs, 1):
    print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}")
    
    json_output = xml_to_json(output)
    if json_output:
        print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}")
    else:
        print(f"\n⚠️  Warning: Output {i} could not be parsed as valid XML\n{'='*80}")

Example Output

{
  "segments": [
    {
      "start": 0.0,
      "end": 1.52,
      "text": "ε—―θ‘Œθ‘Œθ‘Œ",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 1.42,
      "end": 6.48,
      "text": "η„ΆεŽδ½ δ»¬ιœ€θ¦δ»€δΉˆζζ–™ιœ€θ¦ε“ͺδΊ›ζζ–™ηŽ°εœ¨ε°±ε―δ»₯着手准倇了吗",
      "speaker_id": "2",
      "speaker_gender": "male"
    },
    {
      "start": 6.35,
      "end": 6.96,
      "text": "ε—―",
      "speaker_id": "1",
      "speaker_gender": "female"
    },
    {
      "start": 6.95,
      "end": 8.62,
      "text": "ε°±εˆšεˆšε’±δ»¬θ―΄ηš„θΏ™δΈͺ",
      "speaker_id": "2",
      "speaker_gender": "male"
    }
  ]
}

πŸ“Œ Model Characteristics

  • Input: Raw audio waveform (16 kHz recommended)
  • Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format
  • Backend LLM: Qwen2.5-7B-Instruct (frozen)
  • Architecture: Dual encoders (semantic + voice) with numeric time anchors

⚠️ Limitations

  • This checkpoint is trained on approximately 103 hours of Alimeeting speech only, and is primarily optimized for noisy, far-field, multi-speaker meeting scenarios. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data.

  • The model is recommended for short inference (≀ 30 seconds). For long-form recordings, chunk-based inference is required. Chunking and post-processing logic are not provided in this repository.

  • This model is designed for offline inference only and does not support real-time or streaming ASR.

  • Speaker attributes such as gender shown here are currently experimental and may not produce reliable results due to the limited amount of training data and annotation.

Citation

If you use TagSpeech in your research, please cite:

@article{huo2026tagspeech,
  title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding},
  author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng},
  journal={arXiv preprint arXiv:2601.06896},
  year={2026}
}
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AudenAI/TagSpeech-Alimeeting

Base model

Qwen/Qwen2.5-7B
Finetuned
(2372)
this model

Collection including AudenAI/TagSpeech-Alimeeting

Paper for AudenAI/TagSpeech-Alimeeting