TagSpeech
TagSpeech is a fully end-to-end multi-speaker ASR and diarization model.
Given a raw waveform of a multi-speaker conversation, the model directly outputs speaker-attributed transcriptions with timestamps and gender labels, without requiring a separate diarization or clustering stage.
π Paper: TagSpeech: Unified E2E Multi-Speaker ASR and Diarization Model
Available checkpoints:
- English (AMI):
AudenAI/TagSpeech-AMI - Mandarin (AliMeeting):
AudenAI/TagSpeech-Alimeeting
π What Can This Model Do?
- ποΈ Multi-speaker speech recognition
- π§βπ€βπ§ Speaker diarization
- β±οΈ Timestamped utterances
- π» Gender prediction
- π§© Single forward pass (no external diarization model required)
The model is designed for meeting-style conversational audio with overlapping speakers.
Quick Inference
import torch
from model import TagSpeech
from utils.xml_utils import xml_to_json
device = "cuda" if torch.cuda.is_available() else "cpu"
model = TagSpeech.from_pretrained("AudenAI/TagSpeech-Alimeeting").to(device)
wav_files = ["assets/test_example_Alimeeting_R8009_M8028-11-0-116.wav"]
audio_token = model.config.audio_token
messages = [
[{"role": "user", "content": f"<text>{audio_token}</text>\n<speaker>{audio_token}</speaker>"}]
for _ in wav_files
]
outputs = model.generate(wav_files, messages, max_new_tokens=800, num_beams=1, do_sample=False)
# Print outputs in XML and JSON formats
for i, output in enumerate(outputs, 1):
print(f"\n{'='*80}\nOutput {i}/{len(outputs)} - XML:\n{'='*80}\n{output}\n{'='*80}")
json_output = xml_to_json(output)
if json_output:
print(f"\nOutput {i}/{len(outputs)} - JSON:\n{'='*80}\n{json_output}\n{'='*80}")
else:
print(f"\nβ οΈ Warning: Output {i} could not be parsed as valid XML\n{'='*80}")
Example Output
{
"segments": [
{
"start": 0.0,
"end": 1.52,
"text": "ε―θ‘θ‘θ‘",
"speaker_id": "1",
"speaker_gender": "female"
},
{
"start": 1.42,
"end": 6.48,
"text": "ηΆεδ½ δ»¬ιθ¦δ»δΉζζιθ¦εͺδΊζζη°ε¨ε°±ε―δ»₯ηζεε€δΊε",
"speaker_id": "2",
"speaker_gender": "male"
},
{
"start": 6.35,
"end": 6.96,
"text": "ε―",
"speaker_id": "1",
"speaker_gender": "female"
},
{
"start": 6.95,
"end": 8.62,
"text": "ε°±εεε±δ»¬θ―΄ηθΏδΈͺ",
"speaker_id": "2",
"speaker_gender": "male"
}
]
}
π Model Characteristics
- Input: Raw audio waveform (16 kHz recommended)
- Output: Speaker-attributed ASR with timestamps in XML format, can be parsed into JSON format
- Backend LLM: Qwen2.5-7B-Instruct (frozen)
- Architecture: Dual encoders (semantic + voice) with numeric time anchors
β οΈ Limitations
This checkpoint is trained on approximately 103 hours of Alimeeting speech only, and is primarily optimized for noisy, far-field, multi-speaker meeting scenarios. Performance may degrade on out-of-domain audio (e.g., clean close-talk speech or other acoustic conditions). For best results, we recommend fine-tuning on in-domain data.
The model is recommended for short inference (β€ 30 seconds). For long-form recordings, chunk-based inference is required. Chunking and post-processing logic are not provided in this repository.
This model is designed for offline inference only and does not support real-time or streaming ASR.
Speaker attributes such as gender shown here are currently experimental and may not produce reliable results due to the limited amount of training data and annotation.
Citation
If you use TagSpeech in your research, please cite:
@article{huo2026tagspeech,
title={TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding},
author={Huo, Mingyue and Shao, Yiwen and Zhang, Yuheng},
journal={arXiv preprint arXiv:2601.06896},
year={2026}
}
- Downloads last month
- 27