TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

TTA is a multilingual model that jointly supports transcribe, translate, and align tasks. It provides strong multilingual ASR/ST performance and cross-lingual speech retrieval capability.

πŸ”— Paper: https://arxiv.org/abs/2511.14410
πŸ”— Model: https://huggingface.co/AudenAI/auden-tta-m10
πŸ”— Encoder: https://huggingface.co/AudenAI/auden-encoder-tta-m10
πŸ”— Code: https://github.com/AudenAI/Auden/tree/main/examples/tta

πŸ” What Can This Model Do?

  • πŸŽ™οΈ Multilingual ASR (transcribe)
  • 🌍 Speech translation (translate)
  • 🧩 Audio–text alignment (align)
  • πŸ”Ž Cross-lingual speech retrieval

Quick Start

TTA model

from auden.auto.auto_model import AutoModel

# 1) Load a model checkpoint directory (contains config.json + weights)
model_dir = "AudenAI/auden-tta-m10"  # or any exported directory / HF repo id
model = AutoModel.from_pretrained(model_dir)
model = model.to("cuda")
model.eval()

# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
#    model.speech_encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ...  # Tensor shapes: (B, T, F), (B,)

inputs = (x, x_lens)
# Alternatively, you can pass WAV inputs directly:
# - List of WAV paths (str):
#   inputs = ["/abs/a.wav", "/abs/b.wav"]
# - List of mono waveforms (Tensor/ndarray), 16 kHz:
#   inputs = [torch.randn(16000*5), torch.randn(16000*3)]

# 3a) Transcribe (RNNT greedy)
out = model.generate(inputs, task="transcribe", blank_penalty=0.0, return_timestamps=False)
print(out["hypotheses"])  # list[str]

# 3b) Translate (attention beam search). Language can be a single str or a list[str] per utterance
out = model.generate(
    inputs,
    task="translate",
    beam_size=5,
    source_language=["zh"] * x.size(0),
    target_language=["en"] * x.size(0),
)
print(out["hypotheses"])      # list[str]
print(out["source_language"]) # list[str], model-predicted or provided
print(out["target_language"]) # list[str], model-predicted or provided

# 3c) Align (audio-text similarity)
texts = ["hello world", "good morning"]
out = model.generate(inputs, task="align", texts=texts)
print(out["similarities"])  # (B, len(texts))
print(out["audio_emb"]) # (B, emb_dim)
print(out["text_emb"]) # (B, emb_dim)

TTA encoder

from auden.auto.auto_model import AutoModel
encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-tta-m10")
encoder = encoder.to("cuda")

# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
#    encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ...  # Tensor shapes: (B, T, F), (B,)

encoder_output = encoder(x, x_lens)
print(encoder_output["encoder_out"]) # (B, T//4, D)
print(encoder_output["encoder_out_lens"]) # (B)

πŸ“Œ Model Characteristics

  • Input: Raw audio waveform (16 kHz recommended)
  • Output: Transcription, translation, or alignment scores
  • Encoder: TTA encoder (AudenAI/auden-encoder-tta-m10)
  • Tasks: transcribe / translate / align

πŸ“Š Evaluation

Multilingual ASR & ST

Model #Params AISHELL1/2 (CER↓) Wenet (CER↓) LibriSpeech (WER↓) CommonVoice (WER↓) MLS (WER↓) VoxPopuli (WER↓) FLEURS (WER↓) CoVoSTv2 (BLEU↑)
Whisper Medium 762M 6.74 / 6.23 11.00 / 22.68 2.88 / 6.08 11.86 7.27 12.08 6.62 35.12
Whisper Large-v2 1.54B 5.90 / 5.24 9.47 / 22.77 2.64 / 5.14 9.70 5.65 11.90 5.20 38.80
Whisper Large-v3 1.54B 5.33 / 4.76 9.00 / 15.68 2.01 / 3.89 8.30 4.48 13.78 4.51 37.60
ZT (ASR) 199M 1.89 / 3.14 6.91 / 6.08 1.58 / 3.62 6.92 5.82 11.12 6.35 –
ZT-AED (ASR) 246M 1.82 / 3.07 6.89 / 6.18 1.54 / 3.59 6.70 5.71 10.78 6.18 –
ZT-AED (Full) 246M 1.80 / 3.03 6.96 / 5.94 1.56 / 3.76 6.69 5.72 10.88 6.17 34.72
πŸ”₯ TTA (Ours) 247M 1.85 / 3.09 7.06 / 6.44 1.58 / 3.85 6.76 5.74 10.87 6.19 35.28

TTA Encoder (LLM-ASR Encoder Evaluation)

Encoder Aishell CER↓ LibriSpeech WER↓
Whisper-Medium 5.47 4.66
Whisper-Large 4.87 3.64
ZT-AED 2.92 2.30
TTA (Ours) 1.92 1.95

Training Data

Full data composition (open-source links + in-house aggregation):

Language Data Source Type Hours Total Hours Share
Chinese (Zh) WenetSpeech Open Source 10,005 129,265 37.1%
AISHELL-2 Open Source 1,000
AISHELL-1 Open Source 150
Common Voice Open Source 237
Yodas Open Source 222
In-house Data In-house 117,651
Code-Switch TALCS Open Source 555 8,924 2.6%
In-house Data In-house 8,369
English (En) Libriheavy Open Source 45,751 107,626 30.9%
Multilingual LibriSpeech (MLS) Open Source 44,659
GigaSpeech Open Source 10,000
Yodas Open Source 3,426
Common Voice Open Source 1,778
LibriSpeech Open Source 960
VoxPopuli Open Source 522
TED-LIUM Open Source 453
AMI Corpus Open Source 77
Japanese (Ja) ReazonSpeech Open Source 35,389 40,426 11.6%
Yodas Open Source 499
Common Voice Open Source 19
In-house Data In-house 4,519
Korean (Ko) KsponSpeech (AIHub) Open Source 965 20,095 5.8%
KrespSpeech (AIHub) Open Source 2,906
KconfSpeech (AIHub) Open Source 2,928
MeetingSpeech (AIHub) Open Source 4,962
GyeongsangSpeech (AIHub) Open Source 2,481
Yodas Open Source 1,528
Common Voice Open Source 1
In-house Data (Aggregated) In-house 4,324
Russian (Ru) Golos Open Source 1,221 15,246 4.4%
Public Speech & Radio Open Source 1,651
Buriy Audiobook Open Source 874
Public Youtube Dataset Open Source 809
Yodas Open Source 2,606
Common Voice Open Source 37
In-house Data In-house 8,048
Vietnamese (Vi) GigaSpeech 2 Open Source 6,048 8,390 2.4%
Bud500 Open Source 324
VLSP 2020 Open Source 101
ViMD Open Source 81
LSVSC Open Source 80
Yodas Open Source 140
Common Voice Open Source 2
In-house Data In-house 1,614
Indonesian (Id) GigaSpeech 2 Open Source 6,352 8,238 2.4%
Yodas Open Source 442
Common Voice Open Source 7
In-house Data In-house 1,437
French (Fr) Multilingual LibriSpeech (MLS) Open Source 1,076 4,124 1.2%
Yodas Open Source 1,423
Common Voice Open Source 831
VoxPopuli Open Source 205
In-house Data In-house 589
Spanish (Es) Multilingual LibriSpeech (MLS) Open Source 917 4,596 1.3%
Yodas Open Source 2,399
Common Voice Open Source 502
VoxPopuli Open Source 151
In-house Data In-house 627
Portuguese (Pt) Multilingual LibriSpeech (MLS) Open Source 160 1,602 0.5%
Yodas Open Source 852
Common Voice Open Source 25
In-house Data In-house 565

Language totals from the same table:

Language Total Hours Share
Chinese (Zh) 129,265 37.1%
English (En) 107,626 30.9%
Japanese (Ja) 40,426 11.6%
Korean (Ko) 20,095 5.8%
Russian (Ru) 15,246 4.4%
Code-Switch 8,924 2.6%
Vietnamese (Vi) 8,390 2.4%
Indonesian (Id) 8,238 2.4%
Spanish (Es) 4,596 1.3%
French (Fr) 4,124 1.2%
Portuguese (Pt) 1,602 0.5%

⚠️ Limitations

  • Performance depends on audio quality and recording conditions.
  • For long-form audio, chunking and post-processing might be required for optimal performance.
  • Not designed for safety-critical applications.

Citation

If you use this model in your research, please cite:

@article{liu2025tta,
  title={TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation},
  author={Liu, Wei and Li, Jiahong and Shao, Yiwen and Yu, Dong},
  journal={arXiv preprint arXiv:2511.14410},
  year={2025}
}
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for AudenAI/auden-encoder-tta-m10