TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

TTA is a multilingual model that jointly supports transcribe, translate, and align tasks. It provides strong multilingual ASR/ST performance and cross-lingual speech retrieval capability.

🔗 Paper: https://arxiv.org/abs/2511.14410
🔗 Model: https://huggingface.co/AudenAI/auden-tta-m10
🔗 Encoder: https://huggingface.co/AudenAI/auden-encoder-tta-m10
🔗 Code: https://github.com/AudenAI/Auden/tree/main/examples/tta

🔍 What Can This Model Do?

🎙️ Multilingual ASR (transcribe)
🌍 Speech translation (translate)
🧩 Audio–text alignment (align)
🔎 Cross-lingual speech retrieval

Quick Start

TTA model

from auden.auto.auto_model import AutoModel

# 1) Load a model checkpoint directory (contains config.json + weights)
model_dir = "AudenAI/auden-tta-m10"  # or any exported directory / HF repo id
model = AutoModel.from_pretrained(model_dir)
model = model.to("cuda")
model.eval()

# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
#    model.speech_encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ...  # Tensor shapes: (B, T, F), (B,)

inputs = (x, x_lens)
# Alternatively, you can pass WAV inputs directly:
# - List of WAV paths (str):
#   inputs = ["/abs/a.wav", "/abs/b.wav"]
# - List of mono waveforms (Tensor/ndarray), 16 kHz:
#   inputs = [torch.randn(16000*5), torch.randn(16000*3)]

# 3a) Transcribe (RNNT greedy)
out = model.generate(inputs, task="transcribe", blank_penalty=0.0, return_timestamps=False)
print(out["hypotheses"])  # list[str]

# 3b) Translate (attention beam search). Language can be a single str or a list[str] per utterance
out = model.generate(
    inputs,
    task="translate",
    beam_size=5,
    source_language=["zh"] * x.size(0),
    target_language=["en"] * x.size(0),
)
print(out["hypotheses"])      # list[str]
print(out["source_language"]) # list[str], model-predicted or provided
print(out["target_language"]) # list[str], model-predicted or provided

# 3c) Align (audio-text similarity)
texts = ["hello world", "good morning"]
out = model.generate(inputs, task="align", texts=texts)
print(out["similarities"])  # (B, len(texts))
print(out["audio_emb"]) # (B, emb_dim)
print(out["text_emb"]) # (B, emb_dim)

TTA encoder

from auden.auto.auto_model import AutoModel
encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-tta-m10")
encoder = encoder.to("cuda")

# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
#    encoder.extract_feature(wav) to get (x, x_lens).
x, x_lens = ...  # Tensor shapes: (B, T, F), (B,)

encoder_output = encoder(x, x_lens)
print(encoder_output["encoder_out"]) # (B, T//4, D)
print(encoder_output["encoder_out_lens"]) # (B)

📌 Model Characteristics

Input: Raw audio waveform (16 kHz recommended)
Output: Transcription, translation, or alignment scores
Encoder: TTA encoder (AudenAI/auden-encoder-tta-m10)
Tasks: transcribe / translate / align

📊 Evaluation

Multilingual ASR & ST

Model	#Params	AISHELL1/2 (CER↓)	Wenet (CER↓)	LibriSpeech (WER↓)	CommonVoice (WER↓)	MLS (WER↓)	VoxPopuli (WER↓)	FLEURS (WER↓)	CoVoSTv2 (BLEU↑)
Whisper Medium	762M	6.74 / 6.23	11.00 / 22.68	2.88 / 6.08	11.86	7.27	12.08	6.62	35.12
Whisper Large-v2	1.54B	5.90 / 5.24	9.47 / 22.77	2.64 / 5.14	9.70	5.65	11.90	5.20	38.80
Whisper Large-v3	1.54B	5.33 / 4.76	9.00 / 15.68	2.01 / 3.89	8.30	4.48	13.78	4.51	37.60
ZT (ASR)	199M	1.89 / 3.14	6.91 / 6.08	1.58 / 3.62	6.92	5.82	11.12	6.35	–
ZT-AED (ASR)	246M	1.82 / 3.07	6.89 / 6.18	1.54 / 3.59	6.70	5.71	10.78	6.18	–
ZT-AED (Full)	246M	1.80 / 3.03	6.96 / 5.94	1.56 / 3.76	6.69	5.72	10.88	6.17	34.72
🔥 TTA (Ours)	247M	1.85 / 3.09	7.06 / 6.44	1.58 / 3.85	6.76	5.74	10.87	6.19	35.28

TTA Encoder (LLM-ASR Encoder Evaluation)

Encoder	Aishell CER↓	LibriSpeech WER↓
Whisper-Medium	5.47	4.66
Whisper-Large	4.87	3.64
ZT-AED	2.92	2.30
TTA (Ours)	1.92	1.95

Training Data

Full data composition (open-source links + in-house aggregation):

Language	Data Source	Type	Hours	Total Hours	Share
Chinese (Zh)	WenetSpeech	Open Source	10,005	129,265	37.1%
	AISHELL-2	Open Source	1,000
	AISHELL-1	Open Source	150
	Common Voice	Open Source	237
	Yodas	Open Source	222
	In-house Data	In-house	117,651
Code-Switch	TALCS	Open Source	555	8,924	2.6%
	In-house Data	In-house	8,369
English (En)	Libriheavy	Open Source	45,751	107,626	30.9%
	Multilingual LibriSpeech (MLS)	Open Source	44,659
	GigaSpeech	Open Source	10,000
	Yodas	Open Source	3,426
	Common Voice	Open Source	1,778
	LibriSpeech	Open Source	960
	VoxPopuli	Open Source	522
	TED-LIUM	Open Source	453
	AMI Corpus	Open Source	77
Japanese (Ja)	ReazonSpeech	Open Source	35,389	40,426	11.6%
	Yodas	Open Source	499
	Common Voice	Open Source	19
	In-house Data	In-house	4,519
Korean (Ko)	KsponSpeech (AIHub)	Open Source	965	20,095	5.8%
	KrespSpeech (AIHub)	Open Source	2,906
	KconfSpeech (AIHub)	Open Source	2,928
	MeetingSpeech (AIHub)	Open Source	4,962
	GyeongsangSpeech (AIHub)	Open Source	2,481
	Yodas	Open Source	1,528
	Common Voice	Open Source	1
	In-house Data (Aggregated)	In-house	4,324
Russian (Ru)	Golos	Open Source	1,221	15,246	4.4%
	Public Speech & Radio	Open Source	1,651
	Buriy Audiobook	Open Source	874
	Public Youtube Dataset	Open Source	809
	Yodas	Open Source	2,606
	Common Voice	Open Source	37
	In-house Data	In-house	8,048
Vietnamese (Vi)	GigaSpeech 2	Open Source	6,048	8,390	2.4%
	Bud500	Open Source	324
	VLSP 2020	Open Source	101
	ViMD	Open Source	81
	LSVSC	Open Source	80
	Yodas	Open Source	140
	Common Voice	Open Source	2
	In-house Data	In-house	1,614
Indonesian (Id)	GigaSpeech 2	Open Source	6,352	8,238	2.4%
	Yodas	Open Source	442
	Common Voice	Open Source	7
	In-house Data	In-house	1,437
French (Fr)	Multilingual LibriSpeech (MLS)	Open Source	1,076	4,124	1.2%
	Yodas	Open Source	1,423
	Common Voice	Open Source	831
	VoxPopuli	Open Source	205
	In-house Data	In-house	589
Spanish (Es)	Multilingual LibriSpeech (MLS)	Open Source	917	4,596	1.3%
	Yodas	Open Source	2,399
	Common Voice	Open Source	502
	VoxPopuli	Open Source	151
	In-house Data	In-house	627
Portuguese (Pt)	Multilingual LibriSpeech (MLS)	Open Source	160	1,602	0.5%
	Yodas	Open Source	852
	Common Voice	Open Source	25
	In-house Data	In-house	565

Language totals from the same table:

Language	Total Hours	Share
Chinese (Zh)	129,265	37.1%
English (En)	107,626	30.9%
Japanese (Ja)	40,426	11.6%
Korean (Ko)	20,095	5.8%
Russian (Ru)	15,246	4.4%
Code-Switch	8,924	2.6%
Vietnamese (Vi)	8,390	2.4%
Indonesian (Id)	8,238	2.4%
Spanish (Es)	4,596	1.3%
French (Fr)	4,124	1.2%
Portuguese (Pt)	1,602	0.5%

⚠️ Limitations

Performance depends on audio quality and recording conditions.
For long-form audio, chunking and post-processing might be required for optimal performance.
Not designed for safety-critical applications.

Citation

If you use this model in your research, please cite:

@article{liu2025tta,
  title={TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation},
  author={Liu, Wei and Li, Jiahong and Shao, Yiwen and Yu, Dong},
  journal={arXiv preprint arXiv:2511.14410},
  year={2025}
}

Downloads last month: 9

Paper for AudenAI/auden-encoder-tta-m10

TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

Paper • 2511.14410 • Published Nov 18, 2025