Evoxtral LoRA β Expressive Tagged Transcription
A LoRA adapter for Voxtral-Mini-3B-2507 that produces transcriptions enriched with inline expressive audio tags from the ElevenLabs v3 tag set.
Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).
Two model variants available:
- Evoxtral SFT β Best overall transcription accuracy (lowest WER)
- Evoxtral RL β Best expressive tag accuracy (highest Tag F1)
What It Does
Standard ASR:
So I was thinking maybe we could try that new restaurant downtown. I mean if you're free this weekend.
Evoxtral:
[nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously] I mean, if you're free this weekend?
Training Pipeline
Base Voxtral-Mini-3B β SFT (LoRA, 3 epochs) β RL (RAFT, 1 epoch)
- SFT: LoRA finetuning on 808 synthetic audio samples with expressive tags (lr=2e-4, 3 epochs)
- RL (RAFT): Rejection sampling β generate 4 completions per sample, score with rule-based reward (WER accuracy + Tag F1 - hallucination penalty), keep best, then SFT on curated data (lr=5e-5, 1 epoch)
This follows the approach from GRPO for Speech Recognition and Voxtral's own SFTβDPO training recipe.
Evaluation Results
Evaluated on 50 held-out test samples. Full benchmark (Evoxtral-Bench) with 7 metrics:
Core Metrics β Base vs SFT vs RL
| Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best |
|---|---|---|---|---|
| WER | 6.64% | 4.47% | 5.12% | SFT |
| CER | 2.72% | 1.23% | 1.48% | SFT |
| Tag F1 | 22.0% | 67.2% | 69.4% | RL |
| Tag Precision | 22.0% | 67.4% | 68.5% | RL |
| Tag Recall | 22.0% | 69.4% | 72.7% | RL |
| Emphasis F1 | 42.0% | 84.0% | 86.0% | RL |
| Tag Hallucination | 0.0% | 19.3% | 20.2% | SFT |
SFT excels at raw transcription accuracy (best WER/CER). RL further improves expressive tag generation (+2.2% Tag F1, +3.3% Tag Recall, +2% Emphasis F1) at a small cost to WER.
Per-Tag F1 Breakdown (SFT β RL)
| Tag | SFT F1 | RL F1 | Change | Support |
|---|---|---|---|---|
[sighs] |
1.000 | 1.000 | β | 9 |
[clears throat] |
0.889 | 1.000 | +12.5% | 8 |
[gasps] |
0.957 | 0.957 | β | 12 |
[pause] |
0.885 | 0.902 | +1.9% | 25 |
[nervous] |
0.800 | 0.846 | +5.8% | 13 |
[stammers] |
0.889 | 0.842 | -5.3% | 8 |
[laughs] |
0.800 | 0.815 | +1.9% | 12 |
[sad] |
0.667 | 0.750 | +12.4% | 4 |
[whispers] |
0.636 | 0.667 | +4.9% | 13 |
[crying] |
0.750 | 0.571 | -23.9% | 5 |
[excited] |
0.615 | 0.571 | -7.2% | 5 |
[shouts] |
0.400 | 0.500 | +25.0% | 3 |
[calm] |
0.200 | 0.400 | +100% | 6 |
[frustrated] |
0.444 | 0.444 | β | 3 |
[angry] |
0.667 | 0.667 | β | 2 |
[confused] |
0.000 | 0.000 | β | 1 |
[scared] |
0.000 | 0.000 | β | 1 |
RL improved 9 tags, kept 4 stable, and regressed 3. Biggest gains on [clears throat] (+12.5%), [calm] (+100%), [sad] (+12.4%), and [shouts] (+25%).
Training Details
SFT Stage
| Parameter | Value |
|---|---|
| Base model | mistralai/Voxtral-Mini-3B-2507 |
| Method | LoRA (PEFT) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| Target modules | q/k/v/o_proj, gate/up/down_proj, multi_modal_projector |
| Learning rate | 2e-4 |
| Scheduler | Cosine |
| Epochs | 3 |
| Batch size | 2 (effective 16 with grad accum 8) |
| NEFTune noise alpha | 5.0 |
| Precision | bf16 |
| GPU | NVIDIA A10G (24GB) |
| Training time | ~25 minutes |
| Trainable params | 124.8M / 4.8B (2.6%) |
RL Stage (RAFT)
| Parameter | Value |
|---|---|
| Method | Rejection sampling + SFT (RAFT) |
| Samples per input | 4 (temperature=0.7, top_p=0.9) |
| Reward function | 0.4Γ(1-WER) + 0.4ΓTag_F1 + 0.2Γ(1-hallucination) |
| Curated samples | 727 (bottom 10% filtered, reward > 0.954) |
| Avg reward | 0.980 |
| Learning rate | 5e-5 |
| Epochs | 1 |
| Final loss | 0.021 |
| Training time | ~7 minutes |
Dataset
Custom synthetic dataset of 1,010 audio samples generated with ElevenLabs TTS v3:
- 808 train / 101 validation / 101 test
- Each sample has audio + tagged transcription with inline ElevenLabs v3 expressive tags
- Tags include:
[sighs],[laughs],[whispers],[nervous],[frustrated],[clears throat],[pause],[excited], and more - Audio encoder (Whisper-based) was frozen during training
Usage
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from peft import PeftModel
repo_id = "mistralai/Voxtral-Mini-3B-2507"
# Use "YongkangZOU/evoxtral-lora" for SFT or "YongkangZOU/evoxtral-rl" for RL
adapter_id = "YongkangZOU/evoxtral-rl"
processor = AutoProcessor.from_pretrained(repo_id)
base_model = VoxtralForConditionalGeneration.from_pretrained(
repo_id, dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, adapter_id)
# Transcribe audio with expressive tags
inputs = processor.apply_transcription_request(
language="en",
audio=["path/to/audio.wav"],
format=["WAV"],
model_id=repo_id,
return_tensors="pt",
)
inputs = inputs.to(model.device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
transcription = processor.batch_decode(
outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
)[0]
print(transcription)
# [nervous] So... I was thinking maybe we could [clears throat] try that new restaurant downtown?
API
A serverless API with Swagger UI is available on Modal:
curl -X POST https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/transcribe \
-F "file=@audio.wav"
W&B Tracking
All training and evaluation runs are tracked on Weights & Biases:
Supported Tags
The model can produce any tag from the ElevenLabs v3 expressive tag set, including:
[laughs] [sighs] [gasps] [clears throat] [whispers] [sniffs] [pause] [nervous] [frustrated] [excited] [sad] [angry] [calm] [stammers] [yawns] and more.
Limitations
- Trained on synthetic (TTS-generated) audio, not natural speech recordings
- ~20% tag hallucination rate β model occasionally predicts tags not in the reference
- Rare/subtle tags ([calm], [confused], [scared]) have low accuracy due to limited training examples
- RL variant trades ~0.65% WER for better tag accuracy
- English only
- Best results on conversational and emotionally expressive speech
Citation
@misc{evoxtral2026,
title={Evoxtral: Expressive Tagged Transcription with Voxtral},
author={Yongkang Zou},
year={2026},
url={https://huggingface.co/YongkangZOU/evoxtral-lora}
}
- Downloads last month
- 25
Model tree for mistral-hackaton-2026/evoxtral
Base model
mistralai/Voxtral-Mini-3B-2507