whisper-large-v3-LoS
Table of Contents
Click to expand
Model Description
The LoS Whisper-large-v3 model is a multilingual automatic speech recognition (ASR) system designed to transcribe speech in Spanish, Catalan, Galician, and Euskera (LoS, Languages of Spain). It is the result of fine-tuning the model openai/whisper-large-v3 on a combination of public and institutional datasets.
The model was trained on meticulously on 8,110 hours of preprocessed data. This ensures high-quality, readable transcriptions while preserving linguistic consistency. The training hours were equalized across the four languages to ensure comparable performance. Speed Perturbation (0.9× and 1.1×) was applied to every Basque training audio file, with a total duration of 2027h 56m 11s, which then served as the reference for balancing the remaining languages.
Intended Uses and Limitations
This model can be used for automatic speech recognition in the four languages mentioned.
Limitations: Speakers’ demographic information is not available; biases may exist due to institutional content.
How to Get Started with the Model
To see a functional version of this code, please check our Notebook and, in order to invoke this model, just substitute the instances of "projecte-aina/whisper-large-v3-ca-3catparla" with "BSC-LT/whisper-large-v3-LoS".
Installation
To use this model, you may install datasets and transformers:
Create a virtual environment:
python -m venv /path/to/venv
Activate the environment:
source /path/to/venv/bin/activate
Install the modules:
pip install datasets transformers
For Inference
To transcribe an audio using this model, you can follow this example:
#Install Prerequisites
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer
#This code works with a GPU
#Notice that: load_metric is no longer part of datasets.
# You have to remove it and use evaluate's load instead.
#(Note from November 2024)
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
#Load the processor and model.
MODEL_NAME="BSC-LT/whisper-large-v3-LoS"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")
#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/parlament_parla",split='test')
#Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
#Process the dataset
def map_to_pred(batch):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])
with torch.no_grad():
predicted_ids = model.generate(input_features.to("cuda"))[0]
transcription = processor.decode(predicted_ids)
batch["prediction"] = processor.tokenizer._normalize(transcription)
return batch
#Do the evaluation
result = ds.map(map_to_pred)
#Compute the overall WER now.
from evaluate import load
wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)
Training Details
Training data
The specific datasets used to create the model are: In Catalan:
- "3CatParla" (To be published soon)
- Parlament-Parla-v3 (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version)
- Corts Valencianes (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version)
- IB3 (To be published soon)
- Common Voice ca 17 Benchmark
In Spanish:
- ciempiess light
- ciempiess fem
- ciempiess complementary
- ciempiess balance
- CHM150
- Tedx spanish
- librivox spanish
- Wikipedia spanish
- voxforge spanish
- Tele con ciencia
- Argentinian Spanish Speech Dataset
- Dimex100 light
- Glissando Spanish
- Herico
- Latino40
- Common voice 17 es
In Galician:
- fleurs-galician
- google_crowdsourced
- Nos_ParlaSpeech-GL (clean)
- Nos_ParlaSpeech-GL (other)
- Nos_TranscriSpeech-GL
- Nos_RG-Podcast-GL
- FalAI (validated split)
- Common Voice 22.0
In Euskera:
Training procedure
This model is the result of fine-tuning the model "openai/whisper-large-v3" by following this tutorial provided by Hugging Face.
Training Hyperparameters
- languages: Catalan, Spanish, Basque, Galician
- hours of training audio: 8110h 28m
- learning rate: 1e-5
- sample rate: 16000 Hz
- train batch size: 32 (×4 GPUs)
- gradient accumulation: 1
- eval batch size: 32
- max steps: 172,294
- warmup steps: 17,229
- eval steps: 17,229
- save steps: 17,229
Citation
If this model contributes to your research, please cite the work:
@misc{LoS_Whisper2025,
title={Acoustic Model in Language of Spain: whisper-large-v3-LoS.},
author={Hernandez Mena, Carlos Daniel; Messaoudi, Abir; Solito, Sarah; España-Bonet, Cristina},
organization={Barcelona Supercomputing Center},
url={https://huggingface.co/BSC-LT/whisper-large-v3-LoS},
year={2025}
}
Additional Information
Author
The fine-tuning process was performed during November (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center.
Contact
For further information, please email langtech@bsc.es.
Copyright
Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center.
License
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337. The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.
- Downloads last month
- 23
Model tree for BSC-LT/whisper-large-v3-LoS
Base model
openai/whisper-large-v3Dataset used to train BSC-LT/whisper-large-v3-LoS
Collection including BSC-LT/whisper-large-v3-LoS
Evaluation results
- WER on Mozilla Common Voice 17.0 (Test)test set self-reportednull