W2V-BERT 2.0 ASR Adapters
This repository contains 14 per-language bottleneck adapters for automatic speech recognition (ASR) trained on top of facebook/w2v-bert-2.0.
Model Description
- Base Model: facebook/w2v-bert-2.0 (600M parameters, frozen)
- Adapter Architecture: MMS-style bottleneck adapters (dim=64)
- Decoder: Lightweight transformer decoder (2 layer)
- Training: CTC loss with extended vocabulary for double vowels
- Average WER: 45.13%
Trained Adapters
| Adapter | Language | WER | Train Samples |
|---|---|---|---|
| ach_Latn | Acholi | 22.08% | 4825 |
| eng_Latn_salt | English (SALT) | 99.33% | 4804 |
| eng_Latn_tts | English (TTS) | 99.80% | 3030 |
| ful_Latn | Fulah | 99.02% | 2355 |
| kam_Latn | Kamba | 31.91% | 14968 |
| kik_Latn | Kikuyu | 15.36% | 14966 |
| lug_Latn_salt | Luganda (SALT) | 28.15% | 5002 |
| luo_Latn | Luo | 17.69% | 14922 |
| mer_Latn | Kimeru | 34.70% | 14981 |
| nyn_Latn | Runyankole | 30.46% | 4884 |
| swh_Latn_salt | Swahili (SALT) | 95.23% | 3835 |
| swh_Latn_v1 | Swahili (Filtered) | 20.94% | 15000 |
| swh_Latn_v2 | Swahili (Bible) | 3.31% | 10458 |
| teo_Latn | Ateso | 33.88% | 4901 |
Architecture
The model uses:
- Frozen w2v-bert-2.0 encoder - Extracts audio representations
- Bottleneck adapter - Language-specific adaptation (trainable)
- Lightweight decoder - Transformer decoder block (trainable)
- LM head - Per-language vocabulary projection (trainable)
Audio β Encoder(frozen) β Adapter β Decoder β LayerNorm β LM Head β Text
Usage
Each adapter folder contains:
adapter_weights.pt- Bottleneck adapter weightsdecoder_weights.pt- Decoder block weightslm_head_weights.pt- Language model head weightsfinal_norm_weights.pt- Final layer norm weightsvocab.json- Language-specific vocabularyadapter_config.json- Adapter configurationmetrics.json- Training metrics
Loading an Adapter
import torch
from transformers import Wav2Vec2BertProcessor
from huggingface_hub import hf_hub_download
# Load processor for specific language (e.g., kik_Latn for Kikuyu)
adapter_id = "kik_Latn"
processor = Wav2Vec2BertProcessor.from_pretrained(
"mutisya/w2v-bert-adapters-14lang-e10-25_52-v6",
subfolder=adapter_id
)
# Load adapter configuration
import json
config_path = hf_hub_download("mutisya/w2v-bert-adapters-14lang-e10-25_52-v6", f"{adapter_id}/adapter_config.json")
with open(config_path) as f:
adapter_config = json.load(f)
# Load adapter weights
adapter_weights = torch.load(
hf_hub_download("mutisya/w2v-bert-adapters-14lang-e10-25_52-v6", f"{adapter_id}/adapter_weights.pt"),
map_location="cpu"
)
decoder_weights = torch.load(
hf_hub_download("mutisya/w2v-bert-adapters-14lang-e10-25_52-v6", f"{adapter_id}/decoder_weights.pt"),
map_location="cpu"
)
lm_head_weights = torch.load(
hf_hub_download("mutisya/w2v-bert-adapters-14lang-e10-25_52-v6", f"{adapter_id}/lm_head_weights.pt"),
map_location="cpu"
)
Training Configuration
- Epochs: 10
- Learning Rate: 0.0005
- Batch Size: 48 Γ 1 (effective: 48)
- Extended Vocabulary: True
- Adapter Dimension: 64
- Decoder Layers: 2
Supported Languages
The following languages have trained adapters:
- Acholi (
ach_Latn): WER 22.08% - English (SALT) (
eng_Latn_salt): WER 99.33% - English (TTS) (
eng_Latn_tts): WER 99.80% - Fulah (
ful_Latn): WER 99.02% - Kamba (
kam_Latn): WER 31.91% - Kikuyu (
kik_Latn): WER 15.36% - Luganda (SALT) (
lug_Latn_salt): WER 28.15% - Luo (
luo_Latn): WER 17.69% - Kimeru (
mer_Latn): WER 34.70% - Runyankole (
nyn_Latn): WER 30.46% - Swahili (SALT) (
swh_Latn_salt): WER 95.23% - Swahili (Filtered) (
swh_Latn_v1): WER 20.94% - Swahili (Bible) (
swh_Latn_v2): WER 3.31% - Ateso (
teo_Latn): WER 33.88%
License
Apache 2.0
Citation
@misc{w2vbert-asr-adapters,
author = {Mutisya},
title = {W2V-BERT 2.0 ASR Adapters for African Languages},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/mutisya/w2v-bert-adapters-14lang-e10-25_52-v6}
}
Model tree for mutisya/w2v-bert-adapters-14lang-e10-25_52-v6
Base model
facebook/w2v-bert-2.0Evaluation results
- Average WERself-reported45.130