π£οΈ Kiswahili Sahihi ASR Adapted 1 (Adapter-Fused Whisper Model)
Overview
keystats/kiswahili_sahihi_asr_adapted_1 is a refined Swahili automatic speech recognition (ASR) model optimized for on-device use and low-resource settings.
It extends the Whisper Medium architecture through parameter-efficient fine-tuning (PEFT) and LoRA adapters, achieving high transcription accuracy while keeping the model lightweight and deployable.
This model builds upon the foundation of Kiswahili Sahihi ASR and was developed for the Your Voice, Your Device, Your Language Challenge, advancing accessible speech technology for over 200 million Swahili speakers.
π§© Base Model
- Base architecture:
keystats/kiswahili_sahihi_asr - Training framework: PEFT (LoRA)
- Tokenizer vocab size:
51866tokens - Trainable parameters:
2.36M(β0.31% of total 766M)
βοΈ Training Configuration
| Parameter | Value |
|---|---|
| Dataset | Sunbird/salt (studio-swa) |
| Noise Augmentation | Sunbird/urban-noise-uganda-61k |
| Effective Batch Size | 8 (per_device_train_batch_size=4, gradient_accumulation_steps=2) |
| Learning Rate | 1e-5 |
| Warmup Steps | 500 |
| Epochs | 3 |
| Precision | Mixed precision (fp16) |
| Quantization | 8-bit via bitsandbytes |
| Memory Optimization | gradient_checkpointing=True |
π§ Training Summary
β
Added <|pad|> token to tokenizer
β
Resized embeddings to 51866 tokens
β
PEFT + LoRA adapters merged
β
Dataset: Train=3,758 | Validation=77
β
1,000 noise clips used for augmentation
Final Training Snapshot:
| Step | Training Loss | Validation Loss | WER (%) | CER (%) |
|---|---|---|---|---|
| 200 | 0.6747 | 0.6675 | 16.23 | 4.82 |
| 400 | 0.5860 | 0.5616 | 15.56 | 4.74 |
| 600 | 0.5368 | 0.4852 | 12.09 | 4.11 |
| 800 | 0.4646 | 0.4447 | 12.58 | 4.23 |
| 1000 | 0.4267 | 0.4154 | 12.42 | 4.23 |
| 1200 | 0.4279 | 0.3949 | 11.42 | 4.03 |
| 1400 | 0.3965 | 0.3902 | 11.59 | 4.11 |
π§© Model Usage
# ============================================
# πͺ Full Swahili ASR Transcription with Adapted Model
# ============================================
# π¦ Installation
!pip install -q transformers "datasets<4.0.0"
!pip install -q torchaudio==2.6.0 torchvision==0.21.0 jiwer evaluate
!pip install -q soundfile librosa accelerate>=0.26.0 tensorboard bitsandbytes
!pip install -q pydub
!apt-get -y install ffmpeg
# ============================================
# 1οΈβ£ Imports
# ============================================
import torch
import librosa
import numpy as np
from pydub import AudioSegment
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel, PeftConfig
import imageio_ffmpeg as ffmpeg_lib
import os
# ============================================
# 2οΈβ£ Register ffmpeg for pydub (no sudo)
# ============================================
from pydub.utils import which
ffmpeg_path = ffmpeg_lib.get_ffmpeg_exe()
AudioSegment.converter = ffmpeg_path
AudioSegment.ffmpeg = ffmpeg_path
AudioSegment.ffprobe = ffmpeg_path
print("β
ffmpeg and ffprobe linked successfully!")
# ============================================
# 3οΈβ£ Load Adapted Model (with vocab fix)
# ============================================
base_model_id = "keystats/kiswahili_sahihi_asr"
adapter_model_path = "keystats/kiswahili_sahihi_asr_adapted_1"
print(f"πΉ Loading processor from: {adapter_model_path}")
processor = WhisperProcessor.from_pretrained(adapter_model_path)
vocab_size = len(processor.tokenizer)
print(f"πΉ Tokenizer vocab size: {vocab_size}")
# Load PEFT config
peft_config = PeftConfig.from_pretrained(adapter_model_path)
print(f"πΉ PEFT base model: {peft_config.base_model_name_or_path}")
# Load base Whisper model
base_model = WhisperForConditionalGeneration.from_pretrained(
peft_config.base_model_name_or_path,
ignore_mismatched_sizes=True,
)
# Fix vocab size mismatch
base_model.resize_token_embeddings(vocab_size)
print(f"β
Resized token embeddings to match vocab size ({vocab_size})")
# Load and merge adapter
model = PeftModel.from_pretrained(base_model, adapter_model_path)
print("β
Adapter loaded successfully")
model = model.merge_and_unload()
print("β
Adapter merged and unloaded")
# Move to device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device, dtype=torch.float32)
model.eval()
print(f"π Model ready on {device.upper()}")
# ============================================
# 4οΈβ£ Convert Any Format to WAV
# ============================================
def convert_to_wav(input_path, output_path="converted.wav"):
"""Convert MP3, M4A, or any audio file to mono 16kHz WAV."""
try:
audio = AudioSegment.from_file(input_path)
audio = audio.set_frame_rate(16000).set_channels(1)
audio.export(output_path, format="wav")
return output_path
except Exception as e:
raise RuntimeError(f"β Could not convert file. Error: {e}")
# π§ Replace this with your Swahili audio file
audio_path = "Your audio here"
wav_path = convert_to_wav(audio_path)
print(f"β
Converted to: {wav_path}")
# ============================================
# 5οΈβ£ Load Audio and Chunk
# ============================================
audio_input, sr = librosa.load(wav_path, sr=16000, mono=True)
chunk_length_s = 60 # seconds
chunk_size = chunk_length_s * sr
num_chunks = int(np.ceil(len(audio_input) / chunk_size))
print(f"πΉ Total length: {len(audio_input)/sr:.2f}s | Splitting into {num_chunks} chunks...")
# ============================================
# 6οΈβ£ Transcribe (without forced_decoder_ids)
# ============================================
full_transcription = []
# Add language token manually for safety
lang_token = processor.tokenizer.convert_tokens_to_ids("<|swahili|>")
task_token = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>")
start_tokens = torch.tensor([[lang_token, task_token]], device=device)
for i in range(num_chunks):
start = i * chunk_size
end = min((i + 1) * chunk_size, len(audio_input))
chunk = audio_input[start:end]
inputs = processor(
chunk,
sampling_rate=16000,
return_tensors="pt",
return_attention_mask=True,
).to(device, dtype=torch.float32)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
num_beams=2,
repetition_penalty=1.1,
)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
full_transcription.append(text.strip())
print(f"π’ Chunk {i+1}/{num_chunks} done")
# ============================================
# 7οΈβ£ Combine Final Transcript
# ============================================
final_text = " ".join(full_transcription)
print("\nπ Final Transcription:\n")
print(final_text)
π‘ Key Features
- π§ Adapter-based fine-tuning (LoRA) β trains <1% of parameters
- πͺΆ Lightweight & efficient β runs on edge or T4 GPUs
- π Noise-robust β trained with real-world East African noise
- ποΈ Privacy-preserving β suitable for offline/edge deployment
- π Focused on accessibility β targets Swahili and low-resource regions
π§Ύ Acknowledgments
This model builds upon the architecture and open-source contributions of Salifou Abdourahamane β creator of the excellent swahili_asr_sota_model.
π Citation
If you use this model, please credit:
Jackson Kahungu, Kiswahili Sahihi ASR Adapted 1 Hugging Face: keystats/kiswahili_sahihi_asr_adapted_1
- Downloads last month
- 90