🗣️ Kiswahili Sahihi ASR Adapted 1 (Adapter-Fused Whisper Model)

Overview

keystats/kiswahili_sahihi_asr_adapted_1 is a refined Swahili automatic speech recognition (ASR) model optimized for on-device use and low-resource settings.
It extends the Whisper Medium architecture through parameter-efficient fine-tuning (PEFT) and LoRA adapters, achieving high transcription accuracy while keeping the model lightweight and deployable.

This model builds upon the foundation of Kiswahili Sahihi ASR and was developed for the Your Voice, Your Device, Your Language Challenge, advancing accessible speech technology for over 200 million Swahili speakers.

🧩 Base Model

Base architecture: keystats/kiswahili_sahihi_asr
Training framework: PEFT (LoRA)
Tokenizer vocab size: 51866 tokens
Trainable parameters: 2.36M (≈0.31% of total 766M)

⚙️ Training Configuration

Parameter	Value
Dataset	Sunbird/salt (studio-swa)
Noise Augmentation	Sunbird/urban-noise-uganda-61k
Effective Batch Size	8 (`per_device_train_batch_size=4`, `gradient_accumulation_steps=2`)
Learning Rate	`1e-5`
Warmup Steps	`500`
Epochs	3
Precision	Mixed precision (`fp16`)
Quantization	8-bit via `bitsandbytes`
Memory Optimization	`gradient_checkpointing=True`

🧠 Training Summary

✅ Added <|pad|> token to tokenizer
✅ Resized embeddings to 51866 tokens
✅ PEFT + LoRA adapters merged
✅ Dataset: Train=3,758 | Validation=77
✅ 1,000 noise clips used for augmentation

Final Training Snapshot:

Step	Training Loss	Validation Loss	WER (%)	CER (%)
200	0.6747	0.6675	16.23	4.82
400	0.5860	0.5616	15.56	4.74
600	0.5368	0.4852	12.09	4.11
800	0.4646	0.4447	12.58	4.23
1000	0.4267	0.4154	12.42	4.23
1200	0.4279	0.3949	11.42	4.03
1400	0.3965	0.3902	11.59	4.11

🧩 Model Usage

# ============================================
# 🪄 Full Swahili ASR Transcription with Adapted Model 
# ============================================
# 📦 Installation
!pip install -q transformers "datasets<4.0.0"
!pip install -q torchaudio==2.6.0 torchvision==0.21.0 jiwer evaluate
!pip install -q soundfile librosa accelerate>=0.26.0 tensorboard bitsandbytes
!pip install -q pydub
!apt-get -y install ffmpeg

# ============================================
# 1️⃣ Imports
# ============================================
import torch
import librosa
import numpy as np
from pydub import AudioSegment
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel, PeftConfig
import imageio_ffmpeg as ffmpeg_lib
import os

# ============================================
# 2️⃣ Register ffmpeg for pydub (no sudo)
# ============================================
from pydub.utils import which

ffmpeg_path = ffmpeg_lib.get_ffmpeg_exe()
AudioSegment.converter = ffmpeg_path
AudioSegment.ffmpeg = ffmpeg_path
AudioSegment.ffprobe = ffmpeg_path

print("✅ ffmpeg and ffprobe linked successfully!")

# ============================================
# 3️⃣ Load Adapted Model (with vocab fix)
# ============================================
base_model_id = "keystats/kiswahili_sahihi_asr"
adapter_model_path = "keystats/kiswahili_sahihi_asr_adapted_1"

print(f"🔹 Loading processor from: {adapter_model_path}")
processor = WhisperProcessor.from_pretrained(adapter_model_path)
vocab_size = len(processor.tokenizer)
print(f"🔹 Tokenizer vocab size: {vocab_size}")

# Load PEFT config
peft_config = PeftConfig.from_pretrained(adapter_model_path)
print(f"🔹 PEFT base model: {peft_config.base_model_name_or_path}")

# Load base Whisper model
base_model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path,
    ignore_mismatched_sizes=True,
)

# Fix vocab size mismatch
base_model.resize_token_embeddings(vocab_size)
print(f"✅ Resized token embeddings to match vocab size ({vocab_size})")

# Load and merge adapter
model = PeftModel.from_pretrained(base_model, adapter_model_path)
print("✅ Adapter loaded successfully")

model = model.merge_and_unload()
print("✅ Adapter merged and unloaded")

# Move to device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device, dtype=torch.float32)
model.eval()
print(f"🚀 Model ready on {device.upper()}")

# ============================================
# 4️⃣ Convert Any Format to WAV
# ============================================
def convert_to_wav(input_path, output_path="converted.wav"):
    """Convert MP3, M4A, or any audio file to mono 16kHz WAV."""
    try:
        audio = AudioSegment.from_file(input_path)
        audio = audio.set_frame_rate(16000).set_channels(1)
        audio.export(output_path, format="wav")
        return output_path
    except Exception as e:
        raise RuntimeError(f"❌ Could not convert file. Error: {e}")

# 🎧 Replace this with your Swahili audio file
audio_path = "Your audio here"
wav_path = convert_to_wav(audio_path)
print(f"✅ Converted to: {wav_path}")

# ============================================
# 5️⃣ Load Audio and Chunk
# ============================================
audio_input, sr = librosa.load(wav_path, sr=16000, mono=True)
chunk_length_s = 60  # seconds
chunk_size = chunk_length_s * sr
num_chunks = int(np.ceil(len(audio_input) / chunk_size))
print(f"🔹 Total length: {len(audio_input)/sr:.2f}s | Splitting into {num_chunks} chunks...")

# ============================================
# 6️⃣ Transcribe (without forced_decoder_ids)
# ============================================
full_transcription = []

# Add language token manually for safety
lang_token = processor.tokenizer.convert_tokens_to_ids("<|swahili|>")
task_token = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>")
start_tokens = torch.tensor([[lang_token, task_token]], device=device)

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(audio_input))
    chunk = audio_input[start:end]

    inputs = processor(
        chunk,
        sampling_rate=16000,
        return_tensors="pt",
        return_attention_mask=True,   
    ).to(device, dtype=torch.float32)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=256,
            num_beams=2,
            repetition_penalty=1.1,
        )

    text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    full_transcription.append(text.strip())

    print(f"🟢 Chunk {i+1}/{num_chunks} done")


# ============================================
# 7️⃣ Combine Final Transcript
# ============================================
final_text = " ".join(full_transcription)
print("\n📝 Final Transcription:\n")
print(final_text)

💡 Key Features

🧠 Adapter-based fine-tuning (LoRA) — trains <1% of parameters
🪶 Lightweight & efficient — runs on edge or T4 GPUs
🔊 Noise-robust — trained with real-world East African noise
🕊️ Privacy-preserving — suitable for offline/edge deployment
🌍 Focused on accessibility — targets Swahili and low-resource regions

🧾 Acknowledgments

This model builds upon the architecture and open-source contributions of Salifou Abdourahamane — creator of the excellent swahili_asr_sota_model.

📚 Citation

If you use this model, please credit:

Jackson Kahungu, Kiswahili Sahihi ASR Adapted 1 Hugging Face: keystats/kiswahili_sahihi_asr_adapted_1

Downloads last month: 90

Model tree for keystats/kiswahili_sahihi_asr_adapted_1

Base model

openai/whisper-medium

Finetuned

keystats/kiswahili_sahihi_asr

Adapter

(2)

this model

keystats
/

kiswahili_sahihi_asr_adapted_1