πŸ—£οΈ Kiswahili Sahihi ASR Adapted 1 (Adapter-Fused Whisper Model)

Overview

keystats/kiswahili_sahihi_asr_adapted_1 is a refined Swahili automatic speech recognition (ASR) model optimized for on-device use and low-resource settings.
It extends the Whisper Medium architecture through parameter-efficient fine-tuning (PEFT) and LoRA adapters, achieving high transcription accuracy while keeping the model lightweight and deployable.

This model builds upon the foundation of Kiswahili Sahihi ASR and was developed for the Your Voice, Your Device, Your Language Challenge, advancing accessible speech technology for over 200 million Swahili speakers.


🧩 Base Model


βš™οΈ Training Configuration

Parameter Value
Dataset Sunbird/salt (studio-swa)
Noise Augmentation Sunbird/urban-noise-uganda-61k
Effective Batch Size 8 (per_device_train_batch_size=4, gradient_accumulation_steps=2)
Learning Rate 1e-5
Warmup Steps 500
Epochs 3
Precision Mixed precision (fp16)
Quantization 8-bit via bitsandbytes
Memory Optimization gradient_checkpointing=True

🧠 Training Summary

βœ… Added <|pad|> token to tokenizer
βœ… Resized embeddings to 51866 tokens
βœ… PEFT + LoRA adapters merged
βœ… Dataset: Train=3,758 | Validation=77
βœ… 1,000 noise clips used for augmentation

Final Training Snapshot:

Step Training Loss Validation Loss WER (%) CER (%)
200 0.6747 0.6675 16.23 4.82
400 0.5860 0.5616 15.56 4.74
600 0.5368 0.4852 12.09 4.11
800 0.4646 0.4447 12.58 4.23
1000 0.4267 0.4154 12.42 4.23
1200 0.4279 0.3949 11.42 4.03
1400 0.3965 0.3902 11.59 4.11

🧩 Model Usage

# ============================================
# πŸͺ„ Full Swahili ASR Transcription with Adapted Model 
# ============================================
# πŸ“¦ Installation
!pip install -q transformers "datasets<4.0.0"
!pip install -q torchaudio==2.6.0 torchvision==0.21.0 jiwer evaluate
!pip install -q soundfile librosa accelerate>=0.26.0 tensorboard bitsandbytes
!pip install -q pydub
!apt-get -y install ffmpeg

# ============================================
# 1️⃣ Imports
# ============================================
import torch
import librosa
import numpy as np
from pydub import AudioSegment
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel, PeftConfig
import imageio_ffmpeg as ffmpeg_lib
import os

# ============================================
# 2️⃣ Register ffmpeg for pydub (no sudo)
# ============================================
from pydub.utils import which

ffmpeg_path = ffmpeg_lib.get_ffmpeg_exe()
AudioSegment.converter = ffmpeg_path
AudioSegment.ffmpeg = ffmpeg_path
AudioSegment.ffprobe = ffmpeg_path

print("βœ… ffmpeg and ffprobe linked successfully!")

# ============================================
# 3️⃣ Load Adapted Model (with vocab fix)
# ============================================
base_model_id = "keystats/kiswahili_sahihi_asr"
adapter_model_path = "keystats/kiswahili_sahihi_asr_adapted_1"

print(f"πŸ”Ή Loading processor from: {adapter_model_path}")
processor = WhisperProcessor.from_pretrained(adapter_model_path)
vocab_size = len(processor.tokenizer)
print(f"πŸ”Ή Tokenizer vocab size: {vocab_size}")

# Load PEFT config
peft_config = PeftConfig.from_pretrained(adapter_model_path)
print(f"πŸ”Ή PEFT base model: {peft_config.base_model_name_or_path}")

# Load base Whisper model
base_model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path,
    ignore_mismatched_sizes=True,
)

# Fix vocab size mismatch
base_model.resize_token_embeddings(vocab_size)
print(f"βœ… Resized token embeddings to match vocab size ({vocab_size})")

# Load and merge adapter
model = PeftModel.from_pretrained(base_model, adapter_model_path)
print("βœ… Adapter loaded successfully")

model = model.merge_and_unload()
print("βœ… Adapter merged and unloaded")

# Move to device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device, dtype=torch.float32)
model.eval()
print(f"πŸš€ Model ready on {device.upper()}")

# ============================================
# 4️⃣ Convert Any Format to WAV
# ============================================
def convert_to_wav(input_path, output_path="converted.wav"):
    """Convert MP3, M4A, or any audio file to mono 16kHz WAV."""
    try:
        audio = AudioSegment.from_file(input_path)
        audio = audio.set_frame_rate(16000).set_channels(1)
        audio.export(output_path, format="wav")
        return output_path
    except Exception as e:
        raise RuntimeError(f"❌ Could not convert file. Error: {e}")

# 🎧 Replace this with your Swahili audio file
audio_path = "Your audio here"
wav_path = convert_to_wav(audio_path)
print(f"βœ… Converted to: {wav_path}")

# ============================================
# 5️⃣ Load Audio and Chunk
# ============================================
audio_input, sr = librosa.load(wav_path, sr=16000, mono=True)
chunk_length_s = 60  # seconds
chunk_size = chunk_length_s * sr
num_chunks = int(np.ceil(len(audio_input) / chunk_size))
print(f"πŸ”Ή Total length: {len(audio_input)/sr:.2f}s | Splitting into {num_chunks} chunks...")

# ============================================
# 6️⃣ Transcribe (without forced_decoder_ids)
# ============================================
full_transcription = []

# Add language token manually for safety
lang_token = processor.tokenizer.convert_tokens_to_ids("<|swahili|>")
task_token = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>")
start_tokens = torch.tensor([[lang_token, task_token]], device=device)

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(audio_input))
    chunk = audio_input[start:end]

    inputs = processor(
        chunk,
        sampling_rate=16000,
        return_tensors="pt",
        return_attention_mask=True,   
    ).to(device, dtype=torch.float32)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=256,
            num_beams=2,
            repetition_penalty=1.1,
        )

    text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    full_transcription.append(text.strip())

    print(f"🟒 Chunk {i+1}/{num_chunks} done")


# ============================================
# 7️⃣ Combine Final Transcript
# ============================================
final_text = " ".join(full_transcription)
print("\nπŸ“ Final Transcription:\n")
print(final_text)

πŸ’‘ Key Features

  • 🧠 Adapter-based fine-tuning (LoRA) β€” trains <1% of parameters
  • πŸͺΆ Lightweight & efficient β€” runs on edge or T4 GPUs
  • πŸ”Š Noise-robust β€” trained with real-world East African noise
  • πŸ•ŠοΈ Privacy-preserving β€” suitable for offline/edge deployment
  • 🌍 Focused on accessibility β€” targets Swahili and low-resource regions

🧾 Acknowledgments

This model builds upon the architecture and open-source contributions of Salifou Abdourahamane β€” creator of the excellent swahili_asr_sota_model.


πŸ“š Citation

If you use this model, please credit:

Jackson Kahungu, Kiswahili Sahihi ASR Adapted 1 Hugging Face: keystats/kiswahili_sahihi_asr_adapted_1


Downloads last month
90
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for keystats/kiswahili_sahihi_asr_adapted_1

Adapter
(2)
this model

Datasets used to train keystats/kiswahili_sahihi_asr_adapted_1