---
base_model: keystats/kiswahili_sahihi_asr
library_name: peft
tags:
- base_model:adapter:keystats/kiswahili_sahihi_asr
- lora
- transformers
license: apache-2.0
datasets:
- Sunbird/salt
- Sunbird/urban-noise-uganda-61k
language:
- sw
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
new_version: keystats/kiswahili_sahihi_asr_adapted_2
---

## 🗣️ Kiswahili Sahihi ASR Adapted 1 (Adapter-Fused Whisper Model)

### Overview

**`keystats/kiswahili_sahihi_asr_adapted_1`** is a refined **Swahili automatic speech recognition (ASR)** model optimized for **on-device use** and **low-resource settings**.  
It extends the [Whisper Medium](https://huggingface.co/openai/whisper-medium) architecture through **parameter-efficient fine-tuning (PEFT)** and **LoRA adapters**, achieving high transcription accuracy while keeping the model lightweight and deployable.

This model builds upon the foundation of [Kiswahili Sahihi ASR](https://huggingface.co/keystats/kiswahili_sahihi_asr) and was developed for the [Your Voice, Your Device, Your Language Challenge](https://zindi.africa/competitions/your-voice-your-device-your-language-challenge), advancing **accessible speech technology for over 200 million Swahili speakers**.

---

### 🧩 Base Model

* **Base architecture:** [`keystats/kiswahili_sahihi_asr`](https://huggingface.co/keystats/kiswahili_sahihi_asr)
* **Training framework:** [PEFT (LoRA)](https://github.com/huggingface/peft)
* **Tokenizer vocab size:** `51866` tokens
* **Trainable parameters:** `2.36M` (≈0.31% of total 766M)

---

### ⚙️ Training Configuration

| Parameter                | Value                                                                                            |
| ------------------------ | ------------------------------------------------------------------------------------------------ |
| **Dataset**              | [Sunbird/salt](https://huggingface.co/datasets/Sunbird/salt) (studio-swa)                        |
| **Noise Augmentation**   | [Sunbird/urban-noise-uganda-61k](https://huggingface.co/datasets/Sunbird/urban-noise-uganda-61k) |
| **Effective Batch Size** | 8 (`per_device_train_batch_size=4`, `gradient_accumulation_steps=2`)                             |
| **Learning Rate**        | `1e-5`                                                                                           |
| **Warmup Steps**         | `500`                                                                                            |
| **Epochs**               | 3                                                                                                 |
| **Precision**            | Mixed precision (`fp16`)                                                                         |
| **Quantization**         | 8-bit via `bitsandbytes`                                                                         |
| **Memory Optimization**  | `gradient_checkpointing=True`                                                                    |

---

### 🧠 Training Summary

```text
✅ Added <|pad|> token to tokenizer
✅ Resized embeddings to 51866 tokens
✅ PEFT + LoRA adapters merged
✅ Dataset: Train=3,758 | Validation=77
✅ 1,000 noise clips used for augmentation
````

**Final Training Snapshot:**

| Step | Training Loss | Validation Loss | WER (%) | CER (%) |
| ---- | ------------- | --------------- | ------- | ------- |
| 200  | 0.6747        | 0.6675          | 16.23   | 4.82    |
| 400  | 0.5860        | 0.5616          | 15.56   | 4.74    |
| 600  | 0.5368        | 0.4852          | 12.09   | 4.11    |
| 800  | 0.4646        | 0.4447          | 12.58   | 4.23    |
| 1000 | 0.4267        | 0.4154          | 12.42   | 4.23    |
| 1200 | 0.4279        | 0.3949          | 11.42   | 4.03    |
| 1400 | 0.3965        | 0.3902          | 11.59   | 4.11    |

---

### 🧩 Model Usage

```python
# ============================================
# 🪄 Full Swahili ASR Transcription with Adapted Model 
# ============================================
# 📦 Installation
!pip install -q transformers "datasets<4.0.0"
!pip install -q torchaudio==2.6.0 torchvision==0.21.0 jiwer evaluate
!pip install -q soundfile librosa accelerate>=0.26.0 tensorboard bitsandbytes
!pip install -q pydub
!apt-get -y install ffmpeg

# ============================================
# 1️⃣ Imports
# ============================================
import torch
import librosa
import numpy as np
from pydub import AudioSegment
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel, PeftConfig
import imageio_ffmpeg as ffmpeg_lib
import os

# ============================================
# 2️⃣ Register ffmpeg for pydub (no sudo)
# ============================================
from pydub.utils import which

ffmpeg_path = ffmpeg_lib.get_ffmpeg_exe()
AudioSegment.converter = ffmpeg_path
AudioSegment.ffmpeg = ffmpeg_path
AudioSegment.ffprobe = ffmpeg_path

print("✅ ffmpeg and ffprobe linked successfully!")

# ============================================
# 3️⃣ Load Adapted Model (with vocab fix)
# ============================================
base_model_id = "keystats/kiswahili_sahihi_asr"
adapter_model_path = "keystats/kiswahili_sahihi_asr_adapted_1"

print(f"🔹 Loading processor from: {adapter_model_path}")
processor = WhisperProcessor.from_pretrained(adapter_model_path)
vocab_size = len(processor.tokenizer)
print(f"🔹 Tokenizer vocab size: {vocab_size}")

# Load PEFT config
peft_config = PeftConfig.from_pretrained(adapter_model_path)
print(f"🔹 PEFT base model: {peft_config.base_model_name_or_path}")

# Load base Whisper model
base_model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path,
    ignore_mismatched_sizes=True,
)

# Fix vocab size mismatch
base_model.resize_token_embeddings(vocab_size)
print(f"✅ Resized token embeddings to match vocab size ({vocab_size})")

# Load and merge adapter
model = PeftModel.from_pretrained(base_model, adapter_model_path)
print("✅ Adapter loaded successfully")

model = model.merge_and_unload()
print("✅ Adapter merged and unloaded")

# Move to device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device, dtype=torch.float32)
model.eval()
print(f"🚀 Model ready on {device.upper()}")

# ============================================
# 4️⃣ Convert Any Format to WAV
# ============================================
def convert_to_wav(input_path, output_path="converted.wav"):
    """Convert MP3, M4A, or any audio file to mono 16kHz WAV."""
    try:
        audio = AudioSegment.from_file(input_path)
        audio = audio.set_frame_rate(16000).set_channels(1)
        audio.export(output_path, format="wav")
        return output_path
    except Exception as e:
        raise RuntimeError(f"❌ Could not convert file. Error: {e}")

# 🎧 Replace this with your Swahili audio file
audio_path = "Your audio here"
wav_path = convert_to_wav(audio_path)
print(f"✅ Converted to: {wav_path}")

# ============================================
# 5️⃣ Load Audio and Chunk
# ============================================
audio_input, sr = librosa.load(wav_path, sr=16000, mono=True)
chunk_length_s = 60  # seconds
chunk_size = chunk_length_s * sr
num_chunks = int(np.ceil(len(audio_input) / chunk_size))
print(f"🔹 Total length: {len(audio_input)/sr:.2f}s | Splitting into {num_chunks} chunks...")

# ============================================
# 6️⃣ Transcribe (without forced_decoder_ids)
# ============================================
full_transcription = []

# Add language token manually for safety
lang_token = processor.tokenizer.convert_tokens_to_ids("<|swahili|>")
task_token = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>")
start_tokens = torch.tensor([[lang_token, task_token]], device=device)

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(audio_input))
    chunk = audio_input[start:end]

    inputs = processor(
        chunk,
        sampling_rate=16000,
        return_tensors="pt",
        return_attention_mask=True,   
    ).to(device, dtype=torch.float32)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=256,
            num_beams=2,
            repetition_penalty=1.1,
        )

    text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    full_transcription.append(text.strip())

    print(f"🟢 Chunk {i+1}/{num_chunks} done")


# ============================================
# 7️⃣ Combine Final Transcript
# ============================================
final_text = " ".join(full_transcription)
print("\n📝 Final Transcription:\n")
print(final_text)

```

---

### 💡 Key Features

* 🧠 **Adapter-based fine-tuning (LoRA)** — trains <1% of parameters
* 🪶 **Lightweight & efficient** — runs on edge or T4 GPUs
* 🔊 **Noise-robust** — trained with real-world East African noise
* 🕊️ **Privacy-preserving** — suitable for offline/edge deployment
* 🌍 **Focused on accessibility** — targets Swahili and low-resource regions

---

### 🧾 Acknowledgments

This model builds upon the architecture and open-source contributions of
**[Salifou Abdourahamane](https://github.com/SalifouAbdourahamane)** — creator of the excellent [swahili_asr_sota_model](https://github.com/SalifouAbdourahamane/swahili_asr_sota_model).

---

### 📚 Citation

If you use this model, please credit:

**Jackson Kahungu**, *Kiswahili Sahihi ASR Adapted 1*
Hugging Face: [keystats/kiswahili_sahihi_asr_adapted_1](https://huggingface.co/keystats/kiswahili_sahihi_asr_adapted_1)

```