--- base_model: keystats/kiswahili_sahihi_asr library_name: peft tags: - base_model:adapter:keystats/kiswahili_sahihi_asr - lora - transformers license: apache-2.0 datasets: - Sunbird/salt - Sunbird/urban-noise-uganda-61k language: - sw metrics: - wer - cer pipeline_tag: automatic-speech-recognition new_version: keystats/kiswahili_sahihi_asr_adapted_2 --- ## ๐Ÿ—ฃ๏ธ Kiswahili Sahihi ASR Adapted 1 (Adapter-Fused Whisper Model) ### Overview **`keystats/kiswahili_sahihi_asr_adapted_1`** is a refined **Swahili automatic speech recognition (ASR)** model optimized for **on-device use** and **low-resource settings**. It extends the [Whisper Medium](https://huggingface.co/openai/whisper-medium) architecture through **parameter-efficient fine-tuning (PEFT)** and **LoRA adapters**, achieving high transcription accuracy while keeping the model lightweight and deployable. This model builds upon the foundation of [Kiswahili Sahihi ASR](https://huggingface.co/keystats/kiswahili_sahihi_asr) and was developed for the [Your Voice, Your Device, Your Language Challenge](https://zindi.africa/competitions/your-voice-your-device-your-language-challenge), advancing **accessible speech technology for over 200 million Swahili speakers**. --- ### ๐Ÿงฉ Base Model * **Base architecture:** [`keystats/kiswahili_sahihi_asr`](https://huggingface.co/keystats/kiswahili_sahihi_asr) * **Training framework:** [PEFT (LoRA)](https://github.com/huggingface/peft) * **Tokenizer vocab size:** `51866` tokens * **Trainable parameters:** `2.36M` (โ‰ˆ0.31% of total 766M) --- ### โš™๏ธ Training Configuration | Parameter | Value | | ------------------------ | ------------------------------------------------------------------------------------------------ | | **Dataset** | [Sunbird/salt](https://huggingface.co/datasets/Sunbird/salt) (studio-swa) | | **Noise Augmentation** | [Sunbird/urban-noise-uganda-61k](https://huggingface.co/datasets/Sunbird/urban-noise-uganda-61k) | | **Effective Batch Size** | 8 (`per_device_train_batch_size=4`, `gradient_accumulation_steps=2`) | | **Learning Rate** | `1e-5` | | **Warmup Steps** | `500` | | **Epochs** | 3 | | **Precision** | Mixed precision (`fp16`) | | **Quantization** | 8-bit via `bitsandbytes` | | **Memory Optimization** | `gradient_checkpointing=True` | --- ### ๐Ÿง  Training Summary ```text โœ… Added <|pad|> token to tokenizer โœ… Resized embeddings to 51866 tokens โœ… PEFT + LoRA adapters merged โœ… Dataset: Train=3,758 | Validation=77 โœ… 1,000 noise clips used for augmentation ```` **Final Training Snapshot:** | Step | Training Loss | Validation Loss | WER (%) | CER (%) | | ---- | ------------- | --------------- | ------- | ------- | | 200 | 0.6747 | 0.6675 | 16.23 | 4.82 | | 400 | 0.5860 | 0.5616 | 15.56 | 4.74 | | 600 | 0.5368 | 0.4852 | 12.09 | 4.11 | | 800 | 0.4646 | 0.4447 | 12.58 | 4.23 | | 1000 | 0.4267 | 0.4154 | 12.42 | 4.23 | | 1200 | 0.4279 | 0.3949 | 11.42 | 4.03 | | 1400 | 0.3965 | 0.3902 | 11.59 | 4.11 | --- ### ๐Ÿงฉ Model Usage ```python # ============================================ # ๐Ÿช„ Full Swahili ASR Transcription with Adapted Model # ============================================ # ๐Ÿ“ฆ Installation !pip install -q transformers "datasets<4.0.0" !pip install -q torchaudio==2.6.0 torchvision==0.21.0 jiwer evaluate !pip install -q soundfile librosa accelerate>=0.26.0 tensorboard bitsandbytes !pip install -q pydub !apt-get -y install ffmpeg # ============================================ # 1๏ธโƒฃ Imports # ============================================ import torch import librosa import numpy as np from pydub import AudioSegment from transformers import WhisperProcessor, WhisperForConditionalGeneration from peft import PeftModel, PeftConfig import imageio_ffmpeg as ffmpeg_lib import os # ============================================ # 2๏ธโƒฃ Register ffmpeg for pydub (no sudo) # ============================================ from pydub.utils import which ffmpeg_path = ffmpeg_lib.get_ffmpeg_exe() AudioSegment.converter = ffmpeg_path AudioSegment.ffmpeg = ffmpeg_path AudioSegment.ffprobe = ffmpeg_path print("โœ… ffmpeg and ffprobe linked successfully!") # ============================================ # 3๏ธโƒฃ Load Adapted Model (with vocab fix) # ============================================ base_model_id = "keystats/kiswahili_sahihi_asr" adapter_model_path = "keystats/kiswahili_sahihi_asr_adapted_1" print(f"๐Ÿ”น Loading processor from: {adapter_model_path}") processor = WhisperProcessor.from_pretrained(adapter_model_path) vocab_size = len(processor.tokenizer) print(f"๐Ÿ”น Tokenizer vocab size: {vocab_size}") # Load PEFT config peft_config = PeftConfig.from_pretrained(adapter_model_path) print(f"๐Ÿ”น PEFT base model: {peft_config.base_model_name_or_path}") # Load base Whisper model base_model = WhisperForConditionalGeneration.from_pretrained( peft_config.base_model_name_or_path, ignore_mismatched_sizes=True, ) # Fix vocab size mismatch base_model.resize_token_embeddings(vocab_size) print(f"โœ… Resized token embeddings to match vocab size ({vocab_size})") # Load and merge adapter model = PeftModel.from_pretrained(base_model, adapter_model_path) print("โœ… Adapter loaded successfully") model = model.merge_and_unload() print("โœ… Adapter merged and unloaded") # Move to device device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device, dtype=torch.float32) model.eval() print(f"๐Ÿš€ Model ready on {device.upper()}") # ============================================ # 4๏ธโƒฃ Convert Any Format to WAV # ============================================ def convert_to_wav(input_path, output_path="converted.wav"): """Convert MP3, M4A, or any audio file to mono 16kHz WAV.""" try: audio = AudioSegment.from_file(input_path) audio = audio.set_frame_rate(16000).set_channels(1) audio.export(output_path, format="wav") return output_path except Exception as e: raise RuntimeError(f"โŒ Could not convert file. Error: {e}") # ๐ŸŽง Replace this with your Swahili audio file audio_path = "Your audio here" wav_path = convert_to_wav(audio_path) print(f"โœ… Converted to: {wav_path}") # ============================================ # 5๏ธโƒฃ Load Audio and Chunk # ============================================ audio_input, sr = librosa.load(wav_path, sr=16000, mono=True) chunk_length_s = 60 # seconds chunk_size = chunk_length_s * sr num_chunks = int(np.ceil(len(audio_input) / chunk_size)) print(f"๐Ÿ”น Total length: {len(audio_input)/sr:.2f}s | Splitting into {num_chunks} chunks...") # ============================================ # 6๏ธโƒฃ Transcribe (without forced_decoder_ids) # ============================================ full_transcription = [] # Add language token manually for safety lang_token = processor.tokenizer.convert_tokens_to_ids("<|swahili|>") task_token = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>") start_tokens = torch.tensor([[lang_token, task_token]], device=device) for i in range(num_chunks): start = i * chunk_size end = min((i + 1) * chunk_size, len(audio_input)) chunk = audio_input[start:end] inputs = processor( chunk, sampling_rate=16000, return_tensors="pt", return_attention_mask=True, ).to(device, dtype=torch.float32) with torch.no_grad(): generated_ids = model.generate( **inputs, max_new_tokens=256, num_beams=2, repetition_penalty=1.1, ) text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] full_transcription.append(text.strip()) print(f"๐ŸŸข Chunk {i+1}/{num_chunks} done") # ============================================ # 7๏ธโƒฃ Combine Final Transcript # ============================================ final_text = " ".join(full_transcription) print("\n๐Ÿ“ Final Transcription:\n") print(final_text) ``` --- ### ๐Ÿ’ก Key Features * ๐Ÿง  **Adapter-based fine-tuning (LoRA)** โ€” trains <1% of parameters * ๐Ÿชถ **Lightweight & efficient** โ€” runs on edge or T4 GPUs * ๐Ÿ”Š **Noise-robust** โ€” trained with real-world East African noise * ๐Ÿ•Š๏ธ **Privacy-preserving** โ€” suitable for offline/edge deployment * ๐ŸŒ **Focused on accessibility** โ€” targets Swahili and low-resource regions --- ### ๐Ÿงพ Acknowledgments This model builds upon the architecture and open-source contributions of **[Salifou Abdourahamane](https://github.com/SalifouAbdourahamane)** โ€” creator of the excellent [swahili_asr_sota_model](https://github.com/SalifouAbdourahamane/swahili_asr_sota_model). --- ### ๐Ÿ“š Citation If you use this model, please credit: **Jackson Kahungu**, *Kiswahili Sahihi ASR Adapted 1* Hugging Face: [keystats/kiswahili_sahihi_asr_adapted_1](https://huggingface.co/keystats/kiswahili_sahihi_asr_adapted_1) ```