---
base_model: unsloth/gemma-3n-E4B-it
tags:
- text-generation-inference
- transformers
- unsloth
- gemma3n
- egyptian
- codeswitch
- arabic
- masry
license: apache-2.0
language:
- en
- ar
datasets:
- MohamedRashad/arabic-english-code-switching
pipeline_tag: automatic-speech-recognition
---

# 🇪🇬🎙 MasriSwitch-Gemma3n-Transcriber-v1  

**MasriSwitch-Gemma3n-Transcriber** is an automatic speech transcription model specialized for **Egyptian Arabic** with strong **English code-switching** capabilities.

This model is **one of the very few publicly available systems** explicitly optimized for:
- Egyptian Arabic dialect transcription  
- Natural Arabic ↔ English code-switching  
- Short and medium-length real-world audio

The model is trained using:
- **MohamedRashad/arabic-english-code-switching** dataset  
- **A private Egyptian speech dataset** containing real conversational audio, voice notes, and mixed Arabic/English speech recordings  


---

## 🔍 Overview

**MasriSwitch-Gemma3n-Transcriber** is built on the Gemma3n conditional generation architecture and fine-tuned to understand natural Egyptian speech patterns, including mixed Arabic/English utterances commonly used in daily life, workplaces, and online content.

It is suitable for:
- Social media content transcription  
- Customer support calls  
- Meetings, voice notes, and interviews  
- Research in dialectal ASR  
- Multilingual speech processing  

---

## ✨ Features

- 🗣 Egyptian Arabic dialect-aware transcription  
- 🔀 Accurate English code-switching support  
- 🎧 Strong performance on informal, real-world speech  
- ⚡ Optimized for short (10–30s) audio segments  
- 🤖 Built using the Gemma3n generation-based ASR pipeline  

---

## 🎯 Intended Use

Use this model for:
- Speech-to-text systems  
- Captioning and subtitling  
- Chat or voice assistant pipelines  
- Indexing/searching Arabic audio content  
- Research and experimentation  

---

## ⚠️ Limitations

- Best results with clean audio and single speakers  
- Not optimized for Gulf, Levantine, or MSA-only speech  
- Struggles with:
  - Heavy noise  
  - Overlapping speakers  
  - Fast speech  
- Long recordings should be segmented (20–30s recommended)  

---

## 🛡 Safety & Privacy

- Transcriptions may include sensitive user data — handle with care.  
- Should not be used for high-stakes decisions without human review.  
- Biases in training data may affect accuracy.  

---

# 🧪 Inference Example (Python)

```python
import torch
from transformers import AutoProcessor, Gemma3nForConditionalGeneration

MODEL_ID = "oddadmix/egyptian-code-switching-b4-g2-merged"

def load_model_and_processor(model_id=MODEL_ID, device=None):
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"

    print(f"Loading model {model_id} to device {device}...")
    
    model = Gemma3nForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16 if device == "cuda" else None,
        device_map="auto" if device == "cuda" else None,
    ).eval()

    if not any(p.device.type == "cuda" for p in model.parameters()) and device == "cuda":
        model.to("cuda")

    processor = AutoProcessor.from_pretrained(model_id)
    return model, processor, device


def transcribe_file(model, processor, audio_path, max_new_tokens=128):
    if not audio_path:
        raise ValueError("audio_path must point to an audio file")

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an assistant that transcribes speech accurately."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "audio", "url": audio_path},
                {"type": "text", "text": "Please transcribe this audio."}
            ],
        },
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )

    device = next(model.parameters()).device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    input_len = inputs["input_ids"].shape[-1]

    with torch.inference_mode():
        generated = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )

    gen_tokens = generated[0][input_len:]
    text = processor.decode(gen_tokens, skip_special_tokens=True)
    return text


if __name__ == "__main__":
    audio_path = "path/to/audio.wav"
    model, processor, device = load_model_and_processor()
    transcription = transcribe_file(model, processor, audio_path, max_new_tokens=256)
    print("Transcription:", transcription)