Urdu Turn Detection (Audio-Only V2)

This model is a high-speed, audio-native turn detection system designed for real-time Urdu voice applications. It uses a Whisper-Tiny encoder combined with an Attention Pooling mechanism to detect if a speaker has finished their turn or is just pausing.

Key Features

  • Low Latency: Optimized for real-time inference (~50ms).
  • Audio-Only: No ASR/Text needed, making it faster and privacy-friendly.
  • Attention-Based: Uses cross-frame attention to focus on prosodic cues like intonation and sentence-ending phonemes.
  • Robustness: Trained specifically to handle silence and "thinking" pauses without false positives.

Architecture

  • Backbone: openai/whisper-tiny (Encoder only).
  • Pooling: Masked Attention Pooling (ignores padding/silence).
  • Classifier: 2-layer MLP head.

Performance

Metric Value
Inference Latency (CPU) ~120ms
Inference Latency (CUDA) ~45ms
F1 Score (Turn Detection) 95%+ (estimated)

Usage

πŸš€ High-Level API (Recommended)

The model can be used directly via the urdu-turn-detector library:

pip install urdu-turn-detector
from urdu_turn_detection import UrduTurnDetector

# Auto-downloads from Hub
detector = UrduTurnDetector.from_pretrained("PuristanLabs1/urdu-turn-v2")

# Predict on file or buffer
result = detector.predict("audio.wav")
print(f"Turn is {result.label} (Conf: {result.confidence})")

☁️ Hugging Face Inference API

This model is also compatible with HF Inference Endpoints using the included handler.py.

Dataset

Trained on a combination of Common Voice 13 (Urdu) and synthetically augmented samples simulating natural turn transitions and interruptions.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train PuristanLabs1/urdu-turn-v2