Hinglish Parakeet FastConformer RNNT (110M)

Model Details

  • Architecture: FastConformer RNNT
  • Parameters: ~110M
  • Streaming: โœ… Yes
  • Language: English (Streaming-focused)
  • Framework: NVIDIA NeMo
  • Training: Stage-1 RNNT fine-tuning

results

0.05 WER on LibriSpeech dataset 0.06 WER on CommonVoice Dataset

training

it was trained on 1100 hours of data for around 20 epochs

training notebook : https://www.kaggle.com/code/nijajohww/script-stage-1-53ebaf-19bb76-819eda

Training Infrastructure: The model was trained entirely on a Kaggle-provided NVIDIA Tesla P100 GPU. Due to compute constraints, the full training process took approximately 200 hours to complete, covering all epochs of Stage-1 RNNT fine-tuning.

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for streaming or for fine-tuning on another dataset. You will need to install NVIDIA NeMo. We recommend you install it after you've installed latest Pytorch version.

pip install nemo_toolkit['all']

Transcribing using Python

Cache-aware models are designed in a way that the model's predictions are the same in both offline and streaming mode.

So you may use the regular transcribe function to get the transcriptions. First, let's get a sample:

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_hybrid_large_streaming_multi")

# Optional: change the default latency. Default latency is 1040ms. Supported latencies: {0: 0ms, 1: 80ms, 16: 480ms, 33: 1040ms}.
# Note: These are the worst latency and average latency would be half of these numbers.
asr_model.encoder.set_default_att_context_size([70,13]) 

#Optional: change the default decoder. Default decoder is Transducer (RNNT). Supported decoders: {ctc, rnnt}.
asr_model.change_decoding_strategy(decoder_type='rnnt')

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support