|
|
--- |
|
|
language: |
|
|
- de |
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- audio |
|
|
- automatic-speech-recognition |
|
|
- speech |
|
|
- medical |
|
|
- german |
|
|
- parakeet |
|
|
- nemo |
|
|
base_model: nvidia/parakeet-tdt-0.6b-v3 |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- wer |
|
|
model-index: |
|
|
- name: Parakeet-DE-Med |
|
|
results: |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: German Medical Documentation |
|
|
type: custom |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 3.28 |
|
|
name: Word Error Rate |
|
|
--- |
|
|
|
|
|
# Parakeet-DE-Med: German Medical ASR |
|
|
|
|
|
Fine-tuned NVIDIA Parakeet-TDT-0.6B for German medical documentation transcription. |
|
|
|
|
|
This model is a fine-tuned derivative of [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3), which is licensed under CC-BY-4.0. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a parameter-efficient fine-tuned (PEFT) version of NVIDIA's Parakeet-TDT-0.6B specialized for German medical documentation (Arztbriefe). It uses the decoder+joint training strategy, training only 2.89% of the model parameters while achieving significant improvements in medical domain accuracy. |
|
|
|
|
|
- **Base Model:** nvidia/parakeet-tdt-0.6b-v3 |
|
|
- **Language:** German (de-DE) |
|
|
- **Domain:** Medical documentation |
|
|
- **Training Method:** PEFT (decoder+joint strategy) |
|
|
- **Parameters Trained:** 18.1M / 627M (2.89%) |
|
|
|
|
|
## Performance |
|
|
|
|
|
Evaluated on German medical documentation test set (122 samples): |
|
|
|
|
|
| Model | WER | |
|
|
|-------|-----| |
|
|
| Base Parakeet-TDT-0.6B | 11.73% | |
|
|
| **Parakeet-DE-Med** | **3.28%** | |
|
|
|
|
|
**Improvement:** 72% WER reduction |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training Data:** 976 German medical documentation samples |
|
|
- **Training Epochs:** 5 |
|
|
- **Training Strategy:** Freeze encoder, train decoder and joint network only |
|
|
- **Precision:** BF16 mixed precision |
|
|
- **Batch Size:** 4 (effective batch size 16 with gradient accumulation) |
|
|
- **Learning Rate:** 2e-4 |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
```bash |
|
|
pip install nemo_toolkit['asr'] |
|
|
# Or for the latest version: |
|
|
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr] |
|
|
``` |
|
|
|
|
|
### Basic Transcription |
|
|
|
|
|
```python |
|
|
import nemo.collections.asr as nemo_asr |
|
|
|
|
|
# Load the model from HuggingFace |
|
|
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med") |
|
|
|
|
|
# Transcribe a single audio file |
|
|
transcription = model.transcribe(["path/to/audio.wav"]) |
|
|
print(transcription[0]) |
|
|
``` |
|
|
|
|
|
### Batch Transcription |
|
|
|
|
|
```python |
|
|
import nemo.collections.asr as nemo_asr |
|
|
|
|
|
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med") |
|
|
|
|
|
# Transcribe multiple files |
|
|
audio_files = [ |
|
|
"patient_recording_1.wav", |
|
|
"patient_recording_2.wav", |
|
|
"patient_recording_3.wav" |
|
|
] |
|
|
|
|
|
transcriptions = model.transcribe(audio_files, batch_size=4) |
|
|
for i, text in enumerate(transcriptions): |
|
|
print(f"File {i+1}: {text}") |
|
|
``` |
|
|
|
|
|
### Real-time/Streaming Audio |
|
|
|
|
|
```python |
|
|
import nemo.collections.asr as nemo_asr |
|
|
import soundfile as sf |
|
|
|
|
|
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med") |
|
|
|
|
|
# Load audio file |
|
|
audio, sample_rate = sf.read("medical_dictation.wav") |
|
|
|
|
|
# Resample to 16kHz if needed |
|
|
if sample_rate != 16000: |
|
|
import librosa |
|
|
audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000) |
|
|
|
|
|
# Transcribe from numpy array |
|
|
transcription = model.transcribe(audio, batch_size=1) |
|
|
print(transcription[0]) |
|
|
``` |
|
|
|
|
|
### Advanced Configuration |
|
|
|
|
|
```python |
|
|
import nemo.collections.asr as nemo_asr |
|
|
|
|
|
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med") |
|
|
|
|
|
# Configure decoding parameters |
|
|
transcription = model.transcribe( |
|
|
paths2audio_files=["recording.wav"], |
|
|
batch_size=1, |
|
|
return_hypotheses=True, # Get confidence scores |
|
|
num_workers=4, |
|
|
channel_selector=0, # For multi-channel audio |
|
|
augmentor=None |
|
|
) |
|
|
|
|
|
# Access detailed results |
|
|
for hyp in transcription: |
|
|
print(f"Text: {hyp.text}") |
|
|
print(f"Confidence: {hyp.score}") |
|
|
``` |
|
|
|
|
|
### Using with GPU |
|
|
|
|
|
```python |
|
|
import nemo.collections.asr as nemo_asr |
|
|
import torch |
|
|
|
|
|
# Ensure GPU is available |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
print(f"Using device: {device}") |
|
|
|
|
|
# Load model (automatically uses GPU if available) |
|
|
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med") |
|
|
|
|
|
# Transcribe |
|
|
transcription = model.transcribe(["audio.wav"]) |
|
|
print(transcription[0]) |
|
|
``` |
|
|
|
|
|
### Transcribing from Microphone |
|
|
|
|
|
```python |
|
|
import nemo.collections.asr as nemo_asr |
|
|
import pyaudio |
|
|
import wave |
|
|
import tempfile |
|
|
|
|
|
def record_audio(duration=5, sample_rate=16000): |
|
|
"""Record audio from microphone""" |
|
|
p = pyaudio.PyAudio() |
|
|
stream = p.open( |
|
|
format=pyaudio.paInt16, |
|
|
channels=1, |
|
|
rate=sample_rate, |
|
|
input=True, |
|
|
frames_per_buffer=1024 |
|
|
) |
|
|
|
|
|
print(f"Recording for {duration} seconds...") |
|
|
frames = [] |
|
|
for _ in range(0, int(sample_rate / 1024 * duration)): |
|
|
data = stream.read(1024) |
|
|
frames.append(data) |
|
|
|
|
|
print("Recording finished.") |
|
|
stream.stop_stream() |
|
|
stream.close() |
|
|
p.terminate() |
|
|
|
|
|
# Save to temporary file |
|
|
temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False) |
|
|
wf = wave.open(temp_file.name, 'wb') |
|
|
wf.setnchannels(1) |
|
|
wf.setsampwidth(p.get_sample_size(pyaudio.paInt16)) |
|
|
wf.setframerate(sample_rate) |
|
|
wf.writeframes(b''.join(frames)) |
|
|
wf.close() |
|
|
|
|
|
return temp_file.name |
|
|
|
|
|
# Load model |
|
|
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med") |
|
|
|
|
|
# Record and transcribe |
|
|
audio_file = record_audio(duration=5) |
|
|
transcription = model.transcribe([audio_file]) |
|
|
print(f"Transcription: {transcription[0]}") |
|
|
``` |
|
|
|
|
|
### Expected Input Format |
|
|
|
|
|
- **Sample Rate:** 16 kHz (model will work with other rates but 16kHz is optimal) |
|
|
- **Channels:** Mono (single channel) |
|
|
- **Format:** WAV, FLAC, MP3, or any format supported by soundfile |
|
|
- **Bit Depth:** 16-bit PCM recommended |
|
|
|
|
|
## Medical Domain Coverage |
|
|
|
|
|
The model is trained on comprehensive medical documentation including: |
|
|
|
|
|
- Patient presentation and admission |
|
|
- Medical history and examinations |
|
|
- Vital signs and lab results |
|
|
- Diagnoses and treatments |
|
|
- Medications and dosages |
|
|
- Discharge summaries |
|
|
- Follow-up recommendations |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for German medical documentation speech |
|
|
- Trained on synthetic speech data |
|
|
- May have reduced accuracy on non-medical German content |
|
|
- Performance may vary with different audio conditions and accents |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- β
German medical documentation transcription |
|
|
- β
Clinical note-taking assistance |
|
|
- β
Medical dictation systems |
|
|
- β
Research in medical ASR |
|
|
|
|
|
Not recommended for: |
|
|
- β Critical medical decisions without human review |
|
|
- β General-purpose German ASR (use base model instead) |
|
|
- β Languages other than German |
|
|
|
|
|
## License |
|
|
|
|
|
This model is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), the same license as the base model [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3). |
|
|
|
|
|
**You are free to:** |
|
|
- β
Use commercially |
|
|
- β
Modify and create derivatives |
|
|
- β
Distribute and share |
|
|
|
|
|
**Under the following terms:** |
|
|
- **Attribution** β You must give appropriate credit to both NVIDIA (original model) and this fine-tuned version, provide a link to the license, and indicate if changes were made. |
|
|
|
|
|
## Citation |
|
|
|
|
|
Base model: |
|
|
``` |
|
|
@misc{parakeet-tdt-0.6b-v3, |
|
|
author = {NVIDIA}, |
|
|
title = {Parakeet-TDT-0.6B}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the repository. |
|
|
|