parakeet_de_med / README.md
johannhartmann's picture
Upload README.md with huggingface_hub
82c627f verified
---
language:
- de
license: cc-by-4.0
tags:
- audio
- automatic-speech-recognition
- speech
- medical
- german
- parakeet
- nemo
base_model: nvidia/parakeet-tdt-0.6b-v3
pipeline_tag: automatic-speech-recognition
datasets:
- custom
metrics:
- wer
model-index:
- name: Parakeet-DE-Med
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: German Medical Documentation
type: custom
metrics:
- type: wer
value: 3.28
name: Word Error Rate
---
# Parakeet-DE-Med: German Medical ASR
Fine-tuned NVIDIA Parakeet-TDT-0.6B for German medical documentation transcription.
This model is a fine-tuned derivative of [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3), which is licensed under CC-BY-4.0.
## Model Description
This model is a parameter-efficient fine-tuned (PEFT) version of NVIDIA's Parakeet-TDT-0.6B specialized for German medical documentation (Arztbriefe). It uses the decoder+joint training strategy, training only 2.89% of the model parameters while achieving significant improvements in medical domain accuracy.
- **Base Model:** nvidia/parakeet-tdt-0.6b-v3
- **Language:** German (de-DE)
- **Domain:** Medical documentation
- **Training Method:** PEFT (decoder+joint strategy)
- **Parameters Trained:** 18.1M / 627M (2.89%)
## Performance
Evaluated on German medical documentation test set (122 samples):
| Model | WER |
|-------|-----|
| Base Parakeet-TDT-0.6B | 11.73% |
| **Parakeet-DE-Med** | **3.28%** |
**Improvement:** 72% WER reduction
## Training Details
- **Training Data:** 976 German medical documentation samples
- **Training Epochs:** 5
- **Training Strategy:** Freeze encoder, train decoder and joint network only
- **Precision:** BF16 mixed precision
- **Batch Size:** 4 (effective batch size 16 with gradient accumulation)
- **Learning Rate:** 2e-4
## Usage
### Prerequisites
```bash
pip install nemo_toolkit['asr']
# Or for the latest version:
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
```
### Basic Transcription
```python
import nemo.collections.asr as nemo_asr
# Load the model from HuggingFace
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")
# Transcribe a single audio file
transcription = model.transcribe(["path/to/audio.wav"])
print(transcription[0])
```
### Batch Transcription
```python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")
# Transcribe multiple files
audio_files = [
"patient_recording_1.wav",
"patient_recording_2.wav",
"patient_recording_3.wav"
]
transcriptions = model.transcribe(audio_files, batch_size=4)
for i, text in enumerate(transcriptions):
print(f"File {i+1}: {text}")
```
### Real-time/Streaming Audio
```python
import nemo.collections.asr as nemo_asr
import soundfile as sf
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")
# Load audio file
audio, sample_rate = sf.read("medical_dictation.wav")
# Resample to 16kHz if needed
if sample_rate != 16000:
import librosa
audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
# Transcribe from numpy array
transcription = model.transcribe(audio, batch_size=1)
print(transcription[0])
```
### Advanced Configuration
```python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")
# Configure decoding parameters
transcription = model.transcribe(
paths2audio_files=["recording.wav"],
batch_size=1,
return_hypotheses=True, # Get confidence scores
num_workers=4,
channel_selector=0, # For multi-channel audio
augmentor=None
)
# Access detailed results
for hyp in transcription:
print(f"Text: {hyp.text}")
print(f"Confidence: {hyp.score}")
```
### Using with GPU
```python
import nemo.collections.asr as nemo_asr
import torch
# Ensure GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Load model (automatically uses GPU if available)
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")
# Transcribe
transcription = model.transcribe(["audio.wav"])
print(transcription[0])
```
### Transcribing from Microphone
```python
import nemo.collections.asr as nemo_asr
import pyaudio
import wave
import tempfile
def record_audio(duration=5, sample_rate=16000):
"""Record audio from microphone"""
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=sample_rate,
input=True,
frames_per_buffer=1024
)
print(f"Recording for {duration} seconds...")
frames = []
for _ in range(0, int(sample_rate / 1024 * duration)):
data = stream.read(1024)
frames.append(data)
print("Recording finished.")
stream.stop_stream()
stream.close()
p.terminate()
# Save to temporary file
temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
wf = wave.open(temp_file.name, 'wb')
wf.setnchannels(1)
wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
wf.setframerate(sample_rate)
wf.writeframes(b''.join(frames))
wf.close()
return temp_file.name
# Load model
model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")
# Record and transcribe
audio_file = record_audio(duration=5)
transcription = model.transcribe([audio_file])
print(f"Transcription: {transcription[0]}")
```
### Expected Input Format
- **Sample Rate:** 16 kHz (model will work with other rates but 16kHz is optimal)
- **Channels:** Mono (single channel)
- **Format:** WAV, FLAC, MP3, or any format supported by soundfile
- **Bit Depth:** 16-bit PCM recommended
## Medical Domain Coverage
The model is trained on comprehensive medical documentation including:
- Patient presentation and admission
- Medical history and examinations
- Vital signs and lab results
- Diagnoses and treatments
- Medications and dosages
- Discharge summaries
- Follow-up recommendations
## Limitations
- Optimized for German medical documentation speech
- Trained on synthetic speech data
- May have reduced accuracy on non-medical German content
- Performance may vary with different audio conditions and accents
## Intended Use
This model is designed for:
- βœ… German medical documentation transcription
- βœ… Clinical note-taking assistance
- βœ… Medical dictation systems
- βœ… Research in medical ASR
Not recommended for:
- ❌ Critical medical decisions without human review
- ❌ General-purpose German ASR (use base model instead)
- ❌ Languages other than German
## License
This model is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), the same license as the base model [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3).
**You are free to:**
- βœ… Use commercially
- βœ… Modify and create derivatives
- βœ… Distribute and share
**Under the following terms:**
- **Attribution** β€” You must give appropriate credit to both NVIDIA (original model) and this fine-tuned version, provide a link to the license, and indicate if changes were made.
## Citation
Base model:
```
@misc{parakeet-tdt-0.6b-v3,
author = {NVIDIA},
title = {Parakeet-TDT-0.6B},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
```
## Contact
For questions or issues, please open an issue on the repository.