parakeet_de_med / README.md

Upload README.md with huggingface_hub

82c627f verified 23 days ago

7.64 kB

	---
	language:
	- de
	license: cc-by-4.0
	tags:
	- audio
	- automatic-speech-recognition
	- speech
	- medical
	- german
	- parakeet
	- nemo
	base_model: nvidia/parakeet-tdt-0.6b-v3
	pipeline_tag: automatic-speech-recognition
	datasets:
	- custom
	metrics:
	- wer
	model-index:
	- name: Parakeet-DE-Med
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: German Medical Documentation
	type: custom
	metrics:
	- type: wer
	value: 3.28
	name: Word Error Rate
	---

	# Parakeet-DE-Med: German Medical ASR

	Fine-tuned NVIDIA Parakeet-TDT-0.6B for German medical documentation transcription.

	This model is a fine-tuned derivative of [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3), which is licensed under CC-BY-4.0.

	## Model Description

	This model is a parameter-efficient fine-tuned (PEFT) version of NVIDIA's Parakeet-TDT-0.6B specialized for German medical documentation (Arztbriefe). It uses the decoder+joint training strategy, training only 2.89% of the model parameters while achieving significant improvements in medical domain accuracy.

	- Base Model: nvidia/parakeet-tdt-0.6b-v3
	- Language: German (de-DE)
	- Domain: Medical documentation
	- Training Method: PEFT (decoder+joint strategy)
	- Parameters Trained: 18.1M / 627M (2.89%)

	## Performance

	Evaluated on German medical documentation test set (122 samples):

	\| Model \| WER \|
	\|-------\|-----\|
	\| Base Parakeet-TDT-0.6B \| 11.73% \|
	\| Parakeet-DE-Med \| 3.28% \|

	Improvement: 72% WER reduction

	## Training Details

	- Training Data: 976 German medical documentation samples
	- Training Epochs: 5
	- Training Strategy: Freeze encoder, train decoder and joint network only
	- Precision: BF16 mixed precision
	- Batch Size: 4 (effective batch size 16 with gradient accumulation)
	- Learning Rate: 2e-4

	## Usage

	### Prerequisites

	```bash
	pip install nemo_toolkit['asr']
	# Or for the latest version:
	pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
	```

	### Basic Transcription

	```python
	import nemo.collections.asr as nemo_asr

	# Load the model from HuggingFace
	model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")

	# Transcribe a single audio file
	transcription = model.transcribe(["path/to/audio.wav"])
	print(transcription[0])
	```

	### Batch Transcription

	```python
	import nemo.collections.asr as nemo_asr

	model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")

	# Transcribe multiple files
	audio_files = [
	"patient_recording_1.wav",
	"patient_recording_2.wav",
	"patient_recording_3.wav"
	]

	transcriptions = model.transcribe(audio_files, batch_size=4)
	for i, text in enumerate(transcriptions):
	print(f"File {i+1}: {text}")
	```

	### Real-time/Streaming Audio

	```python
	import nemo.collections.asr as nemo_asr
	import soundfile as sf

	model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")

	# Load audio file
	audio, sample_rate = sf.read("medical_dictation.wav")

	# Resample to 16kHz if needed
	if sample_rate != 16000:
	import librosa
	audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)

	# Transcribe from numpy array
	transcription = model.transcribe(audio, batch_size=1)
	print(transcription[0])
	```

	### Advanced Configuration

	```python
	import nemo.collections.asr as nemo_asr

	model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")

	# Configure decoding parameters
	transcription = model.transcribe(
	paths2audio_files=["recording.wav"],
	batch_size=1,
	return_hypotheses=True, # Get confidence scores
	num_workers=4,
	channel_selector=0, # For multi-channel audio
	augmentor=None
	)

	# Access detailed results
	for hyp in transcription:
	print(f"Text: {hyp.text}")
	print(f"Confidence: {hyp.score}")
	```

	### Using with GPU

	```python
	import nemo.collections.asr as nemo_asr
	import torch

	# Ensure GPU is available
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	print(f"Using device: {device}")

	# Load model (automatically uses GPU if available)
	model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")

	# Transcribe
	transcription = model.transcribe(["audio.wav"])
	print(transcription[0])
	```

	### Transcribing from Microphone

	```python
	import nemo.collections.asr as nemo_asr
	import pyaudio
	import wave
	import tempfile

	def record_audio(duration=5, sample_rate=16000):
	"""Record audio from microphone"""
	p = pyaudio.PyAudio()
	stream = p.open(
	format=pyaudio.paInt16,
	channels=1,
	rate=sample_rate,
	input=True,
	frames_per_buffer=1024
	)

	print(f"Recording for {duration} seconds...")
	frames = []
	for _ in range(0, int(sample_rate / 1024 * duration)):
	data = stream.read(1024)
	frames.append(data)

	print("Recording finished.")
	stream.stop_stream()
	stream.close()
	p.terminate()

	# Save to temporary file
	temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
	wf = wave.open(temp_file.name, 'wb')
	wf.setnchannels(1)
	wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
	wf.setframerate(sample_rate)
	wf.writeframes(b''.join(frames))
	wf.close()

	return temp_file.name

	# Load model
	model = nemo_asr.models.ASRModel.from_pretrained("johannhartmann/parakeet_de_med")

	# Record and transcribe
	audio_file = record_audio(duration=5)
	transcription = model.transcribe([audio_file])
	print(f"Transcription: {transcription[0]}")
	```

	### Expected Input Format

	- Sample Rate: 16 kHz (model will work with other rates but 16kHz is optimal)
	- Channels: Mono (single channel)
	- Format: WAV, FLAC, MP3, or any format supported by soundfile
	- Bit Depth: 16-bit PCM recommended

	## Medical Domain Coverage

	The model is trained on comprehensive medical documentation including:

	- Patient presentation and admission
	- Medical history and examinations
	- Vital signs and lab results
	- Diagnoses and treatments
	- Medications and dosages
	- Discharge summaries
	- Follow-up recommendations

	## Limitations

	- Optimized for German medical documentation speech
	- Trained on synthetic speech data
	- May have reduced accuracy on non-medical German content
	- Performance may vary with different audio conditions and accents

	## Intended Use

	This model is designed for:
	- ✅ German medical documentation transcription
	- ✅ Clinical note-taking assistance
	- ✅ Medical dictation systems
	- ✅ Research in medical ASR

	Not recommended for:
	- ❌ Critical medical decisions without human review
	- ❌ General-purpose German ASR (use base model instead)
	- ❌ Languages other than German

	## License

	This model is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), the same license as the base model [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3).

	You are free to:
	- ✅ Use commercially
	- ✅ Modify and create derivatives
	- ✅ Distribute and share

	Under the following terms:
	- Attribution — You must give appropriate credit to both NVIDIA (original model) and this fine-tuned version, provide a link to the license, and indicate if changes were made.

	## Citation

	Base model:
	```
	@misc{parakeet-tdt-0.6b-v3,
	author = {NVIDIA},
	title = {Parakeet-TDT-0.6B},
	year = {2024},
	publisher = {HuggingFace},
	url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
	}
	```

	## Contact

	For questions or issues, please open an issue on the repository.