MedASR Model Card
Model documentation: MedASR
Resources:
Model on Google Cloud Model Garden: MedASR
Model on Hugging Face: MedASR
GitHub repository (supporting code, Colab notebooks, discussions, and issues): MedASR
Quick start notebook: GitHub
Fine-tuning notebook: GitHub
Support: See Contact
License: The use of MedASR is governed by the Health AI Developer Foundations terms of use.
Author: Google
Model information
This section describes the MedASR (Medical Automated Speech Recognition) model and how to use it.
Description
MedASR is a speech-to-text model based on the Conformer architecture pre-trained for medical dictation. MedASR is intended as a starting point for developers, and is well-suited for dictation tasks involving medical terminologies, such as radiology dictation, and transcribing physician-patient conversations. While MedASR has been extensively pre-trained on a corpus of medical audio data, it may occasionally exhibit performance variability when encountering terms outside of its pre-training data, such as non-standard medication names or consistent handling of temporal data (dates, times, or durations).
How to use
The following are some example code snippets to help you quickly get started running the model locally. If you want to use the model at scale, we recommend that you create a production version using Model Garden.
First, install the Transformers library. MedASR is supported starting from transformers 5.0.0. You may need to install transformers from GitHub.
$ uv pip install git+https://github.com/huggingface/transformers.git
Run model with the pipeline API
from transformers import pipeline
import huggingface_hub
from IPython.display import Audio, display
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
model_id = "google/medasr"
pipe = pipeline("automatic-speech-recognition", model=model_id)
result = pipe(audio,chunk_length_s=20,stride_length_s=2)
# the chunk length is how long in seconds MedASR batches audio and the stride length is the overlap between chunks.
print(result)
Run the model directly
from transformers import AutoModelForCTC, AutoProcessor
import huggingface_hub
import librosa
import torch
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
model_id = f"google/medasr"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCTC.from_pretrained(model_id).to(device)
audio = huggingface_hub.hf_hub_download('google/medasr', 'test_audio.wav')
speech, sample_rate = librosa.load(audio, sr=16000)
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt", padding=True)
inputs = inputs.to(device)
outputs = model.generate(**inputs)
decoded_text = processor.batch_decode(outputs)[0]
print(f"result={decoded_text}")
Examples
See the following tutorial notebooks for examples of how to use MedASR:
To give the model a quick try, running it locally with weights from Hugging Face, see Quick start notebook in Colab.
For an example of fine-tuning the, see the Fine-tuning notebook in Colab.
Model architecture overview
The MedASR model is built based on the Conformer architecture.
Technical specifications
Model type: Automated-speech-detector
Input Modalities: Mono-channel audio 16kHz, int16 waveform
Output Modality: Text only
Number of parameters: 105M
Key publication: LAST: Scalable Lattice-Based Speech Modelling in JAX
Model created: November 4, 2025
Model version: 1.0.0
Citation
When using this model, cite:
@inproceedings{wu2023last,
title={Last: Scalable Lattice-Based Speech Modelling in Jax},
author={Wu, Ke and Variani, Ehsan and Bagby, Tom and Riley, Michael},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}
Performance and Evaluations
Our evaluation methods include evaluating word-error rate (WER) of MedASR against held out medical audio examples. We also evaluate specifically medical WER, where we only look at words that have a medical context. These audio samples have been transcribed by human experts, but there is always some noise in such transcriptions.
Key performance metrics
Word error rate of MedASR versus other models*
| Dataset name | Dataset description | MedASR with greedy decoding | MedASR + 6-gram language model | Gemini 2.5 Pro | Gemini 3 Pro | Gemini 2.5 Flash | Whisper v3 Large |
|---|---|---|---|---|---|---|---|
| RAD-DICT | Private radiologist dictation dataset | 7.1% | 5.8% | 14.2% | 12.7% | 26% | 25.2% |
| GENERAL-DICT | Private general and internal medicine dataset | 9.8% | 7.9% | 17.9% | 18.6% | 27.8% | 32.4% |
| FM-DICT | Private family medicine dataset | 8.7% | 7.2% | 16.3% | 16.8% | 21.2% | 32.6% |
| Eye Gaze | Dictation of audio from 998 MIMIC cases (multiple speakers) | 7.2% | 5.2% | 5.9% | 5.5% | 9.3% | 12.8% |
*All results except "MedASR + 6-gram language model" in the preceding table use greedy decoding. "MedASR + 6-gram language model" uses beam search with beam size 8.
Safety evaluation
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant safety policies. This model was evaluated across various dimensions to assess safety. Human evaluations were conducted on 100 example outputs to assess for potential safety impact, specifically related to incorrect transcriptions associated with medication names, dosages, diagnoses, semantic changes, and medical terminology. The results of these evaluations were determined to be acceptable in regards to internal policies for overall safety.
Data card
Dataset overview
Training
The MedASR model is specifically trained on a diverse set of de-identified medical speech data. Its training utilizes approximately 5000 hours of physician dictations across a range of specialities (proprietary dataset 1) and de-identified medical conversations, primarily physician-patient dialogue (proprietary dataset 2). The model is trained on audio segments paired with corresponding transcripts and metadata, with subsets of the conversational data also including extensive annotations for medical named entities such as symptoms, medications, and conditions. MedASR therefore has a strong understanding of vocabulary used in medical contexts.
Evaluation
MedASR has been evaluated using a mix of internal and public datasets as noted in the Key Performance Metrics section. We used argmax of the model for posterior probability (greedy decoding) to get the output model's hypothesis tokens. The hypothesis is compared against ground truth transcript using jiwer library to calculate the word error rate.
Source
The datasets used to train MedASR include a public dataset for pre-training and a proprietary dataset that was licensed and incorporated (described in the following section).
Data ownership and documentation
Pre-training with the full LibriHeavy training set. Fine-tuning was conducted on de-identified, licensed datasets described in the following section
Private Medical Dict: Google internal dataset consisting of de-identified dictations made by physicians of different specialities including radiology, internal medicine, family medicine, and other subspecialties totaling more than 5000 hours of audio. This dataset was split into test sets that constitute RAD-DICT, FM-DICT and General and Internal Medicine-DICT referenced previously in Performance and Evaluations.
Data citation
Eye Gaze Data for Chest X-rays (evaluation set described previously in Performance and Evaluations) was derived from:
MIMIC-CXR Database v1.0.0 and MIMIC-IV v0.4
De-identification/anonymization:
Google and its partners utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy.
Implementation Information
Details about the model internals.
Hardware
Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training speech-to text models requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain:
- Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs.
- Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality.
- Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing.
- Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training.
- These advantages are aligned with Google's commitments to operate sustainably.
Software
Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones.
Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models; "the 'single controller' programming model of JAX and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow."
Usage and Limitations
The MedASR model has certain limitations that users should be aware of.
Intended Use
MedASR is a speech-to-text model intended to be used as a starting point that enables more efficient development of downstream healthcare applications requiring speech as input. MedASR is intended for developers in the healthcare and life sciences space. Developers are responsible for training, adapting, and making meaningful changes to MedASR to accomplish their specific intended use. The MedASR model can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions.
MedASR is trained on many medical audio, speech, and text and enables further development and integration, or both with generative models like MedGemma, where MedASR converts speech to text, which can then be used as input for a text-to-text response. Full details of all the tasks MedASR has been evaluated and pre-trained on can be found in the MedASR model card.
MedASR is not intended to be used without appropriate validation, adaptation, or making meaningful modification by developers for their specific use case. The outputs generated by MedASR may include transcription errors and are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. All outputs from MedASR should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies.
Limitations
- Training Data
- English-only: All training data is in English
- Speaker diversity: Most training data comes from speakers where English is their first language and were raised in the United States. The base model's performance may be lower for other types of speakers, necessitating the need for fine-tuning.
- Speaker Sex/Gender: Training data included both men and women but had a higher proportion of men.
- Audio quality: Training data is mostly from high quality microphones. The base model's performance may deteriorate on low quality audio with background noise, necessitating the need for fine-tuning.
- Specialized medical terminology: Although MedASR has specialized medical audio training, its training may not include all medications, procedures or terminology, especially ones that have come into usage in the past 10 years.
- Dates: MedASR has been trained on de-identified data so its performance on different date formats may be lacking. This can be rectified with further finetuning or alternative decoding approaches such as language model decoding debiasing.
Benefits
At the time of release, MedASR is a high performing open speech-to-text model, with specific training for medical applications. Users can update its vocabulary with few-shot fine-tuning or decoding with external language models.
Based on the benchmark evaluation metrics in this document, MedASR represents a significant leap forward in medical speech-to-text performance relative to other comparably-sized open model alternatives. History References
- Downloads last month
- 823