MIRIAD

community
Activity Feed

AI & ML interests

None defined yet.

Centered Image

MIRIAD is a curated million scale Medical Instruction and RetrIeval Dataset. It contains 5.8 | 4.4 million medical question-answer pairs, distilled from peer-reviewed biomedical literature using LLMs. MIRIAD provides structured, high-quality QA pairs, enabling diverse downstream tasks like RAG, medical retrieval, hallucination detection, and instruction tuning. Any follow-up works will also be hosted here. We hope you find it helpful. Have fun building!

The dataset was introduced in our arXiv preprint.

Licensing

In this paper, we use the Semantic Scholar Open Research Corpus (S2ORC) as the source of documents to generate our dataset. These documents are made available under the Open Data Commons Attribution License (ODC-By) v1.0 (https://opendatacommons.org/licenses/by/1-0/), which permits reuse and modification of the dataset, including for commercial use, provided that proper attribution is given. To construct our dataset, we used S2ORC documents as input to OpenAI’s language models. The resulting model-generated outputs are owned by us, as per OpenAI’s Terms of Use, which also specify that outputs must not be used for medical diagnosis or decision-making about real individuals (https://openai.com/policies/terms-of-use/). Since our outputs are generated using both S2ORC documents and OpenAI’s models, we release the dataset under the ODC-By v1.0 license, subject to the usage restrictions in OpenAI’s Terms of Use.

Intended use

At this stage, the outputs of this study and the provided assets are supplied exclusively for academic research and educational exploration. They have not been reviewed or cleared by any regulatory body, and accordingly must not be used for clinical decision-making or considered a certified medical device.

📖 Cite

@misc{zheng2025miriadaugmentingllmsmillions,
      title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs}, 
      author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor},
      year={2025},
      eprint={2506.06091},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.06091}, 
}

For dataset feedbacks

Please send feedbacks regarding correcting the factual errors or issues within this dataset via email to us with key word MIRIAD Edit in the subject. We will collect them and update the maintenance in batch. Thank you!

models 0

None public yet