--- license: apache-2.0 --- # MedRAGChecker Claim Extractor ยท LoRA Adapter Biomedical claim-triple extractor fine-tuned from a medical LLM using GPT-4.1 teacher labels. This adapter is part of the **MedRAGChecker** pipeline for claim-level verification in biomedical RAG. > **Task:** given a medical question and its answer, extract factual triples of the form > `[subject, relation, object]` as a pure JSON array. --- ## Model summary - **Base model:** `` (for example: `med42-llama3-8b`, `Meditron3-8B`, `PMC_LLaMA_13B`, or `qwen2-med-7b`) - **Adapter type:** LoRA (rank = 16, alpha = 32, dropout = 0.0) via PEFT - **Architecture:** same as base causal LM (LLaMA-style or Qwen-style) - **Task:** biomedical claim triple extraction - **Input:** question text + model answer (plain text) - **Output:** JSON array of triples, e.g. ```json [ ["Psoriasis", "is", "chronic inflammatory skin disease"], ["Psoriasis", "is associated with", "systemic comorbidities"] ] ``` You can either: - keep one Hugging Face repo per adapter (recommended), or - store several adapters in one repo and refer to specific subfolders. Replace `` and any placeholder names below with your actual base model and repo id (for example: `JoyDaJun/MedRAGChecker-Extractor-Meditron3-8B`). --- ## Intended use - Post-hoc analysis of biomedical QA systems at *claim level*. - Use inside a RAG or QA evaluation pipeline to: - extract atomic factual statements from a generated answer; - feed those triples to a checker model (e.g. MedRAGChecker NLI+KG). This adapter is **not** a general-purpose chat model and **must not** be used as a standalone medical assistant. --- ## How to use ### 1. LLaMA-style base models (Meditron, Med42, PMC-LLaMA, etc.) ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel import torch, json base_model_id = "" # e.g. "med42-llama3-8b" adapter_id = "" # e.g. "JoyDaJun/MedRAGChecker-Extractor-Med42-8B" tokenizer = AutoTokenizer.from_pretrained(base_model_id) model = AutoModelForCausalLM.from_pretrained( base_model_id, torch_dtype=torch.bfloat16, device_map="auto", ) model = PeftModel.from_pretrained(model, adapter_id) def build_prompt(question: str, answer: str) -> str: system_part = ( "You are an information extraction assistant. " "Given a medical question and its answer, extract all factual triples " "as [subject, relation, object]. " "Return a pure JSON array of triples, with no explanations, no extra text, " "no comments. If there are no clear factual triples, return an empty JSON array []." ) qa_part = f"Question: {question}\nAnswer: {answer}" return ( system_part + "\n\n" + qa_part + '\n\nTriples (JSON only, e.g. [["subj", "rel", "obj"], ...]):\n' ) question = "Does hypercholesterolemia increase leukotriene B4 in neutrophils?" answer = "Hypercholesterolemia increases 5-LO activity in neutrophils..." prompt = build_prompt(question, answer) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): gen_ids = model.generate( **inputs, max_new_tokens=256, do_sample=False, ) text = tokenizer.decode(gen_ids[0], skip_special_tokens=True) # Optional: keep only the JSON array start = text.find("[") end = text.rfind("]") + 1 json_str = text[start:end] if start != -1 and end != -1 else "[]" triples = json.loads(json_str) print(triples) ``` ### 2. Chat-style base models (Qwen2-med, etc.) For chat-style models, wrap the same prompt inside the chat template. ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel import torch, json base_model_id = "" # e.g. "qwen2-med-7b" adapter_id = "" # e.g. "JoyDaJun/MedRAGChecker-Extractor-Qwen2-med-7B" tokenizer = AutoTokenizer.from_pretrained(base_model_id) model = AutoModelForCausalLM.from_pretrained( base_model_id, torch_dtype=torch.bfloat16, device_map="auto", ) model = PeftModel.from_pretrained(model, adapter_id) def build_prompt(question: str, answer: str) -> str: system_part = ( "Given a medical question and its answer, extract all factual triples " "as [subject, relation, object]. " "Return only a JSON array of triples." ) qa_part = f"Question: {question}\nAnswer: {answer}" return system_part + "\n\n" + qa_part + '\n\nTriples (JSON only, e.g. [["subj", "rel", "obj"], ...]):\n' question = "Does hypercholesterolemia increase leukotriene B4 in neutrophils?" answer = "Hypercholesterolemia increases 5-LO activity in neutrophils..." messages = [ {"role": "system", "content": "You are an information extraction assistant."}, {"role": "user", "content": build_prompt(question, answer)}, ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): gen_ids = model.generate( **inputs, max_new_tokens=256, do_sample=False, ) text = tokenizer.decode(gen_ids[0], skip_special_tokens=True) start = text.find("[") end = text.rfind("]") + 1 json_str = text[start:end] if start != -1 and end != -1 else "[]" triples = json.loads(json_str) print(triples) ``` --- ## Training details This adapter was trained with the `DistillExtractor/train_extractor_sft.py` script in the MedRAGChecker codebase. - **Teacher model:** GPT-4.1 as claim-triple annotator. - **Training data:** - JSONL file `extractor_sft.jsonl` with fields: - `instruction`: system prompt + `Question:` + `Answer:` (from biomedical QA datasets and RAG outputs). - `output`: pure JSON array of `[subject, relation, object]` triples labeled by GPT-4.1. - Sources include consumer and research-style biomedical QA (e.g., MedQuAD, PubMedQA, LiveQA Medical, CSIRO MedRedQA, and AskDocs-style Reddit threads). - **Preprocessing:** - Parse `Question:` and `Answer:` from the `instruction` field using regex. - Rebuild a canonical prompt with an explicit `Triples (JSON only, e.g. [["subj", "rel", "obj"], ...]):` header. - **Fine-tuning setup (example):** - Epochs: `10` - Batch size: `1` with gradient accumulation `32` (effective batch size 32). - Max input length: `2048`. - Optimizer: AdamW, learning rate `1e-4`. - LoRA config: `r = 16`, `alpha = 32`, `dropout = 0.0`. - Precision: `bfloat16` on GPUs with `device_map="auto"`. Example training command: ```bash export WANDB_PROJECT=MedRAGChecker export WANDB_NAME=extractor_ BASE=/path/to/ CUDA_VISIBLE_DEVICES=0,1,2,3 \ python DistillExtractor/train_extractor_sft.py \ --model_name "$BASE" \ --train_path ./data/extractor_sft.jsonl \ --output_dir ./runs/extractor_sft_ \ --epochs 10 \ --batch_size 1 \ --grad_accum 32 \ --lr 1e-4 \ --bf16 ``` Replace `` and `` with your actual base model. --- ## Evaluation We evaluate on a held-out split of the same GPT-4.1-annotated dataset using two families of metrics: 1. **Strict triple match** - Normalize to lowercase and strip whitespace. - Treat each triple as a set element `(subject, relation, object)`. - Compute precision/recall/F1 on exact triple matches. - Also report exact match rate (all triples in an example match exactly). 2. **Soft triple match** - Tokenize subject, relation, and object. - Compute token-level F1 for each field between predicted and gold triples. - Aggregate into a per-triple similarity score. - Run greedy matching between predicted and gold triples by similarity. - Compute soft precision/recall/F1 from matched pairs. Example metrics on a random subsample of `N = 200` examples for a Meditron3-8B-based extractor: | Metric | Value | |------------------|--------| | strict_precision | 0.0890 | | strict_recall | 0.0930 | | strict_f1 | 0.0900 | | exact_match | 0.0500 | | soft_precision | 0.2052 | | soft_recall | 0.2598 | | soft_f1 | 0.2148 | These numbers illustrate that: - the model is far from perfect at exact triple reconstruction; - soft matching shows it still captures many approximate facts, which is often sufficient for downstream diagnostics in MedRAGChecker. You can reproduce these metrics (and compute new ones for other checkpoints) with the evaluation script: ```bash python DistillExtractor/run_extractor_eval_soft.py \ --base_model \ --adapter_path \ --data_path ./data/extractor_sft.jsonl \ --output_path ./results/extractor_soft_.json \ --num_examples 200 ``` --- ## Limitations and risks - The adapter inherits all limitations and biases of the base model and GPT-4.1 teacher. - Extracted triples may still be incomplete, redundant, or slightly rephrased. - The model is optimized for **English biomedical text**; performance on other domains or languages is likely poor. - Do **not** use this model (or its extracted triples) directly for patient-facing decisions or clinical care without expert validation. --- ## Citation If you use this adapter or MedRAGChecker in your work, please consider citing our paper (details to be updated): ```bibtex @inproceedings{ji2025medragchecker, title = {MedRAGChecker: Claim-level Verification for Biomedical Retrieval-Augmented Generation}, author = {Ji, Yuelyu and collaborators}, booktitle = {Proceedings of a future venue}, year = {2025} } ``` --- ## License - This adapter is released under the same license terms as the corresponding base model ``. - You must accept and comply with the license of the base model before using this LoRA.