MedRAGChecker Claim Extractor 路 LoRA Adapter
Biomedical claim-triple extractor fine-tuned from a medical LLM using GPT-4.1 teacher labels.
This adapter is part of the MedRAGChecker pipeline for claim-level verification in biomedical RAG.
Task: given a medical question and its answer, extract factual triples of the form
[subject, relation, object]as a pure JSON array.
Model summary
- Base model:
<BASE_MODEL_ID>(for example:med42-llama3-8b,Meditron3-8B,PMC_LLaMA_13B, orqwen2-med-7b) - Adapter type: LoRA (rank = 16, alpha = 32, dropout = 0.0) via PEFT
- Architecture: same as base causal LM (LLaMA-style or Qwen-style)
- Task: biomedical claim triple extraction
- Input: question text + model answer (plain text)
- Output: JSON array of triples, e.g.
[
["Psoriasis", "is", "chronic inflammatory skin disease"],
["Psoriasis", "is associated with", "systemic comorbidities"]
]
You can either:
- keep one Hugging Face repo per adapter (recommended), or
- store several adapters in one repo and refer to specific subfolders.
Replace <BASE_MODEL_ID> and any placeholder names below with your actual base model and repo id (for example: JoyDaJun/MedRAGChecker-Extractor-Meditron3-8B).
Intended use
- Post-hoc analysis of biomedical QA systems at claim level.
- Use inside a RAG or QA evaluation pipeline to:
- extract atomic factual statements from a generated answer;
- feed those triples to a checker model (e.g. MedRAGChecker NLI+KG).
This adapter is not a general-purpose chat model and must not be used as a standalone medical assistant.
How to use
1. LLaMA-style base models (Meditron, Med42, PMC-LLaMA, etc.)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch, json
base_model_id = "<BASE_MODEL_ID>" # e.g. "med42-llama3-8b"
adapter_id = "<ADAPTER_REPO_ID>" # e.g. "JoyDaJun/MedRAGChecker-Extractor-Med42-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
def build_prompt(question: str, answer: str) -> str:
system_part = (
"You are an information extraction assistant. "
"Given a medical question and its answer, extract all factual triples "
"as [subject, relation, object]. "
"Return a pure JSON array of triples, with no explanations, no extra text, "
"no comments. If there are no clear factual triples, return an empty JSON array []."
)
qa_part = f"Question: {question}\nAnswer: {answer}"
return (
system_part
+ "\n\n"
+ qa_part
+ '\n\nTriples (JSON only, e.g. [["subj", "rel", "obj"], ...]):\n'
)
question = "Does hypercholesterolemia increase leukotriene B4 in neutrophils?"
answer = "Hypercholesterolemia increases 5-LO activity in neutrophils..."
prompt = build_prompt(question, answer)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
gen_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
)
text = tokenizer.decode(gen_ids[0], skip_special_tokens=True)
# Optional: keep only the JSON array
start = text.find("[")
end = text.rfind("]") + 1
json_str = text[start:end] if start != -1 and end != -1 else "[]"
triples = json.loads(json_str)
print(triples)
2. Chat-style base models (Qwen2-med, etc.)
For chat-style models, wrap the same prompt inside the chat template.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch, json
base_model_id = "<QWEN_BASE_MODEL_ID>" # e.g. "qwen2-med-7b"
adapter_id = "<ADAPTER_REPO_ID_QWEN>" # e.g. "JoyDaJun/MedRAGChecker-Extractor-Qwen2-med-7B"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
def build_prompt(question: str, answer: str) -> str:
system_part = (
"Given a medical question and its answer, extract all factual triples "
"as [subject, relation, object]. "
"Return only a JSON array of triples."
)
qa_part = f"Question: {question}\nAnswer: {answer}"
return system_part + "\n\n" + qa_part + '\n\nTriples (JSON only, e.g. [["subj", "rel", "obj"], ...]):\n'
question = "Does hypercholesterolemia increase leukotriene B4 in neutrophils?"
answer = "Hypercholesterolemia increases 5-LO activity in neutrophils..."
messages = [
{"role": "system", "content": "You are an information extraction assistant."},
{"role": "user", "content": build_prompt(question, answer)},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
gen_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
)
text = tokenizer.decode(gen_ids[0], skip_special_tokens=True)
start = text.find("[")
end = text.rfind("]") + 1
json_str = text[start:end] if start != -1 and end != -1 else "[]"
triples = json.loads(json_str)
print(triples)
Training details
This adapter was trained with the DistillExtractor/train_extractor_sft.py script in the MedRAGChecker codebase.
- Teacher model: GPT-4.1 as claim-triple annotator.
- Training data:
- JSONL file
extractor_sft.jsonlwith fields:instruction: system prompt +Question:+Answer:(from biomedical QA datasets and RAG outputs).output: pure JSON array of[subject, relation, object]triples labeled by GPT-4.1.
- Sources include consumer and research-style biomedical QA (e.g., MedQuAD, PubMedQA, LiveQA Medical, CSIRO MedRedQA, and AskDocs-style Reddit threads).
- JSONL file
- Preprocessing:
- Parse
Question:andAnswer:from theinstructionfield using regex. - Rebuild a canonical prompt with an explicit
Triples (JSON only, e.g. [["subj", "rel", "obj"], ...]):header.
- Parse
- Fine-tuning setup (example):
- Epochs:
10 - Batch size:
1with gradient accumulation32(effective batch size 32). - Max input length:
2048. - Optimizer: AdamW, learning rate
1e-4. - LoRA config:
r = 16,alpha = 32,dropout = 0.0. - Precision:
bfloat16on GPUs withdevice_map="auto".
- Epochs:
Example training command:
export WANDB_PROJECT=MedRAGChecker
export WANDB_NAME=extractor_<BASE_NAME>
BASE=/path/to/<BASE_MODEL_ID>
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python DistillExtractor/train_extractor_sft.py \
--model_name "$BASE" \
--train_path ./data/extractor_sft.jsonl \
--output_dir ./runs/extractor_sft_<BASE_NAME> \
--epochs 10 \
--batch_size 1 \
--grad_accum 32 \
--lr 1e-4 \
--bf16
Replace <BASE_MODEL_ID> and <BASE_NAME> with your actual base model.
Evaluation
We evaluate on a held-out split of the same GPT-4.1-annotated dataset using two families of metrics:
Strict triple match
- Normalize to lowercase and strip whitespace.
- Treat each triple as a set element
(subject, relation, object). - Compute precision/recall/F1 on exact triple matches.
- Also report exact match rate (all triples in an example match exactly).
Soft triple match
- Tokenize subject, relation, and object.
- Compute token-level F1 for each field between predicted and gold triples.
- Aggregate into a per-triple similarity score.
- Run greedy matching between predicted and gold triples by similarity.
- Compute soft precision/recall/F1 from matched pairs.
Example metrics on a random subsample of N = 200 examples for a Meditron3-8B-based extractor:
| Metric | Value |
|---|---|
| strict_precision | 0.0890 |
| strict_recall | 0.0930 |
| strict_f1 | 0.0900 |
| exact_match | 0.0500 |
| soft_precision | 0.2052 |
| soft_recall | 0.2598 |
| soft_f1 | 0.2148 |
These numbers illustrate that:
- the model is far from perfect at exact triple reconstruction;
- soft matching shows it still captures many approximate facts, which is often sufficient for downstream diagnostics in MedRAGChecker.
You can reproduce these metrics (and compute new ones for other checkpoints) with the evaluation script:
python DistillExtractor/run_extractor_eval_soft.py \
--base_model <BASE_MODEL_ID> \
--adapter_path <ADAPTER_REPO_OR_LOCAL_PATH> \
--data_path ./data/extractor_sft.jsonl \
--output_path ./results/extractor_soft_<BASE_NAME>.json \
--num_examples 200
Limitations and risks
- The adapter inherits all limitations and biases of the base model and GPT-4.1 teacher.
- Extracted triples may still be incomplete, redundant, or slightly rephrased.
- The model is optimized for English biomedical text; performance on other domains or languages is likely poor.
- Do not use this model (or its extracted triples) directly for patient-facing decisions or clinical care without expert validation.
Citation
If you use this adapter or MedRAGChecker in your work, please consider citing our paper (details to be updated):
@inproceedings{ji2025medragchecker,
title = {MedRAGChecker: Claim-level Verification for Biomedical Retrieval-Augmented Generation},
author = {Ji, Yuelyu and collaborators},
booktitle = {Proceedings of a future venue},
year = {2025}
}
License
- This adapter is released under the same license terms as the corresponding base model
<BASE_MODEL_ID>. - You must accept and comply with the license of the base model before using this LoRA.