Qwen3-4B Medical QA

This is a fully merged medical question-answering model based on Qwen/Qwen3-4B-Instruct-2507. The model has been fine-tuned using LoRA (with DoRA) on medical QA datasets and then merged into a single standalone model.

Model Details

Base Model

Base Model: Qwen/Qwen3-4B-Instruct-2507
Model Type: Causal Language Model (fully merged)
Parameters: 4.02B
Architecture: Qwen3ForCausalLM
Precision: BFloat16
Context Length: 262,144 tokens
License: Same as base model

Fine-tuning Details

Method: LoRA with DoRA (Weight-Decomposed Low-Rank Adaptation)
LoRA Rank: 64
LoRA Alpha: 64
LoRA Dropout: 0.1
Target Modules: q_proj, k_proj, v_proj, o_proj
Training Framework: LLaMA-Factory

Performance

Validation Accuracy: 82.18%
Validation Loss: 0.7984
Training Dataset: combined_selected_train (medical QA)

Training Details

Training Hyperparameters

Learning Rate: 3e-4
LR Scheduler: constant_with_warmup
Optimizer: AdamW (fused)
Number of Epochs: 3.0
Total Batch Size: 48 (distributed across 48 GPUs)
Per-device Train Batch Size: 1
Per-device Eval Batch Size: 8
Seed: 42

Training Results

Training Loss	Epoch	Step	Validation Loss	Accuracy
1.0377	1.18	20	1.1270	0.7382
0.6478	2.35	40	0.8764	0.7388
-	3.00	51	0.7984	0.8218

Framework Versions

PEFT: 0.15.2
Transformers: 4.55.0
PyTorch: 2.8.0+cu128
Datasets: 3.6.0
Tokenizers: 0.21.1

Usage

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    trust_remote_code=True
)

# Prepare messages
messages = [
    {"role": "system", "content": "You are a helpful medical assistant."},
    {"role": "user", "content": "What are the common symptoms of pneumonia?"}
]

# Generate response
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.8,
    top_k=20
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using with vLLM (Faster Inference)

from vllm import LLM, SamplingParams

# Initialize vLLM
llm = LLM(
    model="Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    trust_remote_code=True,
    dtype="bfloat16"
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    max_tokens=512
)

# Generate
prompts = ["What are the symptoms of diabetes?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Using with Ollama

A Modelfile is included for easy deployment with Ollama:

# Create Ollama model
ollama create qwen3-medical -f Modelfile

# Run the model
ollama run qwen3-medical "What are the symptoms of hypertension?"

Quantization (Optional)

For resource-constrained environments, you can quantize the model:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

Intended Use

This model is designed for:

Medical question-answering in educational contexts
Medical knowledge exploration and learning
Research in medical AI and NLP applications
Prototype development for medical chatbots and assistants

Limitations & Warnings

IMPORTANT DISCLAIMERS:

⚠️ NOT for clinical use: This model should NEVER be used for actual clinical decision-making, diagnosis, or treatment without qualified medical professional oversight
⚠️ Educational purposes only: Intended for education, research, and development purposes
⚠️ May contain errors: The model can generate incorrect or outdated medical information
⚠️ Bias: May inherit biases from training data
⚠️ Hallucination: Like all LLMs, may generate plausible-sounding but incorrect information
⚠️ Not a replacement: Always consult qualified healthcare professionals for medical advice

Performance Limitations

Performance may vary on out-of-distribution medical questions
Better suited for common medical topics in the training distribution
May struggle with very recent medical developments (knowledge cutoff)
Accuracy is dataset-dependent and not guaranteed

Model Size & Requirements

Model Size: ~7.6GB (BFloat16)
Recommended VRAM: 16GB+ for inference
Quantized (4-bit): ~2-3GB VRAM
CPU Inference: Possible but slow

Files Structure

qwen3-4b-medical-qa-merged/
├── model-00001-of-00005.safetensors  # Model weights (shard 1)
├── model-00002-of-00005.safetensors  # Model weights (shard 2)
├── model-00003-of-00005.safetensors  # Model weights (shard 3)
├── model-00004-of-00005.safetensors  # Model weights (shard 4)
├── model-00005-of-00005.safetensors  # Model weights (shard 5)
├── model.safetensors.index.json     # Shard index
├── config.json                       # Model configuration
├── generation_config.json            # Generation settings
├── tokenizer.json                    # Tokenizer
├── tokenizer_config.json             # Tokenizer config
├── vocab.json                        # Vocabulary
├── merges.txt                        # BPE merges
├── chat_template.jinja               # Chat template
├── special_tokens_map.json           # Special tokens
├── added_tokens.json                 # Added tokens
├── Modelfile                         # Ollama modelfile
└── README.md                         # This file

Comparison: LoRA Adapter vs Merged Model

Advantages of merged model:

✅ Easier to use (no need to load adapter separately)
✅ Faster inference (no adapter overhead)
✅ Compatible with more inference engines (vLLM, Ollama, etc.)
✅ Can be quantized directly

Disadvantages:

❌ Larger file size (~7.6GB vs ~182MB for adapter)
❌ Less flexible (can't swap adapters easily)
❌ Takes more storage space

Citation

If you use this model in your research or applications, please cite:

@misc{qwen3-4b-medical-qa-merged,
  author = {Your Name},
  title = {Qwen3-4B Medical QA},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Acryl-Jonathan-01/qwen3-4b-medical-qa-merged}}
}

Also consider citing the base model:

@misc{qwen3-2025,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  publisher={Alibaba Cloud},
  url={https://huggingface.co/Qwen}
}

Acknowledgments

Base Model: Qwen Team
Training Framework: LLaMA-Factory
PEFT Library: Hugging Face PEFT
Transformers Library: Hugging Face Transformers

License

This model inherits the license from the base model Qwen/Qwen3-4B-Instruct-2507. Please refer to the base model's license for usage terms and conditions.

Contact & Support

For issues, questions, or contributions:

Open an issue on the model repository
Refer to LLaMA-Factory documentation for training questions

Downloads last month: 3

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Acryl-Jonathan-01/ALLM-Med-4B

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1069)

this model

Evaluation results

Accuracy
self-reported

0.822
Loss
self-reported

0.798