Qwen3-4B Medical QA

This is a fully merged medical question-answering model based on Qwen/Qwen3-4B-Instruct-2507. The model has been fine-tuned using LoRA (with DoRA) on medical QA datasets and then merged into a single standalone model.

Model Details

Base Model

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Model Type: Causal Language Model (fully merged)
  • Parameters: 4.02B
  • Architecture: Qwen3ForCausalLM
  • Precision: BFloat16
  • Context Length: 262,144 tokens
  • License: Same as base model

Fine-tuning Details

  • Method: LoRA with DoRA (Weight-Decomposed Low-Rank Adaptation)
  • LoRA Rank: 64
  • LoRA Alpha: 64
  • LoRA Dropout: 0.1
  • Target Modules: q_proj, k_proj, v_proj, o_proj
  • Training Framework: LLaMA-Factory

Performance

  • Validation Accuracy: 82.18%
  • Validation Loss: 0.7984
  • Training Dataset: combined_selected_train (medical QA)

Training Details

Training Hyperparameters

  • Learning Rate: 3e-4
  • LR Scheduler: constant_with_warmup
  • Optimizer: AdamW (fused)
  • Number of Epochs: 3.0
  • Total Batch Size: 48 (distributed across 48 GPUs)
  • Per-device Train Batch Size: 1
  • Per-device Eval Batch Size: 8
  • Seed: 42

Training Results

Training Loss Epoch Step Validation Loss Accuracy
1.0377 1.18 20 1.1270 0.7382
0.6478 2.35 40 0.8764 0.7388
- 3.00 51 0.7984 0.8218

Framework Versions

  • PEFT: 0.15.2
  • Transformers: 4.55.0
  • PyTorch: 2.8.0+cu128
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Usage

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    trust_remote_code=True
)

# Prepare messages
messages = [
    {"role": "system", "content": "You are a helpful medical assistant."},
    {"role": "user", "content": "What are the common symptoms of pneumonia?"}
]

# Generate response
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.8,
    top_k=20
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using with vLLM (Faster Inference)

from vllm import LLM, SamplingParams

# Initialize vLLM
llm = LLM(
    model="Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    trust_remote_code=True,
    dtype="bfloat16"
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    max_tokens=512
)

# Generate
prompts = ["What are the symptoms of diabetes?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Using with Ollama

A Modelfile is included for easy deployment with Ollama:

# Create Ollama model
ollama create qwen3-medical -f Modelfile

# Run the model
ollama run qwen3-medical "What are the symptoms of hypertension?"

Quantization (Optional)

For resource-constrained environments, you can quantize the model:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "Acryl-Jonathan-01/qwen3-4b-medical-qa-merged",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

Intended Use

This model is designed for:

  • Medical question-answering in educational contexts
  • Medical knowledge exploration and learning
  • Research in medical AI and NLP applications
  • Prototype development for medical chatbots and assistants

Limitations & Warnings

IMPORTANT DISCLAIMERS:

  • โš ๏ธ NOT for clinical use: This model should NEVER be used for actual clinical decision-making, diagnosis, or treatment without qualified medical professional oversight
  • โš ๏ธ Educational purposes only: Intended for education, research, and development purposes
  • โš ๏ธ May contain errors: The model can generate incorrect or outdated medical information
  • โš ๏ธ Bias: May inherit biases from training data
  • โš ๏ธ Hallucination: Like all LLMs, may generate plausible-sounding but incorrect information
  • โš ๏ธ Not a replacement: Always consult qualified healthcare professionals for medical advice

Performance Limitations

  • Performance may vary on out-of-distribution medical questions
  • Better suited for common medical topics in the training distribution
  • May struggle with very recent medical developments (knowledge cutoff)
  • Accuracy is dataset-dependent and not guaranteed

Model Size & Requirements

  • Model Size: ~7.6GB (BFloat16)
  • Recommended VRAM: 16GB+ for inference
  • Quantized (4-bit): ~2-3GB VRAM
  • CPU Inference: Possible but slow

Files Structure

qwen3-4b-medical-qa-merged/
โ”œโ”€โ”€ model-00001-of-00005.safetensors  # Model weights (shard 1)
โ”œโ”€โ”€ model-00002-of-00005.safetensors  # Model weights (shard 2)
โ”œโ”€โ”€ model-00003-of-00005.safetensors  # Model weights (shard 3)
โ”œโ”€โ”€ model-00004-of-00005.safetensors  # Model weights (shard 4)
โ”œโ”€โ”€ model-00005-of-00005.safetensors  # Model weights (shard 5)
โ”œโ”€โ”€ model.safetensors.index.json     # Shard index
โ”œโ”€โ”€ config.json                       # Model configuration
โ”œโ”€โ”€ generation_config.json            # Generation settings
โ”œโ”€โ”€ tokenizer.json                    # Tokenizer
โ”œโ”€โ”€ tokenizer_config.json             # Tokenizer config
โ”œโ”€โ”€ vocab.json                        # Vocabulary
โ”œโ”€โ”€ merges.txt                        # BPE merges
โ”œโ”€โ”€ chat_template.jinja               # Chat template
โ”œโ”€โ”€ special_tokens_map.json           # Special tokens
โ”œโ”€โ”€ added_tokens.json                 # Added tokens
โ”œโ”€โ”€ Modelfile                         # Ollama modelfile
โ””โ”€โ”€ README.md                         # This file

Comparison: LoRA Adapter vs Merged Model

Advantages of merged model:

  • โœ… Easier to use (no need to load adapter separately)
  • โœ… Faster inference (no adapter overhead)
  • โœ… Compatible with more inference engines (vLLM, Ollama, etc.)
  • โœ… Can be quantized directly

Disadvantages:

  • โŒ Larger file size (~7.6GB vs ~182MB for adapter)
  • โŒ Less flexible (can't swap adapters easily)
  • โŒ Takes more storage space

Citation

If you use this model in your research or applications, please cite:

@misc{qwen3-4b-medical-qa-merged,
  author = {Your Name},
  title = {Qwen3-4B Medical QA},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Acryl-Jonathan-01/qwen3-4b-medical-qa-merged}}
}

Also consider citing the base model:

@misc{qwen3-2025,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  publisher={Alibaba Cloud},
  url={https://huggingface.co/Qwen}
}

Acknowledgments

License

This model inherits the license from the base model Qwen/Qwen3-4B-Instruct-2507. Please refer to the base model's license for usage terms and conditions.

Contact & Support

For issues, questions, or contributions:


Downloads last month
14
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Acryl-Jonathan-01/ALLM-Med-4B

Finetuned
(343)
this model

Evaluation results