Fine-tuning with HuggingFace: LoRA, QLoRA, PEFT - Comprehensive Guide (Tutorial #3)
Fine-tuning with HuggingFace: LoRA, QLoRA, PEFT - Comprehensive Guide
Author: AYI-NEDJIMI | AI & Cybersecurity Consultant
This tutorial covers LLM fine-tuning in depth with HuggingFace: full fine-tuning, LoRA, QLoRA, instruction dataset preparation, BitsAndBytes configuration, SFTTrainer, monitoring, saving, adapter merging, GGUF quantization, and AutoTrain.
For our complete fine-tuning guide, check: Fine-tuning LLM with LoRA and QLoRA
For production deployment, check: Deploy LLM in Production with GPU
For quantization details, check: Quantization GPTQ, GGUF, AWQ
1. What is Fine-tuning?
Fine-tuning is the process of adapting a pre-trained model to a specific task by continuing training on your data.
1.1 Types of Fine-tuning
| Method | Parameters Trained | VRAM Required | Quality |
|---|---|---|---|
| Full Fine-tuning | All (100%) | Very high (80+ GB) | Maximum |
| LoRA | 0.1-1% | Moderate (16-24 GB) | Very good |
| QLoRA | 0.1-1% (quantized model) | Low (8-16 GB) | Good |
| Prefix Tuning | Prefixes only | Low | Fair |
| Prompt Tuning | Soft prompts | Very low | Variable |
1.2 LoRA in Detail
LoRA (Low-Rank Adaptation) decomposes weight updates into two small matrices:
W' = W + BA
where:
- W is the original weight matrix (frozen)
- B is of size (d x r) with r << d
- A is of size (r x k) with r << k
- r is the rank (typically 8-64)
Advantages:
- 99% reduction in trainable parameters
- Speed: 2-5x faster than full fine-tuning
- Memory: works on consumer GPUs
- Modular: adapters are interchangeable
1.3 QLoRA in Detail
QLoRA combines LoRA with 4-bit quantization:
- Base model is loaded in 4-bit NormalFloat (NF4)
- LoRA adapters are trained in bfloat16
- Result: fine-tune a 70B model on a single 48GB GPU
2. When to Fine-tune vs RAG vs Prompting?
| Approach | When to Use | Advantages | Disadvantages |
|---|---|---|---|
| Prompting | General tasks, prototyping | Fast, no training needed | Limited by context |
| RAG | Need specific/recent knowledge | No training, up-to-date data | Latency, complexity |
| Fine-tuning | Specific style/format, specialized task | Maximum quality, fast inference | Costly, data needed |
Recommendation: Start with prompting, then RAG, then fine-tuning if necessary.
3. Preparing Instruction Datasets
3.1 Instruction Format
from datasets import Dataset
# Standard instruction-input-output format
data = {
"instruction": [
"Classify this security alert by type and severity.",
"Generate an incident report from these logs.",
"Explain this CVE vulnerability in simple terms.",
"Propose remediation measures for this threat.",
"Analyze this network traffic and identify anomalies.",
],
"input": [
"50 failed SSH login attempts in 2 minutes from IP 185.220.101.45",
"2026-01-15 03:22:15 ALERT: Outbound connection to known C2 server 45.33.32.156:443",
"CVE-2026-1234: Buffer overflow in OpenSSL 3.2.0 allowing remote code execution",
"LockBit ransomware detected on workstation DESKTOP-HR042, .locked files",
"DNS traffic spike: 10000 queries/min to randomly generated DGA domains",
],
"output": [
"Type: SSH Brute Force | Severity: HIGH | Action: Block source IP, verify access",
"INCIDENT REPORT\nDate: 01/15/2026 03:22\nType: C2 Communication\nSuspicious IP: 45.33.32.156\nAction: Isolate machine, analyze malware",
"This vulnerability allows a remote attacker to execute code on your server by exploiting a buffer overflow in OpenSSL. Update immediately.",
"1. Isolate infected workstation from network\n2. Identify infection vector\n3. Restore from backups\n4. Analyze logs to identify propagation\n5. Update antivirus signatures",
"Anomaly detected: Massive DNS queries to DGA (Domain Generation Algorithm) domains. Probable malware/botnet indicator. Recommendation: block domains, isolate source machines.",
],
}
dataset = Dataset.from_dict(data)
print(f"Dataset: {len(dataset)} examples")
3.2 Chat/Conversational Format
chat_data = {
"messages": [
[
{"role": "system", "content": "You are a cybersecurity expert. Respond precisely and technically."},
{"role": "user", "content": "What is a SQL injection attack?"},
{"role": "assistant", "content": "SQL injection is an attack technique that exploits input validation flaws in web applications. The attacker inserts malicious SQL code into input fields to manipulate the database. Example: ' OR '1'='1 in a login field can bypass authentication."},
],
]
}
3.3 Format for SFTTrainer
def format_instruction(example):
"""Format an example for instruction fine-tuning."""
if example.get("input", ""):
text = f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
else:
text = f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
return {"text": text}
formatted_dataset = dataset.map(format_instruction)
print(formatted_dataset[0]['text'])
4. BitsAndBytesConfig (4-bit, 8-bit)
from transformers import BitsAndBytesConfig
import torch
# 4-bit configuration (QLoRA)
bnb_config_4bit = BitsAndBytesConfig(
load_in_4bit=True, # Load in 4-bit
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit (better than FP4)
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
bnb_4bit_use_double_quant=True, # Double quantization (saves ~0.4 bits/param)
)
# 8-bit configuration
bnb_config_8bit = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
)
# Load quantized model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Meta-Llama-3.1-8B"
# model = AutoModelForCausalLM.from_pretrained(
# model_name,
# quantization_config=bnb_config_4bit,
# device_map="auto",
# trust_remote_code=True,
# )
# VRAM Comparison:
# Full precision (fp32): 8B * 4 bytes = ~32 GB
# Half precision (fp16): 8B * 2 bytes = ~16 GB
# 8-bit: 8B * 1 byte = ~8 GB
# 4-bit: 8B * 0.5 byte = ~4 GB
5. LoRA Configuration
from peft import LoraConfig, TaskType
# Optimal LoRA configuration
lora_config = LoraConfig(
r=16, # Decomposition rank (8-64)
lora_alpha=32, # Scale factor (generally 2*r)
target_modules=[
"q_proj", # Attention query
"k_proj", # Attention key
"v_proj", # Attention value
"o_proj", # Attention output
"gate_proj", # MLP gate
"up_proj", # MLP up
"down_proj", # MLP down
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Apply LoRA to model
from peft import get_peft_model
# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# Typical output:
# "trainable params: 13,631,488 || all params: 8,030,261,248 || trainable%: 0.1698%"
Rank Selection Guide
| Rank | Parameters | Quality | Use Case |
|---|---|---|---|
| 4 | Very few | Fair | Simple tasks |
| 8 | Few | Good | General use |
| 16 | Moderate | Very good | Recommended |
| 32 | Many | Excellent | Complex tasks |
| 64 | Very many | Maximum | Near full FT |
6. SFTTrainer from TRL
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 4 * 4 = 16
learning_rate=2e-4, # LR for LoRA (higher than full FT)
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_steps=100,
save_total_limit=3,
bf16=True,
max_grad_norm=0.3,
max_seq_length=2048,
packing=True, # Packing to optimize throughput
gradient_checkpointing=True, # Save VRAM
optim="paged_adamw_32bit",
report_to="tensorboard",
seed=42,
)
# Create trainer
# trainer = SFTTrainer(
# model=peft_model,
# train_dataset=formatted_dataset,
# args=training_args,
# tokenizer=tokenizer,
# peft_config=lora_config,
# )
# Launch training
# trainer.train()
# Save
# trainer.save_model("./cybersec-lora-adapter")
7. Training Monitoring (Loss, Eval)
7.1 TensorBoard
# Launch TensorBoard
# tensorboard --logdir ./results/runs
# In Jupyter notebook
# %load_ext tensorboard
# %tensorboard --logdir ./results/runs
7.2 Weights & Biases
import wandb
# wandb.init(project="cybersec-finetuning", name="llama-3.1-8b-lora")
# In SFTConfig: report_to="wandb"
7.3 Key Metrics
- Training Loss: should decrease steadily
- Eval Loss: should follow training loss (otherwise overfitting)
- Learning Rate: verify schedule (warmup + decay)
- Gradient Norm: should not explode (> 1.0)
- GPU Memory: monitor VRAM usage
8. Save and Upload Adapters
# Save LoRA adapter (only a few MB)
# trainer.model.save_pretrained("./cybersec-lora-adapter")
# tokenizer.save_pretrained("./cybersec-lora-adapter")
# Upload to Hub
# trainer.model.push_to_hub("AYI-NEDJIMI/cybersec-llama-lora")
# Directory structure:
# cybersec-lora-adapter/
# |-- adapter_config.json (LoRA config)
# |-- adapter_model.safetensors (LoRA weights, ~50MB)
# |-- tokenizer.json
# |-- tokenizer_config.json
Loading an Adapter
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model
# base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
# Load LoRA adapter
# model = PeftModel.from_pretrained(base_model, "AYI-NEDJIMI/cybersec-llama-lora")
9. Merge Adapters into Base Model
from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch
# 1. Load base model in full precision
# base_model = AutoModelForCausalLM.from_pretrained(
# "meta-llama/Meta-Llama-3.1-8B",
# torch_dtype=torch.float16,
# device_map="auto"
# )
# 2. Load adapter
# model = PeftModel.from_pretrained(base_model, "./cybersec-lora-adapter")
# 3. Merge
# merged_model = model.merge_and_unload()
# 4. Save merged model
# merged_model.save_pretrained("./cybersec-llama-merged")
# 5. Upload merged model
# merged_model.push_to_hub("AYI-NEDJIMI/cybersec-llama-merged")
10. Quantize to GGUF for Ollama
# Method 1: With llama.cpp
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && make
# Convert to GGUF
# python convert-hf-to-gguf.py ./cybersec-llama-merged --outfile cybersec-llama.gguf
# Quantize
# ./llama-quantize cybersec-llama.gguf cybersec-llama-Q4_K_M.gguf Q4_K_M
# Use with Ollama:
# ollama create cybersec-llama -f Modelfile
Quantization Types
| Format | Size | Quality | Usage |
|---|---|---|---|
| Q2_K | ~2.5 bits | Low | Testing only |
| Q4_K_M | ~4.5 bits | Good | Recommended (balanced) |
| Q5_K_M | ~5.5 bits | Very good | Production |
| Q6_K | ~6.5 bits | Excellent | When VRAM allows |
| Q8_0 | ~8 bits | Near FP16 | Maximum quantized quality |
For more on quantization: Quantization GPTQ, GGUF, AWQ
11. AutoTrain (No-Code Fine-tuning)
AutoTrain is HuggingFace's no-code solution:
- Go to huggingface.co/autotrain
- Select your task (LLM Fine-tuning)
- Upload your dataset
- Choose the base model
- Configure hyperparameters
- Launch training
# Or via command line
# autotrain llm --train \
# --model meta-llama/Meta-Llama-3.1-8B \
# --data-path ./dataset \
# --text-column text \
# --lr 2e-4 \
# --batch-size 4 \
# --epochs 3 \
# --peft \
# --quantization int4 \
# --trainer sft
12. Real Example: Our 3 CyberSec Models
We fine-tuned 3 specialized cybersecurity models available in our collection:
12.1 CyberSec Threat Classifier
- Base: BERT multilingual
- Task: Threat classification (phishing, malware, intrusion, DDoS)
- Dataset: 50K annotated security alerts
- Method: Full fine-tuning
- Performance: F1 = 0.94
12.2 CyberSec Report Generator
- Base: Llama 3.1 8B
- Task: Incident report generation
- Dataset: 10K structured incident reports
- Method: QLoRA (r=16, alpha=32)
- VRAM: 12 GB (RTX 4080)
12.3 CyberSec CVE Analyzer
- Base: Mistral 7B
- Task: CVE analysis and explanation
- Dataset: 30K CVE descriptions + analyses
- Method: LoRA (r=32, alpha=64)
- VRAM: 16 GB (T4)
Discover these models in our collection: CyberSec AI Portfolio
Complete Pipeline (Summary)
1. Prepare instruction dataset
2. Choose base model
3. Configure BitsAndBytes (4-bit for QLoRA)
4. Configure LoRA (r=16, alpha=32)
5. Configure SFTTrainer
6. Train and monitor
7. Save adapter
8. Merge into base model
9. Quantize to GGUF
10. Deploy (HF Inference / Ollama / vLLM)
Conclusion
Fine-tuning with LoRA/QLoRA has revolutionized LLM adaptation. With HuggingFace PEFT and TRL, you can fine-tune state-of-the-art models on consumer hardware. The key is to prepare your data well and choose the right hyperparameters.
Tutorial written by AYI-NEDJIMI - AI & Cybersecurity Consultant