Fine-tuning with HuggingFace: LoRA, QLoRA, PEFT - Comprehensive Guide (Tutorial #3)

#3
by AYI-NEDJIMI - opened

Fine-tuning with HuggingFace: LoRA, QLoRA, PEFT - Comprehensive Guide

Author: AYI-NEDJIMI | AI & Cybersecurity Consultant

This tutorial covers LLM fine-tuning in depth with HuggingFace: full fine-tuning, LoRA, QLoRA, instruction dataset preparation, BitsAndBytes configuration, SFTTrainer, monitoring, saving, adapter merging, GGUF quantization, and AutoTrain.

For our complete fine-tuning guide, check: Fine-tuning LLM with LoRA and QLoRA

For production deployment, check: Deploy LLM in Production with GPU

For quantization details, check: Quantization GPTQ, GGUF, AWQ


1. What is Fine-tuning?

Fine-tuning is the process of adapting a pre-trained model to a specific task by continuing training on your data.

1.1 Types of Fine-tuning

Method Parameters Trained VRAM Required Quality
Full Fine-tuning All (100%) Very high (80+ GB) Maximum
LoRA 0.1-1% Moderate (16-24 GB) Very good
QLoRA 0.1-1% (quantized model) Low (8-16 GB) Good
Prefix Tuning Prefixes only Low Fair
Prompt Tuning Soft prompts Very low Variable

1.2 LoRA in Detail

LoRA (Low-Rank Adaptation) decomposes weight updates into two small matrices:

W' = W + BA
where:
- W is the original weight matrix (frozen)
- B is of size (d x r) with r << d
- A is of size (r x k) with r << k
- r is the rank (typically 8-64)

Advantages:

  • 99% reduction in trainable parameters
  • Speed: 2-5x faster than full fine-tuning
  • Memory: works on consumer GPUs
  • Modular: adapters are interchangeable

1.3 QLoRA in Detail

QLoRA combines LoRA with 4-bit quantization:

  • Base model is loaded in 4-bit NormalFloat (NF4)
  • LoRA adapters are trained in bfloat16
  • Result: fine-tune a 70B model on a single 48GB GPU

2. When to Fine-tune vs RAG vs Prompting?

Approach When to Use Advantages Disadvantages
Prompting General tasks, prototyping Fast, no training needed Limited by context
RAG Need specific/recent knowledge No training, up-to-date data Latency, complexity
Fine-tuning Specific style/format, specialized task Maximum quality, fast inference Costly, data needed

Recommendation: Start with prompting, then RAG, then fine-tuning if necessary.


3. Preparing Instruction Datasets

3.1 Instruction Format

from datasets import Dataset

# Standard instruction-input-output format
data = {
    "instruction": [
        "Classify this security alert by type and severity.",
        "Generate an incident report from these logs.",
        "Explain this CVE vulnerability in simple terms.",
        "Propose remediation measures for this threat.",
        "Analyze this network traffic and identify anomalies.",
    ],
    "input": [
        "50 failed SSH login attempts in 2 minutes from IP 185.220.101.45",
        "2026-01-15 03:22:15 ALERT: Outbound connection to known C2 server 45.33.32.156:443",
        "CVE-2026-1234: Buffer overflow in OpenSSL 3.2.0 allowing remote code execution",
        "LockBit ransomware detected on workstation DESKTOP-HR042, .locked files",
        "DNS traffic spike: 10000 queries/min to randomly generated DGA domains",
    ],
    "output": [
        "Type: SSH Brute Force | Severity: HIGH | Action: Block source IP, verify access",
        "INCIDENT REPORT\nDate: 01/15/2026 03:22\nType: C2 Communication\nSuspicious IP: 45.33.32.156\nAction: Isolate machine, analyze malware",
        "This vulnerability allows a remote attacker to execute code on your server by exploiting a buffer overflow in OpenSSL. Update immediately.",
        "1. Isolate infected workstation from network\n2. Identify infection vector\n3. Restore from backups\n4. Analyze logs to identify propagation\n5. Update antivirus signatures",
        "Anomaly detected: Massive DNS queries to DGA (Domain Generation Algorithm) domains. Probable malware/botnet indicator. Recommendation: block domains, isolate source machines.",
    ],
}

dataset = Dataset.from_dict(data)
print(f"Dataset: {len(dataset)} examples")

3.2 Chat/Conversational Format

chat_data = {
    "messages": [
        [
            {"role": "system", "content": "You are a cybersecurity expert. Respond precisely and technically."},
            {"role": "user", "content": "What is a SQL injection attack?"},
            {"role": "assistant", "content": "SQL injection is an attack technique that exploits input validation flaws in web applications. The attacker inserts malicious SQL code into input fields to manipulate the database. Example: ' OR '1'='1 in a login field can bypass authentication."},
        ],
    ]
}

3.3 Format for SFTTrainer

def format_instruction(example):
    """Format an example for instruction fine-tuning."""
    if example.get("input", ""):
        text = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}

formatted_dataset = dataset.map(format_instruction)
print(formatted_dataset[0]['text'])

4. BitsAndBytesConfig (4-bit, 8-bit)

from transformers import BitsAndBytesConfig
import torch

# 4-bit configuration (QLoRA)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,                     # Load in 4-bit
    bnb_4bit_quant_type="nf4",             # NormalFloat 4-bit (better than FP4)
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in bfloat16
    bnb_4bit_use_double_quant=True,         # Double quantization (saves ~0.4 bits/param)
)

# 8-bit configuration
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

# Load quantized model
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3.1-8B"

# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     quantization_config=bnb_config_4bit,
#     device_map="auto",
#     trust_remote_code=True,
# )

# VRAM Comparison:
# Full precision (fp32): 8B * 4 bytes = ~32 GB
# Half precision (fp16): 8B * 2 bytes = ~16 GB
# 8-bit: 8B * 1 byte = ~8 GB
# 4-bit: 8B * 0.5 byte = ~4 GB

5. LoRA Configuration

from peft import LoraConfig, TaskType

# Optimal LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Decomposition rank (8-64)
    lora_alpha=32,                 # Scale factor (generally 2*r)
    target_modules=[
        "q_proj",                  # Attention query
        "k_proj",                  # Attention key
        "v_proj",                  # Attention value
        "o_proj",                  # Attention output
        "gate_proj",               # MLP gate
        "up_proj",                 # MLP up
        "down_proj",               # MLP down
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA to model
from peft import get_peft_model

# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# Typical output:
# "trainable params: 13,631,488 || all params: 8,030,261,248 || trainable%: 0.1698%"

Rank Selection Guide

Rank Parameters Quality Use Case
4 Very few Fair Simple tasks
8 Few Good General use
16 Moderate Very good Recommended
32 Many Excellent Complex tasks
64 Very many Maximum Near full FT

6. SFTTrainer from TRL

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # Effective batch = 4 * 4 = 16
    learning_rate=2e-4,                # LR for LoRA (higher than full FT)
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    bf16=True,
    max_grad_norm=0.3,
    max_seq_length=2048,
    packing=True,                      # Packing to optimize throughput
    gradient_checkpointing=True,       # Save VRAM
    optim="paged_adamw_32bit",
    report_to="tensorboard",
    seed=42,
)

# Create trainer
# trainer = SFTTrainer(
#     model=peft_model,
#     train_dataset=formatted_dataset,
#     args=training_args,
#     tokenizer=tokenizer,
#     peft_config=lora_config,
# )

# Launch training
# trainer.train()

# Save
# trainer.save_model("./cybersec-lora-adapter")

7. Training Monitoring (Loss, Eval)

7.1 TensorBoard

# Launch TensorBoard
# tensorboard --logdir ./results/runs

# In Jupyter notebook
# %load_ext tensorboard
# %tensorboard --logdir ./results/runs

7.2 Weights & Biases

import wandb
# wandb.init(project="cybersec-finetuning", name="llama-3.1-8b-lora")
# In SFTConfig: report_to="wandb"

7.3 Key Metrics

  • Training Loss: should decrease steadily
  • Eval Loss: should follow training loss (otherwise overfitting)
  • Learning Rate: verify schedule (warmup + decay)
  • Gradient Norm: should not explode (> 1.0)
  • GPU Memory: monitor VRAM usage

8. Save and Upload Adapters

# Save LoRA adapter (only a few MB)
# trainer.model.save_pretrained("./cybersec-lora-adapter")
# tokenizer.save_pretrained("./cybersec-lora-adapter")

# Upload to Hub
# trainer.model.push_to_hub("AYI-NEDJIMI/cybersec-llama-lora")

# Directory structure:
# cybersec-lora-adapter/
# |-- adapter_config.json       (LoRA config)
# |-- adapter_model.safetensors (LoRA weights, ~50MB)
# |-- tokenizer.json
# |-- tokenizer_config.json

Loading an Adapter

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
# base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

# Load LoRA adapter
# model = PeftModel.from_pretrained(base_model, "AYI-NEDJIMI/cybersec-llama-lora")

9. Merge Adapters into Base Model

from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch

# 1. Load base model in full precision
# base_model = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Meta-Llama-3.1-8B",
#     torch_dtype=torch.float16,
#     device_map="auto"
# )

# 2. Load adapter
# model = PeftModel.from_pretrained(base_model, "./cybersec-lora-adapter")

# 3. Merge
# merged_model = model.merge_and_unload()

# 4. Save merged model
# merged_model.save_pretrained("./cybersec-llama-merged")

# 5. Upload merged model
# merged_model.push_to_hub("AYI-NEDJIMI/cybersec-llama-merged")

10. Quantize to GGUF for Ollama

# Method 1: With llama.cpp
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && make

# Convert to GGUF
# python convert-hf-to-gguf.py ./cybersec-llama-merged --outfile cybersec-llama.gguf

# Quantize
# ./llama-quantize cybersec-llama.gguf cybersec-llama-Q4_K_M.gguf Q4_K_M

# Use with Ollama:
# ollama create cybersec-llama -f Modelfile

Quantization Types

Format Size Quality Usage
Q2_K ~2.5 bits Low Testing only
Q4_K_M ~4.5 bits Good Recommended (balanced)
Q5_K_M ~5.5 bits Very good Production
Q6_K ~6.5 bits Excellent When VRAM allows
Q8_0 ~8 bits Near FP16 Maximum quantized quality

For more on quantization: Quantization GPTQ, GGUF, AWQ


11. AutoTrain (No-Code Fine-tuning)

AutoTrain is HuggingFace's no-code solution:

  1. Go to huggingface.co/autotrain
  2. Select your task (LLM Fine-tuning)
  3. Upload your dataset
  4. Choose the base model
  5. Configure hyperparameters
  6. Launch training
# Or via command line
# autotrain llm --train \
#   --model meta-llama/Meta-Llama-3.1-8B \
#   --data-path ./dataset \
#   --text-column text \
#   --lr 2e-4 \
#   --batch-size 4 \
#   --epochs 3 \
#   --peft \
#   --quantization int4 \
#   --trainer sft

12. Real Example: Our 3 CyberSec Models

We fine-tuned 3 specialized cybersecurity models available in our collection:

12.1 CyberSec Threat Classifier

  • Base: BERT multilingual
  • Task: Threat classification (phishing, malware, intrusion, DDoS)
  • Dataset: 50K annotated security alerts
  • Method: Full fine-tuning
  • Performance: F1 = 0.94

12.2 CyberSec Report Generator

  • Base: Llama 3.1 8B
  • Task: Incident report generation
  • Dataset: 10K structured incident reports
  • Method: QLoRA (r=16, alpha=32)
  • VRAM: 12 GB (RTX 4080)

12.3 CyberSec CVE Analyzer

  • Base: Mistral 7B
  • Task: CVE analysis and explanation
  • Dataset: 30K CVE descriptions + analyses
  • Method: LoRA (r=32, alpha=64)
  • VRAM: 16 GB (T4)

Discover these models in our collection: CyberSec AI Portfolio


Complete Pipeline (Summary)

1. Prepare instruction dataset
2. Choose base model
3. Configure BitsAndBytes (4-bit for QLoRA)
4. Configure LoRA (r=16, alpha=32)
5. Configure SFTTrainer
6. Train and monitor
7. Save adapter
8. Merge into base model
9. Quantize to GGUF
10. Deploy (HF Inference / Ollama / vLLM)

Conclusion

Fine-tuning with LoRA/QLoRA has revolutionized LLM adaptation. With HuggingFace PEFT and TRL, you can fine-tune state-of-the-art models on consumer hardware. The key is to prepare your data well and choose the right hyperparameters.


Tutorial written by AYI-NEDJIMI - AI & Cybersecurity Consultant

Sign up or log in to comment