Spaces:

AYI-NEDJIMI
/

CyberSec-Models-Demo

Paused

App Files Files Community

Fine-tuning with HuggingFace: LoRA, QLoRA, PEFT - Comprehensive Guide (Tutorial #3)

by AYI-NEDJIMI - opened 4 days ago

Discussion

AYI-NEDJIMI

Owner 4 days ago

Fine-tuning with HuggingFace: LoRA, QLoRA, PEFT - Comprehensive Guide

Author: AYI-NEDJIMI | AI & Cybersecurity Consultant

This tutorial covers LLM fine-tuning in depth with HuggingFace: full fine-tuning, LoRA, QLoRA, instruction dataset preparation, BitsAndBytes configuration, SFTTrainer, monitoring, saving, adapter merging, GGUF quantization, and AutoTrain.

For our complete fine-tuning guide, check: Fine-tuning LLM with LoRA and QLoRA

For production deployment, check: Deploy LLM in Production with GPU

For quantization details, check: Quantization GPTQ, GGUF, AWQ

1. What is Fine-tuning?

Fine-tuning is the process of adapting a pre-trained model to a specific task by continuing training on your data.

1.1 Types of Fine-tuning

Method	Parameters Trained	VRAM Required	Quality
Full Fine-tuning	All (100%)	Very high (80+ GB)	Maximum
LoRA	0.1-1%	Moderate (16-24 GB)	Very good
QLoRA	0.1-1% (quantized model)	Low (8-16 GB)	Good
Prefix Tuning	Prefixes only	Low	Fair
Prompt Tuning	Soft prompts	Very low	Variable

1.2 LoRA in Detail

LoRA (Low-Rank Adaptation) decomposes weight updates into two small matrices:

W' = W + BA
where:
- W is the original weight matrix (frozen)
- B is of size (d x r) with r << d
- A is of size (r x k) with r << k
- r is the rank (typically 8-64)

Advantages:

99% reduction in trainable parameters
Speed: 2-5x faster than full fine-tuning
Memory: works on consumer GPUs
Modular: adapters are interchangeable

1.3 QLoRA in Detail

QLoRA combines LoRA with 4-bit quantization:

Base model is loaded in 4-bit NormalFloat (NF4)
LoRA adapters are trained in bfloat16
Result: fine-tune a 70B model on a single 48GB GPU

2. When to Fine-tune vs RAG vs Prompting?

Approach	When to Use	Advantages	Disadvantages
Prompting	General tasks, prototyping	Fast, no training needed	Limited by context
RAG	Need specific/recent knowledge	No training, up-to-date data	Latency, complexity
Fine-tuning	Specific style/format, specialized task	Maximum quality, fast inference	Costly, data needed

Recommendation: Start with prompting, then RAG, then fine-tuning if necessary.

3. Preparing Instruction Datasets

3.1 Instruction Format

from datasets import Dataset

# Standard instruction-input-output format
data = {
    "instruction": [
        "Classify this security alert by type and severity.",
        "Generate an incident report from these logs.",
        "Explain this CVE vulnerability in simple terms.",
        "Propose remediation measures for this threat.",
        "Analyze this network traffic and identify anomalies.",
    ],
    "input": [
        "50 failed SSH login attempts in 2 minutes from IP 185.220.101.45",
        "2026-01-15 03:22:15 ALERT: Outbound connection to known C2 server 45.33.32.156:443",
        "CVE-2026-1234: Buffer overflow in OpenSSL 3.2.0 allowing remote code execution",
        "LockBit ransomware detected on workstation DESKTOP-HR042, .locked files",
        "DNS traffic spike: 10000 queries/min to randomly generated DGA domains",
    ],
    "output": [
        "Type: SSH Brute Force | Severity: HIGH | Action: Block source IP, verify access",
        "INCIDENT REPORT\nDate: 01/15/2026 03:22\nType: C2 Communication\nSuspicious IP: 45.33.32.156\nAction: Isolate machine, analyze malware",
        "This vulnerability allows a remote attacker to execute code on your server by exploiting a buffer overflow in OpenSSL. Update immediately.",
        "1. Isolate infected workstation from network\n2. Identify infection vector\n3. Restore from backups\n4. Analyze logs to identify propagation\n5. Update antivirus signatures",
        "Anomaly detected: Massive DNS queries to DGA (Domain Generation Algorithm) domains. Probable malware/botnet indicator. Recommendation: block domains, isolate source machines.",
    ],
}

dataset = Dataset.from_dict(data)
print(f"Dataset: {len(dataset)} examples")

3.2 Chat/Conversational Format

chat_data = {
    "messages": [
        [
            {"role": "system", "content": "You are a cybersecurity expert. Respond precisely and technically."},
            {"role": "user", "content": "What is a SQL injection attack?"},
            {"role": "assistant", "content": "SQL injection is an attack technique that exploits input validation flaws in web applications. The attacker inserts malicious SQL code into input fields to manipulate the database. Example: ' OR '1'='1 in a login field can bypass authentication."},
        ],
    ]
}

3.3 Format for SFTTrainer

def format_instruction(example):
    """Format an example for instruction fine-tuning."""
    if example.get("input", ""):
        text = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}

formatted_dataset = dataset.map(format_instruction)
print(formatted_dataset[0]['text'])

4. BitsAndBytesConfig (4-bit, 8-bit)

from transformers import BitsAndBytesConfig
import torch

# 4-bit configuration (QLoRA)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,                     # Load in 4-bit
    bnb_4bit_quant_type="nf4",             # NormalFloat 4-bit (better than FP4)
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in bfloat16
    bnb_4bit_use_double_quant=True,         # Double quantization (saves ~0.4 bits/param)
)

# 8-bit configuration
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

# Load quantized model
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3.1-8B"

# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     quantization_config=bnb_config_4bit,
#     device_map="auto",
#     trust_remote_code=True,
# )

# VRAM Comparison:
# Full precision (fp32): 8B * 4 bytes = ~32 GB
# Half precision (fp16): 8B * 2 bytes = ~16 GB
# 8-bit: 8B * 1 byte = ~8 GB
# 4-bit: 8B * 0.5 byte = ~4 GB

5. LoRA Configuration

from peft import LoraConfig, TaskType

# Optimal LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Decomposition rank (8-64)
    lora_alpha=32,                 # Scale factor (generally 2*r)
    target_modules=[
        "q_proj",                  # Attention query
        "k_proj",                  # Attention key
        "v_proj",                  # Attention value
        "o_proj",                  # Attention output
        "gate_proj",               # MLP gate
        "up_proj",                 # MLP up
        "down_proj",               # MLP down
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA to model
from peft import get_peft_model

# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# Typical output:
# "trainable params: 13,631,488 || all params: 8,030,261,248 || trainable%: 0.1698%"

Rank Selection Guide

Rank	Parameters	Quality	Use Case
4	Very few	Fair	Simple tasks
8	Few	Good	General use
16	Moderate	Very good	Recommended
32	Many	Excellent	Complex tasks
64	Very many	Maximum	Near full FT

6. SFTTrainer from TRL

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # Effective batch = 4 * 4 = 16
    learning_rate=2e-4,                # LR for LoRA (higher than full FT)
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    bf16=True,
    max_grad_norm=0.3,
    max_seq_length=2048,
    packing=True,                      # Packing to optimize throughput
    gradient_checkpointing=True,       # Save VRAM
    optim="paged_adamw_32bit",
    report_to="tensorboard",
    seed=42,
)

# Create trainer
# trainer = SFTTrainer(
#     model=peft_model,
#     train_dataset=formatted_dataset,
#     args=training_args,
#     tokenizer=tokenizer,
#     peft_config=lora_config,
# )

# Launch training
# trainer.train()

# Save
# trainer.save_model("./cybersec-lora-adapter")

7. Training Monitoring (Loss, Eval)

7.1 TensorBoard

# Launch TensorBoard
# tensorboard --logdir ./results/runs

# In Jupyter notebook
# %load_ext tensorboard
# %tensorboard --logdir ./results/runs

7.2 Weights & Biases

import wandb
# wandb.init(project="cybersec-finetuning", name="llama-3.1-8b-lora")
# In SFTConfig: report_to="wandb"

7.3 Key Metrics

Training Loss: should decrease steadily
Eval Loss: should follow training loss (otherwise overfitting)
Learning Rate: verify schedule (warmup + decay)
Gradient Norm: should not explode (> 1.0)
GPU Memory: monitor VRAM usage

8. Save and Upload Adapters

# Save LoRA adapter (only a few MB)
# trainer.model.save_pretrained("./cybersec-lora-adapter")
# tokenizer.save_pretrained("./cybersec-lora-adapter")

# Upload to Hub
# trainer.model.push_to_hub("AYI-NEDJIMI/cybersec-llama-lora")

# Directory structure:
# cybersec-lora-adapter/
# |-- adapter_config.json       (LoRA config)
# |-- adapter_model.safetensors (LoRA weights, ~50MB)
# |-- tokenizer.json
# |-- tokenizer_config.json

Loading an Adapter

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
# base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

# Load LoRA adapter
# model = PeftModel.from_pretrained(base_model, "AYI-NEDJIMI/cybersec-llama-lora")

9. Merge Adapters into Base Model

from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch

# 1. Load base model in full precision
# base_model = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Meta-Llama-3.1-8B",
#     torch_dtype=torch.float16,
#     device_map="auto"
# )

# 2. Load adapter
# model = PeftModel.from_pretrained(base_model, "./cybersec-lora-adapter")

# 3. Merge
# merged_model = model.merge_and_unload()

# 4. Save merged model
# merged_model.save_pretrained("./cybersec-llama-merged")

# 5. Upload merged model
# merged_model.push_to_hub("AYI-NEDJIMI/cybersec-llama-merged")

10. Quantize to GGUF for Ollama

# Method 1: With llama.cpp
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && make

# Convert to GGUF
# python convert-hf-to-gguf.py ./cybersec-llama-merged --outfile cybersec-llama.gguf

# Quantize
# ./llama-quantize cybersec-llama.gguf cybersec-llama-Q4_K_M.gguf Q4_K_M

# Use with Ollama:
# ollama create cybersec-llama -f Modelfile

Quantization Types

Format	Size	Quality	Usage
Q2_K	~2.5 bits	Low	Testing only
Q4_K_M	~4.5 bits	Good	Recommended (balanced)
Q5_K_M	~5.5 bits	Very good	Production
Q6_K	~6.5 bits	Excellent	When VRAM allows
Q8_0	~8 bits	Near FP16	Maximum quantized quality

For more on quantization: Quantization GPTQ, GGUF, AWQ

11. AutoTrain (No-Code Fine-tuning)

AutoTrain is HuggingFace's no-code solution:

Go to huggingface.co/autotrain
Select your task (LLM Fine-tuning)
Upload your dataset
Choose the base model
Configure hyperparameters
Launch training

# Or via command line
# autotrain llm --train \
#   --model meta-llama/Meta-Llama-3.1-8B \
#   --data-path ./dataset \
#   --text-column text \
#   --lr 2e-4 \
#   --batch-size 4 \
#   --epochs 3 \
#   --peft \
#   --quantization int4 \
#   --trainer sft

12. Real Example: Our 3 CyberSec Models

We fine-tuned 3 specialized cybersecurity models available in our collection:

12.1 CyberSec Threat Classifier

Base: BERT multilingual
Task: Threat classification (phishing, malware, intrusion, DDoS)
Dataset: 50K annotated security alerts
Method: Full fine-tuning
Performance: F1 = 0.94

12.2 CyberSec Report Generator

Base: Llama 3.1 8B
Task: Incident report generation
Dataset: 10K structured incident reports
Method: QLoRA (r=16, alpha=32)
VRAM: 12 GB (RTX 4080)

12.3 CyberSec CVE Analyzer

Base: Mistral 7B
Task: CVE analysis and explanation
Dataset: 30K CVE descriptions + analyses
Method: LoRA (r=32, alpha=64)
VRAM: 16 GB (T4)

Discover these models in our collection: CyberSec AI Portfolio

Complete Pipeline (Summary)

1. Prepare instruction dataset
2. Choose base model
3. Configure BitsAndBytes (4-bit for QLoRA)
4. Configure LoRA (r=16, alpha=32)
5. Configure SFTTrainer
6. Train and monitor
7. Save adapter
8. Merge into base model
9. Quantize to GGUF
10. Deploy (HF Inference / Ollama / vLLM)

Conclusion

Fine-tuning with LoRA/QLoRA has revolutionized LLM adaptation. With HuggingFace PEFT and TRL, you can fine-tune state-of-the-art models on consumer hardware. The key is to prepare your data well and choose the right hyperparameters.

Tutorial written by AYI-NEDJIMI - AI & Cybersecurity Consultant

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment