llama3-8b-pku-PPO-Instruct-SFT-Instruct

Fine-tuned Llama-3.1-8B using PPO (Proximal Policy Optimization (RL with reward model)) on the PKU-SafeRLHF dataset for improved safety alignment.

Model Details

Base Model: meta-llama/Llama-3.1-8B
Fine-tuning Method: PPO
Dataset: PKU-SafeRLHF (10,813 samples)
Training Date: 2025-12-21
Precision: BF16 (bfloat16)
Adapter Type: LoRA (r=16, alpha=16, ~168MB)

Training Hyperparameters

Learning Rate: 1e-05
Batch Size: 16
Mini-Batch Size: 4
PPO Epochs: 4
Training Steps: N/A
Init KL Coefficient: 0.1
Target KL: 0.1
Clip Range: 0.2
Max Sequence Length: 2048
Max New Tokens: 256
Reward Model: OpenAssistant/reward-model-deberta-v3-large-v2

Evaluation Results

Final Objective Scores: -2.8750

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct")

# Generate
messages = [{"role": "user", "content": "Explain quantum computing"}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(input_ids, max_new_tokens=128, temperature=0.7)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

Primary Use: Safety-aligned conversational AI
Recommended: Instruction following with harm refusal capabilities
Not Recommended: Medical/legal advice, factual knowledge (use base Llama-3.1 for general tasks)

Limitations

Fine-tuned on English-only safety dataset (PKU-SafeRLHF)
May refuse benign requests if phrased similarly to harmful prompts
LoRA adapter only - requires base Llama-3.1-8B for inference

License

Llama 3.1 Community License (same as base model)

Citation

@misc{llama3_8b_pku_PPO_Instruct_SFT_Instruct_2024,
  author = {User},
  title = {llama3-8b-pku-PPO-Instruct-SFT-Instruct},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct}}
}

Framework Versions

Transformers: 4.46.3
PyTorch: 2.5.1
TRL: 0.12.1
PEFT: 0.13.2

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct

Base model

meta-llama/Llama-3.1-8B

Finetuned

(1702)

this model

kapilw25
/

llama3-8b-pku-PPO-Instruct-SFT-Instruct