llama3-8b-pku-PPO-Instruct-SFT-Instruct

Fine-tuned Llama-3.1-8B using PPO (Proximal Policy Optimization (RL with reward model)) on the PKU-SafeRLHF dataset for improved safety alignment.

Model Details

  • Base Model: meta-llama/Llama-3.1-8B
  • Fine-tuning Method: PPO
  • Dataset: PKU-SafeRLHF (10,813 samples)
  • Training Date: 2025-12-21
  • Precision: BF16 (bfloat16)
  • Adapter Type: LoRA (r=16, alpha=16, ~168MB)

Training Hyperparameters

  • Learning Rate: 1e-05
  • Batch Size: 16
  • Mini-Batch Size: 4
  • PPO Epochs: 4
  • Training Steps: N/A
  • Init KL Coefficient: 0.1
  • Target KL: 0.1
  • Clip Range: 0.2
  • Max Sequence Length: 2048
  • Max New Tokens: 256
  • Reward Model: OpenAssistant/reward-model-deberta-v3-large-v2

Evaluation Results

  • Final Objective Scores: -2.8750

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct")

# Generate
messages = [{"role": "user", "content": "Explain quantum computing"}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(input_ids, max_new_tokens=128, temperature=0.7)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

  • Primary Use: Safety-aligned conversational AI
  • Recommended: Instruction following with harm refusal capabilities
  • Not Recommended: Medical/legal advice, factual knowledge (use base Llama-3.1 for general tasks)

Limitations

  • Fine-tuned on English-only safety dataset (PKU-SafeRLHF)
  • May refuse benign requests if phrased similarly to harmful prompts
  • LoRA adapter only - requires base Llama-3.1-8B for inference

License

Llama 3.1 Community License (same as base model)

Citation

@misc{llama3_8b_pku_PPO_Instruct_SFT_Instruct_2024,
  author = {User},
  title = {llama3-8b-pku-PPO-Instruct-SFT-Instruct},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct}}
}

Framework Versions

  • Transformers: 4.46.3
  • PyTorch: 2.5.1
  • TRL: 0.12.1
  • PEFT: 0.13.2
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct

Finetuned
(1702)
this model

Dataset used to train kapilw25/llama3-8b-pku-PPO-Instruct-SFT-Instruct