๐Ÿ›ก๏ธ Guard Safety Classifier

A multi-task safety classifier based on DeBERTa-v3-small trained on 3.9M+ samples for content moderation and safety detection.

๐ŸŽฏ Model Tasks

This model performs three simultaneous predictions:

  1. Binary Safety Classification (is_safe)

    • โœ… Safe content
    • โš ๏ธ Unsafe content
  2. Single-Label Category Classification (category)

    • Identifies the primary safety concern category
  3. Multi-Label Categories (categories)

    • Can detect multiple safety issues simultaneously

๐Ÿ“Š Performance Metrics

Metric Score
is_safe Accuracy 92.76%
category F1 0.5037
categories F1 0.9068
Test Loss 1.0233

๐Ÿš€ Quick Start

import torch
from transformers import AutoTokenizer
import pickle

# Load model and tokenizer
model_name = "YOUR_USERNAME/guard-safety-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model architecture
from your_model_file import MultiTaskSafetyClassifier
model = MultiTaskSafetyClassifier(
    model_name="microsoft/deberta-v3-small",
    num_categories=NUM_CATEGORIES,
    num_multi_labels=NUM_MULTI_LABELS
)

# Load weights
model.load_state_dict(torch.load("model_weights.pt"))
model.eval()

# Load label encoders
with open("label_encoders.pkl", "rb") as f:
    encoders = pickle.load(f)
    le_category = encoders['le_category']
    mlb = encoders['mlb']

# Inference
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", max_length=128, 
                   truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    
is_safe = torch.softmax(outputs['is_safe'], dim=1)[0][1].item() > 0.5
category = le_category.inverse_transform([outputs['category'].argmax(1).item()])[0]
categories = mlb.inverse_transform((torch.sigmoid(outputs['categories']) > 0.5).cpu().numpy())[0]

print(f"Is Safe: {is_safe}")
print(f"Category: {category}")
print(f"Categories: {list(categories)}")

๐Ÿ—๏ธ Model Architecture

  • Base Model: microsoft/deberta-v3-small (141M parameters)
  • Hidden Size: 768
  • Max Sequence Length: 128 tokens
  • Training Framework: PyTorch + Transformers

๐Ÿ“š Training Details

  • Dataset: budecosystem/guardrail-training-data
  • Training Samples: 3,182,844
  • Validation Samples: 397,855
  • Test Samples: 397,856
  • Batch Size: 64
  • Learning Rate: 2e-5
  • Epochs: 1
  • Optimizer: AdamW with linear warmup
  • Hardware: NVIDIA Tesla T4 (16GB)
  • Training Time: ~8 hours

๐Ÿท๏ธ Categories

The model can identify the following safety categories:

[
  "animal_abuse",
  "benign",
  "child_abuse",
  "code_vulnerabilities",
  "controversial_topics_politics",
  "cwe_compliance",
  "dangerous_expert_advice",
  "discrimination_stereotype_injustice",
  "drug_abuse_weapons_banned_substance",
  "financial_crime_property_crime_theft",
  "fraud_deception_misinformation",
  "gender_bias",
  "hate_speech_offensive_language",
  "jailbreak_prompt_injection",
  "malware_hacking_cyberattack",
  "misinformation_regarding_ethics_laws_and_safety",
  "mitre_compliance",
  "non_violent_unethical_behavior",
  "orientation_bias",
  "privacy_violation",
  "race_bias",
  "religious_bias",
  "self_harm",
  "sexually_explicit_adult_content",
  "terrorism_organized_crime",
  "violence_aiding_and_abetting_incitement"
]

๐Ÿ”ข Multi-Label Classes

[
  " ",
  ",",
  "_",
  "a",
  "b",
  "c",
  "d",
  "e",
  "f",
  "g",
  "h",
  "i",
  "j",
  "k",
  "l",
  "m",
  "n",
  "o",
  "p",
  "r",
  "s",
  "t",
  "u",
  "v",
  "w",
  "x",
  "y",
  "z"
]

โš™๏ธ Configuration

Full model configuration is available in config.json

๐Ÿ“„ License

Apache 2.0

๐Ÿ™ Acknowledgments

๐Ÿ“ฎ Contact

For questions or issues, please open an issue on the model repository.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train jainsatyam26/guard-safety-classifier