YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Zen Guard Stream

4B Real-Time Streaming Safety Moderation

🌐 Website β€’ πŸ€— Hugging Face β€’ πŸ“„ Paper β€’ πŸ“– Documentation


Introduction

Zen Guard Stream is a 4B parameter token-level streaming classifier that evaluates each generated token in real time. It's optimized for real-time content moderation during LLM response generation.

Features

  • ⚑ 5ms/token Latency: Ultra-fast real-time classification
  • πŸ”„ Token-Level Analysis: Continuous stream monitoring
  • 🌍 119 Languages: Multilingual support
  • 🚦 Three-Tier Classification: Safe, Controversial, Unsafe
  • πŸ›‘οΈ Early Detection: Stop unsafe content mid-generation

How It Works

User Prompt β†’ Zen Guard Stream (safety check)
                     ↓
              If Safe, continue
                     ↓
LLM Response β†’ Token by Token β†’ Real-time Classification
                     ↓
              Immediate intervention if unsafe

Model Specifications

Specification Value
Parameters 4B
Type Streaming Classifier
Base Model Qwen3-4B
Context Length 32,768 tokens
Languages 119
Latency 5ms/token
VRAM (FP16) 8GB
VRAM (INT8) 4GB

Quick Start

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "zenlm/zen-guard-stream"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()

# Simulate streaming moderation
user_message = "Hello, how are you?"
assistant_message = "I'm doing well, thank you for asking!"
messages = [
    {"role": "user", "content": user_message},
    {"role": "assistant", "content": assistant_message}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
inputs = tokenizer(text, return_tensors="pt")
token_ids = inputs.input_ids[0]

# Initialize stream state
stream_state = None

# Process tokens one by one (simulating streaming)
for i, token_id in enumerate(token_ids):
    result, stream_state = model.stream_moderate_from_ids(
        token_id, 
        role="assistant" if i > len(user_message) else "user",
        stream_state=stream_state
    )
    
    token_str = tokenizer.decode([token_id])
    risk = result['risk_level'][-1]
    print(f"Token: {repr(token_str)} β†’ Risk: {risk}")

model.close_stream(stream_state)

Integration Pattern

With Streaming LLM

from transformers import AutoModel, AutoTokenizer, TextIteratorStreamer
import threading

# Load both models
llm = AutoModelForCausalLM.from_pretrained("zenlm/zen-next")
guard = AutoModel.from_pretrained("zenlm/zen-guard-stream", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("zenlm/zen-next")

def safe_generate(prompt):
    # Initial prompt check
    result, state = guard.stream_moderate_from_ids(
        tokenizer.encode(prompt, return_tensors="pt")[0],
        role="user",
        stream_state=None
    )
    
    if result['risk_level'][-1] != "Safe":
        return "I cannot process this request."
    
    # Stream generation with real-time moderation
    streamer = TextIteratorStreamer(tokenizer)
    inputs = tokenizer(prompt, return_tensors="pt")
    
    thread = threading.Thread(target=llm.generate, kwargs={**inputs, "streamer": streamer})
    thread.start()
    
    output = []
    for token in streamer:
        token_id = tokenizer.encode(token, add_special_tokens=False)
        if token_id:
            result, state = guard.stream_moderate_from_ids(
                token_id[0], role="assistant", stream_state=state
            )
            if result['risk_level'][-1] == "Unsafe":
                thread.join()
                return "".join(output) + " [Content moderated]"
        output.append(token)
    
    guard.close_stream(state)
    return "".join(output)

Performance

Metric Zen Guard Stream
Accuracy 95.2%
F1 Score 93.1%
False Positive 2.8%
Latency 5ms/token

Related Models

License

Apache 2.0

Citation

@misc{zenguardstream2025,
    title={Zen Guard Stream: Real-Time Token-Level Safety Moderation},
    author={Hanzo AI and Zoo Labs Foundation},
    year={2025},
    publisher={HuggingFace},
    howpublished={\url{https://huggingface.co/zenlm/zen-guard-stream}}
}

Based On

Built upon Qwen3Guard-Stream-4B.


Zen AI - Clarity Through Intelligence
zenlm.org

Downloads last month
16
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zenlm/zen-guard-stream

Quantizations
1 model