YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Zen VL 4B Instruct

Zen VL is a family of vision-language models with integrated function calling capabilities from Hanzo AI (Techstars '17).

This model (zen-vl-4b-instruct) is the identity fine-tuned variant, establishing the "Zen VL" persona across both text and vision modalities while preserving strong general-purpose vision-language understanding.

Model Details

Model Size: 4B parameters (3.5B non-embedding)
Base Model: Qwen/Qwen3-VL-4B-Instruct
Architecture: Qwen3-VL with DeepStack vision encoder, Interleaved-MRoPE, Text-Timestamp Alignment
Context Length: 32K tokens (expandable to 256K)
Developed by: Hanzo AI
Model Type: Vision-Language Model (VLM)
License: Apache 2.0 (inherited from Qwen3-VL)
Language(s): Multilingual (32 languages for OCR)

Training Data

This model was trained using:

Primary Dataset

Custom Identity Dataset (150 examples):

100 text-only identity prompts
40 visual capability demonstrations
10 multimodal reasoning examples
Focus: Establishing "Zen VL" identity from Hanzo AI

Advanced Training Datasets (In Progress)

We have downloaded and are actively training with:

Agent Data Protocol (ADP) - 8.4GB locally downloaded ✅
- Paper: Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents
- Contributors: Carnegie Mellon, Ohio State, University of Hong Kong, Duke, All Hands AI
- Covers: Web browsing, coding, software engineering, tool use
- Downloaded: 15 configs including synatra (99k), code_feedback (66k), go-browse-wa (27k), nebius_SWE-agent (13k)
- Total: ~220,000 trajectories
- Expected gain: +20% on agent benchmarks
xLAM Function Calling 60k - 101MB locally downloaded ✅
- From: Salesforce Research
- Paper: xLAM: A Family of Large Action Models
- Focus: High-quality function calling and API use
- Downloaded: 60,000 function calling trajectories
- Expected additional gain: +5% on function calling tasks

Training Status: Agent training at 24% complete. Combined ADP+xLAM retraining queued for +25% total performance boost.

Capabilities

✅ Visual Understanding: Image analysis, OCR (32 languages), scene understanding
✅ Multimodal Reasoning: Chart analysis, diagram understanding, visual QA
✅ Identity Consistency: Maintains "Zen VL from Hanzo AI" persona
🔄 Function Calling: Coming in zen-vl-4b-agent variant
🔄 GUI Interaction: Coming in ADP-trained versions

Usage

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "zenlm/zen-vl-4b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(
    "zenlm/zen-vl-4b-instruct",
    trust_remote_code=True
)

# Prepare input
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Who are you?"}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=150)

response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
# Output: "I'm Zen VL, a vision-language model from the Zen family, created by Hanzo AI..."

With Images

# Load image
image = Image.open("path/to/image.jpg")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What do you see in this image?"}
]

# Process with image
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]

Model Variants

The Zen VL family includes:

Model	Size	Type	Description	Link
zen-vl-4b-instruct	4B	Base VL	Identity fine-tuning only	🤗 HF
zen-vl-4b-agent	4B	VL + Functions	With function calling	🤗 HF
zen-vl-8b-instruct	9B	Base VL	Identity fine-tuning only	🤗 HF
zen-vl-8b-agent	9B	VL + Functions	With function calling	🤗 HF
zen-vl-30b-instruct	31B	Base VL (MoE)	Identity fine-tuning only	🤗 HF
zen-vl-30b-agent	31B	VL + Functions (MoE)	With function calling	🤗 HF

Training Details

Training Hyperparameters

Epochs: 3
Batch Size: 1 (per device)
Gradient Accumulation: 4 (effective batch size: 4)
Learning Rate: 2e-5
LR Schedule: Cosine with 3% warmup
Optimizer: AdamW
Weight Decay: 0.0
Max Gradient Norm: 1.0
Precision: bfloat16
Device: MPS (Apple Silicon)

Training Infrastructure

Hardware: Apple M3 Max, 128GB RAM
Framework: PyTorch 2.3.0, Transformers 4.57.1
Training Time: ~3.5 hours
Dataset Size: 150 examples

Evaluation

Identity Tests (Perfect Score: 4/4):

✅ "Who are you?" → Correctly mentions "Zen VL" and "Hanzo AI"
✅ "What is your name?" → Identifies as "Zen VL"
✅ "Tell me about yourself" → Describes vision-language capabilities
✅ "Who created you?" → Attributes to "Hanzo AI"

General Knowledge: Preserved from base Qwen3-VL model

Visual Capabilities: Maintained from base model

Limitations

Function Calling: Not available in this variant (use zen-vl-4b-agent)
Dataset Size: Small identity dataset (150 examples)
Evaluation: Limited benchmarking (comprehensive eval coming)
Video: Basic video support (full temporal reasoning in development)

Bias, Risks, and Ethical Considerations

Inherits biases from Qwen3-VL base model
Identity training may reinforce certain response patterns
Should not be used for malicious purposes (surveillance, deepfakes, etc.)
OCR capabilities could extract sensitive information - use responsibly
See Qwen3-VL model card for additional considerations

Citation

If you use Zen VL in your research, please cite:

@software{zen_vl_2025,
  title = {Zen VL: Vision-Language Models with Integrated Function Calling},
  author = {Hanzo AI Research Team},
  year = {2025},
  url = {https://github.com/zenlm/zen-vl},
  note = {Built on Qwen3-VL architecture}
}

@article{adp_2025,
  title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents},
  author={Song, Yueqi and others},
  journal={arXiv preprint arXiv:2510.24702},
  year={2025}
}

Acknowledgments

Qwen Team at Alibaba Cloud for the excellent Qwen3-VL base model
neulab (CMU, OSU, HKU, Duke, All Hands AI) for the Agent Data Protocol
Salesforce Research for xLAM function calling dataset

Resources

GitHub: https://github.com/zenlm/zen-vl
HuggingFace: https://huggingface.co/zenlm
Website: https://zenlm.org
Paper: Coming soon

Model Card Contact

For questions or feedback:

GitHub Issues: https://github.com/zenlm/zen-vl/issues
Organization: Hanzo AI

Downloads last month: 29

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zenlm/zen-vl-4b-instruct

Quantizations

3 models

zenlm
/

zen-vl-4b-instruct