Zen VL 4B Instruct
Zen VL is a family of vision-language models with integrated function calling capabilities from Hanzo AI (Techstars '17).
This model (zen-vl-4b-instruct) is the identity fine-tuned variant, establishing the "Zen VL" persona across both text and vision modalities while preserving strong general-purpose vision-language understanding.
Model Details
- Model Size: 4B parameters (3.5B non-embedding)
- Base Model: Qwen/Qwen3-VL-4B-Instruct
- Architecture: Qwen3-VL with DeepStack vision encoder, Interleaved-MRoPE, Text-Timestamp Alignment
- Context Length: 32K tokens (expandable to 256K)
- Developed by: Hanzo AI
- Model Type: Vision-Language Model (VLM)
- License: Apache 2.0 (inherited from Qwen3-VL)
- Language(s): Multilingual (32 languages for OCR)
Training Data
This model was trained using:
Primary Dataset
Custom Identity Dataset (150 examples):
- 100 text-only identity prompts
- 40 visual capability demonstrations
- 10 multimodal reasoning examples
- Focus: Establishing "Zen VL" identity from Hanzo AI
Advanced Training Datasets (In Progress)
We have downloaded and are actively training with:
Agent Data Protocol (ADP) - 8.4GB locally downloaded β
- Paper: Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents
- Contributors: Carnegie Mellon, Ohio State, University of Hong Kong, Duke, All Hands AI
- Covers: Web browsing, coding, software engineering, tool use
- Downloaded: 15 configs including synatra (99k), code_feedback (66k), go-browse-wa (27k), nebius_SWE-agent (13k)
- Total: ~220,000 trajectories
- Expected gain: +20% on agent benchmarks
xLAM Function Calling 60k - 101MB locally downloaded β
- From: Salesforce Research
- Paper: xLAM: A Family of Large Action Models
- Focus: High-quality function calling and API use
- Downloaded: 60,000 function calling trajectories
- Expected additional gain: +5% on function calling tasks
Training Status: Agent training at 24% complete. Combined ADP+xLAM retraining queued for +25% total performance boost.
Capabilities
- β Visual Understanding: Image analysis, OCR (32 languages), scene understanding
- β Multimodal Reasoning: Chart analysis, diagram understanding, visual QA
- β Identity Consistency: Maintains "Zen VL from Hanzo AI" persona
- π Function Calling: Coming in
zen-vl-4b-agentvariant - π GUI Interaction: Coming in ADP-trained versions
Usage
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"zenlm/zen-vl-4b-instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"zenlm/zen-vl-4b-instruct",
trust_remote_code=True
)
# Prepare input
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Who are you?"}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=150)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
# Output: "I'm Zen VL, a vision-language model from the Zen family, created by Hanzo AI..."
With Images
# Load image
image = Image.open("path/to/image.jpg")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What do you see in this image?"}
]
# Process with image
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=[text],
images=[image],
return_tensors="pt"
).to(model.device)
# Generate
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
Model Variants
The Zen VL family includes:
| Model | Size | Type | Description | Link |
|---|---|---|---|---|
| zen-vl-4b-instruct | 4B | Base VL | Identity fine-tuning only | π€ HF |
| zen-vl-4b-agent | 4B | VL + Functions | With function calling | π€ HF |
| zen-vl-8b-instruct | 9B | Base VL | Identity fine-tuning only | π€ HF |
| zen-vl-8b-agent | 9B | VL + Functions | With function calling | π€ HF |
| zen-vl-30b-instruct | 31B | Base VL (MoE) | Identity fine-tuning only | π€ HF |
| zen-vl-30b-agent | 31B | VL + Functions (MoE) | With function calling | π€ HF |
Training Details
Training Hyperparameters
- Epochs: 3
- Batch Size: 1 (per device)
- Gradient Accumulation: 4 (effective batch size: 4)
- Learning Rate: 2e-5
- LR Schedule: Cosine with 3% warmup
- Optimizer: AdamW
- Weight Decay: 0.0
- Max Gradient Norm: 1.0
- Precision: bfloat16
- Device: MPS (Apple Silicon)
Training Infrastructure
- Hardware: Apple M3 Max, 128GB RAM
- Framework: PyTorch 2.3.0, Transformers 4.57.1
- Training Time: ~3.5 hours
- Dataset Size: 150 examples
Evaluation
Identity Tests (Perfect Score: 4/4):
- β "Who are you?" β Correctly mentions "Zen VL" and "Hanzo AI"
- β "What is your name?" β Identifies as "Zen VL"
- β "Tell me about yourself" β Describes vision-language capabilities
- β "Who created you?" β Attributes to "Hanzo AI"
General Knowledge: Preserved from base Qwen3-VL model
Visual Capabilities: Maintained from base model
Limitations
- Function Calling: Not available in this variant (use
zen-vl-4b-agent) - Dataset Size: Small identity dataset (150 examples)
- Evaluation: Limited benchmarking (comprehensive eval coming)
- Video: Basic video support (full temporal reasoning in development)
Bias, Risks, and Ethical Considerations
- Inherits biases from Qwen3-VL base model
- Identity training may reinforce certain response patterns
- Should not be used for malicious purposes (surveillance, deepfakes, etc.)
- OCR capabilities could extract sensitive information - use responsibly
- See Qwen3-VL model card for additional considerations
Citation
If you use Zen VL in your research, please cite:
@software{zen_vl_2025,
title = {Zen VL: Vision-Language Models with Integrated Function Calling},
author = {Hanzo AI Research Team},
year = {2025},
url = {https://github.com/zenlm/zen-vl},
note = {Built on Qwen3-VL architecture}
}
@article{adp_2025,
title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents},
author={Song, Yueqi and others},
journal={arXiv preprint arXiv:2510.24702},
year={2025}
}
Acknowledgments
- Qwen Team at Alibaba Cloud for the excellent Qwen3-VL base model
- neulab (CMU, OSU, HKU, Duke, All Hands AI) for the Agent Data Protocol
- Salesforce Research for xLAM function calling dataset
Resources
- GitHub: https://github.com/zenlm/zen-vl
- HuggingFace: https://huggingface.co/zenlm
- Website: https://zenlm.org
- Paper: Coming soon
Model Card Contact
For questions or feedback:
- GitHub Issues: https://github.com/zenlm/zen-vl/issues
- Organization: Hanzo AI
- Downloads last month
- 29