YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Zen VL 4B Instruct

Zen VL is a family of vision-language models with integrated function calling capabilities from Hanzo AI (Techstars '17).

This model (zen-vl-4b-instruct) is the identity fine-tuned variant, establishing the "Zen VL" persona across both text and vision modalities while preserving strong general-purpose vision-language understanding.

Model Details

  • Model Size: 4B parameters (3.5B non-embedding)
  • Base Model: Qwen/Qwen3-VL-4B-Instruct
  • Architecture: Qwen3-VL with DeepStack vision encoder, Interleaved-MRoPE, Text-Timestamp Alignment
  • Context Length: 32K tokens (expandable to 256K)
  • Developed by: Hanzo AI
  • Model Type: Vision-Language Model (VLM)
  • License: Apache 2.0 (inherited from Qwen3-VL)
  • Language(s): Multilingual (32 languages for OCR)

Training Data

This model was trained using:

Primary Dataset

Custom Identity Dataset (150 examples):

  • 100 text-only identity prompts
  • 40 visual capability demonstrations
  • 10 multimodal reasoning examples
  • Focus: Establishing "Zen VL" identity from Hanzo AI

Advanced Training Datasets (In Progress)

We have downloaded and are actively training with:

  1. Agent Data Protocol (ADP) - 8.4GB locally downloaded βœ…

  2. xLAM Function Calling 60k - 101MB locally downloaded βœ…

    • From: Salesforce Research
    • Paper: xLAM: A Family of Large Action Models
    • Focus: High-quality function calling and API use
    • Downloaded: 60,000 function calling trajectories
    • Expected additional gain: +5% on function calling tasks

Training Status: Agent training at 24% complete. Combined ADP+xLAM retraining queued for +25% total performance boost.

Capabilities

  • βœ… Visual Understanding: Image analysis, OCR (32 languages), scene understanding
  • βœ… Multimodal Reasoning: Chart analysis, diagram understanding, visual QA
  • βœ… Identity Consistency: Maintains "Zen VL from Hanzo AI" persona
  • πŸ”„ Function Calling: Coming in zen-vl-4b-agent variant
  • πŸ”„ GUI Interaction: Coming in ADP-trained versions

Usage

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "zenlm/zen-vl-4b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(
    "zenlm/zen-vl-4b-instruct",
    trust_remote_code=True
)

# Prepare input
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Who are you?"}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=150)

response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
# Output: "I'm Zen VL, a vision-language model from the Zen family, created by Hanzo AI..."

With Images

# Load image
image = Image.open("path/to/image.jpg")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What do you see in this image?"}
]

# Process with image
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]

Model Variants

The Zen VL family includes:

Model Size Type Description Link
zen-vl-4b-instruct 4B Base VL Identity fine-tuning only πŸ€— HF
zen-vl-4b-agent 4B VL + Functions With function calling πŸ€— HF
zen-vl-8b-instruct 9B Base VL Identity fine-tuning only πŸ€— HF
zen-vl-8b-agent 9B VL + Functions With function calling πŸ€— HF
zen-vl-30b-instruct 31B Base VL (MoE) Identity fine-tuning only πŸ€— HF
zen-vl-30b-agent 31B VL + Functions (MoE) With function calling πŸ€— HF

Training Details

Training Hyperparameters

  • Epochs: 3
  • Batch Size: 1 (per device)
  • Gradient Accumulation: 4 (effective batch size: 4)
  • Learning Rate: 2e-5
  • LR Schedule: Cosine with 3% warmup
  • Optimizer: AdamW
  • Weight Decay: 0.0
  • Max Gradient Norm: 1.0
  • Precision: bfloat16
  • Device: MPS (Apple Silicon)

Training Infrastructure

  • Hardware: Apple M3 Max, 128GB RAM
  • Framework: PyTorch 2.3.0, Transformers 4.57.1
  • Training Time: ~3.5 hours
  • Dataset Size: 150 examples

Evaluation

Identity Tests (Perfect Score: 4/4):

  • βœ… "Who are you?" β†’ Correctly mentions "Zen VL" and "Hanzo AI"
  • βœ… "What is your name?" β†’ Identifies as "Zen VL"
  • βœ… "Tell me about yourself" β†’ Describes vision-language capabilities
  • βœ… "Who created you?" β†’ Attributes to "Hanzo AI"

General Knowledge: Preserved from base Qwen3-VL model

Visual Capabilities: Maintained from base model

Limitations

  • Function Calling: Not available in this variant (use zen-vl-4b-agent)
  • Dataset Size: Small identity dataset (150 examples)
  • Evaluation: Limited benchmarking (comprehensive eval coming)
  • Video: Basic video support (full temporal reasoning in development)

Bias, Risks, and Ethical Considerations

  • Inherits biases from Qwen3-VL base model
  • Identity training may reinforce certain response patterns
  • Should not be used for malicious purposes (surveillance, deepfakes, etc.)
  • OCR capabilities could extract sensitive information - use responsibly
  • See Qwen3-VL model card for additional considerations

Citation

If you use Zen VL in your research, please cite:

@software{zen_vl_2025,
  title = {Zen VL: Vision-Language Models with Integrated Function Calling},
  author = {Hanzo AI Research Team},
  year = {2025},
  url = {https://github.com/zenlm/zen-vl},
  note = {Built on Qwen3-VL architecture}
}

@article{adp_2025,
  title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents},
  author={Song, Yueqi and others},
  journal={arXiv preprint arXiv:2510.24702},
  year={2025}
}

Acknowledgments

  • Qwen Team at Alibaba Cloud for the excellent Qwen3-VL base model
  • neulab (CMU, OSU, HKU, Duke, All Hands AI) for the Agent Data Protocol
  • Salesforce Research for xLAM function calling dataset

Resources

Model Card Contact

For questions or feedback:

Downloads last month
29
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zenlm/zen-vl-4b-instruct

Quantizations
3 models

Spaces using zenlm/zen-vl-4b-instruct 2