File size: 2,217 Bytes
40627a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: apache-2.0
tags:
- vision-language
- multimodal
- function-calling
- visual-agents
- qwen3-vl
- zen
language:
- en
- multilingual
base_model:
- Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
pipeline_tag: image-text-to-text
---

# Zen Vl 8B Agent

Zen VL 8B Agent - Vision-language model with function calling (9B params)

## Model Details

- **Architecture**: Qwen3-VL
- **Parameters**: 8B
- **Context Window**: 256K tokens (expandable to 1M)
- **License**: Apache 2.0
- **Training**: Fine-tuned with Zen identity and function calling

## Capabilities

- 🎨 **Visual Understanding**: Image analysis, video comprehension, spatial reasoning
- πŸ“ **OCR**: Text extraction in 32 languages
- 🧠 **Multimodal Reasoning**: STEM, math, code generation
- πŸ› οΈ **Function Calling**: Tool use with visual context
- πŸ€– **Visual Agents**: GUI interaction, parameter extraction

## Usage

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "zenlm/zen-vl-8b-agent",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("zenlm/zen-vl-8b-agent")

# Process image
image = Image.open("example.jpg")
prompt = "What's in this image?"

messages = [{"role": "user", "content": prompt}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Links

- 🌐 **Website**: [zenlm.org](https://zenlm.org)
- πŸ“š **GitHub**: [zenlm/zen-vl](https://github.com/zenlm/zen-vl)
- πŸ“„ **Paper**: Coming soon
- πŸ€— **Model Family**: [zenlm](https://huggingface.co/zenlm)

## Citation

```bibtex
@misc{zenvl2025,
  title={Zen VL: Vision-Language Models with Integrated Function Calling},
  author={Hanzo AI Team},
  year={2025},
  publisher={Zen Language Models},
  url={https://github.com/zenlm/zen-vl}
}
```

## License

Apache 2.0

---

Created by [Hanzo AI](https://hanzo.ai) for the Zen model family.