|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- function-calling |
|
|
- visual-agents |
|
|
- qwen3-vl |
|
|
- zen |
|
|
language: |
|
|
- en |
|
|
- multilingual |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-8B-Instruct |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# Zen Vl 8B Agent |
|
|
|
|
|
Zen VL 8B Agent - Vision-language model with function calling (9B params) |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: Qwen3-VL |
|
|
- **Parameters**: 8B |
|
|
- **Context Window**: 256K tokens (expandable to 1M) |
|
|
- **License**: Apache 2.0 |
|
|
- **Training**: Fine-tuned with Zen identity and function calling |
|
|
|
|
|
## Capabilities |
|
|
|
|
|
- π¨ **Visual Understanding**: Image analysis, video comprehension, spatial reasoning |
|
|
- π **OCR**: Text extraction in 32 languages |
|
|
- π§ **Multimodal Reasoning**: STEM, math, code generation |
|
|
- π οΈ **Function Calling**: Tool use with visual context |
|
|
- π€ **Visual Agents**: GUI interaction, parameter extraction |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
|
|
from PIL import Image |
|
|
|
|
|
# Load model |
|
|
model = Qwen3VLForConditionalGeneration.from_pretrained( |
|
|
"zenlm/zen-vl-8b-agent", |
|
|
device_map="auto" |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained("zenlm/zen-vl-8b-agent") |
|
|
|
|
|
# Process image |
|
|
image = Image.open("example.jpg") |
|
|
prompt = "What's in this image?" |
|
|
|
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate |
|
|
outputs = model.generate(**inputs, max_new_tokens=256) |
|
|
response = processor.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- π **Website**: [zenlm.org](https://zenlm.org) |
|
|
- π **GitHub**: [zenlm/zen-vl](https://github.com/zenlm/zen-vl) |
|
|
- π **Paper**: Coming soon |
|
|
- π€ **Model Family**: [zenlm](https://huggingface.co/zenlm) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{zenvl2025, |
|
|
title={Zen VL: Vision-Language Models with Integrated Function Calling}, |
|
|
author={Hanzo AI Team}, |
|
|
year={2025}, |
|
|
publisher={Zen Language Models}, |
|
|
url={https://github.com/zenlm/zen-vl} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
--- |
|
|
|
|
|
Created by [Hanzo AI](https://hanzo.ai) for the Zen model family. |
|
|
|