Qwen3-VL

This version of SmolVLM2-500M-Video-Instructhas been converted to run on the Axera NPU using w8a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(344 tokens)	w8a16	CMM	Flash
AX650	512*512	1	537 ms	510 ms	35.23 tokens/sec	773 MB	813MB

Video Process

Chips	input size	image num	image encoder	ttft(656 tokens)	w8a16	CMM	Flash
AX650	512*512	8	832 ms	1523 ms	35.32 tokens/sec	773 MB	813MB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

Download all files from this repository to the device

If you using AX650 Board

Demo Run

Image understand demo

Set the video parameter in run_ax650.sh to 0 .

input text

describe this image

input image

root@ax650 ~/SmolVLM2-500M-Video-Instruct_Ax650 # run_ax650.sh
prompt >> describe this image
image >> video/frame_0000.jpg
read image
[I][                     EncodeImage][ 409]: pixel_values size 5
[I][                     EncodeImage][ 437]: image encode time : 516.138977 ms, size : 5
[I][                          Encode][ 488]: img_embed.size :5, is video:0, num_media_tokens:64, real num of image:
[I][                          Encode][ 498]: input_ids size:344
[I][                          Encode][ 508]: offset 5
[I][                          Encode][ 508]: offset 71
[I][                          Encode][ 508]: offset 138
[I][                          Encode][ 508]: offset 204
[I][                          Encode][ 508]: offset 271
[I][                          Encode][ 530]: img_embed.size:5, 36864
[I][                          Encode][ 546]: out_embed size:198144
[I][                          Encode][ 547]: input_ids size 344
[I][                          Encode][ 549]: position_ids size:344
[I][                             Run][ 568]: input token num : 344, prefill_split_num : 3
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:88
[I][                             Run][ 791]: ttft: 271.32 ms
 In the image, there are two animals, one on the left and the other on the right, both of which are bears. The bear on the left is standing on all fours, its body oriented towards the right side of the image. It has a black and white coat with a blue patch on its chest. The bear on the right is standing on all fours, its body oriented towards the left side of the image. It has a brown and white coat with a blue patch on its chest. Both bears are standing on a rocky terrain, with a mountainous background in the background. The sky in the background is a gradient of orange and yellow, suggesting a sunny day.

[N][                             Run][ 918]: hit eos,avg 76.61 token/s

Video understand demo

Set the video parameter in run_ax650.sh to 1 .

input text

描述这个视频

input video

./video

root@ax650 ~/SmolVLM2-500M-Video-Instruct_Ax650 # run_ax650.sh
prompt >> describe this video
video >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                     EncodeImage][ 409]: pixel_values size 8
[I][                     EncodeImage][ 437]: image encode time : 834.026978 ms, size : 8
[I][                          Encode][ 488]: img_embed.size :8, is video:1, num_media_tokens:64, real num of image:
[I][                          Encode][ 498]: input_ids size:656
[I][                          Encode][ 508]: offset 43
[I][                          Encode][ 508]: offset 120
[I][                          Encode][ 508]: offset 197
[I][                          Encode][ 508]: offset 274
[I][                          Encode][ 508]: offset 351
[I][                          Encode][ 508]: offset 428
[I][                          Encode][ 508]: offset 505
[I][                          Encode][ 508]: offset 582
[I][                          Encode][ 530]: img_embed.size:8, 36864
[I][                          Encode][ 546]: out_embed size:377856
[I][                          Encode][ 547]: input_ids size 656
[I][                          Encode][ 549]: position_ids size:656
[I][                             Run][ 568]: input token num : 656, prefill_split_num : 6
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:16
[I][                             Run][ 791]: ttft: 827.08 ms
 The video depicts two Siberian foxes in a rocky terrain, engaged in a playful interaction. The fox on the left is standing on its hind legs, while the one on the right is lying down. They are both looking at each other, possibly in a playful or affectionate manner. The background is a natural landscape with a mountainous terrain, suggesting a location where these foxes might be found. The video does not provide any specific actions or movements of the foxes, but the interaction between them is captured in a way that suggests a playful or affectionate moment.

[N][                             Run][ 918]: hit eos,avg 75.46 token/s

Gradio demo

start openai style api server

./run_api_ax650.sh

start gradio demo

if the api server is not run in the same machine,please modify the api url in gradio web ui.

python gradio_demo.py

HTTP demo

start openai style api server

./run_api_ax650.sh

run http demo

python3 openai_cli.py

Downloads last month: 11

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support