Qwen3-VL
This version of SmolVLM2-500M-Video-Instructhas been converted to run on the Axera NPU using w8a16 quantization.
Compatible with Pulsar2 version: 5.0
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
Image Process
| Chips | input size | image num | image encoder | ttft(344 tokens) | w8a16 | CMM | Flash |
|---|---|---|---|---|---|---|---|
| AX650 | 512*512 | 1 | 537 ms | 510 ms | 35.23 tokens/sec | 773 MB | 813MB |
Video Process
| Chips | input size | image num | image encoder | ttft(656 tokens) | w8a16 | CMM | Flash |
|---|---|---|---|---|---|---|---|
| AX650 | 512*512 | 8 | 832 ms | 1523 ms | 35.32 tokens/sec | 773 MB | 813MB |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
How to use
Download all files from this repository to the device
If you using AX650 Board
Demo Run
Image understand demo
Set the video parameter in run_ax650.sh to 0 .
- input text
describe this image
- input image
root@ax650 ~/SmolVLM2-500M-Video-Instruct_Ax650 # run_ax650.sh
prompt >> describe this image
image >> video/frame_0000.jpg
read image
[I][ EncodeImage][ 409]: pixel_values size 5
[I][ EncodeImage][ 437]: image encode time : 516.138977 ms, size : 5
[I][ Encode][ 488]: img_embed.size :5, is video:0, num_media_tokens:64, real num of image:
[I][ Encode][ 498]: input_ids size:344
[I][ Encode][ 508]: offset 5
[I][ Encode][ 508]: offset 71
[I][ Encode][ 508]: offset 138
[I][ Encode][ 508]: offset 204
[I][ Encode][ 508]: offset 271
[I][ Encode][ 530]: img_embed.size:5, 36864
[I][ Encode][ 546]: out_embed size:198144
[I][ Encode][ 547]: input_ids size 344
[I][ Encode][ 549]: position_ids size:344
[I][ Run][ 568]: input token num : 344, prefill_split_num : 3
[I][ Run][ 602]: input_num_token:128
[I][ Run][ 602]: input_num_token:128
[I][ Run][ 602]: input_num_token:88
[I][ Run][ 791]: ttft: 271.32 ms
In the image, there are two animals, one on the left and the other on the right, both of which are bears. The bear on the left is standing on all fours, its body oriented towards the right side of the image. It has a black and white coat with a blue patch on its chest. The bear on the right is standing on all fours, its body oriented towards the left side of the image. It has a brown and white coat with a blue patch on its chest. Both bears are standing on a rocky terrain, with a mountainous background in the background. The sky in the background is a gradient of orange and yellow, suggesting a sunny day.
[N][ Run][ 918]: hit eos,avg 76.61 token/s
Video understand demo
Set the video parameter in run_ax650.sh to 1 .
- input text
描述这个视频
- input video
./video
root@ax650 ~/SmolVLM2-500M-Video-Instruct_Ax650 # run_ax650.sh
prompt >> describe this video
video >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][ EncodeImage][ 409]: pixel_values size 8
[I][ EncodeImage][ 437]: image encode time : 834.026978 ms, size : 8
[I][ Encode][ 488]: img_embed.size :8, is video:1, num_media_tokens:64, real num of image:
[I][ Encode][ 498]: input_ids size:656
[I][ Encode][ 508]: offset 43
[I][ Encode][ 508]: offset 120
[I][ Encode][ 508]: offset 197
[I][ Encode][ 508]: offset 274
[I][ Encode][ 508]: offset 351
[I][ Encode][ 508]: offset 428
[I][ Encode][ 508]: offset 505
[I][ Encode][ 508]: offset 582
[I][ Encode][ 530]: img_embed.size:8, 36864
[I][ Encode][ 546]: out_embed size:377856
[I][ Encode][ 547]: input_ids size 656
[I][ Encode][ 549]: position_ids size:656
[I][ Run][ 568]: input token num : 656, prefill_split_num : 6
[I][ Run][ 602]: input_num_token:128
[I][ Run][ 602]: input_num_token:128
[I][ Run][ 602]: input_num_token:128
[I][ Run][ 602]: input_num_token:128
[I][ Run][ 602]: input_num_token:128
[I][ Run][ 602]: input_num_token:16
[I][ Run][ 791]: ttft: 827.08 ms
The video depicts two Siberian foxes in a rocky terrain, engaged in a playful interaction. The fox on the left is standing on its hind legs, while the one on the right is lying down. They are both looking at each other, possibly in a playful or affectionate manner. The background is a natural landscape with a mountainous terrain, suggesting a location where these foxes might be found. The video does not provide any specific actions or movements of the foxes, but the interaction between them is captured in a way that suggests a playful or affectionate moment.
[N][ Run][ 918]: hit eos,avg 75.46 token/s
Gradio demo
start openai style api server
./run_api_ax650.sh
start gradio demo
if the api server is not run in the same machine,please modify the api url in gradio web ui.
python gradio_demo.py
HTTP demo
start openai style api server
./run_api_ax650.sh
run http demo
python3 openai_cli.py
- Downloads last month
- 11

