---
license: mit
language:
- en
- zh
base_model:
- SmolVLM2-256M-Video-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Int8
- VLM
---

# Qwen3-VL

This version of SmolVLM2-256M-Video-Instructhas been converted to run on the Axera NPU using **w8a16** quantization. 

Compatible with Pulsar2 version: 5.0

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : 

- https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 


## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

**Image Process**
|Chips| input size | image num | image encoder | ttft(344 tokens) | w8a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 512*512 | 1 | 516 ms | 271 ms | 76.7 tokens/sec| 455 MB | 415MB |

**Video Process**
|Chips| input size | image num | image encoder |ttft(656 tokens) | w8a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 512*512 | 8  | 834 ms | 827 ms | 75.46 tokens/sec| 455 MB | 415MB |


The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

## How to use

Download all files from this repository to the device

**If you using AX650 Board**

### Demo Run

#### Image understand demo
Set the `video` parameter in run_ax650.sh to 0 .

- input text

```
describe this image
```

- input image

![](./video/frame_0000.jpg)

```
root@ax650 ~/SmolVLM2-256M-Video-Instruct_Ax650 # run_ax650.sh
prompt >> describe this image
image >> video/frame_0000.jpg
read image
[I][                     EncodeImage][ 409]: pixel_values size 5
[I][                     EncodeImage][ 437]: image encode time : 516.138977 ms, size : 5
[I][                          Encode][ 488]: img_embed.size :5, is video:0, num_media_tokens:64, real num of image:
[I][                          Encode][ 498]: input_ids size:344
[I][                          Encode][ 508]: offset 5
[I][                          Encode][ 508]: offset 71
[I][                          Encode][ 508]: offset 138
[I][                          Encode][ 508]: offset 204
[I][                          Encode][ 508]: offset 271
[I][                          Encode][ 530]: img_embed.size:5, 36864
[I][                          Encode][ 546]: out_embed size:198144
[I][                          Encode][ 547]: input_ids size 344
[I][                          Encode][ 549]: position_ids size:344
[I][                             Run][ 568]: input token num : 344, prefill_split_num : 3
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:88
[I][                             Run][ 791]: ttft: 271.32 ms
 In the image, there are two animals, one on the left and the other on the right, both of which are bears. The bear on the left is standing on all fours, its body oriented towards the right side of the image. It has a black and white coat with a blue patch on its chest. The bear on the right is standing on all fours, its body oriented towards the left side of the image. It has a brown and white coat with a blue patch on its chest. Both bears are standing on a rocky terrain, with a mountainous background in the background. The sky in the background is a gradient of orange and yellow, suggesting a sunny day.

[N][                             Run][ 918]: hit eos,avg 76.61 token/s
```

#### Video understand demo
Set the `video` parameter in run_ax650.sh to 1 .

- input text  

```
描述这个视频
```

- input video  

./video  

```
root@ax650 ~/SmolVLM2-256M-Video-Instruct_Ax650 # run_ax650.sh
prompt >> describe this video
video >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                     EncodeImage][ 409]: pixel_values size 8
[I][                     EncodeImage][ 437]: image encode time : 834.026978 ms, size : 8
[I][                          Encode][ 488]: img_embed.size :8, is video:1, num_media_tokens:64, real num of image:
[I][                          Encode][ 498]: input_ids size:656
[I][                          Encode][ 508]: offset 43
[I][                          Encode][ 508]: offset 120
[I][                          Encode][ 508]: offset 197
[I][                          Encode][ 508]: offset 274
[I][                          Encode][ 508]: offset 351
[I][                          Encode][ 508]: offset 428
[I][                          Encode][ 508]: offset 505
[I][                          Encode][ 508]: offset 582
[I][                          Encode][ 530]: img_embed.size:8, 36864
[I][                          Encode][ 546]: out_embed size:377856
[I][                          Encode][ 547]: input_ids size 656
[I][                          Encode][ 549]: position_ids size:656
[I][                             Run][ 568]: input token num : 656, prefill_split_num : 6
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:128
[I][                             Run][ 602]: input_num_token:16
[I][                             Run][ 791]: ttft: 827.08 ms
 The video depicts two Siberian foxes in a rocky terrain, engaged in a playful interaction. The fox on the left is standing on its hind legs, while the one on the right is lying down. They are both looking at each other, possibly in a playful or affectionate manner. The background is a natural landscape with a mountainous terrain, suggesting a location where these foxes might be found. The video does not provide any specific actions or movements of the foxes, but the interaction between them is captured in a way that suggests a playful or affectionate moment.

[N][                             Run][ 918]: hit eos,avg 75.46 token/s

```

### Gradio demo


#### start openai style api server
```shell
./run_api_ax650.sh
```

#### start gradio demo
if the api server is not run in the same machine,please modify the api url in gradio web ui.
```shell
python gradio_demo.py
```

![image](https://cdn-uploads.huggingface.co/production/uploads/64b7837c17570fdff9b906b9/Og9fPNi0chg768gicse7M.png)


### HTTP demo 

#### start openai style api server
```shell
./run_api_ax650.sh
```

#### run http demo
```
python3 openai_cli.py
```