--- license: mit language: - en - zh base_model: - SmolVLM2-256M-Video-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - Int8 - VLM --- # Qwen3-VL This version of SmolVLM2-256M-Video-Instructhas been converted to run on the Axera NPU using **w8a16** quantization. Compatible with Pulsar2 version: 5.0 ## Convert tools links: For those who are interested in model conversion, you can try to export axmodel through the original repo : - https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) ## Support Platform - AX650 - AX650N DEMO Board - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html) **Image Process** |Chips| input size | image num | image encoder | ttft(344 tokens) | w8a16 | CMM | Flash | |--|--|--|--|--|--|--|--| |AX650| 512*512 | 1 | 516 ms | 271 ms | 76.7 tokens/sec| 455 MB | 415MB | **Video Process** |Chips| input size | image num | image encoder |ttft(656 tokens) | w8a16 | CMM | Flash | |--|--|--|--|--|--|--|--| |AX650| 512*512 | 8 | 834 ms | 827 ms | 75.46 tokens/sec| 455 MB | 415MB | The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value. ## How to use Download all files from this repository to the device **If you using AX650 Board** ### Demo Run #### Image understand demo Set the `video` parameter in run_ax650.sh to 0 . - input text ``` describe this image ``` - input image ![](./video/frame_0000.jpg) ``` root@ax650 ~/SmolVLM2-256M-Video-Instruct_Ax650 # run_ax650.sh prompt >> describe this image image >> video/frame_0000.jpg read image [I][ EncodeImage][ 409]: pixel_values size 5 [I][ EncodeImage][ 437]: image encode time : 516.138977 ms, size : 5 [I][ Encode][ 488]: img_embed.size :5, is video:0, num_media_tokens:64, real num of image: [I][ Encode][ 498]: input_ids size:344 [I][ Encode][ 508]: offset 5 [I][ Encode][ 508]: offset 71 [I][ Encode][ 508]: offset 138 [I][ Encode][ 508]: offset 204 [I][ Encode][ 508]: offset 271 [I][ Encode][ 530]: img_embed.size:5, 36864 [I][ Encode][ 546]: out_embed size:198144 [I][ Encode][ 547]: input_ids size 344 [I][ Encode][ 549]: position_ids size:344 [I][ Run][ 568]: input token num : 344, prefill_split_num : 3 [I][ Run][ 602]: input_num_token:128 [I][ Run][ 602]: input_num_token:128 [I][ Run][ 602]: input_num_token:88 [I][ Run][ 791]: ttft: 271.32 ms In the image, there are two animals, one on the left and the other on the right, both of which are bears. The bear on the left is standing on all fours, its body oriented towards the right side of the image. It has a black and white coat with a blue patch on its chest. The bear on the right is standing on all fours, its body oriented towards the left side of the image. It has a brown and white coat with a blue patch on its chest. Both bears are standing on a rocky terrain, with a mountainous background in the background. The sky in the background is a gradient of orange and yellow, suggesting a sunny day. [N][ Run][ 918]: hit eos,avg 76.61 token/s ``` #### Video understand demo Set the `video` parameter in run_ax650.sh to 1 . - input text ``` 描述这个视频 ``` - input video ./video ``` root@ax650 ~/SmolVLM2-256M-Video-Instruct_Ax650 # run_ax650.sh prompt >> describe this video video >> video video/frame_0000.jpg video/frame_0008.jpg video/frame_0016.jpg video/frame_0024.jpg video/frame_0032.jpg video/frame_0040.jpg video/frame_0048.jpg video/frame_0056.jpg [I][ EncodeImage][ 409]: pixel_values size 8 [I][ EncodeImage][ 437]: image encode time : 834.026978 ms, size : 8 [I][ Encode][ 488]: img_embed.size :8, is video:1, num_media_tokens:64, real num of image: [I][ Encode][ 498]: input_ids size:656 [I][ Encode][ 508]: offset 43 [I][ Encode][ 508]: offset 120 [I][ Encode][ 508]: offset 197 [I][ Encode][ 508]: offset 274 [I][ Encode][ 508]: offset 351 [I][ Encode][ 508]: offset 428 [I][ Encode][ 508]: offset 505 [I][ Encode][ 508]: offset 582 [I][ Encode][ 530]: img_embed.size:8, 36864 [I][ Encode][ 546]: out_embed size:377856 [I][ Encode][ 547]: input_ids size 656 [I][ Encode][ 549]: position_ids size:656 [I][ Run][ 568]: input token num : 656, prefill_split_num : 6 [I][ Run][ 602]: input_num_token:128 [I][ Run][ 602]: input_num_token:128 [I][ Run][ 602]: input_num_token:128 [I][ Run][ 602]: input_num_token:128 [I][ Run][ 602]: input_num_token:128 [I][ Run][ 602]: input_num_token:16 [I][ Run][ 791]: ttft: 827.08 ms The video depicts two Siberian foxes in a rocky terrain, engaged in a playful interaction. The fox on the left is standing on its hind legs, while the one on the right is lying down. They are both looking at each other, possibly in a playful or affectionate manner. The background is a natural landscape with a mountainous terrain, suggesting a location where these foxes might be found. The video does not provide any specific actions or movements of the foxes, but the interaction between them is captured in a way that suggests a playful or affectionate moment. [N][ Run][ 918]: hit eos,avg 75.46 token/s ``` ### Gradio demo #### start openai style api server ```shell ./run_api_ax650.sh ``` #### start gradio demo if the api server is not run in the same machine,please modify the api url in gradio web ui. ```shell python gradio_demo.py ``` ![image](https://cdn-uploads.huggingface.co/production/uploads/64b7837c17570fdff9b906b9/Og9fPNi0chg768gicse7M.png) ### HTTP demo #### start openai style api server ```shell ./run_api_ax650.sh ``` #### run http demo ``` python3 openai_cli.py ```