vllm

by ivys - opened Jul 16

Discussion

ivys

Jul 16

Could this variant of the quantized model be deployed with vLLM?

Rainnighttram

Owner Jul 17

For model deployment using vLLM, you may find their instruction for inflight quantization. I have found that current quantized model sensors will trigger engine initiation failure due to shape mismatch during weight loading. However, I have tested that the original (THUDM/GLM-4.1V-9B-Thinking) can be successfully loaded using the inflight quantization method. During the model loading process, the VRAM usage is around 9.3GB. You may give it a try!

aqin3

Jul 23

For model deployment using vLLM, you may find their instruction for inflight quantization. I have found that current quantized model sensors will trigger engine initiation failure due to shape mismatch during weight loading. However, I have tested that the original (THUDM/GLM-4.1V-9B-Thinking) can be successfully loaded using the inflight quantization method. During the model loading process, the VRAM usage is around 9.3GB. You may give it a try!

can u give the versions of transformers u used or any other docker compose file u used , because i am facing errors

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment