vllm
Could this variant of the quantized model be deployed with vLLM?
For model deployment using vLLM, you may find their instruction for inflight quantization. I have found that current quantized model sensors will trigger engine initiation failure due to shape mismatch during weight loading. However, I have tested that the original (THUDM/GLM-4.1V-9B-Thinking) can be successfully loaded using the inflight quantization method. During the model loading process, the VRAM usage is around 9.3GB. You may give it a try!
For model deployment using vLLM, you may find their instruction for inflight quantization. I have found that current quantized model sensors will trigger engine initiation failure due to shape mismatch during weight loading. However, I have tested that the original (THUDM/GLM-4.1V-9B-Thinking) can be successfully loaded using the inflight quantization method. During the model loading process, the VRAM usage is around 9.3GB. You may give it a try!
can u give the versions of transformers u used or any other docker compose file u used , because i am facing errors