vllm

#2
by ivys - opened

Could this variant of the quantized model be deployed with vLLM?

For model deployment using vLLM, you may find their instruction for inflight quantization. I have found that current quantized model sensors will trigger engine initiation failure due to shape mismatch during weight loading. However, I have tested that the original (THUDM/GLM-4.1V-9B-Thinking) can be successfully loaded using the inflight quantization method. During the model loading process, the VRAM usage is around 9.3GB. You may give it a try!

For model deployment using vLLM, you may find their instruction for inflight quantization. I have found that current quantized model sensors will trigger engine initiation failure due to shape mismatch during weight loading. However, I have tested that the original (THUDM/GLM-4.1V-9B-Thinking) can be successfully loaded using the inflight quantization method. During the model loading process, the VRAM usage is around 9.3GB. You may give it a try!

can u give the versions of transformers u used or any other docker compose file u used , because i am facing errors

Sign up or log in to comment