model troughput is so low on H100??? anyone else is facing the same issue? using vllm to deploy

by sanak - opened Oct 9

Oct 9

INFO 10-09 10:55:56 [loggers.py:127] Engine 000: Avg prompt throughput: 395.1 tokens/s, Avg generation throughput: 2.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 20.5%

vladciocan88

Oct 13

Hi, on h200 i get 3000 Avg prompt throughput and about 170 generation throughput. Can you share your'e vllm config?

sanak

Oct 14

vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
--gpu-memory-utilization 0.95
--max-num-batched-tokens 8192
--max-num-seqs 256
--enable-chunked-prefill
--limit-mm-per-prompt '{"image":3, "video":5}'
--max-model-len 16384
--logprobs-mode processed_logprobs
--host 0.0.0.0
--port 8002 \

vladciocan88

Oct 14

Batched tokens seem a little high, maybe try 2048 for a test? And maybe lower max seq to 64, should be more than enough on a H100.
The decoder and encoder normally run on CPU, try to see if that is not a bottleneck for you.
Test with 1-2 images, see if performance increses. If it increases, the bottleneck is on CPU.
Resize images to something like 1024 x 1024. Big images tend to slow it down a lot.
If all of these dont boost you`re performance, check the logs for when vllm starts to see if FlashInfer is deteced/installed

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment