Unable to use on 4x3090

#4
by marutichintan - opened

CLI : PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True OMP_NUM_THREADS=12 uv run vllm serve cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 4 --gpu_memory_utilization 0.97 --max-num-seqs 1 --max_model_len 30000 --dtype "half" --served-model-name "devstral"

(Worker_TP0 pid=1690264) INFO 12-14 12:59:10 [monitor.py:34] torch.compile takes 40.45 s in total
(Worker_TP0 pid=1690264) INFO 12-14 12:59:12 [gpu_worker.py:370] Available KV cache memory: 5.02 GiB
(EngineCore_DP0 pid=1690185) INFO 12-14 12:59:13 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=1690185) INFO 12-14 12:59:13 [kv_cache_utils.py:1291] GPU KV cache size: 59,792 tokens
(EngineCore_DP0 pid=1690185) INFO 12-14 12:59:13 [kv_cache_utils.py:1296] Maximum concurrency for 30,000 tokens per request: 1.99x

(APIServer pid=1690065) INFO: Started server process [1690065]
(APIServer pid=1690065) INFO: Waiting for application startup.
(APIServer pid=1690065) INFO: Application startup complete.
(APIServer pid=1690065) INFO: 127.0.0.1:50002 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1690065) INFO 12-14 13:00:51 [chat_utils.py:574] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=1690065) INFO: 127.0.0.1:50014 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] WorkerProc hit an exception.
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] Traceback (most recent call last):
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 817, in worker_busy_loop
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] output = func(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 369, in execute_model
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/utils/contextlib.py", line 120, in decorate_context
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return func(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 618, in execute_model
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] output = self.model_runner.execute_model(
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/utils/contextlib.py", line 120, in decorate_context
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return func(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3086, in execute_model
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] model_output = self.model_forward(
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2749, in model_forward
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.model(
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 220, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.runnable(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.call_impl(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in call_impl
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return forward_call(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 624, in forward
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] model_output = self.model(
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 435, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return TorchCompileWithNoGuardsWrapper.call(self, *args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/compilation/wrapper.py", line 221, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.call_with_optional_nvtx_range(
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/compilation/wrapper.py", line 109, in call_with_optional_nvtx_range
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return callable_fn(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 413, in forward
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] def forward(
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/dynamo/eval_frame.py", line 1044, in fn
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return fn(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/compilation/caching.py", line 54, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.optimized_call(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.wrapped_call(self, *args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 413, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] raise e
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.call_impl(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in call_impl
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return forward_call(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File ".178", line 2305, in forward
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_packed
, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_scale
, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_zero_point
, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_g_idx
, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_g_idx_sort_indices
, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_scheme_kernel_workspace, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight
, getitem_4, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_packed
, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_scale
, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_zero_point
, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_g_idx
, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_g_idx_sort_indices
, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_scheme_kernel_workspace, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_packed
, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_scale
, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_zero_point
, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_g_idx
, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_g_idx_sort_indices_, l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_scheme_kernel_workspace, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_packed_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_zero_point_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_g_idx_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_g_idx_sort_indices_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_scheme_kernel_workspace, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_packed_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_zero_point_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_g_idx_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_g_idx_sort_indices_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_scheme_kernel_workspace = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = getitem_4 = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_packed_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_zero_point_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_g_idx_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_g_idx_sort_indices_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_scheme_kernel_workspace = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_packed_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_zero_point_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_weight_g_idx_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_parameters_g_idx_sort_indices_ = l_self_modules_layers_modules_0_modules_mlp_modules_down_proj_scheme_kernel_workspace = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_packed_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_zero_point_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_g_idx_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_g_idx_sort_indices_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_scheme_kernel_workspace = None
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 220, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.runnable(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 183, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return range_entry.runnable(*args)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 63, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self._compiled_fn(*args)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return fn(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1130, in forward
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return compiled_fn(full_args)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] all_outs = call_func_at_runtime_with_args(
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] out = normalize_as_list(f(args))
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return compiled_fn(runtime_args)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 613, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self.current_callable(inputs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 2962, in run
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] out = model(new_inputs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/tmp/torchinductor_doubleslashai/xa/cxan4wfktuj3lg7ttceyp74b7chk53qd3o27xlm7eqnrpxpvlzcy.py", line 935, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] buf6 = torch.ops._C.gptq_marlin_gemm.default(buf5, None, arg10_1, None, arg11_1, None, None, arg12_1, arg13_1, arg14_1, arg15_1, 1125899907892224, s72, 14336, 12288, True, False, True, False)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/torch/_ops.py", line 841, in call
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] return self._op(*args, **kwargs)
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=1690265) ERROR 12-14 13:00:52 [multiproc_executor.py:822] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 1 has a total capacity of 23.56 GiB of which 20.88 MiB is free. Including non-PyTorch memory, this process has 23.44 GiB memory in use. Of the allocated memory 22.77 GiB is allocated by PyTorch, with 22.00 MiB allocated in private pools (e.g., CUDA Graphs), and 117.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

For me, this worked on 4x3090:

VLLM_ATTENTION_BACKEND=FLASHINFER
vllm serve models/Devstral-2-123B-Instruct-2512-AWQ-4bit
--enable-auto-tool-choice
--gpu-memory-utilization 0.92
--host 0.0.0.0
--max-model-len 44000
--max-num-seqs 5
--port 8001
--quantization compressed-tensors
--served-model-name devstral
--tool-call-parser mistral
--tensor-parallel-size 4
--no-enforce-eager

marutichintan changed discussion status to closed

@zipperlein what t/s are you getting?
And why --no-enforce-eagle doesn't this slow it down?

Sign up or log in to comment