vllm

by win10 - opened 12 days ago

12 days ago

@cpatonn Can you help me?
"""
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] EngineCore failed to start.
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] Traceback (most recent call last):
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 858, in run_engine_core
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 634, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] super().init(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self._init_executor()
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.driver_worker.init_device()
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in init_device
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 564, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] MultiModalBudget(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 42, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] max_tokens_by_modality = mm_registry.get_max_tokens_per_item_by_modality(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_modality
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return profiler.get_mm_max_contiguous_tokens(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 369, in get_mm_max_contiguous_tokens
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return self._get_mm_max_tokens(seq_len, mm_counts, mm_embeddings_only=False)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 351, in _get_mm_max_tokens
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 263, in _get_dummy_mm_inputs
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] processor_inputs = factory.get_dummy_processor_inputs(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 120, in get_dummy_processor_inputs
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] dummy_text = self.get_dummy_text(mm_counts)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1176, in get_dummy_text
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] hf_processor = self.info.get_hf_processor()
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1186, in get_hf_processor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return self.ctx.get_hf_processor(**kwargs)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1049, in get_hf_processor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return cached_processor_from_config(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 251, in cached_processor_from_config
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return cached_get_processor_without_dynamic_kwargs(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 210, in cached_get_processor_without_dynamic_kwargs
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] processor = cached_get_processor(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 155, in get_processor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] raise TypeError(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
(EngineCore_DP0 pid=62) Process EngineCore_DP0:
(EngineCore_DP0 pid=62) Traceback (most recent call last):
(EngineCore_DP0 pid=62) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=62) self.run()
(EngineCore_DP0 pid=62) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=62) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 871, in run_engine_core
(EngineCore_DP0 pid=62) raise e
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 858, in run_engine_core
(EngineCore_DP0 pid=62) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 634, in init
(EngineCore_DP0 pid=62) super().init(
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in init
(EngineCore_DP0 pid=62) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in init
(EngineCore_DP0 pid=62) self._init_executor()
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=62) self.driver_worker.init_device()
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=62) self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in init_device
(EngineCore_DP0 pid=62) self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 564, in init
(EngineCore_DP0 pid=62) MultiModalBudget(
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 42, in init
(EngineCore_DP0 pid=62) max_tokens_by_modality = mm_registry.get_max_tokens_per_item_by_modality(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_modality
(EngineCore_DP0 pid=62) return profiler.get_mm_max_contiguous_tokens(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 369, in get_mm_max_contiguous_tokens
(EngineCore_DP0 pid=62) return self._get_mm_max_tokens(seq_len, mm_counts, mm_embeddings_only=False)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 351, in _get_mm_max_tokens
(EngineCore_DP0 pid=62) mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 263, in _get_dummy_mm_inputs
(EngineCore_DP0 pid=62) processor_inputs = factory.get_dummy_processor_inputs(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 120, in get_dummy_processor_inputs
(EngineCore_DP0 pid=62) dummy_text = self.get_dummy_text(mm_counts)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1176, in get_dummy_text
(EngineCore_DP0 pid=62) hf_processor = self.info.get_hf_processor()
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1186, in get_hf_processor
(EngineCore_DP0 pid=62) return self.ctx.get_hf_processor(**kwargs)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1049, in get_hf_processor
(EngineCore_DP0 pid=62) return cached_processor_from_config(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 251, in cached_processor_from_config
(EngineCore_DP0 pid=62) return cached_get_processor_without_dynamic_kwargs(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 210, in cached_get_processor_without_dynamic_kwargs
(EngineCore_DP0 pid=62) processor = cached_get_processor(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 155, in get_processor
(EngineCore_DP0 pid=62) raise TypeError(
(EngineCore_DP0 pid=62) TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
[rank0]:[W1210 08:22:08.174477121 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
"""

jmckenzie-dev

12 days ago

Got this far:

#!/bin/bash
uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly # add variant subdirectory here if needed
uv pip install --upgrade git+https://github.com/huggingface/transformers.git
uv pip install "numpy<2.3"

But end up with triton method lookup error:

(EngineCore_DP0 pid=247846) ImportError: /home/<user>/.triton/cache/KJGBFCDPODDSTMI2D3ESJLJ6JDTKTCDGYNBY5NRKEHPZNNN6VV2Q/cuda_utils.cpython-312-x86_64-linux-gnu.so: undefined symbol: cuModuleGetFunction

Getting the runaround from both GPT-5.1 and Gemini-3-pro on this one. Seems like there's some instability at the cutting edge of this stuff (as seems to always be the case with vllm.......)

Fails for me on CUDA 12.8, 12.9, and 13.0, blowing away the ~/.triton/cache or entire ~/.triton on each attempt.

cpatonn

cyankiwi org 11 days ago

@win10 I think there is something wrong with your transformers version. Could you try installing transformers from source using

pip install --upgrade git+https://github.com/huggingface/transformers.git

cpatonn

cyankiwi org 11 days ago

@jmckenzie-dev I think there is a mismatch between your CUDA, pytorch and vllm. Could you try installing the pytorch version suitable with your CUDA following pytorch.org, and then build vllm from source using your existing pytorch?

git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .

and then install transformers from source

uv pip install --upgrade git+https://github.com/huggingface/transformers.git

It might take time to build vllm from sources, and you might need to install some additional libraries in the process. Or maybe try docker?

jmckenzie-dev

11 days ago

Oh, it's very possible that's the case. I just wanted to avoid going back into that hell-hole that is building vllm from source again. :)
The recommended instructions to update transformers end up with a numpy version break too so that has to be downgraded.

I'll see about doing the custom full-stack vllm from HEAD build locally later today once I'm done doing an exl3 quant of Devstral-2 and seeing how that goes.

Thanks for the ping.

bpozdena

6 days ago

I've also faced countless errors with various combination of component versions. It ended up serving with the below:

uv venv venv-glm4.6v --python 3.13 --seed
source venv-glm4.6v/bin/activate
uv pip install vllm>=0.12.0
uv pip install --upgrade git+https://github.com/huggingface/transformers.git
uv pip install numpy==2.2.6

I also faced an issue, were VLLM would get stuck and don't show any logs. It turned out to be a silent download of the model https://github.com/vllm-project/vllm/issues/17676 .

I use the below command for serving on 4x3090:

vllm serve cyankiwi/GLM-4.6V-AWQ-4bit  --max-model-len 20000  --tensor-parallel-size 1  --pipeline-parallel-size 4 --enable-expert-parallel --host 0.0.0.0 --port 8001

jmckenzie-dev

5 days ago

I did finally get things working w/a similar approach (upgraded transformers, downgraded numpy pinned). Generation seemed to work fine enough though local inference was missing the opening block and didn't finish its generation intermittently. Not sure if it's a chat template thing on my end or what, but chat template errors seem to be incredibly common for new models in various inference environments. Thanks for the assist @cpatonn .

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment