GGUF model with architecture deepseek2 is not supported yet.

#11
by T1-Faker1 - opened

(vadmin) vadmin@vadmin:~$ vllm serve ./GLM-4.7-Flash-Q4_K_M.gguf --tokenizer glm4_tokenizer/ZhipuAI/glm-4-9b-chat
(APIServer pid=12531) INFO 01-22 01:25:04 [api_server.py:872] vLLM API server version 0.14.0rc2.dev199+gc80f92c14
(APIServer pid=12531) INFO 01-22 01:25:04 [utils.py:267] non-default args: {'model_tag': './GLM-4.7-Flash-Q4_K_M.gguf', 'model': './GLM-4.7-Flash-Q4_K_M.gguf', 'tokenizer': 'glm4_tokenizer/ZhipuAI/glm-4-9b-chat'}
(APIServer pid=12531) Traceback (most recent call last):
(APIServer pid=12531) File "/home/vadmin/.venv/bin/vllm", line 10, in
(APIServer pid=12531) sys.exit(main())
(APIServer pid=12531) ^^^^^^
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=12531) args.dispatch_function(args)
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=12531) uvloop.run(run_server(args))
(APIServer pid=12531) File "/home/vadmin/.venv/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=12531) return __asyncio.run(
(APIServer pid=12531) ^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=12531) return runner.run(main)
(APIServer pid=12531) ^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=12531) return self._loop.run_until_complete(task)
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=12531) File "/home/vadmin/.venv/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=12531) return await main
(APIServer pid=12531) ^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/entrypoints/openai/api_server.py", line 919, in run_server
(APIServer pid=12531) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/entrypoints/openai/api_server.py", line 938, in run_server_worker
(APIServer pid=12531) async with build_async_engine_client(
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=12531) return await anext(self.gen)
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
(APIServer pid=12531) async with build_async_engine_client_from_engine_args(
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=12531) return await anext(self.gen)
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/entrypoints/openai/api_server.py", line 172, in build_async_engine_client_from_engine_args
(APIServer pid=12531) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/engine/arg_utils.py", line 1360, in create_engine_config
(APIServer pid=12531) maybe_override_with_speculators(
(APIServer pid=12531) File "/home/vadmin/vllm/vllm/transformers_utils/config.py", line 528, in maybe_override_with_speculators
(APIServer pid=12531) config_dict, _ = PretrainedConfig.get_config_dict(
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict
(APIServer pid=12531) config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 753, in _get_config_dict
(APIServer pid=12531) config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
(APIServer pid=12531) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=12531) File "/home/vadmin/.venv/lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 431, in load_gguf_checkpoint
(APIServer pid=12531) raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
(APIServer pid=12531) ValueError: GGUF model with architecture deepseek2 is not supported yet.

T1-Faker1 changed discussion title from vllm 好像跑起来有点问题 to GGUF model with architecture deepseek2 is not supported yet.
Unsloth AI org

I'm not sure if the GGUF works in vLLM as of yet. Only llama.cpp supported backends

I'm running into this issue when trying to use it with transformers directly. Is there a workaround until the library gets support for it?

MODEL_PATH = "unsloth/GLM-4.7-Flash-GGUF"
GGUF_FILE = "GLM-4.7-Flash-UD-Q2_K_XL.gguf"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    gguf_file=GGUF_FILE,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    gguf_file=GGUF_FILE,
    device_map="auto",
)

Sign up or log in to comment