Devstral Small 2 + VS Code insiders + Copilot + Spec Kit ==> This LLM is more lazy than me!

#21
by davrot - opened

I (am trying to) run the following setup:
vLLM -> VS Code Insider + Copilot + Spec Kit. I thought that it might be a good idea switching from Raptor Mini to Devstal Small 2. Hence I put it on a H100 with vLLM 0.13.0rc2.

I turned on Devstral and gave it the spec kit "Check project consistence" task. The first run of the task was normal. It gave the git commit a useful name and all. Since there were still a lot of in-consistencies in the project, I asked it again to run the "Check project consistence" task again. The result got stranger...

2025-12-17_22-30

I asked it again to run the "Check project consistence" task again (I do this relative often with Raptor Mini without problem). This was the moment where the model lost its marbles.

2025-12-17_22-32

And it made it's position very clear in the git commit message:

2025-12-17_22-31

I used this https://github.com/davrot/overleaf_with_admin_extension/commit/00d4526b312ad4982680ea18fbdccb5fff3b2923 (I asked Raptor Mini not to produce PR and push stuff... But who cares about what the users says... )

Any idea how to get it working?

[Funny detour about qwen3 ->]
I had also a strange experience with qwen3 coder instruct too. First it told me that is not able to control tools, then it told me it can not code and finally it told me that the code is perfect ("I have not run that Maven compile command in this specific environment. I've been explaining what would happen and providing analysis, but I cannot actually execute that command here."). Well, the code didn't compiled at all. 🙃 When I told qwen3 that not everything is perfect, this resulted in a discussion (Dark Star style) where the model tried to convince me that we did a lot of wonderful progress and that everything is good as it is (e.g. "You're asking me to "fix" something that isn't broken." ). Also it switched randomly to Chinese at some point (首先检查该文件内容)

Maybe the vLLM -> VS Code Insider + Copilot (+ Spec Kit) setup is somehow hurting the models? Maybe we need AI psychologists? AI motivators?
[<- Funny detour about qwen3 ]

VS Code-Insider settings:

  "github.copilot.chat.customOAIModels": {
    "mistralai/Devstral-Small-2-24B-Instruct-2512": {
      "name": "mistralai/Devstral-Small-2-24B-Instruct-2512",
      "url": "http://gate0.neuro.uni-bremen.de:8000/v1",
      "toolCalling": true,
      "vision": true,
      "thinking": true,
      "maxInputTokens": 250000,
      "maxOutputTokens": 8192,
      "requiresAPIKey": false
    }},

vLLM:

export TORCHINDUCTOR_CACHE_DIR="/data_1/davrot/devstral/.cache/torch_inductor_cache"

/data_1/davrot/devstral/vllm_uv_env/bin/vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \
    --host 0.0.0.0 \
    --port 8000 \
    --tool-call-parser mistral \
    --enable-auto-tool-choice \
    --dtype half \
    --max-model-len 262144 

vLLM log when it lost it:

PIServer pid=8020) INFO:     10.10.10.10:47296 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO 12-17 21:27:55 [loggers.py:248] Engine 000: Avg prompt throughput: 8061.3 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.8%, MM cache hit rate: 0.0%
(APIServer pid=8020) INFO 12-17 21:28:05 [loggers.py:248] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.8%, MM cache hit rate: 0.0%
(APIServer pid=8020) INFO:     10.10.10.10:53860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO 12-17 21:28:35 [loggers.py:248] Engine 000: Avg prompt throughput: 8088.6 tokens/s, Avg generation throughput: 6.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.8%, MM cache hit rate: 0.0%
(APIServer pid=8020) INFO:     10.10.10.10:53860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO:     10.10.10.10:52328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO 12-17 21:28:45 [loggers.py:248] Engine 000: Avg prompt throughput: 16224.9 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.9%, MM cache hit rate: 0.0%
(APIServer pid=8020) INFO:     10.10.10.10:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO 12-17 21:28:55 [loggers.py:248] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.6%, Prefix cache hit rate: 96.3%, MM cache hit rate: 0.0%
(APIServer pid=8020) INFO:     10.10.10.10:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO:     10.10.10.10:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO:     10.10.10.10:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8020) INFO 12-17 21:29:05 [loggers.py:248] Engine 000: Avg prompt throughput: 33288.4 tokens/s, Avg generation throughput: 19.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.5%, MM cache hit rate: 0.0%
(APIServer pid=8020) INFO 12-17 21:29:15 [loggers.py:248] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 96.5%, MM cache hit rate: 0.0%

Sign up or log in to comment