havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF

This model was converted to GGUF format from cognitivecomputations/Devstral-Vision-Small-2507 using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Devstral-Vision-Small-2507

Created by Eric Hartford at Cognitive Computations

Model Description

Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of Devstral-Small-2507 with the vision understanding of Mistral-Small-3.2-24B-Instruct-2506.

This model enables vision-augmented software engineering tasks, allowing developers to:

  • Analyze screenshots and UI mockups to generate code
  • Debug visual rendering issues with actual screenshots
  • Convert designs and wireframes directly into implementation
  • Understand and modify codebases with visual context

Model Details

  • Base Architecture: Mistral Small 3.2 with vision encoder
  • Parameters: 24B (language model) + vision components
  • Context Window: 128k tokens
  • License: Apache 2.0
  • Language Model: Fine-tuned Devstral weights for superior coding performance
  • Vision Model: Mistral-Small vision encoder and multimodal projector

How It Was Created

This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:

  1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
  2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
  3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
  4. Kept Mistral's tokenizer to maintain proper image token handling

The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.

Here is the script

Intended Use

Primary Use Cases

  • Visual Software Engineering: Analyze UI screenshots, mockups, and designs to generate implementation code
  • Code Review with Visual Context: Review code changes alongside their visual output
  • Debugging Visual Issues: Debug rendering problems by analyzing screenshots
  • Design-to-Code: Convert visual designs directly into code
  • Documentation with Visual Examples: Generate documentation that references visual elements

Example Applications

  • Building UI components from screenshots
  • Debugging CSS/styling issues with visual feedback
  • Converting Figma/design mockups to code
  • Analyzing and reproducing visual bugs
  • Creating visual test cases

Usage

With OpenHands

The model is optimized for use with OpenHands for agentic coding tasks:

# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor-parallel-size 2
# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1

With Transformers

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
model_id = "cognitivecomputations/Devstral-Vision-Small-2507"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Load an image
image = Image.open("screenshot.png")
# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."
# Process inputs
inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt"
).to(model.device)
# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=2000,
    temperature=0.7
)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

image/png

Performance Expectations

Coding Performance

Inherits Devstral's exceptional performance on coding tasks:

  • 53.6% on SWE-Bench Verified (when used with OpenHands)
  • Superior performance on multi-file editing and codebase exploration
  • Excellent tool use and agentic behavior

Vision Performance

Maintains Mistral-Small's vision capabilities:

  • Strong understanding of UI elements and layouts
  • Accurate interpretation of charts, diagrams, and visual documentation
  • Reliable screenshot analysis for debugging

Hardware Requirements

  • GPU Memory: ~48GB for full precision, ~24GB with 4-bit quantization
  • Recommended: 2x RTX 4090 or better for optimal performance
  • Minimum: Single GPU with 24GB VRAM using quantization

Limitations

  • Vision capabilities are limited to what Mistral-Small-3.2 supports
  • Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
  • Large model size may be prohibitive for some deployment scenarios
  • Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)

Ethical Considerations

This model inherits both the capabilities and limitations of its parent models. Users should:

  • Review generated code for security vulnerabilities
  • Verify visual interpretations are accurate
  • Be aware of potential biases in code generation
  • Use appropriate safety measures in production deployments

Citation

If you use this model, please cite:

@misc{devstral-vision-2507,
  author = {Hartford, Eric},
  title = {Devstral-Vision-Small-2507},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}

Acknowledgments

This model builds upon the excellent work by:

  • Mistral AI for both Mistral-Small and Devstral
  • All Hands AI for their collaboration on Devstral
  • The open-source community for testing and feedback

License

Apache 2.0 - Same as the base models


Created with dolphin passion ๐Ÿฌ by Cognitive Computations


Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -p "The meaning to life and the universe is"

or

./llama-server --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -c 2048
Downloads last month
20
GGUF
Model size
24B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF