havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF
This model was converted to GGUF format from cognitivecomputations/Devstral-Vision-Small-2507 using llama.cpp via the ggml.ai's GGUF-my-repo space.
Refer to the original model card for more details on the model.
Devstral-Vision-Small-2507
Created by Eric Hartford at Cognitive Computations
Model Description
Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of Devstral-Small-2507 with the vision understanding of Mistral-Small-3.2-24B-Instruct-2506.
This model enables vision-augmented software engineering tasks, allowing developers to:
- Analyze screenshots and UI mockups to generate code
- Debug visual rendering issues with actual screenshots
- Convert designs and wireframes directly into implementation
- Understand and modify codebases with visual context
Model Details
- Base Architecture: Mistral Small 3.2 with vision encoder
- Parameters: 24B (language model) + vision components
- Context Window: 128k tokens
- License: Apache 2.0
- Language Model: Fine-tuned Devstral weights for superior coding performance
- Vision Model: Mistral-Small vision encoder and multimodal projector
How It Was Created
This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:
- Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
- Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
- Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
- Kept Mistral's tokenizer to maintain proper image token handling
The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.
Here is the script
Intended Use
Primary Use Cases
- Visual Software Engineering: Analyze UI screenshots, mockups, and designs to generate implementation code
- Code Review with Visual Context: Review code changes alongside their visual output
- Debugging Visual Issues: Debug rendering problems by analyzing screenshots
- Design-to-Code: Convert visual designs directly into code
- Documentation with Visual Examples: Generate documentation that references visual elements
Example Applications
- Building UI components from screenshots
- Debugging CSS/styling issues with visual feedback
- Converting Figma/design mockups to code
- Analyzing and reproducing visual bugs
- Creating visual test cases
Usage
With OpenHands
The model is optimized for use with OpenHands for agentic coding tasks:
# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tensor-parallel-size 2
# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1
With Transformers
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
model_id = "cognitivecomputations/Devstral-Vision-Small-2507"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Load an image
image = Image.open("screenshot.png")
# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."
# Process inputs
inputs = processor(
text=prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=2000,
temperature=0.7
)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Performance Expectations
Coding Performance
Inherits Devstral's exceptional performance on coding tasks:
- 53.6% on SWE-Bench Verified (when used with OpenHands)
- Superior performance on multi-file editing and codebase exploration
- Excellent tool use and agentic behavior
Vision Performance
Maintains Mistral-Small's vision capabilities:
- Strong understanding of UI elements and layouts
- Accurate interpretation of charts, diagrams, and visual documentation
- Reliable screenshot analysis for debugging
Hardware Requirements
- GPU Memory: ~48GB for full precision, ~24GB with 4-bit quantization
- Recommended: 2x RTX 4090 or better for optimal performance
- Minimum: Single GPU with 24GB VRAM using quantization
Limitations
- Vision capabilities are limited to what Mistral-Small-3.2 supports
- Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
- Large model size may be prohibitive for some deployment scenarios
- Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)
Ethical Considerations
This model inherits both the capabilities and limitations of its parent models. Users should:
- Review generated code for security vulnerabilities
- Verify visual interpretations are accurate
- Be aware of potential biases in code generation
- Use appropriate safety measures in production deployments
Citation
If you use this model, please cite:
@misc{devstral-vision-2507,
author = {Hartford, Eric},
title = {Devstral-Vision-Small-2507},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}
Acknowledgments
This model builds upon the excellent work by:
- Mistral AI for both Mistral-Small and Devstral
- All Hands AI for their collaboration on Devstral
- The open-source community for testing and feedback
License
Apache 2.0 - Same as the base models
Created with dolphin passion ๐ฌ by Cognitive Computations
Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux)
brew install llama.cpp
Invoke the llama.cpp server or the CLI.
CLI:
llama-cli --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -p "The meaning to life and the universe is"
Server:
llama-server --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -c 2048
Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.
Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggerganov/llama.cpp
Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
cd llama.cpp && LLAMA_CURL=1 make
Step 3: Run inference through the main binary.
./llama-cli --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -p "The meaning to life and the universe is"
or
./llama-server --hf-repo havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF --hf-file devstral-vision-small-2507-q4_k_m.gguf -c 2048
- Downloads last month
- 20
4-bit
Model tree for havenvanheusen/Devstral-Vision-Small-2507-Q4_K_M-GGUF
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503