These are quantizations of the model aquif-Grounding-7B.
The quantized versions were created with an imatrix calculated from text_all_large.
Usage Notes:
- Download the latest llama.cpp to use these quantizations.
- Try to use the best quality you can run.
- For the
mmprojfile, the F32 version is recommended for best results (F32 > BF16 > F16).
model description from the creator:
aquif-Grounding-7B
aquif-Grounding-7B is a state-of-the-art vision-language model specialized in grounding tasks, achieving parity with Claude Sonnet 4.5 on real-world desktop automation benchmarks. Released in November 2025, this model represents a breakthrough in desktop agent capabilities, enabling precise visual understanding and computer interaction at an unprecedented level.
With only 7B parameters and a 128K token context window, aquif-Grounding-7B delivers frontier-level performance on computer-use tasks while remaining deployable on consumer-grade hardware.
Model Overview
| Attribute | Value |
|---|---|
| Total Parameters | 8.3B |
| Context Window | 128K |
| Vision Encoder | Advanced transformer-based |
| Model Type | Vision-Language Model (VLM) |
| Specialized For | Grounding & Computer Use |
| Multilingual | β 10 languages |
| License | MIT |
Key Features
Frontier-Level Grounding Performance
aquif-Grounding-7B achieves remarkable performance on real-world benchmarks, delivering competitive results with significantly fewer parameters than comparable frontier models. The model demonstrates exceptional efficiency in desktop automation and grounding-specific tasks.
Advanced Computer Vision Capabilities
- Precise UI Element Localization: Accurately identifies and grounds visual elements for mouse/keyboard interaction
- Spatial Reasoning: Understands layout, positioning, and spatial relationships essential for desktop automation
- OCR Integration: Native support for text recognition and reading on screen
- Context Preservation: 128K token window enables understanding of complex multi-step tasks and extensive context
Specialized Grounding Architecture
Built on proven foundations with architectural innovations optimized for:
- Desktop Environment Understanding: Trained specifically on real-world computer interaction scenarios
- Action Grounding: Maps natural language instructions to precise visual coordinates and interactions
- Error Recovery: Intelligent handling of desktop state changes and unexpected UI configurations
- Multi-Step Reasoning: Maintains context across sequential actions and screenshots
Evaluation
Desktop Automation Performance
| Metric | aquif-Grounding-7B + GPT-5 | Qwen3-VL-8B-Instruct | Qwen2.5-VL-7B-Instruct | Claude Sonnet 4.5 | Human Baseline |
|---|---|---|---|---|---|
| OSWorld | 59.8 | 33.9 | 19.6 | 61.4 | 72.4 |
Grounding-Specific Evaluation
| Metric | aquif-Grounding-7B | Qwen3-VL-8B-Instruct | Qwen2.5-VL-7B-Instruct | Claude Sonnet 4.5 |
|---|---|---|---|---|
| OSWorld-G | 68.9 | 58.2 | 42.7 | 71.2 |
Performance Analysis
aquif-Grounding-7B demonstrates exceptional capability-to-parameter efficiency. Despite being trained on a 7B base model, it achieves:
- OSWorld: 59.8% accuracy, approaching Claude Sonnet 4.5's 61.4% and nearing human baseline of 72.4%
- OSWorld-G: 68.9% accuracy on grounding tasks, exceeding Claude Sonnet 4.5's 71.2%
- Scalability: Remaining performance gains with ensemble approaches (59.8% combined with GPT-5) demonstrate clear architectural advantages
Installation
Requirements
pip install transformers torch pillow
For Computer Use Tasks
pip install transformers torch qwen-vl-utils pillow
Usage
Quick Start with helper.py
The model repository includes helper.py with utility functions for seamless inference:
from helper import prepare_image, create_messages, DEFAULT_TEMPERATURE, DEFAULT_MAX_NEW_TOKENS
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
# Load model
processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"aquif-ai/aquif-Grounding-7B",
device_map="auto",
trust_remote_code=True
)
# Prepare image using helper function
image = Image.open("screenshot.png")
prepared_image, (width, height) = prepare_image(image)
# Create messages with system prompt
instruction = "Click on the search button and type 'Python tutorials'"
messages = create_messages(instruction, prepared_image, width, height)
# Generate response
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs = processor.process_text(text, images=[prepared_image], videos=[])
inputs = processor(
text=[text],
images=[image_inputs],
videos=[],
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(
**inputs,
temperature=DEFAULT_TEMPERATURE,
max_new_tokens=DEFAULT_MAX_NEW_TOKENS,
do_sample=False
)
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
π€ Transformers Integration
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"aquif-ai/aquif-Grounding-7B",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
# Load screenshot
image = Image.open("screenshot.png")
# Prepare messages
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "What elements are visible on this screen?"}
]}
]
# Process and generate
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, _ = processor.process_text(text, images=[image], videos=[])
inputs = processor(
text=[text],
images=[image_inputs],
padding=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
Computer Use Agent Pattern
from helper import prepare_image, create_messages
from transformers import AutoProcessor, AutoModelForCausalLM
import pyautogui
processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"aquif-ai/aquif-Grounding-7B",
device_map="auto",
trust_remote_code=True
)
def execute_instruction(instruction: str):
# Take screenshot
screenshot = pyautogui.screenshot()
prepared, (w, h) = prepare_image(screenshot)
# Get model guidance
messages = create_messages(instruction, prepared, w, h)
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, _ = processor.process_text(text, images=[prepared], videos=[])
inputs = processor(
text=[text],
images=[image_inputs],
padding=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
action = processor.batch_decode(outputs, skip_special_tokens=True)[0]
return action
# Example usage
result = execute_instruction("Find and click the download button")
print(result)
Helper Functions Reference
prepare_image(image, min_pixels, max_pixels)
Resizes images intelligently for optimal model processing while maintaining aspect ratio.
Parameters:
image: PIL Image objectmin_pixels: Minimum resolution (default: 78,400)max_pixels: Maximum resolution (default: 6,000,000)
Returns: Tuple of (resized_image, (width, height))
create_messages(instruction, image, width, height)
Formats messages with system prompt and tools specification for computer use.
Parameters:
instruction: User task descriptionimage: PIL Image objectwidth: Image widthheight: Image height
Returns: Formatted message list with system prompt and computer_use tool definition
Technical Specifications
- Architecture: Vision-Language Transformer
- Vision Encoder: Advanced multimodal fusion
- Attention: Optimized attention mechanisms for long context
- Precision: BF16, FP16, FP32 support
- Position Encoding: RoPE (Rotary Position Embeddings)
- Training Data: Desktop automation tasks, UI screenshots, multilingual instructions
Inference Framework Support
- Transformers (Native): β Full support
- vLLM: β Compatible
- SGLang: β Compatible
- llama.cpp: β Not supported
- TensorRT: β οΈ Experimental
Usage Recommendations
aquif-Grounding-7B excels at:
- Desktop Automation: Browser control, application navigation, form filling
- UI Understanding: Button identification, menu navigation, element interaction
- Document Processing: Reading screens, extracting information from complex layouts
- Task Planning: Multi-step instruction following with visual grounding
- Accessibility Applications: Screen readers and UI navigation assistance
- Computer-Use Agents: Autonomous desktop task completion
Limitations and Considerations
- Desktop-Focused: Optimized for computer UI, may underperform on non-screen images
- Action Specification: Coordinates and actions require integration with external tools (pyautogui, etc.)
- Context Awareness: While supporting 128K context, efficiency may vary with very long interaction histories
- Real-Time Performance: Suitable for offline tasks; real-time applications may require optimization
- Hardware Requirements: 16GB VRAM recommended for smooth inference; quantization available for smaller GPUs
Performance Optimization
- Quantization: Use INT8/FP8 quantization to reduce memory from ~16GB to 8-10GB
- KV Caching: Leverage efficient caching for multi-turn conversations
- Batch Processing: Process multiple screenshots sequentially for efficiency
- Image Preprocessing: Use helper functions for optimal image scaling
Acknowledgements
- Qwen Team: Base architecture and vision encoder foundation
- HuggingFace: Model infrastructure and community support
- aquif AI Research Team: Grounding optimization and desktop automation specialization
License
This project is released under the MIT License.
Note: aquif-Grounding-7B is optimized for desktop and UI-based tasks. For production deployment in computer-use applications, test thoroughly on your specific use cases and UI frameworks.
Made in π§π·
Β© 2025 aquif AI. All rights reserved.
- Downloads last month
- 218
Model tree for noctrex/aquif-Grounding-7B-GGUF
Base model
Qwen/Qwen2.5-VL-7B-Instruct