aquif-Grounding-7B

aquif-Grounding-7B is a state-of-the-art vision-language model specialized in grounding tasks, achieving parity with Claude Sonnet 4.5 on real-world desktop automation benchmarks. Released in November 2025, this model represents a breakthrough in desktop agent capabilities, enabling precise visual understanding and computer interaction at an unprecedented level.

With only 7B parameters and a 128K token context window, aquif-Grounding-7B delivers frontier-level performance on computer-use tasks while remaining deployable on consumer-grade hardware.

Model Overview

Attribute	Value
Total Parameters	8.3B
Context Window	128K
Vision Encoder	Advanced transformer-based
Model Type	Vision-Language Model (VLM)
Specialized For	Grounding & Computer Use
Multilingual	✅ 10 languages
License	MIT

Key Features

Frontier-Level Grounding Performance

aquif-Grounding-7B achieves remarkable performance on real-world benchmarks, delivering competitive results with significantly fewer parameters than comparable frontier models. The model demonstrates exceptional efficiency in desktop automation and grounding-specific tasks.

Advanced Computer Vision Capabilities

Precise UI Element Localization: Accurately identifies and grounds visual elements for mouse/keyboard interaction
Spatial Reasoning: Understands layout, positioning, and spatial relationships essential for desktop automation
OCR Integration: Native support for text recognition and reading on screen
Context Preservation: 128K token window enables understanding of complex multi-step tasks and extensive context

Specialized Grounding Architecture

Built on proven foundations with architectural innovations optimized for:

Desktop Environment Understanding: Trained specifically on real-world computer interaction scenarios
Action Grounding: Maps natural language instructions to precise visual coordinates and interactions
Error Recovery: Intelligent handling of desktop state changes and unexpected UI configurations
Multi-Step Reasoning: Maintains context across sequential actions and screenshots

Evaluation

Desktop Automation Performance

Metric	aquif-Grounding-7B + GPT-5	Qwen3-VL-8B-Instruct	Qwen2.5-VL-7B-Instruct	Claude Sonnet 4.5	Human Baseline
OSWorld	59.8	33.9	19.6	61.4	72.4

Grounding-Specific Evaluation

Metric	aquif-Grounding-7B	Qwen3-VL-8B-Instruct	Qwen2.5-VL-7B-Instruct	Claude Sonnet 4.5
OSWorld-G	68.9	58.2	42.7	71.2

Performance Analysis

aquif-Grounding-7B demonstrates exceptional capability-to-parameter efficiency. Despite being trained on a 7B base model, it achieves:

OSWorld: 59.8% accuracy, approaching Claude Sonnet 4.5's 61.4% and nearing human baseline of 72.4%
OSWorld-G: 68.9% accuracy on grounding tasks, exceeding Claude Sonnet 4.5's 71.2%
Scalability: Remaining performance gains with ensemble approaches (59.8% combined with GPT-5) demonstrate clear architectural advantages

Installation

Requirements

pip install transformers torch pillow

For Computer Use Tasks

pip install transformers torch qwen-vl-utils pillow

Usage

Quick Start with helper.py

The model repository includes helper.py with utility functions for seamless inference:

from helper import prepare_image, create_messages, DEFAULT_TEMPERATURE, DEFAULT_MAX_NEW_TOKENS
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

# Load model
processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    trust_remote_code=True
)

# Prepare image using helper function
image = Image.open("screenshot.png")
prepared_image, (width, height) = prepare_image(image)

# Create messages with system prompt
instruction = "Click on the search button and type 'Python tutorials'"
messages = create_messages(instruction, prepared_image, width, height)

# Generate response
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs = processor.process_text(text, images=[prepared_image], videos=[])
inputs = processor(
    text=[text],
    images=[image_inputs],
    videos=[],
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(
    **inputs,
    temperature=DEFAULT_TEMPERATURE,
    max_new_tokens=DEFAULT_MAX_NEW_TOKENS,
    do_sample=False
)

response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

🤗 Transformers Integration

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)

# Load screenshot
image = Image.open("screenshot.png")

# Prepare messages
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What elements are visible on this screen?"}
    ]}
]

# Process and generate
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, _ = processor.process_text(text, images=[image], videos=[])
inputs = processor(
    text=[text],
    images=[image_inputs],
    padding=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Computer Use Agent Pattern

from helper import prepare_image, create_messages
from transformers import AutoProcessor, AutoModelForCausalLM
import pyautogui

processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    trust_remote_code=True
)

def execute_instruction(instruction: str):
    # Take screenshot
    screenshot = pyautogui.screenshot()
    prepared, (w, h) = prepare_image(screenshot)
    
    # Get model guidance
    messages = create_messages(instruction, prepared, w, h)
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    image_inputs, _ = processor.process_text(text, images=[prepared], videos=[])
    
    inputs = processor(
        text=[text],
        images=[image_inputs],
        padding=True,
        return_tensors="pt"
    ).to(model.device)
    
    outputs = model.generate(**inputs, max_new_tokens=128)
    action = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    
    return action

# Example usage
result = execute_instruction("Find and click the download button")
print(result)

Helper Functions Reference

prepare_image(image, min_pixels, max_pixels)

Resizes images intelligently for optimal model processing while maintaining aspect ratio.

Parameters:

image: PIL Image object
min_pixels: Minimum resolution (default: 78,400)
max_pixels: Maximum resolution (default: 6,000,000)

Returns: Tuple of (resized_image, (width, height))

create_messages(instruction, image, width, height)

Formats messages with system prompt and tools specification for computer use.

Parameters:

instruction: User task description
image: PIL Image object
width: Image width
height: Image height

Returns: Formatted message list with system prompt and computer_use tool definition

Technical Specifications

Architecture: Vision-Language Transformer
Vision Encoder: Advanced multimodal fusion
Attention: Optimized attention mechanisms for long context
Precision: BF16, FP16, FP32 support
Position Encoding: RoPE (Rotary Position Embeddings)
Training Data: Desktop automation tasks, UI screenshots, multilingual instructions

Inference Framework Support

Transformers (Native): ✅ Full support
vLLM: ✅ Compatible
SGLang: ✅ Compatible
llama.cpp: ❌ Not supported
TensorRT: ⚠️ Experimental

Usage Recommendations

aquif-Grounding-7B excels at:

Desktop Automation: Browser control, application navigation, form filling
UI Understanding: Button identification, menu navigation, element interaction
Document Processing: Reading screens, extracting information from complex layouts
Task Planning: Multi-step instruction following with visual grounding
Accessibility Applications: Screen readers and UI navigation assistance
Computer-Use Agents: Autonomous desktop task completion

Limitations and Considerations

Desktop-Focused: Optimized for computer UI, may underperform on non-screen images
Action Specification: Coordinates and actions require integration with external tools (pyautogui, etc.)
Context Awareness: While supporting 128K context, efficiency may vary with very long interaction histories
Real-Time Performance: Suitable for offline tasks; real-time applications may require optimization
Hardware Requirements: 16GB VRAM recommended for smooth inference; quantization available for smaller GPUs

Performance Optimization

Quantization: Use INT8/FP8 quantization to reduce memory from ~16GB to 8-10GB
KV Caching: Leverage efficient caching for multi-turn conversations
Batch Processing: Process multiple screenshots sequentially for efficiency
Image Preprocessing: Use helper functions for optimal image scaling

Acknowledgements

Qwen Team: Base architecture and vision encoder foundation
HuggingFace: Model infrastructure and community support
aquif AI Research Team: Grounding optimization and desktop automation specialization

License

This project is released under the MIT License.

Note: aquif-Grounding-7B is optimized for desktop and UI-based tasks. For production deployment in computer-use applications, test thoroughly on your specific use cases and UI frameworks.

Made in 🇧🇷