Banner

aquif-Grounding-7B

aquif-Grounding-7B is a state-of-the-art vision-language model specialized in grounding tasks, achieving parity with Claude Sonnet 4.5 on real-world desktop automation benchmarks. Released in November 2025, this model represents a breakthrough in desktop agent capabilities, enabling precise visual understanding and computer interaction at an unprecedented level.

With only 7B parameters and a 128K token context window, aquif-Grounding-7B delivers frontier-level performance on computer-use tasks while remaining deployable on consumer-grade hardware.

Model Overview

Attribute Value
Total Parameters 8.3B
Context Window 128K
Vision Encoder Advanced transformer-based
Model Type Vision-Language Model (VLM)
Specialized For Grounding & Computer Use
Multilingual โœ… 10 languages
License MIT

Key Features

Frontier-Level Grounding Performance

aquif-Grounding-7B achieves remarkable performance on real-world benchmarks, delivering competitive results with significantly fewer parameters than comparable frontier models. The model demonstrates exceptional efficiency in desktop automation and grounding-specific tasks.

Advanced Computer Vision Capabilities

  • Precise UI Element Localization: Accurately identifies and grounds visual elements for mouse/keyboard interaction
  • Spatial Reasoning: Understands layout, positioning, and spatial relationships essential for desktop automation
  • OCR Integration: Native support for text recognition and reading on screen
  • Context Preservation: 128K token window enables understanding of complex multi-step tasks and extensive context

Specialized Grounding Architecture

Built on proven foundations with architectural innovations optimized for:

  • Desktop Environment Understanding: Trained specifically on real-world computer interaction scenarios
  • Action Grounding: Maps natural language instructions to precise visual coordinates and interactions
  • Error Recovery: Intelligent handling of desktop state changes and unexpected UI configurations
  • Multi-Step Reasoning: Maintains context across sequential actions and screenshots

Evaluation

Desktop Automation Performance

Metric aquif-Grounding-7B + GPT-5 Qwen3-VL-8B-Instruct Qwen2.5-VL-7B-Instruct Claude Sonnet 4.5 Human Baseline
OSWorld 59.8 33.9 19.6 61.4 72.4

Grounding-Specific Evaluation

Metric aquif-Grounding-7B Qwen3-VL-8B-Instruct Qwen2.5-VL-7B-Instruct Claude Sonnet 4.5
OSWorld-G 68.9 58.2 42.7 71.2

Performance Analysis

aquif-Grounding-7B demonstrates exceptional capability-to-parameter efficiency. Despite being trained on a 7B base model, it achieves:

  • OSWorld: 59.8% accuracy, approaching Claude Sonnet 4.5's 61.4% and nearing human baseline of 72.4%
  • OSWorld-G: 68.9% accuracy on grounding tasks, exceeding Claude Sonnet 4.5's 71.2%
  • Scalability: Remaining performance gains with ensemble approaches (59.8% combined with GPT-5) demonstrate clear architectural advantages

Installation

Requirements

pip install transformers torch pillow

For Computer Use Tasks

pip install transformers torch qwen-vl-utils pillow

Usage

Quick Start with helper.py

The model repository includes helper.py with utility functions for seamless inference:

from helper import prepare_image, create_messages, DEFAULT_TEMPERATURE, DEFAULT_MAX_NEW_TOKENS
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

# Load model
processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    trust_remote_code=True
)

# Prepare image using helper function
image = Image.open("screenshot.png")
prepared_image, (width, height) = prepare_image(image)

# Create messages with system prompt
instruction = "Click on the search button and type 'Python tutorials'"
messages = create_messages(instruction, prepared_image, width, height)

# Generate response
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs = processor.process_text(text, images=[prepared_image], videos=[])
inputs = processor(
    text=[text],
    images=[image_inputs],
    videos=[],
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(
    **inputs,
    temperature=DEFAULT_TEMPERATURE,
    max_new_tokens=DEFAULT_MAX_NEW_TOKENS,
    do_sample=False
)

response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

๐Ÿค— Transformers Integration

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)

# Load screenshot
image = Image.open("screenshot.png")

# Prepare messages
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What elements are visible on this screen?"}
    ]}
]

# Process and generate
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, _ = processor.process_text(text, images=[image], videos=[])
inputs = processor(
    text=[text],
    images=[image_inputs],
    padding=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Computer Use Agent Pattern

from helper import prepare_image, create_messages
from transformers import AutoProcessor, AutoModelForCausalLM
import pyautogui

processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    trust_remote_code=True
)

def execute_instruction(instruction: str):
    # Take screenshot
    screenshot = pyautogui.screenshot()
    prepared, (w, h) = prepare_image(screenshot)
    
    # Get model guidance
    messages = create_messages(instruction, prepared, w, h)
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    image_inputs, _ = processor.process_text(text, images=[prepared], videos=[])
    
    inputs = processor(
        text=[text],
        images=[image_inputs],
        padding=True,
        return_tensors="pt"
    ).to(model.device)
    
    outputs = model.generate(**inputs, max_new_tokens=128)
    action = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    
    return action

# Example usage
result = execute_instruction("Find and click the download button")
print(result)

Helper Functions Reference

prepare_image(image, min_pixels, max_pixels)

Resizes images intelligently for optimal model processing while maintaining aspect ratio.

Parameters:

  • image: PIL Image object
  • min_pixels: Minimum resolution (default: 78,400)
  • max_pixels: Maximum resolution (default: 6,000,000)

Returns: Tuple of (resized_image, (width, height))

create_messages(instruction, image, width, height)

Formats messages with system prompt and tools specification for computer use.

Parameters:

  • instruction: User task description
  • image: PIL Image object
  • width: Image width
  • height: Image height

Returns: Formatted message list with system prompt and computer_use tool definition

Technical Specifications

  • Architecture: Vision-Language Transformer
  • Vision Encoder: Advanced multimodal fusion
  • Attention: Optimized attention mechanisms for long context
  • Precision: BF16, FP16, FP32 support
  • Position Encoding: RoPE (Rotary Position Embeddings)
  • Training Data: Desktop automation tasks, UI screenshots, multilingual instructions

Inference Framework Support

  • Transformers (Native): โœ… Full support
  • vLLM: โœ… Compatible
  • SGLang: โœ… Compatible
  • llama.cpp: โŒ Not supported
  • TensorRT: โš ๏ธ Experimental

Usage Recommendations

aquif-Grounding-7B excels at:

  • Desktop Automation: Browser control, application navigation, form filling
  • UI Understanding: Button identification, menu navigation, element interaction
  • Document Processing: Reading screens, extracting information from complex layouts
  • Task Planning: Multi-step instruction following with visual grounding
  • Accessibility Applications: Screen readers and UI navigation assistance
  • Computer-Use Agents: Autonomous desktop task completion

Limitations and Considerations

  • Desktop-Focused: Optimized for computer UI, may underperform on non-screen images
  • Action Specification: Coordinates and actions require integration with external tools (pyautogui, etc.)
  • Context Awareness: While supporting 128K context, efficiency may vary with very long interaction histories
  • Real-Time Performance: Suitable for offline tasks; real-time applications may require optimization
  • Hardware Requirements: 16GB VRAM recommended for smooth inference; quantization available for smaller GPUs

Performance Optimization

  • Quantization: Use INT8/FP8 quantization to reduce memory from ~16GB to 8-10GB
  • KV Caching: Leverage efficient caching for multi-turn conversations
  • Batch Processing: Process multiple screenshots sequentially for efficiency
  • Image Preprocessing: Use helper functions for optimal image scaling

Acknowledgements

  • Qwen Team: Base architecture and vision encoder foundation
  • HuggingFace: Model infrastructure and community support
  • aquif AI Research Team: Grounding optimization and desktop automation specialization

License

This project is released under the MIT License.


Note: aquif-Grounding-7B is optimized for desktop and UI-based tasks. For production deployment in computer-use applications, test thoroughly on your specific use cases and UI frameworks.

Made in ๐Ÿ‡ง๐Ÿ‡ท

ยฉ 2025 aquif AI. All rights reserved.

Downloads last month
66
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aquif-ai/aquif-Grounding-7B

Finetuned
(864)
this model
Quantizations
3 models

Collections including aquif-ai/aquif-Grounding-7B