These are quantizations of the model aquif-Grounding-7B.

The quantized versions were created with an imatrix calculated from text_all_large.

Usage Notes:

  • Download the latest llama.cpp to use these quantizations.
  • Try to use the best quality you can run.
  • For the mmproj file, the F32 version is recommended for best results (F32 > BF16 > F16).

model description from the creator:

Banner

aquif-Grounding-7B

aquif-Grounding-7B is a state-of-the-art vision-language model specialized in grounding tasks, achieving parity with Claude Sonnet 4.5 on real-world desktop automation benchmarks. Released in November 2025, this model represents a breakthrough in desktop agent capabilities, enabling precise visual understanding and computer interaction at an unprecedented level.

With only 7B parameters and a 128K token context window, aquif-Grounding-7B delivers frontier-level performance on computer-use tasks while remaining deployable on consumer-grade hardware.

Model Overview

Attribute Value
Total Parameters 8.3B
Context Window 128K
Vision Encoder Advanced transformer-based
Model Type Vision-Language Model (VLM)
Specialized For Grounding & Computer Use
Multilingual βœ… 10 languages
License MIT

Key Features

Frontier-Level Grounding Performance

aquif-Grounding-7B achieves remarkable performance on real-world benchmarks, delivering competitive results with significantly fewer parameters than comparable frontier models. The model demonstrates exceptional efficiency in desktop automation and grounding-specific tasks.

Advanced Computer Vision Capabilities

  • Precise UI Element Localization: Accurately identifies and grounds visual elements for mouse/keyboard interaction
  • Spatial Reasoning: Understands layout, positioning, and spatial relationships essential for desktop automation
  • OCR Integration: Native support for text recognition and reading on screen
  • Context Preservation: 128K token window enables understanding of complex multi-step tasks and extensive context

Specialized Grounding Architecture

Built on proven foundations with architectural innovations optimized for:

  • Desktop Environment Understanding: Trained specifically on real-world computer interaction scenarios
  • Action Grounding: Maps natural language instructions to precise visual coordinates and interactions
  • Error Recovery: Intelligent handling of desktop state changes and unexpected UI configurations
  • Multi-Step Reasoning: Maintains context across sequential actions and screenshots

Evaluation

Desktop Automation Performance

Metric aquif-Grounding-7B + GPT-5 Qwen3-VL-8B-Instruct Qwen2.5-VL-7B-Instruct Claude Sonnet 4.5 Human Baseline
OSWorld 59.8 33.9 19.6 61.4 72.4

Grounding-Specific Evaluation

Metric aquif-Grounding-7B Qwen3-VL-8B-Instruct Qwen2.5-VL-7B-Instruct Claude Sonnet 4.5
OSWorld-G 68.9 58.2 42.7 71.2

Performance Analysis

aquif-Grounding-7B demonstrates exceptional capability-to-parameter efficiency. Despite being trained on a 7B base model, it achieves:

  • OSWorld: 59.8% accuracy, approaching Claude Sonnet 4.5's 61.4% and nearing human baseline of 72.4%
  • OSWorld-G: 68.9% accuracy on grounding tasks, exceeding Claude Sonnet 4.5's 71.2%
  • Scalability: Remaining performance gains with ensemble approaches (59.8% combined with GPT-5) demonstrate clear architectural advantages

Installation

Requirements

pip install transformers torch pillow

For Computer Use Tasks

pip install transformers torch qwen-vl-utils pillow

Usage

Quick Start with helper.py

The model repository includes helper.py with utility functions for seamless inference:

from helper import prepare_image, create_messages, DEFAULT_TEMPERATURE, DEFAULT_MAX_NEW_TOKENS
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

# Load model
processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    trust_remote_code=True
)

# Prepare image using helper function
image = Image.open("screenshot.png")
prepared_image, (width, height) = prepare_image(image)

# Create messages with system prompt
instruction = "Click on the search button and type 'Python tutorials'"
messages = create_messages(instruction, prepared_image, width, height)

# Generate response
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs = processor.process_text(text, images=[prepared_image], videos=[])
inputs = processor(
    text=[text],
    images=[image_inputs],
    videos=[],
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(
    **inputs,
    temperature=DEFAULT_TEMPERATURE,
    max_new_tokens=DEFAULT_MAX_NEW_TOKENS,
    do_sample=False
)

response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

πŸ€— Transformers Integration

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)

# Load screenshot
image = Image.open("screenshot.png")

# Prepare messages
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What elements are visible on this screen?"}
    ]}
]

# Process and generate
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, _ = processor.process_text(text, images=[image], videos=[])
inputs = processor(
    text=[text],
    images=[image_inputs],
    padding=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Computer Use Agent Pattern

from helper import prepare_image, create_messages
from transformers import AutoProcessor, AutoModelForCausalLM
import pyautogui

processor = AutoProcessor.from_pretrained("aquif-ai/aquif-Grounding-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "aquif-ai/aquif-Grounding-7B",
    device_map="auto",
    trust_remote_code=True
)

def execute_instruction(instruction: str):
    # Take screenshot
    screenshot = pyautogui.screenshot()
    prepared, (w, h) = prepare_image(screenshot)
    
    # Get model guidance
    messages = create_messages(instruction, prepared, w, h)
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    image_inputs, _ = processor.process_text(text, images=[prepared], videos=[])
    
    inputs = processor(
        text=[text],
        images=[image_inputs],
        padding=True,
        return_tensors="pt"
    ).to(model.device)
    
    outputs = model.generate(**inputs, max_new_tokens=128)
    action = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    
    return action

# Example usage
result = execute_instruction("Find and click the download button")
print(result)

Helper Functions Reference

prepare_image(image, min_pixels, max_pixels)

Resizes images intelligently for optimal model processing while maintaining aspect ratio.

Parameters:

  • image: PIL Image object
  • min_pixels: Minimum resolution (default: 78,400)
  • max_pixels: Maximum resolution (default: 6,000,000)

Returns: Tuple of (resized_image, (width, height))

create_messages(instruction, image, width, height)

Formats messages with system prompt and tools specification for computer use.

Parameters:

  • instruction: User task description
  • image: PIL Image object
  • width: Image width
  • height: Image height

Returns: Formatted message list with system prompt and computer_use tool definition

Technical Specifications

  • Architecture: Vision-Language Transformer
  • Vision Encoder: Advanced multimodal fusion
  • Attention: Optimized attention mechanisms for long context
  • Precision: BF16, FP16, FP32 support
  • Position Encoding: RoPE (Rotary Position Embeddings)
  • Training Data: Desktop automation tasks, UI screenshots, multilingual instructions

Inference Framework Support

  • Transformers (Native): βœ… Full support
  • vLLM: βœ… Compatible
  • SGLang: βœ… Compatible
  • llama.cpp: ❌ Not supported
  • TensorRT: ⚠️ Experimental

Usage Recommendations

aquif-Grounding-7B excels at:

  • Desktop Automation: Browser control, application navigation, form filling
  • UI Understanding: Button identification, menu navigation, element interaction
  • Document Processing: Reading screens, extracting information from complex layouts
  • Task Planning: Multi-step instruction following with visual grounding
  • Accessibility Applications: Screen readers and UI navigation assistance
  • Computer-Use Agents: Autonomous desktop task completion

Limitations and Considerations

  • Desktop-Focused: Optimized for computer UI, may underperform on non-screen images
  • Action Specification: Coordinates and actions require integration with external tools (pyautogui, etc.)
  • Context Awareness: While supporting 128K context, efficiency may vary with very long interaction histories
  • Real-Time Performance: Suitable for offline tasks; real-time applications may require optimization
  • Hardware Requirements: 16GB VRAM recommended for smooth inference; quantization available for smaller GPUs

Performance Optimization

  • Quantization: Use INT8/FP8 quantization to reduce memory from ~16GB to 8-10GB
  • KV Caching: Leverage efficient caching for multi-turn conversations
  • Batch Processing: Process multiple screenshots sequentially for efficiency
  • Image Preprocessing: Use helper functions for optimal image scaling

Acknowledgements

  • Qwen Team: Base architecture and vision encoder foundation
  • HuggingFace: Model infrastructure and community support
  • aquif AI Research Team: Grounding optimization and desktop automation specialization

License

This project is released under the MIT License.


Note: aquif-Grounding-7B is optimized for desktop and UI-based tasks. For production deployment in computer-use applications, test thoroughly on your specific use cases and UI frameworks.

Made in πŸ‡§πŸ‡·

Β© 2025 aquif AI. All rights reserved.

Downloads last month
218
GGUF
Model size
8B params
Architecture
qwen2vl
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for noctrex/aquif-Grounding-7B-GGUF

Quantized
(3)
this model