TESS-500M

TESS is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).

Model Description

  • Base Model: SmolVLM2-500M-Instruct
  • Architecture: SmolVLM + Router + Mouse/Keyboard heads
  • Parameters: 508M total, 48M trainable
  • Training Data: tess-agentnet (~312K samples)

Usage

import torch
from PIL import Image

# Clone the TESS repo
# git clone https://github.com/husseinlezzaik/TESS.git
# cd TESS/model

from test_checkpoint import load_model, predict

# Load model
model, processor = load_model("path/to/checkpoint.pt", device="cuda")

# Run inference
image = Image.open("screenshot.png")
result = predict(model, processor, image, "Click the search button")

print(result)
# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}

Output Format

Mouse actions:

{
    'action_type': 'mouse',
    'xy': [x, y],  # Normalized coordinates (0-1)
    'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
}

Keyboard actions:

{
    'action_type': 'keyboard',
    'action': 'type' | 'press' | 'hotkey',
    'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
}

Architecture

Screenshot + Instruction β†’ SmolVLM2 β†’ Shared MLP β†’ Router
                                                    ↓
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    ↓                               ↓
                              Mouse Branch                   Keyboard Branch
                              (XY + Click heads)            (VLM text generation)

Training

  • Epochs: 3
  • Batch Size: 48
  • Optimizer: AdamW (LR 2e-4 heads, 5e-4 embeddings)
  • Hardware: NVIDIA H100 80GB
  • Training Time: ~8 hours

Limitations

  • Trained primarily on desktop/web screenshots
  • English instructions only
  • May struggle with unusual UI layouts not seen in training

License

Apache 2.0

Citation

@misc{tess2025,
  title={TESS: A Vision-Language-Action Model for Computer Use},
  author={Hussein Lezzaik},
  year={2025},
  url={https://github.com/husseinlezzaik/TESS}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TESS-Computer/tess-500m

Dataset used to train TESS-Computer/tess-500m