TESS-500M
TESS is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).
Model Description
- Base Model: SmolVLM2-500M-Instruct
- Architecture: SmolVLM + Router + Mouse/Keyboard heads
- Parameters: 508M total, 48M trainable
- Training Data: tess-agentnet (~312K samples)
Usage
import torch
from PIL import Image
# Clone the TESS repo
# git clone https://github.com/husseinlezzaik/TESS.git
# cd TESS/model
from test_checkpoint import load_model, predict
# Load model
model, processor = load_model("path/to/checkpoint.pt", device="cuda")
# Run inference
image = Image.open("screenshot.png")
result = predict(model, processor, image, "Click the search button")
print(result)
# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}
Output Format
Mouse actions:
{
'action_type': 'mouse',
'xy': [x, y], # Normalized coordinates (0-1)
'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
}
Keyboard actions:
{
'action_type': 'keyboard',
'action': 'type' | 'press' | 'hotkey',
'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
}
Architecture
Screenshot + Instruction β SmolVLM2 β Shared MLP β Router
β
βββββββββββββββββ΄ββββββββββββββββ
β β
Mouse Branch Keyboard Branch
(XY + Click heads) (VLM text generation)
Training
- Epochs: 3
- Batch Size: 48
- Optimizer: AdamW (LR 2e-4 heads, 5e-4 embeddings)
- Hardware: NVIDIA H100 80GB
- Training Time: ~8 hours
Limitations
- Trained primarily on desktop/web screenshots
- English instructions only
- May struggle with unusual UI layouts not seen in training
License
Apache 2.0
Citation
@misc{tess2025,
title={TESS: A Vision-Language-Action Model for Computer Use},
author={Hussein Lezzaik},
year={2025},
url={https://github.com/husseinlezzaik/TESS}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for TESS-Computer/tess-500m
Base model
HuggingFaceTB/SmolLM2-360M
Quantized
HuggingFaceTB/SmolLM2-360M-Instruct
Quantized
HuggingFaceTB/SmolVLM-500M-Instruct