X-VLA 0.9B (WidowX Edition)
Repository: 2toINF/X-VLA-0.9B-WidowX
Authors: 2toINFβ|βLicense: Apache 2.0
Paper: Zheng et al., 2025, βX-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Modelβ (arXiv:2510.10274)
π Overview
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich robotic data sources, X-VLA introduces a Soft Prompt approach with minimally added parameters: we infuse prompt-learning concepts into cross-embodiment robot learning, introducing separate sets of learnable embeddings for each distinct embodiment.
These embodiment-specific prompts empower VLA models to exploit cross-embodiment features effectively. Our architectureβa clean, flow-matching-based VLA design relying exclusively on soft-prompted standard Transformersβachieves superior scalability and simplicity.
Trained on Bridge Data and evaluated across six simulations and three real-world robots, the 0.9B-parameter X-VLA simultaneously achieves state-of-the-art performance across diverse benchmarks, demonstrating flexible dexterity and fast adaptation across embodiments, environments, and tasks.
π Project Website: https://thu-air-dream.github.io/X-VLA/
βοΈ Usage
πΉ Load the model
from transformers import AutoModel
model = AutoModel.from_pretrained(
"2toINF/X-VLA-WidowX",
trust_remote_code=True
)
πΉ Start FastAPI server
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
model.run(processor, host="0.0.0.0", port=8000)
πΉ Client-server evaluation
You can run the provided evaluation client from our GitHub: π 2toINF/X-VLA β Client & Server Code
python client_widowx.py --server_ip <SERVER_IP> --server_port 8000 --output_dir logs/
Each evaluation produces task-level videos and logs under logs/.
π§© Architecture
| Component | Role |
|---|---|
| Florence 2 Encoder | Vision-Language representation backbone (encoder-only). |
| SoftPromptedTransformer | Flow-matching action denoiser using learnable soft prompts per embodiment. |
| Action Hub | Defines action spaces, masking rules, pre/post-processing, and losses. |
π§ͺ Performance on Simpler-Env (WidowX)
| Task (Simpler-WidowX) | Spoon | Carrot | Blocks | Eggplant | Average |
|---|---|---|---|---|---|
| Visual Matching (WidowX Robot) | 100 | 91.7 | 95.8 | 95.8 | 95.8 % |
| (Evaluated on four WidowX tasks in Simpler-Env.) |
π§ͺ Performance on Real-World (WidowX)
π§ Training Summary
| Setting | Value |
|---|---|
| Training Data | Bridge Data V2 |
| Parameters | β 0.9 B |
| Action Mode | ee6d |
| Precision | BP16 |
| Framework | PyTorch + Transformers |
πͺͺ License
Copyright 2025 2toINF (https://github.com/2toINF)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
http://www.apache.org/licenses/LICENSE-2.0
π Citation
@article{zheng2025x,
title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
journal = {arXiv preprint arXiv:2510.10274},
year = {2025}
}
π Links
- π Paper: arXiv 2510.10274
- π» Code & Client/Server: GitHub β 2toINF/X-VLA
- π€ Model Hub: Hugging Face β 2toINF/X-VLA-0.9B-WidowX
- Downloads last month
- 37
Model tree for 2toINF/X-VLA-WidowX
Base model
microsoft/Florence-2-large