X-VLA 0.9B (WidowX Edition)

Repository: 2toINF/X-VLA-0.9B-WidowX

Authors: 2toINF | License: Apache 2.0

Paper: Zheng et al., 2025, “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model” (arXiv:2510.10274)

🚀 Overview

Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich robotic data sources, X-VLA introduces a Soft Prompt approach with minimally added parameters: we infuse prompt-learning concepts into cross-embodiment robot learning, introducing separate sets of learnable embeddings for each distinct embodiment.

These embodiment-specific prompts empower VLA models to exploit cross-embodiment features effectively. Our architecture—a clean, flow-matching-based VLA design relying exclusively on soft-prompted standard Transformers—achieves superior scalability and simplicity.

Trained on Bridge Data and evaluated across six simulations and three real-world robots, the 0.9B-parameter X-VLA simultaneously achieves state-of-the-art performance across diverse benchmarks, demonstrating flexible dexterity and fast adaptation across embodiments, environments, and tasks.

🌐 Project Website: https://thu-air-dream.github.io/X-VLA/

⚙️ Usage

🔹 Load the model

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "2toINF/X-VLA-WidowX",
    trust_remote_code=True
)

🔹 Start FastAPI server

from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
model.run(processor, host="0.0.0.0", port=8000)

🔹 Client-server evaluation

You can run the provided evaluation client from our GitHub: 👉 2toINF/X-VLA – Client & Server Code

python client_widowx.py   --server_ip <SERVER_IP>   --server_port 8000   --output_dir logs/

Each evaluation produces task-level videos and logs under logs/.

🧩 Architecture

Component	Role
Florence 2 Encoder	Vision-Language representation backbone (encoder-only).
SoftPromptedTransformer	Flow-matching action denoiser using learnable soft prompts per embodiment.
Action Hub	Defines action spaces, masking rules, pre/post-processing, and losses.

🧪 Performance on Simpler-Env (WidowX)

Task (Simpler-WidowX)	Spoon	Carrot	Blocks	Eggplant	Average
Visual Matching (WidowX Robot)	100	91.7	95.8	95.8	95.8 %
(Evaluated on four WidowX tasks in Simpler-Env.)

🧪 Performance on Real-World (WidowX)

🧠 Training Summary

Setting	Value
Training Data	Bridge Data V2
Parameters	≈ 0.9 B
Action Mode	`ee6d`
Precision	BP16
Framework	PyTorch + Transformers

🪪 License

Copyright 2025 2toINF (https://github.com/2toINF)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
http://www.apache.org/licenses/LICENSE-2.0

📚 Citation

@article{zheng2025x,
  title   = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
  author  = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
             and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
  journal = {arXiv preprint arXiv:2510.10274},
  year    = {2025}
}