X-VLA 0.9B (WidowX Edition)

Repository: 2toINF/X-VLA-0.9B-WidowX

Authors: 2toINF | License: Apache 2.0

Paper: Zheng et al., 2025, β€œX-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model” (arXiv:2510.10274)

πŸš€ Overview

Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich robotic data sources, X-VLA introduces a Soft Prompt approach with minimally added parameters: we infuse prompt-learning concepts into cross-embodiment robot learning, introducing separate sets of learnable embeddings for each distinct embodiment.

These embodiment-specific prompts empower VLA models to exploit cross-embodiment features effectively. Our architectureβ€”a clean, flow-matching-based VLA design relying exclusively on soft-prompted standard Transformersβ€”achieves superior scalability and simplicity.

Trained on Bridge Data and evaluated across six simulations and three real-world robots, the 0.9B-parameter X-VLA simultaneously achieves state-of-the-art performance across diverse benchmarks, demonstrating flexible dexterity and fast adaptation across embodiments, environments, and tasks.

🌐 Project Website: https://thu-air-dream.github.io/X-VLA/

βš™οΈ Usage

πŸ”Ή Load the model

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "2toINF/X-VLA-WidowX",
    trust_remote_code=True
)

πŸ”Ή Start FastAPI server

from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
model.run(processor, host="0.0.0.0", port=8000)

πŸ”Ή Client-server evaluation

You can run the provided evaluation client from our GitHub: πŸ‘‰ 2toINF/X-VLA – Client & Server Code

python client_widowx.py   --server_ip <SERVER_IP>   --server_port 8000   --output_dir logs/

Each evaluation produces task-level videos and logs under logs/.

🧩 Architecture

Component Role
Florence 2 Encoder Vision-Language representation backbone (encoder-only).
SoftPromptedTransformer Flow-matching action denoiser using learnable soft prompts per embodiment.
Action Hub Defines action spaces, masking rules, pre/post-processing, and losses.

πŸ§ͺ Performance on Simpler-Env (WidowX)

Task (Simpler-WidowX) Spoon Carrot Blocks Eggplant Average
Visual Matching (WidowX Robot) 100 91.7 95.8 95.8 95.8 %
(Evaluated on four WidowX tasks in Simpler-Env.)

πŸ§ͺ Performance on Real-World (WidowX)

screenshot-20251104-214209

🧠 Training Summary

Setting Value
Training Data Bridge Data V2
Parameters β‰ˆ 0.9 B
Action Mode ee6d
Precision BP16
Framework PyTorch + Transformers

πŸͺͺ License

Copyright 2025 2toINF (https://github.com/2toINF)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
http://www.apache.org/licenses/LICENSE-2.0

πŸ“š Citation

@article{zheng2025x,
  title   = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
  author  = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
             and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
  journal = {arXiv preprint arXiv:2510.10274},
  year    = {2025}
}

🌐 Links

Downloads last month
37
Safetensors
Model size
0.9B params
Tensor type
F32
Β·
Video Preview
loading

Model tree for 2toINF/X-VLA-WidowX

Finetuned
(29)
this model

Collection including 2toINF/X-VLA-WidowX