--- library_name: transformers license: apache-2.0 pipeline_tag: robotics tags: - robotics - vla - humanoid - policy - adapter - gr00t - pytorch - accelerate base_model: - google/siglip-base-patch16-224 - Qwen/Qwen2.5-0.5B-Instruct datasets: - nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1 --- # VLA-Adapter-Lite — GR00T G1 (BridgeAttention Head) **VLA-Adapter-Lite** is a small, trainable **BridgeAttention** policy head that maps **vision + language + state → action** for the NVIDIA **GR00T Teleop G1** humanoid dataset (43-D state/actions). The **vision** (SigLIP) and **language** (Qwen) towers are **frozen**; only this adapter is trained. > This repo contains **only the policy head** weights and code. At inference, you load the frozen backbones from their own model hubs. --- ## ✨ What’s inside - `adapter.pt` / `adapter.safetensors` — PyTorch state dict for the policy head - `policy_definition.py` — the `BridgeAttentionPolicy` class - `config.json` — dimensions & training config (IDs for base models, dims, etc.) **Backbones (frozen at inference & training):** - Vision: `google/siglip-base-patch16-224` - Language: `Qwen/Qwen2.5-0.5B-Instruct` **Target (GR00T G1):** - State: 43-D - Action: 43-D - Includes brief language prompts and videos per episode --- ## 🚀 Quickstart ```python import json, torch from transformers import SiglipVisionModel, SiglipImageProcessor, AutoTokenizer, AutoModelForCausalLM from policy_definition import BridgeAttentionPolicy # Load config & backbones cfg = json.load(open("config.json")) vision_model_id = cfg["vision_model_id"] text_model_id = cfg["text_model_id"] image_processor = SiglipImageProcessor.from_pretrained(vision_model_id) vision = SiglipVisionModel.from_pretrained(vision_model_id, output_hidden_states=True).eval() tokenizer = AutoTokenizer.from_pretrained(text_model_id, use_fast=True) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token text = AutoModelForCausalLM.from_pretrained(text_model_id, output_hidden_states=True).eval() # Build policy head and load weights v_hidden = vision.config.hidden_size t_hidden = text.config.hidden_size policy = BridgeAttentionPolicy( v_hidden=v_hidden, t_hidden=t_hidden, state_dim=cfg["state_dim"], policy_dim=cfg["policy_dim"], n_heads=cfg["n_heads"], n_layers=cfg["policy_layers"], n_queries=cfg["num_action_queries"], action_dim=cfg["action_dim"], dropout=cfg["dropout"] ).eval() sd = torch.load("adapter.pt", map_location="cpu") policy.load_state_dict(sd, strict=True) # ---- Example forward (single sample) ---- from PIL import Image instruction = "Pick the apple from the table and place it into the basket." state = torch.zeros(1, cfg["state_dim"]) # shape [1,43]; replace with real proprio # Vision: last 4 hidden states (drop CLS token), as a list of tensors img = Image.new("RGB", (256, 256), color=(200, 230, 255)) # replace with a real frame v_inputs = image_processor(images=[img], return_tensors="pt") with torch.no_grad(): v_out = vision(**v_inputs, output_hidden_states=True) v_feats_layers = [t[:, 1:, :].contiguous() if t.shape[1] >= 2 else t.contiguous() for t in v_out.hidden_states[-4:]] # Language: last 4 hidden states t_inputs = tokenizer([instruction], return_tensors="pt", padding=True, truncation=True, max_length=64) with torch.no_grad(): t_out = text(**t_inputs, output_hidden_states=True) t_feats_layers = [t.contiguous() for t in t_out.hidden_states[-4:]] with torch.no_grad(): action = policy(v_feats_layers, t_feats_layers, state) # [1,43] print("Pred action:", action.shape) ``` --- ## Evals - **Eval split**: 3 episodes × 64 frames from each task folder of `nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1` (total 768 frames) - **Protocol**: offline action reconstruction. For each frame we feed ego-view image + instruction + 43D state into the adapter and compare predicted 43D action against teleop ground truth (MSE / MAE). ## Aggregate Metrics - Overall MSE: 0.0622 - Overall MAE: 0.118 - Frames evaluated: 768 **Overall per-joint-group error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0040 | 0.049 | | right_leg | 0.0055 | 0.047 | | waist | 0.0002 | 0.013 | | left_arm | 0.0455 | 0.157 | | left_hand | 0.1253 | 0.156 | | right_arm | 0.0878 | 0.184 | | right_hand | 0.1154 | 0.143 | ## Per-Task Breakdown | Dataset | Samples | MSE | MAE | Arms MSE | Hands MSE | |---|---:|---:|---:|---:|---:| | g1-pick-apple | 192 | 0.0399 | 0.087 | 0.0362 | 0.0850 | | g1-pick-pear | 192 | 0.0817 | 0.146 | 0.0645 | 0.1808 | | g1-pick-grapes | 192 | 0.0801 | 0.136 | 0.1249 | 0.1175 | | g1-pick-starfruit | 192 | 0.0473 | 0.105 | 0.0411 | 0.0981 | **g1-pick-apple segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0011 | 0.027 | | right_leg | 0.0016 | 0.028 | | waist | 0.0002 | 0.012 | | left_arm | 0.0610 | 0.177 | | left_hand | 0.1664 | 0.202 | | right_arm | 0.0113 | 0.083 | | right_hand | 0.0037 | 0.020 | **g1-pick-pear segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0069 | 0.071 | | right_leg | 0.0061 | 0.057 | | waist | 0.0001 | 0.010 | | left_arm | 0.0374 | 0.153 | | left_hand | 0.1331 | 0.165 | | right_arm | 0.0915 | 0.203 | | right_hand | 0.2285 | 0.262 | **g1-pick-grapes segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0030 | 0.045 | | right_leg | 0.0052 | 0.045 | | waist | 0.0002 | 0.012 | | left_arm | 0.0251 | 0.123 | | left_hand | 0.0058 | 0.022 | | right_arm | 0.2246 | 0.335 | | right_hand | 0.2292 | 0.273 | **g1-pick-starfruit segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0051 | 0.053 | | right_leg | 0.0092 | 0.058 | | waist | 0.0004 | 0.019 | | left_arm | 0.0584 | 0.177 | | left_hand | 0.1959 | 0.235 | | right_arm | 0.0238 | 0.114 | | right_hand | 0.0003 | 0.014 | ---
More evals comming soon
## 📚 References **Core** - Wang, Y. *et al.* (2025). **VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model.** arXiv:2509.09372. [[paper]](https://arxiv.org/abs/2509.09372) · [[project]](https://vla-adapter.github.io/) - Kim, M. J., Finn, C., Liang, P. (2025). **Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (OpenVLA-OFT).** arXiv:2502.19645. [[paper]](https://arxiv.org/abs/2502.19645) · [[site]](https://openvla-oft.github.io/) - Kim, M. J. *et al.* (2024). **OpenVLA: An Open-Source Vision-Language-Action Model.** arXiv:2406.09246. [[paper]](https://arxiv.org/abs/2406.09246) **Backbones & Dataset** - Zhai, X. *et al.* (2023). **Sigmoid Loss for Language-Image Pre-Training (SigLIP).** arXiv:2303.15343. [[paper]](https://arxiv.org/abs/2303.15343) - Yang, A. *et al.* (2024/2025). **Qwen2.5 Technical Report.** arXiv:2412.15115. [[paper]](https://arxiv.org/abs/2412.15115) - NVIDIA Physical AI (2025). **PhysicalAI-Robotics-GR00T-Teleop-G1** (Humanoid teleop dataset). [[dataset card]](https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1) **Related Benchmarks / Corpora** - Liu, B. *et al.* (2023). **LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning.** arXiv:2306.03310. [[paper]](https://arxiv.org/abs/2306.03310) - Walke, H. *et al.* (2023). **BridgeData V2: A Dataset for Robot Learning at Scale.** arXiv:2308.12952. [[paper]](https://arxiv.org/abs/2308.12952) --- ### BibTeX ```bibtex @article{wang2025vlaadapter, title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model}, author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin}, journal={arXiv preprint arXiv:2509.09372}, year={2025} } @article{kim2025oft, title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success}, author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy}, journal={arXiv preprint arXiv:2502.19645}, year={2025} } @article{kim2024openvla, title={OpenVLA: An Open-Source Vision-Language-Action Model}, author={Kim, Moo Jin and others}, journal={arXiv preprint arXiv:2406.09246}, year={2024} } @article{zhai2023siglip, title={Sigmoid Loss for Language-Image Pre-Training}, author={Zhai, Xiaohua and others}, journal={arXiv preprint arXiv:2303.15343}, year={2023} } @article{yang2024qwen25, title={Qwen2.5 Technical Report}, author={Yang, An and others}, journal={arXiv preprint arXiv:2412.15115}, year={2024} } @dataset{nvidia2025gr00t, title={PhysicalAI-Robotics-GR00T-Teleop-G1}, author={NVIDIA Physical AI}, year={2025}, howpublished={Hugging Face dataset card}, url={https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1} } @article{liu2023libero, title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning}, author={Liu, Bingjie and others}, journal={arXiv preprint arXiv:2306.03310}, year={2023} } @article{walke2023bridgedatav2, title={BridgeData V2: A Dataset for Robot Learning at Scale}, author={Walke, Homer and others}, journal={arXiv preprint arXiv:2308.12952}, year={2023} } ```