---
library_name: transformers
license: apache-2.0
pipeline_tag: robotics
tags:
- robotics
- vla
- humanoid
- policy
- adapter
- gr00t
- pytorch
- accelerate
base_model:
- google/siglip-base-patch16-224
- Qwen/Qwen2.5-0.5B-Instruct
datasets:
- nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1
---

# VLA-Adapter-Lite — GR00T G1 (BridgeAttention Head)


**VLA-Adapter-Lite** is a small, trainable **BridgeAttention** policy head that maps **vision + language + state → action** for the NVIDIA **GR00T Teleop G1** humanoid dataset (43-D state/actions).  
The **vision** (SigLIP) and **language** (Qwen) towers are **frozen**; only this adapter is trained.

> This repo contains **only the policy head** weights and code. At inference, you load the frozen backbones from their own model hubs.


---

## ✨ What’s inside

- `adapter.pt` / `adapter.safetensors` — PyTorch state dict for the policy head
- `policy_definition.py` — the `BridgeAttentionPolicy` class
- `config.json` — dimensions & training config (IDs for base models, dims, etc.)

**Backbones (frozen at inference & training):**
- Vision: `google/siglip-base-patch16-224`
- Language: `Qwen/Qwen2.5-0.5B-Instruct`

**Target (GR00T G1):**
- State: 43-D
- Action: 43-D
- Includes brief language prompts and videos per episode

---

## 🚀 Quickstart

```python
import json, torch
from transformers import SiglipVisionModel, SiglipImageProcessor, AutoTokenizer, AutoModelForCausalLM
from policy_definition import BridgeAttentionPolicy

# Load config & backbones
cfg = json.load(open("config.json"))

vision_model_id = cfg["vision_model_id"]
text_model_id   = cfg["text_model_id"]

image_processor = SiglipImageProcessor.from_pretrained(vision_model_id)
vision = SiglipVisionModel.from_pretrained(vision_model_id, output_hidden_states=True).eval()
tokenizer = AutoTokenizer.from_pretrained(text_model_id, use_fast=True)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
text = AutoModelForCausalLM.from_pretrained(text_model_id, output_hidden_states=True).eval()

# Build policy head and load weights
v_hidden = vision.config.hidden_size
t_hidden = text.config.hidden_size

policy = BridgeAttentionPolicy(
    v_hidden=v_hidden, t_hidden=t_hidden,
    state_dim=cfg["state_dim"], policy_dim=cfg["policy_dim"],
    n_heads=cfg["n_heads"], n_layers=cfg["policy_layers"],
    n_queries=cfg["num_action_queries"], action_dim=cfg["action_dim"],
    dropout=cfg["dropout"]
).eval()

sd = torch.load("adapter.pt", map_location="cpu")
policy.load_state_dict(sd, strict=True)

# ---- Example forward (single sample) ----
from PIL import Image

instruction = "Pick the apple from the table and place it into the basket."
state = torch.zeros(1, cfg["state_dim"])  # shape [1,43]; replace with real proprio

# Vision: last 4 hidden states (drop CLS token), as a list of tensors
img = Image.new("RGB", (256, 256), color=(200, 230, 255))  # replace with a real frame
v_inputs = image_processor(images=[img], return_tensors="pt")
with torch.no_grad():
    v_out = vision(**v_inputs, output_hidden_states=True)
v_feats_layers = [t[:, 1:, :].contiguous() if t.shape[1] >= 2 else t.contiguous()
                  for t in v_out.hidden_states[-4:]]

# Language: last 4 hidden states
t_inputs = tokenizer([instruction], return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
    t_out = text(**t_inputs, output_hidden_states=True)
t_feats_layers = [t.contiguous() for t in t_out.hidden_states[-4:]]

with torch.no_grad():
    action = policy(v_feats_layers, t_feats_layers, state)  # [1,43]
print("Pred action:", action.shape)
```

---

## Evals

- **Eval split**: 3 episodes × 64 frames from each task folder of `nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1` (total 768 frames)
- **Protocol**: offline action reconstruction. For each frame we feed ego-view image + instruction + 43D state into the adapter and compare predicted 43D action against teleop ground truth (MSE / MAE).

## Aggregate Metrics

- Overall MSE: 0.0622
- Overall MAE: 0.118
- Frames evaluated: 768

**Overall per-joint-group error**

| Segment | MSE | MAE |
|---|---:|---:|
| left_leg | 0.0040 | 0.049 |
| right_leg | 0.0055 | 0.047 |
| waist | 0.0002 | 0.013 |
| left_arm | 0.0455 | 0.157 |
| left_hand | 0.1253 | 0.156 |
| right_arm | 0.0878 | 0.184 |
| right_hand | 0.1154 | 0.143 |

## Per-Task Breakdown

| Dataset | Samples | MSE | MAE | Arms MSE | Hands MSE |
|---|---:|---:|---:|---:|---:|
| g1-pick-apple | 192 | 0.0399 | 0.087 | 0.0362 | 0.0850 |
| g1-pick-pear | 192 | 0.0817 | 0.146 | 0.0645 | 0.1808 |
| g1-pick-grapes | 192 | 0.0801 | 0.136 | 0.1249 | 0.1175 |
| g1-pick-starfruit | 192 | 0.0473 | 0.105 | 0.0411 | 0.0981 |

**g1-pick-apple segment error**

| Segment | MSE | MAE |
|---|---:|---:|
| left_leg | 0.0011 | 0.027 |
| right_leg | 0.0016 | 0.028 |
| waist | 0.0002 | 0.012 |
| left_arm | 0.0610 | 0.177 |
| left_hand | 0.1664 | 0.202 |
| right_arm | 0.0113 | 0.083 |
| right_hand | 0.0037 | 0.020 |

**g1-pick-pear segment error**

| Segment | MSE | MAE |
|---|---:|---:|
| left_leg | 0.0069 | 0.071 |
| right_leg | 0.0061 | 0.057 |
| waist | 0.0001 | 0.010 |
| left_arm | 0.0374 | 0.153 |
| left_hand | 0.1331 | 0.165 |
| right_arm | 0.0915 | 0.203 |
| right_hand | 0.2285 | 0.262 |

**g1-pick-grapes segment error**

| Segment | MSE | MAE |
|---|---:|---:|
| left_leg | 0.0030 | 0.045 |
| right_leg | 0.0052 | 0.045 |
| waist | 0.0002 | 0.012 |
| left_arm | 0.0251 | 0.123 |
| left_hand | 0.0058 | 0.022 |
| right_arm | 0.2246 | 0.335 |
| right_hand | 0.2292 | 0.273 |

**g1-pick-starfruit segment error**

| Segment | MSE | MAE |
|---|---:|---:|
| left_leg | 0.0051 | 0.053 |
| right_leg | 0.0092 | 0.058 |
| waist | 0.0004 | 0.019 |
| left_arm | 0.0584 | 0.177 |
| left_hand | 0.1959 | 0.235 |
| right_arm | 0.0238 | 0.114 |
| right_hand | 0.0003 | 0.014 |
---

<div style="border-left:6px solid #facc15; background:#fffbe6; padding:12px; border-radius:6px; margin:12px 0;">
More evals comming soon
</div>

## 📚 References

**Core**
- Wang, Y. *et al.* (2025). **VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model.** arXiv:2509.09372. [[paper]](https://arxiv.org/abs/2509.09372) · [[project]](https://vla-adapter.github.io/)
- Kim, M. J., Finn, C., Liang, P. (2025). **Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (OpenVLA-OFT).** arXiv:2502.19645. [[paper]](https://arxiv.org/abs/2502.19645) · [[site]](https://openvla-oft.github.io/)
- Kim, M. J. *et al.* (2024). **OpenVLA: An Open-Source Vision-Language-Action Model.** arXiv:2406.09246. [[paper]](https://arxiv.org/abs/2406.09246)

**Backbones & Dataset**
- Zhai, X. *et al.* (2023). **Sigmoid Loss for Language-Image Pre-Training (SigLIP).** arXiv:2303.15343. [[paper]](https://arxiv.org/abs/2303.15343)
- Yang, A. *et al.* (2024/2025). **Qwen2.5 Technical Report.** arXiv:2412.15115. [[paper]](https://arxiv.org/abs/2412.15115)
- NVIDIA Physical AI (2025). **PhysicalAI-Robotics-GR00T-Teleop-G1** (Humanoid teleop dataset). [[dataset card]](https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1)

**Related Benchmarks / Corpora**
- Liu, B. *et al.* (2023). **LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning.** arXiv:2306.03310. [[paper]](https://arxiv.org/abs/2306.03310)
- Walke, H. *et al.* (2023). **BridgeData V2: A Dataset for Robot Learning at Scale.** arXiv:2308.12952. [[paper]](https://arxiv.org/abs/2308.12952)

---


### BibTeX

```bibtex
@article{wang2025vlaadapter,
  title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
  author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
  journal={arXiv preprint arXiv:2509.09372},
  year={2025}
}

@article{kim2025oft,
  title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
  journal={arXiv preprint arXiv:2502.19645},
  year={2025}
}

@article{kim2024openvla,
  title={OpenVLA: An Open-Source Vision-Language-Action Model},
  author={Kim, Moo Jin and others},
  journal={arXiv preprint arXiv:2406.09246},
  year={2024}
}

@article{zhai2023siglip,
  title={Sigmoid Loss for Language-Image Pre-Training},
  author={Zhai, Xiaohua and others},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@article{yang2024qwen25,
  title={Qwen2.5 Technical Report},
  author={Yang, An and others},
  journal={arXiv preprint arXiv:2412.15115},
  year={2024}
}

@dataset{nvidia2025gr00t,
  title={PhysicalAI-Robotics-GR00T-Teleop-G1},
  author={NVIDIA Physical AI},
  year={2025},
  howpublished={Hugging Face dataset card},
  url={https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1}
}

@article{liu2023libero,
  title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
  author={Liu, Bingjie and others},
  journal={arXiv preprint arXiv:2306.03310},
  year={2023}
}

@article{walke2023bridgedatav2,
  title={BridgeData V2: A Dataset for Robot Learning at Scale},
  author={Walke, Homer and others},
  journal={arXiv preprint arXiv:2308.12952},
  year={2023}
}
```