--- library_name: transformers license: apache-2.0 pipeline_tag: robotics tags: - robotics - vla - humanoid - policy - adapter - gr00t - pytorch - accelerate base_model: - google/siglip-base-patch16-224 - Qwen/Qwen2.5-0.5B-Instruct datasets: - nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1 --- # VLA-Adapter-Lite — GR00T G1 (BridgeAttention Head) **VLA-Adapter-Lite** is a small, trainable **BridgeAttention** policy head that maps **vision + language + state → action** for the NVIDIA **GR00T Teleop G1** humanoid dataset (43-D state/actions). The **vision** (SigLIP) and **language** (Qwen) towers are **frozen**; only this adapter is trained. > This repo contains **only the policy head** weights and code. At inference, you load the frozen backbones from their own model hubs. --- ## ✨ What’s inside - `adapter.pt` / `adapter.safetensors` — PyTorch state dict for the policy head - `policy_definition.py` — the `BridgeAttentionPolicy` class - `config.json` — dimensions & training config (IDs for base models, dims, etc.) **Backbones (frozen at inference & training):** - Vision: `google/siglip-base-patch16-224` - Language: `Qwen/Qwen2.5-0.5B-Instruct` **Target (GR00T G1):** - State: 43-D - Action: 43-D - Includes brief language prompts and videos per episode --- ## 🚀 Quickstart ```python import json, torch from transformers import SiglipVisionModel, SiglipImageProcessor, AutoTokenizer, AutoModelForCausalLM from policy_definition import BridgeAttentionPolicy # Load config & backbones cfg = json.load(open("config.json")) vision_model_id = cfg["vision_model_id"] text_model_id = cfg["text_model_id"] image_processor = SiglipImageProcessor.from_pretrained(vision_model_id) vision = SiglipVisionModel.from_pretrained(vision_model_id, output_hidden_states=True).eval() tokenizer = AutoTokenizer.from_pretrained(text_model_id, use_fast=True) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token text = AutoModelForCausalLM.from_pretrained(text_model_id, output_hidden_states=True).eval() # Build policy head and load weights v_hidden = vision.config.hidden_size t_hidden = text.config.hidden_size policy = BridgeAttentionPolicy( v_hidden=v_hidden, t_hidden=t_hidden, state_dim=cfg["state_dim"], policy_dim=cfg["policy_dim"], n_heads=cfg["n_heads"], n_layers=cfg["policy_layers"], n_queries=cfg["num_action_queries"], action_dim=cfg["action_dim"], dropout=cfg["dropout"] ).eval() sd = torch.load("adapter.pt", map_location="cpu") policy.load_state_dict(sd, strict=True) # ---- Example forward (single sample) ---- from PIL import Image instruction = "Pick the apple from the table and place it into the basket." state = torch.zeros(1, cfg["state_dim"]) # shape [1,43]; replace with real proprio # Vision: last 4 hidden states (drop CLS token), as a list of tensors img = Image.new("RGB", (256, 256), color=(200, 230, 255)) # replace with a real frame v_inputs = image_processor(images=[img], return_tensors="pt") with torch.no_grad(): v_out = vision(**v_inputs, output_hidden_states=True) v_feats_layers = [t[:, 1:, :].contiguous() if t.shape[1] >= 2 else t.contiguous() for t in v_out.hidden_states[-4:]] # Language: last 4 hidden states t_inputs = tokenizer([instruction], return_tensors="pt", padding=True, truncation=True, max_length=64) with torch.no_grad(): t_out = text(**t_inputs, output_hidden_states=True) t_feats_layers = [t.contiguous() for t in t_out.hidden_states[-4:]] with torch.no_grad(): action = policy(v_feats_layers, t_feats_layers, state) # [1,43] print("Pred action:", action.shape) ``` --- ## Evals - **Eval split**: 3 episodes × 64 frames from each task folder of `nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1` (total 768 frames) - **Protocol**: offline action reconstruction. For each frame we feed ego-view image + instruction + 43D state into the adapter and compare predicted 43D action against teleop ground truth (MSE / MAE). ## Aggregate Metrics - Overall MSE: 0.0622 - Overall MAE: 0.118 - Frames evaluated: 768 **Overall per-joint-group error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0040 | 0.049 | | right_leg | 0.0055 | 0.047 | | waist | 0.0002 | 0.013 | | left_arm | 0.0455 | 0.157 | | left_hand | 0.1253 | 0.156 | | right_arm | 0.0878 | 0.184 | | right_hand | 0.1154 | 0.143 | ## Per-Task Breakdown | Dataset | Samples | MSE | MAE | Arms MSE | Hands MSE | |---|---:|---:|---:|---:|---:| | g1-pick-apple | 192 | 0.0399 | 0.087 | 0.0362 | 0.0850 | | g1-pick-pear | 192 | 0.0817 | 0.146 | 0.0645 | 0.1808 | | g1-pick-grapes | 192 | 0.0801 | 0.136 | 0.1249 | 0.1175 | | g1-pick-starfruit | 192 | 0.0473 | 0.105 | 0.0411 | 0.0981 | **g1-pick-apple segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0011 | 0.027 | | right_leg | 0.0016 | 0.028 | | waist | 0.0002 | 0.012 | | left_arm | 0.0610 | 0.177 | | left_hand | 0.1664 | 0.202 | | right_arm | 0.0113 | 0.083 | | right_hand | 0.0037 | 0.020 | **g1-pick-pear segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0069 | 0.071 | | right_leg | 0.0061 | 0.057 | | waist | 0.0001 | 0.010 | | left_arm | 0.0374 | 0.153 | | left_hand | 0.1331 | 0.165 | | right_arm | 0.0915 | 0.203 | | right_hand | 0.2285 | 0.262 | **g1-pick-grapes segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0030 | 0.045 | | right_leg | 0.0052 | 0.045 | | waist | 0.0002 | 0.012 | | left_arm | 0.0251 | 0.123 | | left_hand | 0.0058 | 0.022 | | right_arm | 0.2246 | 0.335 | | right_hand | 0.2292 | 0.273 | **g1-pick-starfruit segment error** | Segment | MSE | MAE | |---|---:|---:| | left_leg | 0.0051 | 0.053 | | right_leg | 0.0092 | 0.058 | | waist | 0.0004 | 0.019 | | left_arm | 0.0584 | 0.177 | | left_hand | 0.1959 | 0.235 | | right_arm | 0.0238 | 0.114 | | right_hand | 0.0003 | 0.014 | ---