Cosmos-Policy-ALOHA-Predict2-2B

Cosmos Policy | Code | White Paper | Website

Model Overview

Description:

Cosmos-Policy-ALOHA-Predict2-2B is a 2B-parameter bimanual robot manipulation policy model fine-tuned from the NVIDIA Cosmos-Predict2-2B-Video2World video foundation model. This model achieves a 93.6% average completion rate across four challenging real-world bimanual manipulation tasks on the ALOHA 2 robot platform.

Key features:

Single-stage fine-tuning: Adapted from pretrained video model with no architectural modifications
Multimodal outputs: Jointly predicts actions, future states, and values through unified video diffusion
Real-world performance: 93.6% average score on challenging bimanual manipulation tasks

Use cases:

Bimanual robotic manipulation and control in real-world environments
Imitation learning from human teleoperation demonstrations
Vision-based robot learning with multiple camera viewpoints
Contact-rich and high-precision manipulation tasks
Long-horizon task planning and execution

This model is for research and development only.

Model Developer: NVIDIA

Model Versions

Cosmos Policy models include the following:

Cosmos-Policy-LIBERO-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated LIBERO environments.
Cosmos-Policy-RoboCasa-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated RoboCasa environments.
Cosmos-Policy-ALOHA-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in real-world ALOHA robot environments.
Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B: Given current state observations, a task description, and action sequences, generate future state predictions and value estimates for robot manipulation in real-world ALOHA robot environments. (This checkpoint is meant to be deployed alongside Cosmos-Policy-ALOHA-Predict2-2B, not independently.)

License:

This model is released under the NVIDIA One-Way Noncommercial License (NSCLv1). For a custom license, please contact cosmos-license@nvidia.com.

Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:

Models are not for commercial use.
NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

Deployment Geography:

Global

Use Case:

Physical AI: Bimanual robot manipulation and control in real-world environments, encompassing contact-rich manipulation and imitation learning from human demonstrations.

Release Date:

GitHub [01/22/2026] via https://github.com/nvlabs/cosmos-policy

Hugging Face [01/22/2026] via https://huggingface.co/collections/nvidia/cosmos-policy

Model Architecture:

Architecture Type: A diffusion transformer with latent video diffusion, fine-tuned from Cosmos-Predict2-2B-Video2World.

Network Architecture: The model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion).

Key adaptation: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.

Number of model parameters:

2B (inherited from base model)

Input

Input Type(s): Text + Multi-view Images + Proprioceptive State

Input Format(s):

Text: String (natural language task description)
Images: RGB images from multiple camera views
Proprioception: Numerical array

Input Parameters:

Text: One-dimensional (1D) - Task description (e.g., "put candy in ziploc bag")
Images: Two-dimensional (2D) - Top-down third-person camera: 224×224 RGB; Left wrist-mounted camera: 224×224 RGB; Right wrist-mounted camera: 224×224 RGB
Proprioception: One-dimensional (1D) - 14-dimensional state (7 joint angles per arm)

Other Properties Related to Input:

Requires specific camera configuration (top-down + two wrist views)
Images resized to 224×224 pixels from original resolution
Trained exclusively for ALOHA 2 robot platform with two ViperX 300 S robot arms
Control frequency: 25 Hz (reduced from original 50 Hz)

Output

Output Type(s): Action Sequence + Future State Predictions + Value Estimate

Output Format:

Actions: Numerical array
Future states: Images + Proprioception
Value: Scalar

Output Parameters:

Action chunk: 50-timestep sequence of 14-dimensional actions (7 per arm: joint positions for 6 joints + 1 gripper)
Future robot proprioception: 14-dimensional state at timestep t+50
Future state images: Top-down third-person camera prediction (224×224 RGB), left wrist camera prediction (224×224 RGB), and right wrist camera prediction (224×224 RGB) at timestep t+50
Future state value: Expected cumulative reward from future state (scalar)

Other Properties Related to Output:

Action chunk size: 50 timesteps (spanning 2 seconds given 25 Hz control frequency)
Execution horizon: 50 timesteps (full chunk; recommended, though can be varied)
Denoising steps: 10 (configurable without retraining)
Noise level range: σ_min = 4.0, σ_max = 80.0
Generation mode: Either parallel (action, future state, and value generated simultaneously) or autoregressive (using this checkpoint as the policy and the separate planning model checkpoint as the world model and value function)

Note on future predictions: The future state images and value predictions generated by this base policy checkpoint are primarily for visualization and interpretability purposes. For model-based planning with these predictions, please additionally use the separate Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B checkpoint as the world model and value function.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

Transformers

Supported Hardware Microarchitecture Compatibility:

NVIDIA Hopper (e.g., H100)

Note: We have only tested doing inference with BF16 precision.

Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Hardware Compatibility Warning: This model was trained on a specific ALOHA 2 robot setup with particular hardware characteristics. Differences between our robot setup and downstream users' hardware setups (including calibration, joint limits, camera positioning, gripper mechanics, etc.) may significantly impact performance. Users must exercise caution during deployment.

Control Frequency: This policy must be used with a 25 Hz controller for satisfactory performance (not the original 50 Hz ALOHA control frequency). The reduced frequency was used during data collection and training.

Real-World Deployment: This model operates real robotic hardware. Always ensure that proper safety measures are in place. On the first deployment of this checkpoint, we highly recommend measuring the difference in the current robot state and the next commanded robot state (e.g., difference between current joint angles and predicted actions, which represent target joint angles) and aborting policy execution if the difference is large.

Usage

See Cosmos Policy GitHub for details.

Training and Evaluation Sections:

Training Datasets:

Data Collection Method:

ALOHA-Cosmos-Policy: Human - Human-teleoperated demonstrations recorded in real-world environment

Labeling Method:

ALOHA-Cosmos-Policy: Human - Success/failure labels and completion scores manually determined; task descriptions provided

Properties:

Training Data: ALOHA-Cosmos-Policy dataset

4 bimanual manipulation tasks
185 total real-world human teleoperation demonstrations
- put X on plate: 80 demos
- fold shirt: 15 demos
- put candies in bowl: 45 demos
- put candy in ziploc bag: 45 demos
Successful demonstrations used for policy training
All demonstrations (including failures) used for world model and value function training

Training Configuration:

Base model: NVIDIA Cosmos-Predict2-2B-Video2World (model-480p-16fps.pt)
Training steps: 50,000 gradient steps
Batch size: 200 (global)
GPUs: 8 H100 GPUs
Training time: ~48 hours
Optimization: Full model fine-tuning (all weights updated)
Action chunk size: 50 timesteps
Image resolution: 224×224 pixels

Training Objective: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.

Evaluation Datasets:

Data Collection Method: Not Applicable

Labeling Method: Not Applicable

Properties: Not Applicable - We use the real-world ALOHA 2 robot platform for direct evaluations.

Inference:

Test Hardware: H100, A100

See Cosmos Policy GitHub for details.

System Requirements and Performance

Inference with base Cosmos Policy only (i.e., no model-based planning):

1 GPU with 6.8 GB VRAM for LIBERO sim benchmark tasks
1 GPU with 8.9 GB VRAM for RoboCasa sim benchmark tasks
1 GPU with 6.0 GB VRAM for ALOHA robot tasks

Quality Benchmarks

ALOHA Real-World Benchmark Results

Task	Score
put X on plate	100.0
fold shirt	99.5
put candies in bowl	89.6
put candy in ziploc bag	85.4
Average	93.6

Scores represent average percent completion across 101 trials total (including both in-distribution and out-of-distribution test conditions).

Comparison with baselines:

Diffusion Policy: 33.6
OpenVLA-OFT+: 62.0
π0: 77.9
π0.5: 88.6
Cosmos Policy (ours): 93.6

Task Characteristics

put X on plate: Language-conditioned object placement (tests language following)
fold shirt: Multi-step contact-rich manipulation (tests long-horizon planning)
put candies in bowl: Handling scattered objects (tests multimodal grasp sequences)
put candy in ziploc bag: High-precision millimeter-tolerance manipulation

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

Please report security vulnerabilities or NVIDIA AI Concerns here.