README.md
12.8 KB · 272 lines · markdown Raw
1 ---
2 tags:
3 - robotics
4 ---
5
6 <div align="center">
7 <a href="https://github.com/NVIDIA/Isaac-GR00T">
8 <img src="https://cdn-uploads.huggingface.co/production/uploads/67b8da81d01134f89899b4a7/8bFQa2ZIGCsOQQ2ho2N_U.png">
9 </a>
10 <div align="center">
11 <a href="https://github.com/NVIDIA/Isaac-GR00T">
12 <img src="https://img.shields.io/badge/GitHub-grey?logo=GitHub" alt="GitHub Badge">
13 </a>
14 <a href="https://developer.nvidia.com/isaac/gr00t">
15 <img src="https://img.shields.io/badge/Website-green" alt="Website Badge">
16 </a>
17 <!-- <a href=""">
18 <img src="https://img.shields.io/badge/Project%20Page-blue?style=plastic" alt="Project Page Badge">
19 </a>
20 <a href="">
21 <img src="https://img.shields.io/badge/Research_Blog-black?style=flat" alt="Research Blog Badge">
22 </a>
23 <a href="">
24 <img src="https://img.shields.io/badge/Dataset-Overview-brightgreen?logo=googleforms" alt="Research Blog Badge">
25 </a>
26 -->
27 </div>
28 </div>
29
30 # Model Overview
31
32 <p align="center">
33 <img src="https://cdn-uploads.huggingface.co/production/uploads/67b8da81d01134f89899b4a7/ZCLLXZk2LQBG0YH_BmiIN.gif"
34 style="width:100%; max-width:1000px; height:auto;">
35 </p>
36
37 ## Description:
38 NVIDIA Isaac GR00T N1.7 is an open foundation model for generalized humanoid robot reasoning and skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. Developers and researchers can post-train GR00T N1.7 with real or synthetic data for their specific humanoid robot or task.
39
40 Isaac GR00T N1.7 is the medium-sized version of our model built using pre-trained vision and language encoders, and uses a flow matching action transformer to model a chunk of actions conditioned on vision, language and proprioception.
41
42 A detailed description of the Isaac GR00T N1.X architecture is provided in the GROOT N1 White Paper (https://arxiv.org/abs/2503.14734).
43
44 This model is ready for commercial/non-commercial use.
45
46 **Model Developer**: NVIDIA
47
48 ## Model Versions
49 The Isaac GR00T N1.7 model family includes the following 4 models:
50
51 ### GR00T N1.7 – SimplerEnv Bridge
52
53 **Description**
54 N1.7 post-trained model using the **Bridge Dataset** in SimplerEnv.
55
56 **Post-Training Data**
57 https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot
58
59 **Dataset Summary**
60 A LeRobot-format conversion of **BridgeData V2**, originally containing **60,096 trajectories** of robot manipulation across **24 environments**.
61
62 ### GR00T N1.7 – SimplerEnv Fractal
63
64 **Description**
65 N1.7 post-trained model using the **Fractal Dataset** in SimplerEnv.
66
67 **Post-Training Data**
68 https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot
69
70 **Dataset Summary**
71 A LeRobot-format conversion of **BridgeData V2**, originally containing **60,096 trajectories** of robot manipulation across **24 environments**.
72
73 ### GR00T N1.7 – Droid
74 **Description**
75 N1.7 post-trained model using the **DROID Dataset**.
76
77 **Post-Training Data**
78 https://droid-dataset.github.io/
79
80 **Dataset Summary**
81 A large-scale **“in-the-wild” robot manipulation dataset** with approximately **76,000 demonstration trajectories (~350 hours)** of interaction data, collected across **564 distinct scenes in 52 buildings**, covering **86 manipulation tasks** from natural-language instructions.
82
83 ### GR00T N1.7 – LIBERO
84 **Description**
85 N1.7 post-trained model using the **LIBERO Dataset**.
86
87 **Post-Training Data**
88 https://github.com/Lifelong-Robot-Learning/LIBERO
89
90 **Dataset Summary**
91 A benchmark for **lifelong robot learning**, providing **130 language-conditioned manipulation tasks** grouped into multiple task suites.
92 Includes **human-teleoperated demonstrations** designed to evaluate **knowledge transfer and continual learning** in robotic agents.
93
94 ## License
95 This model is released under the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
96
97
98 ### Deployment Geography:
99 Global
100
101 ### Use Case:
102 Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development.
103 Developers: Integrate and customize AI for various robotic applications.
104 Startups & Companies: Accelerate robotics development and reduce training costs.
105
106 ### Release Date:
107 * Github via https://github.com/NVIDIA/Isaac-GR00T
108 * Huggingface via https://huggingface.co/collections/nvidia/gr00t-n17
109
110 ## Computational Load (Internal Only: For NVIDIA Models Only)
111 Cumulative Compute: Follow Instructions
112 Estimated Energy and Emissions for Model Training: Follow Instructions
113 Total kWh:
114 64 GB200 nodes * 4 gpus per node x 1200W x 0.001 x 0.8 x 120 hours * 1.4 = 41288 kWh
115 Total Emission:
116 410.5 * 41288 * 0.000001 = 16.949 tCO2e
117
118 ## Model Architecture:
119
120 **GR00T-N1.7 VLM backbone is now [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-8B)**
121
122 **Network Architecture:**
123
124 The schematic diagram is shown in the illustration above.
125 Red, Green, Blue (RGB) camera frames are processed through a pre-trained vision transformer (SigLip2).
126 Text is encoded by a pre-trained transformer (T5)
127 Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP.
128 Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment.
129 The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).
130
131 ![Model Architecture](model-architecture.png)
132
133 **Number of Model Parameters:** 3,000,000,000
134
135 ## Input:
136 **Input Type(s):**
137 -Vision: Image Frames
138 -State: Robot Proprioception
139 -Language Instruction: Text
140 -Embodiment ID: Integer
141
142 **Input Format:**
143 -Vision: Variable number of uint8 image frames, coming from robot cameras
144 -State: Floating Point
145 -Language Instruction: String
146 -Embodiment ID: Integer indicating which of the training embodiments is observed
147
148 **Input Parameters:**
149 -Vision: Two-Dimensional (2D) - Red, Green, Blue (RGB)
150 -State: One-Dimensional (1D) - Floating number vector
151 -Language Instruction: One-Dimensional (1D) - String
152 -Embodiment ID: One-Dimensional (1D) - Integer
153
154 ## Output:
155 **Output Type(s):** Actions
156
157 **Output Format** Continuous-value vectors
158
159 **Output Parameters:** [Two-Dimensional (2D)] <br>
160
161 **Other Properties Related to Output:** Continuous-value vectors correspond to different motor controls on a robot, which depends on Degrees of Freedom of the robot embodiment.
162
163 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
164
165 ## Software Integration:
166
167 **Runtime Engine(s):** PyTorch
168
169 **Supported Hardware Microarchitecture Compatibility:**
170 All of the below:
171 * NVIDIA Ampere
172 * NVIDIA Blackwell
173 * NVIDIA Jetson
174 * NVIDIA Hopper
175 * NVIDIA Lovelace
176
177 **[Preferred/Supported] Operating System(s):**
178 * Linux
179
180 The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
181
182 # Model Version
183 GR00T N1.7 EA
184
185 # Training and Evaluation Datasets:
186 The total size (in number of data points): 21.6 million <br>
187 Total number of datasets: 13 <br>
188
189
190 ## Training Dataset:
191 GR00T Pretraining Data
192
193 **Data Collection Method by dataset:** Hybrid: Human, Robot, Simulated.
194
195 **Labeling Method by dataset:** Hybrid: Human, Automated.
196
197 **Properties:**
198 * Cross-embodiment: Data collected on various robot embodiments
199 * Sensor types: RGB camera, robot proprioception, robot actuator data
200
201
202 ## Evaluation:
203 We evaluate in both simulation and real robot benchmarks, as defined in the White Paper (https://arxiv.org/abs/2503.14734).
204
205 **Data Collection Method by dataset:** Hybrid: Human, Robot, Simulated.
206
207 **Labeling Method by dataset:** Hybrid: Human, Automated.
208
209 * Sim evaluation benchmarks for upper body control
210 * 9 DexMG Whitepaper tasks
211 * 24 RoboCasa simulated mobile manipulator tasks
212 * 24 Digital Cousin simulated GR-1 humanoid manipulation tasks
213 * For sim, we automatically measure the success rate in each manipulation behavior.
214 * For real robot
215 * Grocery packing task
216 * Novel objects (unseen from training data)
217 * Industrial multi-robot coordination with handoffs
218 * Evaluated by human observers in the lab
219
220 ## System Requirements and Performance
221
222 This section discusses the various configurations and inference runtimes for GR00T N1.7 tasks. We discuss both latency and speedup.
223
224 GR00T N1.7 Inference Timing (4 denoising steps, 1 camera):
225
226 | Device | Mode | Data Processing | Backbone | Action Head | E2E | Frequency | E2E Speedup |
227 |--------|------|-----------------|----------|-------------|-----|-----------|-------------|
228 | **dGPU** | | | | | | | |
229 | H100 80GB HBM3 | PyTorch Eager | 6.2 ms | 31.3 ms | 48.2 ms | 85.8 ms | 11.7 Hz | 1.00x |
230 | | torch.compile | 6.2 ms | 30.4 ms | 12.0 ms | 48.6 ms | 20.6 Hz | 1.77x |
231 | | **TensorRT (Full Pipeline)** | **6.2 ms** | **8.8 ms** | **12.3 ms** | **27.9 ms** | **35.9 Hz** | **3.08x** |
232 | H20 96GB HBM3 | PyTorch Eager | 5.33 ms | 30.8 ms | 47.3 ms | 83.4 ms | 12.0 Hz | 1.00x |
233 | | torch.compile | 5.33 ms | 31.1 ms | 13.3 ms | 49.7 ms | 20.1 Hz | 1.68x |
234 | | **TensorRT (Full Pipeline)** | **5.33 ms** | **14.2 ms** | **14.5 ms** | **34.0 ms** | **29.4 Hz** | **2.45x** |
235 | RTX Pro 6000 Blackwell | PyTorch Eager | 4.8 ms | 29.3 ms | 44.0 ms | 78.4 ms | 12.8 Hz | 1.00x |
236 | | torch.compile | 4.8 ms | 29.4 ms | 16.5 ms | 50.7 ms | 19.7 Hz | 1.55x |
237 | | **TensorRT (Full Pipeline)** | **4.8 ms** | **9.9 ms** | **13.2 ms** | **27.9 ms** | **35.9 Hz** | **2.81x** |
238 | RTX Pro 5000 72GB | PyTorch Eager | 8.85 ms | 54.01 ms | 63.19 ms | 126.4 ms | 7.9 Hz | 1.00x |
239 | | torch.compile | 8.85 ms | 55.74 ms | 20.38 ms | 84.9 ms | 11.8 Hz | 1.49x |
240 | | **TensorRT (Full Pipeline)** | **8.85 ms** | **14.37 ms** | **17.33 ms** | **40.5 ms** | **24.7 Hz** | **3.13x** |
241 | L40 | PyTorch Eager | 6.6 ms | 42.8 ms | 78.9 ms | 128.3 ms | 7.8 Hz | 1.00x |
242 | | torch.compile | 6.6 ms | 42.7 ms | 19.8 ms | 69.0 ms | 14.5 Hz | 1.86x |
243 | | **TensorRT (Full Pipeline)** | **6.6 ms** | **13.1 ms** | **18.8 ms** | **38.4 ms** | **26.0 Hz** | **3.34x** |
244 | L20 | PyTorch Eager | 5.7 ms | 47.58 ms | 86.92 ms | 140.3 ms | 7.1 Hz | 1.00x |
245 | | torch.compile | 5.7 ms | 47.2 ms | 20.18 ms | 73.1 ms | 13.7 Hz | 1.92x |
246 | | **TensorRT (Full Pipeline)** | **5.7 ms** | **17.27 ms** | **19.79 ms** | **42.8 ms** | **23.3 Hz** | **3.28x** |
247 | **Jetson / Spark** | | | | | | | |
248 | DGX Spark | PyTorch Eager | 13.14 ms | 38.22 ms | 74.94 ms | 126.4 ms | 7.9 Hz | 1.00x |
249 | | torch.compile | 13.14 ms | 39.23 ms | 56.49 ms | 108.8 ms | 9.2 Hz | 1.16x |
250 | | **TensorRT (Full Pipeline)** | **13.14 ms** | **33.43 ms** | **52.37 ms** | **98.6 ms** | **10.1 Hz** | **1.28x** |
251 | AGX Thor | PyTorch Eager | 8.21 ms | 55.26 ms | 81.65 ms | 144.9 ms | 6.9 Hz | 1.00x |
252 | | torch.compile | 8.21 ms | 55.59 ms | 64.66 ms | 128.4 ms | 7.8 Hz | 1.13x |
253 | | **TensorRT (Full Pipeline)** | **8.21 ms** | **28.89 ms** | **56.64 ms** | **93.8 ms** | **10.7 Hz** | **1.54x** |
254 | Orin | PyTorch Eager | 9.45 ms | 127.6 ms | 205.39 ms | 342.8 ms | 2.9 Hz | 1.00x |
255 | | torch.compile | 9.45 ms | 128.59 ms | 78.94 ms | 217.0 ms | 4.6 Hz | 1.58x |
256 | | **TensorRT (DiT-only)** | **9.45 ms** | **128.38 ms** | **78.6 ms** | **216.5 ms** | **4.6 Hz** | **1.58x** |
257
258 > **Note:** Orin uses DiT-only TensorRT (`--inference-mode tensorrt`) because TRT 10.3 does not support the backbone engine. All other platforms use the full pipeline (`--inference-mode trt_full_pipeline`).
259
260
261 ## Inference:
262 **Engine:** PyTorch
263 **Test Hardware:** A6000
264
265 ## Ethical Considerations:
266 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
267
268 Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
269
270 For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
271
272 Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).