README.md · GR00T-N1.7-3B

1

---

2

tags:

3

- robotics

4

---

5

6

<div align="center">

7

<a href="https://github.com/NVIDIA/Isaac-GR00T">

8

<img src="https://cdn-uploads.huggingface.co/production/uploads/67b8da81d01134f89899b4a7/8bFQa2ZIGCsOQQ2ho2N_U.png">

9

</a>

10

<div align="center">

11

<a href="https://github.com/NVIDIA/Isaac-GR00T">

12

<img src="https://img.shields.io/badge/GitHub-grey?logo=GitHub" alt="GitHub Badge">

13

</a>

14

<a href="https://developer.nvidia.com/isaac/gr00t">

15

<img src="https://img.shields.io/badge/Website-green" alt="Website Badge">

16

</a>

17

<!-- <a href=""">

18

<img src="https://img.shields.io/badge/Project%20Page-blue?style=plastic" alt="Project Page Badge">

19

</a>

20

<a href="">

21

<img src="https://img.shields.io/badge/Research_Blog-black?style=flat" alt="Research Blog Badge">

22

</a>

23

<a href="">

24

<img src="https://img.shields.io/badge/Dataset-Overview-brightgreen?logo=googleforms" alt="Research Blog Badge">

25

</a>

26

-->

27

</div>

28

</div>

29

30

# Model Overview

31

32

<p align="center">

33

<img src="https://cdn-uploads.huggingface.co/production/uploads/67b8da81d01134f89899b4a7/ZCLLXZk2LQBG0YH_BmiIN.gif"

34

style="width:100%; max-width:1000px; height:auto;">

35

</p>

36

37

## Description:

38

NVIDIA Isaac GR00T N1.7 is an open foundation model for generalized humanoid robot reasoning and skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. Developers and researchers can post-train GR00T N1.7 with real or synthetic data for their specific humanoid robot or task.

39

40

Isaac GR00T N1.7 is the medium-sized version of our model built using pre-trained vision and language encoders, and uses a flow matching action transformer to model a chunk of actions conditioned on vision, language and proprioception.

41

42

A detailed description of the Isaac GR00T N1.X architecture is provided in the GROOT N1 White Paper (https://arxiv.org/abs/2503.14734).

43

44

This model is ready for commercial/non-commercial use.

45

46

**Model Developer**: NVIDIA

47

48

## Model Versions

49

The Isaac GR00T N1.7 model family includes the following 4 models:

50

51

### GR00T N1.7 – SimplerEnv Bridge

52

53

**Description**

54

N1.7 post-trained model using the **Bridge Dataset** in SimplerEnv.

55

56

**Post-Training Data**

57

https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot

58

59

**Dataset Summary**

60

A LeRobot-format conversion of **BridgeData V2**, originally containing **60,096 trajectories** of robot manipulation across **24 environments**.

61

62

### GR00T N1.7 – SimplerEnv Fractal

63

64

**Description**

65

N1.7 post-trained model using the **Fractal Dataset** in SimplerEnv.

66

67

**Post-Training Data**

68

https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot

69

70

**Dataset Summary**

71

A LeRobot-format conversion of **BridgeData V2**, originally containing **60,096 trajectories** of robot manipulation across **24 environments**.

72

73

### GR00T N1.7 – Droid

74

**Description**

75

N1.7 post-trained model using the **DROID Dataset**.

76

77

**Post-Training Data**

78

https://droid-dataset.github.io/

79

80

**Dataset Summary**

81

A large-scale **“in-the-wild” robot manipulation dataset** with approximately **76,000 demonstration trajectories (~350 hours)** of interaction data, collected across **564 distinct scenes in 52 buildings**, covering **86 manipulation tasks** from natural-language instructions.

82

83

### GR00T N1.7 – LIBERO

84

**Description**

85

N1.7 post-trained model using the **LIBERO Dataset**.

86

87

**Post-Training Data**

88

https://github.com/Lifelong-Robot-Learning/LIBERO

89

90

**Dataset Summary**

91

A benchmark for **lifelong robot learning**, providing **130 language-conditioned manipulation tasks** grouped into multiple task suites.

92

Includes **human-teleoperated demonstrations** designed to evaluate **knowledge transfer and continual learning** in robotic agents.

93

94

## License

95

This model is released under the  [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).

96

97

98

### Deployment Geography:

99

Global

100

101

### Use Case:

102

Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development.

103

Developers: Integrate and customize AI for various robotic applications.

104

Startups & Companies: Accelerate robotics development and reduce training costs.

105

106

### Release Date:

107

* Github via https://github.com/NVIDIA/Isaac-GR00T

108

* Huggingface via https://huggingface.co/collections/nvidia/gr00t-n17

109

110

## Computational Load (Internal Only: For NVIDIA Models Only)

111

Cumulative Compute: Follow Instructions

112

Estimated Energy and Emissions for Model Training: Follow Instructions

113

Total kWh:

114

64 GB200 nodes * 4 gpus per node x 1200W x 0.001 x 0.8 x 120 hours * 1.4 = 41288 kWh

115

Total Emission:

116

410.5 * 41288 * 0.000001 = 16.949 tCO2e

117

118

## Model Architecture:

119

120

**GR00T-N1.7 VLM backbone is now [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-8B)**

121

122

**Network Architecture:**

123

124

The schematic diagram is shown in the illustration above.

125

Red, Green, Blue (RGB) camera frames are processed through a pre-trained vision transformer (SigLip2).

126

Text is encoded by a pre-trained transformer (T5)

127

Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP.

128

Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment.

129

The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).

130

131

![Model Architecture](model-architecture.png)

132

133

**Number of Model Parameters:** 3,000,000,000

134

135

## Input:

136

**Input Type(s):**

137

-Vision: Image Frames

138

-State: Robot Proprioception

139

-Language Instruction: Text

140

-Embodiment ID: Integer

141

142

**Input Format:**

143

-Vision: Variable number of uint8 image frames, coming from robot cameras

144

-State: Floating Point

145

-Language Instruction: String

146

-Embodiment ID: Integer indicating which of the training embodiments is observed

147

148

**Input Parameters:**

149

-Vision: Two-Dimensional (2D) - Red, Green, Blue (RGB)

150

-State: One-Dimensional (1D) - Floating number vector

151

-Language Instruction: One-Dimensional (1D) - String

152

-Embodiment ID: One-Dimensional (1D) - Integer

153

154

## Output:

155

**Output Type(s):** Actions

156

157

**Output Format** Continuous-value vectors

158

159

**Output Parameters:** [Two-Dimensional (2D)] <br>

160

161

**Other Properties Related to Output:** Continuous-value vectors correspond to different motor controls on a robot, which depends on Degrees of Freedom of the robot embodiment.

162

163

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>

164

165

## Software Integration:

166

167

**Runtime Engine(s):** PyTorch

168

169

**Supported Hardware Microarchitecture Compatibility:**

170

All of the below:

171

* NVIDIA Ampere

172

* NVIDIA Blackwell

173

* NVIDIA Jetson

174

* NVIDIA Hopper

175

* NVIDIA Lovelace

176

177

**[Preferred/Supported] Operating System(s):**

178

* Linux

179

180

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

181

182

# Model Version

183

GR00T N1.7 EA

184

185

# Training and Evaluation Datasets:

186

The total size (in number of data points): 21.6 million <br>

187

Total number of datasets: 13 <br>

188

189

190

## Training Dataset:

191

GR00T Pretraining Data

192

193

**Data Collection Method by dataset:** Hybrid: Human, Robot, Simulated.

194

195

**Labeling Method by dataset:** Hybrid: Human, Automated.

196

197

**Properties:**

198

* Cross-embodiment: Data collected on various robot embodiments

199

* Sensor types: RGB camera, robot proprioception, robot actuator data

200

201

202

## Evaluation:

203

We evaluate in both simulation and real robot benchmarks, as defined in the White Paper (https://arxiv.org/abs/2503.14734).

204

205

**Data Collection Method by dataset:** Hybrid: Human, Robot, Simulated.

206

207

**Labeling Method by dataset:** Hybrid: Human, Automated.

208

209

* Sim evaluation benchmarks for upper body control

210

* 9 DexMG Whitepaper tasks

211

* 24 RoboCasa simulated mobile manipulator tasks

212

* 24 Digital Cousin simulated GR-1 humanoid manipulation tasks

213

* For sim, we automatically measure the success rate in each manipulation behavior.

214

* For real robot

215

* Grocery packing task

216

* Novel objects (unseen from training data)

217

* Industrial multi-robot coordination with handoffs

218

* Evaluated by human observers in the lab

219

220

## System Requirements and Performance

221

222

This section discusses the various configurations and inference runtimes for GR00T N1.7 tasks. We discuss both latency and speedup.

223

224

GR00T N1.7 Inference Timing (4 denoising steps, 1 camera):

225

226

227

|--------|------|-----------------|----------|-------------|-----|-----------|-------------|

228

| **dGPU** | | | | | | | |

229

| H100 80GB HBM3 | PyTorch Eager | 6.2 ms | 31.3 ms | 48.2 ms | 85.8 ms | 11.7 Hz | 1.00x |

230

| | torch.compile | 6.2 ms | 30.4 ms | 12.0 ms | 48.6 ms | 20.6 Hz | 1.77x |

231

| | **TensorRT (Full Pipeline)** | **6.2 ms** | **8.8 ms** | **12.3 ms** | **27.9 ms** | **35.9 Hz** | **3.08x** |

232

| H20 96GB HBM3 | PyTorch Eager | 5.33 ms | 30.8 ms | 47.3 ms | 83.4 ms | 12.0 Hz | 1.00x |

233

| | torch.compile | 5.33 ms | 31.1 ms | 13.3 ms | 49.7 ms | 20.1 Hz | 1.68x |

234

| | **TensorRT (Full Pipeline)** | **5.33 ms** | **14.2 ms** | **14.5 ms** | **34.0 ms** | **29.4 Hz** | **2.45x** |

235

| RTX Pro 6000 Blackwell | PyTorch Eager | 4.8 ms | 29.3 ms | 44.0 ms | 78.4 ms | 12.8 Hz | 1.00x |

236

| | torch.compile | 4.8 ms | 29.4 ms | 16.5 ms | 50.7 ms | 19.7 Hz | 1.55x |

237

| | **TensorRT (Full Pipeline)** | **4.8 ms** | **9.9 ms** | **13.2 ms** | **27.9 ms** | **35.9 Hz** | **2.81x** |

238

| RTX Pro 5000 72GB | PyTorch Eager | 8.85 ms | 54.01 ms | 63.19 ms | 126.4 ms | 7.9 Hz | 1.00x |

239

| | torch.compile | 8.85 ms | 55.74 ms | 20.38 ms | 84.9 ms | 11.8 Hz | 1.49x |

240

| | **TensorRT (Full Pipeline)** | **8.85 ms** | **14.37 ms** | **17.33 ms** | **40.5 ms** | **24.7 Hz** | **3.13x** |

241

| L40 | PyTorch Eager | 6.6 ms | 42.8 ms | 78.9 ms | 128.3 ms | 7.8 Hz | 1.00x |

242

| | torch.compile | 6.6 ms | 42.7 ms | 19.8 ms | 69.0 ms | 14.5 Hz | 1.86x |

243

| | **TensorRT (Full Pipeline)** | **6.6 ms** | **13.1 ms** | **18.8 ms** | **38.4 ms** | **26.0 Hz** | **3.34x** |

244

| L20 | PyTorch Eager | 5.7 ms | 47.58 ms | 86.92 ms | 140.3 ms | 7.1 Hz | 1.00x |

245

| | torch.compile | 5.7 ms | 47.2 ms | 20.18 ms | 73.1 ms | 13.7 Hz | 1.92x |

246

| | **TensorRT (Full Pipeline)** | **5.7 ms** | **17.27 ms** | **19.79 ms** | **42.8 ms** | **23.3 Hz** | **3.28x** |

247

| **Jetson / Spark** | | | | | | | |

248

| DGX Spark | PyTorch Eager | 13.14 ms | 38.22 ms | 74.94 ms | 126.4 ms | 7.9 Hz | 1.00x |

249

| | torch.compile | 13.14 ms | 39.23 ms | 56.49 ms | 108.8 ms | 9.2 Hz | 1.16x |

250

| | **TensorRT (Full Pipeline)** | **13.14 ms** | **33.43 ms** | **52.37 ms** | **98.6 ms** | **10.1 Hz** | **1.28x** |

251

| AGX Thor | PyTorch Eager | 8.21 ms | 55.26 ms | 81.65 ms | 144.9 ms | 6.9 Hz | 1.00x |

252

| | torch.compile | 8.21 ms | 55.59 ms | 64.66 ms | 128.4 ms | 7.8 Hz | 1.13x |

253

| | **TensorRT (Full Pipeline)** | **8.21 ms** | **28.89 ms** | **56.64 ms** | **93.8 ms** | **10.7 Hz** | **1.54x** |

254

| Orin | PyTorch Eager | 9.45 ms | 127.6 ms | 205.39 ms | 342.8 ms | 2.9 Hz | 1.00x |

255

| | torch.compile | 9.45 ms | 128.59 ms | 78.94 ms | 217.0 ms | 4.6 Hz | 1.58x |

256

| | **TensorRT (DiT-only)** | **9.45 ms** | **128.38 ms** | **78.6 ms** | **216.5 ms** | **4.6 Hz** | **1.58x** |

257

258

> **Note:** Orin uses DiT-only TensorRT (`--inference-mode tensorrt`) because TRT 10.3 does not support the backbone engine. All other platforms use the full pipeline (`--inference-mode trt_full_pipeline`).

259

260

261

## Inference:

262

**Engine:** PyTorch

263

**Test Hardware:** A6000

264

265

## Ethical Considerations:

266

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

267

268

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

269

270

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

271

272

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).