README.md
9.1 KB · 204 lines · markdown Raw
1 ---
2 datasets:
3 - nvidia/PhysicalAI-Autonomous-Vehicles
4 - nvidia/PhysicalAI-Autonomous-Vehicles-NuRec
5 pipeline_tag: robotics
6 library_name: transformers
7 license: other
8 language:
9 - en
10 new_version: nvidia/Alpamayo-1.5-10B
11 ---
12
13 # Alpamayo 1
14
15 [**Code**](https://github.com/NVlabs/alpamayo) | [**Paper**](https://arxiv.org/abs/2511.00088)
16
17 _Note: Following the release of [NVIDIA Alpamayo](https://nvidianews.nvidia.com/news/alpamayo-autonomous-vehicle-development) at CES 2026, Alpamayo-R1 has been renamed to Alpamayo 1._
18
19 ## Model Overview
20
21 ### Description:
22
23 Alpamayo 1 integrates Chain-of-Causation reasoning with trajectory planning to enhance decision-making in complex autonomous-driving scenarios. Alpamayo 1 (v1.0) was developed by NVIDIA as a vision-language-action (VLA) model that bridges interpretable reasoning with precise vehicle control for autonomous-driving applications.
24
25 This model is ready for non-commercial use. Commercial licensing available upon request.
26
27 ### License:
28
29 The model weights are released under a [non-commercial license](./LICENSE).
30
31 The inference code is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.
32
33 ### Deployment Geography:
34
35 Global
36
37 ### Use Case:
38
39 Researchers and autonomous-driving practitioners who are developing and evaluating VLA models for autonomous-driving scenarios, particularly for handling rare, long-tail events.
40
41 ### Release Date:
42
43 Hugging Face 12/03/2025 via this repository.
44
45 ### Inference Code:
46
47 GitHub: https://github.com/NVlabs/alpamayo
48
49 ## Reference:
50
51 [Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail](https://arxiv.org/abs/2511.00088)
52
53 ## Model Architecture:
54
55 **Architecture Type:** Transformer
56
57 **Network Architecture:** A VLA model based on Cosmos-Reason and featuring a diffusion-based trajectory decoder.
58
59 **This model was developed based on:** Cosmos-Reason (VLM backbone) with a diffusion-based action decoder
60
61 **Number of model parameters:**
62
63 - Backbone: 8.2B parameters
64 - Action Expert: 2.3B parameters
65
66 ## Input(s):
67
68 **Input Type(s):** Image/Video, Text, Egomotion History
69
70 **Input Format(s):**
71
72 - Image: Red, Green, Blue (RGB)
73 - Text: String
74 - Egomotion History: Floating-point values `(x, y, z), R_rot`
75
76 **Input Parameters:**
77
78 - Image: Two-dimensional (2D), multi-camera, multi-timestep
79 - Text: One-dimensional (1D)
80 - Egomotion History: Three-dimensional (3D) translation and nine-dimensional (9D, 3x3) rotation, multi-timestep
81
82 **Other Properties Related to Input:**
83 Multi-camera images (4 cameras: front-wide, front-tele, cross-left, cross-right) with 0.4 second history window at 10Hz (4 frames per camera), image resolution 1080x1920 pixels (processor will downsample them to 320x576 pixels). Text inputs include user commands. Images and egomotion history (16 waypoints at 10Hz) also require associated timestamps.
84 Note that the model is primarily trained and only tested under this setting.
85
86 ## Output(s)
87
88 **Output Type(s):** Text, Trajectory
89
90 **Output Format(s):**
91
92 - Text: String (Chain-of-Causation reasoning traces)
93 - Trajectory: Floating-point values `(x, y, z), R_rot`
94
95 **Output Parameters:**
96
97 - Text: One-dimensional (1D)
98 - Trajectory: Three-dimensional (3D) translation and nine-dimensional (9D, 3x3) rotation, multi-timestep
99
100 **Other Properties Related to Output:**
101 Outputs 6.4-second future trajectory (64 waypoints at 10Hz) with position `(x, y, z)` and rotation matrix `R_rot` in ego vehicle coordinate frame.
102 Internally, the trajectory is represented as a sequence of dynamic actions (acceleration and curvature) following a unicycle model in bird's-eye-view (BEV) space.
103 Text reasoning traces are variable in length, describing driving decisions and causal factors.
104
105 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
106
107 ## Software Integration:
108
109 **Runtime Engine(s):**
110
111 - PyTorch (minimum version: 2.8)
112 - Hugging Face Transformers (minimum version: 4.57.1)
113 - DeepSpeed (minimum version: 0.17.4)
114
115 **Supported Hardware Microarchitecture Compatibility:**
116
117 - NVIDIA GPUs with sufficient memory to load a 10B parameter model (minimum 1 GPU with at least 24GB of VRAM)
118
119 **Preferred/Supported Operating System(s):**
120
121 - Linux (we have not tested on other operating systems)
122
123 The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
124
125 ## Model Version(s):
126
127 Alpamayo 1 10B v1.0 trained
128
129 Can be integrated into autonomous driving software in the cloud for advanced end-to-end perception, reasoning, and motion planning.
130
131 ## Training, Testing, and Evaluation Datasets:
132
133 ## Training Dataset:
134
135 Alpamayo 1's training data comprises a mix of Chain of Causation (CoC) reasoning traces, Cosmos-Reason Physical AI datasets, and NVIDIA's internal proprietary autonomous driving data.
136
137 **Data Modality:**
138
139 - Image (multi-camera)
140 - Text (reasoning traces)
141 - Other: Trajectory data (egomotion, future waypoints)
142
143 **Image Training Data Size:** More than 1 Billion Images (from 80,000 hours of multi-camera driving data)
144
145 **Text Training Data Size:** Less than a Billion Tokens (700K CoC reasoning traces plus Cosmos-Reason training data)
146
147 **Video Training Data Size:** 10,000 to 1 Million Hours (80,000 hours)
148
149 **Non-Audio, Image, Text Training Data Size:** Trajectory data: 80,000 hours at 10Hz sampling rate
150
151 **Data Collection Method by dataset:** Hybrid: Automatic/Sensors (camera and vehicle sensors), Synthetic (VLM-generated reasoning)
152
153 **Labeling Method by dataset:** Hybrid: Human (structured CoC annotations), Automated (VLM-based auto-labeling), Automatic/Sensors (trajectory and egomotion)
154
155 **Properties:**
156 The dataset comprises 80,000 hours of multi-camera driving videos with corresponding egomotion and trajectory annotations.
157 It includes 700,000 Chain-of-Causation (CoC) reasoning traces that provide decision-grounded, causally linked explanations of driving behaviors.
158 Content includes machine-generated data from vehicle sensors (cameras, IMUs, and GPS) and synthetic reasoning traces.
159 CoC annotations are in English and use a structured format that links driving decisions to causal factors.
160 Sensors include RGB cameras (2-6 per vehicle), inertial measurement units, and GPS.
161
162 ### Testing Dataset:
163
164 **Link:** Proprietary autonomous driving test datasets, closed-loop simulation, on-vehicle road tests.
165
166 **Data Collection Method by dataset:** Hybrid: Automatic/Sensors (real-world driving data), Synthetic (simulation scenarios)
167
168 **Labeling Method by dataset:** Hybrid: Automatic/Sensors, Human (ground truth verification)
169
170 **Properties:**
171 This dataset covers multi-camera driving scenarios with a particular focus on rare, long-tail events. It includes challenging cases such as complex intersections, cut-ins, pedestrian interactions, and adverse weather conditions. Data are collected from RGB cameras and vehicle sensors.
172
173 ### Evaluation Dataset:
174
175 **Link:** Same as Testing Dataset.
176
177 **Data Collection Method by dataset:** Hybrid: Automatic/Sensors (real-world driving data), Synthetic (simulation scenarios)
178
179 **Labeling Method by dataset:** Hybrid: Automatic/Sensors, Human (ground truth verification)
180
181 **Properties:**
182 Evaluation focuses on rare, long-tail scenarios, including complex intersections, pedestrian crossings, vehicle cut-ins, and challenging weather and lighting conditions. Multi-camera sensor data are collected from RGB cameras.
183
184 **Quantitative Evaluation Benchmarks:**
185
186 - Closed-Loop Evaluation using [AlpaSim](https://github.com/NVlabs/alpasim) on 910 scenarios from the [PhysicalAI-AV-NuRec Dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec): AlpaSim Score of 0.73 ± 0.01.
187 - Open-Loop Evaluation on 937 challenging samples from the [PhysicalAI-AV Dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles): minADE_6 at 6.4s of 1.22m.
188
189 # Inference:
190
191 **Acceleration Engine:** PyTorch, Hugging Face Transformers
192
193 **Test Hardware:**
194
195 - Minimum: 1 GPU with 24GB+ VRAM (e.g., NVIDIA RTX 3090, RTX 3090 Ti, RTX 4090, A5000, or equivalent)
196 - Tested on: NVIDIA H100
197
198 For scripts related to model inference, please check out our [code repository](https://github.com/NVlabs/alpamayo).
199
200 ## Ethical Considerations:
201
202 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
203
204 Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).