README.md
9.4 KB · 219 lines · markdown Raw
1 ---
2 datasets:
3 - nvidia/PhysicalAI-Autonomous-Vehicles
4 - nvidia/PhysicalAI-Autonomous-Vehicles-NuRec
5 pipeline_tag: robotics
6 license: other
7 language:
8 - en
9 base_model:
10 - nvidia/Cosmos-Reason2-8B
11 ---
12
13 # Alpamayo 1.5
14
15 [**Alpamayo**](https://www.nvidia.com/en-us/solutions/autonomous-vehicles/alpamayo/) | [**Code**](https://github.com/NVlabs/alpamayo1.5)
16
17 ## Model Overview
18
19 ### Description:
20
21 Alpamayo 1.5 is a significant update to NVIDIA’s open 10B-parameter chain-of-thought reasoning VLA model, designed to be an interactive and steerable reasoning engine for the AV community. Alpamayo 1.5 is built on the [Cosmos-Reason2](https://huggingface.co/nvidia/Cosmos-Reason2-8B) VLM backbone, is RL post-trained, and introduces support for navigation guidance, flexible camera counts, and user question answering.
22
23 This model is ready for non-commercial use. Commercial licensing available upon request.
24
25 ### License:
26
27 The model weights are released under a [non-commercial license](./LICENSE).
28
29 The inference code is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.
30
31 ### Deployment Geography:
32
33 Global
34
35 ### Use Case:
36
37 Researchers and autonomous-driving practitioners who are developing and evaluating VLA models for autonomous-driving scenarios, particularly for handling rare, long-tail events.
38
39 ### Release Date:
40
41 Hugging Face 03/19/2026 via https://huggingface.co/nvidia/Alpamayo-1.5-10B
42
43 ### Inference Code:
44
45 GitHub: https://github.com/NVlabs/alpamayo1.5
46
47 ## Model Architecture:
48
49 **Architecture Type:** Transformer
50
51 **Network Architecture:** A VLA model based on Cosmos-Reason2 and featuring a diffusion-based trajectory decoder.
52
53 **This model was developed based on:** Cosmos-Reason2 (VLM backbone) with a diffusion-based action decoder
54
55 **Number of model parameters:**
56
57 - Backbone: 8.2B parameters
58 - Action Expert: 2.3B parameters
59
60 ## Input(s):
61
62 **Input Type(s):** Image/Video, Text, Egomotion History
63
64 **Input Format(s):**
65
66 - Image: Red, Green, Blue (RGB)
67 - Text: String
68 - Egomotion History: Floating-point values `(x, y, z), R_rot`
69
70 **Input Parameters:**
71
72 - Image: Two-dimensional (2D), multi-camera, multi-timestep
73 - Text: One-dimensional (1D)
74 - Egomotion History: Three-dimensional (3D) translation and nine-dimensional (9D, 3x3) rotation, multi-timestep
75
76 **Other Properties Related to Input:**
77 Multi-camera images (4 cameras by default: front-wide, front-tele, cross-left, cross-right) with 0.4 second history window at 10Hz (4 frames per camera), image resolution 1080x1920 pixels (processor will downsample them to 320x576 pixels). Text inputs include user commands and navigation guidance. Images and egomotion history (16 waypoints at 10Hz) also require associated timestamps.
78 Note that the model is primarily trained and only tested under this setting.
79
80 ## Output(s)
81
82 **Output Type(s):** Text, Trajectory
83
84 **Output Format(s):**
85
86 - Text: String (Chain-of-Causation reasoning traces or question answers)
87 - Trajectory: Floating-point values `(x, y, z), R_rot`
88
89 **Output Parameters:**
90
91 - Text: One-dimensional (1D)
92 - Trajectory: Three-dimensional (3D) translation and nine-dimensional (9D, 3x3) rotation, multi-timestep
93
94 **Other Properties Related to Output:**
95 Outputs 6.4-second future trajectory (64 waypoints at 10Hz) with position `(x, y, z)` and rotation matrix `R_rot` in ego vehicle coordinate frame.
96 Internally, the trajectory is represented as a sequence of dynamic actions (acceleration and curvature) following a unicycle model in bird's-eye-view (BEV) space.
97 Text reasoning traces and question answers are variable in length, describing driving decisions and causal factors.
98
99 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
100
101 ## Software Integration:
102
103 **Runtime Engine(s):**
104
105 - PyTorch (minimum version: 2.8)
106 - Hugging Face Transformers (minimum version: 4.57.1)
107 - DeepSpeed (minimum version: 0.17.4)
108
109 **Supported Hardware Microarchitecture Compatibility:**
110
111 - NVIDIA GPUs with sufficient memory to load a 10B parameter model (minimum 1 GPU with at least 24GB of VRAM)
112
113 **Preferred/Supported Operating System(s):**
114
115 - Linux (we have not tested on other operating systems)
116
117 The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
118
119 ## Model Version(s):
120
121 Alpamayo 1.5 10B trained
122
123 Can be integrated into autonomous driving software in the cloud for advanced end-to-end perception, reasoning, and motion planning.
124
125 ## Training, Testing, and Evaluation Datasets:
126
127 ## Training Dataset:
128
129 Alpamayo 1.5's training data comprises a mix of Chain of Causation (CoC) reasoning traces, Cosmos-Reason Physical AI datasets, NVIDIA's internal proprietary autonomous driving data, and public driving data.
130
131 **Data Modality:**
132
133 - Image (multi-camera)
134 - Text (reasoning traces, navigation guidance)
135 - Other: Trajectory data (egomotion, future waypoints)
136
137 **Image Training Data Size:** More than 1 Billion Images (from 80,000 hours of multi-camera driving data)
138
139 **Text Training Data Size:** Less than a Billion Tokens (3M CoC reasoning traces, Cosmos-Reason training data, and public datasets)
140
141 **Video Training Data Size:** 10,000 to 1 Million Hours (80,000 hours)
142
143 **Non-Audio, Image, Text Training Data Size:** Trajectory data: 80,000 hours at 10Hz sampling rate
144
145 **Data Collection Method by dataset:** Hybrid: Automatic/Sensors (camera and vehicle sensors), Synthetic (VLM-generated reasoning)
146
147 **Labeling Method by dataset:** Hybrid: Human (structured CoC annotations), Automated (VLM-based auto-labeling and heuristic rules), Automatic/Sensors (trajectory and egomotion)
148
149 **Properties:**
150 The dataset comprises 80,000 hours of multi-camera driving videos with corresponding egomotion and trajectory annotations.
151 It includes 3,000,000 Chain-of-Causation (CoC) reasoning traces that provide decision-grounded, causally linked explanations of driving behaviors.
152 Content includes machine-generated data from vehicle sensors (cameras, IMUs, and GPS) and synthetic reasoning traces.
153 CoC annotations are in English and use a structured format that links driving decisions to causal factors.
154 Sensors include RGB cameras (2-6 per vehicle), inertial measurement units, and GPS.
155
156 The training dataset also contains data from the following public datasets:
157
158 - CODA-LM
159 - Drive-Action
160 - DriveGPT4
161 - DriveLM
162 - LingoQA
163 - MapLM
164 - MM-AU
165 - NAVSIM-ReCogDrive
166 - NAVSIM-Traj
167 - nuInstruct
168 - nuScenesQA
169 - Omnidrive
170 - Roadwork
171 - Senna
172 - SUTD
173 - Talk2Car
174 - W3DA
175
176 ### Testing Dataset:
177
178 **Link:** Proprietary autonomous driving test datasets, closed-loop simulation, on-vehicle road tests.
179
180 **Data Collection Method by dataset:** Hybrid: Automatic/Sensors (real-world driving data), Synthetic (simulation scenarios)
181
182 **Labeling Method by dataset:** Hybrid: Automatic/Sensors, Human (ground truth verification)
183
184 **Properties:**
185 This dataset covers multi-camera driving scenarios with a particular focus on rare, long-tail events. It includes challenging cases such as complex intersections, cut-ins, pedestrian interactions, and adverse weather conditions. Data are collected from RGB cameras and vehicle sensors.
186
187 ### Evaluation Dataset:
188
189 **Link:** Same as Testing Dataset.
190
191 **Data Collection Method by dataset:** Hybrid: Automatic/Sensors (real-world driving data), Synthetic (simulation scenarios)
192
193 **Labeling Method by dataset:** Hybrid: Automatic/Sensors, Human (ground truth verification)
194
195 **Properties:**
196 Evaluation focuses on rare, long-tail scenarios, including complex intersections, pedestrian crossings, vehicle cut-ins, and challenging weather and lighting conditions. Multi-camera sensor data are collected from RGB cameras.
197
198 **Quantitative Evaluation Benchmarks:**
199
200 - Reasoning Evaluation using [LingoQA](https://github.com/wayveai/LingoQA): Lingo-Judge Score of 74.2.
201 - Closed-Loop Evaluation using [AlpaSim](https://github.com/NVlabs/alpasim) on 910 scenarios from the [PhysicalAI-AV-NuRec Dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec): AlpaSim Score of 0.81 ± 0.01.
202 - Open-Loop Evaluation on 937 challenging samples from the [PhysicalAI-AV Dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles): minADE_6 at 6.4s of 1.11m.
203
204 # Inference:
205
206 **Acceleration Engine:** PyTorch, Hugging Face Transformers
207
208 **Test Hardware:**
209
210 - Minimum: 1 GPU with 24GB+ VRAM (e.g., NVIDIA RTX 3090, RTX 3090 Ti, RTX 4090, A5000, or equivalent)
211 - Tested on: NVIDIA H100
212
213 For scripts related to model inference, please check out our [code repository](https://github.com/NVlabs/alpamayo1.5).
214
215 ## Ethical Considerations:
216
217 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
218
219 Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).