README.md · Alpamayo-R1-10B

1

---

2

datasets:

3

- nvidia/PhysicalAI-Autonomous-Vehicles

4

- nvidia/PhysicalAI-Autonomous-Vehicles-NuRec

5

pipeline_tag: robotics

6

library_name: transformers

7

license: other

8

language:

9

- en

10

new_version: nvidia/Alpamayo-1.5-10B

11

---

12

13

# Alpamayo 1

14

15

[**Code**](https://github.com/NVlabs/alpamayo) | [**Paper**](https://arxiv.org/abs/2511.00088)

16

17

_Note: Following the release of [NVIDIA Alpamayo](https://nvidianews.nvidia.com/news/alpamayo-autonomous-vehicle-development) at CES 2026, Alpamayo-R1 has been renamed to Alpamayo 1._

18

19

## Model Overview

20

21

### Description:

22

23

Alpamayo 1 integrates Chain-of-Causation reasoning with trajectory planning to enhance decision-making in complex autonomous-driving scenarios. Alpamayo 1 (v1.0) was developed by NVIDIA as a vision-language-action (VLA) model that bridges interpretable reasoning with precise vehicle control for autonomous-driving applications.

24

25

This model is ready for non-commercial use. Commercial licensing available upon request.

26

27

### License:

28

29

The model weights are released under a [non-commercial license](./LICENSE).

30

31

The inference code is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.

32

33

### Deployment Geography:

34

35

Global

36

37

### Use Case:

38

39

Researchers and autonomous-driving practitioners who are developing and evaluating VLA models for autonomous-driving scenarios, particularly for handling rare, long-tail events.

40

41

### Release Date:

42

43

Hugging Face 12/03/2025 via this repository.

44

45

### Inference Code:

46

47

GitHub: https://github.com/NVlabs/alpamayo

48

49

## Reference:

50

51

[Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail](https://arxiv.org/abs/2511.00088)

52

53

## Model Architecture:

54

55

**Architecture Type:** Transformer

56

57

**Network Architecture:** A VLA model based on Cosmos-Reason and featuring a diffusion-based trajectory decoder.

58

59

**This model was developed based on:** Cosmos-Reason (VLM backbone) with a diffusion-based action decoder

60

61

**Number of model parameters:**

62

63

- Backbone: 8.2B parameters

64

- Action Expert: 2.3B parameters

65

66

## Input(s):

67

68

**Input Type(s):** Image/Video, Text, Egomotion History

69

70

**Input Format(s):**

71

72

- Image: Red, Green, Blue (RGB)

73

- Text: String

74

- Egomotion History: Floating-point values `(x, y, z), R_rot`

75

76

**Input Parameters:**

77

78

- Image: Two-dimensional (2D), multi-camera, multi-timestep

79

- Text: One-dimensional (1D)

80

- Egomotion History: Three-dimensional (3D) translation and nine-dimensional (9D, 3x3) rotation, multi-timestep

81

82

**Other Properties Related to Input:**

83

Multi-camera images (4 cameras: front-wide, front-tele, cross-left, cross-right) with 0.4 second history window at 10Hz (4 frames per camera), image resolution 1080x1920 pixels (processor will downsample them to 320x576 pixels). Text inputs include user commands. Images and egomotion history (16 waypoints at 10Hz) also require associated timestamps.

84

Note that the model is primarily trained and only tested under this setting.

85

86

## Output(s)

87

88

**Output Type(s):** Text, Trajectory

89

90

**Output Format(s):**

91

92

- Text: String (Chain-of-Causation reasoning traces)

93

- Trajectory: Floating-point values `(x, y, z), R_rot`

94

95

**Output Parameters:**

96

97

- Text: One-dimensional (1D)

98

- Trajectory: Three-dimensional (3D) translation and nine-dimensional (9D, 3x3) rotation, multi-timestep

99

100

**Other Properties Related to Output:**

101

Outputs 6.4-second future trajectory (64 waypoints at 10Hz) with position `(x, y, z)` and rotation matrix `R_rot` in ego vehicle coordinate frame.

102

Internally, the trajectory is represented as a sequence of dynamic actions (acceleration and curvature) following a unicycle model in bird's-eye-view (BEV) space.

103

Text reasoning traces are variable in length, describing driving decisions and causal factors.

104

105

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

106

107

## Software Integration:

108

109

**Runtime Engine(s):**

110

111

- PyTorch (minimum version: 2.8)

112

- Hugging Face Transformers (minimum version: 4.57.1)

113

- DeepSpeed (minimum version: 0.17.4)

114

115

**Supported Hardware Microarchitecture Compatibility:**

116

117

- NVIDIA GPUs with sufficient memory to load a 10B parameter model (minimum 1 GPU with at least 24GB of VRAM)

118

119

**Preferred/Supported Operating System(s):**

120

121

- Linux (we have not tested on other operating systems)

122

123

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

124

125

## Model Version(s):

126

127

Alpamayo 1 10B v1.0 trained

128

129

Can be integrated into autonomous driving software in the cloud for advanced end-to-end perception, reasoning, and motion planning.

130

131

## Training, Testing, and Evaluation Datasets:

132

133

## Training Dataset:

134

135

Alpamayo 1's training data comprises a mix of Chain of Causation (CoC) reasoning traces, Cosmos-Reason Physical AI datasets, and NVIDIA's internal proprietary autonomous driving data.

136

137

**Data Modality:**

138

139

- Image (multi-camera)

140

- Text (reasoning traces)

141

- Other: Trajectory data (egomotion, future waypoints)

142

143

**Image Training Data Size:** More than 1 Billion Images (from 80,000 hours of multi-camera driving data)

144

145

**Text Training Data Size:** Less than a Billion Tokens (700K CoC reasoning traces plus Cosmos-Reason training data)

146

147

**Video Training Data Size:** 10,000 to 1 Million Hours (80,000 hours)

148

149

**Non-Audio, Image, Text Training Data Size:** Trajectory data: 80,000 hours at 10Hz sampling rate

150

151

**Data Collection Method by dataset:** Hybrid: Automatic/Sensors (camera and vehicle sensors), Synthetic (VLM-generated reasoning)

152

153

**Labeling Method by dataset:** Hybrid: Human (structured CoC annotations), Automated (VLM-based auto-labeling), Automatic/Sensors (trajectory and egomotion)

154

155

**Properties:**

156

The dataset comprises 80,000 hours of multi-camera driving videos with corresponding egomotion and trajectory annotations.

157

It includes 700,000 Chain-of-Causation (CoC) reasoning traces that provide decision-grounded, causally linked explanations of driving behaviors.

158

Content includes machine-generated data from vehicle sensors (cameras, IMUs, and GPS) and synthetic reasoning traces.

159

CoC annotations are in English and use a structured format that links driving decisions to causal factors.

160

Sensors include RGB cameras (2-6 per vehicle), inertial measurement units, and GPS.

161

162

### Testing Dataset:

163

164

**Link:** Proprietary autonomous driving test datasets, closed-loop simulation, on-vehicle road tests.

165

166

**Data Collection Method by dataset:** Hybrid: Automatic/Sensors (real-world driving data), Synthetic (simulation scenarios)

167

168

**Labeling Method by dataset:** Hybrid: Automatic/Sensors, Human (ground truth verification)

169

170

**Properties:**

171

This dataset covers multi-camera driving scenarios with a particular focus on rare, long-tail events. It includes challenging cases such as complex intersections, cut-ins, pedestrian interactions, and adverse weather conditions. Data are collected from RGB cameras and vehicle sensors.

172

173

### Evaluation Dataset:

174

175

**Link:** Same as Testing Dataset.

176

177

**Data Collection Method by dataset:** Hybrid: Automatic/Sensors (real-world driving data), Synthetic (simulation scenarios)

178

179

**Labeling Method by dataset:** Hybrid: Automatic/Sensors, Human (ground truth verification)

180

181

**Properties:**

182

Evaluation focuses on rare, long-tail scenarios, including complex intersections, pedestrian crossings, vehicle cut-ins, and challenging weather and lighting conditions. Multi-camera sensor data are collected from RGB cameras.

183

184

**Quantitative Evaluation Benchmarks:**

185

186

- Closed-Loop Evaluation using [AlpaSim](https://github.com/NVlabs/alpasim) on 910 scenarios from the [PhysicalAI-AV-NuRec Dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NuRec): AlpaSim Score of 0.73 ± 0.01.

187

- Open-Loop Evaluation on 937 challenging samples from the [PhysicalAI-AV Dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles): minADE_6 at 6.4s of 1.22m.

188

189

# Inference:

190

191

**Acceleration Engine:** PyTorch, Hugging Face Transformers

192

193

**Test Hardware:**

194

195

- Minimum: 1 GPU with 24GB+ VRAM (e.g., NVIDIA RTX 3090, RTX 3090 Ti, RTX 4090, A5000, or equivalent)

196

- Tested on: NVIDIA H100

197

198

For scripts related to model inference, please check out our [code repository](https://github.com/NVlabs/alpamayo).

199

200

## Ethical Considerations:

201

202

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

203

204

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).