README.md
4.9 KB · 146 lines · markdown Raw
1 ---
2 license: cc-by-nc-4.0
3 tags:
4 - depth-estimation
5 - computer-vision
6 - monocular-depth
7 - multi-view-geometry
8 - pose-estimation
9 library_name: depth-anything-3
10 pipeline_tag: depth-estimation
11 ---
12
13 # Depth Anything 3: DA3NESTED-GIANT-LARGE
14
15 <div align="center">
16
17 [![Project Page](https://img.shields.io/badge/Project_Page-Depth_Anything_3-green)](https://depth-anything-3.github.io)
18 [![Paper](https://img.shields.io/badge/arXiv-Depth_Anything_3-red)](https://arxiv.org/abs/)
19 [![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue)](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) # noqa: E501
20 <!-- Benchmark badge removed as per request -->
21
22 </div>
23
24 ## Model Description
25
26 DA3 Nested model combining the any-view Giant model with the metric Large model for metric-scale visual geometry reconstruction. This is our recommended model that combines all capabilities.
27
28 | Property | Value |
29 |----------|-------|
30 | **Model Series** | Nested |
31 | **Parameters** | 1.40B |
32 | **License** | CC BY-NC 4.0 |
33
34 ⚠️ **Non-commercial use only** due to CC BY-NC 4.0 license.
35
36 ## Capabilities
37
38 - ✅ Relative Depth
39 - ✅ Pose Estimation
40 - ✅ Pose Conditioning
41 - ✅ 3D Gaussians
42 - ✅ Metric Depth
43 - ✅ Sky Segmentation
44
45 ## Quick Start
46
47 ### Installation
48
49 ```bash
50 git clone https://github.com/ByteDance-Seed/depth-anything-3
51 cd depth-anything-3
52 pip install -e .
53 ```
54
55 ### Basic Example
56
57 ```python
58 import torch
59 from depth_anything_3.api import DepthAnything3
60
61 # Load model from Hugging Face Hub
62 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
63 model = DepthAnything3.from_pretrained("depth-anything/da3nested-giant-large")
64 model = model.to(device=device)
65
66 # Run inference on images
67 images = ["image1.jpg", "image2.jpg"] # List of image paths, PIL Images, or numpy arrays
68 prediction = model.inference(
69 images,
70 export_dir="output",
71 export_format="glb" # Options: glb, npz, ply, mini_npz, gs_ply, gs_video
72 )
73
74 # Access results
75 print(prediction.depth.shape) # Depth maps: [N, H, W] float32
76 print(prediction.conf.shape) # Confidence maps: [N, H, W] float32
77 print(prediction.extrinsics.shape) # Camera poses (w2c): [N, 3, 4] float32
78 print(prediction.intrinsics.shape) # Camera intrinsics: [N, 3, 3] float32
79 ```
80
81 ### Command Line Interface
82
83 ```bash
84 # Process images with auto mode
85 da3 auto path/to/images \
86 --export-format glb \
87 --export-dir output \
88 --model-dir depth-anything/da3nested-giant-large
89
90 # Use backend for faster repeated inference
91 da3 backend --model-dir depth-anything/da3nested-giant-large
92 da3 auto path/to/images --export-format glb --use-backend
93 ```
94
95 ## Model Details
96
97 - **Developed by:** ByteDance Seed Team
98 - **Model Type:** Vision Transformer for Visual Geometry
99 - **Architecture:** Plain transformer with unified depth-ray representation
100 - **Training Data:** Public academic datasets only
101
102 ### Key Insights
103
104 💎 A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization. # noqa: E501
105
106 ✨ A singular **depth-ray representation** obviates the need for complex multi-task learning.
107
108 ## Performance
109
110 🏆 Depth Anything 3 significantly outperforms:
111 - **Depth Anything 2** for monocular depth estimation
112 - **VGGT** for multi-view depth estimation and pose estimation
113
114 For detailed benchmarks, please refer to our [paper](https://depth-anything-3.github.io). # noqa: E501
115
116 ## Limitations
117
118 - The model is trained on academic datasets and may have limitations on certain domain-specific images # noqa: E501
119 - Performance may vary depending on image quality, lighting conditions, and scene complexity
120 - ⚠️ **Non-commercial use only** due to CC BY-NC 4.0 license.
121
122 ## Citation
123
124 If you find Depth Anything 3 useful in your research or projects, please cite:
125
126 ```bibtex
127 @article{depthanything3,
128 title={Depth Anything 3: Recovering the visual space from any views},
129 author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang}, # noqa: E501
130 journal={arXiv preprint arXiv:XXXX.XXXXX},
131 year={2025}
132 }
133 ```
134
135 ## Links
136
137 - 🏠 [Project Page](https://depth-anything-3.github.io)
138 - 📄 [Paper](https://arxiv.org/abs/)
139 - 💻 [GitHub Repository](https://github.com/ByteDance-Seed/depth-anything-3)
140 - 🤗 [Hugging Face Demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-3)
141 - 📚 [Documentation](https://github.com/ByteDance-Seed/depth-anything-3#-useful-documentation)
142
143 ## Authors
144
145 [Haotong Lin](https://haotongl.github.io/) · [Sili Chen](https://github.com/SiliChen321) · [Junhao Liew](https://liewjunhao.github.io/) · [Donny Y. Chen](https://donydchen.github.io) · [Zhenyu Li](https://zhyever.github.io/) · [Guang Shi](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) · [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en) · [Bingyi Kang](https://bingykang.github.io/) # noqa: E501
146