README.md
4.8 KB · 144 lines · markdown Raw
1 ---
2 license: cc-by-nc-4.0
3 tags:
4 - depth-estimation
5 - computer-vision
6 - monocular-depth
7 - multi-view-geometry
8 - pose-estimation
9 library_name: depth-anything-3
10 pipeline_tag: depth-estimation
11 ---
12
13 # Depth Anything 3: DA3-GIANT
14
15 <div align="center">
16
17 [![Project Page](https://img.shields.io/badge/Project_Page-Depth_Anything_3-green)](https://depth-anything-3.github.io)
18 [![Paper](https://img.shields.io/badge/arXiv-Depth_Anything_3-red)](https://arxiv.org/abs/)
19 [![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue)](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) # noqa: E501
20 <!-- Benchmark badge removed as per request -->
21
22 </div>
23
24 ## Model Description
25
26 DA3 Giant model for multi-view depth estimation, camera pose estimation, and 3D Gaussian estimation. This is the flagship foundation model with unified depth-ray representation.
27
28 | Property | Value |
29 |----------|-------|
30 | **Model Series** | Any-view Model |
31 | **Parameters** | 1.15B |
32 | **License** | CC BY-NC 4.0 |
33
34 ⚠️ **Non-commercial use only** due to CC BY-NC 4.0 license.
35
36 ## Capabilities
37
38 - ✅ Relative Depth
39 - ✅ Pose Estimation
40 - ✅ Pose Conditioning
41 - ✅ 3D Gaussians
42
43 ## Quick Start
44
45 ### Installation
46
47 ```bash
48 git clone https://github.com/ByteDance-Seed/depth-anything-3
49 cd depth-anything-3
50 pip install -e .
51 ```
52
53 ### Basic Example
54
55 ```python
56 import torch
57 from depth_anything_3.api import DepthAnything3
58
59 # Load model from Hugging Face Hub
60 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
61 model = DepthAnything3.from_pretrained("depth-anything/da3-giant")
62 model = model.to(device=device)
63
64 # Run inference on images
65 images = ["image1.jpg", "image2.jpg"] # List of image paths, PIL Images, or numpy arrays
66 prediction = model.inference(
67 images,
68 export_dir="output",
69 export_format="glb" # Options: glb, npz, ply, mini_npz, gs_ply, gs_video
70 )
71
72 # Access results
73 print(prediction.depth.shape) # Depth maps: [N, H, W] float32
74 print(prediction.conf.shape) # Confidence maps: [N, H, W] float32
75 print(prediction.extrinsics.shape) # Camera poses (w2c): [N, 3, 4] float32
76 print(prediction.intrinsics.shape) # Camera intrinsics: [N, 3, 3] float32
77 ```
78
79 ### Command Line Interface
80
81 ```bash
82 # Process images with auto mode
83 da3 auto path/to/images \
84 --export-format glb \
85 --export-dir output \
86 --model-dir depth-anything/da3-giant
87
88 # Use backend for faster repeated inference
89 da3 backend --model-dir depth-anything/da3-giant
90 da3 auto path/to/images --export-format glb --use-backend
91 ```
92
93 ## Model Details
94
95 - **Developed by:** ByteDance Seed Team
96 - **Model Type:** Vision Transformer for Visual Geometry
97 - **Architecture:** Plain transformer with unified depth-ray representation
98 - **Training Data:** Public academic datasets only
99
100 ### Key Insights
101
102 💎 A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization. # noqa: E501
103
104 ✨ A singular **depth-ray representation** obviates the need for complex multi-task learning.
105
106 ## Performance
107
108 🏆 Depth Anything 3 significantly outperforms:
109 - **Depth Anything 2** for monocular depth estimation
110 - **VGGT** for multi-view depth estimation and pose estimation
111
112 For detailed benchmarks, please refer to our [paper](https://depth-anything-3.github.io). # noqa: E501
113
114 ## Limitations
115
116 - The model is trained on academic datasets and may have limitations on certain domain-specific images # noqa: E501
117 - Performance may vary depending on image quality, lighting conditions, and scene complexity
118 - ⚠️ **Non-commercial use only** due to CC BY-NC 4.0 license.
119
120 ## Citation
121
122 If you find Depth Anything 3 useful in your research or projects, please cite:
123
124 ```bibtex
125 @article{depthanything3,
126 title={Depth Anything 3: Recovering the visual space from any views},
127 author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang}, # noqa: E501
128 journal={arXiv preprint arXiv:XXXX.XXXXX},
129 year={2025}
130 }
131 ```
132
133 ## Links
134
135 - 🏠 [Project Page](https://depth-anything-3.github.io)
136 - 📄 [Paper](https://arxiv.org/abs/)
137 - 💻 [GitHub Repository](https://github.com/ByteDance-Seed/depth-anything-3)
138 - 🤗 [Hugging Face Demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-3)
139 - 📚 [Documentation](https://github.com/ByteDance-Seed/depth-anything-3#-useful-documentation)
140
141 ## Authors
142
143 [Haotong Lin](https://haotongl.github.io/) · [Sili Chen](https://github.com/SiliChen321) · [Junhao Liew](https://liewjunhao.github.io/) · [Donny Y. Chen](https://donydchen.github.io) · [Zhenyu Li](https://zhyever.github.io/) · [Guang Shi](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) · [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en) · [Bingyi Kang](https://bingykang.github.io/) # noqa: E501
144