README.md · DA3NESTED-GIANT-LARGE-1.1

README.md

4.9 KB · 146 lines · markdown Raw

1	`---`
2	`license: cc-by-nc-4.0`
3	`tags:`
4	`- depth-estimation`
5	`- computer-vision`
6	`- monocular-depth`
7	`- multi-view-geometry`
8	`- pose-estimation`
9	`library_name: depth-anything-3`
10	`pipeline_tag: depth-estimation`
11	`---`
12
13	`# Depth Anything 3: DA3NESTED-GIANT-LARGE`
14
15	`<div align="center">`
16
17	`[![Project Page](https://img.shields.io/badge/Project_Page-Depth_Anything_3-green)](https://depth-anything-3.github.io)`
18	`[![Paper](https://img.shields.io/badge/arXiv-Depth_Anything_3-red)](https://arxiv.org/abs/)`
19	`[![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue)](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) # noqa: E501`
20	`<!-- Benchmark badge removed as per request -->`
21
22	`</div>`
23
24	`## Model Description`
25
26	`DA3 Nested model combining the any-view Giant model with the metric Large model for metric-scale visual geometry reconstruction. This is our recommended model that combines all capabilities.`
27
28	`\| Property \| Value \|`
29	`\|----------\|-------\|`
30	`\| Model Series \| Nested \|`
31	`\| Parameters \| 1.40B \|`
32	`\| License \| CC BY-NC 4.0 \|`
33
34	`⚠️ Non-commercial use only due to CC BY-NC 4.0 license.`
35
36	`## Capabilities`
37
38	`- ✅ Relative Depth`
39	`- ✅ Pose Estimation`
40	`- ✅ Pose Conditioning`
41	`- ✅ 3D Gaussians`
42	`- ✅ Metric Depth`
43	`- ✅ Sky Segmentation`
44
45	`## Quick Start`
46
47	`### Installation`
48
49	```bash
50	`git clone https://github.com/ByteDance-Seed/depth-anything-3`
51	`cd depth-anything-3`
52	`pip install -e .`
53	```
54
55	`### Basic Example`
56
57	```python
58	`import torch`
59	`from depth_anything_3.api import DepthAnything3`
60
61	`# Load model from Hugging Face Hub`
62	`device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`
63	`model = DepthAnything3.from_pretrained("depth-anything/da3nested-giant-large")`
64	`model = model.to(device=device)`
65
66	`# Run inference on images`
67	`images = ["image1.jpg", "image2.jpg"] # List of image paths, PIL Images, or numpy arrays`
68	`prediction = model.inference(`
69	`images,`
70	`export_dir="output",`
71	`export_format="glb" # Options: glb, npz, ply, mini_npz, gs_ply, gs_video`
72	`)`
73
74	`# Access results`
75	`print(prediction.depth.shape) # Depth maps: [N, H, W] float32`
76	`print(prediction.conf.shape) # Confidence maps: [N, H, W] float32`
77	`print(prediction.extrinsics.shape) # Camera poses (w2c): [N, 3, 4] float32`
78	`print(prediction.intrinsics.shape) # Camera intrinsics: [N, 3, 3] float32`
79	```
80
81	`### Command Line Interface`
82
83	```bash
84	`# Process images with auto mode`
85	`da3 auto path/to/images \`
86	`--export-format glb \`
87	`--export-dir output \`
88	`--model-dir depth-anything/da3nested-giant-large`
89
90	`# Use backend for faster repeated inference`
91	`da3 backend --model-dir depth-anything/da3nested-giant-large`
92	`da3 auto path/to/images --export-format glb --use-backend`
93	```
94
95	`## Model Details`
96
97	`- Developed by: ByteDance Seed Team`
98	`- Model Type: Vision Transformer for Visual Geometry`
99	`- Architecture: Plain transformer with unified depth-ray representation`
100	`- Training Data: Public academic datasets only`
101
102	`### Key Insights`
103
104	`💎 A single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization. # noqa: E501`
105
106	`✨ A singular depth-ray representation obviates the need for complex multi-task learning.`
107
108	`## Performance`
109
110	`🏆 Depth Anything 3 significantly outperforms:`
111	`- Depth Anything 2 for monocular depth estimation`
112	`- VGGT for multi-view depth estimation and pose estimation`
113
114	`For detailed benchmarks, please refer to our [paper](https://depth-anything-3.github.io). # noqa: E501`
115
116	`## Limitations`
117
118	`- The model is trained on academic datasets and may have limitations on certain domain-specific images # noqa: E501`
119	`- Performance may vary depending on image quality, lighting conditions, and scene complexity`
120	`- ⚠️ Non-commercial use only due to CC BY-NC 4.0 license.`
121
122	`## Citation`
123
124	`If you find Depth Anything 3 useful in your research or projects, please cite:`
125
126	```bibtex
127	`@article{depthanything3,`
128	`title={Depth Anything 3: Recovering the visual space from any views},`
129	`author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang}, # noqa: E501`
130	`journal={arXiv preprint arXiv:XXXX.XXXXX},`
131	`year={2025}`
132	`}`
133	```
134
135	`## Links`
136
137	`- 🏠 [Project Page](https://depth-anything-3.github.io)`
138	`- 📄 [Paper](https://arxiv.org/abs/)`
139	`- 💻 [GitHub Repository](https://github.com/ByteDance-Seed/depth-anything-3)`
140	`- 🤗 [Hugging Face Demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-3)`
141	`- 📚 [Documentation](https://github.com/ByteDance-Seed/depth-anything-3#-useful-documentation)`
142
143	`## Authors`
144
145	`[Haotong Lin](https://haotongl.github.io/) · [Sili Chen](https://github.com/SiliChen321) · [Junhao Liew](https://liewjunhao.github.io/) · [Donny Y. Chen](https://donydchen.github.io) · [Zhenyu Li](https://zhyever.github.io/) · [Guang Shi](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) · [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en) · [Bingyi Kang](https://bingykang.github.io/) # noqa: E501`
146