README.md
| 1 | --- |
| 2 | license: cc-by-nc-4.0 |
| 3 | tags: |
| 4 | - depth-estimation |
| 5 | - computer-vision |
| 6 | - monocular-depth |
| 7 | - multi-view-geometry |
| 8 | - pose-estimation |
| 9 | library_name: depth-anything-3 |
| 10 | pipeline_tag: depth-estimation |
| 11 | --- |
| 12 | |
| 13 | # Depth Anything 3: DA3NESTED-GIANT-LARGE |
| 14 | |
| 15 | <div align="center"> |
| 16 | |
| 17 | [](https://depth-anything-3.github.io) |
| 18 | [](https://arxiv.org/abs/) |
| 19 | [](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) # noqa: E501 |
| 20 | <!-- Benchmark badge removed as per request --> |
| 21 | |
| 22 | </div> |
| 23 | |
| 24 | ## Model Description |
| 25 | |
| 26 | DA3 Nested model combining the any-view Giant model with the metric Large model for metric-scale visual geometry reconstruction. This is our recommended model that combines all capabilities. |
| 27 | |
| 28 | | Property | Value | |
| 29 | |----------|-------| |
| 30 | | **Model Series** | Nested | |
| 31 | | **Parameters** | 1.40B | |
| 32 | | **License** | CC BY-NC 4.0 | |
| 33 | |
| 34 | ⚠️ **Non-commercial use only** due to CC BY-NC 4.0 license. |
| 35 | |
| 36 | ## Capabilities |
| 37 | |
| 38 | - ✅ Relative Depth |
| 39 | - ✅ Pose Estimation |
| 40 | - ✅ Pose Conditioning |
| 41 | - ✅ 3D Gaussians |
| 42 | - ✅ Metric Depth |
| 43 | - ✅ Sky Segmentation |
| 44 | |
| 45 | ## Quick Start |
| 46 | |
| 47 | ### Installation |
| 48 | |
| 49 | ```bash |
| 50 | git clone https://github.com/ByteDance-Seed/depth-anything-3 |
| 51 | cd depth-anything-3 |
| 52 | pip install -e . |
| 53 | ``` |
| 54 | |
| 55 | ### Basic Example |
| 56 | |
| 57 | ```python |
| 58 | import torch |
| 59 | from depth_anything_3.api import DepthAnything3 |
| 60 | |
| 61 | # Load model from Hugging Face Hub |
| 62 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| 63 | model = DepthAnything3.from_pretrained("depth-anything/da3nested-giant-large") |
| 64 | model = model.to(device=device) |
| 65 | |
| 66 | # Run inference on images |
| 67 | images = ["image1.jpg", "image2.jpg"] # List of image paths, PIL Images, or numpy arrays |
| 68 | prediction = model.inference( |
| 69 | images, |
| 70 | export_dir="output", |
| 71 | export_format="glb" # Options: glb, npz, ply, mini_npz, gs_ply, gs_video |
| 72 | ) |
| 73 | |
| 74 | # Access results |
| 75 | print(prediction.depth.shape) # Depth maps: [N, H, W] float32 |
| 76 | print(prediction.conf.shape) # Confidence maps: [N, H, W] float32 |
| 77 | print(prediction.extrinsics.shape) # Camera poses (w2c): [N, 3, 4] float32 |
| 78 | print(prediction.intrinsics.shape) # Camera intrinsics: [N, 3, 3] float32 |
| 79 | ``` |
| 80 | |
| 81 | ### Command Line Interface |
| 82 | |
| 83 | ```bash |
| 84 | # Process images with auto mode |
| 85 | da3 auto path/to/images \ |
| 86 | --export-format glb \ |
| 87 | --export-dir output \ |
| 88 | --model-dir depth-anything/da3nested-giant-large |
| 89 | |
| 90 | # Use backend for faster repeated inference |
| 91 | da3 backend --model-dir depth-anything/da3nested-giant-large |
| 92 | da3 auto path/to/images --export-format glb --use-backend |
| 93 | ``` |
| 94 | |
| 95 | ## Model Details |
| 96 | |
| 97 | - **Developed by:** ByteDance Seed Team |
| 98 | - **Model Type:** Vision Transformer for Visual Geometry |
| 99 | - **Architecture:** Plain transformer with unified depth-ray representation |
| 100 | - **Training Data:** Public academic datasets only |
| 101 | |
| 102 | ### Key Insights |
| 103 | |
| 104 | 💎 A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization. # noqa: E501 |
| 105 | |
| 106 | ✨ A singular **depth-ray representation** obviates the need for complex multi-task learning. |
| 107 | |
| 108 | ## Performance |
| 109 | |
| 110 | 🏆 Depth Anything 3 significantly outperforms: |
| 111 | - **Depth Anything 2** for monocular depth estimation |
| 112 | - **VGGT** for multi-view depth estimation and pose estimation |
| 113 | |
| 114 | For detailed benchmarks, please refer to our [paper](https://depth-anything-3.github.io). # noqa: E501 |
| 115 | |
| 116 | ## Limitations |
| 117 | |
| 118 | - The model is trained on academic datasets and may have limitations on certain domain-specific images # noqa: E501 |
| 119 | - Performance may vary depending on image quality, lighting conditions, and scene complexity |
| 120 | - ⚠️ **Non-commercial use only** due to CC BY-NC 4.0 license. |
| 121 | |
| 122 | ## Citation |
| 123 | |
| 124 | If you find Depth Anything 3 useful in your research or projects, please cite: |
| 125 | |
| 126 | ```bibtex |
| 127 | @article{depthanything3, |
| 128 | title={Depth Anything 3: Recovering the visual space from any views}, |
| 129 | author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang}, # noqa: E501 |
| 130 | journal={arXiv preprint arXiv:XXXX.XXXXX}, |
| 131 | year={2025} |
| 132 | } |
| 133 | ``` |
| 134 | |
| 135 | ## Links |
| 136 | |
| 137 | - 🏠 [Project Page](https://depth-anything-3.github.io) |
| 138 | - 📄 [Paper](https://arxiv.org/abs/) |
| 139 | - 💻 [GitHub Repository](https://github.com/ByteDance-Seed/depth-anything-3) |
| 140 | - 🤗 [Hugging Face Demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) |
| 141 | - 📚 [Documentation](https://github.com/ByteDance-Seed/depth-anything-3#-useful-documentation) |
| 142 | |
| 143 | ## Authors |
| 144 | |
| 145 | [Haotong Lin](https://haotongl.github.io/) · [Sili Chen](https://github.com/SiliChen321) · [Junhao Liew](https://liewjunhao.github.io/) · [Donny Y. Chen](https://donydchen.github.io) · [Zhenyu Li](https://zhyever.github.io/) · [Guang Shi](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) · [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en) · [Bingyi Kang](https://bingykang.github.io/) # noqa: E501 |
| 146 | |