README.md
| 1 | --- |
| 2 | license: apache-2.0 |
| 3 | tags: |
| 4 | - depth-estimation |
| 5 | - computer-vision |
| 6 | - monocular-depth |
| 7 | - multi-view-geometry |
| 8 | - pose-estimation |
| 9 | library_name: depth-anything-3 |
| 10 | pipeline_tag: depth-estimation |
| 11 | --- |
| 12 | |
| 13 | # Depth Anything 3: DA3METRIC-LARGE |
| 14 | |
| 15 | <div align="center"> |
| 16 | |
| 17 | [](https://depth-anything-3.github.io) |
| 18 | [](https://arxiv.org/abs/) |
| 19 | [](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) # noqa: E501 |
| 20 | <!-- Benchmark badge removed as per request --> |
| 21 | |
| 22 | </div> |
| 23 | |
| 24 | ## Model Description |
| 25 | |
| 26 | DA3 Metric Large model specialized for metric depth estimation in monocular settings, ideal for applications requiring real-world scale. Canonical metric depth; multiplying by focal length gives metric depth. |
| 27 | |
| 28 | | Property | Value | |
| 29 | |----------|-------| |
| 30 | | **Model Series** | Monocular Metric Depth | |
| 31 | | **Parameters** | 0.35B | |
| 32 | | **License** | Apache 2.0 | |
| 33 | |
| 34 | |
| 35 | |
| 36 | ## Capabilities |
| 37 | |
| 38 | - ✅ Relative Depth |
| 39 | - ✅ Metric Depth |
| 40 | - ✅ Sky Segmentation |
| 41 | |
| 42 | ## Quick Start |
| 43 | |
| 44 | ### Installation |
| 45 | |
| 46 | ```bash |
| 47 | git clone https://github.com/ByteDance-Seed/depth-anything-3 |
| 48 | cd depth-anything-3 |
| 49 | pip install -e . |
| 50 | ``` |
| 51 | |
| 52 | ### Basic Example |
| 53 | |
| 54 | ```python |
| 55 | import torch |
| 56 | from depth_anything_3.api import DepthAnything3 |
| 57 | |
| 58 | # Load model from Hugging Face Hub |
| 59 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| 60 | model = DepthAnything3.from_pretrained("depth-anything/da3metric-large") |
| 61 | model = model.to(device=device) |
| 62 | |
| 63 | # Run inference on images |
| 64 | images = ["image1.jpg", "image2.jpg"] # List of image paths, PIL Images, or numpy arrays |
| 65 | prediction = model.inference( |
| 66 | images, |
| 67 | export_dir="output", |
| 68 | export_format="glb" # Options: glb, npz, ply, mini_npz, gs_ply, gs_video |
| 69 | ) |
| 70 | |
| 71 | # Access results |
| 72 | print(prediction.depth.shape) # Depth maps: [N, H, W] float32 |
| 73 | print(prediction.conf.shape) # Confidence maps: [N, H, W] float32 |
| 74 | print(prediction.extrinsics.shape) # Camera poses (w2c): [N, 3, 4] float32 |
| 75 | print(prediction.intrinsics.shape) # Camera intrinsics: [N, 3, 3] float32 |
| 76 | ``` |
| 77 | |
| 78 | ### Command Line Interface |
| 79 | |
| 80 | ```bash |
| 81 | # Process images with auto mode |
| 82 | da3 auto path/to/images \ |
| 83 | --export-format glb \ |
| 84 | --export-dir output \ |
| 85 | --model-dir depth-anything/da3metric-large |
| 86 | |
| 87 | # Use backend for faster repeated inference |
| 88 | da3 backend --model-dir depth-anything/da3metric-large |
| 89 | da3 auto path/to/images --export-format glb --use-backend |
| 90 | ``` |
| 91 | |
| 92 | ## Model Details |
| 93 | |
| 94 | - **Developed by:** ByteDance Seed Team |
| 95 | - **Model Type:** Vision Transformer for Visual Geometry |
| 96 | - **Architecture:** Plain transformer with unified depth-ray representation |
| 97 | - **Training Data:** Public academic datasets only |
| 98 | |
| 99 | ### Key Insights |
| 100 | |
| 101 | 💎 A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization. # noqa: E501 |
| 102 | |
| 103 | ✨ A singular **depth-ray representation** obviates the need for complex multi-task learning. |
| 104 | |
| 105 | ## Performance |
| 106 | |
| 107 | 🏆 Depth Anything 3 significantly outperforms: |
| 108 | - **Depth Anything 2** for monocular depth estimation |
| 109 | - **VGGT** for multi-view depth estimation and pose estimation |
| 110 | |
| 111 | For detailed benchmarks, please refer to our [paper](https://depth-anything-3.github.io). # noqa: E501 |
| 112 | |
| 113 | ## Limitations |
| 114 | |
| 115 | - The model is trained on academic datasets and may have limitations on certain domain-specific images # noqa: E501 |
| 116 | - Performance may vary depending on image quality, lighting conditions, and scene complexity |
| 117 | |
| 118 | |
| 119 | ## Citation |
| 120 | |
| 121 | If you find Depth Anything 3 useful in your research or projects, please cite: |
| 122 | |
| 123 | ```bibtex |
| 124 | @article{depthanything3, |
| 125 | title={Depth Anything 3: Recovering the visual space from any views}, |
| 126 | author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang}, # noqa: E501 |
| 127 | journal={arXiv preprint arXiv:XXXX.XXXXX}, |
| 128 | year={2025} |
| 129 | } |
| 130 | ``` |
| 131 | |
| 132 | ## Links |
| 133 | |
| 134 | - 🏠 [Project Page](https://depth-anything-3.github.io) |
| 135 | - 📄 [Paper](https://arxiv.org/abs/) |
| 136 | - 💻 [GitHub Repository](https://github.com/ByteDance-Seed/depth-anything-3) |
| 137 | - 🤗 [Hugging Face Demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) |
| 138 | - 📚 [Documentation](https://github.com/ByteDance-Seed/depth-anything-3#-useful-documentation) |
| 139 | |
| 140 | ## Authors |
| 141 | |
| 142 | [Haotong Lin](https://haotongl.github.io/) · [Sili Chen](https://github.com/SiliChen321) · [Junhao Liew](https://liewjunhao.github.io/) · [Donny Y. Chen](https://donydchen.github.io) · [Zhenyu Li](https://zhyever.github.io/) · [Guang Shi](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) · [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en) · [Bingyi Kang](https://bingykang.github.io/) # noqa: E501 |
| 143 | |