-
The Innovation: DA3 predicts “depth rays” instead of depth values — encoding both direction and distance in one output. This unifies depth estimation and camera pose recovery without separate prediction heads
-
Simplicity Beats Complexity: A plain DINOv2 transformer backbone outperforms specialized geometric architectures. 44% better camera pose accuracy and 25% better geometry than prior methods
-
Multi-View Made Practical: Process any number of unstructured photos together — no calibration needed. Four model sizes (26M to 1.2B) serve everything from mobile apps to 3D reconstruction
Research Overview
Depth Anything 2 became one of the most widely-used models for monocular depth estimation, helping applications predict how far away objects are from a single image. Depth Anything 3 takes a major step forward: it handles multiple images at once, recovering full 3D geometry and camera positions without requiring pre-calibrated cameras.
DA2 processes one image at a time, predicting depth independently. DA3 processes any number of unstructured images together—photos taken from arbitrary angles, at different times, with unknown camera settings. No careful calibration or organized capture required. This means DA3 can figure out where each camera was positioned and build a consistent 3D model of the scene from casual smartphone photos or random video frames.
The surprising finding: architectural simplicity wins. While previous multi-view geometry methods used complex specialized architectures with multiple prediction heads and custom attention mechanisms, DA3 achieves better results with a plain transformer and a single unified prediction target called “depth rays.”
Key Results at a Glance
| Metric | DA3 vs Prior SOTA | What It Means |
|---|---|---|
| Camera Pose Accuracy | +44.3% | Much better at figuring out where cameras were |
| Geometric Accuracy | +25.1% | More precise 3D reconstruction |
| Metric Depth | +6.2% (91.7% δ₁) | Real-world distances in meters, not just relative |
| Monocular Depth | Matches DA2 | No sacrifice for single-image use |
| Training Data | 100% public | Fully reproducible |
Most depth models predict relative depth: “object A is twice as far as object B.” DA3 also offers metric depth: “object A is 4.2 meters away.” This is critical for robotics, autonomous vehicles, and AR applications where real-world measurements matter. DA3’s metric variant achieves 91.7% accuracy, a 6.2% improvement over UniDepthv2.
DA3 vs Prior State-of-the-Art (VGGT)
Relative performance improvements on visual geometry benchmark
Why Depth Estimation Matters
Every day, you effortlessly judge distances. You know that car is about 20 feet away. That coffee cup is within arm’s reach. Computers struggle with this because cameras capture flat 2D images, losing all depth information.
A photograph is a projection of 3D reality onto a 2D plane. An object 1 meter away and an object 10 meters away can appear the same size in a photo if one is proportionally larger. Without depth, computers cannot understand the true layout of a scene.
Depth estimation solves this by predicting a “depth map,” where each pixel has not just a color but also a distance value. This transforms flat images into 3D understanding.
Real-World Impact
Autonomous Vehicles rely on understanding distances to navigate safely. While many use expensive LiDAR sensors ($1,000-$75,000), camera-based depth estimation offers a cheaper alternative. Tesla’s approach, for instance, uses cameras extensively. Better depth estimation means safer and cheaper self-driving systems.
Robotics requires spatial awareness for any physical interaction. A robot arm picking up objects needs to know exactly where things are in 3D space. Depth estimation enables robots to operate from camera input alone.
AR/VR Applications must place virtual objects correctly in real environments. Without accurate depth, a virtual character might appear to float through a real table or sink into a real floor. Depth estimation makes mixed reality feel real.
Photography applications use depth for portrait mode (blurring backgrounds), 3D photo effects, and computational photography features that simulate expensive camera lenses.
Monocular vs Multi-View: A Critical Distinction
Uses learned cues like object sizes, perspective lines, and scene context. A car in an image is probably 2-3 meters long, so its apparent size tells us roughly how far away it is. This works surprisingly well but has inherent ambiguity because multiple 3D scenes could produce the same 2D image.
Uses geometric relationships between views, similar to how human stereo vision works. If you photograph a scene from two positions, objects at different depths shift differently between the images (parallax). This geometric constraint enables much more precise depth measurement.
DA3 handles both cases in a unified framework. Give it one image, and it predicts monocular depth. Give it many images, and it leverages geometric relationships for more accurate results while also figuring out where each camera was positioned.
Understanding Depth Rays
The key innovation in DA3 is representing depth as “rays” rather than simple distance values. This seemingly small change has profound implications for how the model learns and performs.
Understanding Depth Rays
Each pixel predicts a ray with direction (where it points) and length (how far)
What is a Depth Ray?
Traditional depth estimation predicts a single number per pixel: the distance from the camera to the surface. A depth ray instead predicts two things:
- Direction: Which way in 3D space does this pixel point?
- Length: How far along that direction until we hit a surface?
Direction → Camera Pose: The direction each ray points depends entirely on where the camera is positioned and how it’s oriented. When DA3 predicts ray directions, it’s implicitly learning camera pose.
Length → Depth/Geometry: The ray length is the distance to the surface—the depth value. This is the geometry information we want.
By predicting rays, DA3 solves pose and depth with a single output, eliminating the need for separate prediction heads that compete for model capacity.
The Unification Benefit
Previous methods predicted multiple separate outputs:
- A depth map (distances)
- Camera poses (position and orientation)
- Sometimes correspondences (which pixels match across views)
Each output required its own prediction head, loss function, and often competed for model capacity. DA3’s ray representation encodes all this information naturally:
- Ray directions encode camera orientation
- Ray lengths encode depth
- Rays from different views intersecting in 3D space reveal correspondences
This unification means the model learns one thing well instead of juggling multiple competing objectives.
The DA3 Architecture
Radical Simplicity
DA3 uses DINOv2, a self-supervised vision transformer known for learning robust visual features. Unlike prior multi-view methods that heavily modified their backbones with geometric inductive biases, DA3 uses DINOv2 essentially unchanged—proving that a general-purpose vision encoder can excel at geometry tasks.
VGGT (the previous state-of-the-art) used specialized cross-view attention layers, separate pose estimation branches, and multi-task decoders. The assumption was that geometry tasks need geometric inductive biases baked into the architecture. DA3 proves this wrong.
The architecture has three simple components:
| Component | Parameters | Purpose |
|---|---|---|
| DINO Backbone | 86M - 1.13B | Extract visual features |
| DPT Head | 3M - 50M | Predict depth rays |
| Camera Head | 1M - 18M | Optional pose refinement |
The backbone processes all input images, the DPT head predicts depth rays for each pixel, and an optional camera head refines pose estimates when needed.
Attention: Within and Across Views
The transformer processes images in two alternating phases:
Within-Image Attention (first layers): Each image attends only to itself, building up local visual features like edges, textures, and object parts.
Cross-View Attention (later layers): Images attend to each other, learning correspondences and geometric relationships. A pixel showing a table corner in one image can attend to the same corner in another image, understanding how they relate spatially.
Starting with within-image attention builds strong local features before trying to match across views. Jumping straight to cross-view attention would be like trying to match puzzle pieces before looking at what’s on them. The 2:1 ratio (two within-image layers per cross-view layer) was found optimal.
Teacher-Student Training
DA3 uses a clever two-phase training approach that solves a fundamental challenge in geometry learning:
Ground-truth 3D geometry is expensive to collect. LiDAR sensors cost thousands of dollars and only work in specific conditions. Synthetic data has perfect depth but looks artificial. The internet has billions of photos but almost none with depth labels. How do you train a geometry model at scale?
Phase 1 (Steps 0-120K): Train on scenes with ground-truth depth labels. These come from synthetic datasets (where depth is known exactly) and real datasets with depth sensors. This gives the model a solid geometric foundation.
Phase 2 (Steps 120K-200K): The Phase 1 model becomes a “teacher” that predicts depth on millions of unlabeled real photos. These predictions become pseudo-labels for training a “student” model. The student learns from both the original labeled data and these pseudo-labels.
Unlike classification (where mislabeling “cat” as “dog” is clearly wrong), depth errors are often “close enough.” If the teacher predicts a wall is 3.1 meters away when it’s actually 3.0 meters, the student still learns useful geometry. The teacher’s relative accuracy matters more than absolute precision, and averaging over millions of examples smooths out individual errors.
This enables training on vastly more data than has ground-truth labels, improving generalization to diverse real-world scenes that don’t exist in synthetic datasets.
Training Data: 680K+ Scenes from Public Sources
Synthetic data dominates, enabling training without proprietary datasets
Training at Scale
The Data Diet
DA3 trains on 680,000+ scenes from public academic datasets. The composition reveals interesting priorities:
Synthetic Data Dominates (82%): Generated scenes from Objaverse and Trellis provide perfect ground-truth depth. The diversity of 3D models creates varied training scenarios.
Real Sensor Data (7%): LiDAR scans and structured-light captures provide real-world accuracy for validation.
3D Reconstructions (4.5%): Multi-view stereo reconstructions add geometric diversity.
Synthetic data has perfect labels but a “domain gap” from reality (it doesn’t look quite real). However, with enough diversity and the teacher-student approach to handle real unlabeled images, this gap largely closes. The benefit of perfect labels outweighs the cost of synthetic artifacts.
Compute Requirements
Training the full DA3-Giant model requires:
- 128 H100 GPUs for about 10 days
- 200,000 training steps total
- 504×504 pixel base resolution
- 2-18 views sampled per training example
This is substantial but not unusual for foundation models. Importantly, the trained model is released publicly, so most users don’t need to retrain.
Benchmark Results
DA3 establishes a new benchmark covering diverse scene types, from synthetic rooms to outdoor environments to video sequences.
Camera Pose Accuracy Across Benchmarks
AUC score (higher is better) on 5 diverse scene types
Understanding the Benchmarks
HiRoom (Synthetic): Computer-generated indoor scenes with perfect ground truth. Tests pure algorithmic performance.
ETH3D (Outdoor): Real outdoor scenes captured with high-precision laser scanners. Tests generalization to real environments.
DTU (Objects): Controlled captures of objects on turntables. Tests fine geometric detail.
7Scenes (Video): Video sequences from indoor environments. Tests temporal consistency.
ScanNet++ (Indoor): High-quality indoor scans. Tests practical indoor performance.
Why DA3 Wins
The improvements are consistent across all benchmarks, suggesting fundamental advances rather than benchmark-specific tuning:
- Unified representation means the model doesn’t waste capacity on conflicting objectives
- Simpler architecture trains more efficiently and generalizes better
- More training data from the teacher-student approach improves robustness
Choosing the Right Model
DA3 comes in four sizes, each trading accuracy for speed:
DA3 Model Variants: Size vs Speed Trade-off
Larger models are more accurate but slower. Choose based on your deployment needs.
Model Selection Guide
DA3-Small (26M parameters)
- Best for: Edge devices, mobile apps, real-time requirements
- Speed: 160 FPS on A100
- Capacity: 4,000+ images simultaneously
- Trade-off: Lowest accuracy, but still competitive
DA3-Base (105M parameters)
- Best for: Balanced applications, most robotics use cases
- Speed: 127 FPS on A100
- Capacity: 2,100+ images simultaneously
- Trade-off: Good balance of speed and accuracy
DA3-Large (355M parameters)
- Best for: Quality-focused applications, offline processing
- Speed: 78 FPS on A100
- Capacity: 1,500+ images simultaneously
- Trade-off: Higher accuracy, moderate speed
DA3-Giant (1.2B parameters)
- Best for: Maximum accuracy, research, high-end applications
- Speed: 38 FPS on A100
- Capacity: 950+ images simultaneously
- Trade-off: Highest accuracy, significant compute needs
The “max images” number indicates how many views you can process together on an 80GB GPU. For autonomous vehicles with 8 cameras, even DA3-Giant is fine. For large-scale 3D scanning with hundreds of images, you might need DA3-Base or smaller to fit in memory.
Practical Applications
Autonomous Driving
DA3 enables camera-based 3D perception that could reduce reliance on expensive sensors:
Multi-Camera Fusion: Most autonomous vehicles have 6-12 cameras. DA3 can process all views together, building a coherent 3D model of the surroundings.
Unknown Calibration: If a camera gets bumped or replaced, DA3 can still work without precise recalibration, improving system robustness.
Cost Reduction: High-quality depth from cameras could replace or supplement LiDAR in cost-sensitive applications.
Robotics and Manipulation
For robots interacting with the physical world:
Flexible Sensing: Works with whatever cameras are available, from stereo pairs to single cameras to arrays.
Real-Time Operation: DA3-Base at 127 FPS exceeds typical robot control loop requirements.
Manipulation Planning: Accurate depth enables precise grasping and placement of objects.
AR/VR and Spatial Computing
For mixed reality applications:
Instant Mapping: Process video frames to build 3D environment maps for AR content placement.
Casual 3D Capture: Walk around an object with a phone camera, and DA3 can reconstruct it in 3D.
Room-Scale Understanding: Process a few photos of a room to understand its layout for virtual furniture placement.
Drone Mapping and Surveying
For aerial applications:
GPS-Denied Operation: When GPS is unavailable (indoors, urban canyons), visual geometry provides positioning.
Flexible Flight Paths: No need for precisely planned survey patterns; DA3 handles arbitrary image collections.
Real-Time Feedback: Process images during flight to identify areas needing more coverage.
Video and Film Production
A critical advantage for video applications:
Temporal Consistency: When processing video frames, DA3 maintains consistent depth across time. Traditional per-frame depth estimation produces “flickering”—depth values that jump around frame-to-frame even on static surfaces. By processing multiple frames together, DA3 produces smooth, stable depth maps essential for VFX compositing and 3D video conversion.
Per-frame depth estimation treats each frame independently, so random model uncertainties cause visible flickering in the output. This makes effects like depth-based blurring, fog, or 3D conversions look jittery and unprofessional. DA3’s multi-view processing enforces geometric consistency across frames, eliminating this artifact.
VFX Integration: Stable depth maps enable realistic depth-of-field effects, atmospheric haze, and object insertion in post-production.
3D Video Conversion: Converting 2D content to 3D stereo requires consistent depth; flickering ruins the 3D effect and causes viewer discomfort.
Business Implications
This paper has significant ramifications across industries that depend on 3D understanding. Here’s what different stakeholders can expect:
For Autonomous Vehicle Companies
LiDAR Cost Reduction: High-quality depth from cameras could reduce or eliminate reliance on expensive LiDAR sensors ($1,000-$75,000 per unit). For companies producing millions of vehicles, this represents massive cost savings.
Robustness Without Recalibration: DA3’s ability to work without precise camera calibration means vehicles can continue operating safely even if cameras are bumped or replaced. This reduces maintenance costs and improves system reliability.
Multi-Camera Fusion: Processing 6-12 camera feeds together builds a coherent 3D model, enabling more accurate perception than single-camera approaches.
For Robotics Companies
Flexible Sensing Architecture: DA3 works with any camera configuration, from stereo pairs to arbitrary arrays. This flexibility reduces hardware constraints and enables robots to adapt to different environments with different sensor setups.
Real-Time Operation: DA3-Base at 127 FPS exceeds typical robot control loop requirements, enabling precise manipulation and navigation without custom depth hardware.
Cost-Effective Deployment: Camera-based depth is significantly cheaper than time-of-flight or structured-light sensors, making robots more economically viable for smaller applications.
For AR/VR Developers
Instant Environment Mapping: Process video frames in real-time to build 3D maps for AR content placement without specialized depth cameras.
Casual 3D Capture: Walk around an object with a smartphone, and DA3 reconstructs it in 3D, enabling user-generated 3D content without expensive scanners.
Room-Scale Understanding: Process a few photos of a room to understand its layout for virtual furniture placement or spatial computing applications.
For Film and VFX Studios
Temporal Consistency: Unlike per-frame depth estimation that causes flickering, DA3’s multi-view processing produces stable depth maps across video sequences, essential for VFX compositing.
Depth-Based Effects: Reliable depth enables realistic depth-of-field, atmospheric haze, and object insertion in post-production without manual rotoscoping.
3D Video Conversion: Converting 2D content to 3D stereo requires consistent depth; DA3’s stability makes this commercially viable for catalog conversion.
For Drone and Mapping Companies
GPS-Denied Navigation: When GPS is unavailable (indoors, urban canyons, underground), visual geometry provides positioning, expanding drone applications.
Flexible Flight Paths: No need for precisely planned survey patterns. DA3 handles arbitrary image collections, reducing operator training and mission planning overhead.
Real-Time Coverage Feedback: Process images during flight to identify areas needing more coverage, improving survey efficiency.
For Hardware Vendors
Potential LiDAR Disruption: If camera-based depth approaches LiDAR quality at a fraction of the cost, demand for expensive depth sensors may decline. LiDAR vendors may need to emphasize applications where camera limitations persist (extreme lighting, glass surfaces).
Limitations
Computational Cost
While simpler than prior architectures, DA3 still requires significant compute:
- DA3-Giant needs high-end GPUs for real-time operation
- Memory scales with number of input views
- Not yet suitable for microcontrollers or low-power devices
Scene Assumptions
DA3 assumes relatively cooperative conditions:
Static Scenes: Moving objects (people walking, cars driving) can confuse multi-view geometry. Single-frame mode still works, but multi-view benefits diminish.
Sufficient Texture: Featureless surfaces (white walls, clear glass) are challenging because there’s nothing to match across views.
Reasonable Image Quality: Severe motion blur, extreme exposure, or heavy compression degrade results.
Scale Ambiguity
Without absolute reference points, predicted depths are relative rather than metric. A scene could be interpreted as a dollhouse or a full-size room at the same relative proportions. Applications requiring metric depth need additional calibration or known reference objects.
To address this, DA3 provides a dedicated metric-depth variant for use cases where real-world measurements are essential:
DA3’s metric depth variant estimates absolute distances in meters. On the ETH3D benchmark, it achieves 91.7% accuracy (δ₁ metric), a 6.2% improvement over the previous best method UniDepthv2. This variant incorporates scale references during training, trading some generality for the ability to output real-world measurements directly.
Conclusion
Depth Anything 3 demonstrates that architectural simplicity can outperform complex specialized designs. By unifying depth and pose estimation into a single depth-ray representation and using a plain transformer backbone, DA3 achieves state-of-the-art results while being easier to understand, implement, and deploy.
Key Takeaways:
- Simplicity scales: A vanilla transformer outperforms specialized geometric architectures
- Unified representations work: Depth rays encode depth and pose together more effectively than separate predictions
- Public data suffices: 680K scenes from academic datasets enable state-of-the-art results without proprietary data
- Multiple sizes available: From 26M to 1.2B parameters, choose the right trade-off for your application
- Multi-view is practical: Processing many images together is now fast enough for real applications
For practitioners building depth-based applications, DA3 offers a versatile, well-documented solution with public code, weights, and training data. The simplicity of the approach makes it easier to adapt, debug, and deploy than more complex alternatives.
Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang (2025). Depth Anything 3: Recovering the Visual Space from Any Views. arXiv 2025.