Depth Anything 3: Recovering the Visual Space from Any Views

TL;DR

The Innovation: DA3 predicts "depth rays" instead of depth values — encoding both direction and distance in one output. This unifies depth estimation and camera pose recovery without separate prediction heads
Simplicity Beats Complexity: A plain DINOv2 transformer backbone outperforms specialized geometric architectures. 44% better camera pose accuracy and 25% better geometry than prior methods
Multi-View Made Practical: Process any number of unstructured photos together — no calibration needed. Four model sizes (26M to 1.2B) serve everything from mobile apps to 3D reconstruction

Research Overview

Depth Anything 2 became one of the most widely-used models for monocular depth estimation, helping applications predict how far away objects are from a single image. Depth Anything 3 takes a major step forward: it handles multiple images at once, recovering full 3D geometry and camera positions without requiring pre-calibrated cameras.

What Changed from DA2 to DA3?

DA2 processes one image at a time, predicting depth independently. DA3 processes any number of unstructured images together—photos taken from arbitrary angles, at different times, with unknown camera settings. No careful calibration or organized capture required. This means DA3 can figure out where each camera was positioned and build a consistent 3D model of the scene from casual smartphone photos or random video frames.

The surprising finding: architectural simplicity wins. While previous multi-view geometry methods used complex specialized architectures with multiple prediction heads and custom attention mechanisms, DA3 achieves better results with a plain transformer and a single unified prediction target called "depth rays."

Key Results at a Glance

Metric	DA3 vs Prior SOTA	What It Means
Camera Pose Accuracy	+44.3%	Much better at figuring out where cameras were
Geometric Accuracy	+25.1%	More precise 3D reconstruction
Metric Depth	+6.2% (91.7% δ₁)	Real-world distances in meters, not just relative
Monocular Depth	Matches DA2	No sacrifice for single-image use
Training Data	100% public	Fully reproducible

Relative vs Metric Depth

Most depth models predict relative depth: "object A is twice as far as object B." DA3 also offers metric depth: "object A is 4.2 meters away." This is critical for robotics, autonomous vehicles, and AR applications where real-world measurements matter. DA3's metric variant achieves 91.7% accuracy, a 6.2% improvement over UniDepthv2.

DA3 vs Prior State-of-the-Art (VGGT)

Relative performance improvements on visual geometry benchmark

Why Depth Estimation Matters

Every day, you effortlessly judge distances. You know that car is about 20 feet away. That coffee cup is within arm's reach. Computers struggle with this because cameras capture flat 2D images, losing all depth information.

The Fundamental Problem

A photograph is a projection of 3D reality onto a 2D plane. An object 1 meter away and an object 10 meters away can appear the same size in a photo if one is proportionally larger. Without depth, computers cannot understand the true layout of a scene.

Depth estimation solves this by predicting a "depth map," where each pixel has not just a color but also a distance value. This transforms flat images into 3D understanding.

Real-World Impact

Autonomous Vehicles rely on understanding distances to navigate safely. While many use expensive LiDAR sensors ($1,000-$75,000), camera-based depth estimation offers a cheaper alternative. Tesla's approach, for instance, uses cameras extensively. Better depth estimation means safer and cheaper self-driving systems.

Robotics requires spatial awareness for any physical interaction. A robot arm picking up objects needs to know exactly where things are in 3D space. Depth estimation enables robots to operate from camera input alone.

AR/VR Applications must place virtual objects correctly in real environments. Without accurate depth, a virtual character might appear to float through a real table or sink into a real floor. Depth estimation makes mixed reality feel real.

Photography applications use depth for portrait mode (blurring backgrounds), 3D photo effects, and computational photography features that simulate expensive camera lenses.

Monocular vs Multi-View: A Critical Distinction

Single Image (Monocular) Depth

Uses learned cues like object sizes, perspective lines, and scene context. A car in an image is probably 2-3 meters long, so its apparent size tells us roughly how far away it is. This works surprisingly well but has inherent ambiguity because multiple 3D scenes could produce the same 2D image.

Multiple Image (Multi-View) Depth

Uses geometric relationships between views, similar to how human stereo vision works. If you photograph a scene from two positions, objects at different depths shift differently between the images (parallax). This geometric constraint enables much more precise depth measurement.

DA3 handles both cases in a unified framework. Give it one image, and it predicts monocular depth. Give it many images, and it leverages geometric relationships for more accurate results while also figuring out where each camera was positioned.

Understanding Depth Rays

The key innovation in DA3 is representing depth as "rays" rather than simple distance values. This seemingly small change has profound implications for how the model learns and performs.

The Depth Ray Concept

Direction encodes pose, length encodes depth

What is a Depth Ray?

Traditional depth estimation predicts a single number per pixel: the distance from the camera to the surface. A depth ray instead predicts two things:

Direction: Which way in 3D space does this pixel point?
Length: How far along that direction until we hit a surface?

Why Rays Matter: The Unification Trick

Direction → Camera Pose: The direction each ray points depends entirely on where the camera is positioned and how it's oriented. When DA3 predicts ray directions, it's implicitly learning camera pose.

Length → Depth/Geometry: The ray length is the distance to the surface—the depth value. This is the geometry information we want.

By predicting rays, DA3 solves pose and depth with a single output, eliminating the need for separate prediction heads that compete for model capacity.

The Unification Benefit

Previous methods predicted multiple separate outputs:

A depth map (distances)
Camera poses (position and orientation)
Sometimes correspondences (which pixels match across views)

Each output required its own prediction head, loss function, and often competed for model capacity. DA3's ray representation encodes all this information naturally:

Ray directions encode camera orientation
Ray lengths encode depth
Rays from different views intersecting in 3D space reveal correspondences

This unification means the model learns one thing well instead of juggling multiple competing objectives.

The DA3 Architecture

Radical Simplicity

DA3 uses DINOv2, a self-supervised vision transformer known for learning robust visual features. Unlike prior multi-view methods that heavily modified their backbones with geometric inductive biases, DA3 uses DINOv2 essentially unchanged—proving that a general-purpose vision encoder can excel at geometry tasks.

What Prior Methods Did

VGGT (the previous state-of-the-art) used specialized cross-view attention layers, separate pose estimation branches, and multi-task decoders. The assumption was that geometry tasks need geometric inductive biases baked into the architecture. DA3 proves this wrong.

The architecture has three simple components:

Component	Parameters	Purpose
DINO Backbone	86M - 1.13B	Extract visual features
DPT Head	3M - 50M	Predict depth rays
Camera Head	1M - 18M	Optional pose refinement

The backbone processes all input images, the DPT head predicts depth rays for each pixel, and an optional camera head refines pose estimates when needed.

Attention: Within and Across Views

The transformer processes images in two alternating phases:

Within-Image Attention (first layers): Each image attends only to itself, building up local visual features like edges, textures, and object parts.

Cross-View Attention (later layers): Images attend to each other, learning correspondences and geometric relationships. A pixel showing a table corner in one image can attend to the same corner in another image, understanding how they relate spatially.

Why This Order Matters

Starting with within-image attention builds strong local features before trying to match across views. Jumping straight to cross-view attention would be like trying to match puzzle pieces before looking at what's on them. The 2:1 ratio (two within-image layers per cross-view layer) was found optimal.

Teacher-Student Training

DA3 uses a clever two-phase training approach that solves a fundamental challenge in geometry learning:

The Geometry Data Problem

Ground-truth 3D geometry is expensive to collect. LiDAR sensors cost thousands of dollars and only work in specific conditions. Synthetic data has perfect depth but looks artificial. The internet has billions of photos but almost none with depth labels. How do you train a geometry model at scale?

Phase 1 (Steps 0-120K): Train on scenes with ground-truth depth labels. These come from synthetic datasets (where depth is known exactly) and real datasets with depth sensors. This gives the model a solid geometric foundation.

Phase 2 (Steps 120K-200K): The Phase 1 model becomes a "teacher" that predicts depth on millions of unlabeled real photos. These predictions become pseudo-labels for training a "student" model. The student learns from both the original labeled data and these pseudo-labels.

Why This Works for Geometry

Unlike classification (where mislabeling "cat" as "dog" is clearly wrong), depth errors are often "close enough." If the teacher predicts a wall is 3.1 meters away when it's actually 3.0 meters, the student still learns useful geometry. The teacher's relative accuracy matters more than absolute precision, and averaging over millions of examples smooths out individual errors.

This enables training on vastly more data than has ground-truth labels, improving generalization to diverse real-world scenes that don't exist in synthetic datasets.

Training Data Composition

680K+ scenes from public academic datasets

Training at Scale

The Data Diet

DA3 trains on 680,000+ scenes from public academic datasets. The composition reveals interesting priorities:

Synthetic Data Dominates (82%): Generated scenes from Objaverse and Trellis provide perfect ground-truth depth. The diversity of 3D models creates varied training scenarios.

Real Sensor Data (7%): LiDAR scans and structured-light captures provide real-world accuracy for validation.

3D Reconstructions (4.5%): Multi-view stereo reconstructions add geometric diversity.

Why Synthetic Works

Synthetic data has perfect labels but a "domain gap" from reality (it doesn't look quite real). However, with enough diversity and the teacher-student approach to handle real unlabeled images, this gap largely closes. The benefit of perfect labels outweighs the cost of synthetic artifacts.

Compute Requirements

Training the full DA3-Giant model requires:

128 H100 GPUs for about 10 days
200,000 training steps total
504×504 pixel base resolution
2-18 views sampled per training example

This is substantial but not unusual for foundation models. Importantly, the trained model is released publicly, so most users don't need to retrain.

Benchmark Results

DA3 establishes a new benchmark covering diverse scene types, from synthetic rooms to outdoor environments to video sequences.

Benchmark Performance Across Datasets

DA3 consistently outperforms prior methods

Understanding the Benchmarks

HiRoom (Synthetic): Computer-generated indoor scenes with perfect ground truth. Tests pure algorithmic performance.

ETH3D (Outdoor): Real outdoor scenes captured with high-precision laser scanners. Tests generalization to real environments.

DTU (Objects): Controlled captures of objects on turntables. Tests fine geometric detail.

7Scenes (Video): Video sequences from indoor environments. Tests temporal consistency.

ScanNet++ (Indoor): High-quality indoor scans. Tests practical indoor performance.

Why DA3 Wins

The improvements are consistent across all benchmarks, suggesting fundamental advances rather than benchmark-specific tuning:

Unified representation means the model doesn't waste capacity on conflicting objectives
Simpler architecture trains more efficiently and generalizes better
More training data from the teacher-student approach improves robustness

Choosing the Right Model

DA3 comes in four sizes, each trading accuracy for speed:

DA3 Model Size vs Speed

Four model sizes for different deployment needs

Model Selection Guide

DA3-Small (26M parameters)

Best for: Edge devices, mobile apps, real-time requirements
Speed: 160 FPS on A100
Capacity: 4,000+ images simultaneously
Trade-off: Lowest accuracy, but still competitive

DA3-Base (105M parameters)

Best for: Balanced applications, most robotics use cases
Speed: 127 FPS on A100
Capacity: 2,100+ images simultaneously
Trade-off: Good balance of speed and accuracy

DA3-Large (355M parameters)

Best for: Quality-focused applications, offline processing
Speed: 78 FPS on A100
Capacity: 1,500+ images simultaneously
Trade-off: Higher accuracy, moderate speed

DA3-Giant (1.2B parameters)

Best for: Maximum accuracy, research, high-end applications
Speed: 38 FPS on A100
Capacity: 950+ images simultaneously
Trade-off: Highest accuracy, significant compute needs

Capacity Matters for Multi-View

The "max images" number indicates how many views you can process together on an 80GB GPU. For autonomous vehicles with 8 cameras, even DA3-Giant is fine. For large-scale 3D scanning with hundreds of images, you might need DA3-Base or smaller to fit in memory.

Practical Applications

Autonomous Driving

DA3 enables camera-based 3D perception that could reduce reliance on expensive sensors:

Multi-Camera Fusion: Most autonomous vehicles have 6-12 cameras. DA3 can process all views together, building a coherent 3D model of the surroundings.

Unknown Calibration: If a camera gets bumped or replaced, DA3 can still work without precise recalibration, improving system robustness.

Cost Reduction: High-quality depth from cameras could replace or supplement LiDAR in cost-sensitive applications.

Robotics and Manipulation

For robots interacting with the physical world:

Flexible Sensing: Works with whatever cameras are available, from stereo pairs to single cameras to arrays.

Real-Time Operation: DA3-Base at 127 FPS exceeds typical robot control loop requirements.

Manipulation Planning: Accurate depth enables precise grasping and placement of objects.

AR/VR and Spatial Computing

For mixed reality applications:

Instant Mapping: Process video frames to build 3D environment maps for AR content placement.

Casual 3D Capture: Walk around an object with a phone camera, and DA3 can reconstruct it in 3D.

Room-Scale Understanding: Process a few photos of a room to understand its layout for virtual furniture placement.

Drone Mapping and Surveying

For aerial applications:

GPS-Denied Operation: When GPS is unavailable (indoors, urban canyons), visual geometry provides positioning.

Flexible Flight Paths: No need for precisely planned survey patterns; DA3 handles arbitrary image collections.

Real-Time Feedback: Process images during flight to identify areas needing more coverage.

Video and Film Production

A critical advantage for video applications:

Temporal Consistency: When processing video frames, DA3 maintains consistent depth across time. Traditional per-frame depth estimation produces "flickering"—depth values that jump around frame-to-frame even on static surfaces. By processing multiple frames together, DA3 produces smooth, stable depth maps essential for VFX compositing and 3D video conversion.

Why Flickering Matters

Per-frame depth estimation treats each frame independently, so random model uncertainties cause visible flickering in the output. This makes effects like depth-based blurring, fog, or 3D conversions look jittery and unprofessional. DA3's multi-view processing enforces geometric consistency across frames, eliminating this artifact.

VFX Integration: Stable depth maps enable realistic depth-of-field effects, atmospheric haze, and object insertion in post-production.

3D Video Conversion: Converting 2D content to 3D stereo requires consistent depth; flickering ruins the 3D effect and causes viewer discomfort.

Business Implications

This paper has significant ramifications across industries that depend on 3D understanding. Here's what different stakeholders can expect:

For Autonomous Vehicle Companies

LiDAR Cost Reduction: High-quality depth from cameras could reduce or eliminate reliance on expensive LiDAR sensors ($1,000-$75,000 per unit). For companies producing millions of vehicles, this represents massive cost savings.

Robustness Without Recalibration: DA3's ability to work without precise camera calibration means vehicles can continue operating safely even if cameras are bumped or replaced. This reduces maintenance costs and improves system reliability.

Multi-Camera Fusion: Processing 6-12 camera feeds together builds a coherent 3D model, enabling more accurate perception than single-camera approaches.

For Robotics Companies

Flexible Sensing Architecture: DA3 works with any camera configuration, from stereo pairs to arbitrary arrays. This flexibility reduces hardware constraints and enables robots to adapt to different environments with different sensor setups.

Real-Time Operation: DA3-Base at 127 FPS exceeds typical robot control loop requirements, enabling precise manipulation and navigation without custom depth hardware.

Cost-Effective Deployment: Camera-based depth is significantly cheaper than time-of-flight or structured-light sensors, making robots more economically viable for smaller applications.

For AR/VR Developers

Instant Environment Mapping: Process video frames in real-time to build 3D maps for AR content placement without specialized depth cameras.

Casual 3D Capture: Walk around an object with a smartphone, and DA3 reconstructs it in 3D, enabling user-generated 3D content without expensive scanners.

Room-Scale Understanding: Process a few photos of a room to understand its layout for virtual furniture placement or spatial computing applications.

For Film and VFX Studios

Temporal Consistency: Unlike per-frame depth estimation that causes flickering, DA3's multi-view processing produces stable depth maps across video sequences, essential for VFX compositing.

Depth-Based Effects: Reliable depth enables realistic depth-of-field, atmospheric haze, and object insertion in post-production without manual rotoscoping.

3D Video Conversion: Converting 2D content to 3D stereo requires consistent depth; DA3's stability makes this commercially viable for catalog conversion.

For Drone and Mapping Companies

GPS-Denied Navigation: When GPS is unavailable (indoors, urban canyons, underground), visual geometry provides positioning, expanding drone applications.

Flexible Flight Paths: No need for precisely planned survey patterns. DA3 handles arbitrary image collections, reducing operator training and mission planning overhead.

Real-Time Coverage Feedback: Process images during flight to identify areas needing more coverage, improving survey efficiency.

For Hardware Vendors

Potential LiDAR Disruption: If camera-based depth approaches LiDAR quality at a fraction of the cost, demand for expensive depth sensors may decline. LiDAR vendors may need to emphasize applications where camera limitations persist (extreme lighting, glass surfaces).

Limitations

Computational Cost

While simpler than prior architectures, DA3 still requires significant compute:

DA3-Giant needs high-end GPUs for real-time operation
Memory scales with number of input views
Not yet suitable for microcontrollers or low-power devices

Scene Assumptions

DA3 assumes relatively cooperative conditions:

Static Scenes: Moving objects (people walking, cars driving) can confuse multi-view geometry. Single-frame mode still works, but multi-view benefits diminish.

Sufficient Texture: Featureless surfaces (white walls, clear glass) are challenging because there's nothing to match across views.

Reasonable Image Quality: Severe motion blur, extreme exposure, or heavy compression degrade results.

Scale Ambiguity

Without absolute reference points, predicted depths are relative rather than metric. A scene could be interpreted as a dollhouse or a full-size room at the same relative proportions. Applications requiring metric depth need additional calibration or known reference objects.

To address this, DA3 provides a dedicated metric-depth variant for use cases where real-world measurements are essential:

The Metric Depth Solution

DA3's metric depth variant estimates absolute distances in meters. On the ETH3D benchmark, it achieves 91.7% accuracy (δ₁ metric), a 6.2% improvement over the previous best method UniDepthv2. This variant incorporates scale references during training, trading some generality for the ability to output real-world measurements directly.

Conclusion

Depth Anything 3 demonstrates that architectural simplicity can outperform complex specialized designs. By unifying depth and pose estimation into a single depth-ray representation and using a plain transformer backbone, DA3 achieves state-of-the-art results while being easier to understand, implement, and deploy.

Key Takeaways:

Simplicity scales: A vanilla transformer outperforms specialized geometric architectures
Unified representations work: Depth rays encode depth and pose together more effectively than separate predictions
Public data suffices: 680K scenes from academic datasets enable state-of-the-art results without proprietary data
Multiple sizes available: From 26M to 1.2B parameters, choose the right trade-off for your application
Multi-view is practical: Processing many images together is now fast enough for real applications

For practitioners building depth-based applications, DA3 offers a versatile, well-documented solution with public code, weights, and training data. The simplicity of the approach makes it easier to adapt, debug, and deploy than more complex alternatives.

Original paper: arXiv ・ PDF ・ HTML

Authors

Haotong LinByteDance Seed,Sili ChenByteDance Seed,Junhao LiewByteDance Seed,Donny Y. ChenByteDance Seed,Zhenyu LiByteDance Seed,Guang ShiByteDance Seed,Jiashi FengByteDance Seed,Bingyi KangByteDance Seed

Code & Data

Cite this paper

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang (2025). Depth Anything 3: Recovering the Visual Space from Any Views. arXiv 2025.

Key Findings