-
Two-Stage Architecture: Generate coarse geometry first, then refine with voxel-based methods. This decouples global structure from local detail, letting each stage specialize
-
Public Data Only: Trained entirely on public datasets (Objaverse, ShapeNet) with a novel watertight processing pipeline that fixes gaps and reinforces thin structures. Matches proprietary-data quality
-
Practical 3D Generation: Enables concept iteration, base mesh generation, and background asset creation. Textures and materials still require additional processing
Research Overview
Generating high-quality 3D shapes from text or images has improved dramatically, but most methods struggle with a fundamental trade-off: they either capture global structure well or produce fine details, but rarely both. UltraShape 1.0 addresses this by splitting the problem into two stages and introducing a novel refinement method that anchors geometric details to specific spatial locations.
3D content creation is a bottleneck across industries. Games, films, VR experiences, and product design all require 3D assets, but creating them manually is slow and expensive. Automated generation could reduce asset creation time from hours or days to minutes.
The key insight: by first generating a coarse shape, then refining it with explicit spatial anchors, the model can allocate its capacity more efficiently. The coarse stage handles global structure while the refinement stage focuses entirely on local detail.
Key Innovation
Unlike methods that try to generate everything at once, UltraShape decouples spatial positioning from detail synthesis. The refinement stage uses voxel queries derived from the coarse geometry, providing explicit positional anchors that guide where details should appear.
| Aspect | Single-Stage Methods | UltraShape Two-Stage |
|---|---|---|
| Global structure | Often inconsistent | Handled by Stage 1 |
| Fine details | Limited by capacity | Dedicated Stage 2 |
| Scalability | Compute intensive | More efficient |
| Training data | Often proprietary | Public datasets only |
The Challenge of 3D Generation
3D shape generation faces unique challenges compared to 2D image generation:
Images are 2D grids of pixels with clear spatial relationships. 3D shapes exist in continuous space with complex topology, surfaces that curve and fold, and structures that must be physically plausible from every viewing angle. A small error in 2D might cause a blurry patch; in 3D it can create impossible geometry.
The Resolution Problem
3D representations scale poorly. A 256x256 image has 65,536 pixels. A 256x256x256 voxel grid has over 16 million voxels. This cubic scaling makes high-resolution 3D generation computationally expensive.
Previous approaches have tried several strategies:
Point clouds represent shapes as collections of points, but struggle to capture surfaces accurately.
Meshes define explicit surfaces but are difficult to generate with neural networks due to their irregular structure.
Neural implicit functions (like NeRF) learn continuous representations but are slow to evaluate and hard to convert to usable formats.
Diffusion models have shown promise but typically operate at limited resolutions.
The Data Problem
Training 3D generative models requires large, high-quality datasets. But most available 3D data has issues:
- Missing geometry (holes in surfaces)
- Self-intersecting faces
- Inconsistent normals
- Thin structures that disappear at lower resolutions
UltraShape addresses this with a novel data processing pipeline.
How UltraShape Works
The system uses a two-stage pipeline where each stage has a specific responsibility:
Stage 1: Coarse Structure Generation
The first stage generates a low-resolution representation of the overall shape. This captures:
- Global proportions and silhouette
- Major structural elements
- Rough surface topology
Generating coarse geometry first is similar to how artists work. A sculptor starts with a rough form before adding details. Starting coarse lets the model establish correct global structure without wasting capacity on details that might be in the wrong location.
This stage uses a 3D diffusion model that operates on a voxel grid, gradually denoising random noise into structured geometry.
Stage 2: Voxel-Based Refinement
The refinement stage takes the coarse output and adds geometric detail. The key innovation is how it handles spatial positioning:
Voxel queries from coarse geometry: Rather than having the refinement network figure out where it is in space, the coarse output provides explicit positional anchors. Each region to be refined knows exactly where it sits relative to the global structure.
Fixed spatial locations: The refinement operates at fixed positions derived from the coarse mesh. This means the network only needs to learn what details to add, not where to add them.
Local detail synthesis: With position handled, the refinement network focuses entirely on generating high-frequency geometric details like sharp edges, surface textures, and fine protrusions.
UltraShape Two-Stage Pipeline
Coarse structure then voxel-based refinement
Training Strategy
UltraShape trains exclusively on publicly available 3D datasets, including:
- Objaverse (800K+ 3D models)
- ShapeNet
- Other curated public sources
This is notable because many competing methods rely on proprietary datasets that are not publicly available, making comparison and reproduction difficult.
Training Progression
Two-stage training with coarse-to-fine refinement
Data Processing Pipeline
A major contribution is the data processing pipeline that improves the quality of public 3D datasets:
Watertight Processing
Many 3D models have holes or gaps in their surfaces. The pipeline includes a novel method to make meshes "watertight" (fully enclosed surfaces with no gaps).
Watertight meshes are essential for many applications. 3D printing requires closed surfaces. Physics simulation needs to distinguish inside from outside. Rendering algorithms work better with consistent geometry. A mesh with holes can cause artifacts or failures in downstream applications.
Quality Filtering
Not all 3D models are suitable for training. The pipeline filters out:
- Models with excessive self-intersections
- Geometry that cannot be repaired
- Assets that are too simple (single primitives)
- Models with inconsistent scale or orientation
Thin Structure Reinforcement
Thin features like wires, poles, or decorative elements often disappear when converting to voxel representations. The pipeline reinforces these structures to preserve them during training.
Data Processing Funnel
Quality filtering and watertight processing pipeline
Practical Implications
For Content Creators
UltraShape's approach could accelerate 3D asset workflows:
Concept iteration: Quickly generate multiple shape variations for review before detailed modeling.
Base mesh generation: Create starting points that artists can refine rather than modeling from scratch.
Background assets: Generate less critical scene elements automatically.
For Researchers
The public data-only training approach is significant:
Reproducibility: Other researchers can train similar models without access to proprietary datasets.
Fair comparison: Benchmarks on public data allow meaningful comparison between methods.
Democratization: Labs without resources to collect proprietary data can still participate.
For Developers
The two-stage architecture has implementation benefits:
Modular improvement: Each stage can be upgraded independently.
Resource flexibility: Coarse generation can run on modest hardware; refinement can scale to available compute.
Quality control: Inspect coarse output before committing to expensive refinement.
Business Implications
This paper has ramifications for industries that rely on 3D content creation.
For Game Development Studios
Asset Pipeline Acceleration: Generate base meshes for props, environment objects, and background assets, reducing the workload on 3D artists. Even if results need manual refinement, starting from generated geometry is faster than modeling from scratch.
Concept Iteration Speed: Quickly visualize multiple variations of a design concept in 3D before committing artist time to detailed modeling. Fail fast, iterate faster.
Indie Game Viability: Smaller studios with limited 3D art budgets can produce larger game worlds. The gap between indie and AAA asset quality narrows.
For VR/AR Companies
Content Supply Challenge: VR experiences are content-hungry but 3D asset creation is slow. Automated generation could unlock experiences that weren't economically viable before.
User-Generated 3D Content: As generation improves, end users could describe objects they want and get usable 3D assets. This democratizes spatial computing content creation.
Rapid Prototyping: Build and test VR experiences faster by generating placeholder or even final-quality 3D assets on demand.
For Product Design and Manufacturing
Design Exploration: Generate multiple form factor variations from textual descriptions. Designers can explore solution spaces faster than traditional CAD workflows allow.
Visualization Before Commitment: Show stakeholders multiple 3D concepts before investing in detailed engineering. Reduce wasted effort on rejected directions.
Bridge to CAD: Generated shapes could serve as starting points for engineering refinement, though current methods produce meshes rather than parametric CAD models.
For Architecture and Real Estate
Quick Visualization: Generate furniture, fixtures, and decorative elements to populate architectural renderings. Interior staging becomes faster and cheaper.
Client Communication: Show clients 3D representations of described concepts quickly, even before detailed design work begins.
For Film and Animation Studios
Background Asset Generation: Populate scenes with generated 3D props, reducing asset creation bottlenecks for environment teams.
Concept Art to 3D: Accelerate the pipeline from concept art to rough 3D models, enabling faster iteration between art directors and 3D modelers.
For the 3D Marketplace Ecosystem
Content Creation Democratization: Lower barriers to creating 3D assets could increase supply on marketplaces like TurboSquid, Sketchfab, or Unity Asset Store.
Quality Bar Shifts: As generated content improves, expectations for purchased assets may increase. Vendors may need to offer more sophisticated or specialized assets to differentiate.
Limitations
Not Yet Real-Time
While faster than manual creation, the generation process still takes significant compute time. Real-time generation for interactive applications remains future work.
Texture and Material Gaps
UltraShape focuses on geometry. Generated shapes lack textures, materials, and other attributes needed for final use. Additional processing or manual work is required.
Style Consistency
Like other generative models, consistency across multiple generated assets is not guaranteed. Generating a coherent set of objects (e.g., matching furniture pieces) requires additional control mechanisms.
Resolution Ceiling
While the two-stage approach improves scalability, there are still practical limits on achievable detail. Extremely fine features may require additional refinement stages or post-processing.
Conclusion
UltraShape 1.0 demonstrates that high-quality 3D generation is achievable using only public data and a well-designed two-stage architecture. The key innovations are:
- Decoupling global structure from local detail through staged generation
- Using voxel queries as explicit positional anchors for refinement
- A data processing pipeline that improves public dataset quality
Key takeaways:
- Two-stage generation allocates model capacity more efficiently than single-stage approaches
- Explicit positional anchors improve detail placement during refinement
- Public datasets, when properly processed, can match proprietary data quality
- The modular architecture enables independent improvement of each stage
For teams building 3D generation pipelines, UltraShape offers a reproducible baseline trained entirely on accessible data, with architecture decisions that can inform custom implementations.
Original paper: arXiv ・ PDF ・ HTML
Cite this paper
Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Yuhan Wang, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan (2025). UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement. arXiv 2025.