Tekta.ai LogoTektaai
arXiv 2025December 24, 2025

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Tanghui Jiaet al.

Generating detailed 3D shapes remains challenging because models must handle both global structure and fine geometric details. UltraShape 1.0 solves this with a two-stage approach: first generate coarse geometry, then refine it using voxel-based methods that anchor details to specific spatial locations. A key contribution is their data processing pipeline that cleans public 3D datasets by fixing gaps, removing poor samples, and reinforcing thin structures.

Categories:Computer VisionGenerative Models

Key Findings

1

Two-stage pipeline separates coarse structure generation from fine detail synthesis

2

Novel watertight processing fixes gaps and reinforces thin structures in training data

3

Voxel-based refinement provides explicit positional anchors for detail generation

4

Trained exclusively on public datasets, matching proprietary data quality

5

Decouples spatial positioning from detail synthesis for better scalability

6

Code and models to be publicly released for research use

TL;DR
  1. Two-Stage Architecture: Generate coarse geometry first, then refine with voxel-based methods. This decouples global structure from local detail, letting each stage specialize

  2. Public Data Only: Trained entirely on public datasets (Objaverse, ShapeNet) with a novel watertight processing pipeline that fixes gaps and reinforces thin structures. Matches proprietary-data quality

  3. Practical 3D Generation: Enables concept iteration, base mesh generation, and background asset creation. Textures and materials still require additional processing

Research Overview

Generating high-quality 3D shapes from text or images has improved dramatically, but most methods struggle with a fundamental trade-off: they either capture global structure well or produce fine details, but rarely both. UltraShape 1.0 addresses this by splitting the problem into two stages and introducing a novel refinement method that anchors geometric details to specific spatial locations.

Why 3D Generation Matters

3D content creation is a bottleneck across industries. Games, films, VR experiences, and product design all require 3D assets, but creating them manually is slow and expensive. Automated generation could reduce asset creation time from hours or days to minutes.

The key insight: by first generating a coarse shape, then refining it with explicit spatial anchors, the model can allocate its capacity more efficiently. The coarse stage handles global structure while the refinement stage focuses entirely on local detail.

Key Innovation

Unlike methods that try to generate everything at once, UltraShape decouples spatial positioning from detail synthesis. The refinement stage uses voxel queries derived from the coarse geometry, providing explicit positional anchors that guide where details should appear.

AspectSingle-Stage MethodsUltraShape Two-Stage
Global structureOften inconsistentHandled by Stage 1
Fine detailsLimited by capacityDedicated Stage 2
ScalabilityCompute intensiveMore efficient
Training dataOften proprietaryPublic datasets only

The Challenge of 3D Generation

3D shape generation faces unique challenges compared to 2D image generation:

3D vs 2D Generation

Images are 2D grids of pixels with clear spatial relationships. 3D shapes exist in continuous space with complex topology, surfaces that curve and fold, and structures that must be physically plausible from every viewing angle. A small error in 2D might cause a blurry patch; in 3D it can create impossible geometry.

The Resolution Problem

3D representations scale poorly. A 256x256 image has 65,536 pixels. A 256x256x256 voxel grid has over 16 million voxels. This cubic scaling makes high-resolution 3D generation computationally expensive.

Previous approaches have tried several strategies:

Point clouds represent shapes as collections of points, but struggle to capture surfaces accurately.

Meshes define explicit surfaces but are difficult to generate with neural networks due to their irregular structure.

Neural implicit functions (like NeRF) learn continuous representations but are slow to evaluate and hard to convert to usable formats.

Diffusion models have shown promise but typically operate at limited resolutions.

The Data Problem

Training 3D generative models requires large, high-quality datasets. But most available 3D data has issues:

  • Missing geometry (holes in surfaces)
  • Self-intersecting faces
  • Inconsistent normals
  • Thin structures that disappear at lower resolutions

UltraShape addresses this with a novel data processing pipeline.

How UltraShape Works

The system uses a two-stage pipeline where each stage has a specific responsibility:

Stage 1: Coarse Structure Generation

The first stage generates a low-resolution representation of the overall shape. This captures:

  • Global proportions and silhouette
  • Major structural elements
  • Rough surface topology
Why Start Coarse?

Generating coarse geometry first is similar to how artists work. A sculptor starts with a rough form before adding details. Starting coarse lets the model establish correct global structure without wasting capacity on details that might be in the wrong location.

This stage uses a 3D diffusion model that operates on a voxel grid, gradually denoising random noise into structured geometry.

Stage 2: Voxel-Based Refinement

The refinement stage takes the coarse output and adds geometric detail. The key innovation is how it handles spatial positioning:

Voxel queries from coarse geometry: Rather than having the refinement network figure out where it is in space, the coarse output provides explicit positional anchors. Each region to be refined knows exactly where it sits relative to the global structure.

Fixed spatial locations: The refinement operates at fixed positions derived from the coarse mesh. This means the network only needs to learn what details to add, not where to add them.

Local detail synthesis: With position handled, the refinement network focuses entirely on generating high-frequency geometric details like sharp edges, surface textures, and fine protrusions.

UltraShape Two-Stage Pipeline

Coarse structure then voxel-based refinement

Training Strategy

UltraShape trains exclusively on publicly available 3D datasets, including:

  • Objaverse (800K+ 3D models)
  • ShapeNet
  • Other curated public sources

This is notable because many competing methods rely on proprietary datasets that are not publicly available, making comparison and reproduction difficult.

Training Progression

Two-stage training with coarse-to-fine refinement

Data Processing Pipeline

A major contribution is the data processing pipeline that improves the quality of public 3D datasets:

Watertight Processing

Many 3D models have holes or gaps in their surfaces. The pipeline includes a novel method to make meshes "watertight" (fully enclosed surfaces with no gaps).

Why Watertight Matters

Watertight meshes are essential for many applications. 3D printing requires closed surfaces. Physics simulation needs to distinguish inside from outside. Rendering algorithms work better with consistent geometry. A mesh with holes can cause artifacts or failures in downstream applications.

Quality Filtering

Not all 3D models are suitable for training. The pipeline filters out:

  • Models with excessive self-intersections
  • Geometry that cannot be repaired
  • Assets that are too simple (single primitives)
  • Models with inconsistent scale or orientation

Thin Structure Reinforcement

Thin features like wires, poles, or decorative elements often disappear when converting to voxel representations. The pipeline reinforces these structures to preserve them during training.

Data Processing Funnel

Quality filtering and watertight processing pipeline

Practical Implications

For Content Creators

UltraShape's approach could accelerate 3D asset workflows:

Concept iteration: Quickly generate multiple shape variations for review before detailed modeling.

Base mesh generation: Create starting points that artists can refine rather than modeling from scratch.

Background assets: Generate less critical scene elements automatically.

For Researchers

The public data-only training approach is significant:

Reproducibility: Other researchers can train similar models without access to proprietary datasets.

Fair comparison: Benchmarks on public data allow meaningful comparison between methods.

Democratization: Labs without resources to collect proprietary data can still participate.

For Developers

The two-stage architecture has implementation benefits:

Modular improvement: Each stage can be upgraded independently.

Resource flexibility: Coarse generation can run on modest hardware; refinement can scale to available compute.

Quality control: Inspect coarse output before committing to expensive refinement.

Business Implications

This paper has ramifications for industries that rely on 3D content creation.

For Game Development Studios

Asset Pipeline Acceleration: Generate base meshes for props, environment objects, and background assets, reducing the workload on 3D artists. Even if results need manual refinement, starting from generated geometry is faster than modeling from scratch.

Concept Iteration Speed: Quickly visualize multiple variations of a design concept in 3D before committing artist time to detailed modeling. Fail fast, iterate faster.

Indie Game Viability: Smaller studios with limited 3D art budgets can produce larger game worlds. The gap between indie and AAA asset quality narrows.

For VR/AR Companies

Content Supply Challenge: VR experiences are content-hungry but 3D asset creation is slow. Automated generation could unlock experiences that weren't economically viable before.

User-Generated 3D Content: As generation improves, end users could describe objects they want and get usable 3D assets. This democratizes spatial computing content creation.

Rapid Prototyping: Build and test VR experiences faster by generating placeholder or even final-quality 3D assets on demand.

For Product Design and Manufacturing

Design Exploration: Generate multiple form factor variations from textual descriptions. Designers can explore solution spaces faster than traditional CAD workflows allow.

Visualization Before Commitment: Show stakeholders multiple 3D concepts before investing in detailed engineering. Reduce wasted effort on rejected directions.

Bridge to CAD: Generated shapes could serve as starting points for engineering refinement, though current methods produce meshes rather than parametric CAD models.

For Architecture and Real Estate

Quick Visualization: Generate furniture, fixtures, and decorative elements to populate architectural renderings. Interior staging becomes faster and cheaper.

Client Communication: Show clients 3D representations of described concepts quickly, even before detailed design work begins.

For Film and Animation Studios

Background Asset Generation: Populate scenes with generated 3D props, reducing asset creation bottlenecks for environment teams.

Concept Art to 3D: Accelerate the pipeline from concept art to rough 3D models, enabling faster iteration between art directors and 3D modelers.

For the 3D Marketplace Ecosystem

Content Creation Democratization: Lower barriers to creating 3D assets could increase supply on marketplaces like TurboSquid, Sketchfab, or Unity Asset Store.

Quality Bar Shifts: As generated content improves, expectations for purchased assets may increase. Vendors may need to offer more sophisticated or specialized assets to differentiate.

Limitations

Not Yet Real-Time

While faster than manual creation, the generation process still takes significant compute time. Real-time generation for interactive applications remains future work.

Texture and Material Gaps

UltraShape focuses on geometry. Generated shapes lack textures, materials, and other attributes needed for final use. Additional processing or manual work is required.

Style Consistency

Like other generative models, consistency across multiple generated assets is not guaranteed. Generating a coherent set of objects (e.g., matching furniture pieces) requires additional control mechanisms.

Resolution Ceiling

While the two-stage approach improves scalability, there are still practical limits on achievable detail. Extremely fine features may require additional refinement stages or post-processing.

Conclusion

UltraShape 1.0 demonstrates that high-quality 3D generation is achievable using only public data and a well-designed two-stage architecture. The key innovations are:

  1. Decoupling global structure from local detail through staged generation
  2. Using voxel queries as explicit positional anchors for refinement
  3. A data processing pipeline that improves public dataset quality

Key takeaways:

  1. Two-stage generation allocates model capacity more efficiently than single-stage approaches
  2. Explicit positional anchors improve detail placement during refinement
  3. Public datasets, when properly processed, can match proprietary data quality
  4. The modular architecture enables independent improvement of each stage

For teams building 3D generation pipelines, UltraShape offers a reproducible baseline trained entirely on accessible data, with architecture decisions that can inform custom implementations.


Original paper: arXivPDFHTML

Authors

Tanghui Jia,Dongyu Yan,Dehao Hao,Yang Li,Kaiyi Zhang,Xianyi He,Lanjiong Li,Yuhan Wang,Jinnan Chen,Lutao Jiang,Qishen Yin,Long Quan,Ying-Cong Chen,Li Yuan

Cite this paper

Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Yuhan Wang, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan (2025). UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement. arXiv 2025.

Related Research