CaPa: Carve-n-Paint Synthesis
for Efficient 4K Textured Mesh Generation

Paper is released! Stay tuned for online demo!

Hwan Heo, Jangyeong Kim, Seongyeong Lee, Jeong A Wi, Junyoung Choi, Sangjun Ahn*
Graphics AI Lab, NCSOFT Research

Paper Github
Online Demo (coming soon)
TL; DR: we propose CaPa, a novel method for generating high-quality 4K textured mesh under only 30 seconds,
providing 3D assets ready for commercial applications such as games, movies, and VR/AR.

Abstract

The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce CaPa, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.


Methodology


Overview


Pipeline of CaPa

  1. 1. Geometry Generation: First, we generate geometry (polygonal mesh) using a 3D latent diffusion model. Using the learned 3D latent space with ShapeVAE, we train a 3D Latent Diffusion model that generates 3D geometries, guided by multi-view images from multi-view diffusion model to ensure alignment between the generated shape and texture.
  2. 2. Texture Generation: Second, we render four orthogonal views of the mesh, which serve as inputs for texture generation. To produce a high-quality texture while preventing the Janus problem, we design a novel, model-agnostic spatially decoupled attention:
    • This mechanism ensures that each spatial region independently attends to its corresponding view, preserving view-specific details and enhancing multi-view consistency.
    • Its model-agnostic nature allows integration with any diffusion model, enabling extraordinary texture quality powered by SDXL, thus outperforms other 3D generation or texturing methods typically limited to SD1.5.
  3. Final Output: A hyper-quality textured mesh is obtained through back projection and a 3D-aware occlusion inpainting algorithm. The entire 3D asset generation process is completed in less than 30 seconds using a fully feed-forward approach.


Comparison: Image to 3D Asset Generation


Input
Ours (~30 sec)
Unique3D (~2 min)
SF3D (~10 sec)


We compare CaPa with state-of-the-art Image-to-3D methods. Here, all the assets are converted to polygonal mesh, using its official code. CaPa significantly outperforms both geometry stability and visual fidelity, especially for the back and side view.


Scalability & Adaptability


PBR-aware 3D asset Generation


Texture Editing

Original Edited w/ text prompt ("orange sofa, orange pulp")


Citation


BibTeX

@article{heo2025capa,
  title = {CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation},
  author = {Hwan Heo and Jangyeong Kim and Seongyeong Lee and Jeong A Wi and Junyoung Choi and Sangjun Ahn},
  journal = {arXiv preprint arXiv:2501.09433},
  year = {2025},
}


Related Project


Texture Copilot