arxiv:2604.23774

Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions

Published on Apr 29

· Submitted by

Authors:

Abstract

A training-free framework for fine-grained 3D editing that uses geometric primitives and vision-language models to preserve identity while enabling localized structural changes.

AI-generated summary

Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision-language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.

View arXiv page View PDF Project page Add to collection

Community

etaisella

Paper submitter 2 days ago

Even today when image editing models are more powerful than ever, fine grained structural 3D editing remains difficult. In this work we use primitive based abstractions to leverage the reasoning power of VLMs to solve this challenging task.

SagiPolaczek

2 days ago

nice

avahal

1 day ago

the use of superquadric primitives as a compact proxy is clever, but the real test is how the edited proxy translates into the 3d diffusion steps without losing identity. the proxy-induced denoising path is where alignment and sampling quirks will show up, especially when edits are localized and other regions dominate the shape signal. an ablation varying the number of primitives or substituting a more expressive primitive family could reveal where identity preservation actually hinges. the arxivlens breakdown helped me parse the method details, btw, here's the link: https://arxivlens.com/PaperView/Details/prox-e-fine-grained-3d-shape-editing-via-primitive-based-abstractions-5048-d8c944ab. overall, i like the clean geometry-prior split, and it would be interesting to see how this scales to multi-object scenes with occlusion.