Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions
Abstract
A training-free framework for fine-grained 3D editing that uses geometric primitives and vision-language models to preserve identity while enabling localized structural changes.
Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision-language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.
Community
Even today when image editing models are more powerful than ever, fine grained structural 3D editing remains difficult. In this work we use primitive based abstractions to leverage the reasoning power of VLMs to solve this challenging task.
nice
the use of superquadric primitives as a compact proxy is clever, but the real test is how the edited proxy translates into the 3d diffusion steps without losing identity. the proxy-induced denoising path is where alignment and sampling quirks will show up, especially when edits are localized and other regions dominate the shape signal. an ablation varying the number of primitives or substituting a more expressive primitive family could reveal where identity preservation actually hinges. the arxivlens breakdown helped me parse the method details, btw, here's the link: https://arxivlens.com/PaperView/Details/prox-e-fine-grained-3d-shape-editing-via-primitive-based-abstractions-5048-d8c944ab. overall, i like the clean geometry-prior split, and it would be interesting to see how this scales to multi-object scenes with occlusion.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data (2026)
- FineEdit: Fine-Grained Image Edit with Bounding Box Guidance (2026)
- FluSplat: Sparse-View 3D Editing without Test-Time Optimization (2026)
- SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting (2026)
- PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing (2026)
- VeloEdit: Training-Free Consistent and Continuous Instruction-Based Image Editing via Velocity Field Decomposition (2026)
- RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.23774 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper