I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners
Abstract
A pre-trained 3D instance generator is reprogrammed to generalize spatial understanding in new layouts by learning directly from geometric cues, demonstrating its potential as a foundation model for interactive 3D scene generation.
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration (2025)
- WorldGrow: Generating Infinite 3D World (2025)
- Self-Evolving 3D Scene Generation from a Single Image (2025)
- IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction (2025)
- Inferring Compositional 4D Scenes without Ever Seeing One (2025)
- 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer (2025)
- ConsistCompose: Unified Multimodal Layout Control for Image Composition (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper