Yukai Shi1,3, Weiyu Li2,4, Zihao Wang4, Hongyang Li3, Xingyu Chen3, Ping Tan2,4, Lei Zhang3
1 Tsinghua University 2 HKUST 3 IDEA Research 4 LightIllusions
|
|
||||||||||||
|
|
||||||||||||
|
|
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets will be released.
Our framework consists of three main components:
- Scene Perception: Understanding the input scene structure
- 3D Object Generation under Occlusion: Decoupled de-occlusion model for robust object generation
- Pose Estimation: Unified pose estimation model with global and local attention mechanisms
We decouple the de-occlusion model from 3D object generation. We construct a unified pose estimation model that incorporates both global and local attention mechanisms.
- ✅ Dataset: Available
- ✅ Inference Code: Released
- ✅ Training Code: Released
Note The open-source release uses FLUX Kontext as the de-occlusion model and Step1X-3D as the 3D generation model. This is a bit different from the exact implementation described in the paper.
- Install Python dependencies for python 3.10:
pip install -r requirements.txt- Install MoGe for depth estimation:
- MoGe repo: https://github.com/microsoft/MoGe
- Please follow the official MoGe repository instructions for installation.
- Install Step1x-3D for 3D obejct generation:
git clone --depth 1 --branch main https://github.com/stepfun-ai/Step1X-3D.git- Download checkpoints from Hugging Face, and place in the corresponding folders (
ckpts/):
- SceneMaker checkpoints: https://huggingface.co/horizon171852/SceneMakerSceneMaker
bash run_gradio.shSelect corresponding checkpoints (indoor / open-set), and run scripts:
bash run_generation.shDownload required datasets:
- InstPIFu: https://github.com/GAP-LAB-CUHK-SZ/InstPIFu
- MIDI-3D: https://github.com/VAST-AI-Research/MIDI-3D
- SceneMaker OpenSet Dataset: https://huggingface.co/datasets/LightillusionsLab/
Select config in configs/image-to-scene-diffusion and Run scripts:
bash run_train.shIf you find our work useful in your research, please consider citing:
@article{shi2025scenemaker,
title={SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model},
author={Shi, Yukai and Li, Weiyu and Wang, Zihao and Li, Hongyang and Chen, Xingyu and Tan, Ping and Zhang, Lei},
journal={arXiv preprint arXiv:2512.10957},
year={2025}
}We would like to thank the authors of the following projects for their excellent work and open-source contributions:
- MoGe - Monocular depth estimation
- SAM - Segment Anything Model for image segmentation
- DINO-X - Grounding segementation
- CraftsMan - 3D object generation
- Step1x-3D - 3D object generation
- Hunyuan3D - 3D object generation
- MIDI3D - Multi-instance 3D scene generation
- InstPIFu - Indoor 3D scene generation
Their contributions have been invaluable to the development of SceneMaker.
See LICENSE file for details.



















