In summary, the contributions of our work are as follows:
- We generate a new dataset for scene modification tasks by designing a GPT-aided data pipeline for paraphrasing the descriptive texts in ReferIt3D dataset to generative instructions, referred to Nr3D-SA and Sr3D-SA datasets.
- We propose an end-to-end multi-modal diffusion-based deep neural network model for generating in-door 3D objects into specific scenes according to input instructions.
- We propose quantized position prediction, a simple but effective technique to predict Top-K candidate positions, which mitigates false negative problems arising from the ambiguity of language and provides reasonable options.
- We introduce the visual grounding task as an evaluation strategy to assess the quality of a generated scene and integrate several metrics to evaluate the generated objects.