Context-Aware Indoor Point Cloud Object Generation through User Instructions

1Nanyang Technological University, 2Tsinghua University, 3University of Science and Technology of China
ACM MM '24
*Equal Contribution

Abstract

Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our work proposes a novel approach in scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.

Summary

Our model generates a couch that is positioned close to the television in response to the query and makes it consistent with the rest of the scene, i.e., the orientation, size, and overlap with other objects in certain cases.

In summary, the contributions of our work are as follows:

  • We generate a new dataset for scene modification tasks by designing a GPT-aided data pipeline for paraphrasing the descriptive texts in ReferIt3D dataset to generative instructions, referred to Nr3D-SA and Sr3D-SA datasets.
  • We propose an end-to-end multi-modal diffusion-based deep neural network model for generating in-door 3D objects into specific scenes according to input instructions.
  • We propose quantized position prediction, a simple but effective technique to predict Top-K candidate positions, which mitigates false negative problems arising from the ambiguity of language and provides reasonable options.
  • We introduce the visual grounding task as an evaluation strategy to assess the quality of a generated scene and integrate several metrics to evaluate the generated objects.

Methodology & Pipeline

Overview pipeline of CaIPCG.

  1. Data Pipeline: A large language model (LLM) is used to paraphrase the descriptive text, combined with rule-based and manual corrections.
  2. Model Pipeline: Upon receiving generative text as a query and point cloud input, our model integrates both object and language features to predict the final position. Besides, the language features are aligned across the model. The amalgamated features are then processed through the Point-E model to generate a realistic object.

Experiments

Scene before and after modification.
Consistency with Surroundings

Most generated point clouds are located close to the reference and the shapes are consistent with the instructions.


Diversity.
Diversity of Generations

While maintaining consistency with the surrounding environment and instructions, our method creates meaningful variances in both shape and color.


Generated objects under instructions with slight variations.
Effectiveness of Instructions

Generated objects can exhibit variations in color, shape, and location while remaining aligned with the provided instructions and the context of surrounding objects.


Visual grounding analysis.
Quality through Visual Grounding Analysis

Our model is capable of generating scenes that are not only consistent but also easily recognizable by visual grounding models trained on the original dataset.

BibTeX


@article{luo2024CaIPCG,
  title={Context-Aware Indoor Point Cloud Object Generation through User Instructions},
  author={Luo, Yiyang and Lin, Ke and Gu, Chao},
  journal={arXiv preprint arXiv:2311.16501},
  year={2023}
}