Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

1National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI) 2Department of Computer Science and Technology, Tsinghua University

TL;DR: We propose Semantic Gaussians, a versatile framework to conduct open-vocabulary scene understanding on off-the-shelf 3D Gaussian Splatting scenes.



Figure 1. Overview of our Semantic Gaussians. We inject semantic features into off-the-shelf 3D Gaussian Splatting by either projecting semantic features from pre-trained 2D encoders or directly predicting pointwise embeddings by a 3D semantic network (or fusing these two). The newly added semantic components of 3D Gaussians open up diverse applications centered around open-vocabulary scene understanding.

Abstract

Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Previous approaches have adopted Neural Radiance Fields (NeRFs) to analyze 3D scenes. In this paper, we introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our key idea is distilling pre-trained 2D semantics into 3D Gaussians. We design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, without the additional training required by NeRFs. We further build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference. We explore several applications of Semantic Gaussians: semantic segmentation on ScanNet-20, where our approach attains a 4.2% mIoU and 4.0% mAcc improvement over prior open-vocabulary scene understanding counterparts; object part segmentation, scene editing, and spatial-temporal segmentation with better qualitative results over 2D and 3D baselines, highlighting its versatility and effectiveness on supporting diverse downstream tasks.


Figure 2. An illustration of the pipeline of Semantic Gaussians. Upper left: our projection framework maps various pre-trained 2D features to the semantic component s2D of 3D Gaussians; Bottom left: we additionally introduce a 3D semantic network that directly predicts the semantic components s3D out of raw 3D Gaussians. It is supervised by the projected s2D; Right: given an open-vocabulary text query, we compare its embedding against the semantic components (s2D, s3D, or their fusion) of 3D Gaussians. The matched Gaussians will be splatted to render the 2D mask corresponding to the query.

Semantic Segmentation Results

Some results of semantic segmentation on the ScanNet-20 dataset.

Part Segmentation Results

Some results of part segmentation on the MVImgNet dataset.

Spatiotemporal Tracking Results

The demo of spatiotemporal tracking on human parts and a basketball on the CMU Panoptic dataset.

Language-Guided Editing Results

Some language-guided editing result on the room scene of the Mip-NeRF 360 dataset.

BibTeX

@misc{guo2024semantic,
        title={Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting}, 
        author={Jun Guo and Xiaojian Ma and Yue Fan and Huaping Liu and Qing Li},
        year={2024},
        eprint={2403.15624},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }