Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Previous approaches have adopted Neural Radiance Fields (NeRFs) to analyze 3D scenes. In this paper, we introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our key idea is distilling pre-trained 2D semantics into 3D Gaussians. We design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, without the additional training required by NeRFs. We further build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference. We explore several applications of Semantic Gaussians: semantic segmentation on ScanNet-20, where our approach attains a 4.2% mIoU and 4.0% mAcc improvement over prior open-vocabulary scene understanding counterparts; object part segmentation, scene editing, and spatial-temporal segmentation with better qualitative results over 2D and 3D baselines, highlighting its versatility and effectiveness on supporting diverse downstream tasks.
@misc{guo2024semantic,
title={Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting},
author={Jun Guo and Xiaojian Ma and Yue Fan and Huaping Liu and Qing Li},
year={2024},
eprint={2403.15624},
archivePrefix={arXiv},
primaryClass={cs.CV}
}