Introduction
The ability to perceive, understand and interact with arbitrary 3D environments is a long-standing goal in
research with applications in AR/VR, health, robotics and so on.
Current 3D scene understanding models are largely limited to low-level recognition tasks such as object
detection or semantic segmentation,
and do not generalize well beyond the a pre-defined set of training labels.
More recently, large visual-language models (VLM), such as CLIP, have demonstrated impressive capabilities
trained
solely on internet-scale image-language pairs.
Some initial works have shown that these models have the potential to extend 3D scene understanding not only
to open set recognition, but also offer additional applications such as affordances, materials, activities,
and properties of unseen environments.
The goal of this workshop is to bundle these efforts and to discuss and establish clear task
definitions, evaluation metrics, and benchmark datasets.