The ability to perceive, understand and interact with arbitrary 3D environments is a long-standing goal in
both academia and industry with applications in AR/VR as well as robotics.
Current 3D scene understanding models are largely limited to recognizing a closed set of pre-defined object
Recently, large visual-language models, such as CLIP, have demonstrated impressive capabilities trained
solely on internet-scale image-language pairs.
Some initial works have shown that these models have the potential to extend 3D scene understanding not only
to open set recognition, but also offer additional applications such as affordances, materials, activities,
and properties of unseen environments.
The goal of this workshop is to bundle these initial siloed efforts and to discuss and establish clear task
definitions, evaluation metrics, and benchmark datasets.