Open☀️3D

Introduction

The ability to perceive, understand and interact with arbitrary 3D environments is a long-standing goal in research with applications in AR/VR, health, robotics and so on. Current 3D scene understanding models are largely limited to low-level recognition tasks such as object detection or semantic segmentation, and do not generalize well beyond the a pre-defined set of training labels. More recently, large visual-language models (VLM), such as CLIP, have demonstrated impressive capabilities trained solely on internet-scale image-language pairs. Some initial works have shown that these models have the potential to extend 3D scene understanding not only to open set recognition, but also offer additional applications such as affordances, materials, activities, and properties of unseen environments. The goal of this workshop is to bundle these efforts and to discuss and establish clear task definitions, evaluation metrics, and benchmark datasets.

Schedule

13:45 - 14:00	Welcome & Introduction
14:00 - 14:30	Keynote 1 Jeannette Bohg (Stanford) Challenges and Opportunities of Mobile Manipulation
14:30 - 15:00	Keynote 2 Laura Leal-Taixé (NVIDIA) Towards a Foundation Model for 4D Lidar
15:00 - 15:45	Coffee Break & Poster Session in Hall D (Level 3, Boards: 452-470)
15:45 - 16:15	Keynote 3 Afshin Dehghan (Apple) 3D Scene Intelligence
16:15 - 16:45	Keynote 4 Lukas Schmid (MIT) Hierarchical Methods for Task-driven and Dynamic Scene Understanding
16:45 - 17:15	Keynote 5 Björn Ommer (LMU) Efficient Repurposing of T2I Representations Across Modalities
17:15 - 17:45	Challenge Winner Jaime Corsetti (FBK) Functionality Understanding and Segmentation in 3D Scenes
17:45 - 18:00	Concluding Remarks

Keynote Speakers

Laura Leal-Taixé

Senior Research Manager at NVIDIA

Dr. Laura Leal-Taixé is a Senior Research Manager at NVIDIA and also an Adjunct Professor at the Technical University of Munich (TUM), leading the Dynamic Vision and Learning group. From 2018 until 2022, she was a tenure-track professor at TUM. Before that, she spent two years as a postdoctoral researcher at ETH Zurich, Switzerland, and a year as a senior postdoctoral researcher in the Computer Vision Group at the Technical University in Munich. She obtained her PhD from the Leibniz University of Hannover in Germany, spending a year as a visiting scholar at the University of Michigan, Ann Arbor, USA. She pursued B.Sc. and M.Sc. in Telecommunications Engineering at the Technical University of Catalonia (UPC) in her native city of Barcelona. She went to Boston, USA to do her Masters Thesis at Northeastern University with a fellowship from the Vodafone foundation. She is a recipient of the Sofja Kovalevskaja Award of 1.65 million euros in 2017, the Google Faculty Award in 2019, and the ERC Starting Grant in 2021.

Lukas Schmid

Research Scientist at MIT

Lukas Schmid is a Research Scientist at the MIT SPARK Lab led by Prof. Luca Carlone at the Massachusetts Institute of Technology (MIT). Before, he was a Postdoctoral Fellow at MIT SPARK, and briefly a Postdoctoral Researcher at the Autonomous Systems Lab (ASL) led by Prof. Roland Siegwart at ETH Zürich (ETHZ). He earned his PhD in 2022 from ASL at ETHZ, where he was a visiting researcher at the Microsoft Spatial AI Lab led by Prof. Marc Pollefeys in 2022, and also obtained his M.Sc. in Robotics, Systems, and Control (RSC) in 2019. His work has been recognized by several honors, including RSS Pioneers 2025, the RSS 2024 Outstanding Systems Paper Award, two ETH Medals for outstanding PhD and M.Sc. Theses, the Willi Studer Prize for the best graduate of the year at ETHZ, the first place in the 2024 Hilti SLAM challenge, and a Swiss National Science Foundation (SNSF) Postdoc Fellowship. His research focuses on active perception and understanding of complex, dynamic, human-centric environments for robot autonomy and augmented reality. This includes research on dense geometric and semantic scene representations and abstraction, on detection, prediction, and understanding of moving and changing entities, as well as lifelong learning for continuous improvement and adaptation to the robot environment, embodiment, and human preference.

Jeannette Bohg

Assistant Professor for Robotics at Stanford University

Jeannette Bohg is an Assistant Professor of Computer Science at Stanford University. She was a group leader at the Autonomous Motion Department (AMD) of the MPI for Intelligent Systems until September 2017. Before joining AMD in January 2012, Jeannette Bohg was a PhD student at the Division of Robotics, Perception and Learning (RPL) at KTH in Stockholm. In her thesis, she proposed novel methods towards multi-modal scene understanding for robotic grasping. She also studied at Chalmers in Gothenburg and at the Technical University in Dresden where she received her Master in Art and Technology and her Diploma in Computer Science, respectively. Her research focuses on perception and learning for autonomous robotic manipulation and grasping. She is specifically interested in developing methods that are goal-directed, real-time and multi-modal such that they can provide meaningful feedback for execution and learning. Jeannette Bohg has received several Early Career and Best Paper awards, most notably the 2019 IEEE Robotics and Automation Society Early Career Award and the 2020 Robotics: Science and Systems Early Career Award.

Björn Ommer

Björn Ommer is a full professor of computer science at LMU Munich, where he leads the Computer Vision & Learning Group. Before joining LMU, he was a full professor at Heidelberg University and a director at both the Interdisciplinary Center for Scientific Computing (IWR) and the Heidelberg Collaboratory for Image Processing (HCI). He holds a Ph.D. from ETH Zurich, a diploma from University of Bonn, and he was a postdoctoral researcher at UC Berkeley. His research focuses on generative AI, visual understanding, and explainable neural networks. His group developed several influential approaches in generative modeling, such as Stable Diffusion, which has seen broad adoption across academia, industry, and beyond. Björn is a director of the Bavarian AI Council, an ELLIS Fellow, and he has served in senior roles at major conferences such as CVPR, ICCV, ECCV, and NeurIPS. His most recent recognitions include the German AI Prize 2024, the Eduard Rhein Technology Award, and a nomination for the German Future Prize by the President of Germany.

Afshin Dehghan

Senior AI/ML Manager

Afshin leads the Multimodal Intelligence group in Hardware Technology at Apple, where he drives critical research and development in multimodal technologies that bridge vision, language, and spatial understanding. His team developed RoomPlan, Apple’s breakthrough 3D parametric room mapping solution, which set a new benchmark in spatial computing by leveraging LiDAR for high-fidelity scene understanding. His group has shipped core 2D and 3D perception technologies across iOS and VisionPro, and is now advancing Apple Intelligence through cutting-edge work in visual foundation models.

Challenge

This year, we host a challenge on the SceneFun3D benchmark which focuses on fine-grained functionality and affordance understanding in 3D indoor environments. It consists of two tracks, functionality segmentation and open-vocabulary 3D affordance grounding. Below, you will find key resources and important dates.

SceneFun3D Documentation
Benchmark Submission Portal
Benchmark Submission Instructions
Benchmark Submission Opens: March 10, 2025
Benchmark Submission Deadline: ~~June 8, 2025~~ → Extended to: June 10, 2025, 2:00 PM CDT (Countdown: )

Challenge Winners

Our workshop challenge is proudly supported by:

Poster Presentations

All paper submission that are accepted have to option to be presented as posters during the workshop.

For poster presenters: Please use poster boards 452-470 in Hall D (Level 3) to put your poster.

Printing your poster: Please follow the official CVPR 2025 guidelines for poster preparation. Note the early bird poster printing deadline of May 25, 2025.

Paper Track

We invite 8-page full papers for inclusion in the proceedings, as well as 4-page extended abstracts. Extended abstracts may present either new or previously published work but will not be included in the proceedings. 4-page extended abstracts generally do not conflict with the dual submission policies of other conferences, whereas 8-page full papers, if accepted, will be part of the proceedings and are therefor subject to the dual submission policy (i.e., they cannot be under review for another conference at the same time or already accepted at another conference). All submissions should be anonymous and follow the official CVPR 2025 guidelines.

Submission Portal: OpenReview
Paper Submission Opens: January 15, 2025
Paper Submission Deadline: March 25, 2025 (Countdown: )
Notification to Authors: March 31, 2025
Camera-ready submission: April 14, 2025 (Authors will receive a link to submit camera-ready versions on or around April 6, 2025)

Accepted Papers (Proceedings)

Enforcing View-Consistency in Class-Agnostic 3D Segmentation Fields
Corentin Dumery, Aoxiang Fan, Ren Li, Nicolas Talabot, Pascal Fua
DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting
Luis Wiedmann, Luca Wiehe, David Rozenberszki
HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering
Alexander M. Rusnak, Frederic Kaplan
Segment Any Primitive: Zero-Shot 3D Primitive Segmentation from Point Cloud
Baiyushan, Wangshaohu, Rongtao Xu, Tongyuchuang, Xuchaoran, Zhengtao Zhang
ForesightNav: Learning Scene Imagination for Efficient Exploration
Hardik Shah, Jiaxu Xing, Nico Messikommer, Boyang Sun, Marc Pollefeys, Davide Scaramuzza
OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting
Jens Piekenbrinck, Christian Schmidt, Alexander Hermans, Narunas Vaskevicius, Timm Linder, Bastian Leibe

Extended Abstracts

Online Language Splatting
Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren
Large Spatial Model: End-to-end Unposed Images to Semantic 3D
Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, Yue Wang
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Wenyan Cong, Suya You, Zhangyang Wang, Leonidas Guibas, Achuta Kadambi
CrossOver: 3D Scene Cross-Modal Alignment
Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni
Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration
Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, Tae-Hyun Oh
FunGraph: Functionality Aware 3D Scene Graphs Generation from 2D and 3D Data
Dennis Rotondi, Fabio Scaparro, Chinmay Nadgouda, Hermann Blum, Kai Arras
From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding
Anna-Maria Halacheva, Jan-Nico Zaech, Sombit Dey, Luc Van Gool, Danda Pani Paudel
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
Jiangyong Huang, Baoxiong Jia, Yan Wang, Ziyu Zhu, Xiongkun Linghu, Qing Li, Song-Chun Zhu, Siyuan Huang