3rd Workshop on
Open-Vocabulary 3D Scene Understanding

in conjunction with ECCV 2024 in Milan, Italy.

September 29, Sunday Afternoon


The ability to perceive, understand and interact with arbitrary 3D environments is a long-standing goal in research with applications in AR/VR, health, robotics and so on. Current 3D scene understanding models are largely limited to low-level recognition tasks such as object detection or semantic segmentation, and do not generalize well beyond the a pre-defined set of training labels. More recently, large visual-language models (VLM), such as CLIP, have demonstrated impressive capabilities trained solely on internet-scale image-language pairs. Some initial works have shown that these models have the potential to extend 3D scene understanding not only to open set recognition, but also offer additional applications such as affordances, materials, activities, and properties of unseen environments. The goal of this workshop is to bundle these efforts and to discuss and establish clear task definitions, evaluation metrics, and benchmark datasets.


13:30 - 13:45 Welcome & Introduction
13:45 - 14:15 Keynote 1
14:15 - 14:45 Keynote 2
14:45 - 15:00 Winner Presentations
15:00 - 15:45 Poster Session & Coffee Break
15:45 - 16:15 Keynote 3
16:15 - 16:45 Keynote 4
16:45 - 17:30 Concluding Remarks

Keynote Speakers

Dr. Laura Leal-Taixé is a Senior Research Manager at NVIDIA and also an Adjunct Professor at the Technical University of Munich (TUM), leading the Dynamic Vision and Learning group. From 2018 until 2022, she was a tenure-track professor at TUM. Before that, she spent two years as a postdoctoral researcher at ETH Zurich, Switzerland, and a year as a senior postdoctoral researcher in the Computer Vision Group at the Technical University in Munich. She obtained her PhD from the Leibniz University of Hannover in Germany, spending a year as a visiting scholar at the University of Michigan, Ann Arbor, USA. She pursued B.Sc. and M.Sc. in Telecommunications Engineering at the Technical University of Catalonia (UPC) in her native city of Barcelona. She went to Boston, USA to do her Masters Thesis at Northeastern University with a fellowship from the Vodafone foundation. She is a recipient of the Sofja Kovalevskaja Award of 1.65 million euros in 2017, the Google Faculty Award in 2021, and the ERC Starting Grant in 2022.

Krishna Murthy Jatavallabhula is a postdoc at MIT CSAIL with Antonio Torralba and Josh Tenenbaum. He received his PhD at Mila, advised by Liam Paull. His research focuses on designing structured world models for robots: rich, multisensory models of the physical world that enable robots and embodied AI systems to perceive, reason, and act just as humans are able. His work draws upon ideas from robotics, computer vision, graphics, and computational cognitive science; intertwining our understanding of the world with probabilistic inference and deep learning. His work has been recognized with PhD fellowship awards from NVIDIA and Google, and a best-paper award from IEEE RAL.

As of April 2023, Georgia is a Full Professor for Interactive Robot Perception & Learning at the Computer Science Department of the Technical University of Darmstadt and Hessian.AI. Before that, she was an Assistant Professor since February 2022, and Independent Research Group Leader from March 2021, after getting the renowned Emmy Noether Programme (ENP) fund of the German Research Foundation (DFG). This project was awarded within the ENP Artificial Intelligence call of the DFG. In her research group, PEARL (previously iROSA), Dr. Chalvatzaki and her team propose new methods at the intersection of machine learning and classical robotics, taking the research for embodied AI robotic assistants one step further. The research in PEARL proposes novel methods for combined planning and learning to enable mobile manipulator robots to solve complex tasks in house-like environments, with the human-in-the-loop of the interaction process.

Alex Bewley is a Researcher at in Google Zurich Switzerland where he investigates novel approaches to machine learning and perception. Previously, he was a Postdoc at the Applied Artificial Intelligence Lab at the University of Oxford (formally part of the Mobile Robotics Group) working with Ingmar Posner and Paul Newman. There, the scope of his research covered various domains including multi-task learning, unsupervised domain adaptation, visual attention, model introspection and interpretability. He completed his PhD research at the Queensland University of Technology (Australia) alongside the ARC Centre of Excellence for Robotic Vision. His PhD topic was focused on the automatic detection and tracking of moving objects from video data with applications towards field robotics.


Paper Track: We accept novel full 14-page papers for publication in the proceedings, and either shorter 4-page extended abstracts or 14-page papers of novel or previously published work that will not be included in the proceedings. Full papers should use the official ECCV 2024 template. Extended abstracts are not subject to the ECCV rules, so they can be in any template but, as a rule to not be considered a publication in terms of double submission policies, they should be 4 pages in CVPR template format.


This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
It borrows the source code of this website. We would like to thank Utkarsh Sinha and Keunhong Park.