☀️ OpenSUN 3D 🌍

2nd Workshop on Open-Vocabulary 3D Scene Understanding

in conjunction with CVPR 2024, Seattle, USA.

June 18 Tuesday Afternoon

Motivation 💡

The ability to perceive, understand and interact with arbitrary 3D environments is a long-standing goal in both academia and industry with applications in AR/VR as well as robotics. Current 3D scene understanding models are largely limited to recognizing a closed set of pre-defined object classes. Recently, large visual-language models, such as CLIP, have demonstrated impressive capabilities trained solely on internet-scale image-language pairs. Some initial works have shown that these models have the potential to extend 3D scene understanding not only to open set recognition, but also offer additional applications such as affordances, materials, activities, and properties of unseen environments. The goal of this workshop is to bundle these initial siloed efforts and to discuss and establish clear task definitions, evaluation metrics, and benchmark datasets.

Schedule ⏰ (tentative)

13:20 - 13:30 Welcome & Introduction
13:30 - 14:00 Keynote 1
14:00 - 14:30 Keynote 2
14:30 - 14:45 Oral Sessions / Challenge Winners
14:45 - 15:15 Keynote 3
15:15 - 16:00 Poster Session & Coffee Break
16:00 - 16:30 Keynote 4
16:30 - 17:00 Keynote 5
17:00 - 17:30 Panel Discussion

Invited Speakers 🧑‍🏫

Kristen Grauman

University of Texas at Austin

Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Director in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on video, visual recognition, and action for perception or embodied AI. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, a Microsoft Research New Faculty Fellow, and a recipient of NSF CAREER and ONR Young Investigator awards, the PAMI Young Researcher Award in 2013, the 2013 Computers and Thought Award from the International Joint Conference on Artificial Intelligence (IJCAI), the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2013. She was inducted into the UT Academy of Distinguished Teachers in 2017. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award).

Jiajun Wu

Stanford University

Jiajun Wu is an Assistant Professor of Computer Science at Stanford University, working on computer vision, machine learning, and computational cognitive science. Before joining Stanford, he was a Visiting Faculty Researcher at Google Research. He received his PhD in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology. Wu's research has been recognized through the Young Investigator Programs (YIP) by ONR and by AFOSR, paper awards and finalists at ICCV, CVPR, SIGGRAPH Asia, CoRL, and IROS, dissertation awards from ACM, AAAI, and MIT, the 2020 Samsung AI Researcher of the Year, and faculty research awards from J.P. Morgan, Samsung, Amazon, and Meta.

Chung Min Kim

University of California, Berkeley

Chung Min Kim is a PhD student at UC Berkeley, where she is advised by Ken Goldberg and Angjoo Kanazawa. She received her dual B.S. degree in EECS (Electrical Engineering and Computer Science) and Mechanical Engineering from UC Berkeley in 2021. She is currently funded by the NSF GRFP. Her research interests include 3D scene understanding for computer vision and robotics. In particular, she is interested in modeling multi-scale semantics with 3D, using large vision-language models. Her goal is to apply these models to robots in the real world, which is challenging due to lack of structure and large variability in the real world.

Justin Kerr

University of California, Berkeley

Justin Kerr is a PhD student at UC Berkeley co-advised by Ken Goldberg and Angjoo Kanazawa working primarily on NeRF for robot manipulation, 3D scene understanding, and visuo-tactile representation learning. Recently Justin is interested in leveraging NeRF for language grounding, and how it could change how we interact with 3D. His work is supported by the NSF GRFP. Previously he finished my bachelor's at CMU where he worked with Howie Choset on multi-agent path planning, and spent time at Berkshire Grey and NASA's JPL.

Related Works 🧑‍🤝

Below is a collection of concurrent and related works in the field of open-set 3D scene understanding. Please feel free to get in touch to add other works as well. and many more ...

Important Dates 🗓️

Paper Track: We accept novel full 8-page papers for publication in the proceedings, and either shorter 4-page extended abstracts or 8-page papers of novel or previously published work that will not be included in the proceedings. All submissions have to follow the CVPR 2024 author guidelines.
  • Submission Portal: CMT
  • Paper Submission Deadline: April 1, 2024 (23:59 Pacific Time)
  • Notification to Authors: April 9, 2024
  • Camera-ready submission: April 14, 2024


This year, our challenge will consist of two tracks, open-vocabulary 3D object instance search and open-vocabulary 3D affordance grounding.
  • Challenge Track 1: Open-vocabulary 3D object instance search
    • Submission Portal: EvalAI
    • Data Instructions & Helper Scripts: April 17, 2024
    • Dev Phase Start: April 17, 2024
    • Submission Portal Start: April 19, 2024
    • Test Phase Start: May 1, 2024
    • Test Phase End: June 8, 2024 (23:59 Pacific Time)
  • Challenge Track 2: Open-vocabulary 3D functionality grounding
    • Submission Portal: EvalAI
    • Data Instructions & Helper Scripts: April 17, 2024
    • Dev Phase Start: April 17, 2024
    • Submission Portal Start: April 19, 2024
    • Test Phase Start: May 1, 2024
    • Test Phase End: June 8, 2024 (23:59 Pacific Time)
Please check this page out for an overview of last year's challenge results. We have also published a technical report providing an overview of our ICCV 2023 workshop challenge.