HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics

Jongsung Lee¹, Harin Park¹, Byeong-Uk Lee², Kyungdon Joo^1✝

¹3D Vision & Robotics Lab, UNIST ²KRAFTON
^✝Corresponding Author

CVPR 2025

TL;DR: HUSH conducts various panorama image-based 3D perception tasks by utilizing task-relevant and geometrically aligned spherical harmonics basis functions for each task.

🌟 Key insight: SH basis functions seem geometrically aligned with the signals (e.g., depth/normal) on the unit sphere!

Abstract

Motivated by the efficiency of spherical harmonics (SH) in representing various physical phenomena, we propose a Holistic panoramic 3D scene Understanding framework using Spherical Harmonics, dubbed as HUSH. Our approach focuses on a unified framework adaptable to various 3D scene understanding tasks via SH bases. To achieve this, we first estimate SH coefficients, allowing for the adaptive configuration of the SH bases specific to each scene. HUSH then employs a hierarchical attention module that uses SH bases as queries to generate comprehensive scene features by integrating these scene-adaptive SH bases with image features. Additionally, we introduce an SH basis index module that adaptively emphasizes relevant SH bases to produce task-relevant features, enhancing the versatility of HUSH across different scene understanding tasks. Finally, by combining the scene features with task-relevant features in the task-specific heads, we perform various scene understanding tasks, including depth, surface normal and room layout estimation. Experiments demonstrate that HUSH achieves state-of-the-art performance on depth estimation benchmarks, highlighting the robustness and scalability of using SH in panoramic 3D scene understanding.

Methods

HUSH first extracts multi-scale image features $f_i$ and scene-wise SH bases via feature extractor and SH coefficient network. These image features and the scene-wise SH bases are then fed into the SH-based hierarchical attention module and the SH basis index module to estimate comprehensive scene feature $f_S$ and task-relevant features ($f_D$, $f_N$, $f_L$). Finally, various scene-understanding tasks are performed as these features pass through task-specific heads.

Task-relevant SH bases

To validate the effectiveness of using SH basis functions as queries, we compare the results from conventional learnable (LR) queries and our SH queries. We visualize the frequently referred queries both on 2D and 3D domains w.r.t target task (depth/normal estimation). As we can see below, using the SH basis function as a query can keep better geometric consistency of the scene than the LR query.

HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics

Abstract

Methods

Task-relevant SH bases

Visualization on 2D

Visualization on 3D

Results

Depth estimation

Surface normal estimation

Layout estimation

Comparison on 3D