HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics

Jongsung Lee1,    Harin Park1,    Byeong-Uk Lee2,    Kyungdon Joo1✝  
13D Vision & Robotics Lab, UNIST      2KRAFTON
Corresponding Author
CVPR 2025

TL;DR: HUSH conducts various panorama image-based 3D perception tasks by utilizing task-relevant and geometrically aligned spherical harmonics basis functions for each task.

🌟 Key insight: SH basis functions seem geometrically aligned with the signals (e.g., depth/normal) on the unit sphere!




Abstract


Motivated by the efficiency of spherical harmonics (SH) in representing various physical phenomena, we propose a Holistic panoramic 3D scene Understanding framework using Spherical Harmonics, dubbed as HUSH. Our approach focuses on a unified framework adaptable to various 3D scene understanding tasks via SH bases. To achieve this, we first estimate SH coefficients, allowing for the adaptive configuration of the SH bases specific to each scene. HUSH then employs a hierarchical attention module that uses SH bases as queries to generate comprehensive scene features by integrating these scene-adaptive SH bases with image features. Additionally, we introduce an SH basis index module that adaptively emphasizes relevant SH bases to produce task-relevant features, enhancing the versatility of HUSH across different scene understanding tasks. Finally, by combining the scene features with task-relevant features in the task-specific heads, we perform various scene understanding tasks, including depth, surface normal and room layout estimation. Experiments demonstrate that HUSH achieves state-of-the-art performance on depth estimation benchmarks, highlighting the robustness and scalability of using SH in panoramic 3D scene understanding.


Methods


HUSH first extracts multi-scale image features $f_i$ and scene-wise SH bases via feature extractor and SH coefficient network. These image features and the scene-wise SH bases are then fed into the SH-based hierarchical attention module and the SH basis index module to estimate comprehensive scene feature $f_S$ and task-relevant features ($f_D$, $f_N$, $f_L$). Finally, various scene-understanding tasks are performed as these features pass through task-specific heads.





Task-relevant SH bases



To validate the effectiveness of using SH basis functions as queries, we compare the results from conventional learnable (LR) queries and our SH queries. We visualize the frequently referred queries both on 2D and 3D domains w.r.t target task (depth/normal estimation). As we can see below, using the SH basis function as a query can keep better geometric consistency of the scene than the LR query.


Visualization on 2D



Visualization on 3D




Results



Depth estimation



Surface normal estimation



Layout estimation



Comparison on 3D




    
      @InProceedings{Lee_2025_CVPR,
        author    = {Lee, Jongsung and Park, Harin and Lee, Byeong-Uk and Joo, Kyungdon},
        title     = {HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics},
        booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
        month     = {June},
        year      = {2025},
        pages     = {16599-16608}
       }