Icon
VPOcc: Exploiting Vanishing Point for
3D Semantic Occupancy Prediction

1UNIST, 2Carnegie Mellon University
Corresponding Author

Additional Results

Overview

MY ALT TEXT

Problem (left): (a) 3D objects of similar size (e.g., green and red boxes) appear differently in (b) 2D images depending on their distance from the camera, due to perspective projection. Solution (right): We propose a framework that leverages Vanishing Points (VP) to incorporate this into the network and thereby utilize 2D image features in a 3D-aware manner. (c) Given an input image and VP, we perform (d) VP-based image synthesis with zoom-in and (e) VP-guided point sampling for perspective-aware feature aggregation. These strategies address the imbalance caused by perspective projection, operating on both the pixel and feature levels to accurately estimate (f) the complete 3D semantic voxels of the scene.

Overall Architecture

MY ALT TEXT

In the feature extraction step, a zoomed-in image is generated using VPZoomer, and multi-scale feature maps \( \mathcal{F}^{2D}_o \) and \( \mathcal{F}^{2D}_z \) are extracted from \( I_o \) and \( I_z \). During the feature lifting, the depth proposed voxel query \( \mathcal{Q}_p \) is employed with VP-guided cross-attention (VPCA) on \( \mathcal{F}^{2D}_o \) and deformable cross-attention on \( \mathcal{F}^{2D}_z \) to construct the voxel feature volumes \( \mathcal{F}^{3D}_o \) and \( \mathcal{F}^{3D}_z \), respectively. In the feature volume fusion stage, both \( \mathcal{F}^{3D}_o \) and \( \mathcal{F}^{3D}_z \) are fused using a spatial volume fusion (SVF) module and refined via the 3D UNet-based decoder. Finally, the prediction head estimates the 3D semantic voxel map of the entire scene.

VPZoomer: VP-guided Image Zoom-in

MY ALT TEXT

Left: The original image \( I_o \) with source areas (\( \mathcal{S}_L \), \( \mathcal{S}_R \)) outlined in blue trapezoids. Right: The zoomed-in image \( I_z \) with target areas (\( \mathcal{T}_L \), \( \mathcal{T}_R \)) outlined in red rectangles.

VP-guided Cross-Attention (VPCA)

MY ALT TEXT

Following the sampling order, centered on the reference point \( \mathbf{r} \), we first generate the initial grid \( \mathcal{O} \) with the offset \( d \) and rotate it by an angle \( \theta \) to obtain \( \tilde{\mathcal{O}} \). Next, we identify the intersection grid \( \hat{\mathcal{O}} \) at the cross-point of lines from \( \tilde{\mathcal{O}} \) and VP \( \mathbf{v} \). As a result, a set of sampling points \( \mathcal{P} \) is composed of \( \{ \tilde{\mathcal{O}}, \hat{\mathcal{O}}, \mathbf{r} \} \).

Qualitative Results on SemanticKITTI Dataset

MY ALT TEXT