Problem (left): (a) 3D objects of similar size (e.g., green and red boxes) appear differently in (b) 2D images depending on their distance from the camera, due to perspective projection. Solution (right): We propose a framework that leverages Vanishing Points (VP) to incorporate this into the network and thereby utilize 2D image features in a 3D-aware manner. (c) Given an input image and VP, we perform (d) VP-based image synthesis with zoom-in and (e) VP-guided point sampling for perspective-aware feature aggregation. These strategies address the imbalance caused by perspective projection, operating on both the pixel and feature levels to accurately estimate (f) the complete 3D semantic voxels of the scene.
In the feature extraction step, a zoomed-in image is generated using VPZoomer, and multi-scale feature maps \( \mathcal{F}^{2D}_o \) and \( \mathcal{F}^{2D}_z \) are extracted from \( I_o \) and \( I_z \). During the feature lifting, the depth proposed voxel query \( \mathcal{Q}_p \) is employed with VP-guided cross-attention (VPCA) on \( \mathcal{F}^{2D}_o \) and deformable cross-attention on \( \mathcal{F}^{2D}_z \) to construct the voxel feature volumes \( \mathcal{F}^{3D}_o \) and \( \mathcal{F}^{3D}_z \), respectively. In the feature volume fusion stage, both \( \mathcal{F}^{3D}_o \) and \( \mathcal{F}^{3D}_z \) are fused using a spatial volume fusion (SVF) module and refined via the 3D UNet-based decoder. Finally, the prediction head estimates the 3D semantic voxel map of the entire scene.
Left: The original image \( I_o \) with source areas (\( \mathcal{S}_L \), \( \mathcal{S}_R \)) outlined in blue trapezoids. Right: The zoomed-in image \( I_z \) with target areas (\( \mathcal{T}_L \), \( \mathcal{T}_R \)) outlined in red rectangles.
Following the sampling order, centered on the reference point \( \mathbf{r} \), we first generate the initial grid \( \mathcal{O} \) with the offset \( d \) and rotate it by an angle \( \theta \) to obtain \( \tilde{\mathcal{O}} \). Next, we identify the intersection grid \( \hat{\mathcal{O}} \) at the cross-point of lines from \( \tilde{\mathcal{O}} \) and VP \( \mathbf{v} \). As a result, a set of sampling points \( \mathcal{P} \) is composed of \( \{ \tilde{\mathcal{O}}, \hat{\mathcal{O}}, \mathbf{r} \} \).