In the feature extraction step, a zoomed-in image is generated using VPZoomer, and multi-scale feature maps \( \mathcal{F}^{2D}_o \) and \( \mathcal{F}^{2D}_z \) are extracted from \( I_o \) and \( I_z \). During the feature lifting, the depth proposed voxel query \( \mathcal{Q}_p \) is employed with VP-guided cross-attention (VPCA) on \( \mathcal{F}^{2D}_o \) and deformable cross-attention on \( \mathcal{F}^{2D}_z \) to construct the voxel feature volumes \( \mathcal{F}^{3D}_o \) and \( \mathcal{F}^{3D}_z \), respectively. In the feature volume fusion stage, both \( \mathcal{F}^{3D}_o \) and \( \mathcal{F}^{3D}_z \) are fused using a spatial volume fusion (SVF) module and refined via the 3D UNet-based decoder. Finally, the prediction head estimates the 3D semantic voxel map of the entire scene.
https://vision3d-lab.github.io/vpocc/Left: The original image \( I_o \) with source areas (\( \mathcal{S}_L \), \( \mathcal{S}_R \)) outlined in blue trapezoids. Right: The zoomed-in image \( I_z \) with target areas (\( \mathcal{T}_L \), \( \mathcal{T}_R \)) outlined in red rectangles.
Following the sampling order, centered on the reference point \( \mathbf{r} \), we first generate the initial grid \( \mathcal{O} \) with the offset \( d \) and rotate it by an angle \( \theta \) to obtain \( \tilde{\mathcal{O}} \). Next, we identify the intersection grid \( \hat{\mathcal{O}} \) at the cross-point of lines from \( \tilde{\mathcal{O}} \) and VP \( \mathbf{v} \). As a result, a set of sampling points \( \mathcal{P} \) is composed of \( \{ \tilde{\mathcal{O}}, \hat{\mathcal{O}}, \mathbf{r} \} \).