Beyond the Highlights: Video Retrieval with
Salient and Surrounding Contexts

WACV 2026
1UNIST, 2POSTECH, 3KAIST
Teaser Comparison

Existing datasets focus mainly on Salient events.
Our SS Datasets capture fine-grained Salient and Surrounding contexts
over semantically meaningful temporal segments.

Abstract

When searching for videos, users often rely on surrounding context such as background elements or temporal details beyond salient content. However, existing video models struggle with fine-grained spatio-temporal understanding, particularly surrounding contexts, and there are no datasets that effectively evaluate their performance.

We introduce SS Datasets, three video retrieval datasets with detailed salient and surrounding captions. To capture rich, temporally localized contexts aligned with meaningful scene changes, we segment videos by scene transitions and generate captions with a vision-language model. Analyzing current models reveals difficulties in handling surrounding queries and temporally complex videos. To address this, we propose simple yet effective baselines that improve retrieval across diverse query types, enabling more robust generalization to real-world scenarios.


The Challenge: Overcoming Salient Bias

Current video retrieval models exhibit a strong bias towards prominent foreground objects. While they excel at identifying main actions, they often overlook surrounding contexts such as background details, weather, or subtle temporal cues.

For instance, when a user queries for a background detail like a cushion, standard zero-shot models focus entirely on the main subject, such as the dog, and fail to retrieve the correct clip.

Motivation Failure Case

This observation is supported by quantitative evidence. In zero-shot settings, performance drops drastically on surrounding queries compared to salient or original queries.

Zero-shot Performance Gap

SS Datasets: Scale & Diversity

To address this limitation, we extend MSRVTT, LSMDC, and DiDeMo with dense, fine-grained captions that explicitly distinguish salient and surrounding contexts.

Dataset Statistics

Compared with original benchmarks, the SS Datasets include substantially more captions aligned with semantically meaningful temporal segments, enabling richer contextual coverage.


Caption Diversity

Our captions exhibit higher variance and distance from the mean compared to original captions, indicating richer linguistic diversity.

Semantic Distance

Semantic analysis shows that surrounding captions are clearly distinct from salient ones, indicating that our dataset captures complementary information instead of redundant descriptions.

In-depth Analysis

Beyond standard retrieval evaluation, we analyze how video properties, such as temporal complexity, affect model performance.

Performance Analysis

We found that more clips (higher temporal complexity) generally lead to lower performance, while longer clip duration helps models understand the context better.

Correlation Matrix

A detailed correlation matrix further reveals the relationship between video properties (like clip count and duration) and retrieval recall.


Qualitative Results

Our simple baseline captures information that is often underrepresented in existing approaches.

Qualitative Results

While the zero-shot model does not capture the specific background detail of a white object with red, the baseline retrieves the correct clip containing the cushion.

BibTeX

@inproceedings{bang2026beyond,
  title={Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts},
  author={Bang, Jaehun and Moon, Ye-Bin and Oh, Tae-Hyun and Joo, Kyungdon},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}