When searching for videos, users often rely on surrounding context such as background elements or temporal details beyond salient content. However, existing video models struggle with fine-grained spatio-temporal understanding, particularly surrounding contexts, and there are no datasets that effectively evaluate their performance.
We introduce SS Datasets, three video retrieval datasets with detailed salient and surrounding captions. To capture rich, temporally localized contexts aligned with meaningful scene changes, we segment videos by scene transitions and generate captions with a vision-language model. Analyzing current models reveals difficulties in handling surrounding queries and temporally complex videos. To address this, we propose simple yet effective baselines that improve retrieval across diverse query types, enabling more robust generalization to real-world scenarios.
Current video retrieval models exhibit a strong bias towards prominent foreground objects. While they excel at identifying main actions, they often overlook surrounding contexts such as background details, weather, or subtle temporal cues.
For instance, when a user queries for a background detail like a cushion, standard zero-shot models focus entirely on the main subject, such as the dog, and fail to retrieve the correct clip.
This observation is supported by quantitative evidence. In zero-shot settings, performance drops drastically on surrounding queries compared to salient or original queries.
To address this limitation, we extend MSRVTT, LSMDC, and DiDeMo with dense, fine-grained captions that explicitly distinguish salient and surrounding contexts.
Compared with original benchmarks, the SS Datasets include substantially more captions aligned with semantically meaningful temporal segments, enabling richer contextual coverage.
Our captions exhibit higher variance and distance from the mean compared to original captions, indicating richer linguistic diversity.
Semantic analysis shows that surrounding captions are clearly distinct from salient ones, indicating that our dataset captures complementary information instead of redundant descriptions.
Beyond standard retrieval evaluation, we analyze how video properties, such as temporal complexity, affect model performance.
We found that more clips (higher temporal complexity) generally lead to lower performance, while longer clip duration helps models understand the context better.
A detailed correlation matrix further reveals the relationship between video properties (like clip count and duration) and retrieval recall.
Our simple baseline captures information that is often underrepresented in existing approaches.
While the zero-shot model does not capture the specific background detail of a white object with red, the baseline retrieves the correct clip containing the cushion.
@inproceedings{bang2026beyond,
title={Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts},
author={Bang, Jaehun and Moon, Ye-Bin and Oh, Tae-Hyun and Joo, Kyungdon},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2026}
}