A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search
in the Gaming Vertical Domain
Figure 1. SVFSearch benchmark construction and evaluation pipeline. Starting from raw short-video clips, we construct a frozen offline retrieval environment with text, image, and multimodal indices. Three evaluation protocols — Direct QA, Workflow RAG, and Plan-Act-Replan (PAR) agents — are compared against an Oracle Knowledge upper bound.
Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge.
We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip.
To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models.
Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.
Each example centers on a paused game frame from a real short-video clip paired with a four-choice question. The system must locate domain-specific evidence from the retrieval environment before answering.
Figure 2. Representative QA examples drawn from different games and question categories, showing the paused frame, question text, answer options, and ground-truth rationale.
Case 1 · Frame A
Case 1 · Frame B
Case 2 · Frame A
Case 2 · Frame B
Curate 22,800 game-specific elements across 221 games, building a 45K-entry text corpus and 262K retrieval chunks.
Process 200K+ videos, extract 1M+ frames, and obtain 43K reliable image-element pairs for the topic-linked image gallery.
Generate 80K candidate QA pairs and filter to 9,198 high-quality instances with difficulty and category labels.
Qwen3-Embedding-0.6B · 512-dim · 45K entries
Lexical retrieval over chunked knowledge base
DINOv3-Base fine-tuned · 256-dim · 34K images
Qwen3-VL-Embedding-2B · 512-dim
Figure 3. Distribution of SVFSearch examples across question theme, question type, and difficulty levels.
SVFSearch benchmarks four evaluation settings in a unified frozen environment, enabling controlled comparison from model-only to fully agentic approaches.
MLLM answers with the paused frame, question, and options — no retrieval tools. Establishes the model-only baseline.
Fixed image→text pipeline: image ANN retrieves candidate frames, core elements are identified, text ANN fetches relevant knowledge, then the model answers.
LangGraph-based agent dynamically selects among 4 retrieval tools over up to 6 rounds, accumulating and reasoning over evidence adaptively.
Learned search model that internalizes tool use via RL training, generating tool calls as part of the decoding process.
Accuracy and search rate (SR) on the 5,000-example test split. Oracle Knowledge provides the upper bound of evidence availability.
| Setting | Model | Acc. | SR | Char. | Equip. | Map | Story | Mech. | Easy | Med. | Hard |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary Models (Direct QA) | |||||||||||
| Direct | Claude-Opus-4.7 | 69.0 | — | 70.0 | 69.0 | 66.8 | 72.7 | 66.2 | 67.9 | 68.6 | 72.9 |
| Direct | GPT-5.4 | 67.9 | — | 69.8 | 67.8 | 59.7 | 69.7 | 63.7 | 65.1 | 67.7 | 70.4 |
| Direct | Gemini-3.1-Pro | 77.5 | — | 79.1 | 79.6 | 69.9 | 87.9 | 72.3 | 82.5 | 78.3 | 68.2 |
| Open-source Direct QA | |||||||||||
| Direct | Qwen2.5-VL-7B | 49.8 | — | 50.8 | 47.9 | 47.4 | 51.5 | 48.1 | 57.1 | 49.4 | 50.2 |
| Direct | Qwen3-VL-8B | 54.8 | — | 56.5 | 52.4 | 55.1 | 66.7 | 50.9 | 58.0 | 53.8 | 61.9 |
| Direct | Qwen3-VL-32B | 57.1 | — | 58.7 | 54.3 | 53.6 | 63.6 | 54.4 | 71.2 | 56.2 | 58.9 |
| Direct | Qwen3.5-9B | 59.9 | — | 61.1 | 57.1 | 57.1 | 60.6 | 58.6 | 59.9 | 59.2 | 65.4 |
| Direct | Qwen3.5-27B | 66.4 | — | 68.0 | 63.1 | 63.8 | 75.8 | 63.9 | 73.6 | 65.5 | 71.3 |
| Open-source Workflow RAG | |||||||||||
| Workflow | Qwen2.5-VL-7B | 57.3 | 100.0 | 57.7 | 58.6 | 58.2 | 69.7 | 54.3 | 59.4 | 57.9 | 51.0 |
| Workflow | Qwen3-VL-8B | 63.5 | 100.0 | 64.1 | 66.5 | 63.3 | 60.6 | 59.6 | 66.5 | 64.3 | 55.3 |
| Workflow | Qwen3.5-9B | 66.5 | 100.0 | 67.5 | 67.2 | 64.3 | 63.6 | 63.3 | 70.8 | 66.5 | 64.4 |
| Workflow | Qwen3.5-27B | 69.4 | 100.0 | 70.8 | 69.4 | 65.8 | 78.8 | 65.3 | 70.8 | 69.1 | 71.0 |
| Open-source Plan-Act-Replan (PAR) Agent | |||||||||||
| PAR | Qwen2.5-VL-7B | 59.3 | 85.6 | 59.3 | 61.7 | 63.3 | 54.5 | 56.9 | 55.2 | 59.8 | 56.5 |
| PAR | Qwen3-VL-8B | 63.7 | 99.7 | 62.8 | 66.1 | 70.9 | 63.6 | 63.2 | 66.5 | 64.2 | 58.3 |
| PAR | Qwen3-VL-32B | 71.6 | 98.4 | 71.1 | 75.9 | 74.5 | 69.7 | 69.4 | 79.7 | 72.4 | 61.5 |
| PAR | Qwen3.5-9B | 79.1 | 100.0 | 79.3 | 82.7 | 79.1 | 87.9 | 76.0 | 75.0 | 79.7 | 75.9 |
| PAR | Qwen3.5-27B | 78.6 | 96.8 | 78.3 | 83.1 | 81.1 | 81.8 | 75.5 | 79.7 | 79.0 | 74.5 |
| Open-source MMSearch-R1 | |||||||||||
| MS-R1 | Qwen2.5-VL-7B | 49.4 | 72.8 | 49.1 | 53.8 | 54.1 | 60.6 | 46.2 | 49.1 | 35.5 | 29.8 |
| MS-R1 | Qwen3-VL-8B | 63.2 | 0.02 | 64.5 | 60.8 | 63.3 | 63.6 | 61.1 | 70.3 | 62.6 | 65.6 |
| MS-R1-Game | Qwen3-VL-8B | 64.5 | 68.2 | 65.2 | 66.8 | 65.3 | 63.6 | 60.0 | 73.6 | 64.3 | 61.9 |
| Oracle Knowledge (Upper Bound) | |||||||||||
| Oracle | Qwen3-VL-8B | 86.5 | — | 88.4 | 87.5 | 87.8 | 78.8 | 80.3 | 73.1 | 87.5 | 84.0 |
| Oracle | Qwen3-VL-32B | 90.8 | — | 92.0 | 94.1 | 88.3 | 93.9 | 85.4 | 87.7 | 91.4 | 87.0 |
| Oracle | Qwen3.5-9B | 90.3 | — | 90.7 | 91.9 | 86.2 | 97.0 | 88.4 | 91.5 | 90.7 | 85.8 |
| Oracle | Qwen3.5-27B | 95.4 | — | 96.0 | 96.7 | 95.9 | 87.9 | 93.0 | 94.3 | 95.7 | 93.5 |
Figure 4. Retrieval gains and search behavior. Left: accuracy decomposition from Direct QA to Oracle Knowledge, showing gains from Workflow RAG, PAR, and the remaining gap to Oracle. Right: correctness and search-usage breakdown for MS-R1 models.
Figure 5. Tool-use diagnostics. Left: PAR tool-call counts and accuracy across model scales. Right: item-level search rates and accuracy of MS-R1-style models.
All evaluation code is open-source. The three main entry points correspond to the three evaluation protocols evaluated in the paper.
python run_direct_qa.py \
--model Qwen3.5-27B \
--data data/test.jsonl
Direct QA baseline — no retrieval tools, pure MLLM answering with optional knowledge injection.
python run_workflow.py \
--model Qwen3.5-27B \
--data data/test.jsonl
Fixed image→text retrieval pipeline. Image ANN retrieves frames, core elements are voted, text ANN fetches evidence.
python run_agent.py \
--model Qwen3.5-9B \
--data data/test.jsonl \
--max_rounds 6
LangGraph-based Plan-Act-Replan agent. Dynamically selects among 4 retrieval tools over multiple planning rounds.
img_emb_server.pykn_lookup_server.pymultimodal_emb_server.pytext_emb_server.pybm25_server.pySVFSearch is hosted on HuggingFace. The dataset includes QA pairs, game frame images, text knowledge base, and retrieval indices — all the resources needed to reproduce the benchmark results.
If you use SVFSearch in your research, please cite our paper:
@misc{mao2026svfsearchmultimodalknowledgeintensivebenchmark,
title={SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain},
author={Lingtao Mao and Huangyu Dai and Xinyu Sun and Zihan Liang and Ben Chen and Chenyi Lei and Wenwu Ou},
year={2026},
eprint={2605.17946},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.17946},
}