SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

Figure 1. SVFSearch benchmark construction and evaluation pipeline. Starting from raw short-video clips, we construct a frozen offline retrieval environment with text, image, and multimodal indices. Three evaluation protocols — Direct QA, Workflow RAG, and Plan-Act-Replan (PAR) agents — are compared against an Oracle Knowledge upper bound.

Abstract

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge.

We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip.

To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models.

Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

Benchmark at a Glance

9,198 QA Examples 5,000 test · 4,198 train

221 Games Covered Popular titles in Chinese gaming vertical

22,800 Core Elements Game-specific knowledge units

45K Text Corpus Entries 262K retrieval chunks

34K Gallery Images Topic-linked image index

6 × 3 Taxonomy 6 categories · 3 difficulty levels

Representative Examples

Each example centers on a paused game frame from a real short-video clip paired with a four-choice question. The system must locate domain-specific evidence from the retrieval environment before answering.

Figure 2. Representative QA examples drawn from different games and question categories, showing the paused frame, question text, answer options, and ground-truth rationale.

Case Study Frames

Case 1 · Frame A

Case 1 · Frame B

Case 2 · Frame A

Case 2 · Frame B

Dataset & Retrieval Environment

Three-Stage Construction Pipeline

01
Core Element & Knowledge Construction
Curate 22,800 game-specific elements across 221 games, building a 45K-entry text corpus and 262K retrieval chunks.
02
Visual Grounding via Short-Video Retrieval
Process 200K+ videos, extract 1M+ frames, and obtain 43K reliable image-element pairs for the topic-linked image gallery.
03
QA Generation & Quality Filtering
Generate 80K candidate QA pairs and filter to 9,198 high-quality instances with difficulty and category labels.

Frozen Retrieval Indices

T

Dense Text

Qwen3-Embedding-0.6B · 512-dim · 45K entries

B

Sparse BM25

Lexical retrieval over chunked knowledge base

I

Image ANN

DINOv3-Base fine-tuned · 256-dim · 34K images

M

Multimodal ANN

Qwen3-VL-Embedding-2B · 512-dim

Question Categories & Difficulty

Topic

Character Equipment Map Story Mechanics Other

Difficulty

Easy Medium Hard

Figure 3. Distribution of SVFSearch examples across question theme, question type, and difficulty levels.

Evaluation Protocols

SVFSearch benchmarks four evaluation settings in a unified frozen environment, enabling controlled comparison from model-only to fully agentic approaches.

⚡

Direct QA

MLLM answers with the paused frame, question, and options — no retrieval tools. Establishes the model-only baseline.

🔄

Workflow RAG

Fixed image→text pipeline: image ANN retrieves candidate frames, core elements are identified, text ANN fetches relevant knowledge, then the model answers.

🤖

Plan-Act-Replan (PAR)

LangGraph-based agent dynamically selects among 4 retrieval tools over up to 6 rounds, accumulating and reasoning over evidence adaptively.

🎯

MMSearch-R1

Learned search model that internalizes tool use via RL training, generating tool calls as part of the decoding process.

Main Results

Accuracy and search rate (SR) on the 5,000-example test split. Oracle Knowledge provides the upper bound of evidence availability.

Best Open-source Direct QA 66.4% Qwen3.5-27B

→ +12.7%

Best PAR Agent 79.1% Qwen3.5-9B

→ +16.3%

Oracle Knowledge 95.4% Qwen3.5-27B

Gemini-3.1-Pro Direct QA

77.5%

Qwen3.5-27B Direct QA

66.4%

Qwen3.5-27B Workflow RAG

69.4%

Qwen3.5-9B PAR Agent

79.1%

Qwen3.5-27B Oracle Knowledge

95.4%

Setting	Model	Acc.	SR	Char.	Equip.	Map	Story	Mech.	Easy	Med.	Hard
Proprietary Models (Direct QA)
Direct	Claude-Opus-4.7	69.0	—	70.0	69.0	66.8	72.7	66.2	67.9	68.6	72.9
Direct	GPT-5.4	67.9	—	69.8	67.8	59.7	69.7	63.7	65.1	67.7	70.4
Direct	Gemini-3.1-Pro	77.5	—	79.1	79.6	69.9	87.9	72.3	82.5	78.3	68.2
Open-source Direct QA
Direct	Qwen2.5-VL-7B	49.8	—	50.8	47.9	47.4	51.5	48.1	57.1	49.4	50.2
Direct	Qwen3-VL-8B	54.8	—	56.5	52.4	55.1	66.7	50.9	58.0	53.8	61.9
Direct	Qwen3-VL-32B	57.1	—	58.7	54.3	53.6	63.6	54.4	71.2	56.2	58.9
Direct	Qwen3.5-9B	59.9	—	61.1	57.1	57.1	60.6	58.6	59.9	59.2	65.4
Direct	Qwen3.5-27B	66.4	—	68.0	63.1	63.8	75.8	63.9	73.6	65.5	71.3
Open-source Workflow RAG
Workflow	Qwen2.5-VL-7B	57.3	100.0	57.7	58.6	58.2	69.7	54.3	59.4	57.9	51.0
Workflow	Qwen3-VL-8B	63.5	100.0	64.1	66.5	63.3	60.6	59.6	66.5	64.3	55.3
Workflow	Qwen3.5-9B	66.5	100.0	67.5	67.2	64.3	63.6	63.3	70.8	66.5	64.4
Workflow	Qwen3.5-27B	69.4	100.0	70.8	69.4	65.8	78.8	65.3	70.8	69.1	71.0
Open-source Plan-Act-Replan (PAR) Agent
PAR	Qwen2.5-VL-7B	59.3	85.6	59.3	61.7	63.3	54.5	56.9	55.2	59.8	56.5
PAR	Qwen3-VL-8B	63.7	99.7	62.8	66.1	70.9	63.6	63.2	66.5	64.2	58.3
PAR	Qwen3-VL-32B	71.6	98.4	71.1	75.9	74.5	69.7	69.4	79.7	72.4	61.5
PAR	Qwen3.5-9B	79.1	100.0	79.3	82.7	79.1	87.9	76.0	75.0	79.7	75.9
PAR	Qwen3.5-27B	78.6	96.8	78.3	83.1	81.1	81.8	75.5	79.7	79.0	74.5
Open-source MMSearch-R1
MS-R1	Qwen2.5-VL-7B	49.4	72.8	49.1	53.8	54.1	60.6	46.2	49.1	35.5	29.8
MS-R1	Qwen3-VL-8B	63.2	0.02	64.5	60.8	63.3	63.6	61.1	70.3	62.6	65.6
MS-R1-Game	Qwen3-VL-8B	64.5	68.2	65.2	66.8	65.3	63.6	60.0	73.6	64.3	61.9
Oracle Knowledge (Upper Bound)
Oracle	Qwen3-VL-8B	86.5	—	88.4	87.5	87.8	78.8	80.3	73.1	87.5	84.0
Oracle	Qwen3-VL-32B	90.8	—	92.0	94.1	88.3	93.9	85.4	87.7	91.4	87.0
Oracle	Qwen3.5-9B	90.3	—	90.7	91.9	86.2	97.0	88.4	91.5	90.7	85.8
Oracle	Qwen3.5-27B	95.4	—	96.0	96.7	95.9	87.9	93.0	94.3	95.7	93.5

Analysis

Figure 4. Retrieval gains and search behavior. Left: accuracy decomposition from Direct QA to Oracle Knowledge, showing gains from Workflow RAG, PAR, and the remaining gap to Oracle. Right: correctness and search-usage breakdown for MS-R1 models.

Figure 5. Tool-use diagnostics. Left: PAR tool-call counts and accuracy across model scales. Right: item-level search rates and accuracy of MS-R1-style models.

Key Findings

Large knowledge gap: Best Direct QA (77.5%) vs. Oracle Knowledge (95.4%) reveals that game-specific evidence is critical — many questions are answerable with evidence but not without it.
PAR outperforms fixed workflows: Qwen3.5-9B with PAR reaches 79.1%, surpassing all Direct QA baselines including Gemini-3.1-Pro, by adaptively coordinating multiple retrieval channels.
Scale vs. search volume: Very small models (0.8B) under-search; medium models (2B) over-search without effective evidence use; large models (27B) balance retrieval frequency and accuracy.
RL and answer-only shortcuts: Outcome-only RL can improve accuracy while suppressing retrieval behavior in multiple-choice settings, requiring task-specific reward design to avoid this failure mode.

Code & Reproduction

All evaluation code is open-source. The three main entry points correspond to the three evaluation protocols evaluated in the paper.

run_direct_qa.py Baseline

python run_direct_qa.py \
  --model Qwen3.5-27B \
  --data data/test.jsonl

Direct QA baseline — no retrieval tools, pure MLLM answering with optional knowledge injection.

run_workflow.py Workflow RAG

python run_workflow.py \
  --model Qwen3.5-27B \
  --data data/test.jsonl

Fixed image→text retrieval pipeline. Image ANN retrieves frames, core elements are voted, text ANN fetches evidence.

run_agent.py PAR Agent

python run_agent.py \
  --model Qwen3.5-9B \
  --data data/test.jsonl \
  --max_rounds 6

LangGraph-based Plan-Act-Replan agent. Dynamically selects among 4 retrieval tools over multiple planning rounds.

Retrieval Services (start before evaluation)

:8001 Image ANN img_emb_server.py

:8002 Knowledge Lookup kn_lookup_server.py

:8003 Multimodal ANN multimodal_emb_server.py

:8004 Dense Text ANN text_emb_server.py

:8005 BM25 bm25_server.py

View Full Code on GitHub

Get the Dataset

SVFSearch is hosted on HuggingFace. The dataset includes QA pairs, game frame images, text knowledge base, and retrieval indices — all the resources needed to reproduce the benchmark results.

9,198 QA pairs (JSONL)

6,415 Game frame images (JPG)

262K Knowledge chunks (JSONL)

Download from HuggingFace

Citation

If you use SVFSearch in your research, please cite our paper:

@misc{mao2026svfsearchmultimodalknowledgeintensivebenchmark,
      title={SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain},
      author={Lingtao Mao and Huangyu Dai and Xinyu Sun and Zihan Liang and Ben Chen and Chenyi Lei and Wenwu Ou},
      year={2026},
      eprint={2605.17946},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.17946},
}