SVFSearch

A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search
in the Gaming Vertical Domain

Anonymous Authors
SVFSearch pipeline overview

Figure 1. SVFSearch benchmark construction and evaluation pipeline. Starting from raw short-video clips, we construct a frozen offline retrieval environment with text, image, and multimodal indices. Three evaluation protocols — Direct QA, Workflow RAG, and Plan-Act-Replan (PAR) agents — are compared against an Oracle Knowledge upper bound.

Abstract

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge.

We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip.

To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models.

Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

Benchmark at a Glance

9,198 QA Examples 5,000 test · 4,198 train
221 Games Covered Popular titles in Chinese gaming vertical
22,800 Core Elements Game-specific knowledge units
45K Text Corpus Entries 262K retrieval chunks
34K Gallery Images Topic-linked image index
6 × 3 Taxonomy 6 categories · 3 difficulty levels

Representative Examples

Each example centers on a paused game frame from a real short-video clip paired with a four-choice question. The system must locate domain-specific evidence from the retrieval environment before answering.

Representative QA examples

Figure 2. Representative QA examples drawn from different games and question categories, showing the paused frame, question text, answer options, and ground-truth rationale.

Case Study Frames

Game frame case 1a

Case 1 · Frame A

Game frame case 1b

Case 1 · Frame B

Game frame case 2a

Case 2 · Frame A

Game frame case 2b

Case 2 · Frame B

Dataset & Retrieval Environment

Three-Stage Construction Pipeline

  1. 01
    Core Element & Knowledge Construction

    Curate 22,800 game-specific elements across 221 games, building a 45K-entry text corpus and 262K retrieval chunks.

  2. 02
    Visual Grounding via Short-Video Retrieval

    Process 200K+ videos, extract 1M+ frames, and obtain 43K reliable image-element pairs for the topic-linked image gallery.

  3. 03
    QA Generation & Quality Filtering

    Generate 80K candidate QA pairs and filter to 9,198 high-quality instances with difficulty and category labels.

Frozen Retrieval Indices

T
Dense Text

Qwen3-Embedding-0.6B · 512-dim · 45K entries

B
Sparse BM25

Lexical retrieval over chunked knowledge base

I
Image ANN

DINOv3-Base fine-tuned · 256-dim · 34K images

M
Multimodal ANN

Qwen3-VL-Embedding-2B · 512-dim

Question Categories & Difficulty

Topic
Character Equipment Map Story Mechanics Other
Difficulty
Easy Medium Hard
Dataset distribution across categories and difficulty

Figure 3. Distribution of SVFSearch examples across question theme, question type, and difficulty levels.

Evaluation Protocols

SVFSearch benchmarks four evaluation settings in a unified frozen environment, enabling controlled comparison from model-only to fully agentic approaches.

Direct QA

MLLM answers with the paused frame, question, and options — no retrieval tools. Establishes the model-only baseline.

🔄

Workflow RAG

Fixed image→text pipeline: image ANN retrieves candidate frames, core elements are identified, text ANN fetches relevant knowledge, then the model answers.

🤖

Plan-Act-Replan (PAR)

LangGraph-based agent dynamically selects among 4 retrieval tools over up to 6 rounds, accumulating and reasoning over evidence adaptively.

🎯

MMSearch-R1

Learned search model that internalizes tool use via RL training, generating tool calls as part of the decoding process.

Main Results

Accuracy and search rate (SR) on the 5,000-example test split. Oracle Knowledge provides the upper bound of evidence availability.

Best Open-source Direct QA 66.4% Qwen3.5-27B
+12.7%
Best PAR Agent 79.1% Qwen3.5-9B
+16.3%
Oracle Knowledge 95.4% Qwen3.5-27B
Gemini-3.1-Pro Direct QA
77.5%
Qwen3.5-27B Direct QA
66.4%
Qwen3.5-27B Workflow RAG
69.4%
Qwen3.5-9B PAR Agent
79.1%
Qwen3.5-27B Oracle Knowledge
95.4%
Setting Model Acc. SR Char. Equip. Map Story Mech. Easy Med. Hard
Proprietary Models (Direct QA)
DirectClaude-Opus-4.769.0 70.069.066.872.766.2 67.968.672.9
DirectGPT-5.467.9 69.867.859.769.763.7 65.167.770.4
DirectGemini-3.1-Pro77.5 79.179.669.987.972.3 82.578.368.2
Open-source Direct QA
DirectQwen2.5-VL-7B49.8 50.847.947.451.548.1 57.149.450.2
DirectQwen3-VL-8B54.8 56.552.455.166.750.9 58.053.861.9
DirectQwen3-VL-32B57.1 58.754.353.663.654.4 71.256.258.9
DirectQwen3.5-9B59.9 61.157.157.160.658.6 59.959.265.4
DirectQwen3.5-27B66.4 68.063.163.875.863.9 73.665.571.3
Open-source Workflow RAG
WorkflowQwen2.5-VL-7B57.3100.0 57.758.658.269.754.3 59.457.951.0
WorkflowQwen3-VL-8B63.5100.0 64.166.563.360.659.6 66.564.355.3
WorkflowQwen3.5-9B66.5100.0 67.567.264.363.663.3 70.866.564.4
WorkflowQwen3.5-27B69.4100.0 70.869.465.878.865.3 70.869.171.0
Open-source Plan-Act-Replan (PAR) Agent
PARQwen2.5-VL-7B59.385.6 59.361.763.354.556.9 55.259.856.5
PARQwen3-VL-8B63.799.7 62.866.170.963.663.2 66.564.258.3
PARQwen3-VL-32B71.698.4 71.175.974.569.769.4 79.772.461.5
PARQwen3.5-9B79.1100.0 79.382.779.187.976.0 75.079.775.9
PARQwen3.5-27B78.696.8 78.383.181.181.875.5 79.779.074.5
Open-source MMSearch-R1
MS-R1Qwen2.5-VL-7B49.472.8 49.153.854.160.646.2 49.135.529.8
MS-R1Qwen3-VL-8B63.20.02 64.560.863.363.661.1 70.362.665.6
MS-R1-GameQwen3-VL-8B64.568.2 65.266.865.363.660.0 73.664.361.9
Oracle Knowledge (Upper Bound)
OracleQwen3-VL-8B86.5 88.487.587.878.880.3 73.187.584.0
OracleQwen3-VL-32B90.8 92.094.188.393.985.4 87.791.487.0
OracleQwen3.5-9B90.3 90.791.986.297.088.4 91.590.785.8
OracleQwen3.5-27B95.4 96.096.795.987.993.0 94.395.793.5

Analysis

Retrieval gains and search behavior

Figure 4. Retrieval gains and search behavior. Left: accuracy decomposition from Direct QA to Oracle Knowledge, showing gains from Workflow RAG, PAR, and the remaining gap to Oracle. Right: correctness and search-usage breakdown for MS-R1 models.

Tool-use diagnostics

Figure 5. Tool-use diagnostics. Left: PAR tool-call counts and accuracy across model scales. Right: item-level search rates and accuracy of MS-R1-style models.

Key Findings

  • Large knowledge gap: Best Direct QA (77.5%) vs. Oracle Knowledge (95.4%) reveals that game-specific evidence is critical — many questions are answerable with evidence but not without it.
  • PAR outperforms fixed workflows: Qwen3.5-9B with PAR reaches 79.1%, surpassing all Direct QA baselines including Gemini-3.1-Pro, by adaptively coordinating multiple retrieval channels.
  • Scale vs. search volume: Very small models (0.8B) under-search; medium models (2B) over-search without effective evidence use; large models (27B) balance retrieval frequency and accuracy.
  • RL and answer-only shortcuts: Outcome-only RL can improve accuracy while suppressing retrieval behavior in multiple-choice settings, requiring task-specific reward design to avoid this failure mode.

Code & Reproduction

All evaluation code is open-source. The three main entry points correspond to the three evaluation protocols evaluated in the paper.

run_direct_qa.py Baseline
python run_direct_qa.py \
  --model Qwen3.5-27B \
  --data data/test.jsonl

Direct QA baseline — no retrieval tools, pure MLLM answering with optional knowledge injection.

run_workflow.py Workflow RAG
python run_workflow.py \
  --model Qwen3.5-27B \
  --data data/test.jsonl

Fixed image→text retrieval pipeline. Image ANN retrieves frames, core elements are voted, text ANN fetches evidence.

run_agent.py PAR Agent
python run_agent.py \
  --model Qwen3.5-9B \
  --data data/test.jsonl \
  --max_rounds 6

LangGraph-based Plan-Act-Replan agent. Dynamically selects among 4 retrieval tools over multiple planning rounds.

Retrieval Services (start before evaluation)

:8001 Image ANN img_emb_server.py
:8002 Knowledge Lookup kn_lookup_server.py
:8003 Multimodal ANN multimodal_emb_server.py
:8004 Dense Text ANN text_emb_server.py
:8005 BM25 bm25_server.py
View Full Code on GitHub

Get the Dataset

SVFSearch is hosted on HuggingFace. The dataset includes QA pairs, game frame images, text knowledge base, and retrieval indices — all the resources needed to reproduce the benchmark results.

9,198 QA pairs (JSONL)
6,415 Game frame images (JPG)
262K Knowledge chunks (JSONL)
Download from HuggingFace

Citation

If you use SVFSearch in your research, please cite our paper:

@misc{mao2026svfsearchmultimodalknowledgeintensivebenchmark,
      title={SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain},
      author={Lingtao Mao and Huangyu Dai and Xinyu Sun and Zihan Liang and Ben Chen and Chenyi Lei and Wenwu Ou},
      year={2026},
      eprint={2605.17946},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.17946},
}