Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

Yangning Li*, Yinghui Li,*, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng,
Hui Wang, Hai-Tao Zheng, Pengjun Xie, Philip S. Yu, Fei Huang, Jingren Zhou
liyangning.lyn@alibaba-inc.com
Tongyi Lab Tongyi Logo , Alibaba Group
*Work done during internship at Tongyi Lab, Alibaba Group.
Paper Code Hugging Face Logo Dataset leaderboard Logo Leaderboard Modelscope Logo Demo
Demo of WebWalker

A demo of WebWalker. You can explore it here.

Abstract

Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the “hallucination” issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of ``dynamic'' questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. Code and dataset will be open-sourced.

🌟Overview

📚 We reveal that existing VQA-based mRAG benchmarks fail to reflect the feature that realworld questions require dynamic knowledge retrieval, and propose novel Dyn-VQA dataset, which contains three types of dynamic questions..

💡 We benchmark various mRAG methods with leading MLLMs on Dyn-VQA, demonstrating their flaw in providing sufficient and relevant knowledge for dynamic questions.

🤖 We propose OmniSearch, a self-adaptive retrieval agent that plans each retrieval action in realtime according to question solution stage and current retrieval content.

📊 Extensive experiments prove the effectiveness of our OmniSearch. Detailed analyses are conducted to provide direction for advancing mRAG.

📚Dyn-VQA

Construction of Dyn-VQA Dataset. The Dyn-VQA dataset is constructed to evaluate mRAG systems with dynamic questions requiring complex retrieval strategies. It consists of 1,452 questions built through three steps:

  • Step 1: Textual Question Writing: Annotators create questions categorized by answer update frequency, need for external visual knowledge, and reasoning steps.
  • Step 2: Multimodal Rewriting: Questions are converted to multimodal format by replacing visual references and pairing with relevant images.
  • Step 3: Translation: Chinese and English versions are translated and verified for accuracy.

Data Statistics. The Dyn-VQA dataset comprises 1,452 questions across 9 domains, with an equal distribution between Chinese (50.8%) and English (49.2%). Questions are categorized by their answer update frequency: 26.5% have rapidly changing answers, 34.0% have slowly changing answers, and 39.5% have static answers. In terms of reasoning complexity, 73.3% of the questions require ≤ 2 reasoning steps, while 26.7% need more than 2 steps. Additionally, 59.6% of the questions require external visual knowledge, highlighting the multimodal nature of the dataset. The average question length is 12.5 tokens, and the average answer length is 4.3 tokens. Dyn-VQA is designed to be highly challenging, emphasizing dynamic and complex retrieval needs.

Data Domain. The Dyn-VQA dataset spans 9 domains. Sports and Recreation, along with Companies and Products, constitute approximately 50% of the data. The distribution of questions with fast, slow, and never-changing answers is relatively balanced among the categories and does not exhibit a long tail, reflecting a distribution that closely aligns with real-world scenarios.

🤖OmniSearch

The OmniSearch framework operates through the collaboration of three key modules to effectively tackle complex multimodal questions:

Planning Agent. This module understands the input question and real-world feedback, formulates sub-questions, and plans the next retrieval action by selecting the appropriate API and query based on the required knowledge type. It generates various potential actions, such as clarifying ambiguities, refining queries, and proposing next steps.

Retriever. This module executes the actual retrieval operations using the chosen API and query. It fetches relevant content from external sources, including web searches and image searches, providing the necessary information to address the sub-questions.

Sub-question Solver. This module summarizes the retrieved content and attempts to answer the sub-question. It processes the information from the retriever and generates feedback for the planning agent, helping it assess the adequacy of the retrieved content and decide on the next steps.

OmniSearch iterates through these steps, dynamically adjusting its retrieval strategy based on feedback until it gathers sufficient knowledge to provide a comprehensive final answer to the original question.

📊Experiments

Main Results. OmniSearch (GPT-4V) outperforms other models by breaking down complex questions into sub-questions and reassessing content to ensure accuracy, reducing error propagation. While it matches human-level performance in some aspects, it still struggles with challenging questions that require fast-changing knowledge, multiple retrieval steps, or external visual knowledge. Two-step heuristic mRAG aids in detailed image descriptions, although its advantage is limited to questions needing visual input. Commercial search engines like Gemini lack grounding capabilities essential for effective multimodal integration. Finally, mRAG helps bridge the performance gap between text and multimodal models.

How different models as sub-question solvers affect token and expenses? We analyze how different sub-question solvers impact token costs and expenses. Despite higher costs, OmniSearch significantly outperforms heuristic mRAG, with a proportional but non-linear relationship between OmniSearch's performance and its costs. Replacing GPT-4V with Qwen-VL-Chat reduces performance by less than 4 points (around 7.9%) but nearly halves expenses, demonstrating OmniSearch's scalability. Sub-question reasoning isn't the main bottleneck; rather, improving retrieval strategies for complex questions is more critical. This is evident from the significant benefits of changing the planning model compared to altering the sub-question solver in OmniSearch (Q) with GPT-4V. Thus, with limited computational resources, prioritizing a larger model for retrieval planning is advisable.