OmniSearch

📚Dyn-VQA

Construction of Dyn-VQA Dataset. The Dyn-VQA dataset is constructed to evaluate mRAG systems with dynamic questions requiring complex retrieval strategies. It consists of 1,452 questions built through three steps:

Step 1: Textual Question Writing: Annotators create questions categorized by answer update frequency, need for external visual knowledge, and reasoning steps.
Step 2: Multimodal Rewriting: Questions are converted to multimodal format by replacing visual references and pairing with relevant images.
Step 3: Translation: Chinese and English versions are translated and verified for accuracy.

Data Statistics. The Dyn-VQA dataset comprises 1,452 questions across 9 domains, with an equal distribution between Chinese (50.8%) and English (49.2%). Questions are categorized by their answer update frequency: 26.5% have rapidly changing answers, 34.0% have slowly changing answers, and 39.5% have static answers. In terms of reasoning complexity, 73.3% of the questions require ≤ 2 reasoning steps, while 26.7% need more than 2 steps. Additionally, 59.6% of the questions require external visual knowledge, highlighting the multimodal nature of the dataset. The average question length is 12.5 tokens, and the average answer length is 4.3 tokens. Dyn-VQA is designed to be highly challenging, emphasizing dynamic and complex retrieval needs.

Data Domain. The Dyn-VQA dataset spans 9 domains. Sports and Recreation, along with Companies and Products, constitute approximately 50% of the data. The distribution of questions with fast, slow, and never-changing answers is relatively balanced among the categories and does not exhibit a long tail, reflecting a distribution that closely aligns with real-world scenarios.

🤖OmniSearch

The OmniSearch framework operates through the collaboration of three key modules to effectively tackle complex multimodal questions:

Planning Agent. This module understands the input question and real-world feedback, formulates sub-questions, and plans the next retrieval action by selecting the appropriate API and query based on the required knowledge type. It generates various potential actions, such as clarifying ambiguities, refining queries, and proposing next steps.

Retriever. This module executes the actual retrieval operations using the chosen API and query. It fetches relevant content from external sources, including web searches and image searches, providing the necessary information to address the sub-questions.

Sub-question Solver. This module summarizes the retrieved content and attempts to answer the sub-question. It processes the information from the retriever and generates feedback for the planning agent, helping it assess the adequacy of the retrieved content and decide on the next steps.

OmniSearch iterates through these steps, dynamically adjusting its retrieval strategy based on feedback until it gathers sufficient knowledge to provide a comprehensive final answer to the original question.

📊Experiments

Main Results. OmniSearch (GPT-4V) outperforms other models by breaking down complex questions into sub-questions and reassessing content to ensure accuracy, reducing error propagation. While it matches human-level performance in some aspects, it still struggles with challenging questions that require fast-changing knowledge, multiple retrieval steps, or external visual knowledge. Two-step heuristic mRAG aids in detailed image descriptions, although its advantage is limited to questions needing visual input. Commercial search engines like Gemini lack grounding capabilities essential for effective multimodal integration. Finally, mRAG helps bridge the performance gap between text and multimodal models.

How different models as sub-question solvers affect token and expenses? We analyze how different sub-question solvers impact token costs and expenses. Despite higher costs, OmniSearch significantly outperforms heuristic mRAG, with a proportional but non-linear relationship between OmniSearch's performance and its costs. Replacing GPT-4V with Qwen-VL-Chat reduces performance by less than 4 points (around 7.9%) but nearly halves expenses, demonstrating OmniSearch's scalability. Sub-question reasoning isn't the main bottleneck; rather, improving retrieval strategies for complex questions is more critical. This is evident from the significant benefits of changing the planning model compared to altering the sub-question solver in OmniSearch (Q) with GPT-4V. Thus, with limited computational resources, prioritizing a larger model for retrieval planning is advisable.

Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

A demo of WebWalker. You can explore it here.

Abstract

🌟Overview

📚Dyn-VQA

🤖OmniSearch

📊Experiments