WebWalker: Benchmarking LLMs in Web Traversal

Jialong Wu*, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang,
Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang
jialongwu@{alibaba-inc.com, seu.edu.cn}
Tongyi Lab Tongyi Logo , Alibaba Group
*Work done during internship at Tongyi Lab, Alibaba Group.
Paper Code Hugging Face Logo Dataset leaderboard Logo Leaderboard Modelscope Logo Demo Hugging Face Logo Demo

A demo of WebWalker. You can explore it here.

Abstract

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

🌟Overview

📚 We construct a challenging benchmark, WebWalkerQA, which is composed of 680 queries from four real-world scenarios across over 1373 webpages.

🤖 To tackle the challenge of web-navigation tasks requiring long context, we propose WebWalker, which utilizes a multi-agent framework for effective memory management.

📊 Extensive experiments show that the WebWalkerQA is challenging, and for informationseeking tasks, vertical exploration within the page proves to be beneficial.

📚WebWalkerQA

Data Collection. To make the annotation process cost-efficient and accurate, we employ a two-stage funnel annotation strategy, combining LLM-based and human annotation. In the first stage, GPT-4o, performs initial annotations, followed by a second stage, where crowd-sourced human annotators conduct quality control and filtering to refine the final results.

Data Statistics. Through such data construction method with LLM and human participation, we obtain 680 questionanswer pairs for WebWalkerQA. WebWalkerQA contains two types of data: multi-source and single-source QAs. We categorize the questions into three difficulty levels: easy, medium, and hard, based on the value of i, which denotes the depth of the corresponding subpage. WebWalkerQA encompasses four realworld domains: conference, organization, education, and game according is a bilingual dataset that includes both Chinese and English.

🤖WebWalker

Think then Explore. The explorer agent navigates through a web environment step by step. It interacts with the webpage, focusing on HTML buttons and clickable links to decide where to go next. At each step, the agent observes the current page, which includes the page content and a list of clickable links. Based on this observation, it chooses a link to explore further. The decision-making process considers the sequence of all previous actions and observations, forming a "history" that helps guide the exploration. This process continues until the critic agent decides that enough information has been gathered or a pre-set limit on exploration steps is reached.

Think then Critique. The critic agent steps in to assess the progress made by the explorer agent. After each exploration step, the critic evaluates the current state, including the query, the latest page observation, and the chosen action. It maintains a memory that is updated incrementally with relevant information gathered so far. The critic decides whether the information collected is sufficient to answer the query. If it is, the critic formulates and provides an answer. If not, the exploration continues. This iterative process ensures that the exploration is purposeful and focused on gathering the necessary information to resolve the query.

📊Experiments

Results on Agents. The closed-source models outperform the open-source models in both performance and efficiency. For open-source models, performance and efficiency improves as the model size increases. Our proposed WebWalker framework outperforms Reflexion, which in turn outperforms React. We only counted the action count (A.C.) from correct executions, and as the model size increases, the A.C. grows, indicating that larger LLMs have enhanced long-range information-seeking ability. Even the best-performing WebWalker using GPT-4o as its backbone does not surpass 40%, highlighting the challenge posed by WebWalkerQA.

Further Analysis. The further towards the top-right corner, the more effective and prolonged the web traversal becomes. We observe that increasing the model size or introducing reflection on the process of each action can address certain problems requiring multi-step solutions, thereby enabling long-distance task-solving capabilities in web traversal tasks. The model with a relatively small number of parameters using the ReAct framework lacks the capacity to explore the depth of information, making judgments within just a few iterations of taking action, regardless of whether relevant information has been found. It tends to “give up” and exhibits characteristics of impatience. Introducing memory to manage the long context, along with an increase in model parameters, provides evidence that this phenomenon stems from the interference of long contexts having noisy information and the inherent capabilities of the model itself. It can be observed that as the depth increases or the number of sources required increases, the difficulty of acquiring the information needed to resolve the query becomes greater, resulting in a decline in accuracy performance.

Result on RAG System. We first evaluate the performance under Close Book settings using the state-of-the-art model OpenAI o1 and Gemini-1.5-Pro without retrieval. We then access the performance of several commercial and open-sourced RAG systems. Both commercial and open-sourced RAG systems exhibit relatively poor performance on WebWalkerQA, with the best result coming from Tongyi, which only reaches 40%. Furthermore, as the difficulty increases, the depth of information growing deeper, the performance tends to deteriorate.

Findings (i): The RAG systems struggle with key challenges that require effective web traversal.

🤔️Intrersting Findings

The standard RAG system can be viewed as a horizontal search for relevant documents in response to a query, while WebWalker can be considered as a vertical exploration approach. WebWalker can seamlessly integrate into standard RAG systems to acquire deep information and enhance problem-solving capabilities. It is observed that, after the integration, performance has improved across all difficulty levels, especially in the multi-source category.

Findings (ii): WebWalker can be a module in agentic RAG system, enabling vertical exploration.0

We scale up the amount of K ∈ {5, 10, 15, 20, 25} to study the impact of scaling during the inference phase when tracing source information. Figure 9 shows the results of scaling up, where larger values of K lead to better performance, validating the feasibility of vertical scaling within a certain range.

Findings (iii): Scaling the process of digging through links could represent a potential direction for vertical exploration in RAG systems.

🚩Citation

@misc{wu2025webwalker,
        title={WebWalker: Benchmarking LLMs in Web Traversal},
        author={Jialong Wu and Wenbiao Yin and Yong Jiang and Zhenglin Wang and Zekun Xi and Runnan Fang and Deyu Zhou and Pengjun Xie and Fei Huang},
        year={2025},
        eprint={2501.07572},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2501.07572},
  }