ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou
Yong Jiang, Pengjun Xie, Fei Huang Yan Zhang,
Tongyi Lab Tongyi Logo , Alibaba Group
Paper Code Hugging Face Logo Dataset Hugging Face Logo Model

Abstract

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that enhances the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of varying sizes and is compatible with a wide range of RL algorithms.

🌟Overview

🔍 We propose ZeroSearch, a novel reinforcement learning framework that incentivizes the search capability of LLMs without interacting with real search engines.

🤖 Through supervised fine-tuning, we transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. We further introduce a curriculum rollout mechanism to progressively elicit the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios.

📊 We conduct extensive experiments on both in-domain and out-of-domain datasets. Results show that ZeroSearch outperforms real search engine-based models while incurring zero API cost. Moreover, it generalizes well across both base and instruction-tuned LLMs of various sizes and supports different reinforcement learning algorithms.

🔍 ZeroSearch

Reinforcement Learning without a Search Engine We propose a reinforcement learning framework that eliminates the need for a real search engine by leveraging an LLM to simulate the search engine. The optimization objective is formulated as:

\[ \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(\cdot \mid x; \pi_{\psi})} \bigl[\,r_{\phi}(x, y)\bigr] \;-\;\beta\,\mathrm{D}_{\mathrm{KL}}\bigl[\pi_{\theta}(y \mid x; \pi_{\psi}) \,\big\|\, \pi_{\mathrm{ref}}(y \mid x; \pi_{\psi})\bigr], \]

where \(\pi_{\theta}\) is the policy model to be optimized, \(\pi_{\mathrm{ref}}\) is the reference model, and \(r_{\phi}\) denotes the reward function. \(\pi_{\psi}\) represents the simulation LLM, whose parameters remain fixed throughout training.

Search Simulation Tuning We propose a lightweight supervised fine-tuning (SFT) procedure. Specifically, we first collect interaction trajectories by prompting the LLM to engage with a real search engine in a multi-turn manner until a final answer is reached. Trajectories that yield correct answers are labeled as positive, indicating that the retrieved documents were useful, while those leading to incorrect answers are labeled as negative, indicating noisy retrievals. Then, we extract query-document pairs from these trajectories and perform lightweight SFT to improve the LLM’s ability to simulate real search engines.

Rollout with Curriculum Search Simulation During rollout, the policy model performs interactive reasoning and generates search queries, which are fed into the simulation LLM to produce corresponding documents. To gradually increase the difficulty of training, we introduce a curriculum learning-based rollout mechanism, where the quality of the retrieved documents is progressively degraded over time.

Reward Design The reward signal serves as the primary supervision in the reinforcement learning process. In this work, we adopt F1 score-based reward that focuses solely on answer accuracy.

\[ r_{\phi}(x, y) = \frac{2 \times IN}{PN + RN}, \]

where IN denotes the number of overlapping words between the prediction and the ground truth, PN is the number of words in the prediction, and RN is the number of words in the ground truth. We do not incorporate an additional reward for output format, as we observed that the model consistently produces well-formed responses without explicit supervision.

📊 Experiments

Main Results The above table presents a comparison between ZeroSearch and several baseline methods across seven datasets. Based on the results, several key observations can be drawn:

ZeroSearch consistently outperforms all baseline methods. This performance advantage holds for both in-domain datasets (\textit{i.e.}, NQ and HotpotQA) and out-of-domain datasets (\textit{i.e.}, TriviaQA, PopQA, 2WikiMultiHopQA, Musique, and Bamboogle), demonstrating the robustness of our method.

ZeroSearch surpasses methods that rely on real search engines. Compared to Search-R1, which utilizes a real search engine, ZeroSearch achieves better performance, highlighting its potential as an effective alternative to real search engines in large-scale reinforcement learning.

ZeroSearch demonstrates strong generalizability. Across different model families, sizes, and types (i.e., base or instruction-tuned), ZeroSearch consistently outperforms baselines. Moreover, its performance further improves with larger models, highlighting its scalability.

Compare ZeroSearch with Real Search Engine We compare the reward curves of ZeroSearch and Search-R1 (using a real search engine) on LLaMA-3.2-3B.

The overall reward trends are similar across both methods. As training progresses, the reward scores of both ZeroSearch and Search-R1 steadily increase, indicating that the policy models in both settings effectively learn to interact with search engines and produce correct answers.

ZeroSearch achieves a more stable and smoother learning curve. ZeroSearch initially lags behind Search-R1 but eventually surpasses it with much less fluctuation, thanks to the curriculum mechanism that helps the model gradually master search tool usage.

ZeroSearch generalizes well across both base and instruction-tuned models. Under both model types, ZeroSearch steadily improves reward performance, underscoring its generalizability.

Choice of Simulation LLMs We evaluate how different simulation engine configurations affect performance, including prompt-based and fine-tuned LLMs ranging from 3B to 14B parameters.

The fine-tuned 7B simulation engine (SFT-7B) achieves performance comparable to that of Google Search, while the 14B variant (SFT-14B) even surpasses it. This demonstrates the feasibility of using a well-trained LLM as a substitute for real search engines in reinforcement learning setups.

Fine-tuned (SFT-based) simulation engines significantly outperform prompt-based ones. Although prompt-based methods are explicitly guided to mimic the response style of a real search engine, a substantial distribution gap remains, leading to inferior performance.

Performance improves consistently with increasing model size. Larger simulation LLMs not only exhibit stronger simulation capabilities, but also more accurately distinguish between relevant and irrelevant documents, thereby enabling more effective curriculum learning during training.

Case Study We show several interaction trajectories. From these examples, we observe:

The policy model consistently adheres to the expected output format, even though the format is only specified in the input template and not reinforced by the reward design.

The model demonstrates the capability for multi-turn search behavior to arrive at the final answer. This confirms that our method effectively incentivizes and leverages the model’s search capabilities.