LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services
Abstract
Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench comprises a database of over 1.3M merchant entries across 6 service categories and 9 major cities, and 900 multi-hop QA tasks from real user queries that require multi-step reasoning. We also developed LocalPlayground, a unified environment integrating multiple tools for LRMs interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.2) achieves only 32.93% correctness, and most models have issues with completeness (average 60.32%) and faithfulness (average 30.72%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services.
🛠️ Playground
Try our interactive playground to test LocalRAG Search on LocalSearchBench. Enter queries for different cities and see how our enhanced RAG system performs.
⚠️ Interactive Feature: This playground requires a backend server.
📍 Supported Cities: Currently supports Shanghai, Beijing, Guangzhou, Shenzhen, Wuhan, Chongqing, Chengdu, Suzhou, and Hangzhou. Other cities are under development.
Select Search Method
Example Queries
🍲 Restaurant Search
上海 + 外滩 + 附近有哪些餐厅
🔥 Hotpot Restaurant
北京 + 五道口 + 附近有哪些火锅店
🎬 Entertainment
深圳 + 南山区 + 附近有哪些电影院
🛍️ Shopping
广州 + 天河区 + 附近有哪些生日蛋糕
☕ Coffee Shop
成都 + 春熙路 + 附近有哪些咖啡店
🏨 Hotel Booking
武汉 + 武昌站 + 附近有哪些酒店
📊 Overview
LocalSearchBench builds a comprehensive merchant database covering 6 scenarios across 9 major cities in China, with 1,354,185 merchants. The database is constructed through multi-agent techniques, along with data augmentation and anonymization, detailed in our paper.
🎯 Supported Scenarios
The 6 scenarios comprehensively cover the core business verticals of local life services. The distribution ratio of merchants across these scenarios is based on the platform's core business verticals, ensuring realistic representation of real-world search demands. Each city's data basically follows this distribution baseline with city-specific variations detailed in our paper.
📍 Supported Cities
We carefully selected 9 major cities across China based on key economic indicators, ensuring a balanced urban-rural distribution and geographical coverage. The heatmap distribution of merchants across these cities is detailed in our paper.
⭐ Industry Search Standards
We are the first to propose a difficulty grading system (L1-L5) for agentic search in the local life services, systematically defining query difficulty levels from easy to difficult. The grading is based on requirement understanding complexity and planning-search-reflection loop complexity, detailed in our paper.
Based on the above merchant database, we construct 900 multi-hop QA tasks with difficulty levels ranging from L3 to L4, comprising 100 QA tasks per city. These tasks involve 3-5 hops and support both web search and LocalRAG search. They are designed to challenge agents' ability to perform complex reasoning and multi-step retrieval across diverse local service scenarios.
The following examples illustrate Multi-hop QA tasks from our benchmark, demonstrating how the system performs iterative retrieval across multiple rounds to fulfill complex planning requirements.
🔧 System Architecture
Search Agent follows ReAct structure and performs iterative multi-hop reasoning using LocalRAG and Web Search; Validation Agent assesses answer and trajectory quality.
🏆 Leaderboard
Performance of various models on LocalSearchBench. Evaluated on answer quality and trajectory effectiveness.
📊 Answer Quality Metrics
| Model | Avg. tool calls | Avg. rounds | Correctness (%) | Completeness (%) | Fluency (%) | Faithfulness (%) | Safety (%) |
|---|---|---|---|---|---|---|---|
Qwen3-235B-A22B (w/ thinking) |
2.31 | 3.17 | 30.24 | 71.20 | 71.58 | 26.90 | 81.76 |
Qwen3-235B-A22B (w/o thinking) |
2.00 | 2.93 | 21.18 | 50.94 | 69.16 | 25.28 | 79.72 |
Qwen3-32B (w/ thinking) |
2.80 | 3.12 | 25.63 | 40.66 | 68.44 | 22.40 | 79.54 |
Qwen3-32B (w/o thinking) |
2.78 | 3.11 | 19.76 | 40.96 | 68.50 | 21.38 | 80.76 |
Qwen3-14B (w/ thinking) |
2.57 | 2.12 | 25.17 | 40.98 | 69.32 | 28.40 | 80.44 |
Qwen3-14B (w/o thinking) |
2.53 | 2.07 | 24.21 | 40.60 | 69.62 | 27.44 | 80.78 |
| 1.73 | 2.42 | 18.56 | 45.44 | 66.02 | 28.85 | 77.47 | |
| 2.91 | 3.38 | 31.60 | 69.71 | 70.80 | 33.98 | 81.96 | |
Gemini-2.5-Flash |
1.84 | 2.51 | 21.04 | 58.51 | 68.13 | 35.81 | 79.79 |
Gemini-2.5-Pro |
2.75 | 3.10 | 32.41 | 71.10 | 71.21 | 35.02 | 82.41 |
LongCat-Flash-Chat |
2.34 | 3.07 | 25.28 | 52.98 | 69.45 | 27.49 | 83.61 |
LongCat-Flash-Thinking |
3.04 | 3.20 | 30.68 | 68.83 | 69.07 | 31.47 | 80.10 |
GLM-4.6 (w/ thinking) |
3.08 | 4.06 | 32.83 | 76.83 | 70.27 | 37.48 | 81.30 |
GLM-4.6 (w/o thinking) |
2.86 | 3.86 | 28.97 | 76.45 | 70.37 | 35.40 | 81.40 |
Deepseek-V3.2 (w/ thinking) |
3.21 | 4.20 | 32.93 | 77.63 | 71.01 | 39.87 | 81.22 |
Deepseek-V3.2 (w/o thinking) |
3.12 | 4.11 | 32.81 | 77.15 | 70.21 | 36.05 | 81.61 |
🎯 Trajectory Effectiveness Metrics
| Model | Action Relevance (%) | Evidence Sufficiency (%) | Causal Coherence (%) | Search Efficiency (%) |
|---|---|---|---|---|
Qwen3-235B-A22B (w/ thinking) |
80.40 | 45.75 | 52.04 | 48.63 |
Qwen3-235B-A22B (w/o thinking) |
75.20 | 43.61 | 50.68 | 45.99 |
Qwen3-32B (w/ thinking) |
75.00 | 46.99 | 48.87 | 49.13 |
Qwen3-32B (w/o thinking) |
74.60 | 46.52 | 48.82 | 49.84 |
Qwen3-14B (w/ thinking) |
81.70 | 47.24 | 52.52 | 51.65 |
Qwen3-14B (w/o thinking) |
80.40 | 46.44 | 50.96 | 50.02 |
| 68.40 | 38.62 | 45.83 | 42.29 | |
| 75.90 | 44.71 | 51.78 | 42.96 | |
Gemini-2.5-Flash |
70.50 | 41.61 | 48.94 | 46.91 |
Gemini-2.5-Pro |
77.30 | 45.73 | 52.87 | 41.78 |
LongCat-Flash-Chat |
77.80 | 47.33 | 50.86 | 52.29 |
LongCat-Flash-Thinking |
78.50 | 47.37 | 53.18 | 53.27 |
GLM-4.6 (w/ thinking) |
77.70 | 48.90 | 52.67 | 54.43 |
GLM-4.6 (w/o thinking) |
74.30 | 48.44 | 50.79 | 52.76 |
Deepseek-V3.2 (w/ thinking) |
75.60 | 48.86 | 52.62 | 54.83 |
Deepseek-V3.2 (w/o thinking) |
75.50 | 48.49 | 52.23 | 54.33 |
Note: Results are based on our evaluation framework. See paper for detailed methodology.
Qwen3-235B-A22B (w/ thinking)
Gemini-2.5-Flash
LongCat-Flash-Chat
GLM-4.6 (w/ thinking)
Deepseek-V3.2 (w/ thinking)