PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Hang He^1,², Chuhuai Yue^1,³, Chengqi Dong^1,⁴, Mingxue Tian^1,⁵, Hao Chen¹, Zhenfeng Liu¹, Jiajun Chai¹*, Xiaohan Wang¹, Yufei Zhang¹, Qun Liao¹, Guojun Yin¹†, Wei Lin¹, Chengcheng Wan^2,⁶†, Haiying Sun², Ting Su²†

¹ Meituan, ² East China Normal University, ³ Beijing Institute of Technology, ⁴ University of Science and Technology of China, ⁵ Shanghai Jiaotong University, ⁶ Shanghai Innovation Institute
* Project leader. † Corresponding author.
Correspondence to: hang.he@stu.ecnu.edu.cn, {chaijiajun,yinguojun02}@meituan.com, {ccwan,tsu}@sei.ecnu.edu.cn

Code

🤗

Dataset

🏆

Leaderboard

🛠️

Playground

Abstract

Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench comprises a database of over 1.3M merchant entries across 6 service categories and 9 major cities, and 900 multi-hop QA tasks from real user queries that require multi-step reasoning. We also developed LocalPlayground, a unified environment integrating multiple tools for LRMs interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.2) achieves only 32.93% correctness, and most models have issues with completeness (average 60.32%) and faithfulness (average 30.72%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services.

🛠️ Playground

Try our interactive playground to test LocalRAG Search on LocalSearchBench. Enter queries for different cities and see how our enhanced RAG system performs.

⚠️ Interactive Feature: This playground requires a backend server.

📍 Supported Cities: Currently supports Shanghai, Beijing, Guangzhou, Shenzhen, Wuhan, Chongqing, Chengdu, Suzhou, and Hangzhou. Other cities are under development.

Select Search Method

LocalRAG Search

City (城市)

Location (位置)

Query (查询内容)

Top-K Results

Retrieval Model

Reranker Model

Example Queries

🍲 Restaurant Search

上海 + 外滩 + 附近有哪些餐厅

🔥 Hotpot Restaurant

北京 + 五道口 + 附近有哪些火锅店

🎬 Entertainment

深圳 + 南山区 + 附近有哪些电影院

🛍️ Shopping

广州 + 天河区 + 附近有哪些生日蛋糕

☕ Coffee Shop

成都 + 春熙路 + 附近有哪些咖啡店

🏨 Hotel Booking

武汉 + 武昌站 + 附近有哪些酒店

📊 Overview

LocalSearchBench builds a comprehensive merchant database covering 6 scenarios across 9 major cities in China, with 1,354,185 merchants. The database is constructed through multi-agent techniques, along with data augmentation and anonymization, detailed in our paper.

🎯 Supported Scenarios

The 6 scenarios comprehensively cover the core business verticals of local life services. The distribution ratio of merchants across these scenarios is based on the platform's core business verticals, ensuring realistic representation of real-world search demands. Each city's data basically follows this distribution baseline with city-specific variations detailed in our paper.

📍 Supported Cities

We carefully selected 9 major cities across China based on key economic indicators, ensuring a balanced urban-rural distribution and geographical coverage. The heatmap distribution of merchants across these cities is detailed in our paper.

⭐ Industry Search Standards

We are the first to propose a difficulty grading system (L1-L5) for agentic search in the local life services, systematically defining query difficulty levels from easy to difficult. The grading is based on requirement understanding complexity and planning-search-reflection loop complexity, detailed in our paper.

Based on the above merchant database, we construct 900 multi-hop QA tasks with difficulty levels ranging from L3 to L4, comprising 100 QA tasks per city. These tasks involve 3-5 hops and support both web search and LocalRAG search. They are designed to challenge agents' ability to perform complex reasoning and multi-step retrieval across diverse local service scenarios.

The following examples illustrate Multi-hop QA tasks from our benchmark, demonstrating how the system performs iterative retrieval across multiple rounds to fulfill complex planning requirements.

🔧 System Architecture

Search Agent follows ReAct structure and performs iterative multi-hop reasoning using LocalRAG and Web Search; Validation Agent assesses answer and trajectory quality.

🏆 Leaderboard

Performance of various models on LocalSearchBench. Evaluated on answer quality and trajectory effectiveness.

📊 Answer Quality Metrics

Model	Avg. tool calls	Avg. rounds	Correctness (%)	Completeness (%)	Fluency (%)	Faithfulness (%)	Safety (%)
Qwen3-235B-A22B (w/ thinking)	2.31	3.17	30.24	71.20	71.58	26.90	81.76
Qwen3-235B-A22B (w/o thinking)	2.00	2.93	21.18	50.94	69.16	25.28	79.72
Qwen3-32B (w/ thinking)	2.80	3.12	25.63	40.66	68.44	22.40	79.54
Qwen3-32B (w/o thinking)	2.78	3.11	19.76	40.96	68.50	21.38	80.76
Qwen3-14B (w/ thinking)	2.57	2.12	25.17	40.98	69.32	28.40	80.44
Qwen3-14B (w/o thinking)	2.53	2.07	24.21	40.60	69.62	27.44	80.78
GPT-4.1	1.73	2.42	18.56	45.44	66.02	28.85	77.47
o3(high)	2.91	3.38	31.60	69.71	70.80	33.98	81.96
Gemini-2.5-Flash	1.84	2.51	21.04	58.51	68.13	35.81	79.79
Gemini-2.5-Pro	2.75	3.10	32.41	71.10	71.21	35.02	82.41
LongCat-Flash-Chat	2.34	3.07	25.28	52.98	69.45	27.49	83.61
LongCat-Flash-Thinking	3.04	3.20	30.68	68.83	69.07	31.47	80.10
GLM-4.6 (w/ thinking)	3.08	4.06	32.83	76.83	70.27	37.48	81.30
GLM-4.6 (w/o thinking)	2.86	3.86	28.97	76.45	70.37	35.40	81.40
Deepseek-V3.2 (w/ thinking)	3.21	4.20	32.93	77.63	71.01	39.87	81.22
Deepseek-V3.2 (w/o thinking)	3.12	4.11	32.81	77.15	70.21	36.05	81.61

🎯 Trajectory Effectiveness Metrics

Model	Action Relevance (%)	Evidence Sufficiency (%)	Causal Coherence (%)	Search Efficiency (%)
Qwen3-235B-A22B (w/ thinking)	80.40	45.75	52.04	48.63
Qwen3-235B-A22B (w/o thinking)	75.20	43.61	50.68	45.99
Qwen3-32B (w/ thinking)	75.00	46.99	48.87	49.13
Qwen3-32B (w/o thinking)	74.60	46.52	48.82	49.84
Qwen3-14B (w/ thinking)	81.70	47.24	52.52	51.65
Qwen3-14B (w/o thinking)	80.40	46.44	50.96	50.02
GPT-4.1	68.40	38.62	45.83	42.29
o3(high)	75.90	44.71	51.78	42.96
Gemini-2.5-Flash	70.50	41.61	48.94	46.91
Gemini-2.5-Pro	77.30	45.73	52.87	41.78
LongCat-Flash-Chat	77.80	47.33	50.86	52.29
LongCat-Flash-Thinking	78.50	47.37	53.18	53.27
GLM-4.6 (w/ thinking)	77.70	48.90	52.67	54.43
GLM-4.6 (w/o thinking)	74.30	48.44	50.79	52.76
Deepseek-V3.2 (w/ thinking)	75.60	48.86	52.62	54.83
Deepseek-V3.2 (w/o thinking)	75.50	48.49	52.23	54.33

Note: Results are based on our evaluation framework. See paper for detailed methodology.