Promoting Efficient Reasoning with Verifiable Stepwise Reward

Proposes a rule-based VSRM that assigns rewards via intermediate reasoning states to precisely address LRMs' overthinking.

AAAI 2026.
MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL

MTIR-SQL proposes a multi-turn tool-integrated reinforcement learning framework for Text-to-SQL, extending GRPO with trajectory filtering and removing KL constraints to enhance training stability and query accuracy, achieving state-of-the-art performance on benchmark datasets.

Under review.
SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

SRFT unifies SFT and RL through entropy-aware weighting mechanisms in a single-stage method, achieving 59.1% average accuracy and outperforming zero-RL methods by 9.0% on mathematical reasoning benchmarks.

Under review.
UIOrchestra: Generating High-Fidelity Code from UI Designs with a Multi-agent System

UIOrchestra integrates three specialized agents for layout description, code generation, and difference analysis to reconstruct static single-page applications from design mockups, outperforming existing methods in complex app page reconstruction.

EMNLP 2025.

Meituan Logo LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

1 Meituan, 2 East China Normal University, 3 Beijing Institute of Technology, 4 University of Science and Technology of China, 5 Shanghai Jiaotong University, 6 Shanghai Innovation Institute
* Project leader. † Corresponding author.
Correspondence to: hang.he@stu.ecnu.edu.cn, {chaijiajun,yinguojun02}@meituan.com, {ccwan,tsu}@sei.ecnu.edu.cn

Abstract

Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most research focuses on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench includes over 1,300,000 high-quality entries from various cities and business types. We construct 900 multi-hop QA tasks based on real user queries, challenging agents to understand questions and retrieve information in multiple steps. We also developed LocalPlayground, a unified environment integrating multiple tools for agent interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.1) achieves only 34.34% correctness, and most models have issues with completeness (average 77.33%) and faithfulness (average 61.99%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services.

🛠️ Playground

Try our interactive playground to test LocalRAG Search on LocalSearchBench. Enter queries for different cities and see how our enhanced RAG system performs.

⚠️ Interactive Feature: This playground requires a backend server.

📍 Supported Cities: Currently supports Shanghai, Beijing, Guangzhou, Shenzhen, Wuhan, Chongqing, Chengdu, Suzhou, and Hangzhou. Other cities are under development.

Select Search Method

Example Queries

🍲 Restaurant Search

上海 + 外滩 + 附近有哪些餐厅

🔥 Hotpot Restaurant

北京 + 五道口 + 附近有哪些火锅店

🎬 Entertainment

深圳 + 南山区 + 附近有哪些电影院

🛍️ Shopping

广州 + 天河区 + 附近有哪些生日蛋糕

☕ Coffee Shop

成都 + 春熙路 + 附近有哪些咖啡店

🏨 Hotel Booking

武汉 + 武昌站 + 附近有哪些酒店

📊 Overview

LocalSearchBench builds a comprehensive merchant database covering 6 scenarios across 9 major cities in China, with 1,354,185 merchants. The database is constructed through multi-agent techniques, along with data augmentation and anonymization, detailed in our paper.

🎯 Supported Scenarios

The 6 scenarios comprehensively cover the core business verticals of local life services. The distribution ratio of merchants across these scenarios is based on the platform's core business verticals, ensuring realistic representation of real-world search demands. Each city's data basically follows this distribution baseline with city-specific variations detailed in our paper.

LocalSearchBench Overview

📍 Supported Cities

We carefully selected 9 major cities across China based on key economic indicators, ensuring a balanced urban-rural distribution and geographical coverage. The heatmap distribution of merchants across these cities is detailed in our paper.

⭐ Industry Search Standards

We are the first to propose a difficulty grading system (L1-L5) for agentic search in the local life services, systematically defining query difficulty levels from easy to difficult. The grading is based on requirement understanding complexity and planning-search-reflection loop complexity, detailed in our paper.

LocalSearchBench Statistics

Based on the above merchant database, we construct 900 multi-hop QA tasks with difficulty levels ranging from L3 to L4, comprising 100 QA tasks per city. These tasks involve 3-5 hops and support both web search and LocalRAG search. They are designed to challenge agents' ability to perform complex reasoning and multi-step retrieval across diverse local service scenarios.

The following examples illustrate Multi-hop QA tasks from our benchmark, demonstrating how the system performs iterative retrieval across multiple rounds to fulfill complex planning requirements.

LocalSearchBench GitHub Example
LocalSearchBench GitHub Example 2

🏆 Leaderboard

Performance of various models on LocalSearchBench. Evaluated on correctness, completeness, fluency, faithfulness, and safety.

Model Avg. tool calls Avg. rounds Correctness (%) Completeness (%) Fluency (%) Faithfulness (%) Safety (%)
GPT-4.1 1.72 2.70 26.76 75.42 71.54 72.63 81.00
Gemini-2.5-Pro 1.89 2.86 26.09 77.93 73.50 78.26 83.81
Qwen-Plus-Latest 2.59 3.12 32.79 80.94 73.65 68.68 83.14
LongCat-Large-32K 2.73 3.22 33.19 80.51 73.40 60.80 81.14
Hunyuan-T1 2.30 3.15 32.94 80.59 73.00 63.77 84.71
Qwen3-14B 2.04 2.85 26.42 76.25 70.97 48.54 81.54
Qwen3-32B 2.48 2.89 24.75 71.37 70.17 40.84 81.34
Qwen3-235B-A22B 2.35 2.81 28.43 73.48 70.54 52.43 81.67
GLM-4.5 2.73 3.66 33.78 76.76 71.81 73.12 76.66
Deepseek-V3.1 3.43 4.02 34.34 80.00 72.39 60.80 81.48
Average 2.43 3.13 29.95 77.33 72.10 61.99 81.65

Note: Results are based on our evaluation framework. See paper for detailed methodology.

BibTeX

@misc{he2025localsearchbenchbenchmarkingagenticsearch,
  title={LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services},
  author={Hang He and Chuhuai Yue and Chengqi Dong and Mingxue Tian and Zhenfeng Liu and Jiajun Chai and Xiaohan Wang and Yufei Zhang and Qun Liao and Guojun Yin and Wei Lin and Chengcheng Wan and Haiying Sun and Ting Su},
  year={2025},
  eprint={2512.07436},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2512.07436}
}

Visitor Statistics