GUI Grounding 实践指南总结

Marlowee

588人浏览 · 2026-03-24 21:57:27

Marlowee · 2026-03-24 21:57:27 发布

GUI Grounding 实践指南总结

原文：Why your AI Agent keeps misclicking? A Practical Guide to GUI Grounding for Frontier Models

作者：Liangyu Chen, Hanzhang Zhou, Quyu Kong, Xu Zhang, Wenxuan Wang, Qin Jin, Yue Wang

一、核心问题

各大 AI 公司的 GUI Agent demo 视频展示了近乎完美的界面操作精度，但实际复现时性能差距巨大。作者发现差距的根源不在模型能力本身，而在于没人公开的实现细节：Grounding 范式、坐标系、Prompt 模板、图片分辨率、Zoom-In 策略等。

对 Gemini-3-Pro、Claude-Sonnet-4.5、Seed1.8、Kimi-K2.5 和自研 MAI-UI 进行了系统评测，使用 OSWorld-G (Refined)（桌面端 Grounding 基准，指令明确无歧义）和 ScreenSpot-Pro（高分辨率密集布局基准，考验空间精度）两个 benchmark，目标包括：标准化评测范式对比、逆向工程复现各模型的报告数字、深入探测模型的边界能力。

二、Grounding 范式演进

2.1 Set-of-Mark（SoM）：让解析器做视觉识别

早期 MLLM 缺乏细粒度 Grounding 能力，需借助 OmniParser V2 等专用小模型检测 UI 元素并标号覆盖到界面上，MLLM 仅在标号中做选择。为减轻视觉噪声，评测时同时提供原始未标注图片和 SoM 标注图。

结论：全面失败——所有模型在 SoM 设定下均低于 30 分，主因是 UI 元素密集导致视觉噪声严重、标注框重叠严重。

Table 1: SoM 设定下的 Grounding 精度（OSWorld-G）

2.2 End-to-End（E2E）：直接像素到坐标映射

随着 MLLM 内部空间感知增强，行业转向 E2E 范式——模型直接处理原始截图并预测 (x, y) 坐标，不再依赖外部 UI 解析器。

E2E 性能远超 SoM，但暴露两个问题：①跨 Benchmark 差异巨大，如 Claude Sonnet 4.5 在 OSWorld-G 上 69.0 到 ScreenSpot-Pro 暴跌至 35.0，说明高分辨率密集布局是更难的问题；②复现与报告存在显著差距，Gemini-3-Pro 复现 39.0 vs 报告 72.7（差 33 分），Seed1.8 差 6 分，这绝非四舍五入误差。

Table 2: E2E 设定下各模型整体性能

2.3 Inference-phase Optimization（Zoom-In）：推理阶段优化

引入反思过程或 Zoom-In 工具，将 Grounding 从单步执行转变为多步交互过程。作者实现了单阶段 Zoom-In 工具来模拟可控的工具调用机制。

效果极其显著：Gemini-3-Pro 从 39.0 跳到 69.7（逼近报告 72.7），Seed1.8 从 67.0 到 73.3（略超报告值），Claude Sonnet 4.5 从 35.0 到 54.0（近乎翻倍），MAI-UI-32B 从 67.9 到 73.5（+10% 相对提升）。"缺失"的能力一直都在，只是需要正确的评测协议才能释放。

实际意义：在高分辨率屏幕上部署 GUI 自动化，不加 Zoom-In 步骤就是在浪费模型能力。

Table 3: Zoom-In 优化后的性能与报告值对比

模型	坐标系	图片处理	Zoom-In 策略	关键细节
Gemini-3-Pro	相对坐标 0–1000	Ultra High 分辨率	两阶段：1/4 裁剪→resize 1920×1080	temperature=0.01
Claude Sonnet 4.5	相对坐标 0–1	resize 到 1280×720	50% 裁剪→upscale 1280×720	降采样太狠是性能差的主因
Seed1.8	相对坐标 0–1000	—	50% 裁剪	x y 格式，E2E 最鲁棒
Kimi-K2.5	0–1000	—	类似 Seed1.8	长 prompt 含 sub-agent/search/browser 工具
MAI-UI-32B	0–1000	smart resize（max 6.3M 像素）	同 Seed1.8	32B 就超过了闭源大模型

三、复现各模型的精确配方

这是全文最有实操价值的部分——逐一给出了各模型达到报告性能所需的完整配置，包括坐标系、分辨率、Zoom-In 策略和 Prompt。

3.1 Gemini-3-Pro

坐标系：相对坐标 0–1000
Temperature：0.01
媒体分辨率：Ultra High
Zoom-In：两阶段，先初步定位，再以预测点为中心裁剪原图 1/4 宽高区域，resize 到 1920×1080 后二次定位

Prompt：

You are an expert UI element locator. Given a GUI image and a user's element description, provide your reasoning process first, finally provide the coordinates of the specified element as a single point. For elements with area, return the center point.

Give your reasoning process first, then output the coordinate pair ranging from 0 to 1000 exactly in format:
(x,y)

3.2 Claude Sonnet 4.5

坐标系：相对坐标 0–1
图片 Resize：1280×720（来源于 OSWorld Repo）
Zoom-In：50% 裁剪比例，放大到 1280×720
关键发现：性能下降的主因是降采样过于激进（1280×720），模型在缩小后的图上根本看不到小元素。Zoom-In 可恢复大部分丢失信息。该配置在标准 1080p benchmark 上可达 70%+

Prompt：

You are an expert UI element locator. Given a GUI image and a user's element description, provide your reasoning process first, finally provide the coordinates of the specified element as a single (x,y) point. For elements with area, return the center point.

Output the coordinate pair exactly in format:
(x,y)

3.3 Seed1.8

坐标系：相对坐标 0–1000
输出格式：<point>x y</point>
Zoom-In：50% 裁剪比例
特点：纯 E2E 设定下最鲁棒的模型，结构化输出格式与其训练高度匹配，简单加上 Zoom-In 即可达到 SOTA

Prompt：

You are an expert UI element locator. Given a GUI image and a user's element description, provide your reasoning process first, finally provide the coordinates of the specified element as a single <point>x y<point> point. For elements with area, return the center point.

Give your reasoning process first, then output the coordinate pair ranging from 0 to 1000 exactly in format:
<point>x y<point>

3.4 Kimi-K2.5

坐标系：0–1000
Zoom-In：类似 Seed1.8
观察：推理 trace 显示模型在做反思和可视化，但并未执行真正的 crop-and-zoom 操作

Prompt（含完整工具描述）：

You are Kimi, a professional and meticulous expert in information collection and organization.
You fully understand user needs, skillfully use various tools, and complete tasks with the highest efficiency.
# Task Description
After receiving users' questions, you need to fully understand their needs and think about and plan how to complete the tasks efficiently and quickly.
# Available Tools
To help you complete tasks better and faster, I have provided you with the following tools:
1. Search tool: You can use the search engine to retrieve information, supporting multiple queries in parallel.
2. Browser tools: You can visit web links (web pages, PDFs, etc.), get page content, and perform interactions such as clicking, inputting, finding, and scrolling.
3. Sub Agent tools:
- 'create_subagent': Create a new sub-agent with a unique name and clear, specific system prompt.
- 'assign_task': Delegate tasks to created sub-agents. Sub-agents can also use search and browser tools.
4. Other tools: Including code execution (IPython, Shell).
You should locate the UI element in screenshot by user's instruction.
Finally you should give the coordinate normalized in 0-1 of the UI element in format: (x,y)

3.5 MAI-UI-32B（自研）

坐标系：0–1000
图片处理：Smart resize，最大 6,335,600 像素（约 4800 image tokens）
Zoom-In：同 Seed1.8
亮点：仅 32B 参数量即超过所有评测中的闭源大模型

Prompt：

You are a GUI grounding agent.
## Task
Given a screenshot and the user's grounding instruction. Your task is to accurately locate a UI element based on the user's instructions.
First, you should carefully examine the screenshot and analyze the user's instructions, translate the user's instruction into a effective reasoning process, and then provide the final coordinate.
## Output Format
Return a json object with a reasoning process in <grounding_think></grounding_think> tags, a [x,y] format coordinate within <answer></answer> XML tags:
<grounding_think>...</grounding_think>
<answer>
{"coordinate": [x,y]}
</answer>
## Input instruction

四、深度洞察

4.1 Free-form Thinking 对 Grounding 几乎无效

直觉上让模型先推理再定位应该更准，但实测显示：自由推理（free-form reasoning）对 GUI Grounding 几乎无提升甚至有害。这与 UI-Ins 的研究一致——"Instruction-as-Reasoning"即以低级别、观察为中心的指令作为推理内容，一致性优于自由推理。

设计 Agent Prompt 时，推理需要方向引导（结构化、观察型），而非无限制的自由思考。

Table 5: Free-form Reasoning 的效果对比

4.2 Zoom-In 工具不是免费午餐

Zoom-In 的裁剪比例和目标分辨率交互影响巨大。Gemini-3-Pro 在 ScreenSpot-Pro 上，最佳配置（1/4 crop + 1920×1080）比最差配置高 11.4 分。

Table 6: Gemini-3-Pro 不同 Zoom-In 配置在 ScreenSpot-Pro 上的效果

性能差异的根源在于 ScreenSpot-Pro 包含极其多样的分辨率，包括多显示器拼接截图：

Table 7: ScreenSpot-Pro 的分辨率分布

具体而言：5120×1440 通常是两个 2560×1440 显示器拼接，从这种全景图裁剪后 resize 会严重扭曲 aspect ratio，按钮被拉伸、文字被畸变，模型在标准宽高比上训练的空间先验被破坏。过大的裁剪比例（1/2）覆盖区域太广无法有效放大。在低分辨率场景下 Zoom-In 会过度放大，放大视觉噪声使 UI 元素尺寸失真。此外，两阶段 Grounding 推理时间约为单次的 2 倍。

Kimi-K2.5 的 IPython 工具模式（代码执行做可视化反思）实测反而降低性能：

Table 4: Kimi-K2.5 E2E vs. IPython Tool

4.3 模型有未被释放的潜力

作者测试了两种 test-time scaling 策略：Pass@k（k 次独立预测取最优）和 GUI-RCPO（k 次预测各构建 50×50 像素框，取交集中心作为最终答案）。

Table 8: Test-time Scaling 实验结果

结果令人振奋：Kimi-K2.5 通过 RCPO 从 60.8 → 69.9（+15% 相对提升），Seed1.8 达到 76.6（超基线近 10 分），且 RCPO 在更难的 benchmark 上一致性优于 Pass@k oracle。

核心结论：模型在分布意义上知道正确位置——单次预测有噪声，但噪声大致以正确答案为中心。聚合多次空间假设可过滤随机方差、收敛到真实位置。能力是 latent 的，挑战在于如何高效提取。

五、全文核心观点

黑箱是最大瓶颈：最大性能差距不在模型之间，而在"公开 vs 未公开"的评测协议之间。坐标系或 Zoom-In 一个细节错了就能差 33 分。对于构建在这些模型之上的开发者，理解实现细节不是可选项而是前提条件。

范式重要性大于模型大小：SoM → E2E → Inference-phase 优化是最重要的提升轴。原生像素到坐标映射已决定性超越 SoM 方法，配合良好的 Zoom-In 配置可在高分辨率屏幕上接近翻倍精度。MAI-UI 用 32B 参数即超越闭源大模型，证明了范式和训练策略比规模更关键。

推理需要方向而非自由：Free-form reasoning 对 GUI Grounding 无效，结构化的观察型推理才有效。这对 Agent Prompt 设计有直接指导意义。

精度-效率 Tradeoff 是下一个前沿：Zoom-In 和 test-time scaling 证明模型有远超单次评测所展示的 Grounding 能力，但每多一步推理都增加延迟和成本。2026 年的关键问题不是如何推高上限，而是如何一步到位。