Microsoft Research proposes Deep Integration of Computer Use Agents in AgentOS for Windows

AgentOS is a Step Towards Non-Intrusive Unified Desktop Automation Across Application Environments

UFO2: The Desktop AgentOS

Chaoyun Zhang He Huang Chiming Ni Jian Mu Si Qin Shilin He Lu Wang Fangkai Yang Pu Zhao Chao Du Liqun Li Yu Kang Zhao Jiang Suzhen Zheng Rujia Wang Jiaxu Qian Minghua Ma Jian-Guang Lou Qingwei Lin Saravan Rajmohan Dongmei Zhang

Get All The Latest Research & News!

Subscribe

The automation of desktop applications has long been a goal for improving productivity, traditionally relying on rigid and fragile script-based Robotic Process Automation (RPA) systems.

The emergence of Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offered a more flexible alternative by interpreting natural language instructions and adapting to dynamic graphical user interfaces (GUIs).

However, these early CUAs often suffered from limited integration with operating systems (OSs), reliance on noisy screenshot-based interactions, and disruptive execution on the user's primary desktop.

This research from Microsoft introduces UFO2, a novel multiagent AgentOS for Windows desktops designed to overcome these limitations by achieving deep OS integration and providing a practical, system-level solution for robust and non-disruptive desktop automation.

Extensive documentation and open source code is available from Microsoft.

Documentation https://microsoft.github.io/UFO/

Github Repository https://github.com/microsoft/UFO/

Key Takeaways
UFO2 presents a multiagent AgentOS architecture featuring a centralized HostAgent for task coordination and specialized AppAgents equipped with native APIs for specific applications, leading to more robust and accurate task execution.

A hybrid control detection pipeline that combines Windows UI Automation (UIA) with vision-based parsing enhances the agent's ability to interact with diverse interface styles.

A unified GUI–API action layer allows agents to seamlessly switch between traditional GUI actions and application-native API calls, improving efficiency and reducing brittleness.

Speculative multi-action planning reduces the overhead of LLM inference by predicting and validating sequences of actions in a single step.

A Picture-in-Picture (PiP) interface provides an isolated virtual desktop environment for automation, enabling concurrent user and agent activity without interference.

Evaluations across over 20 real-world Windows applications demonstrate that UFO2 significantly improves robustness and execution accuracy compared to prior CUAs, highlighting the benefits of deep OS integration.

UFO2's architecture supports continuous knowledge integration from documentation and past execution logs, allowing agents to improve autonomously over time without retraining.

Overview

The paper addresses the increasing need for robust and scalable desktop automation in the face of complex and evolving software environments.

Traditional RPA systems, which depend on predefined scripts based on GUI cues, are inherently fragile and require significant manual maintenance as user interfaces change. Recent advancements in LLMs have paved the way for CUAs that can interpret natural language and perform adaptive actions on GUIs without fixed scripting.

However, existing CUAs lack deep integration with the underlying OS, relying on superficial interactions such as screenshots and simulated mouse/keyboard inputs. This approach leads to inefficiencies, increased cognitive load for LLMs, and a disruptive user experience as the agent takes control of the main desktop.

UFO2 is introduced as a solution that reimagines desktop automation as a fundamental OS abstraction.

Instead of operating as a layer on top of the GUI, UFO2 is designed as a deeply integrated, multiagent execution environment that embeds OS capabilities, application-specific introspection, and domain-aware planning into its core automation loop.

As illustrated in Figure 1, UFO2 contrasts with existing CUAs by operating as a system-level AgentOS rather than an application running on top of the standard desktop environment.

The architecture of UFO2 revolves around a centralized HostAgent that interprets user instructions, decomposes them into subtasks, and coordinates specialized AppAgents, each tailored for a specific Windows application.

These AppAgents possess native APIs, domain-specific knowledge, execution history and a unified GUI–API action layer. This modular design promotes robustness and extensibility, allowing for automation across multiple concurrent applications.

To ensure reliable interaction with diverse UIs, UFO2 employs a hybrid control detection pipeline that fuses Windows UIA APIs with vision-based parsing.

Speculative multi-action planning optimizes runtime efficiency by reducing the overhead of per-step LLM inference.

Finally, the innovative Picture-in-Picture (PiP) - Figure 12 - interface creates an isolated virtual desktop, enabling non-disruptive, safe, concurrent operation by agents and users.

Why it’s Important

The development of UFO2 addresses critical limitations hindering the practical deployment of CUAs for real-world desktop automation. It has implications for enhancing workforce productivity, streamlining complex tasks, and potentially transforming how users interact with their desktop environments.

By implementing deep OS integration, UFO2 unlocks a more scalable and reliable path towards user-aligned automation.

UFO2's innovations, such as the hybrid control detection, overcome the challenges posed by diverse and non-standard UI elements, potentially allowing for broader application support and easier maintenance.

The unified GUI–API action layer provides a more efficient and robust means of interacting with applications, reducing dependence on brittle GUI simulations.

The speculative multi-action execution directly tackles the latency issues associated with LLM-driven agents, making them more responsive.

The concept of "Everything-as-an-AppAgent" highlights the potential for interoperability and extensibility by allowing new automation tools to be seamlessly integrated into the UFO2 framework.

Perhaps most importantly for user experience, the PiP interface resolves the disruptive nature of existing CUAs, enabling seamless multitasking and increasing user trust and adoptability.

From a broader perspective, UFO2 contributes to the vision of elevating automation to a system primitive. By treating automation as a deeply integrated OS abstraction, UFO2 paves the way for more programmable, composable, and robust desktop workflows.

Additionally, the modular AgentOS architecture of UFO2 could serve as a blueprint for future intelligent systems that require deep integration with specific platforms and applications.

Summary of Results

The researchers evaluated UFO2 across over 20 real-world Windows applications to assess its performance, efficiency, and robustness. They compared UFO2 with several state-of-the-art CUAs, including UFO, NAVI, OmniAgent, Agent S, and Operator, using two established Windows-centric automation benchmarks: Windows Agent Arena (WAA) and OSWorld-W.

The primary evaluation metrics were Success Rate (SR), defined as the percentage of successfully completed tasks, and Average Completion Steps (ACS), measuring the average number of LLM-involved action inference steps per task.

Table 1 summarizes the success rates of the evaluated agents on the two benchmarks.

Table 1. Comparison of success rates (SR) across agents on WAA and OSWorld-W benchmarks.

Agent	Model	WAA	OSWorld-W
UFO	GPT-4o	19.5%	12.2%
NAVI	GPT-4o	13.3%	10.2%
OmniAgent	GPT-4o	19.5%	8.2%
Agent S	GPT-4o	18.2%	12.2%
Operator	computer-use	20.8%	14.3%
UFO2-base	GPT-4o	23.4%	16.3%
UFO2-base	o1	25.3%	16.3%
UFO2	GPT-4o	27.9%	28.6%
UFO2	o1	30.5%	32.7%

The results demonstrate that UFO2 outperforms existing CUAs in terms of success rate. Even the base version of UFO2 (UFO2-base), which only uses UIA detection and GUI-based actions, shows improvements over the baselines.

The full version of UFO2, incorporating its key design innovations, achieves even higher success rates, exceeding the best baseline (Operator) by 7.1% on WAA and a substantial 14.3% on OSWorld-W when using GPT-4o. Using the stronger o1 model further enhances UFO2's performance.

Table 2 provides a breakdown of success rates by application type, revealing that UFO2 shows particularly strong performance in web browsers and coding environments.

Figure 19 shows the error analysis of UFO2-base, highlighting that control detection failures were a significant issue on WAA, while plan errors were more prevalent on OSWorld-W. This analysis motivated the development and evaluation of the hybrid control detection and continuous knowledge integration components.

Table 3 demonstrates the effectiveness of the hybrid control detection mechanism, showing that it consistently outperforms UIA-only or OmniParser-only approaches by increasing success rates and recovering previously failed cases.

Table 5 and Figure 21 illustrate the benefits of the unified GUI + API action layer, showing improved success rates and reduced completion steps by leveraging application-native APIs.

Table 5. Performance comparison of GUI-only vs. GUI + API actions.

Action	Model	SR	PRR	ERR	CRR	ACS
GUI-only	GPT-4o	16.3%	-	-	-	13.8
GUI+API	GPT-4o	22.4%	5.9%	14.3%	25.0%	12.9
GUI-only	o1	16.3%	-	-	-	16.0
GUI+API	o1	24.5%	17.7%	0.0%	12.5%	6.6

Table 6 demonstrates the impact of continuous knowledge integration, where the retrieval of help documents and past execution logs leads to noticeable improvements in success rates by reducing planning errors.

Table 7 and Figure 22 show that speculative multi-action execution maintains a comparable success rate while significantly reducing the average completion steps, leading to greater efficiency.

Figure 23 demonstrates the "Everything-as-an-AppAgent" capability, showing that orchestrating Operator as a single AppAgent within the UFO2 framework leads to improved performance compared to running Operator independently.

Table 8 provides a step count analysis, revealing that the fully integrated UFO2 configuration consistently reduces the average number of steps required for successful task completion.

Figure 24 breaks down the average latency per execution step, highlighting that LLM inference is the dominant factor, but the overhead of UFO2's integrated components is relatively small.

Figure 25 compares the performance of UFO2 and UFO2-base across different LLMs, suggesting that models with built-in reasoning tend to achieve higher success rates.

Conclusion

This research presents UFO2, a significant advancement in the field of desktop automation by introducing a deeply integrated, multiagent AgentOS for Windows.

By addressing the limitations of prior CUAs through innovations like hybrid control detection, a unified GUI–API action layer, speculative multi-action execution, continuous knowledge integration, and the non-disruptive PiP interface, UFO2 demonstrates substantial improvements in robustness, accuracy, and scalability across a wide range of real-world applications.

The evaluation results convincingly show that deep OS-level integration is crucial for achieving reliable and efficient desktop automation, with UFO2 outperforming state-of-the-art CUAs even when using general-purpose LLMs.

The modular architecture and the "Everything-as-an-AppAgent" concept further highlight the extensibility and potential of the UFO2 framework.

Future work will focus on further reducing latency, bridging the gap with human-level performance, and exploring generalization across other operating systems.

Overall, UFO2 represents a paradigm shift from GUI scripting to structured, programmable application control, paving the way for more practical and user-centric desktop automation solutions.

in Research

# Artificial Intelligence Computer Science

Publication Title: UFO2: The Desktop AgentOS

DOI: 10.48550/arXiv.2504.14603

Authors:

Organizations:

Microsoft Peking University ZJU-UIUC Institute Nanjing University

Research Categories:

Computer Science Artificial Intelligence

Preprint Date: 2025-04-20

Number of Pages: 24

Publication Links:

Arxiv

Joshua Berkowitz April 25, 2025

Share this post

blogs

Sign in to leave a comment

Paving the Way for Room Temperature Quantum Information Exchange

Using Optical Fibers for Interconnecting Quantum Processing Units

Follow us