Next-generation Windows system exposed: Cross-application scheduling based on GPT-4V Agent codenamed UFO–Quick Technology–Technology changes the future

The next generation Windows operating system has been exposed in advance? ?

Microsoft’s first Agent for Windows unveiled:

Advertisement

Based on GPT-4V, you can seamlessly switch between multiple applications and complete complex tasks in one sentence. The entire process requires no human intervention, and its execution success rate and efficiency are twice that of GPT-4 and four times that of GPT-3.5.

For example, delete all notes on a PPT presentation.

It can be done in a few simple steps.

Advertisement

There are also things like using text from multiple sources, such as word documents, image text content, and composing emails.

Netizens said: This is the innovation capability that Windows level should have

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

The first Windows Agent is here

Such an agent is called UFO, whose full name is “UI-Focused”. It is an agent framework specifically designed for Windows OS (operating system) interaction and oriented to the user interface (UI). It can operate in a single or multiple applications. It is jointly created by MSRA, Microsoft AI and application research teams, etc.

Users can operate the App's user interface through natural language instructions.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

According to reports, UFO is the first UI Agent specifically tailored for task completion in the Windows OS environment.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Take deleting all comments on PPT as an example. The traditional way requires manually deleting comments page by page. If the PPT is extremely long, the process will be long and boring, making people instantly irritable.

But after the UFO got the instructions, it simplified the whole process.

It first proposed to use the “Delete all presentation notes” function. This function is often ignored by users because the button is hidden deep.

Then, UFO navigates to the “File” option to access the background view; then, smoothly switches to the “info” menu, clicks the “Check Issues” button, and selects “Check Document” to start checking all the files contained in the document. Note.

Immediately afterwards, UFO recognized “Delete all presentation notes” at the menu level, scrolled down to locate its location, and activated the click function.

Considering the possibility of accidental deletion, UFO has a protection function that requires users to confirm again whether they really want to delete all comments.

Once the user confirms, all notes will be “

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

“The words are gone~”

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Like PowerPoint, the article shows several other scenes with pictures and texts.

For example, reading a PDF:

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Design PPT format:

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Download the Docker extension:

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Tweet:

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Search summary:

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Read this paper:

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

And how to use UFOsExtract text from Word documents, describe images, compose and send emails, and more.

The research team tested UFO on 9 commonly used Windows applications, including Outlook, Photos, PPT, Word, etc., covering high-frequency usage scenarios of Windows users.Able to test work, communicate, code, read, web browse and other purposes.

For each application, the team designed five different requests, for a total of 45; an additional five requests were designed across multiple interactive applications.

That is, a total of 50 requests were generated, with at least one request from each application linked to another subsequent request, providing a comprehensive assessment of UFO's interaction patterns.

In terms of evaluation indicators, UFOs are evaluated from the perspectives of success, steps, completion rate and guarantee rate.

In order to comprehensively evaluate the performance of UFO, the team developed a test benchmark called WindowsBench.

Considering that there is no ready-made Windows Agent, the team selected GPT-3.5 and GPT-4 as the base models and instructed them to provide step-by-step guidance to complete user requests.

It is worth noting thatUFO's success rate on WindowsBench reached 86%.Exceeds GPT-4 exponentially – so UFO can be positioned as an efficient agent.

The UFO also has the best completion rate, which shows that it is capable of taking more precise actions; in addition, the UFO completes the task with the fewest steps and the highest degree of safety.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Finally, the detailed scores of 9 scenes in WindowsBench from 4 angles are as follows:

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Composed of three modules

In this case, how is such an operating system level Agent implemented?

First, it understands the user's natural language request and then breaks it down into a series of subtasks. Then observe the user interface and manipulate its control elements to achieve the overall goal.

In this case, how is it achieved?

Architecturally, UFO is a dual-Agent framework.There are three main modules:

Application agent (AppAgent) selects an application to satisfy the user request.

ActAgent is responsible for repeatedly performing tasks in the selected application.

Interactive control, no manual intervention required, fully automatic execution.

After receiving the user's request, AppAgent will analyze the requirements. In addition, there is this information as input: desktop screenshots, app information, memories, and examples.

Among them, UFO provides AppAgent with a complete desktop screenshot and list of available applications for reference.

Then select a suitable application from the currently activated applications and develop a global implementation plan, passing it to the ActAgent.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Once a suitable application is found, the App will appear on the desktop. The ActAgent then initiates the operation.

Before each action selection, UFO captures a screenshot of the current application user interface window and labels all available controls. In addition, UFO also records relevant information of each control for ActAgent to observe.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

ActAgent's task is to select the control to operate and then select the specific operation to be performed on the selected control through the control interaction module.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

This decision is made based on the ActAgent's observations, previous plans, and operating memory.

This recursive process continues until the user request is successfully completed in the selected application. At this point, one stage of user request ends.

If it needs to span multiple applications, after the ActAgent completes the current task, the ActAgent will delegate the task to the AppAgent to switch to a different application, thus starting the second phase of the request.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

The user can choose to make a new request, causing UFO to handle the new task by repeating the above process.

Based on daily mouse operations, the research team also developed custom operations, such as clicking, selecting text, scrolling, etc., to complete the control operations.

There are mainly these control types.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Led by Microsoft Global Senior Vice President and MSRA Deputy Dean

Finally, let me introduce the UFO research team, most of whom are Chinese.

The corresponding author, Chaoyun Zhang, is a senior researcher in Microsoft's DKI (Data, Knowledge, Intelligence)* group.

He obtained his master's and PhD degrees from the University of Edinburgh in 2020. His research interests include time series modeling, spatiotemporal data mining, causal reasoning, and explainable machine learning for cloud services and AIOps.

Chaoyun Zhang is also an alumnus of Huazhong University of Science and Technology. He obtained a bachelor's degree from the School of Electronic Information and Communications of Huazhong University of Science and Technology before going abroad.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

The author Liqun Li is now the principal researcher of Microsoft's DKI group.

He first graduated from the Department of Computer Science and Technology of Tsinghua University with a bachelor's degree; and then received a doctorate from the Institute of Software, Chinese Academy of Sciences in 2012. During this period, Liqun Li went to Michigan State University as a visiting scholar.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Author Saravan Rajmohan, Partner Director of AI and Application Research at Miceosoft 365.

He leads the applied research team to conduct in-depth collaboration with various research groups at Microsoft to combine algorithm research with AI/ML technology and hardware innovation

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

The author Zhang Dongmei is the executive vice president of MSRA (Microsoft Research Asia) and the distinguished chief scientist of Microsoft.

She has joined MSRA in 2004 and has engaged in and led research work in the field of DKI. In recent years, the team has expanded research into the field of business intelligence.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

The author Zhang Qi is Microsoft's global senior vice president.

Previously, Zhang Qi served as executive vice president of Microsoft (Asia) Internet Engineering Academy and concurrently as chairman of Microsoft Mobile Lianxin Internet Services Co., Ltd., responsible for Microsoft's Internet business and artificial intelligence platform teams in Asia.

At the same time, he is also Microsoft China's first “Global Outstanding Engineer”.

Next-generation Windows system exposed: Based on GPT-4V, Agent cross-application scheduling, codenamed UFO

Finally, let’s briefly introduce the working unit of many authors: MSRA’s DKI group.

DKI is the abbreviation of Data, Knowledge, and Intelligence.

This group is committed to the research of AI, data analysis, data interaction, and data visualization, exploring new data analysis, display, and interaction technologies, so that data and the discovery stories in the data can be efficiently understood and widely disseminated.

The team has in-depth cooperation with Microsoft products such as Excel, PowerPoint, etc., and publishes papers in top conferences and journals in various fields all year round.

Advertisement