AutoDev: Microsoft AI Programmer debuts with 996 work schedule, codes independently with performance surpassing GPT-4 by 30%

New Wisdom Report

Editor: Taozi Run

[Introduction to New Wisdom]After the birth of Devin, the world's first AI programmer, coders panicked. Unexpectedly, Microsoft also created an AI programmer – AutoDev, who can independently generate and execute code and other tasks. Netizens exclaimed that AI coding is developing too fast.

The emergence of Devin, the world's first AI programmer, may become an important node in the history of software and AI development. It has mastered the full stack of skills. Not only can it write code, debug, and train models, it can also grab orders on Upwork, the largest job search website in the United States.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

For a time, netizens exclaimed, “Programmers no longer exist”? Even people who have just started studying for a computer degree are panicking about the impact of “10x AI engineers” on their future jobs.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

In addition to star startups like Cognition AI, major manufacturers in the United States have long been thinking of ways to use AI agents to reduce costs and increase efficiency. On the same day as March 14, the Microsoft team also released a “Microsoft AI programmer” – AutoDev.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

Paper address: pdf/ 2403.08299.pdf

It is different from Devin's extreme pursuit of efficiency and results. AutoDev is designed for autonomous planning and execution of complex software engineering tasks while maintaining privacy and security in Docker environments.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

Prior to this, Microsoft had its flagship product GitHub Copilot to help developers complete software development.

However, some AI tools, including GitHub Copilot, do not take full advantage of all the potential functions in the IDE, such as building, testing, executing code, git operations, etc.

Based on the requirements of the chat interface, they mainly focus on suggesting code snippets, as well as file operations. AutoDev was born to fill this gap.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

Users can define complex software engineering goals, and AutoDev assigns these goals to autonomous AI agents to achieve.

These AI agents can then perform various operations on the code base, including file editing, retrieval, build processes, execution, testing, and git operations.

Even more, they have access to files, compiler output, build and test logs, static analysis tools, and more.

In the HumanEval test, AutoDev achieved excellent results of 91.5% and 87.8% Pass@1 in code generation and test generation respectively.

Netizens said that AI coding is developing too fast. In 2021, GitHub Copilot can solve 28.8% of HumanEval problems. By 2024, AutoDev will directly solve 91.5% of the problems.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

AutoDev completes tasks autonomously without human intervention

The AutoDev workflow is shown in the figure below. The user defines a goal, such as “testing a specific method”.

The AI agent writes the test to a new file and initiates the test execution command, all within a secure evaluation environment.

The output of the test execution, including failure logs, is then merged into the conversation.

The AI agent analyzes these outputs, triggers retrieval commands, merges the retrieved information by editing files, and restarts test execution.

Finally, the Eval environment provides feedback on whether the test execution was successful and how well the user's goals were accomplished.

The entire process is coordinated autonomously by AutoDev, requiring no developer intervention other than setting initial goals.

In contrast, if an existing AI coding assistant is integrated into an IDE, developers must manually perform tests (such as running pytest), provide failure logs to the AI chat interface, may need to identify additional contextual information to incorporate, and repeat verification Action ensures that the test succeeds after the AI generates the modified code.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

It is worth mentioning that AutoDev draws inspiration from many previous research in the field of AI agents, such as AutoGen – orchestrating language model workflows and advancing conversations between multiple agents.

AutoDev extends AutoGen with capabilities that go beyond conversation management and enable agents to interact directly with code repositories to automate commands and operations.

Similarly, AutoDev's research also draws on Auto-GPT. This is an open source AI agent for autonomous task execution that supports the execution of complex software engineering tasks by providing code and IDE-specific functionality.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

AutoDev architecture

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

The above figure is a simple diagram of the AutoDev architecture.

AutoDev mainly consists of 4 functional modules:

-Conversation Manager for tracking and managing user-agent conversations;

-Tools library that provides agents with various code and integrated development environment related tools;

-Agents Scheduler for scheduling various agents;

– and the Evaluation Environment used to perform operations.

Below I will give you a detailed introduction to each functional module.

Rules, actions and goal configuration

Users configure rules and actions through yaml files to initiate processes.

These files define the available commands (actions) that the AI agent can perform.

Users can tailor AutoDev to their specific needs by enabling/disabling specific commands to take advantage of default settings or fine-grained permissions.

The purpose of the configuration step is to achieve precise control over the capabilities of the AI agent.

At this stage, users can define the number and behavior of AI agents, assigning specific responsibilities, permissions, and available actions.

For example, users can define a “developer” agent and a “reviewer” agent and have them work together to achieve a goal.

Based on rules and action configurations, users can specify software engineering tasks or processes to be completed by AutoDev.

For example, users can ask that test cases be generated and ensure that they are syntactically correct and contain no errors (this involves editing files, running test suites, performing syntax checks and error finding tools).

conversation manager

The session manager is responsible for initializing the session history and plays a key role in high-level management of ongoing sessions. It is responsible for deciding when to interrupt the conversation process and ensuring seamless communication between the user, the AI agent, and the entire system.

The dialogue objects it maintains and manages mainly include information from the agent and operation results from the evaluation environment (eval environment).

parser

The parser interprets the response generated by the agent, extracting instructions and parameters in a predetermined format. It ensures that the command is well-formed and verifies the number and accuracy of parameters (for example, a file editing command requires a file path parameter).

If parsing fails, an error message is injected into the conversation, preventing further operations on the repository.

Successfully parsed commands are further analyzed by enforcing specific agent permissions and performing additional semantic checks.

It ensures that recommended actions comply with user-specified fine-grained permissions.

If the command passes review, the dialog manager calls the corresponding action in the tool library.

Output organizer

The output organizer module is primarily responsible for processing the output received from the evaluation environment.

It selects key information (such as status or errors), selectively summarizes relevant content, and adds well-structured information to the conversation history.

This ensures that users have a clear and organized record of AutoDev operations and results.

Conversation Terminator

The session manager decides when to end the session. This may occur when the agent signals task completion (stop command), the conversation reaches the user-defined maximum number of iterations/tokens, or an issue is detected in the process or evaluation environment.

AutoDev’s comprehensive design ensures systematic and controllable AI-driven development.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

Agent Scheduler (Multi-Agents)

The agent scheduler is responsible for coordinating the AI agents to achieve user-defined goals.

Agents configured with specific roles and available command sets work together to perform various tasks. The scheduler employs various collaborative algorithms, such as round-robin, token-based, or priority-based algorithms, to decide the order and manner in which agents participate in conversations.

Specifically, scheduling algorithms include but are not limited to the following:

(i) Loop collaboration, calling each agent in sequence and letting each agent perform a predetermined number of operations;

(ii) token-based collaboration, where an agent performs multiple operations until it issues a token indicating completion of the assigned task;

(iii) Priority-based collaboration, where agents are launched in order of their priority. The agent scheduler calls a specific agent with the current conversation.

acting

Agents composed of large language models (LLMs) such as OpenAI GPT-4 and small language models (SLMs) optimized for code generation communicate through textual natural language.

These agents receive goals and conversation history from the Agent Scheduler and respond with actions specified by rules and action configurations. Each agent has its own unique configuration that contributes to overall progress toward the user's goals.

Tools Library

The tool library in AutoDev provides a series of commands that enable agents to perform various operations on the repository.

These commands are designed to encapsulate complex operations, tools, and utilities into a simple and intuitive command structure.

For example, simple commands like build and test abstract away the complex issues related to build and test execution.

-File Editing: This category contains commands for editing files, including code, configuration, and documentation.

– Utilities in this category, such as writing, editing, inserting, and deleting, offer varying degrees of sophistication.

-The agent can perform various operations from writing the entire file to modifying specific lines in the file. For example, the command write – allows the agent to rewrite a series of lines with new content.

Retrieval: In this category, retrieval tools include basic CLI tools such as grep, find, and ls, as well as more sophisticated embedding-based techniques.

These techniques enable agents to find similar snippets of code, thereby improving their ability to retrieve relevant information from the code base.

For example, the retrieve command allows the agent to perform embed-based fragment retrieval similar to the content being served.

-Build & Execute: This class of commands allows agents to effortlessly compile, build, and execute code bases using simple and intuitive commands. The complexity of the underlying build commands has been abstracted, simplifying the process of evaluating the environment infrastructure. Examples of such commands include: build, run .

– Test & Verify: These commands enable the agent to test the code base by executing individual test cases, specific test files, or the entire test suite. Agents can perform these operations without relying on the underlying commands of a specific testing framework.

This category also includes validation tools such as filters and error finding tools. Examples of such commands include syntax , which checks syntax correctness, and test, which runs the entire test suite.

-Git: Users can configure fine-grained permissions for Git operations. Including operations such as commit, push and merge. For example, you can grant an agent permission to only perform local commits, or push changes to the source code repository when necessary.

– Communication: Agents can invoke a series of commands designed to facilitate communication with other agents and/or users. It is worth noting that the talk command can send natural language information (not interpreted as a repository operation command), the ask command is used to request user feedback, and the stop command can interrupt the process to indicate that the goal has been achieved or the agent cannot continue.

The tool library in AutoDev therefore provides AI agents with a versatile and easy-to-use set of tools to interact with the code base and communicate effectively in a collaborative development environment.

Eval Environment

The evaluation environment runs in a Docker container and can safely perform file editing, retrieval, build, execution, and test commands.

It abstracts the complexity of the underlying commands and provides a simplified interface to the agent. The evaluation environment returns standard output/error to the output organizer module.

Integrate

Users initiate a conversation by specifying a goal and related settings.

The dialog manager initializes a dialog object that integrates information from the AI agent and the evaluation environment. The conversation manager then dispatches the conversation to the agent scheduler, which is responsible for coordinating the actions of the AI agent.

As artificial intelligence agents, language models (large or small LM) suggest instructions through textual interactions.

The command interface contains a variety of functions, including file editing, retrieval, build and execution, testing, and Git operations. The dialog manager parses these suggested commands and directs them to the evaluation environment for execution in the code base.

These commands are executed within the security confines of the assessment environment and encapsulated in a Docker container.

Once executed, the resulting actions are seamlessly integrated into the conversation history, contributing to subsequent iterations.

This iterative process continues until the agent considers the task complete, user intervention occurs, or the maximum iteration limit is reached.

AutoDev's design ensures systematic and secure coordination of artificial intelligence agents to complete complex software engineering tasks in an autonomous and user-controlled manner.

empirical evaluation design

In the researchers' empirical evaluation, we assessed AutoDev's capabilities and effectiveness in software engineering tasks, investigating whether it can improve the performance of artificial intelligence models beyond simple inference.

In addition, the researchers also evaluated the cost of AutoDev in terms of number of steps, inference calls, and tokens.

Mainly three experimental research questions were identified:

-𝑅 𝑄 1 : How effective is AutoDev in code generation tasks?

-𝑅 𝑄 2 : How effective is AutoDev in test generation tasks?

-𝑅 𝑄 3 : How efficient is AutoDev in completing its tasks?

𝑄 1 : How efficient is AutoDev in code generation tasks?

The researchers used the Pass@k metric to measure the effectiveness of AutoDev, where 𝑘 represents the number of attempts.

A successfully solved problem means that the method body code generated by AutoDev satisfies all manually written tests. One attempt is equivalent to a complete AutoDev session, involving multiple inference calls and steps.

This is in contrast to other approaches, such as direct calls to GPT-4, which typically involve only one inference call. Details on multiple inference calls and steps are further explored in 𝑅 𝑄 3. In this evaluation, the researchers set 𝑘 = 1 to calculate Pass@1, considering only the success rate of the first attempt.

𝑄 2: How effective is AutoDev in test generation tasks?

For this research question, the researchers modified the HumanEval dataset to evaluate AutoDev's ability to generate tests.

Researchers consider human-written solutions and abandon the human-written tests provided.

They instruct AutoDev to generate test cases for focused methods and evaluate them based on test success rate, invocation of focused methods, and test coverage.

The researcher reports Pass@1 and the test is considered successful if it passes and the focus method is called.

In addition, the researchers compared the coverage of AutoDev tests with the coverage of manually written tests.

𝑅 𝑄 3: How efficiently does AutoDev complete the task?

In this research question, the researcher will investigate the efficiency of AutoDev in completing SE tasks.

The researchers analyzed the number of steps or inference calls required, the distribution of commands used (e.g. write, test), and the total number of tokens used in the conversation.

AutoDev settings

In this evaluation, AutoDev maintains a consistent setup with an agent based on the GPT-4 model (gpt-4-1106-preview).

Enabled operations include file editing, retrieval, and testing.

The only communication command available is the stop command to indicate task completion.

Other commands, such as interrogation, are disabled, requiring AutoDev to run autonomously without human feedback or intervention beyond initial goal setting.

Experimental results

– How efficient is AutoDev in code generation tasks?

Table 1 shows,comparing AutoDev with two alternatives and a zero-sample,baseline.

The researchers compared AutoDev with Linguistic Agent Tree Search (LATS) and Reflexion, the two leading methods on the HumanEval rankings as of March 2024.

The results of the zero-sample baseline (GPT-4) are taken from the OpenAI GPT-4 technical report.

The AutoDev Pass@1 rate is 91.5%, firmly ranking second on the HumanEval rankings.

It is worth noting that this result was obtained without additional training data, which distinguishes AutoDev from LATS, which achieved 94.4%.

In addition, the AutoDev framework improves GPT-4 performance from 67% to 91.5%, a relative improvement of 30%.

These results demonstrate AutoDev's ability to significantly improve the performance of large models in completing software engineering tasks.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

– How effective is AutoDev in test generation tasks?

AutoDev achieved a Pass@1 score of 87.8% on the HumanEval dataset modified for the test generation task, a relative improvement of 17% compared to the baseline using the same GPT-4 model.

Correct tests generated by AutoDev (included in Pass@1) achieve a robust coverage of 99.3%, which is comparable to the 99.4% coverage of human-written tests.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

– How efficiently does AutoDev complete its tasks?

Figure 3 shows the cumulative number of commands used by AutoDev in the code generation and test generation tasks, taking into account the average number of evaluated commands per HumanEval question in Question 1 and Question 2.

For code generation, AutoDev executed an average of 5.5 commands, including 1.8 write operations, 1.7 test operations, 0.92 stop operations (indicating task completion), 0.25 error commands, and minimal retrieval (grep, find, cat) , syntax check operations and call communication commands.

In the “Test Generation” task, the average number of commands is consistent with the “Code Generation” task.

However, the “Test Generation” task involves more retrieval operations and has a higher incidence of error operations, so the average total number of commands per run is 6.5.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

In the first two problems, the average length of AutoDev conversations to solve each HumanEval problem was 1656 and 1863 tokens respectively.

This includes the user's goals, information from the AI agent, and responses from the evaluation environment.

In comparison, the zero-shot GPT-4 (baseline) uses an average of 200 tokens (estimates) per task to generate code and 373 tokens to generate tests.

Although AutoDev uses more tokens, a large number of tokens are used to test, verify and interpret the code it generates, which is beyond the scope of the baseline approach.

Finally, AutoDev incurs execution costs associated with coordinating AI agents, managing conversations, and executing commands in a Docker environment.

The developer is given a task: set up to generate pytest tests that follow a specific format.

The AutoDev agent starts the write-new command, providing the file path and contents of the test file.

Subsequently, the AutoDev agent triggers the test operation, AutoDev runs the test in its secure Docker environment, and gives a test execution report JSON.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

Then, AutoDev starts executing autonomously:

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

The AutoDev agent found an error in the pytest output and recognized that a fix needed to be made to make the test consistent with the expected behavior of the function.

Continuing as shown in Figure 5, the AutoDev agent issues a write command specifying the file path and line number range (5-5) to rewrite the erroneous assertion statement.

The AutoDev agent then proceeds to execute the test, and the test succeeds.

微软 AI 程序员 AutoDev 登场：996 自主生成代码，性能超 GPT-4 30%

As you can see from the example above, AutoDev is able to self-evaluate the generated code and resolve errors in its own output.

In addition, AutoDev can help users gain insights into the operation of agents and allow agents to communicate during tasks.

With the birth of AI engineers such as Devin and AutoDev, a large part of programmers' work may be automated.

References:

https://www.reddit.com/r/singularity/comments/1bfolbj/autodev_automated_aidriven_development_microsoft/