GPT-4’s performance improves by 74.4% as UIUC Apple’s Chinese team introduces CodeAct for unifying LLM agent actions with Python code

Recently, UIUC Apple Chinese proposed a general agent framework CodeAct to unify the actions of LLM agents through Python code. LLM agents have long been favored by many industry AI leaders, and are even expected to become a weapon that frees humans from trivial tasks.

But how do they best interact with the world?


Recently, Chinese researchers from UIUC and Apple proposed a new intelligent agent framework-CodeAct. It unifies the actions of LLM agents by using executable Python code.

Paper address: pdf/ 2402.01030.pdf

Unlike most existing LLM agents, CodeAct stands out in that it can make full use of existing LLM's pre-training of code data to achieve low-cost and efficient adoption.

Complex operations can inherently be supported through control and data flow, and a wide range of software packages can be used to extend the action space and automated feedback.


In this regard, the author also built a CodeActAgent tool, built on the Mistral 7B model, which can complete coding tasks through dialogue.

For example, “Can you create 100 random data points (each with dimension 2) and create a scatter plot? Run k-means to cluster and visualize them.”

Let LLM become the optimal agent

When enhanced with action modules that allow access to APIs, the action space of LLM can be expanded beyond traditional text processing.

This allows LLM to gain functionality such as tool calling and memory management, and venture into real-world tasks such as controlling robots and conducting scientific experiments.

So, how to effectively expand the action space of LLM agents to solve complex real-life problems?

As shown in Figure 1, upper left, many existing studies have examined the use of text, or JSON, to generate actions.

However, both approaches are often limited by the scope of the action space (actions are often tailored to specific tasks) and limited flexibility (e.g. the inability to combine multiple tools in a single action).

Other studies have demonstrated the potential of using LLM to generate code to control robots or game characters.

However, they often rely on pre-specified control primitives and hand-designed cues, and more importantly, they are difficult to dynamically adjust or issue actions based on new environmental observations and feedback.

In this regard, this study proposes CodeAct, a general framework that allows LLM to generate executable Python code as actions (Fig. 1 upper right).

CodeAct is designed to handle a variety of applications and offers unique advantages:

(1) CodeAct integrates with the Python interpreter to perform code actions and dynamically adjust previous actions or issue new actions based on observations received through multiple rounds of interaction (code execution).

(2) Code actions allow LLM to leverage existing software packages. CodeAct can use off-the-shelf Python packages to expand the action space, rather than hand-crafted task-specific tools. It also allows large models to improve task solving by self-debugging their generated code, using the automatic feedback implemented in most software (e.g. error messages).

(3) Code data is widely used in pre-training of today’s large models. These models are already familiar with structured programming languages, so they can adopt CodeAct cost-effectively.

(4) In contrast to JSON and pre-formatted text, code inherently supports control and data flow, allowing intermediate results to be stored as variables for reuse, and with one piece of code allows combining multiple tools to perform complex logical operations (e.g. , if-statements, for loops), thereby unlocking the potential of large model pre-trained programming knowledge to handle complex tasks.

In Figure 1, an LLM using CodeAct (top right) can apply the same sequence of tools to all inputs with a single action via a for loop. Whereas text or JSON must act on each input.

CodeAct Framework

In Figure 2, the general multi-turn interaction framework used by LLM agents in the real world is first introduced, which considers three roles:

Agent, user, environment.

Researchers define interaction as the exchange of information between an agent and an external entity (user or environment).

In each round of interaction, the agent receives observations (inputs) from the user (such as natural language instructions) or the environment (such as code execution results), selectively plans its actions through the chain of thought (CoT), and responds in natural language or The environment issues actions (outputs) to the user.

CodeAct uses Python code to integrate all operations of the agent's interaction with the environment.

In CodeAct, each action issued to the environment is a piece of Python code, and the agent will receive the output of the code execution (such as results, errors) as observations.

The promise of CodeAct as a powerful tool usage framework

In the study, the authors conducted a controlled experiment to understand which format (text, JSON, CodeAct) is more likely to guide LLM to generate correct atomic tool calls.

The performance of this experiment reflects LLM's familiarity with the corresponding format.

The researchers hypothesized that calling tools using CodeAct would be a more natural way to use tools for models, which typically have extensive exposure to code data during training.

For most LLMs, CodeAc achieves comparable or better performance even in atomic operations where the strength of its control and data flow is reduced (simplistic tool usage scenario).

Compared to closed-source LLM, CodeAct's improvements are more prominent in the open-source model.

Additionally, code data is often more accessible than specialized JSON or text tool call formats for fine-tuning open source LLMs. Although JSON is consistently weaker than other open source model methods, it achieves decent performance in closed source LLM, suggesting that these closed source models may have been purposefully fine-tuned for their JSON capabilities.

These results suggest that for open source large models, optimizing for CodeAct is a better way than other methods to improve their tooling capabilities, as they already exhibit good initial CodeAct due to extensive exposure to code data during pre-training. ability.

CodeAct does more with less interaction

In addition, the authors investigate whether LLM agents can benefit from code control and data flow on problems requiring complex tool usage patterns.

Here, researchers curated a benchmark M3ToolEval is used to evaluate LLM's ability to solve complex tasks that often require multiple calls to multiple tools.

The authors list the full results in Table 3 and a subset of the visual results in Figure 1 .

CodeAct generally has a higher task success rate (12 out of 17 evaluated LLMs). Additionally, the average number of interaction rounds required to perform tasks using CodeAct is also lower.

For example, compared to the next best operating format (text), the best model gpt-4-1106-preview achieves an absolute improvement of 20.7% while reducing an average of 2.1 interaction rounds.

However, in terms of the absolute performance of CodeAct, there is still a significant gap between open source and closed source LLM, with the best open source model showing a 13.4% improvement in absolute performance compared to the best closed source model gpt-4-1106-preview. An increase of 74.4%.

This may be due to the weak task-solving capabilities of open-source models and their inability to follow complex instructions without demonstration, indicating an urgent need to improve open-source LLMs to complete real-world tasks in zero-shot settings.

CodeAct benefits from multiple rounds of interaction and existing software packages

The researchers also demonstrated how LLM agents can be integrated with Python and use existing software to perform complex tasks over multiple rounds of interactions.

Thanks to the rich Python knowledge learned during pre-training, the LLM agent can automatically import the correct Python library to solve the task without requiring user-provided tools or demonstrations.

As shown in Figure 3, CodeActAgent can use Pandas to download and process tabular data, use Scikit-Learn for machine learning training-test data segmentation and regression model training, and use Matplotlib for data visualization.

In addition, using an interactive Python interpreter to execute code can automatically display error messages, helping the LLM agent “self-debug” its operations in multiple rounds of interactions, and ultimately correctly complete the human user's request.

Building open source LLM agents

The results of the potential demonstrated by CodeAct inspired researchers to build an open source LLM agent that can interact with the environment through CodeAct and communicate with humans using language.

In order to improve the CodeAct capability of open source LLM, the author introduces CodeActInstruct, an instruction fine-tuning data set containing the interaction trajectory between the agent and the environment.

As shown in Table 4, it is the data composition of CodeActInstruct and the comparison with previous work.

Next, the researchers fine-tuned the CodeActInstruct and general dialogue of Llama-2 7B and Mistral 7B to obtain the CodeActAgent

CodeActAgent performs well in CodeAct tasks.

As shown in Table 5, CodeActAgent (both variants) performs better than all evaluated open source LLMs on both the in-domain and out-of-domain subsets of MINT.

in M3On ToolEval, the author found that CodeActAgent (Mistral) outperformed open source LLMs of similar scale (7B and 13B), and even achieved similar performance to the 70B model.

Surprisingly, no improvement was observed with the Llama-2 variant.

CodeActAgent summarizes text operations.

When evaluated on out-of-domain text operations, CodeActAgent (LLaMA2, 7B), which has never been optimized for text operations, achieves comparable performance to AgentLM-7B, which performs display adjustments on text operations.

In Table 5, it is also found that CodeActAgent maintains or improves the performance of general LLM tasks.

In Table 5, the researchers also found that CodeActAgent (both variants) performed better in the general LLM tasks tested, except that CodeActAgent (Mistral-7B) suffered a slight decrease in MMLU.