Tsinghua’s groundbreaking AI hospital town debuts, AI doctors outperform humans in diagnosing 10,000 patients in just days

[Introduction to New Wisdom]The Tsinghua team actually moved the hospital into the AI ​​world! The first AI hospital town – Agent Hospital, can completely simulate the entire process of doctor and patient treatment. What's more, AI doctors can evolve autonomously and treat approximately 10,000 patients in just a few days.

Stanford AI Town once became popular all over the Internet. With 25 intelligent agents living and making friends, it can be called a real-life version of “Western World”.

Advertisement

And now, the AI ​​“hospital town” is here too!

Recently, researchers from the Tsinghua team developed a simulated hospital called “Agent Hospital”.

Paper address:https://arxiv.org/pdf/2405.02957

In this virtual world, all doctors, nurses, and patients are intelligent agents driven by LLM and can interact autonomously.

Advertisement

They simulate the entire process of diagnosis and treatment, including triage, registration, consultation, examination, diagnosis, treatment, follow-up and other links.

In this study, the author's core goal is to enable AI doctors to learn to treat diseases in a simulated environment and achieve autonomous evolution.

From this, they developed a MedAgent-Zero system that allows doctor agents to continuously accumulate experience from successful and failed cases.

It is worth mentioning that AI doctors can complete the treatment of 10,000 patients in a few days.

It would take human doctors 2 years to reach a similar level.

In addition, the evolved doctor agent achieved a state-of-the-art accuracy of 93.06% on a subset of the MedQA data set covering major respiratory diseases.

It has to be said that AI evolution is silently evolving in the virtual world and has the potential to eliminate humans.

Some netizens said, “AI simulation will explore roads that humans simply do not have the time or ability to explore.”

Imagine thousands of fully automated hospitals, which will save millions of lives. This will come soon.

The first AI hospital town debuts

In fact, intelligent agents have long become a promising field in the industry.

Whether it is simulation in a virtual world or an agent that can solve actual tasks (such as Devin), it will bring great changes to our world.

However, these multi-agent agents are often used for “social simulation,” or “problem solving.”

So, are there any agents that combine these two abilities?

In other words, can the social simulation process improve the performance of LLM agents on specific tasks?

Inspired by this, the researchers developed a simulation of treatment procedures covering almost all fields of medicine.

A world like the stand-alone game “Theme Hospital”

The simulated environment in Agent Hospital mainly has two types of subjects: one is the patient and the other is the medical professional.

Their role information is generated by GPT-3.5 and can be infinitely expanded.

For example, in the picture below, 35-year-old patient Kenneth Morgan has acute rhinitis, and his medical history is hypertension. His current symptoms are persistent vomiting, some diarrhea, recurrent fever, abdominal pain, headache, and swollen cervical lymph nodes.

Let’s look at 32-year-old physician Elise Martin, who has excellent communication skills and empathetic nursing skills.

Her primary responsibility is to provide diagnostic, therapeutic and preventive care to adult patients with a variety of acute and chronic conditions.

ZhaoLei, a radiologist who specializes in interpreting medical images, and Fatoumata Diawara, the front desk receptionist.

As shown in the figure below, there are various consulting rooms and examination rooms in the Agent Hospital, so a series of medical professional agents are required to work.

The researchers designed,14 doctors and 4 nurses.

Doctor agents are designed to diagnose diseases and develop detailed treatment plans, while nursing agents focus on triage and support daily therapeutic interventions.

How are AI patients treated?

Just like the process of seeing a doctor in the real world, when a patient gets sick, he or she will go to the hospital to register for treatment.

During this period, they will also go through a series of stages, including examination, triage, consultation, diagnosis, and treatment.

After the patient receives the treatment plan, LLM will help predict changes in the patient's health status. Once recovered, it will proactively report to the hospital for follow-up.

The following is a schematic diagram of Kenneth Morgan going to the hospital for treatment.

First, triage nurse Katherine Li conducted a preliminary assessment of Morgan and triaged him to the dermatology department.

Morgan then checked in at the hospital counter and was scheduled for a consultation with dermatologist Robert Thompson.

After completing the required physical examination, the AI ​​doctor prescribed medication for Morgan and urged him to go home and rest while monitoring the improvement of his condition.

AI doctors super-evolve themselves without manually labeling data

In a simulated environment, researchers hope to train a skilled doctor agent to handle medical tasks such as diagnosis and treatment.

The traditional method is to feed huge amounts of medical data to LLM/agent, and then go through pre-training, fine-tuning, and RAG to build a powerful medical model.

In the latest research, the author proposed a new strategy – simulating doctor-patient interaction in a virtual environment to train doctor intelligence.

In this process, the researchers did not use manually labeled data, so the latest system is named MedAgent-Zero.

This strategy includes two important modules, namely “medical record database” and “experience database”.

Cases of successful diagnosis and treatment are compiled and stored in the medical record database as a reference for future medical intervention.

In the case of treatment failure, AI doctors have the responsibility to reflect on and analyze the reasons for incorrect diagnosis, and summarize guiding principles as warnings in the subsequent treatment process.

In short, MedAgent-Zero enables biological agents to interact with patient agents.

Evolve into a better “doctor” by accumulating records of successful cases and gaining experience from failed cases.

The entire self-evolution process is as follows:

1) Accumulate examples and summarize experience;

2) Add the correct response directly to the sample library;

3) Summarize the wrong experience and retest;

4) Further abstract successful experiences and incorporate them into the experience database;

5) During the reasoning process, use the two libraries to retrieve the most similar content for reasoning.

What's rare is that due to low training cost and high efficiency, the doctor agent can easily handle dozens of situations.

For example, the agent can handle tens of thousands of cases in just a few days, which would take real-world doctors years.

Diagnose respiratory diseases with an accuracy of 93.06%

Next, the researchers conducted two types of experiments to verify the effectiveness of the doctor agent improved by the MedAgent-Zero strategy in the hospital.

On the one hand, in the virtual hospital, the authors conducted interactive experiments with 100-10,000 agents (human doctors may treat about 100 patients a week), covering 8 different respiratory diseases and more than a dozen medical examinations. and three different treatment options for each disease.

The doctor agent trained through the MedAgent-Zero strategy continuously evolved itself in the process of handling simulated patients, and finally achieved accuracy rates of 88%, 95.6% and 77.6% in examination, diagnosis and treatment tasks respectively.

As samples continue to expand, the training performance of MedAgent-Zero tends to stabilize when it reaches a certain amount.

The performance of MedAgent-Zero on the three tasks of examination, diagnosis, and treatment also fluctuates as the number of samples increases, but the overall accuracy shows an upward trend.

Look at the following three pictures, which respectively show the inspection accuracy, diagnosis accuracy, and treatment accuracy of different diseases. As the number of samples increases, they are also rising steadily.

On the other hand, the researchers asked the evolved doctor agent to participate in the evaluation of a subset of the MedQA data set.

Surprisingly, even without any manually annotated data, the doctor agent achieved state-of-the-art performance after evolving in Agent Hospital.

In terms of experience accumulation, Figure 11, Figure 12 and Figure 13 respectively show the accumulation of verified experience and wrong answers in examination, diagnosis and treatment tasks.

As the training samples increase, both the number of experiences and the number of wrong answers slowly increase.

As shown in the figure, the experience curve is lower than the wrong answer curve because the agent cannot reflect all failed experiences. Furthermore, diagnostic experience is easier to accumulate than other tasks.

Let’s look at a case study.

The following table illustrates the performance of the experience library, pathology library and MedAgent-Zero on three tasks in patient diagnosis and treatment.

After learning the patient's symptoms, the AI ​​doctor not only needs to use the medical record database, but also needs the experience database, which means they complement each other.

If one of them is missing, the diagnostic accuracy will decrease.

As shown below, by adding experience and records, MedAgent-Zero gives correct answers for all 3 tasks.

The above results show that the simulation environment can effectively help LLM agents evolve when dealing with specific tasks.

MedAgent-Zero is 2.78% higher than the SOTA method Medprompt when using GPT-3.5, and 1.39% higher than the SOTA method MedAgents when using GPT-4.

This result verifies that the new model helps to evolve the agent using only simulation documents and medical documents without any MedQA training samples, thereby effectively improving the medical capabilities of the doctor agent.

Secondly, the best performance of MedAgent-Zero based on GPT-4 is 93.06%, which is better than human experts (about 87%) in the MedQA dataset.

Third, the doctor agent based on GPT-4 performs better than any other method based on GPT-3.5, indicating that GPT-4 is more powerful in the medical field.

Additionally, in ablation studies conducted on MedAgent-Zero,

MedAgent-Zero, which simultaneously utilizes the “Medical Record Database” and “Experience Database”, achieved the best performance, indicating the help of these two modules in diagnosis.

With the accumulation of cases and the expansion of the experience base, the accuracy of doctor agents is generally getting higher and higher.

Whether using GPT-3.5 or GPT-4, the performance of using the accumulated experience base of 8000 cases is higher than that of using 2000/4000/6000 cases.

However, a larger experience base is not always better, as the researchers also found a significant drop-off between 2,000-4,000 cases.

limitation

Finally, the researchers also mentioned the limitations of the study.

– Only use GPT-3.5 as the simulator for Agent Hospital and evaluation

– Since the interaction between agents and their evolution involve API calls, the operational efficiency of AI hospitals is limited by LLM generation

– Each patient's health record and examination results are generated without domain knowledge to simulate real electronic health records, but there are still some differences from real-world records.

In the future, researchers' plans for Agent Hospital will include:

First, the scope of diseases covered by the scale is expanded to more medical departments, aiming to reflect the comprehensive services provided by real hospitals for further research.

Second, in terms of strengthening the social simulation of intelligent agents, such as incorporating a comprehensive promotion system for medical professionals, changing the distribution of diseases over time, incorporating patients' historical medical records, etc.

Third, optimize the selection and implementation of the underlying LLM, aiming to perform the entire simulation process more efficiently by leveraging powerful open source models.

References:

  • https://x.com/emollick/status/1787896361276571660

Advertisement