Stanford Team Unveils Revolutionary Robot Control System: Success Rate Soars With Just a Shout, Netizens Demand Faster Production for Tesla.

Stanford's ALOHA housework robot team released the latest research results——

The project is called Yell At Your Robot (YAY for short). With it, the robot's “rollover” action can be corrected by just shouting a word!

And the robot can follow the human voiceDynamic promotionaction level,Adjust on the flystrategies and based on feedbackContinuous self-improvement.

For example, in this scene, the robot failed to complete the task of “putting the sponge into the bag” set by the system.

At this time, the researcher shouted directly to it, “Use a sponge to stretch the bag wider,” and then it was successful.

Moreover, these corrective instructions will be recorded by the system and become training data to further improve the subsequent performance of the robot.

Some netizens looked at it and asked, since they can shout at the robot, should the car be arranged quickly? They also named Tesla and its autonomous driving software director Ashok Elluswamy online.

After the results were released, former Google Robotics senior researcher Eric Jang, former DeepMind researcher and Stanford visiting professor Karol Hausman and other big names also expressed their affirmation and praise.

So, what kind of actions can be achieved by a robot adjusted by shouting?

You can give orders by shouting

After being trained with YAY technology, the robot challenged the three complex tasks of bagging items, mixing fruits, and washing dishes with a higher success rate.

The characteristic of these three tasks is that they all requireThe two hands complete different actionsone hand needs to hold the container stably and adjust the posture as needed, and the other hand needs to accurately locate the target position and complete the instruction. Moreover, the process also involves soft objects such as sponges, and the strength of holding is also a science. .

Take the task of packing bags as an example. The robot will encounter various difficulties in the process of fully autonomous execution, but it can find solutions by shouting.

I saw that the robot accidentally dropped the sponge while filling the bag, and then it could not be picked up again.

At this time, the developer shouted directly to it, and the password was simply “Move to me, then left.”

After following the instructions, the robot failed the first time, but the robot remembered the “go left” command and successfully picked up the sponge after moving left again.

But then a new difficulty appeared – the mouth of the bag was stuck.

At this time, just tell it to open the bag a little more, and the robot will “understand”, adjust a series of follow-up actions, and finally complete the task successfully.

And it's not just about correcting mistakes,The details of the task can also be adjusted in real time through shoutingFor example, in the task of filling candy, the developer felt that the robot took a little too much candy. As long as he shouted “less”, the robot would pour some of the candy back into the box.

Furthermore, these human-generatedInstructions are also recorded by the system and used for fine-tuningto improve the subsequent performance of the robot.

For example, in the task of washing dishes, the robot after fine-tuning has stronger cleaning power and a wider range.

Statistics show that after this kind of fine-tuning, the average task success rate of the robot increased by 20%, and it can continue to improve if shouting instructions are continued.

Moreover, such an instruction-fine-tuning process can be carried out iteratively, and the robot's performance can be improved with each iteration.

So, how does YAY implement it?

Human teachings “engraved in the heart”

Architecturally, the entire YAY system is mainly composed ofAdvanced strategyandlow level strategyIt consists of two parts.

The high-level strategy is responsible for generating language instructions to guide the low-level strategy, and the low-level strategy is used to perform specific actions.

Specifically, the high-level strategy encodes the visual information captured by the camera and combines it with relevant knowledge, and then the Transformer generates instructions containing current action description, future action prediction, etc.

After receiving the language instructions, the low-level strategy will parse the keywords in these instructions and map them to the target position or motion trajectory of the robot joints.

At the same time, the YAY system introducedReal-time language correction mechanismhuman verbal commands have the highest priority – after being recognized, they are directly passed to low-level strategies for execution.

And during this process, commands will be recorded by the system and used to fine-tune high-level strategies – by learning the corrective feedback provided by humans, gradually reducing reliance on immediate verbal corrections, thereby improving the autonomous success rate of long-term tasks.

After completing basic training and being deployed in a real environment, the system can still continue to collect instruction information, continuously learn from feedback and improve itself.

About the Author

The first author of this project is a student researcher at Stanford University Lucy X. Shigraduated from the High School Affiliated to Renmin University of China in 2019 and entered the University of Southern California to major in computer science.

During this period, Lucy interned at NVIDIA to research multi-modal large models, and collaborated with the well-known AI scholar Dr. Jim Fan.

Her papers have been included in CoRL, the top robotics conference, for two consecutive years, and were also selected for NeurIPS. She was also invited to give a speech by DeepMind.

Lucy's mentor, Chelsea Finn, is an assistant professor in the Department of Computer Science and Electrical Engineering at Stanford. She has more than 47,000 citations in Google Scholar papers. She also worked at Google Brain for a period of time.

Including this project, Finn always appears as the corresponding author in a series of papers published by the ALOHA team.

In addition, researchers such as Tony Z. Zhao and Sergey Levine from the ALOHA team are also co-authors of this article.

Paper address:

  • https://arxiv.org/abs/2403.12910