Did the blogger find evidence that the video of Devin, the world’s first AI programmer, is fake after analyzing it frame by frame? Devin’s coding skills were found to be subpar.

[Introduction to New Wisdom]Devin, the world's first AI programmer, was exposed for video fraud? A YouTube blogger recently exposed the lies behind the star startup Cognition. Through frame-by-frame analysis, he found that Devin was not able to complete Upwork work independently.

Devin, the world's first AI programmer who has become popular all over the Internet, is now exposed to be a fake video?

Advertisement

Recently, a YouTube blogger with 35 years of experience as a software engineer went to reproduce the promotional video of Devin completing the Upwork task frame by frame.

However, it was unexpectedly discovered that AI cannot complete tasks like human engineers, and it is very bad.

After he made the 25-minute “Exposing Devin's Lies” video public, it immediately detonated the entire Internet and caused an uproar on social platforms such as HN and Reddit.

Advertisement

What's even more interesting is that the blogger himself replicated the task Devin was trying to do, which took him about 36 minutes.

However, it took Devin at least 6 hours, maybe even more than a day.

Some netizens said, “As the blogger explained in detail, despite trying to imply in the demonstration, Devin cannot complete Upwork's work independently. It is creating confusing, overly complex code.”

Some people also believe that from the beginning, Devin has been more about marketing and hype than reality.

Come to think of it, is Devin’s video really a fake?

Frame-by-frame analysis to expose Devin’s lies

When Devin was released, the startup company behind it, Cognition AI, updated its official blog post and introduced the “magic” of this AI through seven videos.

Among them, one video is about Devin completing a task independently on Upwork, the world's largest comprehensive freelancing platform.

At that time, netizens were surprised to see it and said that they did not expect that AI agents could do side jobs.

You, the developer of Cognition, chose a job of “using computer vision models for reasoning”. The specific requirements are:

– I want to make inferences using models from this repository. ( roadDamageDetection2020)

– Your deliverable will be detailed instructions on how to operate within an EC2 instance in AWS.

– Please provide an evaluation report of your performance of this work. I will not respond to reports without evaluation.

The following is Devin's performance in completing the task in the official video.

The blogger said that the first thing we saw was that Devin was not able to do any work on Upwork, but that in this task, the researcher carefully selected the “road damage” task.

Of course, this does not mean that Devin is deceptive, but it means that his performance in other tasks must be worse than this.

Then, in the actual conversation, developer You's request to Devin was as follows, “I want to use the model in this repository for inference, please clarify.”

It is worth noting that the customer's request is “You need to submit detailed instructions for this operation in the Amazon EC2 instance”, which is obviously different from the developer's request.

However, according to the end of Devin's video, it didn't actually do the job the customer asked for.

In the blogger's opinion, before completing this task, you need to know how to start the work.

This requires asking customers:

– Size, type of instance

– Would you prefer a faster but more expensive instance, or a more economical but slower instance?

– Does this system need to be constantly online?

How to deal with the data/pictures that you need to conduct inference analysis? How would you upload these to the server?

For example, you can create a web interface for processing, upload via SSH, or put it in an S3 bucket. What is the access method of the output results?

These are questions you must understand.

All in all, the blogger said that this is also what I mentioned in my previous video, the most difficult, most critical, and most time-consuming part of a software developer's work:

Mainly communication with customers, leaders and other stakeholders.

These are tasks that AI cannot currently do, and these are very important things we do.

What did Devin actually do?

The following is a screenshot from the video, which mentions a Repo.

This is a file called requirements.txt A file that specifies the dependent library versions of the code.

However, some of the libraries that this codebase originally relied on were versions four years ago, and some of these libraries are no longer available for download, so they had to be modified.

Devin updated the code as mentioned in the video. The blogger said, “It's really amazing that Devin can do this.”

Basically they want to build their own reasoning capabilities compared to the client's requirements.

Devin was told to just use sample data, so that's exactly what the blogger did when reproducing Devin's operation.

Devin encountered an error early on, a command line error:

At the top, errors related to Opening Image, File Not Found, No Such File or Directory were encountered.

This error occurs in a file called visualize_detections.py in the code file.The blogger said that he did not encounter this problem because there is no name in the code base. visualize_detections.py document.

Back at the command line, if you zoom in to other parts of the window, you'll see that Devin writes some content into a file called inspect_results.py file, and then run Python to execute the file, resulting in a syntax error.

Using /n in Python files is not allowed, and the echo command should not be used this way. This whole process is wrong and pointless.

Devin created these files with errors and then fixed them. As mentioned in the video, Devin is actually debugging print lines. This is a very common practice that many people use.

The comment said, “Devin is adding code and tracing the data flow until he fully understands it.”

The blogger doubted this and said, “I don't believe Devin can really understand anything.”

Zooming in on this part, you can see a strange loop. It's reading a file and reading the data into a buffer.This is update_image_ids.py document.

Again, this file does not exist in the code repository requested by the customer.

In fact, the blogger searched all possible locations on GitHub, and there were only 2 places where a file with this name existed.

The reason there are three on the screen is that one of them is a forked version of the other, which is completely different from the file Devin is using.

But the problem is that Devin is debugging a file he created, and this file is not in the project code repository at all, which is very inappropriate.

In fact, Devin is not correcting the code he found online, nor is he dealing with the problem code specified by the customer, but is correcting the error code he generated.

What's worse is that it's not necessary. This is the readme file in that code base.

This library has a library called infer.py file, as Devin does in the video.

The readme file explains its functions and usage. On the right, there's even a little button that lets you copy the entire command, paste it into the command line window, and hit enter.

The blogger believes that the person who developed this “detection of road damage” code repository has simplified the instructions as much as possible, but Devin still doesn't seem to understand.

So Devin had to create a chaotic project on his own.

As Devin is discovering, code that is complex, unwieldy, and prone to small bugs is difficult to debug.

It took half an hour to reproduce, but the AI ​​took 6 hours.

Next, the blogger plans to reproduce the task Devin tried to do himself.

He said it took him about 36 minutes to complete what he did.

On the next slide, there's actually a bug that needs to be fixed, in the file called dataset.py Line 33 of the file.

The problem is that the torch module is missing a property called underscore six.

The blogger searched the issue on Google and found a related comment on GitHub.

He modified the line of code as suggested in that comment, and it did fix the problem.

“It took me about a minute and seven seconds to fix the problem, and in that short time I had the error fixed. It was just a quick Google search.”

Below is the specific content of the modifications made by the blogger, which is the difference between the initial state and the final state.

This is requirements.txt One modification to the file was that the torch version 1.4.0 was initially used, and the blogger used the latest version of torch 2.2.2.

Then on the right, this is the last screen in Devin's video, and on the left is my video, which is the final output.

They are both very similar. The blogger's box is yellow, Devin's is red.

According to the time in Devin's official video, it started at 3:25pm on March 9, 2024 and was completed at 9:41pm, which took 6 hours in between.

Finally, let’s look at the effect of Devin’s work and his evaluation.

To replicate Devin's results, the blogger simply needs to set up an environment with appropriate hardware on a cloud instance and actually run two commands with the correct paths.

These things look like Devin did a lot of work and accomplished a lot.

However, once you set up your environment, you actually only need to run 2 commands. None of these code fixes are irrelevant because they are Devin's own generated code.

At the end of the video, researcher You said Devin did a good job. In fact, the tasks Devin completed are really cool for AI.

AI programmers, the whole Internet is buzzing

In early March, I still remember when Devin was released, the entire Internet was going crazy about this AI.

In the SWE-bench benchmark test, its performance far exceeded that of Claude 2, Llama, GPT-4 and other players, achieving an astonishing result of 13.86%!

Not only can it learn unfamiliar technologies independently, build and deploy applications end-to-end, correct bugs by itself, it can even train and fine-tune its own AI model!

Netizens are panicking, will Devin steal our jobs? Do programmers really no longer exist? !

Even the inspirational story of the ten-person founding team behind it has been dug out.

Core founder and CEO Scott Wu, his younger brother Neal Wu and others received a total of ten IOI gold medals.

In less than a month, various AI programmers were born one after another.

For example, the SWE-agent proposed by the Princeton team can fix bugs in the real GitHub repository, as well as the OpenDevin and Devika open source projects.

However, one should still have reservations about the ability of AI programmers to solve real problems.

Because, even with the help of GPT-4 Turbo's capabilities, AI is not omnipotent.

References:

  • https://x.com/0interestrates/status/1779268441226256500

  • https://www.reddit.com/r/programming/comments/1c1g0fn/debunking_devin_first_ai_software_engineer_upwork/

Advertisement