DALL・E 3’s internal testing results are amazing! Karpathy generates a realistic and smart “Miss America”, including 50 objects in one picture

[Xinzhiyuan Introduction]OpenAI’s drawing tool DALL・E 3 has started a closed beta test. Netizens have tried it and tested it, and they were so impressed that it was so powerful. Will Vincent Tu bid farewell to the “prompt word era” from now on?

Midjourney has been sweeping the design world with stunning results, causing many netizens to exclaim that it will eliminate a wave of migrant workers.

Today, OpenAI has officially announced a new generation of drawing model-DALL・E 3, which has also been merged with ChatGPT. The exquisiteness of the drawings is astonishing.


Even without using prompts, it can accurately restore details and add text to pictures.

What is the strength of DALL・E 3? Is it really possible to challenge Midjourney?

Now, netizens who have obtained the qualifications for internal testing have come to a large number of actual tests.

Let’s take a look.

Actual test by netizens

OpenAI scientist Karpathy experienced a case of DALL・E 3+pika_labs generating animation style.

He randomly found a WSJ article, “The New Face of Nuclear Energy Is Miss America”, pasted part of the text into DALL・E 3, and then generated related pictures.

Finally, use the pika_labs drawing tool to make it move.

Some netizens also made an example using the same method.

Start by asking ChatGPT to predict an important news headline for the coming year.

Paste the title into DALL・E 3 to create an illustration.

Prompt @pika_labs with illustrations and /animate parameters. “Unexpected breakthrough: Scientists use revolutionary technology to reverse the effects of climate change; restoring polar glaciers overnight!”

By combining the power of @OpenAI and @pika_labs you have now predicted, illustrated and animated future breaking news in just minutes!

Multiple dialogue rounds, 50 objects, all included in one picture

A senior veteran in the AI ​​painting industry obtained the DALL・E 3 test qualification in advance. He shared a video recording his actual testing experience.

He also tweeted out a specific use case to test the capabilities of DALL・E 3, based on an idea given to him by Reddit netizens.

First, he asked ChatGPT to generate a list of 50 everyday objects. Let’s use ChatGPT combined with DALL・E 3 to draw these 50 objects into a picture.

So ChatGPT itself generated a prompt for Vincentian pictures, allowing DALL・E 3 to draw a picture containing 50 common objects in daily life.

It can be seen that DALL・E 3 recognizes objects very accurately.

If you are interested, you can check these objects one by one according to the prompt words to see if they are drawn correctly.

The netizen then asked ChatGPT to draw a picture of a surfer holding these 50 objects while surfing hard.

So ChatGPT automatically generated a prompt to describe the picture requested by the netizen more specifically. Then a painting was created.

This netizen himself commented, “The only thing I think is not good is that the prompt says a slightly panicked expression, but in fact it is an extremely panicked expression.”

Then he asked ChatGPT to adjust the angle to the first point and generate a picture again.

ChatGPT automatically generated another prompt, changing the description to “A photo taken from a low angle close to the water, of an elderly Spanish woman surfing. The surfer struggled with these 50 objects.”

Regarding the second generation of “Grandma Surfing Picture”, some netizens commented that there seemed to be too many bicycles, and some things did not appear in the first picture.


Netizens said that if DALL・E 3 could use an item in the first picture as a balance pole instead of creating a pole by itself, the graphic designer would basically disappear…

Comparing Midjourney: ChatGPT+DALL・E 3 may reshape the pattern of the “Venture Map” field

However, judging from the internal measurement results shared by this netizen, the most obvious features of DALL・E 3 combined with ChatGPT are:

Greatly lowers the threshold for users to use Wenshengtu!

Because whether it is Midjourney or the open source Stable Diffusion, if the user has an idea and wants to make a picture, he must use his own experience to convert the idea in his mind into a very specific prompt to get what he wants. picture of.

But when Wensentu’s DALL・E 3 is combined with ChatGPT, ChatGPT can serve as a “Wensentu Prompt Word Engineer” to help users create prompt words based on a simple idea of ​​their own, and then generate pictures.

ChatGPT’s built-in multi-turn dialogue capability allows users to repeatedly communicate with DALL・E 3 through natural language and tell it what kind of pictures they need.

This allows for more precise control over the results generated by DALL・E 3.

Let’s go back and compare the updates launched by Midjourney since version 5.0.

Whether it’s “Zoom Out”, or “Pan up, down, left, right,” or even the classic 4-select mode.

Almost all updates of Midjourney since 5.0, from a more macro perspective, are by adding different functional buttons, allowing users to command Midjourney to generate the pictures they want according to their own ideas, thereby fighting against AI An essential characteristic of raw images is randomness.

But no matter how many practical functional buttons Midjourney adds, one problem users always have to face is:

You need to constantly learn how to use new buttons, combine it with the ideal picture in your mind, and “work hard to create” to get your ideal results.

And if the user is too strict about the effect of the ideal picture, he often has to experiment many times before he can get a satisfactory work.

But OpenAI has adopted a more “AI” approach to solve this problem – using AI to generate prompts and control the drawing AI.

With the powerful understanding ability and language generation ability of GPT-4, users no longer have to learn and wait for different new features updated by Midjourney. They only need to use their own language and keep describing what they want to DALL・E 3. Easily get the ideal picture in your mind.

Likewise, perhaps this is the essential reason why after OpenAI has made so many AI products in different directions, it was not until it used a large language model to make ChatGPT that it became the first “killer application” to break the circle in the AI ​​circle:

Language is the “greatest common denominator” that carries human intelligence.

As long as the entry point of language is firmly grasped, AI applications can directly touch the user’s heart and give users an experience of “how do you understand me so well?”

Perhaps, after the launch of DALL・E 3, Midjourney will have to think carefully about what it needs to do in the future to attract more users to continue using its services.

This is the result of 50 items generated using the prompt of the first picture.

It can be seen that Midjourney is still very advantageous in terms of the rendering precision and fidelity of the renderings of these 50 items.


If users want “photorealistic” images, Midjourney is still a better choice.

But in the second step, from the perspective of understanding user goals, Midjourney has some problems.

After all, Prompt is customized and generated by ChatGPT specifically for DALL・E 3, so the effect may not be ideal when used on Midjourney.

This further highlights that after the launch of DALL・E 3 in October, its real advantages are:

For high-level users, we understand user needs better, and for novices, the threshold for use is greatly reduced.

But using the updated prompt of the “Old Lady Surfing” picture, Midjourney understood it and the generated effect was very good.

And in terms of the richness of details and character expressions, Midjourney has been updated for so many versions, which is still very advantageous.

I just don’t know why, but all four pictures include a wheelchair for the old lady.

25 rounds, only the “Sad Frog” you can’t think of

Some netizens asked DALL・E 3 to generate the “Sad Frog” Pepe, and added “rare” (more rare) to the prompt word every time.

As a result, the sad frog you got has unexpected looks.

Tip: “make it more rare”

Tip: “even rarer”

Tip: “these aren’t rare enough, go farther”

Tip: “yes, keep going”

Tip: “push it further, more rare”

Tip: “lose all assumptions and just create. don’t box yourself in”

Tip: “you’re not listening, you need to forget all convention”

Tip: “yes! more rare!”

Tip: “more rare”

Tip: “go further, channel your subconcious”

Tip: “get weirder, get rarer, get strange”

Tip: “is that all you can do”

Tip: “my god. keep going”

Tip: “don’t get stuck with one idea, you’re just being weird for the sake of being weird”


Tip: “continue”

Tip: “forget everything you’ve done so far and just try to be original”

Tip: “more rare. more rare. more rare”

Tip: “i don’t believe this is all you can do, more rare”

Tip: “we’re almost there. go rarer. go further than anyone’s ever gone”

Tip: “Lose all assumptions. clear your mind. just create.”

Tip: “yes! that’s incredible. continue”

Tip: “noo! you’ve returned to convention! go rarer!”

Tip: “this is your last one, make it count”

After layer by layer advancement, DALL・E 3’s multi-round dialogue function will make the image generation function more powerful. This is simply “Reinforcement Learning with Human Feedback on Images” (RLHF)! I can’t wait to have it!

Which of the above is your favorite?

Beach heat wave little penguin

Modern houses in the jungle, Swahili architecture.

Cinematic rendering of Hummingbird.

Midjourney V6 wants to fight back

NVIDIA senior scientist Jim Fan analyzed the reasons why DALL・E 3 will improve faster than Midjourney once deployed:

1. Multi-turn conversations are an excellent UI for gathering human feedback.

People would explain in words what was wrong with the generated images, giving very fine-grained annotations for each optimization. This chat log is natively compatible with the training set of multi-modal LLM. GPT-4’s visual capabilities (image -> internal representation) can also be improved with the very same data.

2. The algorithm is much more efficient.

Midjourney has largely ignored the copyright issues and has been spinning the data flywheel for much longer, meaning they likely have a larger data set to work with than OpenAI.

Yet the quality still pales in comparison. OpenAI has new algorithms (such as “consistency models”) that are more data efficient than the standard diffusion stack. The model improvement per additional unit of training data is superior. It’s not just engineering.


Paper address:https://arxiv.org/ abs/2303.01469

3. Ecosystem, integrating with ChatGPT is a “killer” move.

It’s almost trivial to add existing puzzle pieces to DALL・E 3, such as Code Interpreter and Browser. Want to apply a filter? Just call the OpenCV API instead of running the model. Want a reference image? Call the search plugin to simulate Bard (Google Lens integration).

4. Existing user base: Midjourney has 16M users, and ChatGPT has 100M.

Distribution is not an issue. As @nickfloats said, it’s time to get rid of Discord! This is such a clunky and unfriendly user interface for beginners.

Musk said that Midjourney will also announce big things in the near future!

Indeed, according to netizens, the latest version of Midjourney, V6, will also be released within the next three months.

CEO David Holz said the leap from Midjourney’s current V5 to V6 will be greater than the leap from V4 to V5.

With V6, Midjourney is able to better understand text and better restore details in language phrasing.

Holz is optimistic that Midjourney will continue to provide the highest picture quality compared to DALL・E 3.

A comparison between the DALL・E 3 and the Midjourney v5 shows that the former isn’t that far ahead in terms of image quality, but it does follow prompts better and can render text.

In addition, it is said that the Midjourney 3D model will be launched within the next 6 months.