New York University assistant Xie Saining and others provide technical analysis of Sora’s parameter size limited to 3 billion

Let’s just say how popular Sora is. One of the generated videos goes online and goes viral.

The effect newly uploaded by the author quickly attracted onlookers.


Failure cases are addictive to watch.

Nearly 10,000 people liked it.


The academic circle is even more excited, and bigwigs from all walks of life are opening their microphones one after another.

Xie Saining, assistant professor at New York University(A work by ResNeXt) To put it bluntly, Sora will rewrite the entire field of video generation.

Senior Research Scientist, NVIDIA Jim Fan Shouting, this is what the video generated GPT-3 timeah! Especially after the technical report was released, the discussion became more interesting. Because many of the details are not very clear, the big guys can only guess.

include“Sora is a data-driven physics engine”, “Sora is built on the DiT model and may only have 3 billion parameters.”etc. So, why is Sora so amazing? What does it mean for the field of video generation? No, there are some possible answers soon.

Video generated GPT-3 moment

In general, Sora is trained on videos and images of different durations, resolutions, and aspect ratios.diffusion modelwhile adopting Transformer architecture, that is, a“Diffusion Transformer”.

Regarding technical details, the official report briefly mentions the following: 6 o'clock:

The first is the “innovative transformation” of visual data.

Different from tokens in large language models, Sora uses “Patches” to unify different visual data representations.

As shown in the figure below, in specific operation, the model first compresses the video into a low-dimensional latent space, and then decomposes their representation into spatio-temporal patches, thereby converting the video into patches. (Ah, this, I said it but it seems like I didn’t say anything)

The second is to train a video compression network.

It can reduce the dimensionality of visual data, input video, and output a spatiotemporally compressed latent representation. Sora completes his training on this. Accordingly, OpenAI also trained a specialized decoder.

The third is space-time patching technology(Spacetime latent patches).

Given a compressed input video, the model extracts a series of spatiotemporal patches, which serve as tokens for the Transformer. It is this patch-based representation that allows Sora to train on videos and images of different resolutions, durations, and aspect ratios.

At inference time, the model controls the size of the generated video by arranging randomly initialized patches in a grid of appropriate sizes.

The fourth is the discovery that extending Transformer is also suitable for video generation..

OpenAI found in this research that the diffusion Transformer can also achieve efficient expansion in the field of video models. The figure below shows that as training resources increase, sample quality improves significantly (fixed seeds and input conditions).

The fifth is some revelations about video diversification.

Compared with other models, Sora can hold videos of various sizes, including different resolutions, durations, aspect ratios, etc.

It has also optimized the composition and layout more, as shown in the figure below. Many similar models in the industry will blindly crop the output video into a square, resulting in the theme elements only being partially displayed, but Sora can capture the complete scene:

The report points out that this is all thanks to OpenAI’s direct processing of video dataoriginal sizeTraining was conducted on.

Finallylanguage understandingEfforts in aspects. Here, OpenAI takes a re-annotation technique introduced in DALL・E 3 and applies it to videos.

In addition to using descriptive video descriptions for training, OpenAI also uses GPT to convert users' short prompts into longer detailed descriptions and then sends them to Sora. This series makes Sora's text understanding ability quite powerful.

The introductory report on technology only mentions so much, and the rest of the text is focused on Sora's series of effect demonstrations, including text to video, video to video, and image generation.

It can be seen that such asHow the “patch” is designed and other core issues, the article does not explain it in detail. Some netizens complained,OpenAI is indeed still so “Close”(dog head). Because of this, there are various speculations from various bosses and netizens.

Xie Saining analysis:

1. Sora should be built on the diffusion Transformer DiT.

In short, DiT is a diffusion model with a Transformer backbone, which = (VAE encoder + ViT + DDPM + VAE decoder).

Xie Senin guessed that Sora didn't have too many fancy extras on it.

2. Regarding the video compression network, Sora may use the VAE architecture. The difference is that it is trained with original video data.

And since VAE is a ConvNet, DiT is technically a hybrid model.

3.Sora probably has about 3 billion parameters.

Xie Saining believes that this speculation is not unreasonable, because Sora may not really need as many GPUs for training as people think. If this is the case, Sora's later iterations will also be very fast.

NVIDIA AI scientist Jim Fan believes that:

Sora is supposed to be a data-driven physics engine.

Sora is a simulation of a real or fantasy world that uses some denoising and gradient descent to learn complex rendering, “intuitive” physics, long-shot reasoning, and semantic foundations.

For example, in this effect, the prompt word is a realistic close-up video of two pirate ships sailing and fighting in a cup of coffee.

Jim Fan analyzed that Sora first needs to provide two 3D assets: pirate ships with different decorations; the implicit problem of text-to-3D must be solved in the potential space; and the two ships must avoid each other's routes, taking into account the flow of coffee liquid. Mechanics, maintaining a sense of realism, and bringing an effect like light chasing.

There are some opinions that Sora only controls pixels on a 2D level. Jim Fan explicitly disagrees with this statement. He feels this is like saying that GPT-4 doesn't understand encoding and just samples strings.

However, he also said that Sora cannot yet replace game engine developers because its understanding of physics is far from enough and there are still very serious “illusions”.

So he proposed Sora is a video-generated GPT-3 moment.

Back in 2020, GPT-3 was not a perfect model, but it strongly proved the importance of contextual learning. So don’t dwell on the shortcomings of GPT-3 and think more about GPT-4.

In addition, there are bold netizens who evenI suspect Sora used Unreal Engine 5 to create some training data.

He even analyzed the effects in several videos one by one to support his conjecture:

However, there are many people who refute him. The reasons include “the shots of people walking are obviously strange, and it cannot be the effect of the engine”, “there are billions of hours of various videos on YouTube, ue5 is not very useful”…

Don't care about all these things for now.

Finally, some netizens said that although they do not expect more details from OpenAI, they still want to know whether Sora has innovations in video encoding, decoding, and additional modules for time interpolation.

OpenAI valued at $80 billion

While Sora attracted global attention, OpenAI's valuation also increased again, becoming the third most highly valued technology startup in the world.

With the completion of the latest tender offer, OpenAI’s valuation officially reached $80 billionsecond only to ByteDance and SpaceX.

The deal, led by venture capital firm Thrive Capital, allows outside investors to buy shares from some employees. OpenAI completed a similar deal early last year, valuing it at $29 billion at the time.

After Sora was released,GPT-4 Turbo also significantly reduces rate limits, improve TPM (maximum number of tokens per minute), achieving a 2x improvement compared to the previous time. President Brockman also personally brought the goods to promote the product.

But at the same time,OpenAI’s application to register the “GPT” trademark failed. The reason is that “GPT” is too generic.

One More Thing

It is worth mentioning that some sharp-eyed netizens discovered thatYesterday, Stability AI also released SVD 1.1.

But it seems that Sora's blog was quickly deleted shortly after it was released.

Some people commented harshly, isn’t this a replica of Wang Feng? It shouldn’t be deleted, it should be brought back to gain popularity.

This is still a joke.

Others lamented that as soon as Sora came, they immediately understood why Zhang Nan wanted to focus on editing.

As well as the army of course sellers, they also took advantage of the news and seized the business opportunities.

Reference links:

  • (1)

  • (2)

  • (3)

  • (4)

  • (5)