The secret of domestic Sora is hidden in this large model team of Tsinghua University–Quick Technology–Technology changes the future

In the field of video AIGC, a powerful domestic player has emerged.

In 2024, Sora has been living in the spotlight.

Advertisement

Musk did not hesitate to praise his words, saying that “human beings are willing to accept defeat.” In the eyes of red-clothed leader Zhou Hongyi, the realization of AGI by humans with the help of Sora will be reduced to one or two years. Even micro-businesses selling paid courses have used “Sora” to reassemble their own sickles.

This craze spreads from the United States to China, from primary to secondary and even to tertiary markets, spreading like ripples around the world.

Because, under ideal circumstances, the underlying logic of long video generation is approximately equal to the world model. A video of more than ten seconds or tens of seconds contains the mapping of real laws and knowledge such as basic image processing, spatial relationships, physical laws, causal logic, etc. Looking at it from a small perspective, it can overturn the table of traditional film and game production. Looking at it from a big perspective, this is a key step towards general artificial intelligence.

At the same time, among the long video generation algorithms, Sora’s technological breakthrough is revolutionary. Compared with the traditional Stable Diffusion, the Diffusion plus Transformer architecture adopted by Sora not only overcomes the lack of scalability of Stable Diffusion, but also makes a qualitative leap in the accuracy and flexibility of generated content.

Advertisement

The only drawback is that Sora is not an open source algorithm.

Without open source, there is no possibility of recurrence; without the possibility of recurrence, even if a partner with a background in management changes his bedtime reading to “Scalable diffusion models with transformers”, the investment manager will travel to Beijing and Shenzhen technology industries in a week After digging into the ground, everyone has to admit a reality. Although there are many large-scale video model companies, the elimination round of large-scale video models may have come to an end before the domestic Sora is officially discovered.

The industry is buzzing with excitement, but the primary market is experiencing unprecedented anxiety. Can Chinese AI companies just watch themselves getting further and further away from Sora?

01 “Domestic Sora” is here?When the VCs on the field were almost in despair, no one would have thought that the first person to reveal the secret of domestic Sora was Shengshu Technology, a large model company established just over a year ago.

Recently, Shengshu Technology and Tsinghua University announced the launch of China's first large video model “Vidu” based on a purely self-developed U-ViT architecture, which supports one-click generation of high-definition video content up to 16 seconds long and with a resolution of up to 1080p. Judging from the official short film, Vidu is almost on par with Sora in terms of multi-lens generation, time and space consistency, simulation of the real physical world, and imagination.

Compared with other domestic “Sora-like” works, one of the most obvious features of Vidu is that the screen time is long enough.

Ten seconds has always been a life-or-death line for “domestic Sora”. To reach or exceed ten seconds, it means that in-depth research is required on the accumulation of training materials and how to solve the problem of algorithmic memory loss.

This is another official video released by Vidu. From the video, you can see that when the white old-fashioned SUV is driving on a hillside dirt road, the rolling tires will raise dust, and the trajectory is naturally coherent; the surrounding woods, also under the sunlight, follow the real The rules of projection in the world leave mottled light and shadow.

In contrast, under the premise of ensuring the length of the video, it is difficult for most domestic “Sora-like” to maintain the continuity of characters and scenes, and it is also difficult to truly follow the laws of the physical world. For example, eating hamburgers will leave bite marks, and cars will leave bite marks. Driving over it will leave traces of exhaust gas and dust.

According to industry insiders, some of the previous “Sora-like” models currently on the market achieve long-term paths by actually inserting frames, adding one or more frames to every two frames of the video to improve the video. length.

This method requires processing the video frame by frame and inserting additional frames to improve video length and quality. The overall picture will appear stiff and slow.

However, the working principle of biological technology is obviously different. Based on a single model, the underlying algorithm is completely generated end-to-end. Intuitively, we can see the smooth feeling of “one shot to the end”. The video is generated continuously from beginning to end, without any trace of frame insertion.

In addition, there are some long tool videos that adopt the approach of “changing the soup without changing the medicine”. The bottom layer integrates many other model work, such as first generating a single picture based on Stable Diffusion and Midjourney, then generating a 4s short video, and then stitching it. In other words, if you want a video that is more than ten seconds long, just put multiple 4s short videos together. Not only will the overall smoothness of the picture be greatly reduced, but the underlying layer will not achieve a breakthrough in long video generation capabilities.

In addition to the qualitative breakthrough in the generation time, we can also see from the official announcement video that Vidu also achieves continuous and smooth pictures, with details and logical coherence.Although they are all moving images, there are almost no problems with model wear, ghosting, or movements that do not conform to the laws of reality.

For a simple comparison, the following is a screenshot of the video generation effect of a popular video model team. Although the overall video length is only four seconds, just one action command to prepare to jump is enough to turn the kitten in the screen into six legs. Or the “ghost” with three tails.

The secret of domestic Sora is hidden in this Tsinghua model team

The contrast is so stark that people can’t help but wonder: why after the release of ChatGPT, a number of large model products that “reached GPT 3.5 and approached GPT 4.0” immediately appeared on the market. It’s also about catching up, why is it so difficult to create Sora-like products?

The answer is that not long after ChatGPT was released, Meta LLama2 was open sourced. Open source solved the urgent need to reproduce the domestic ChatGPT technology. Sora is not open source, and the technical details have not been made public. This means that the only way to realize “domestic Sora” is self-research.

According to the technical report disclosed by OpenAI, the core technology architecture behind Sora originated from a paper called “Scalable Diffusion Models with Transformers”. The paper proposed an architecture that integrates Diffusion (diffusion model) and Transformer – DiT, which was later adopted by Sora. use.

Coincidentally, more than two months before DiT, the Tsinghua team proposed using Transformer to replace the CNN-based U-Net network architecture U-ViT. From an architectural perspective, there is no difference between the two. There was even an episode in the process. Because it was released earlier, the top computer vision conference CVPR 2023 included Tsinghua University’s U-ViT paper, but rejected the DiT used at the bottom of Sora because of “lack of innovation”. paper.

The founding team of Shengshu Technology originated from the paper team of Tsinghua University. The company's CTO Bao Fan is the first author of the paper. The bottom layer of the Vidu model released this time uses the U-ViT architecture. In other words, Shengshu Technology is not part of the group chasing Sora, but has been on the same starting line early on, or even earlier.

From this, we can see that although Shengshu Technology has been established for a short time, it has a long history.

Shenpa found that in terms of talent, the core members of its team came from the Institute of Artificial Intelligence of Tsinghua University and were the first team in China to carry out in-depth generative research. In terms of technology, many of the team's research results have been applied to DALL·E 2, Stable Diffusion and other models by OpenAI, Apple, Stability AI, etc. It is currently the domestic team that has published the largest number of papers in the generative field. In terms of background, Shengshu Technology has been recognized by many well-known institutions such as Ant Group, Qiming Venture Capital, BV Baidu Ventures, Byte Jinqiu Fund, etc., and has completed hundreds of millions of yuan in financing.

And why is Shengshu the one who can do all this?

02 Why biotechnology?Perhaps the most important answer is that Shengshu Technology has taken the right technical route early.

Unlike most video generation algorithms on the market that use traditional diffusion models based on the U-Net convolution architecture, the Vidu and Sora released by Shengshu Technology this time use fusion architectures (i.e., the U-ViT and DiT mentioned above). ).

The so-called fusion architecture can be understood as the fusion of Diffusion (diffusion model) and Transformer.

The Transformer architecture is well known for its use in large language models. The advantage of this architecture lies in its scale characteristics. The larger the number of parameters, the better the effect. Diffusion is often used in traditional visual tasks (image and video generation).

The fusion architecture is to replace the commonly used U-Net convolutional network with Transformer in the Diffusion Model, integrating the scalability of the Transformer with the natural advantages of the Diffusion model in processing visual data, and can show excellence in visual tasks. emergent ability.

In September 2022, the team submitted a U-ViT paper, proposing for the first time in the world an architectural idea that integrates the diffusion model with Transformer. The DiT architecture, launched more than two months later, also adopted this idea and was later adopted by Sora.

Compared with DiT, which only conducted experiments on ImageNet, U-ViT also conducted experiments on small data sets (CIFAR10, CelebA), ImageNet, and the graphic and text data set MSCOCO. Moreover, compared with traditional Transformer, U-ViT proposes a “long connection” technology, which greatly improves the training convergence speed.

After that, the team went deeper. In March 2023, the team trained nearly 1 billion parameter model UniDiffuser on the large-scale graphic and text data set LAION-5B based on the U-ViT architecture and made it open source. UniDiffuser supports arbitrary generation and conversion between graphic and text modalities.

The implementation of UniDiffuser has an important value – it has verified for the first time the scalability (Scaling Law) of the fusion architecture in large-scale training tasks, which is equivalent to running through all the processes of the fusion architecture in large-scale training tasks.

It is worth mentioning that UniDiffuser, both a graphic and text model, is one year ahead of Stable Diffusion 3, which recently switched to the DiT architecture.

However, although they all chose the integrated architecture, in terms of advancing the subsequent product path, based on resource considerations, the Sora team chose “almost no sleep every day and worked intensively for a year” all in long videos, while Shengshu Technology chose Start with 2D images and expand further to 3D and video.

There is no right or wrong route. A basic common sense is that the technical route of domestic startups can be the same as that of OpenAI, which shows that the vision is long-term enough; but referring to OpenAI for commercialization is seeking a dead end – behind Sora is the technical strength of OpenAI, and Microsoft’s almost unlimited computing power supports ordinary companies without the capital to learn.

For this reason, looking back on the entire year of 2023, Shengshu Technology's main resources will be placed on images and 3D. In January this year, Shengshu Technology officially launched 4-second short video generation. After the release of Sora in February, the company officially tackled the problem, and soon broke through the 8-second video generation in March, and achieved a 16-second breakthrough in April. Breakthroughs have been made in all aspects of production quality and duration.

It took only two months to complete the training task from 4 seconds to 16 seconds, which is astonishingly fast.

The reason behind this is not only the “forward-looking” at the technical architecture level, but also the team's accumulation of efficient engineering experience through the step-by-step process from images to 3D to video.

Video is essentially an amplification of images in a time series, which can be viewed as a continuous multi-frame image, so we start with the image first and perform infrastructure engineering work, such as data collection, cleaning, annotation, and efficient training of models. Experience can be reused. Sora does just that: it uses DALL·E 3's re-annotation technology to generate detailed descriptions for the visual training data, allowing the model to more accurately follow the user's textual instructions to generate videos.

It is reported that “Vidu” also reuses a lot of Shengshu Technology’s experience in graphics and text tasks. By preparing for the early image and other tasks, Shengshu Technology uses video data compression technology to reduce the sequence dimension of the input data. At the same time, it uses The self-developed distributed training framework not only ensures calculation accuracy, but also doubles the communication efficiency, reduces the memory overhead by 80%, and increases the training speed by a cumulative 40 times.

The road must be walked step by step, and the meal must be eaten one bite at a time. In this business game of competing for “domestic Sora”, finding the right technology and finding the right direction is the first step; and developing “domestic” characteristics is also a necessary condition for survival, and both are indispensable.

Advertisement