Sora in sights: Joint release from Picsart AI team unveils StreamingT2V model capable of producing 1200 frames of 2-minute video.

Recently, Picsart AI Resarch and other teams jointly released StreamingT2V, which can generate videos up to 1200 frames and 2 minutes long, surpassing Sora in one fell swoop.

At the same time, as a powerful component in the open source world, StreamingT2V can be seamlessly compatible with models such as SVD and animatediff.

Advertisement

The 120-second long AI video model is here! Not only is it longer than Sora, but it’s also free and open source!

Picsart AI Resarch and other teams jointly released StreamingT2V, which can generate videos of up to 1200 frames and 2 minutes in length, and the quality is also very good.

  • Paper address: pdf/ 2403.14773.pdf

  • Demo trial: spaces/PAIR/StreamingT2V

  • open source code: Picsart-AI-Research/StreamingT2V

Moreover, the author stated that two minutes is not the limit of the model. Just like the previous Runway video can be extended, StreamingT2V can theoretically be infinitely long.

Before Sora, video generation models such as Pika, Runway, and Stable Video Diffusion (SVD) could generally only generate videos of a few seconds, and could be extended to more than ten seconds at most.

Advertisement

As soon as Sora came out, it immediately beat out many models in 60 seconds. Cristóbal Valenzuela, CEO of Runway, tweeted that day: The competition has begun.

——No, the 120-second long AI video is here.

Although Sora's dominance cannot be shaken immediately, it can at least regain the victory in terms of duration.

More importantly, StreamingT2V, as a powerful component in the open source world, is compatible with projects such as SVD and animatediff to better promote the development of the open source ecosystem:

Judging from the examples released, the current compatibility effect is still a little abstract, but technological progress is only a matter of time, and rolling up is the most important thing~

One day we will all be able to use “open source Sora” – you think so? OpenAI.

Free to play

Currently, StreamingT2V has been open sourced on GitHub, and a free trial is also provided on huggingface. I can’t wait any longer, and the editor will start testing it right away:

However, it seems that the server load is too high. I don’t know if the above is the waiting time. Anyway, the editor failed.

The current trial interface allows you to enter text and image prompts, and the latter needs to be turned on in the advanced options below.

Of the two generation buttons, Faster Preview refers to a video with lower resolution and shorter duration.

The editor then moved to another testing platform ( camenduru/streaming-t2v), finally got a chance to test, the following is a text prompt:

A beautiful girl with short hair wearing a school uniform is walking on the spring campus

However, perhaps due to the complexity of the editor's requirements, the generated effect is somewhat scary. You can try it yourself based on your own experience.

Here are some success stories from huggingface:

StreamingT2V

“world masterpiece”

The emergence of Sora caused a huge sensation, making Pika, Runway, SVD and other models that were sparkling a second ago directly turned into works of the “pre-Sora era”.

But as the author of StreamingT2V said, the model of pre-Sora days also has its own unique charm.

Model architecture

StreamingT2V is an advanced autoregressive technology that can create long videos with rich motion dynamics without any stuttering.

It ensures temporal consistency throughout the video, tight alignment with descriptive text, and maintains high frame-level image quality.

Existing text-to-video diffusion models mainly focus on high-quality short video generation (usually 16 or 24 frames). When directly extended to long videos, problems such as quality degradation, stiff performance, or stagnation will occur.

AI generated video

By introducing StreamingT2V, the video can be extended to 80, 240, 600, 1200 frames, or even longer, with smooth transitions, superior to other models in terms of consistency and movement.

Key components of StreamingT2V include:

(i) a short-term memory block called the Conditional Attention Module (CAM), which modulates the current generation based on features extracted from the previous block via an attention mechanism, enabling consistent block transitions;

(ii) a long-term memory block called the Appearance Preserving Module (APM), which extracts high-level scene and object features from the first video block to prevent the model from forgetting the initial scene;

(iii) A stochastic mixing method that is able to apply video enhancers automatically and regressionally to infinitely long videos without inconsistencies between blocks.

The above is the overall pipeline diagram of StreamingT2V. During the initialization phase, the first 16-frame block is synthesized by the text-to-video model. During the streaming T2V stage, new content will be automatically regressed to generate more frames.

Finally, in the streaming optimization stage, by applying the high-resolution text to short video model and equipped with the stochastic mixing method mentioned above, the generated long videos (600, 1200 frames or more) are automatically regression enhanced.

The figure above shows the overall structure of the StreamingT2V method: the conditional attention module (CAM) serves as short-term memory, and the appearance preservation module (APM) extends into long-term memory. CAM uses a frame encoder to condition on the Video Diffusion Model (VDM) on the previous block.

CAM's attention mechanism guarantees smooth transitions between blocks and videos with a high amount of motion.

APM extracts high-level image features from anchor frames and injects them into the text cross-attention of VDM, which helps preserve object/scene features during video generation.

conditional attention module

The researchers first pre-trained a text-to-(short) video model (Video-LDM), and then used CAM (some short-term information from the previous block) to autoregressively adjust Video-LDM.

CAM consists of a feature extractor and a feature injector integrated into UNet of Video-LDM. The feature extractor uses the frame-by-frame image encoder E.

For feature injection, the authors make each long-range jump connection in UNet focus on the corresponding features generated by CAM through cross-attention.

CAM uses the last Fconditional frame of the previous block as input, and cross-attention can adjust the F-frame of the base model to CAM.

In contrast, sparse encoders use convolutions for feature injection and thus require additional frames of F − Fzero values ​​(and masks) as input in order to add the output to the F frames of the base model. This results in inconsistent input to SparseCtrl, causing the resulting video to be severely inconsistent.

Appearance saving module

Autoregressive video generators often forget initial object and scene features, leading to severe appearance changes.

To solve this problem, the Appearance Preserving Module (APM) utilizes the information contained in the fixed anchor frame of the first block to integrate long-term memory. This helps maintain scene and object features between video block generation.

In order to enable APM to balance the guidance of anchor frames and the guidance of text instructions, the author recommends:

(i) Blending the CLIP image tag of the anchor frame with the CLIP text tag of the text instruction by using a linear layer to extend the clip image tag up to k = 8, concatenating the text and image encodings in the tag dimension, and using projection blocks .

(ii) A weight α∈R (initialized to 0) is introduced for each cross-attention layer to perform cross-attention using the keys and values ​​from the weighted sum x.

Automatic regression video enhancement

To further improve the quality and resolution of the text-to-video results, a high-resolution (1280×720) text-to-(short) video model (Refiner Video-LDM) is utilized here to automatically regression enhance the 24-frame blocks of the generated video.

Using the text-to-video model as a refiner/enhancer for 24-frame blocks is accomplished by adding a large amount of noise to the input video block and denoising it using a text-to-video diffusion model.

However, the simple approach of enhancing each block independently results in inconsistent transitions:

The authors solve this problem by using shared noise between consecutive blocks and leveraging a random mixing method.

comparison test

The image above is a visual comparison of DynamiCrafter-XL and StreamingT2V, using the same prompts.

XT slice visualization shows that DynamiCrafter-XL suffers from severe block inconsistencies and repetitive motion. In contrast, StreamingT2V transitions seamlessly and continues to evolve.

Existing methods are not only prone to temporal inconsistencies and video stagnation, but they are also affected by changes in object appearance/features and video quality degradation over time (such as SVD in the figure below).

The reason is that, by conditioning only the last frame of the previous block, they ignore the long-term dependence of the autoregressive process.

In the visual comparison above (80 frame length, autoregressive generated video), StreamingT2V generates long videos without motion stasis.

What can AI long videos do?

Everyone is working on video generation, and the most intuitive application scenario may be movies or games.

Movie clips generated with AI (Pika, Midjourney, Magnific):

Runway even held an AI film festival:

But what’s the other answer?

world model

The virtual world created by long videos is the best training environment for Agents and humanoid robots. Of course, the premise is that it is long enough and real enough (in line with the logic of the physical world).

Maybe one day in the future, there will also be a living space for us humans.

References:

  • Picsart-AI-Research/StreamingT2V

Advertisement