2023: The Rise of AI Video Generation – Gen-2/Pika Set to Dominate Market, A Complete Evaluation

2023 is also the first year of AI video. What popular applications have been born in the past year, and what other problems will the video generation field face in the future? In the past year, we have witnessed the birth of popular products such as Gen-2 and Pika in the field of AI video.

Justine Moore from a16z took a detailed inventory of the current situation in the field of artificial intelligence video generation, comparison of different models, and unresolved technical challenges.

Advertisement

AI video generation explosion

2023 is the year of breakthroughs in AI video. However, one month has passed this year, and there is still no public text-to-video model.

In just 12 months, dozens of video generation products have been favored by tens of thousands of users around the world.

However, these AI video generation tools are still relatively limited, and most can only generate 3-4 seconds of video. At the same time, the quality is often uneven, and issues such as character consistency have not yet been resolved.

That said, we're still far from being able to create a Pixar-level short with only text cues, or even multiple cues.

Advertisement

However, the progress we’ve made in video generation over the past year suggests that the world is in the early stages of a massive transformation — similar to what we’ve seen in image generation.

We see that text-to-video models are constantly improving, and branches such as image-to-video and video-to-video are also booming.

To help understand this explosion of innovation, a16z tracked the companies that require the most attention so far, as well as the potential issues that remain in the space.

Where can you generate AI video today?

21 video generation products

So far this year, a16z has tracked 21 public products.

While you may have heard of Runway, Pika, Genmo, and Stable Video Diffusion, there are many others to explore.

These products, mostly from startups, many of them derived from Discord bots, have several advantages:

– No need to build your own consumer-facing interface, just focus on model quality

– Can leverage Discord’s base of 150 million monthly active users for distribution

– Public channels provide an easy way for new users to get creative inspiration (by viewing other people's creations)

However, as the technology matures, we are beginning to see more and more AI video products building their own websites and even apps.

While Discord provides a great platform, it is limited in terms of workflows that can be added on top of pure generation, and the team has very little control over the consumer experience.

It’s worth noting that a large number of people don’t use Discord because they find the interface cluttered and confusing.

research and technology

Where are Google, Meta and the rest?

They're conspicuously absent from public product listings — although you may have seen their posts about models like Emu Video, VideoPoet, and Lumiere.

So far, major technology companies have basically chosen not to make their AI video products public.

Instead, they published various related papers on video generation without opting for video demonstrations.

For example, Google’s text-to-video model Lumiere

These companies have huge distribution advantages and have billions of users of their products.

So why don't they quit releasing video models and capture a huge share of this emerging category.

The main reason is that legal, safety and copyright concerns often make it difficult for these large companies to transform research into products and delay launch. This gives newcomers a chance to gain first-mover advantage.

What’s next for AI video?

If you've ever used any of these products, you know that there's still a lot of room for improvement before AI video becomes mainstream.

Sometimes it is found that AI video tools can generate the “magic moment” of the video from the prompt content, but this is relatively rare. More often than not, you'll need to regenerate with a few clicks and then crop or edit the output to get a professional-looking clip.

Most companies in this space focus on solving a few core problems:

– Controllability: Can you control what is happening in the scene at the same time (for example, the prompt “Someone is walking forward”, is the action as described?) Regarding the latter point, many products have added functions that allow you to control the camera zoom or pan, or even add special effects.

– “Does the action work as described” has always been harder to resolve: this involves the quality of the underlying model (whether the model understands the meaning of the prompt and can generate it as required), although some companies are working to provide more user control before generation .

A good example of this is Runway's motion brush, which allows users to target specific areas of a sorghum image and determine how it moves.

Temporal Consistency: How do you keep characters, objects, and backgrounds consistent from frame to frame without morphing into something else or distorting?

This is a very common problem among all publicly available models.

If you see a coherent video today that's longer than a few seconds, it's most likely video-to-video, by shooting a video and then using a tool like AnimateDiff prompt travel to change the style.

– Length – Making long short films is highly relevant to temporal coherence.

Many companies limit the length of generated videos because they cannot ensure that the video will remain consistent after a few minutes.

If you see an extremely long AI video, know that they are made up of a bunch of short clips.

unresolved issues

When will the ChatGPT moment for video come?

In fact, we still have a long way to go and need to answer the following questions:

1 Are current diffusion architectures suitable for video?

Today's video models are built on diffusion models: they basically generate frames and try to create time-consistent animations between them (there are multiple strategies to do this).

They have no intrinsic understanding of 3D space and how objects should interact, which explains warping/morphing.

2 Where does high-quality training data come from?

Compared with other modality models, training video models is more difficult, mainly because video models do not have as much high-quality training data to learn from. Language models are typically trained on public datasets such as Common Crawl, while image models are trained on labeled datasets (text-image pairs) such as LAION and ImageNet.

Video data is harder to come by. While there is no shortage of publicly accessible videos on platforms like YouTube and TikTok, these videos are unlabeled and not diverse enough.

3 How will these use cases be broken down between platforms/models?

What we see in almost every content modality is that one model doesn’t “win” for every use case. For example, MidTrik, Idegraph, and Dall-E all have different styles and are good at generating different types of images.

If you test out today's text-to-video and image-to-video modes, you'll find that they excel at different styles, types of motion, and scene compositions.

Tip: Snow falling on a city street, photorealistic

Genmo

Runway

Stable Video Diffusion

Pika Labs

Who will lead the video production workflow?

And among many products, there's no point in going back and forth.

Beyond pure video generation, making good clips or movies often requires editing, especially in the current paradigm where many creators are using video models to animate photos created on another platform.

It's not uncommon to see videos starting with images from Midjourney, animated on Runway or Pika, and then upgraded on Topz.

The creator then takes the video to an editing platform like CapCut or Kapwing and adds a soundtrack and voiceover, often generated on other products like Suno and ElevenLabs.

References:

  • https://a16z.com/why-2023-was-ai-videos-breakout-year-and-what-to-expect-in-2024/

Advertisement