Microsoft advances realism with Vasa-1 model

The speed at which generative AI is developing is astonishing. The proof is with the latest generative AI model unveiled by Microsoft. In a scientific article – supported by numerous examples of the capabilities of its model – nine researchers from the firm present VASA, a framework for generating realistic talking faces of virtual characters from a single static image and an audio clip containing words.

The result is a first model, VASA-1, capable not only of producing lip movements perfectly synchronized with audio, but also of capturing a wide spectrum of facial nuances and natural head movements that contribute to the perception of speech. authenticity and liveliness.

Ability to process images and audio external to the training base

The researchers rely on a method they themselves developed including a holistic model of facial dynamics and diffusion-based head movement generation that operates in a latent face space, and the development of such latent space of expressive face and disentangled using videos.

“Our method is capable of processing photo and audio inputs that are not part of the learning phase,” they specify. For example, the tool can process artistic photos, singing audios, and non-English speeches. These data types were not present in the model's training set.

Realistic avatars very (too) close to human faces

In a series of examples on its research page, the team of researchers reveals portraits with virtual, non-existent identities generated by StyleGAN2 or DALL-E-3 (with the exception of Mona Lisa). What impresses most here is both the lip-audio synchronization, but also the wide spectrum of emotions, expressive facial nuances and natural head movements that contribute to the perception of realism and liveliness of these faces.

On the social network ranking of the ten best video sequences of VASA-1.

7. Power of angle
Example of same motion sequence with different photos pic.twitter.com/MSLFobwJTx
— Min Choi (@minchoi) April 18, 2024

“Our method not only provides high video quality with realistic facial and head dynamics, but also supports online generation of 512×512 videos at up to 40 FPS with negligible starting latency (170 ms, rated on a desktop PC with a single NVIDIA RTX 4090 GPU, editor's note)”, they specify. For them, this “paves the way for real-time engagements with realistic avatars that emulate human conversational behaviors.”

Ubiquitous risks

The researchers indicate that they are exploring in parallel “visual-affective skills for virtual and interactive characters, which do not impersonate a person in the real world”. If they want to clarify that“this is a research demo only and there is no product release plan or API”, the researchers themselves discuss the risks arising from the VASA framework and the VASA-1 model.

“(Our research) is not intended to create content that is used to deceive or mislead. However, like other content generation techniques, it could be misused to impersonate a To be human”. Opposing the creation of misleading or harmful content from real people, the researchers say they are “interested in applying our technique to advance forgery detection. Currently, videos generated by this method still contain identifiable artifacts, and digital analysis shows that there is still some way to go to reach the 'authenticity of real videos'.

Do you want to stay up to date on the latest news in the artificial intelligence sector? Register for free to the IA Insider newsletter.

Selected for you