Microsoft's VASA-1 AI Model Can Generate Deepfakes Using a Single Photo and Speech Audio Track

You’ve seen Flawless TrueSync, now check out Microsoft’s VASA-1 AI model, which can generate deepfakes using a single photo and a short speech audio track. The generated clips result in deepfakes with lip movements that are precisely synchronized with the audio, while also capturing a large spectrum of facial nuances and natural head motions.

To achieve this level of liveliness in a deepfake, Microsoft needed to develop a new holistic facial dynamics and head movement generation model that works in a face latent space using videos. The current model supports online generation of 512×512 videos at up to 40 FPS with negligible starting latency, thus paving the way for real-time engagements with lifelike avatars that emulate human conversational behaviors. There’s no word yet on if VASA-1 will be released to the public.

Sale

Microsoft Xbox Series S 1TB SSD Console Carbon Black - Includes Xbox Wireless Controller - Up to 120...

XBOX SERIES S 1TB: Go all digital and experience next-gen speed and performance. Double the fun with double the storage.
FASTER LOAD TIMES: Make the most of every gaming minute with Quick Resume, lightning-fast load times, and gameplay of up to 120 FPS – all powered by...
IN THE BOX: Xbox Series S 1TB console, one Xbox Wireless Controller, an ultra high-speed HDMI cable, power cable, and 2 AA batteries.

Our method exhibits the capability to handle photo and audio inputs that are out of the training distribution. For example, it can handle artistic photos, singing audios, and non-English speech. These types of data were not present in the training set,” said the researchers.