Exploring AI-Generated Video and Audio

Digital Rabbit
Jul 24, 2024
3 min read

Updated: Dec 23, 2024

Wouldn't it be great if you could write a script and have a clone of yourself record a video reading that script? You could then upload that video to YouTube. Why would anyone want to do this? People who make instructional videos (how-to, educational, and so on) know that making a video takes a long time whereas writing the script for the video takes much less time. In essence making a "deep fake" of yourself would save you time.

VASA-1 is a "Lifelike Audio-Driven Talking Faces Generated in Real Time" system created by Microsoft Research. The article starts with:

TL;DR: single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.

No reading is necessary. Just take a look at the generated videos on their page. The videos are quite impressive and motivated me to search for a publicly available product. Nothing is as good as VASA-1, but Runway shows promise. It can create a clone of your voice, make a movie from a still image or a text description, lip sync text to a human face (but not animals), and much more. After trying the free version, I signed on for a Standard membership to unlock more capabilities. Unfortunately voice cloning is available only in the pricier plans.

The Free Version

I uploaded the Digital Rabbit logo I created several years ago. The 4-second video (length of all videos in the free version) gives just a hint of motion. I like it.

My eventual goal with this tech is to make a video of me giving a review. So I uploaded a few images of me and gave a few words to try out the lip sync feature. Runway provides many different voices to use for the text I provide. So it's my face, my words, and someone else's voice.

This version puffs out my lips unnaturally (for me). The voice I chose also dictated a personality that is a bit more space cadet than I'd like. Note the three circles on the bottom right. That watermark appears in the free version.

I used a different voice for this one. You'll be able to see how the personality changes.

I chose a different image for this one. The head position is more static, but the big lips and expression are not me.

These videos convinced me I would have to sign up for a paid plan to get access to the most current generative model. But first, I tried the text to movie feature. The prompt: A bearded man in 18th century clothing walking in an art gallery. The video does well until the man sprouts a third arm in the last second.

I used the same prompt, but started with a different seed value. This video is better because it is more zoomed in on the man. Thus no opportunity to show extra limbs, a typical problem with AI-generated images.

The Standard Plan

Upgrading provided access to better generative models, faster generation, and more features. I uploaded a photo of me and a short audio recording that I made. The limit for audio in the standard plan is 40 seconds. I recorded me reading the start of a story authored by Chatty (aka ChatGPT) about a lost dog. This video is the best attempt of my afternoon session with Runway.

Runway has many tutorials. It is possible to choose lighting, camera angle, and many other options that could liven up the video. I would like to vary the expression, move the head, and get the eyes to move around a bit so they look more natural.

Runway had some spectacular failures when I used it, but most have to do with the prompts I used. Runway interprets phrases literally. I had to modify how I wrote prompts. Runway does have lots of examples among the tutorials. It's clear I need to work through them all to get the best success.