ByteDance's OmniHuman-1 AI converts photos into realistic videos of people talking, singing

Chinese tech giant ByteDance has developed a new AI system to turn a single photo into a video of a person talking, singing, and moving naturally.

The new technology, OmniHuman-1, was announced on Wednesday and is the next step in AI-generated media. It goes beyond past models that could only move faces or upper bodies.

How OmniHuman-1 works

OmniHuman-1 uses deepfake technology to generate AI avatars that realistically synchronise speech and expressions. Unlike traditional deepfake models, this system enhances facial movements and body gestures to align seamlessly with the provided audio.

ByteDance researchers trained the model using diverse text, voice, and motion input datasets to enhance video realism. The model can process various body proportions and aspect ratios, making it adaptable for applications from close-up facial animations to full-body portrayals.

According to ByteDance researchers, OmniHuman-1 surpasses current audio-conditioned human video-generation techniques.

Experts say this tech could revolutionise filmmaking, online learning, and virtual communication. But there are also concerns about misuse in creating deepfakes or misleading content.

ByteDance researchers plan to share more details at an upcoming computer vision conference, though they haven’t said when. As AI-generated content advances, tools like OmniHuman show how quickly technology changes how we create and consume media.

ByteDance said its new model, trained on roughly 18,700 hours’ worth of human motion data, could create video clips of any length within memory limits and adapt to different input signals.

Realism and reactions

OmniHuman-1 demonstration videos have already gained significant attention. One notable example features a digitally recreated Albert Einstein delivering a speech with remarkable realism.

Albert Einstein speaks at a blackboard with hand gestures and delicate facial expressions in a sharp black-and-white video:“What would art be like without emotions? It would be empty,” he says. “What would our lives be like without emotion? They would be empty of values.”

It’s like watching the famous theoretical physicist give a university lecture from the past, but it appears like present footage.

Freddy Tran Nager, a clinical associate professor of communications at USC’s Annenberg School for Communication and Journalism, viewed the sample videos and said, “If you were thinking of reviving Humphrey Bogart and casting him in a film, I’m not sure how they would look. But on a small screen, especially on a phone, these are impressive”.

Matt Groh, an assistant professor specializing in computational social science, noted the impact of this technology in a post on X, stating, “The realism of deepfakes just reached a whole new level with Bytedance’s release of OmniHuman-1”.

OmniHuman-1 has not yet been released for public use.

TechPression

ByteDance’s OmniHuman-1 AI converts photos into realistic videos of people talking, singing

How OmniHuman-1 works

Realism and reactions

Leave a Reply Cancel reply

Recent Posts

Social Media

Advertisement