Vidu. With Panda Optimization. Seriously.
Edition 21 - Chinese startup Shengshu Technology in collaboration with Tsinghua University unveiled Vidu, a text-to-video generator. Brush up on your Chinese text prompts.
Listen to the podcast here or on Spotify.
Vidu press release:
Since the release of Sora, the battle for "domestic Sora" has begun. But when the industry focuses on the "long" feature, they all ignore that behind Sora is actually the improvement of comprehensive effects, such as consistency, realism, aesthetics, etc. in long time series.
From the perspective of comprehensive effects, "Vidu" is the first and only video model to fully benchmark against Sora at the effect level, not only domestically, but also globally. It is also the first video model to achieve a breakthrough after Sora.
Chinese startup Shengshu Technology and Tsinghua University have officially unveiled China’s answer to OpenAI’s Sora, Vidu. The AI-powered text-to-video app can generate 16-second clips at 1080p resolution with a single click.
While considerably shorter than Sora’s 60-second video capability, Vidu is the best China currently offers. The new text-to-video software was unveiled on the weekend at the Zhongguancun Forum in Beijing. China’s Vidu Challenges Sora with High-Definition 16-Second AI Video Clips in 1080p.
Vidu to rival OpenAI’s Sora
Vidu, a new AI system, is said to be constructed using a novel visual transformation model architecture dubbed the Universal Vision Transformer (U-ViT). According to The Global Times, the developers claim this architecture integrates two text-to-video AI models: Diffusion and Transformer models.
The Medium reports that U-ViT allows Vidu to generate strikingly realistic videos featuring dynamic camera movements, intricate facial expressions, and natural lighting and shadows. This cutting-edge architecture seemingly enables unprecedented levels of realism and control over the video generation process when working from text inputs.
Reported by Marktechpost (4/27/24), Vidu offers cultural support unique to icons distinctly China:
Vidu has been thoughtfully designed with a deep understanding of Chinese cultural elements. It is capable of generating visuals that incorporate iconic Chinese symbols such as pandas and the mythical loong (dragon), resulting in greater resonance with local content creators and audiences.
However, unlike the swathe of Chinese copies of OpenAI’s ChatGPT that launched in November 2020, Sora has not been matched by Chinese rivals until now. Industry experts have cited inadequate computing power as a significant obstacle to Chinese companies’ progress.
Li Yangwei, a technical consultant in the intelligent computing sector based in Beijing, told the South China Morning Post (SCMP) that Sora requires eight NVIDIA A100 graphics processing units (GPUs) running for over three hours to generate a one-minute video clip.
Vidu (fun) vs VIDU (sales)
The name choice, however, causes some confusion if you attempt to find it on the internet. There is an existing product of the same name (but capitalized) that is, in fact, a tool for sales teams.
“Unfortunately for us, they chose the name “Vidu” for their AI model. We’ve been using the name VIDU for our product since 2021,” VIDU explains.
Who’s on First?
According to media reports, the core U-ViT technology that powers Vidu was first proposed by the system's research team back in September 2022. This precedes Sora's model architecture known as DiT (Diversity in Transformation), which is touted as the world's first visual transformation model to combine the strengths of both Diffusion and Transformer models. U-ViT's earlier unveiling suggests Vidu's developers were ahead of the curve in developing this innovative approach to combining multiple AI architectures for video generation from text inputs.
Interested in pursuing?
I find that Vidu has a good amount of work to do. Whilst using Midjourney with Runway GEN-3 might provide a better looking final product, The clips are shorter with Runway (as of today) and there's nothing better than a turnkey system. However you might need to learn Chinese.