Google Research Unveils Lumiere, a Novel Video Generation AI
Google Research Unveils Lumiere a Novel Video Generation AI, Lumiere, the latest video generation model from Google Research, has been introduced by a team of researchers. Leveraging a probabilistic diffusion model based on a spatio-temporal U-Net network, Lumiere can generate realistic and coherent 5-second videos from prompts or still images. It allows users to stylize them according to their preferences or create cinemagraphs by animating only the selected part of an image.
While image generation models like Adobe Firefly, DALL-E, Midjourney, Imagen, or Stable Diffusion have garnered enthusiasm and swift adoption, the logical progression was towards video generation. Meta AI took a step in this direction in October 2022 with Make-A-Video, while the NVIDIA AI lab in Toronto unveiled a high-resolution Text-to-Video synthesis model based on Stability AI's open-source Stable Diffusion model. Stability AI, in turn, presented Stable Video Diffusion in November, showcasing a highly performant model.
Video generation is a far more complex task than image generation, involving a temporal dimension in addition to the spatial dimension. The model must not only generate each pixel correctly but also predict how it will evolve to produce a coherent and smooth video.
For Lumiere, Google Research, which recently contributed to the development of the W.A.L.T video generation model, opted for an innovative approach to overcome specific challenges related to training text-to-video models.
The LUMIERE model consists of a base model and a spatial super-resolution model. The base model generates low-resolution video clips by processing the video signal in multiple spatio-temporal scales, relying on a pre-trained text-to-image model. The spatial super-resolution model enhances the spatial resolution of video clips using a multidiffusion technique to ensure overall continuity of the result.
The researchers explain:
"We introduce a spatio-temporal U-Net architecture that generates the entire temporal duration of the video in one go, in a single pass through the model. This contrasts with existing video models that synthesize distant keyframes followed by temporal super-resolution, an approach that intrinsically complicates achieving global temporal coherence."
Applications The model can be easily adapted to a variety of video content creation and editing tasks, such as generating stylized videos, image-to-video generation, video inpainting and outpainting, and creating cinemagraphs, as demonstrated in the video below.
Inpainting allows realistically filling or restoring missing or damaged parts of a video. It can be used to replace unwanted objects, repair artifacts (unwanted anomalies or alterations), or corrupted areas in a video, or even create special effects.
Video outpainting, on the other hand, refers to extending or adding content beyond the existing limits of the video. It enables adding elements to enlarge the scene, create smooth transitions between video clips, or add decorative or contextual elements.
The Lumiere model was evaluated on 113 textual descriptions and the UCF101 dataset. It achieved competitive results in terms of Frechet Video Distance and Inception Score, and users preferred it for its visual quality and motion coherence compared to competing methods.
While the model demonstrated strong performance, the researchers emphasize:
"Our primary goal in this work is to enable novice users to creatively generate visual content flexibly. However, there is a risk of misuse for creating false or harmful content with our technology, and we believe it is crucial to develop and apply tools to detect biases and malicious use cases to ensure safe and fair use."