after, after Make-A-Video Show by MetaGoogle responded. The company has unveiled Imagen Video, its system for creating video from a written description. The announcement comes after Google introduced Imagen (a solution to convert text to images) just a few months ago, which indicates that these new AI models that convert text to video have been developed very quickly.
1280 x 768 . videos
Google claims to be able to produce videos at a resolution of 1280 x 768 pixels with 24 frames per second of text. company clarifies Confirmation and transfer of the results of previous work on image creation based on diffusion models to video creation. On site Videos such as “A teddy bear running in New York”, “A drone flying over a snow-covered rainforest”, “A teddy bear washing dishes” appear.
To achieve this result, Google relies on Imagen. For this first solution that translates text into images, the company explains that it relies on key language understanding models as well as broadcast models to generate high-resolution images. Google confirms that large generic language models (such as T5) that are pre-trained on text-only groups are effective at converting text to images.
Increasing the size of the language model in Imagen improves sample accuracy and the image’s respect for text, more than increasing the size of the image post model. As a result, the company promises “An unprecedented degree of photo realism”.
Models trained on multiple databases
For Imagen Video, Google trains its model on the open source image and text database LAION-400M as well as 14 million video and text matching data and 60 million image and text matching data. The first video is generated from the text with three images per second at a resolution of 24 x 48. This video is then scaled and additional images are generated by the model for the final rendering.
As for Imagen Video, Google claims to be able to create videos based on the work of some famous illustrators, to be able to create rotating 3D objects while preserving the structure of this object, and to be able to display in different animation styles.
However, Google is aware of this “These generative models can be misused, for example to create false, hateful, explicit or harmful content.” Filters exist to limit such uses, but “There are always prejudices and social stereotypes that are hard to detect and filter out.”. So Google does not want to release the Imagen Video sample or its source code until this issue is resolved. A key point at a time when fake news and deepfakes are widely spread on the Internet.
“Evil thinker. Music scholar. Hipster-friendly communicator. Bacon geek. Amateur internet enthusiast. Introvert.”