๐ฅ Metaโs MovieGen: Raising the Bar for AI-Generated Video Content
Hey AI Explorers! ๐ One of the most exciting AI innovations of this week is Metaโs MovieGen. This powerful tool is part of the growing trend of multimodal AI, which means it can create both synchronized audio and video from simple text prompts! Imagine generating a professional-quality video for your next marketing campaign or creative project without spending hours on editingโitโs all done with AI. Here's the landing page with example videos from Meta: https://ai.meta.com/research/movie-gen/ I don't believe that MovieGen is avalible to the public yet, but of course you can use other tools to make realistic AI videos. It's just interesting because in addition to OpenAI's Sora, creatify, and these other companies that have put their time and attention into creating AI videos, Meta has spent large amounts of resources working to gain market share as well. The 2 main things that i've seen set it apart is the video length feature & the integration of AI videos into social media. Meta's working to make their videos long form (at least a minute). Here's a quote from the whitepaper talking about the competition and how their videos are only 15 seconds: "There are a few products offering video-to-audio capabilities, including PikaLabs4 and ElevenLabs.5, but neither can really generate motion-aligned sound effects or cinematic soundtracks with both music and sound effects. PikaLabs supports sound effect generation with video and optionally text prompts; however it will generate audio longer than the video where a user needs to select an audio segment to use. This implies under the hood it may be an audio generation model conditioned on a fixed number of key image frames. The maximum audio length is capped at 15 seconds without joint music generation and audio extension capabilities, preventing its application to soundtrack creation for long-form videos. ElevenLabs leverages GPT-4o to create a sound prompt given four image frames extracted from the video (one second apart), and then generates audio using a TTA model with that prompt. Lastly, Google released a research blog6 describing their video-to-audio generation models that also provide text control. Based on the video samples, the model is capable of sound effects, speech, and music generation. However, the details (model size, training data characterization) about the model and the number of samples (13 samples with 11 distinct videos) are very limited, and no API is provided. It is difficult to conclude further details other than the model is diffusion-based and that the maximum audio length may be limited as the longest sample showcased is less than 15 seconds."