AI Generative Video: The Future of Video Production

New high-definition text-to-video AI model, “Sora”, advances the state of generative AI technology.

If you are keeping up to date with the revolutionary changes that artificial intelligence is bringing to the table, then you probably already know about generative AI. Generative AI models can create content by “inferring” what the next word or output should be. This is the case for image generation models like DALLE 3 and Stable Diffusion. These models let you input text and get an image as a response. However, did you know that soon you might be able to do the same for video? That is what we might see soon with the introduction of models like the new model “Sora” by OpenAI, which has already accomplished an incredible level of video quality. Keep reading to learn more about generative AI for video and what it might in the future.


What is Generative AI Video?


Generative AI Video is an artificial intelligence model that takes text instructions and converts them into video output. Sora is one such video generation model. According to OpenAI, “Sora is an AI model that can create realistic and imaginative scenes from text instructions.” It will generate a video of up to 1 minute in length. The Sora model has recently made waves in the AI field since its showcase on February 15, 2024, when OpenAI first released high-definition videos showcasing what AI video generation is capable of, opening the doors for future models to be trained in similar ways.


How does Generative AI Video Work?


The Sora model works by training jointly on videos and images of variable quality, such as resolution and aspect ratio. They use “spacetime” data of these videos (meaning, data that contains both the video images and their position in time) to train it to be able to generate videos of different sizes and with surprisingly accurate continuity. Before this, it has been a difficult issue to create an AI model that could generate video since it would very rapidly forget about its context, and things would “disappear” or “appear” from thin air. This is still an issue that needs to be addressed, but the current improvements are promising and generating video content that is very believable.  See below for a diagram that sums up the AI video training and generation process.


Diagram showing an overview of SORA training and use


Applications for Generative AI Video


The first thing we think about when imagining the possible use cases for an AI video generation model, is to generate stock video for different uses. The amount of time, money, and effort required to create some simple videos will be dramatically reduced and their quality improved. There are some other uses that are really interesting to see. OpenAI showcases how they use methods for video editing through prompts, which lets you “request” changes to an original video such as changing its style or even completely change the geographic location in which an event occurs on the video. They also demonstrate combining and transitioning between videos. Generative AI Video capabilities will unchain the power of video editing and effects to a new level with reduced effort in the future.

However, there are some other use cases that OpenAI mentions. For example, they state that such models could show a “promising path towards building general purpose simulators of the physical world”. Such an application could benefit many areas from physics simulations used for research, to improved graphics on videogames and movies with less computing requirements.


Limitations


OpenAI’s Sora is an incredible model that has achieved groundbreaking results in AI generative video; however, it is important to remember that there are limitations:

  • Expressiveness: There could be a specific need that might be difficult to convey through text.
  • Duration: The output is currently limited to 1-minute-long videos.
  • Physics: There are limitations in trying to simulate some physical interactions such as breaking glass, objects interacting with each other (like a person taking a bite of food), and we have seen bizarre moving objects (e.g. boats on water, moving bicycles, people walking).
  • Closed Model: Currently, Sora is the only published generative AI video model and is not available for public use. Even after it is released as a service, based on OpenAI's recent track record, it is most likely never going to be open source. Meaning, we probably will not be able to download these specific models to run on our own servers. Though it’s unlikely to happen soon due to the massive amounts of computation required, we are crossing our fingers for an open-source breakthrough.


FAQs

  • What is Sora’s underlying technology?
    • Sora uses a transformers AI model, trained on spacetime video and image data, to create a text-conditional diffusion model. Whereas LLMs have text tokens, Sora has visual "spacetime" patches.
  • How long are the videos generated by Sora?
    • Sora can generate videos of up to 1 minute in length (with no audio). According to OpenAI latest statements.
  • Can Sora handle complex prompts?
    • Sora was trained using labeled videos with highly descriptive captions as input. This enables Sora to generate high quality videos that accurately follow user prompts. 
  • Is Sora available for public use?  
    • Currently, the Sora model is closed and not available for public use. There is still no known date for its release.

External Links

OpenAI's Official Sora Page

Learn more about AI and software for your business

We help you keep up with innovation

Share this post
Importance of data privacy in the age of AI
Leveraging local LLMs for On-Premise AI Solutions