Presented by OpenAI: Sora, an advanced video generation model, part 2

OpenAI plans a commercial launch later. Along with safety testers, the business is sharing the model with a small group of film producers and artists to gather comments on how to make Sora more valuable to creative professions. Ramesh explains, “The other goal is to show everyone what is on the horizon, to give a preview of what these models will be capable of.”

The team behind Sora modified the code from OpenAI's most recent text-to-image model, DALL-E 3, in order to make it work with images. The diffusion model is what DALL-E 3 employs, as it does with the majority of text-to-image models. They learn to create an image from a jumble of unstructured pixels.

Sora adopts this strategy and uses it on moving pictures instead of still ones. However, an additional method was also incorporated by the researchers. Sora integrates a transformer neural network with its diffusion model, setting it apart from DALL-E and other generative video models.

Transformations excel at processing lengthy data sequences like words. They're the secret sauce of huge language models like OpenAI's GPT-4 and Google DeepMind's Gemini. But videos aren't text. The researchers had to chop films into parts that could be considered as such. They diced movies over space and time. Brooks compares it to cutting cubes from a stack of video frames.

The transformer within Sora can handle these video chunks like a huge language model analyzes words in a block of text. Researchers believe this allowed Sora to be trained on more video kinds than prior text-to-video models, including resolution, length, aspect ratio, and orientation. “It really helps the model,” adds Brooks. We don't know of any existing work on that.

Witness executive director Sam Gregory adds, “From a technical perspective it seems like a very significant leap forward,” referring to the use and misuse of video technology. “But there are two sides to the coin,” he argues. “Expressive capabilities allow more people to tell stories through video. There are also abuse opportunities.”

Technology like this might be used to mislead people about protest locations or combat zones, as pointed out by Gregory. The variety of styles is intriguing as well, he remarks. The video would seem more genuine if it included a wobbly camera effect that mimicked a phone shot.

Technology like this might be used to mislead people about protest locations or combat zones, as pointed out by Gregory. The variety of styles is intriguing as well, he remarks. The video would seem more genuine if it included a wobbly camera effect that mimicked a phone shot.

OpenAI will use DALL-E 3 safety testing from last year. Sora filters all model suggestions to block violent, pornographic, hostile, and known person photos. Another filter looks at created video frames and blocks OpenAI safety violations.

A DALL-E 3 fake-image detector is being adapted by OpenAI for Sora. All Sora output will include industry-standard C2PA tags, which describe picture generation. These steps are not infallible. False-image detectors can fail. Most social networking platforms automatically delete metadata from submitted photographs, making it easier to erase.

According to Ramesh, "Before it would make sense for us to release this, we will definitely need to get more feedback and learn more about the different types of risks that need to be addressed and addressed with video."

He is in agreement. In his words, "We are talking about this research now in order to begin getting the input that we need in order to do the work that is necessary to figure out how it could be deployed safely." This is one of the reasons why we are discussing this research at this time.

Keep coming back here for the most up-to-date information.