How artificial intelligence models generate videos

A clear primer on how artificial intelligence turns text prompts into short videos, covering diffusion, latent compression, transformers, and the first models that sync audio and image. The article explains why the results can be impressive, inconsistent, and energy intensive.

This year saw rapid advances in artificial intelligence video generation, with public releases such as OpenAI´s Sora, Google DeepMind´s Veo 3, and Runway´s Gen-4, and a first mainstream use of the technology in a Netflix visual effect. Demo reels showcase the best outputs, and services such as Sora and Veo 3 are now accessible inside apps like ChatGPT and Gemini for paying subscribers. Wider availability has enabled casual creators to produce remarkable clips, but it has also increased low-quality outputs, misinformation risks, and large energy use compared with image or text generation.

At the heart of most modern systems are diffusion models. During training a diffusion model learns to reverse a process that progressively adds random noise to images. Shown many images at varying noise levels, the model learns how to turn a noisy mess back into a coherent image. For text-guided generation a second model, often a large language model trained to match text and images, guides each denoising step so the output matches a user prompt. The same basic technique can be applied to sequences of frames to create video clips.

To reduce the huge compute costs of operating on raw pixels, many systems use latent diffusion. Frames and prompts are encoded into a compressed latent space that keeps essential features while discarding extraneous data. The diffusion process then works on these smaller representations and the compressed results are decoded back into watchable video. Latent diffusion is far more efficient than operating on full-resolution pixels, but video generation still requires an eye-popping amount of computation.

Maintaining consistency across frames is solved by combining diffusion with transformers. Transformers excel at processing long sequences, so models slice video across space and time into chunks that act like sequence elements. This approach helps prevent objects from popping in and out of existence and allows training on diverse formats, from vertical phone clips to cinematic widescreen. OpenAI´s Sora pioneered this latent diffusion transformer architecture, which has become a standard in recent generative video work.

Multimodal advances continue. A key leap in Veo 3 is joint audio and video generation by compressing both into a single representation so diffusion produces sound and images in lockstep, enabling lip sync and synced effects. While large language models are still generally transformer based, research is blurring the lines; Google DeepMind is experimenting with diffusion for text generation, and diffusion models can be more efficient than transformers. Expect diffusion techniques to play a growing role across generative media.

70

Impact Score

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Artificial Intelligence models split on job disruption estimates

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.