Can world models unlock general purpose robotics

World models aim to help robots learn physics from large-scale video instead of relying mainly on hand-built simulators and scarce robot-specific data. Early results are promising, but major questions remain around consistency, tactile sensing, speed, and economics.

The main constraint on general purpose robotics is data. Unlike language models, which trained on vast amounts of already digitized internet text, robotics lacks a comparable corpus of robot experience. Teleoperation data is expensive to gather because it depends on physical hardware, human operators, and real-world environments. Aggregate robotic data costs will exceed $3 billion within the next two years, spanning every modality: on-embodiment and off-embodiment, video and teleoperation, tactile, and force. World models offer a different path by learning physical dynamics from video, giving robots a way to build physical intuition and simulate future outcomes before acting.

Traditional simulators remain useful, but they struggle with the complexity of real-world manipulation. Rigid-body locomotion can be modeled well, while grasping deformable or fragile objects involves contact, friction, and material behavior that current simulators do not capture reliably. World knowledge and action knowledge are distinct: internet video can teach a robot how the world behaves, while robot-specific data is still needed to connect that understanding to a particular embodiment. Meta’s V-JEPA 2 was pre-trained on over one million hours of internet video. Researchers then added action conditioning from just 62 hours of unlabeled robot video. The result: 80% zero-shot pick-and-place success on real robot arms, across different labs, with no task-specific training. DeepMind’s Dreamer 4 learned to collect diamonds in Minecraft from purely offline data, suggesting that imagination-based training could extend to physical tasks.

Scale is improving results, but it is costly. Models like NVIDIA’s Cosmos, Wayve’s GAIA-2, and DeepMind’s Genie 3 show that bigger models and more video improve learned physics. Training runs are starting to rival large LLM runs: Cosmos used 10,000 H100 GPUs over three months. Frontier runs cost tens to hundreds of millions of dollars. The architecture question is still open, with researchers exploring pixel prediction, abstract representation learning, and diffusion approaches. Imitation learning also appears insufficient on its own, while reinforcement learning inside a world model may help robots handle failures and edge cases more robustly.

Important gaps remain before world models can support production robotics. Video-generation systems often lose coherence over time, causing object permanence failures, spatial drift, and unrealistic causal behavior. Google’s Genie 3, arguably the most capable interactive world model today, maintains coherent generation for a few minutes. Tactile sensing is another major limitation because video cannot capture force, pressure, or contact dynamics needed for dexterous control. At the planning level, world models remain slow. V-JEPA 2 takes ~16 seconds per action; real-time control needs to be 100x faster. Serving costs may be an even bigger commercial obstacle than training. Google’s Genie 3 costs roughly $100 per hour to run, according to one industry source. Odyssey requires a full H200 chip per user for its standard model, and several H200 chips for its more advanced model, costing several dollars per hour. Decart, an Israeli startup, claims to have reduced video generation costs by 400x, but real-time per-user streaming remains structurally expensive.

The broader case for world models is that learned representations have repeatedly displaced hand-engineered systems across computing. Early signs in robotics are directionally strong, including zero-shot manipulation, training through imagined environments, and emergent physical understanding at larger scales. At the same time, the gap between an 80% lab result and dependable production performance remains significant. Whether world models alone can deliver general purpose robotics is unresolved, but the field is advancing along a clear scaling trajectory and is increasingly focused on the infrastructure and model design needed to make that progress usable in the real world.

58

Impact Score

Nvidia chief projects chip sales growth

Nvidia’s chief executive is tied to a projection of massive future Artificial Intelligence chip revenue, but the available source material provides no reported details beyond the headline and a brief author description.

HHS weighs clinical Artificial Intelligence adoption around trust and burden

HHS is using public feedback to shape how Artificial Intelligence should be adopted in clinical care, with a focus on provider burden, patient trust, interoperability, and responsible use. The department is signaling that future changes in regulation, reimbursement, and research will reflect the themes that emerge.

Designing carbon materials with Artificial Intelligence at exascale

Argonne researchers are using supercomputers and Artificial Intelligence to predict how carbon changes under extreme heat and pressure. The work could help design nanocarbon materials for medicine, energy, and national security before they are built in the lab.

NVIDIA unveils RTX PRO 4500 Blackwell server edition GPU

NVIDIA has introduced a passively cooled, single-slot RTX PRO 4500 Blackwell Server Edition aimed at compute-dense server deployments. The card closely matches the standard RTX PRO 4500 Blackwell while lowering power and memory speed to fit hyper-dense configurations.

Snap speeds Snapchat A/B testing with NVIDIA data libraries

Snap has moved key Snapchat experimentation workloads to NVIDIA-accelerated Apache Spark on Google Cloud, aiming to process large daily data volumes faster and at lower cost. The shift supports broader feature testing across engagement, performance and monetization metrics.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.