Artificial Intelligence models often lack the common sense humans acquire through real-world experience. NVIDIA is addressing that gap by designing evaluation tests and curated datasets that teach models about the limits and dynamics of the physical world, such as spatial relations, motion and cause and effect. The company frames this work as essential for deploying models in unpredictable environments like industrial warehouses, laboratories and roads where mistakes can cause harm.
Central to the effort is Cosmos Reason, an open reasoning vision language model that is optimized for physical Artificial Intelligence applications and that recently topped the physical reasoning leaderboard on Hugging Face. NVIDIA emphasizes temporally grounded responses, enabling the model to analyze video, infer relative motion and predict likely outcomes. The models are trained with reinforcement learning to internalize spatial and temporal constraints, so they can, for example, determine when two cars on a single lane are likely to collide or which hand a person uses to cut spaghetti in a clip.
The data curation process is run by an NVIDIA data factory team that blends skills from bioengineering, business and linguistics. Annotators create multiple-choice question and answer pairs from real-world video footage. Each item is labeled with four choices and then reviewed by analysts and project leads for alignment with project objectives and quality standards. Michelle Li and other analysts perform quality checks, while researchers such as Yin Cui and Tsung-Yi Lin guide the research goals. After vetting, hundreds of thousands of these data units are fed to the Cosmos Reason team for model training. NVIDIA positions this pipeline as a foundation for safer autonomous agents and physical Artificial Intelligence systems, and provides model access through preview pages and downloads on Hugging Face and GitHub.