MindJourney is a research framework that enables Artificial Intelligence agents to explore simulated three-dimensional spaces they cannot directly observe. The approach targets a limitation in vision-language models (VLMs), which are effective at identifying objects in static images but often fail to infer the interactive 3D layout behind a 2D view. By allowing an agent to mentally simulate motion through a scene, MindJourney helps answer spatial questions that require understanding position and movement through space.
The system relies on a world model built from a large dataset of videos captured from a single moving viewpoint. This video generation system learns to predict how a scene would appear from different perspectives, and at inference time it generates photo-realistic candidate views based on hypothetical agent movements. A vision-language model evaluates those generated observations and guides the search, keeping promising perspectives and discarding less informative ones. To make the search efficient, MindJourney uses a spatial beam search that balances breadth and depth within a fixed number of movement steps, focusing compute on the most informative paths rather than enumerating thousands of possibilities. On the Spatial Aptitude Training benchmark, the method improved VLM accuracy by 8 percent over baseline performance.
MindJourney demonstrates that pretrained VLMs and trainable world models can cooperate in 3D without retraining either component, suggesting a path toward general-purpose agents that can interpret and act in real environments. Potential applications include autonomous robotics, smart home systems, and accessibility tools for people with visual impairments. Because exploration occurs in the model’s latent space, agents could evaluate multiple viewpoints before moving, which may reduce wear, energy use, and collision risk. Future work aims to extend the framework to world models that also forecast how scenes evolve over time so agents can use those predictions for more accurate planning and interpretation.