Self-critique method lifts large language model planning performance

January 10, 2026

Researchers at Google DeepMind and collaborators show that intrinsic self-critique can significantly improve large language model planning on benchmarks like Blocksworld, Logistics, and Mini-grid without external verification tools.

Researchers from Google DeepMind and collaborators have introduced an intrinsic self-critique method that allows large language models to evaluate and refine their own plans, leading to substantial gains on standard planning benchmarks. The work targets long-standing limitations in planning and reasoning, and demonstrates that self-generated feedback can improve performance on Blocksworld, Logistics, and Mini-grid datasets without relying on external verification tools. The approach is positioned as a step toward more robust and self-improving artificial intelligence systems that can better handle complex planning tasks expressed in natural language.

The core of the method is an iterative loop where a large language model first proposes a plan, then critiques that plan by assessing correctness and providing justifications, and finally uses this feedback as contextual material for the next planning attempt. The researchers started with a few-shot learning setup and then progressively extended it to a many-shot regime, showing that substantial improvement is possible through iterative correction and refinement. Experiments utilized LLM model checkpoints from October 2024 as the basis for evaluation, establishing new state-of-the-art results on multiple planning benchmarks and demonstrating that the technique transfers across different model versions.

The team tested the method on planning problems of varying difficulty, including Blocksworld scenarios with 3-5 and 3-7 blocks, as well as standard Logistics and Mini-grid datasets, and reported consistently higher accuracies than strong existing baselines. The self-critique mechanism reduced false positives and improved error detection by aggregating past plans and critiques into a growing in-context history that the model could learn from without any parameter updates. In a key result, substantial gains were achieved across multiple datasets, with a new state-of-the-art result of 89.3% success rate on Blocksworld 3-5 when employing self-critique alongside self-consistency, and the research represents the first demonstration of LLMs solving Mystery Blocksworld problems with 22% accuracy, improving to 37.8% with the implemented self-improvement techniques. The authors note a limitation arising from context length, which required limiting iterative critique to ten steps, and suggest that combining this self-critique process with methods such as Chain-of-Thought or Monte-Carlo Tree Search on more capable models could further close the gap between language model planners and traditional algorithmic planners, especially in real-world, natural-language planning scenarios like holiday planning or meeting scheduling where classic systems often struggle.

Source

58

Impact Score

Latest News

Key European Union debates from tariffs to abortion rights

February 28, 2026

European Union politics is dominated by disputes over tariffs, climate policy, abortion access and regulation of digital technologies, while tensions grow over support for Ukraine and relations with the US and China.

Cloud and data center spending accelerates artificial intelligence expansion

February 28, 2026

Cloud providers, chipmakers, and enterprises are escalating multi-billion dollar investments to build out artificial intelligence and cloud infrastructure across key global markets. Strategic deals and partnerships are reshaping data center footprints, sovereign cloud offerings, and access to high-performance compute.

Meta signs multiyear Artificial Intelligence chip deal with AMD alongside Nvidia partnership

February 28, 2026

Meta has agreed to buy 6 gigawatts’ worth of AMD Artificial Intelligence chips under a multiyear deal that complements its recently expanded Nvidia partnership and deepens its infrastructure bet on generative technologies.

Global regulatory trends on the use of generative artificial intelligence

February 28, 2026

Governments in the EU, Japan, the United States, and the United Kingdom are moving quickly to regulate generative artificial intelligence, using a mix of binding laws, guidelines, and standards. Diverging philosophies and timelines are making cross-border compliance planning increasingly complex for companies.

Y Combinator backs 241 generative artificial intelligence startups across sectors in 2026

February 28, 2026

Y Combinator is backing 241 generative artificial intelligence startups in 2026, spanning infrastructure, developer tools, biotech, creative media, and highly specialized industry agents. The cohort highlights a shift toward domain-specific automation, autonomous agents, and new consumer experiences built on generative models.

Self-critique method lifts large language model planning performance

58

Impact Score

Latest News

Key European Union debates from tariffs to abortion rights

Cloud and data center spending accelerates artificial intelligence expansion

Meta signs multiyear Artificial Intelligence chip deal with AMD alongside Nvidia partnership

Global regulatory trends on the use of generative artificial intelligence

Y Combinator backs 241 generative artificial intelligence startups across sectors in 2026

Contact Us