A practical test inside a personal CRM raised doubts about the value of premium reasoning models for everyday work. The system analyzes emails, meeting notes, public information, and relationship context across a large contact base, but the most capable models were reserved for the most important cases because of cost. With over 800 active contacts and the need for cross-contact analysis, the number of LLM queries this app generates is quite significant. During one session, 47 premium contacts were updating at once, each using extended reasoning. After several weeks of use, the outputs from the expensive model and the cheaper one appeared effectively the same in quality, while the premium option was slower and more costly.
That experience aligns with recent research questioning whether visible and hidden reasoning steps are actually doing useful work. A paper by Basu and Chakraborty tested 10 frontier models, including GPT-5.4, Claude Opus, and DeepSeek, across four task types. Their method removed one reasoning step at a time and checked whether the final answer changed. For most models on most tasks, removing any single step changed the answer less than 17% of the time. The implication was that no single step was individually necessary, even when the reasoning looked coherent and persuasive.
Separate research from Goodfire AI and Harvard examined when models had effectively already decided on an answer before finishing their reasoning. On straightforward questions, internal confidence converged on the correct answer very early, yet the models continued generating additional reasoning tokens. When the researchers forced the model to stop once it had already made up its mind, token use dropped by up to 80%, while accuracy remained comparable. That finding suggests a large share of reasoning output on routine tasks may be decorative rather than functional.
The financial consequences could be significant because reasoning tokens are billed as output, the most expensive token category, and models can generate thousands of them before producing a short visible response. If 80% of those tokens on routine tasks are performative, as the Goodfire research suggests, then much of the cost of everyday Artificial Intelligence use may come from unnecessary computation. The recommended response is to test common tasks side by side on smaller or non-reasoning models and identify which queries truly need extended reasoning. Teams that cannot answer that question may be overspending on model behavior that looks impressive but does not improve results.
