Hidden reasoning tokens can erase Artificial Intelligence model savings

A production LLM gateway showed how cheaper per-token pricing can still produce higher bills when a model spends output budget on internal reasoning. The engineering lessons mirror familiar backend work: reliability, observability, cost control, retries, and failover.

A test against Claude Haiku and Gemini 2.5 Flash exposed a hidden cost problem in model routing. Flash has the lower per-token price, but before it answered “Paris,” it spent a few dozen tokens reasoning, and reasoning is billed as output. Haiku answered in 4 tokens. Flash spent about 28, which meant a lower unit price, far more units, and ~8.6× the bill per request. The discrepancy became visible only because every call was instrumented for tokens, cost, and latency, then written to Postgres.

The experience reframed Artificial Intelligence infrastructure as a continuation of backend systems engineering rather than a wholly new discipline. A model API behaves like any expensive and unreliable downstream dependency: it can be slow, unavailable, rate-limited, and billed per call. The same practices used for payment processors, KYC providers, and partner banks apply to model providers, including circuit breakers, rate limiters, retries, fail-fast behavior, and audit logs.

The gateway used a circuit breaker per provider, modeled on payment infrastructure where partner integrations had their own SLA and their own definition of “up.” When a partner started failing, the wrong response was to keep sending traffic while worker pools filled with stuck calls. The gateway applied the same CLOSED, OPEN, HALF_OPEN state pattern around Anthropic and Gemini, changing the downstream label from partner bank to model provider.

Cost metering followed the same payment-system discipline. Across 20+ Go services, idempotency keys prevented retried requests from double-counting; the gateway uses request_id for the same reason. Cost is stored as NUMERIC rather than a float because money and spending should not be approximated. The familiar distinction between transient and sustained failure also carried over: one timeout can justify an idempotent retry, while sustained failure should stop traffic.

Some problems remain specific to model systems. Token economics is a real discipline, and the ~8.6× surprise does not resemble ordinary database billing. Models are also non-deterministic, so testing cannot rely on exact string matches and shifts toward evals and distributions. The broader lesson is that distributed-systems experience remains central to production Artificial Intelligence work, especially when the goal is to make model calls reliable, observable, and cost-aware.

52

Impact Score

Nvidia and Abridge build clinical Artificial Intelligence model

Nvidia is partnering with Abridge on a healthcare-focused Artificial Intelligence model designed for real doctor-patient conversations. The system will run inside Abridge’s clinical scribe platform, which is already used by major health systems.

Artificial Intelligence shifts into execution and scrutiny

London Tech Week focused on responsible Artificial Intelligence deployment as companies, regulators and investors moved from ambition to implementation. OpenAI’s IPO plans, Meta’s WhatsApp dispute, drone defence partnerships and changing junior roles defined the week’s technology agenda.

Britain secures major Artificial Intelligence investment

Global and UK technology companies announced new Artificial Intelligence investments, infrastructure projects and hiring plans during London Tech Week. The commitments span compute, cloud, quantum, coding tools, legal technology and sovereign model development.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.