Christopher S. Penn´s Almost Timely newsletter lays out a practical, repeatable approach to testing generative artificial intelligence models and explains why testing matters now that models such as OpenAI´s GPT-5 and competitors like Gemini 2.5 circulate in production. He begins by reminding readers that generative models are probabilistic and therefore never perfectly deterministic; the same prompt rarely yields identical output twice. Standard bench‑marks matter for apples to apples comparisons, but they do not replace bespoke tests designed around the tasks you actually need to perform.
Penn surveys the landscape of public benchmarks, naming MMLU-Pro, GPQA Diamond, Humanity´s Last Exam, SciCode, IFBench, AIME 2025, AA-LCR and others, and notes a recurring problem: model makers sometimes tune systems to pass public tests rather than optimize for real‑world use. The remedy is to build private, use‑case specific evaluations. He argues that a model that performs poorly on a public benchmark can still be the best tool for a narrow domain, and conversely, broad benchmark success does not guarantee suitability for every task.
The newsletter describes a two‑tier test suite the author uses. A short public competency test checks current knowledge, topical awareness, multi‑step mathematics, reasoning, coding with external dependencies, and writing that adheres to a provided style profile. For Trust Insights clients, Penn offers a longer battery that covers seven major generative use cases and bias testing. He also emphasizes two testing modes: the consumer web interface, which includes system prompts and guardrails, and the raw API, which exposes the base model without interface augmentations. Test both to understand the shipped experience and the underlying engine.
On running experiments, Penn recommends identical stored prompts, starting new chats for each trial, keeping results in notebooks or spreadsheets, and scoring either pass/fail or on a one to five scale. Use results to choose models for particular jobs, to detect capability shifts between model versions, and to inform governance and compliance work such as bias audits. He shares that his GPT-5 short competency verdict was roughly on par with Gemini 2.5, and he points readers to video, courses and a Slack community at Trust Insights for follow up. The core advice is straightforward: test for what you actually need, run tests fairly, and treat the output as a tool selection guide rather than an oracle.