Frontier Artificial Intelligence models continue to advance on high‑stakes benchmarks while real‑world enterprise returns remain uneven. A controlled evaluation of GPT‑5 on multimodal medical reasoning found large gains over GPT‑4o on the MedXpertQA benchmark (+29.26% in reasoning and +26.18% in understanding) and reported performance above pre‑licensed human experts (+24.23% and +29.40% respectively). The paper also noted a nuance: on the smaller VQA‑RAD dataset, GPT‑5‑mini slightly outperformed the full GPT‑5 model, suggesting that right‑sizing can sometimes beat brute‑force scaling for niche tasks.
A separate large field experiment examined voice agents in hiring, randomizing more than 70,000 applicants for customer service roles in the Philippines to human interviewers, an Artificial Intelligence voice agent, or a choice between the two. The AI‑led interviews produced materially better hiring outcomes: 12% more job offers, 18% more job starts, and 17% higher 30‑day retention. When given a choice, 78% of applicants chose the AI interviewer. Transcript analysis pointed to greater consistency in interviews as the likely mechanism, and reported gender discrimination by interviewers nearly halved under the AI condition.
These capability wins sit against an enterprise adoption backdrop described by MIT’s NANDA initiative in The GenAI Divide. The report finds only about 5% of corporate Artificial Intelligence pilots drive rapid revenue acceleration, with most pilots stalling due to a learning and integration gap rather than model quality. Purchasing specialized tools and partnering succeed roughly two‑thirds of the time, while internal builds succeed only about one‑third as often. The newsletter draws practical lessons: treat adoption as a process problem, separate interaction from adjudication so humans make final decisions, redesign workflows for consistent, auditable signal capture, right‑size models to the task, and favor buy‑then‑integrate when speed and reliability are critical. The piece also flags related industry moves such as NVIDIA’s Granary dataset and Anthropic and Mistral model updates, underscoring a fast‑moving technical landscape alongside persistent organizational challenges.