Generative Artificial Intelligence continues to reshape the landscape of innovation, demanding a new level of scrutiny and governance to ensure responsible and beneficial outcomes. In the concluding episode of the Microsoft Research podcast series on Artificial Intelligence testing and evaluation, Amanda Craig Deckard, senior director of public policy in Microsoft’s Office of Responsible AI, joins host Kathleen Sullivan to summarize key takeaways from cross-disciplinary explorations into testing as a governance tool. Drawing from case studies ranging from genome editing to cybersecurity, Deckard emphasizes that testing is crucial for building trust, yet inherently challenging due to the interplay between risk management, innovation, and the diversity of stakeholders involved.
The discussion identifies three central themes for advancing Artificial Intelligence governance: the purpose of testing, the balance between pre- and post-deployment evaluation, and the need to calibrate between rigid and adaptive testing frameworks. In sectors like pharmaceuticals and medical devices, stringent pre-deployment requirements dominate, while cybersecurity tends to rely more heavily on post-deployment strategies such as coordinated vulnerability disclosures and bug bounties. For Artificial Intelligence, both approaches are necessary but must be tailored to shifting risks as technologies move from model development into complex real-world deployments. The context and application environment can dramatically alter the risk profile, underlining the need for both model-level and system-level evaluation to ensure trustworthy outcomes.
To elevate the testing ecosystem, Deckard highlights the importance of increasing rigor, standardizing methodologies, and enhancing interpretability of test results. Standardization, in particular, can help smaller organizations participate in responsible Artificial Intelligence deployment and integrate into broader supply chains without the burden of independently developing exhaustive evaluation processes. The conversation underscores the critical role of public-private partnerships, such as the International Network of AI Safety Institutes and Singapore’s Global AI Assurance Pilot, in building capacity for meaningful, actionable evaluation across the industry. Ultimately, transparency, adaptive information sharing, and cross-domain learning remain top priorities as Microsoft and the broader Artificial Intelligence ecosystem continue to refine practices for safe, effective deployment that align with societal values.
Looking ahead, Microsoft intends to deepen its exploration of transparency across the Artificial Intelligence supply chain, drawing further lessons from established domains and focusing research on sociotechnical impacts. Engagement across disciplines—from quantum safety to gaming—and fostering public-private collaboration are seen as essential to developing standards and measuring broader societal effects like misinformation or workforce shifts. As new frameworks such as the Hiroshima AI Process Reporting Framework emerge, Deckard calls for inclusive approaches, ensuring metrics account for both technical and social risks, and that progress remains rooted in dialogue and shared responsibility across innovators, deployers, governments, and end users alike.