Observability in generative artificial intelligence with Microsoft Foundry

Microsoft Foundry introduces an observability stack for generative artificial intelligence applications that unifies evaluation, monitoring, and tracing across the full lifecycle. Teams can benchmark models, harden agents before deployment, and continuously monitor production traffic for quality, safety, and performance issues.

Observability for generative artificial intelligence in Microsoft Foundry focuses on making systems measurable, understandable, and debuggable across the entire application lifecycle. Teams collect evaluation metrics, logs, traces, and model outputs to gain visibility into performance, safety, and operational health, with the goal of preventing inaccurate, poorly grounded, or harmful responses. The Microsoft Foundry SDK for evaluation and the Foundry portal are in public preview, while the underlying evaluation APIs for models and datasets are generally available, and agent evaluation remains in public preview.

Microsoft Foundry’s observability offering is organized around three core capabilities: evaluation, monitoring, and tracing. Evaluators measure quality, safety, and reliability of artificial intelligence responses, including general metrics such as coherence and fluency, retrieval augmented generation metrics such as groundedness and relevance, safety and security checks such as hate or unfairness, violence, and protected materials, and agent-specific metrics such as tool call accuracy and task completion, with options to build custom evaluators. Production monitoring, integrated with Azure Monitor Application Insights, provides real-time dashboards for operational metrics, token consumption, latency, error rates, and quality scores, and enables alerts when outputs fail quality thresholds or produce harmful content. Distributed tracing, built on OpenTelemetry and integrated with Application Insights, captures the flow of large language model calls, tool invocations, agent decisions, and inter-service dependencies, and supports frameworks including LangChain, Semantic Kernel, and the OpenAI Agents SDK.

Evaluation is framed across three stages of the artificial intelligence application lifecycle: base model selection, pre-production evaluation, and post-production monitoring. During model selection, teams compare quality, task performance, ethics, and safety across models using the Microsoft Foundry benchmark and the Azure AI Evaluation SDK. In pre-production, agents and applications are tested against evaluation datasets and edge cases, with metrics such as task adherence, groundedness, relevance, and safety, using bring-your-own-data evaluations, the Foundry evaluation wizard or SDK, and an artificial intelligence red teaming agent based on Microsoft’s PyRIT framework for adversarial testing with human-in-the-loop review. After deployment, continuous monitoring covers operational metrics, sampled production traffic evaluation, scheduled dataset-based evaluation to detect drift, and scheduled red teaming, with Azure Monitor alerts and a Foundry observability dashboard that consolidates performance, safety, and quality insights.

A structured evaluation “cheat sheet” guides teams through configuring distributed tracing, selecting or building relevant evaluators, uploading or generating datasets, running local or remote evaluation runs, and analyzing results. Capabilities include cluster analysis of evaluation failures, monitoring dashboard analysis, and an agent optimization playbook that recommends updating agent instructions, improving tool success rates, applying targeted mitigations, upgrading underlying models, saving as new versions, and re-evaluating. Region support, rate limits, and virtual network support determine where artificial intelligence assisted evaluators can run and how to achieve network isolation. Observability features such as risk and safety evaluations, continuous evaluations, and evaluations in the agent playground are billed based on consumption as listed in the Azure pricing page, and evaluations in the agent playground are enabled by default for all Foundry projects unless users explicitly turn off all evaluators in the playground metrics settings.

55

Impact Score

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Artificial Intelligence models split on job disruption estimates

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.