Opt-in NVIDIA software enables Artificial Intelligence data center fleet management

NVIDIA is launching an opt-in, customer-installed service with an open-source agent to stream GPU telemetry and provide a dashboard for monitoring and optimizing Artificial Intelligence data center fleets.

NVIDIA is developing an opt-in, customer-installed software service that visualizes and monitors fleets of NVIDIA GPUs to help cloud partners and enterprises boost GPU uptime across computing infrastructures. The offering includes a client software agent that streams node-level GPU telemetry to a portal hosted on NVIDIA NGC. Customers can visualize fleet utilization globally or by compute zones, generate fleet reports, and use dashboard insights to address system bottlenecks and optimize productivity.

The service is built around an open-source client tooling agent intended to provide transparency and auditability. The agent streams read-only telemetry data and cannot modify GPU configurations or underlying operations; telemetry remains customer managed and customizable. According to the announcement, NVIDIA GPUs do not have hardware tracking technology, kill switches and backdoors. Real-time monitoring works by each GPU system communicating and sharing GPU metrics with the external cloud service, enabling operators to spot errors and anomalies and identify failing parts early.

Operational capabilities highlighted include tracking spikes in power usage to stay within energy budgets while maximizing performance per watt, monitoring utilization, memory bandwidth and interconnect health, detecting hotspots and airflow issues to avoid thermal throttling, and confirming consistent software configurations to ensure reproducible results. These features are presented as ways to visualize GPU inventory, address system bottlenecks, and increase return on investment for enterprises and cloud providers running heavy compute workloads.

The announcement positions the service as part of broader efforts to evolve management of modern Artificial Intelligence infrastructure as applications grow in number and complexity. Interested customers are invited to learn more at NVIDIA GTC, taking place March 16-19 in San Jose, California, where additional details and demonstrations are expected.

55

Impact Score

Canonical prepares native AMD ROCm support for Ubuntu

Canonical will package and maintain AMD ROCm directly in Ubuntu, starting with Ubuntu 26.04 LTS, to simplify deployment of Artificial Intelligence/ML and HPC workloads across data center, desktop, and edge environments.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.