NVIDIA is developing an opt-in, customer-installed software service that visualizes and monitors fleets of NVIDIA GPUs to help cloud partners and enterprises boost GPU uptime across computing infrastructures. The offering includes a client software agent that streams node-level GPU telemetry to a portal hosted on NVIDIA NGC. Customers can visualize fleet utilization globally or by compute zones, generate fleet reports, and use dashboard insights to address system bottlenecks and optimize productivity.
The service is built around an open-source client tooling agent intended to provide transparency and auditability. The agent streams read-only telemetry data and cannot modify GPU configurations or underlying operations; telemetry remains customer managed and customizable. According to the announcement, NVIDIA GPUs do not have hardware tracking technology, kill switches and backdoors. Real-time monitoring works by each GPU system communicating and sharing GPU metrics with the external cloud service, enabling operators to spot errors and anomalies and identify failing parts early.
Operational capabilities highlighted include tracking spikes in power usage to stay within energy budgets while maximizing performance per watt, monitoring utilization, memory bandwidth and interconnect health, detecting hotspots and airflow issues to avoid thermal throttling, and confirming consistent software configurations to ensure reproducible results. These features are presented as ways to visualize GPU inventory, address system bottlenecks, and increase return on investment for enterprises and cloud providers running heavy compute workloads.
The announcement positions the service as part of broader efforts to evolve management of modern Artificial Intelligence infrastructure as applications grow in number and complexity. Interested customers are invited to learn more at NVIDIA GTC, taking place March 16-19 in San Jose, California, where additional details and demonstrations are expected.
