Inference is the runtime stage where trained models process inputs and produce outputs in real time. As modern artificial intelligence reasoning models grow in size and generate many more tokens per interaction, inference infrastructure must handle diverse workloads from single-shot queries to multistep reasoning involving millions of tokens. The article presents the Think SMART framework for evaluating inference: scale and complexity, multidimensional performance, architecture and software, return on investment driven by performance, and technology ecosystem and install base.
Multidimensional performance requires balancing throughput, latency, scalability and cost efficiency. Some workloads demand ultralow latency and high tokens per user, while others prioritize raw throughput. The piece recommends assessing throughput in tokens per second, latency per prompt, the ability to scale from one to thousands of GPUs without waste, and sustainable performance per dollar. NVIDIA positions its inference platform to reconcile these needs and cites benchmarks on models such as gpt-oss, DeepSeek-R1 and Llama 3.1.
Architecture and software must be engineered together. The NVIDIA Blackwell platform and the GB200 NVL72 rack-scale system are presented as examples, with the GB200 connecting 36 NVIDIA Grace CPUs and 72 Blackwell GPUs via NVLink and claiming large gains in revenue potential, throughput, energy efficiency and water efficiency. NVFP4 is described as a low-precision format that reduces energy, memory and bandwidth demands. On the software side, the Dynamo orchestration platform enables dynamic autoscaling and routing of distributed inference and is said to deliver up to 4x more performance without cost increases. TensorRT-LLM, PyTorch-centric workflows and model packaging via NVIDIA NIM are highlighted for optimizing inference per GPU and simplifying deployment, with examples of partners like Baseten achieving state-of-the-art performance.
Return on investment is driven by performance improvements that translate to more tokens per watt and higher revenue per rack. The article cites a 4x performance increase to Blackwell yielding up to 10x profit growth within similar power budgets and notes industry cost reductions such as reported 80% lower costs-per-million-tokens from stack-wide optimizations. Finally, the technology ecosystem matters: open models and open-source projects accelerate adoption, with claims that open models drive over 70% of inference workloads and that NVIDIA contributes many projects, models and datasets to community platforms to support diverse frameworks and deployment scenarios.