The Hao Artificial Intelligence Lab at the University of California San Diego has received an NVIDIA DGX B200 system to advance its work on large language model inference. The system, hosted at the San Diego Supercomputer Center within the School of Computing, Information and Data Sciences, is fully accessible to the lab and the broader UC San Diego community. Assistant professor Hao Zhang describes the DGX B200 as one of the most powerful Artificial Intelligence systems from NVIDIA, and says it enables the team to prototype and experiment much faster than with previous-generation hardware.
The DGX B200 is already accelerating two flagship projects: FastVideo and the Lmgame benchmark. FastVideo is training a family of video generation models that are designed to produce a five-second video based on a given text prompt in just five seconds, and its research phase also taps into NVIDIA H200 GPUs along with the DGX B200 system. Lmgame-bench is a benchmarking suite that evaluates large language models through popular online games such as Tetris and Super Mario Bros, allowing users to test one model at a time or pit two models against each other. Other ongoing projects at the lab aim to achieve low-latency large language model serving for real-time responsiveness, with doctoral candidate Junda Chen noting that the DGX B200 is being used to explore the next frontier of low-latency serving on its advanced hardware specifications.
The lab’s earlier DistServe work has shaped a disaggregated inference approach that is influencing platforms like NVIDIA Dynamo, an open-source framework to scale generative Artificial Intelligence models efficiently. DistServe promotes the metric of “goodput,” which measures throughput while meeting user-specified latency service-level objectives, as a better indicator of system health than raw throughput alone. In typical large language model serving, prefill and decode phases historically run on the same GPU, but the DistServe researchers show that splitting them onto different GPUs maximizes goodput. By separating compute-intensive prefill from memory-intensive decode onto two different sets of GPUs, they can eliminate resource contention and make both jobs run faster, a process called prefill/decode disaggregation. This disaggregated inference method increases goodput, supports continuous workload scaling, and helps maintain low latency and high-quality responses. In parallel, cross-departmental collaborations in areas such as healthcare and biology are using the DGX B200 to optimize diverse research projects as UC San Diego teams further explore how Artificial Intelligence platforms can accelerate innovation.
