Snap has adopted open data processing libraries from NVIDIA on Google Cloud services to accelerate development for Snapchat. Every new feature rolled out to Snapchat’s more than 940 million monthly active users goes through controlled experiments before launch, with teams studying different variables across a subset of users and measuring nearly 6,000 metrics tied to engagement, app performance and monetization.
Snap runs thousands of these experiments each month, processing over 10 petabytes of data within a three-hour window each morning using the Apache Spark distributed framework. By adopting Apache Spark accelerated by NVIDIA cuDF, the company is boosting these data processing workloads on NVIDIA GPUs to achieve 4x speedups in runtime with the same number of machines, providing a cost-effective path to scale. The stack combines NVIDIA’s GPU-optimized software, including NVIDIA CUDA-X libraries, with Google infrastructure services such as Google Kubernetes Engine.
The A/B testing system now runs on cuDF, which lets developers run existing Apache Spark applications on NVIDIA GPUs with no code changes. Snap said that, based on internal data collected between January 1 and February 28, the migration delivered 76% daily cost savings using NVIDIA GPUs on Google Kubernetes Engine compared with CPU-only workflows. The company said the shift from CPUs to GPUs helps scale experimentation across more features, more metrics and more users over time, including both visible product changes and behind-the-scenes updates such as performance optimizations and compatibility updates for new operating system versions.
To support workload migration, Snap also used cuDF microservices that automatically qualify, test, configure and optimize Spark workloads for GPU acceleration at scale. Working with NVIDIA experts, the team optimized its pipelines on Google Cloud’s G2 virtual machines powered by NVIDIA L4 GPUs so they required just 2,100 GPUs running concurrently, as opposed to the initial projection that around 5,500 GPUs would need to run concurrently, according to data Snap collected between January 1 and March 13. Snap plans to expand the Spark accelerator beyond the A/B testing team to a broader set of production workloads, after migrating its two biggest pipelines.
