Charles Srisuwananukorn, Founding Vice President of Engineering at Together AI, shared insights during a Chat8VC fireside chat about navigating the demands of scaling physical infrastructure for advanced Artificial Intelligence applications. Detailing his career journey—from impactful work at Snorkel AI and Apple to building core infrastructure at Together AI—he emphasized the unique challenges of managing large, physical GPU clusters, in contrast to virtualized environments. This hands-on approach has become integral to Together AI’s growth and mission to provide robust compute resources and infrastructure for foundational model development.
Srisuwananukorn discussed the major gaps in the open-source ecosystem, particularly the scarcity of clean, high-quality datasets, which led Together AI to launch the RedPajama initiative. He also highlighted the need for improved reinforcement learning tools as models become more sophisticated. Together AI’s clusters, equipped with the latest GPUs like H100s and H200s, are used for both internal research and external client workloads, offering customized orchestration and optimized system performance via proprietary software like the Together Kernel Collection. This focus on deep technical optimization—spanning networking, kernel design, and systems reliability—enables clients to achieve faster, more efficient model training, often delivering a notable performance boost out of the box.
As the company scales to tens of thousands of GPUs, Srisuwananukorn described tackling unexpected low-level operational challenges such as hardware reliability, overheating, and maintaining consistent performance. Automation is key, yet physical interventions—like resolving hardware failures—remain necessary. On infrastructure flexibility, he addressed the evolving demand for both giant and smaller, faster models, noting Together AI´s investments in edge computing to reduce latency for real-world Artificial Intelligence applications. Despite the operational pressures, Srisuwananukorn expressed optimism about recent breakthroughs in model accessibility, which allow increasingly sophisticated models to run on consumer hardware, forecasting a wave of innovation in the Artificial Intelligence ecosystem.