Charles Srisuwananukorn Discusses Scaling Artificial Intelligence Infrastructure at Together AI

Charles Srisuwananukorn, VP of Engineering at Together AI, reveals the complexity and demands of building and operating physical infrastructure for cutting-edge Artificial Intelligence development.

Charles Srisuwananukorn, Founding Vice President of Engineering at Together AI, shared insights during a Chat8VC fireside chat about navigating the demands of scaling physical infrastructure for advanced Artificial Intelligence applications. Detailing his career journey—from impactful work at Snorkel AI and Apple to building core infrastructure at Together AI—he emphasized the unique challenges of managing large, physical GPU clusters, in contrast to virtualized environments. This hands-on approach has become integral to Together AI’s growth and mission to provide robust compute resources and infrastructure for foundational model development.

Srisuwananukorn discussed the major gaps in the open-source ecosystem, particularly the scarcity of clean, high-quality datasets, which led Together AI to launch the RedPajama initiative. He also highlighted the need for improved reinforcement learning tools as models become more sophisticated. Together AI’s clusters, equipped with the latest GPUs like H100s and H200s, are used for both internal research and external client workloads, offering customized orchestration and optimized system performance via proprietary software like the Together Kernel Collection. This focus on deep technical optimization—spanning networking, kernel design, and systems reliability—enables clients to achieve faster, more efficient model training, often delivering a notable performance boost out of the box.

As the company scales to tens of thousands of GPUs, Srisuwananukorn described tackling unexpected low-level operational challenges such as hardware reliability, overheating, and maintaining consistent performance. Automation is key, yet physical interventions—like resolving hardware failures—remain necessary. On infrastructure flexibility, he addressed the evolving demand for both giant and smaller, faster models, noting Together AI´s investments in edge computing to reduce latency for real-world Artificial Intelligence applications. Despite the operational pressures, Srisuwananukorn expressed optimism about recent breakthroughs in model accessibility, which allow increasingly sophisticated models to run on consumer hardware, forecasting a wave of innovation in the Artificial Intelligence ecosystem.

76

Impact Score

IBM and AMD partner on quantum-centric supercomputing

IBM and AMD announced plans to develop quantum-centric supercomputing architectures that combine quantum computers with high-performance computing to create scalable, open-source platforms. The collaboration leverages IBM´s work on quantum computers and software and AMD´s expertise in high-performance computing and Artificial Intelligence accelerators.

Qualcomm launches Dragonwing Q-6690 with integrated RFID and Artificial Intelligence

Qualcomm announced the Dragonwing Q-6690, billed as the world’s first enterprise mobile processor with fully integrated UHF RFID and built-in 5G, Wi-Fi 7, Bluetooth 6.0, ultra-wideband and Artificial Intelligence capabilities. The platform is aimed at rugged handhelds, point-of-sale systems and smart kiosks and offers software-configurable feature packs that can be upgraded over the air.

Recent books from the MIT community

A roundup of new titles from the MIT community, including Empire of Artificial Intelligence, a critical look at Sam Altman’s OpenAI, and Data, Systems, and Society, a textbook on harnessing Artificial Intelligence for societal good.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.