Computing has shifted through clear phases: first the cpu, then the gpu, then whole systems optimized for parallel workloads. The article frames the next phase as the ´token factory´, a systems-level idea born from the need to move vastly more data through large language models. In this view, tokens are the measurable output that matters; every architectural choice serves the goal of maximizing tokens per second. That shift changes how engineers define efficiency, and it elevates throughput above many traditional metrics.
The scale is extreme. The article cites xai´s colossus 1 at 100,000 nvidia h100 gpus and notes colossus 2 will use more than 550,000 nvidia gb200 and gb300 gpus. These numbers are presented to show that modern deployments exist to produce tokens at an industrial rate. Historically inference migrated from cpus to gpus and then to integrated systems like nvidia nvl72. Today, entire facilities are being treated as a single compute unit tuned to feed models with the largest possible stream of tokens.
Design and procurement decisions follow. When the primary metric is tokens per second, network topologies, cooling, power distribution, rack layouts and software stacks are chosen to maximize sustained throughput for both training runs and later inference. The article stresses that the ´token factory´ is not a single component but an orchestrated combination of compute, interconnect and infrastructure focused on token generation. That focus has downstream effects on how performance is reported, how capacity is forecast, and how future accelerators are evaluated.
There are broader implications for buyers and builders. Benchmarks will trend toward token-centric measures, vendors will optimize across system boundaries, and operators will trade versatility for specialized throughput. The ´token factory´ concept reframes data centers as production lines, where the unit of value is the token and the system is engineered to churn out as many as possible.