Dion brings scalable orthonormal updates for large model training

Dion orthonormalizes a top-rank subset of singular vectors to reduce communication and compute, enabling faster training of Artificial Intelligence models such as LLaMA-3 with much less overhead.

Microsoft Research introduces ´Dion´, an optimizer designed to bring the benefits of orthonormal updates to very large models while avoiding the heavy communication and compute costs that limited previous approaches. The method selectively enforces orthonormality on only the top r singular vectors of the momentum or update matrix, creating a new axis for scalability: rank. By focusing on a compact subspace, Dion reduces the matrix work required per step and integrates with common distributed strategies such as FSDP and tensor parallelism.

Under the hood Dion uses an amortized power iteration together with a QR decomposition to recover an approximate orthonormal basis spanning the leading singular directions. Amortization means the power iterations are distributed across optimization steps and operate on the slowly evolving momentum matrix, so each step needs only two matrix multiplications. A residual error feedback mechanism stores the low-rank approximation residual in the momentum, allowing systematic gradient structure that is not captured immediately to accumulate and be applied later. The result is an orthonormal update that is practical to compute in sharded, tensor-parallel setups, in some cases without ever materializing a full row or column of the parameter matrix.

Empirical results show an unusual scaling behavior. At small scales Dion can be slower than Muon, the recent orthonormal-style optimizer, but as models grow the precision of Dion´s approximation compounds into a performance advantage. Larger batch sizes widen that gap in Dion´s favor, and experiments suggest that required rank grows much more slowly than parameter count. Projections for LLaMA-3 405B indicate effective operation with rank fractions as low as 1/16 or 1/64, which translates into orders-of-magnitude lower wall-clock cost per step compared with Muon in some settings.

The team has open-sourced a PyTorch FSDP2 plus tensor parallel implementation and ships a pip-installable package. The repository also includes Muon for comparison. Documentation and code aim to make it straightforward for researchers and engineers to test ´Dion´ on large pretraining runs or fine-tuning workloads, with the promise of more efficient training at scale.

75

Impact Score

How Artificial Intelligence is reshaping financial services oversight

Financial services regulators are largely treating Artificial Intelligence as another technology governed by existing rules rather than building new securities-specific frameworks. History suggests that clearer expectations will emerge through examinations, enforcement, and supervisory guidance.

Nvidia faces gamer backlash over Artificial Intelligence shift

Nvidia is facing growing frustration from gamers as memory supply is steered toward data center chips and DLSS 5 becomes more central to game performance. The dispute highlights how far the company’s priorities have shifted toward enterprise Artificial Intelligence.

Executives see limited Artificial Intelligence productivity gains so far

Corporate enthusiasm around Artificial Intelligence has yet to translate into broad gains in employment or productivity, reviving comparisons to the long lag between early computing breakthroughs and measurable economic impact. Recent surveys and studies show mixed results, with strong expectations for future benefits but little consensus on present gains.

Nvidia skips a new GeForce generation as Artificial Intelligence chips dominate

Nvidia is set to go a year without a new GeForce GPU generation for the first time since the 1990s as memory shortages and higher margins in Artificial Intelligence hardware reshape the market. AMD and Intel are also struggling to capitalize because the same supply constraints are hitting gaming products across the industry.

Where gpu debt starts to break

Stress in gpu-backed infrastructure financing is emerging around deals that lack the structural protections seen in the strongest transactions. Oracle, the Abilene Stargate project, and older CoreWeave debt illustrate different ways residual risk can surface when contracts, collateral, and counterparties fall short.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.