Microsoft Research introduces ´Dion´, an optimizer designed to bring the benefits of orthonormal updates to very large models while avoiding the heavy communication and compute costs that limited previous approaches. The method selectively enforces orthonormality on only the top r singular vectors of the momentum or update matrix, creating a new axis for scalability: rank. By focusing on a compact subspace, Dion reduces the matrix work required per step and integrates with common distributed strategies such as FSDP and tensor parallelism.
Under the hood Dion uses an amortized power iteration together with a QR decomposition to recover an approximate orthonormal basis spanning the leading singular directions. Amortization means the power iterations are distributed across optimization steps and operate on the slowly evolving momentum matrix, so each step needs only two matrix multiplications. A residual error feedback mechanism stores the low-rank approximation residual in the momentum, allowing systematic gradient structure that is not captured immediately to accumulate and be applied later. The result is an orthonormal update that is practical to compute in sharded, tensor-parallel setups, in some cases without ever materializing a full row or column of the parameter matrix.
Empirical results show an unusual scaling behavior. At small scales Dion can be slower than Muon, the recent orthonormal-style optimizer, but as models grow the precision of Dion´s approximation compounds into a performance advantage. Larger batch sizes widen that gap in Dion´s favor, and experiments suggest that required rank grows much more slowly than parameter count. Projections for LLaMA-3 405B indicate effective operation with rank fractions as low as 1/16 or 1/64, which translates into orders-of-magnitude lower wall-clock cost per step compared with Muon in some settings.
The team has open-sourced a PyTorch FSDP2 plus tensor parallel implementation and ships a pip-installable package. The repository also includes Muon for comparison. Documentation and code aim to make it straightforward for researchers and engineers to test ´Dion´ on large pretraining runs or fine-tuning workloads, with the promise of more efficient training at scale.