Researchers at MIT and the MIT-IBM Watson Artificial Intelligence Lab have developed a new attention and positional encoding technique for large language models called PaTH Attention, which addresses key weaknesses in how transformers track word order, state changes, and long-range dependencies. While standard attention lets a model look back over an input sequence to determine which tokens matter most, it does not inherently understand order, so transformers rely on positional encodings such as rotary position encoding, known as RoPE. RoPE assigns fixed mathematical rotations based solely on relative distance between tokens and does not depend on the content of those tokens, which limits its ability to handle complex, evolving structures in language, code, or conditional instructions.
PaTH Attention makes positional information adaptive and context-aware by treating the sequence between two words as a path composed of many small, data-dependent transformations. Each transformation is based on a Householder reflection, described as a tiny mirror that adjusts according to the content of each token it passes, so that every step in the sequence can influence how later information is interpreted. The cumulative effect allows the model to track how entities and relationships evolve along the path between words, giving transformers a form of positional memory rather than just a notion of distance. To make this practical at scale, the team also designed a hardware-efficient algorithm that compresses the cumulative PaTH transformation and breaks it into smaller computations compatible with fast processing on GPUs, preserving efficiency while increasing expressivity.
The MIT-IBM team evaluated PaTH Attention on synthetic and real-world benchmarks, including reasoning tasks, long-context evaluations, and full large language model training, to test whether it improves tracking of information over time. They examined how well the method handled tasks such as following the most recent write command amid many distracting steps and multi-step recall problems that are challenging for fixed schemes like RoPE, and they trained mid-size large language models to compare against alternative encodings. PaTH Attention improved perplexity and outperformed other methods on reasoning benchmarks it was not explicitly trained on, and it showed strong content-awareness on retrieval, reasoning, and stability tests with inputs containing tens of thousands of tokens. The researchers then combined PaTH Attention with the Forgetting Transformer, or FoX, to create PaTH-FoX, which selectively down-weights less relevant information in a data-dependent way, yielding strong performance across reasoning, long-context understanding, and language modeling tasks while maintaining transformer scalability. Senior author Yoon Kim situates this work within a broader push for new general-purpose building blocks in Artificial Intelligence architectures that enhance accuracy, expressivity, flexibility, and hardware scalability, and suggests that data-dependent positional encodings like PaTH could be especially impactful in structured domains such as biology.
