PaTH Attention boosts large language model positional reasoning

December 19, 2025

Researchers at MIT and the MIT-IBM Watson Artificial Intelligence Lab have introduced PaTH Attention, a new positional encoding method that makes transformers more context-aware and better at tracking state over long sequences. The technique adapts position information based on token content and can be combined with forgetting mechanisms to improve long-context reasoning and efficiency.

Researchers at MIT and the MIT-IBM Watson Artificial Intelligence Lab have developed a new attention and positional encoding technique for large language models called PaTH Attention, which addresses key weaknesses in how transformers track word order, state changes, and long-range dependencies. While standard attention lets a model look back over an input sequence to determine which tokens matter most, it does not inherently understand order, so transformers rely on positional encodings such as rotary position encoding, known as RoPE. RoPE assigns fixed mathematical rotations based solely on relative distance between tokens and does not depend on the content of those tokens, which limits its ability to handle complex, evolving structures in language, code, or conditional instructions.

PaTH Attention makes positional information adaptive and context-aware by treating the sequence between two words as a path composed of many small, data-dependent transformations. Each transformation is based on a Householder reflection, described as a tiny mirror that adjusts according to the content of each token it passes, so that every step in the sequence can influence how later information is interpreted. The cumulative effect allows the model to track how entities and relationships evolve along the path between words, giving transformers a form of positional memory rather than just a notion of distance. To make this practical at scale, the team also designed a hardware-efficient algorithm that compresses the cumulative PaTH transformation and breaks it into smaller computations compatible with fast processing on GPUs, preserving efficiency while increasing expressivity.

The MIT-IBM team evaluated PaTH Attention on synthetic and real-world benchmarks, including reasoning tasks, long-context evaluations, and full large language model training, to test whether it improves tracking of information over time. They examined how well the method handled tasks such as following the most recent write command amid many distracting steps and multi-step recall problems that are challenging for fixed schemes like RoPE, and they trained mid-size large language models to compare against alternative encodings. PaTH Attention improved perplexity and outperformed other methods on reasoning benchmarks it was not explicitly trained on, and it showed strong content-awareness on retrieval, reasoning, and stability tests with inputs containing tens of thousands of tokens. The researchers then combined PaTH Attention with the Forgetting Transformer, or FoX, to create PaTH-FoX, which selectively down-weights less relevant information in a data-dependent way, yielding strong performance across reasoning, long-context understanding, and language modeling tasks while maintaining transformer scalability. Senior author Yoon Kim situates this work within a broader push for new general-purpose building blocks in Artificial Intelligence architectures that enhance accuracy, expressivity, flexibility, and hardware scalability, and suggests that data-dependent positional encodings like PaTH could be especially impactful in structured domains such as biology.

Source

55

Impact Score

Latest News

Adobe advances edge delivery and artificial intelligence in experience manager evolution

January 30, 2026

Adobe is recasting experience manager and edge delivery services as a tightly connected, artificial intelligence driven platform for intelligent content orchestration and ultra-fast web delivery. A recent two-day developer event in San Jose showcased edge native architecture, agentic workflows, and automated content supply chains that target both authors and developers.

ByteDance and Alibaba plan new artificial intelligence models in China race

January 30, 2026

ByteDance and Alibaba are preparing new flagship artificial intelligence models in a push to strengthen their positions in China’s fast-moving generative technology market.

Artificial intelligence initiatives at argonne national laboratory

January 30, 2026

Argonne national laboratory is expanding its artificial intelligence research portfolio, from next generation supercomputing partnerships to urban digital twins and nuclear maintenance frameworks. A series of recent press releases and feature stories outlines how artificial intelligence is being integrated across scientific disciplines and large scale facilities.

Nvidia to build artificial intelligence GPUs on Intel foundry starting in 2028

January 30, 2026

Nvidia is planning to use Intel foundry nodes and packaging for parts of its 2028 ‘Feynman’ artificial intelligence GPU, while keeping most core logic on TSMC’s advanced process technology.

Google aluminium os leak reveals android based desktop plans

January 30, 2026

Google’s aluminium os leak points to an android based desktop platform running on x86 hardware, blending mobile roots with a more traditional pc style interface and deep Gemini artificial intelligence integration.

PaTH Attention boosts large language model positional reasoning

55

Impact Score

Latest News

Adobe advances edge delivery and artificial intelligence in experience manager evolution

ByteDance and Alibaba plan new artificial intelligence models in China race

Artificial intelligence initiatives at argonne national laboratory

Nvidia to build artificial intelligence GPUs on Intel foundry starting in 2028

Google aluminium os leak reveals android based desktop plans

Contact Us