Large language models have transformed artificial intelligence, powering everything from chatbots to automated code generation. Yet, even the most advanced models often struggle to follow evolving states or relationships in long passages, such as tracing a character’s transformation in a novel or monitoring changing variables in a program.
This challenge stems from how traditional models encode position and context, relying heavily on mechanisms like rotary position encoding (RoPE). While RoPE helps models recognize order, it applies the same mathematical transformation to all tokens at a given distance, missing the subtle shifts in meaning that occur across lengthy texts.
The Breakthrough: Introducing PaTH Attention
Researchers at the MIT-IBM Watson AI Lab have unveiled PaTH Attention, a novel encoding approach that makes positional information adaptive and context-aware. Instead of using a fixed rotation for every token pair, PaTH Attention uses a sequence of small, data-driven transformations known as Householder reflections.
These act like adjustable mirrors, dynamically reflecting the unique context between words as the model reads. The result is a “positional memory” that evolves with the narrative, allowing the language model to track entities and their relationships more effectively.
This innovative method is also designed for efficiency. The team developed an algorithm that compresses and distributes the necessary computations, so even as input sequences grow longer, models can process them quickly without sacrificing accuracy or speed.
Proving the Power of PaTH Attention
To test their innovation, researchers subjected PaTH Attention to a mix of synthetic reasoning tasks and real-world language modeling benchmarks. The results were clear: PaTH Attention consistently outperformed RoPE and other popular techniques, particularly on tasks that demanded instruction following, recent event recall, and handling of extremely long texts.
Key improvements included lower perplexity scores (signaling better model predictions) and enhanced ability to follow complex sequences. Even when evaluated on unfamiliar benchmarks, models using PaTH Attention maintained their strong performance, showcasing the method’s flexibility and generalizability.
Going Further: Selective Forgetting with PaTH-FoX
Building on this foundation, the team combined PaTH Attention with the Forgetting Transformer (FoX). This hybrid, called PaTH-FoX, introduces a “selective forgetting” capability, helping models focus on the most relevant context while discarding distractions.
Mimicking a key trait of human cognition, PaTH-FoX further boosts performance in reasoning and long-context understanding, making transformer models even more expressive, adaptable, and efficient.
The Road Ahead: Smarter, Scalable AI
As language models become crucial tools across industries, from biomedical research to financial analysis, the need for scalable, expressive, and contextually aware AI is greater than ever. Yoon Kim, senior author and MIT associate professor, notes that advances like PaTH Attention form the next generation of AI “building blocks.” This work aligns with broader goals in AI research: balancing accuracy, expressivity, flexibility, and hardware efficiency to unlock new possibilities for intelligent systems.
PaTH Attention and its offshoots could mark a turning point in how AI understands and reasons about language and structured data, bringing machine learning closer to the nuanced way humans process information.

PaTH Attention: The Next Leap in Context-Aware Language Models