A new way to extend the power of large language models | MIT News

Most languages ​​use word position and sentence structure to extract meaning. For example, “The cat was sitting on the box” is not the same as “The box was on the cat.” In a long text such as a financial document or novel, the syntax of these words is likely to evolve.

Similarly, a person can keep track of variables in a piece of code or execute statements that contain conditional actions. These are examples of state transitions and sequential reasoning where we expect state-of-the-art AI systems to excel; however, the existing state-of-the-art attention engine in transformers—an architecture primarily used in large language models (LLMs) to determine word meaning—has theoretical and empirical limitations when it comes to such capabilities.

The attention mechanism allows LLM to look back at earlier parts of a query or document and, based on its training, determine which details and words matter most; however, this mechanism itself does not understand word order. It “sees” all the input words, or tokens, at the same time and handles them in the order in which they are presented, so researchers have developed techniques for encoding position information. This is crucial for highly structured domains such as language. However, the dominant position encoding method, called rotational position encoding (RoPE), only considers the relative distance between tokens in a sequence and is independent of the input data. This means that, for example, words four positions apart, such as “cat” and “box” in the example above, will receive the same constant mathematical rotation appropriate to that relative distance.

Currently, research conducted by MIT and the MIT-IBM Watson AI Lab has resulted in an encoding technique known as “PaTH Attention”, which makes location information adaptive and contextual, rather than static as with RoPE.

“Transformers enable accurate and scalable modeling of many domains, but they have some limitations with respect to state tracking, a class of phenomena that is believed to underlie important features we expect in our artificial intelligence systems. So the important question is: How can we maintain the scalability and performance of transformers while still enabling state tracking?” says the article's lead author Yoon Kim, associate professor at the Department of Electrical Engineering and Computer Science (EECS), member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and researcher at the MIT-IBM Watson AI Lab.

A new paper on this work was presented earlier this month at the Conference on Neural Information Processing Systems (NeurIPS). Kim's co-authors include lead author Songlin Yang, an EECS graduate student and former MIT-IBM Watson AI Lab Summer Program intern; Kaiyue Wen of Stanford University; Liliang Ren of Microsoft; and Yikang Shen, Shawn Tan, Mayank Mishra and Rameswar Panda of IBM Research and MIT-IBM Watson AI Lab.

The path to understanding

Instead of assigning each word a fixed rotation based on the relative distance between tokens, as RoPE does, PaTH Attention is flexible and treats intermediate words as a path composed of small data-dependent transformations. Each transformation, based on a mathematical operation called Householder reflection, acts as a tiny mirror that adjusts depending on the contents of each token it passes. Each step in the sequence can influence how the model later interprets the information. The cumulative effect allows the system to model how meaning changes along the path between words, not just the distance between them. This approach allows transformers to track how entities and relationships change over time, providing a sense of “positional memory.” Think of it as following a path, experiencing your surroundings and how they affect you. Moreover, the team also developed a hardware-efficient algorithm to more efficiently calculate attention scores between each pair of tokens, whereby the cumulative mathematical transformation from PaTH Attention is compressed and split into smaller calculations, making it compatible with fast processing on GPUs.

MIT-IBM researchers then examined PaTH Attention's performance on synthetic and real-world tasks, including reasoning, long-context benchmarking, and full LLM training, to see if it improved the model's ability to track information over time. The team tested its ability to follow the latest “write” command despite multiple distracting steps and multi-step recall tests, tasks that are difficult for standard positional encoding methods such as RoPE. The researchers also trained medium-sized LLMs and compared them with other methods. PaTH Attention improved confusion and outperformed other methods on reasoning benchmarks on which it was not trained. They also evaluated recovery, reasoning, and stability using tens of thousands of tokens. PaTH Attention has been consistently proven to be capable of content awareness.

“We found that for both diagnostic tasks designed to test the limits of transformers and real-world language modeling tasks, our new approach outperformed existing attention mechanisms while maintaining their effectiveness,” says Kim. Moreover, “I would be excited to test whether this type of data-dependent position encoding, such as PATH, improves the performance of transformers in structural domains such as biology, in (analyzing) proteins or DNA.”

Thinking broader and more effectively

The researchers then tested how the PaTH attention mechanism would work if it more closely mimicked human cognitive function, in which we ignore old or less relevant information when making decisions. To do this, they combined PaTH Attention with another position encoding scheme known as Forgetting Transformer (FoX), which allows models to selectively “forget”. The resulting PaTH-FoX system enables information weight reduction in a data-dependent manner, achieving good results in terms of reasoning, long-context understanding, and language modeling patterns. In this way, PaTH Attention expands the expressive power of transformer architectures.

Kim says this type of research is part of a broader effort to develop the “next big thing” in artificial intelligence. He explains that a major factor driving the revolution in both deep learning and generative AI has been the creation of “general-purpose building blocks that can be applied across broad domains,” such as “convolutional layers, RNN (recurrent neural network) layers,” and, more recently, transformers. Looking to the future, Kim notes that considerations such as accuracy, expressiveness, flexibility and scalability of hardware have been and will continue to be essential. As he put it, “the fundamental enterprise of modern architectural research is the attempt to develop new primitives that maintain or improve expression while at the same time being scalable.”

This work was supported in part by the MIT-IBM Watson AI Lab and the AI2050 program at Schmidt Sciences.

LEAVE A REPLY

Please enter your comment!
Please enter your name here