Unique, mathematical language models shortcuts for predicting dynamic scenarios Myth news

Let's say you are reading a story or play chess. You may not notice, but in every stage, your mind followed how the situation (or “the state of the world” changed). You can imagine this as a kind of sequence of the event list we use to update our forecast of what will happen next.

Language models, such as chatgpt, also follow changes in their own “minds”, ending the code block or predicting what you write next. They usually seem that they are developed using transformers – internal architecture that help models understand sequential data – but the systems are sometimes incorrect due to faulty thinking patterns. Identification and adaptation of these basic mechanisms helps in language models to become more reliable prognostors, especially in the case of more dynamic tasks, such as weather forecasting and financial markets.

But do these AI systems process situations like us? New paper From scientists from the MIT and Artificial Intelligence Laboratory (CSAIL) and the Department of Electrical Engineering and Computer Science, they show that instead of this models use clever mathematical shortcuts between each progressive step in the sequence, ultimately making reasonable forecasts. The team made this observation, passing under the hood of language models, assessing how exactly they can track objects that change position quickly. Their discoveries show that engineers can control when language models use specific circumventing as a way to improve the possibilities of predictive systems.

Shell games

Scientists analyzed the internal functioning of these models using a clever experiment reminiscent of a classic concentration game. Have you ever had to guess the final location of the object after placing it under a cup and shuffled identical containers? The team used a similar test in which the model guessed the final arrangement of individual numbers (also called permutation). The models received an initial sequence, such as “42135”, and instructions on when and where to move each digit, such as transfer “4” to third position and for further knowledge, not knowing the final result.

In these experiments, models based on transformers have gradually learned to predict the correct final arrangements. Instead of shuffling numbers based on the instructions provided by them, aggregated systems between successive states (or individual steps in the sequence) and calculated the final permutation.

One of them observed a team, called the “association algorithm”, basically organizes nearby steps into groups, and then calculates the final supposition. You can think about this process as a structure like a tree in which the initial numerical system is “root”. When you move around the tree, the neighboring steps are grouped into various branches and multiplied together. At the top of the tree there is a final combination of numbers, calculated by multiplying each resulting sequence on the branches together.

Another way in which language models guessed that the final permutation was a cunning mechanism called the “association algorithm of emergency”, which basically slows down the options in front of their group. Determines whether the final system is a result of an equal or odd number of regrouping of individual digits. Then the mechanism is adjacent sequences from different steps before multiplying them, as well as the association algorithm.

“These behaviors tell us that transformers perform simulation through an association scan. Instead of observing step -by -step state changes, the models organize them in the hierarchy,” says Mit and Associated PhD student at CSail Belinda Li SM '23, the main author of paper. “How do we encourage transformers to learn better to track a state? Instead of imposing that these systems create applications on data in a human, sequential way, maybe we should satisfy the approaches that they naturally use when tracking changes in state.”

“One of the advantages of research was the extension of time calculations along the depth dimension, not the dimension of the token-dimension, increasing the number of transformers layers, not the number of chain tokens during temporary reasoning,” adds Li. “Our work suggests that this approach would allow transformers to build a deeper reasoning of trees.”

By a glass

Li and its co -authors noticed how association and even association algorithms worked using tools that allowed them to look into the “mind” of language models.

First, they used a method called “probation”, which shows what information flows through the AI system. Imagine that you can look at the model of the model to see his thoughts at a certain moment-in a similar way the technique invents system forecasts about the final system of numbers.

Then a tool called “looking activation” was used to show where the language model processes changes in the situation. It consists in interfering with some “ideas” of the system, injecting incorrect information into some parts of the network, while maintaining permanent parts, and seeing how the system will adjust its forecasts.

These tools were revealed when the algorithms would make mistakes and when the systems “invented” how to properly guess the final permutation. They noticed that the association algorithm has learned faster than the parish association algorithm, and at the same time achieves longer sequences. Li attributes the difficulties of the latter with more complex instructions for excessive rely on heuristics (or principles that allow us to quickly calculate a reasonable solution) to predict permutation.

“We discovered that when language models use heuristics early in training, they will start building these tricks in their mechanisms,” says Li. “However, these models tend to generalize worse than those that do not rely on heuristics. We have found that some pre -training goals can deter or encourage these patterns, so in the future we can look for techniques that discourage models from collecting bad habits.”

Scientists note that their experiments were carried out on small language models, adapted to synthetic data, but found that the size of the model had a little impact on the results. This suggests that refining larger language models, such as GPT 4.1, would probably bring similar results. The team plans to more accurately examine their hypotheses by testing language models of different sizes that have not been adapted, assessing their performance in dynamic tasks in the real world, such as tracking code and tracking the evolution of history.

Harvard University Postdoc Keyon Vafa, who was not involved in the article, claims that the discoveries of scientists can create opportunities for the development of language models. “Many applications for large language models are based on tracking: from providing regulations to writing code to tracking details in conversation,” he says. “This article makes a significant progress in understanding how language models perform these tasks. This progress provides us with interesting insight into what language models do and offers promising new strategies for their improvement.”

Li wrote an article with a student MIT, Zifan “Carl” Guo and senior author Jacob Andreas, who is an associate professor of electrical engineering and computer science and the main CSAIL researcher. Their research was partly supported by Open Philanthropy, Mit Quest for Intelligence, National Science Foundation, Clare Boothe Luce program for women in STEM and Sloan Research Fellowship.

This week, scientists presented their research at an international conference on machine learning (ICML).

LEAVE A REPLY

Please enter your comment!
Please enter your name here