Artificial intelligence has made extraordinary progress, with large language models (LLM) and their advanced counterparts, Large models of reasoning (LRMS)Relieving the way machines process and generate a text similar to man. These models can write essays, answer questions and even solve mathematical problems. However, despite their impressive skills, these models show interesting behavior: they often compose simple problems, while struggling with the complex. Recent test Apple provides valuable insight into this phenomenon by researchers. In this article, he examines why LLM and LRM behave in this way and what it means to the future AI.
Understanding LLM and LRMS
To understand why LLM and LRM behave in this way, we must first explain what these models are. LLM, such as GPT-3 or Bert, are trained on extensive text data sets to predict the next word in the sequence. This makes them perfect in tasks such as text generation, translation and summary. However, they are not by nature designed for reasoning, which include logical deduction or problem solving.
LRM is a new class of models designed to solve this gap. Contain such techniques Thinking chain (COT) Trying in which the model generates steps of indirect reasoning before giving the final answer. For example, when solving the mathematical problem, LRM can divide it into steps, just like man. This approach improves the performance of complex tasks, but faces challenges in case of problems with various complexity, as Apple reveals.
Test
Apple took the research team differently approach To assess the possibilities of reasoning LLM and LRM. Instead of relying on traditional comparative tests, such as mathematical tests or coding, which can be affected by data pollution (where models remember answers), they created controlled environments. They included well -known puzzles such as Hanoi towerIN SKOK in a chessboardIN Riverand blocks the world. For example, the Hanoi tower includes the transfer of disks between PEG in accordance with the specific rules, with the complexity increases as more disks added. Systematic adaptation of the complexity of these puzzles while maintaining coherent logical structures scientists observe how models work in the spectrum of difficulty. This method allowed them to analyze not only final answers, but also reasoning processes that ensure a deeper view of how these models “think”.
Arrangements for incurring and resignation
The study identified three separate performance regimes based on the complexity of the problem:
- At low levels of complexity, standard LLM often work better than LRM, because LRMs tend to think about, generating additional stages that are not necessary, while standard LLM are more efficient.
- In case of problems with the average complex, LRMS has excellent performance due to their ability to generate detailed signs of reasoning that help them effectively solve these challenges.
- In case of high -comprehensive complexity problems, both LLM and LRMs completely fail; In particular, LRMS experiences a complete fall in accuracy and reduces their reasoning despite increased difficulties.
In the case of simple puzzles, such as the Hanoi tower with one or two disks, standard LLM were more efficient to ensure the correct answers. However, LRM often processed these problems, generating long signs of reasoning, even when the solution was simple. This suggests that LRM can imitate explanations of their training data that can lead to inefficient.
In moderately complex scenarios, LRMS achieved better results. Their ability to create detailed reasoning allowed them to solve problems that required many logical steps. This allows them to surpass the standard LLM, which fought for consistency.
However, in the case of very complex puzzles, such as the Hanoi tower with many disks, both models failed. Surprisingly, LRMS reduced reasoning, because the complexity has increased by exceeding a certain issue, despite the fact that they have sufficient calculation resources. This behavior of “resignation” indicates a basic limitation of their ability to scale the possibilities of reasoning.
Why is this happening
Limiting simple puzzles probably results from how LLM and LRM are trained. These models learn from huge data sets that contain both concise and detailed explanations. For easy problems, they can generate signs of reasoning by default, imitating long examples in their training data, even when a direct answer would be enough. This behavior is not necessarily a disadvantage, but a reflection of their training, which priority treats the reasoning over performance.
The failure of complex puzzles reflects the inability of LLM and LRMS in learning to generalize logical rules. As the complexity increases, their dependence on matching patterns breaks down, which leads to inconsistent reasoning and falling performance. The study showed that LRMs do not use clear algorithms and are not inconsistent in various puzzles. He emphasizes that although these models can simulate reasoning, they do not understand the basic logic in the way people do.
Various perspectives
This study caused a discussion in the AI community. Some experts to argue that these findings can be misinterpreted. They suggest that although LLM and LRM may not reason like people, they still show effective problem solving within certain limits of complexity. They emphasize that “reasoning” in artificial intelligence does not have to reflect human knowledge to be valuable. Similarly, discussions On platforms such as Hacker News, they praise the rigorous approach of the study, but emphasize the need for further research to improve AI reasoning. These perspectives emphasize a continuous debate on what is reasoning in artificial intelligence and how we should assess it.
Implications and future directions
The test results have significant implications for AI development. While LRMs represent progress in imitating human reasoning, their restrictions in solving complex problems and scaling of reasoning efforts suggest that the current models are far from achieving generalization. This emphasizes the need for new assessment methods that focus on the quality and possibilities of adjusting the reasoning processes, and not only on the accuracy of the final answers.
Future research should be aimed at increasing the ability of models to thoroughly take logical steps and adapt their reasoning based on the complexity of the problem. Developing reference points that reflect real reasoning tasks, such as medical diagnosis or legal argumentation, can provide more significant insight into AI's ability. In addition, dealing with excessive reactions of models in the field of recognizing patterns and improving their ability to generalize logical rules will be crucial for the development of AI reasoning.
Lower line
The study ensures a critical analysis of the possibility of reasoning LLM and LRM. It shows that although these models excessively analyze simple puzzles, they fight more complex, revealing both their strong and restrictions. Although they work well in some situations, their inability to solve very complex problems emphasizes the gap between simulated reasoning and true understanding. The study emphasizes the need to develop an AI system, which can adapt adaptation at different levels of complexity, enabling problems with various complexity, just like people.