Large language models (LLMs) like ChatGPT can write an essay or plan a menu almost instantly. But until recently, they were also easy to crush. Models that rely on linguistic patterns to respond to user queries often struggled with math problems and complex reasoning. But suddenly they became much better at these things.
A new generation of LLMs, called reasoning models, are trained to solve complex problems. Like humans, they need some time to think about these types of problems – and interestingly, researchers at MIT's McGovern Institute for Brain Research found that the types of problems that require the most processing based on reasoning models are the same problems that humans need time to solve. In other words, they write in your journal today PNASThe “thinking cost” of a reasoning model is similar to the thinking cost of a human.
Researchers he led Ewelina Fedorenkoassociate professor of brain and cognitive sciences and a researcher at the McGovern Institute, concluded that in at least one important respect models of reasoning are characterized by a human approach to thinking. They note that this is not by design. “The people who build these models don't care whether they do it like humans. They just want a system that will work reliably in all conditions and provide correct responses,” says Fedorenko. “The fact that there is some convergence is really striking.”
Models of reasoning
Like many forms of artificial intelligence, new reasoning models are artificial neural networks: computational tools that learn how to process information when given data and a problem to solve. Artificial neural networks are very effective at many of the tasks that the brain's neural networks are good at – and in some cases, neuroscientists have discovered that those that do the best actually share certain aspects of information processing in the brain. Still, some scientists have argued that artificial intelligence is not ready to absorb the more sophisticated aspects of human intelligence.
“Until recently, I was one of those people who said, 'These models are really good at things like perception and language, but there's a long way to go before we have neural network models that can reason,'” says Fedorenko. “Then large reasoning models emerged that seem to be much better at handling many thinking tasks, such as solving math problems and writing pieces of computer code.”
Andrea Gregor de Varda, a K. Lisa Yang ICoN Center Fellow and postdoc in Fedorenko's lab, he explains that reasoning models solve problems step by step. “At some point, people realized that models needed more space to be able to do the actual calculations needed to solve complex problems,” he says. “Performance gets much, much better if you let models break problems down into pieces.”
To encourage models to solve complex problems in steps that lead to correct solutions, engineers can use reinforcement learning. During training, models are rewarded for correct answers and punished for incorrect ones. “Models explore the problem space themselves,” says de Varda. “Actions that lead to positive rewards are reinforced, so they are more likely to lead to correct solutions.”
Models trained in this way are much more likely than their predecessors to produce the same answers that a human would give when given a reasoning task. Their step-by-step problem solving means that reasoning models may take a little longer to find answers than the LLM models that came before – but because they get the right answers where previous models would have failed, it's worth waiting for their answers.
The time it takes for models to work through complex problems already suggests an analogy to human thinking: if you require a person to solve a difficult problem immediately, he or she will likely fail as well. De Varda wanted to explore this relationship more systematically. So he gave reasoning models and volunteers the same set of problems and checked not only whether they got the answers right, but also how much time and effort it took them to get to the goal.
Time versus tokens
This meant measuring how long it took people to answer each question, down to the millisecond. For the models, Varda used a different metric. There was no point in measuring processing time because it is more dependent on the computer hardware than the effort the model puts into solving the problem. Instead, it tracked tokens that are part of the model's internal chain of thought. “They create tokens that are not intended for the user to view and work on, but simply to track the internal calculations they perform,” de Varda explains. “It's like they're talking to each other.”
Both humans and reasoning models were asked to solve seven different types of problems, such as numerical arithmetic and intuitive reasoning. Multiple problems are assigned to each problem class. The more difficult a problem was, the longer it took people to solve it – and the longer it took to solve the problem, the more tokens the reasoning model generated when it arrived at its own solution.
Similarly, the classes of problems that took humans the longest to solve were the same classes of problems that required the most tokens for models: arithmetic problems were the least demanding, while a group of problems called the “ARC challenge,” in which pairs of colored grids represent a transformation that must be inferred and then applied to a new object, were the most costly for both humans and models.
De Varda and Fedorenko argue that the striking consistency in thinking costs shows one way in which reasoning models think like humans. This does not mean, however, that the models reproduce human intelligence. Scientists still want to know whether models use similar representations of information to the human brain and how these representations are transformed into solutions to problems. They are also curious about whether the models will be able to handle problems that require knowledge about the world that is not contained in the texts used to train the models.
The researchers point out that while reasoning models generate internal monologues when solving problems, they do not necessarily use language to think. “If you look at the results these models produce when reasoning, they often contain errors or some nonsense, even if the model ultimately gives the correct answer. So the actual internal computations probably take place in an abstract, non-linguistic representation space, similar to how humans don't use language to think,” he says.

















