A smarter way to think about difficult problems through large language models | MIT News

To improve the accuracy of large language models (LLMs) when answering more difficult questions, researchers can allow the model to spend more time thinking about potential solutions.

However, common approaches that give LLM this opportunity set a fixed computational budget for each problem, no matter how complex it is. This means that the LLM may waste computational resources on simpler questions or be unable to solve complex problems that require more reasoning.

To solve this problem, MIT researchers have developed a smarter way to allocate computational effort as the LLM solves the problem. Their method allows the model's computational budget to dynamically adjust based on the difficulty of the question and the likelihood that each partial solution will lead to the correct answer.

The researchers found that their new approach enabled LLM to use just half the computation of existing methods, while achieving comparable accuracy on a range of questions of varying difficulty. Additionally, their method allows smaller, less resource-intensive LLMs to perform as well as or even better than larger models for complex problems.

By improving the reliability and efficiency of LLMs, especially when dealing with complex reasoning tasks, this technique could reduce the energy consumption of generative AI systems and enable LLMs to be used in more demanding and time-sensitive applications.

“The computational cost of inference has quickly become a major bottleneck for pioneering model vendors who are actively trying to find ways to improve computational efficiency for user queries. For example, the recent release of GPT-5.1 highlights the effectiveness of the 'adaptive inference' approach proposed in our paper. By empowering models with the ability to learn what they don't know, we can enable them to allocate more computational power to the most difficult problems and most promising solution paths while using significantly fewer tokens on the straight lines. This makes reasoning more both more reliable and significantly more efficient,” says Navid Azizan, the Alfred H. and Jean M. Hayes Career Development Assistant Professor in the Department of Mechanical Engineering and the Institute for Data, Systems and Society (IDSS), principal investigator in the Laboratory for Information and Decision Systems (LIDS), and senior author of the book. article about this technique.

In the article, Azizan is joined by lead author Young-Jin Park, a LIDS/MechE graduate student; Kristjan Greenewald, research associate at the MIT-IBM Watson AI Lab; Kaveh Alim, IDSS graduate; and Hao Wang, research associate at the MIT-IBM Watson AI Lab and the Red Hat AI Innovation Team. The research will be presented this week at a conference on neural information processing systems.

Calculations for contemplation

A recent approach called inference time scaling allows a large language model to take longer to reason about difficult problems.

Using reasoning time scaling, LLM can generate multiple solution attempts simultaneously or explore different reasoning paths and then select the best among these candidates.

A separate model, known as the process reward model (PRM), evaluates each potential solution or reasoning path. The LLM uses these results to identify the most promising ones.

Typical inference time scaling approaches assign a fixed amount of computation to LLM to break down the problem and justify individual steps.

Instead, the researchers' method, known as adaptive instance scaling, dynamically adjusts the number of potential solutions, or reasoning steps, based on their likelihood of success as the model grapples with the problem.

“This is how people solve problems. We come up with some partial solutions and then decide, should I move forward with one of them, or should I stop and revise, or even go back to the previous step and continue solving the problem from there?” Wang explains.

To do this, the platform uses PRM to estimate question difficulty, helping LLM assess how much of the computational budget should be used to generate and reason about potential solutions.

At each stage of the model reasoning process, the PRM looks at the questions and partial answers and assesses how promising each is for finding the right solution. If LLM is more certain, it can reduce the number of potential solutions or reasoning trajectories that need to be pursued, saving computational resources.

However, the researchers found that existing people with limited mobility often overestimate the model's likelihood of success.

Overcoming overconfidence

“If we simply trusted current PRMs, which often overestimate the chance of success, our system would reduce the computational budget too aggressively. So we first had to find a way to better calibrate the PRMs to make inference time scaling more efficient and reliable,” Park says.

Researchers have introduced a calibration method that allows PRMs to generate a series of probability scores rather than a single value. In this way, PRM creates more reliable uncertainty estimates that better reflect the true probability of success.

With a well-calibrated PRM, their instance-adaptive scaling framework can use probability scores to effectively reduce computation while maintaining the accuracy of model results.

When they compared their method with standard approaches to scaling reasoning time on a range of mathematical reasoning tasks, they found that less computation was required to solve each problem while achieving similar accuracy.

“The beauty of our approach is that adaptation happens on the fly as the problem is solved, rather than right at the beginning of the process,” says Greenewald.

In the future, researchers are interested in applying this technique to other applications, such as code generation and artificial intelligence agents. They also plan to explore additional applications of their PRM calibration method, such as reinforcement learning and fine-tuning.

“Employees learn on the job – some CEOs even started as interns – but today's agents remain largely static pieces of probabilistic software. Work like this paper is an important step toward changing that: helping agents understand what they don't know and building mechanisms for continuous self-improvement. These capabilities are essential if we want agents to operate safely, adapt to new situations, and deliver consistent results at scale,” says Akash Srivastava, principal and lead Core AI architect at IBM Software, who was not involved in this work.

This work was funded in part by the MIT-IBM Watson AI Lab, the MIT-Amazon Science Hub, the MIT-Google Program for Computing Innovation, and MathWorks.

LEAVE A REPLY

Please enter your comment!
Please enter your name here