As language models (LMs) improve on tasks like image generation, trivia questions, and simple math calculations, you might think that human reasoning is just around the corner. In fact, when it comes to complex tasks, they still have a significant advantage behind us. For example, try playing Sudoku with One by writing the numbers one through nine so that each appears only once in the columns, rows, and sections of a nine-by-nine grid. Your AI opponent will either not fill in the fields on its own or will do so inefficiently, although it can check that you have filled them in correctly.
Whether LM is trying to solve advanced puzzles, design molecules, or write mathematical proofs, the system has difficulty responding to open requests that must be followed strictly. The model is better at advising users how to approach these challenges rather than taking them on themselves. Moreover, practical problem solving requires the LM to consider a wide range of options while adhering to constraints. Small LMs cannot do this on their own; this can sometimes be done with large language models (LLMs), especially if they are optimized for reasoning tasks, but the response takes time and uses a lot of processing power.
This situation led researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) to develop a collaborative approach in which the LLM does the planning and then divides the work related to that strategy among smaller ones. Their method helps small LLMs provide more accurate answers than leading LLMs such as OpenAI GPT-4oand approach the precision of the best reasoning systems such as o1and at the same time they are more efficient than both. Their framework, called “Distribution Constraints through Inference Programming with Model Languages” (or “DisCIPL”), involves a large model that guides smaller “observer” models toward precise answers when writing things like text notes, shopping lists with budgets, and travel plans.
The inner workings of DisCIPL are like outsourcing a specific task to a company. You send the model “boss” a request and he carefully considers how to proceed with this project. The LLM then communicates these instructions and guidelines clearly to the smaller models. Corrects followers' BOM results as needed – for example, replacing a phrase from one model that doesn't match the row with a better option from another.
The LLM communicates with its followers using a language that everyone understands – that is, a programming language to control the LM, called “LLaMPPL.” Developed by the Probabilistic Computing Project at MIT in 2023, this program allows users to encode specific rules that guide the model toward a desired outcome. For example, LLaMPPL can be used to create error-free code by incorporating the rules of a specific language into its instructions. Directions such as “write eight lines of poetry, each with exactly eight words” are encoded in LLaMPPL, queuing up smaller models to contribute to different parts of the answer.
MIT PhD student Gabriel Grand, lead author of the book titled paper presenting this work, says that DisCIPL allows LMs to guide each other to the best answers, which improves their overall performance. “We are working to improve the efficiency of LM inference, especially for many modern applications of these models that involve generating results subject to constraints,” adds Grand, who is also a CSAIL researcher. “Language models consume more power as people use them more often, which means we need models that can provide accurate answers while using minimal processing power.”
“It's really exciting to see new alternatives to standard language model inference,” says UC Berkeley assistant professor Alane Suhr, who was not involved in the research. “This work encourages new approaches to language modeling and LLM that significantly reduce inference latency through parallelism, require significantly fewer parameters than current LLM, and even improve task performance compared to standard serialized inference. The work also provides opportunities to explore the transparency, interpretability, and controllability of model output, which is still a huge open problem in the implementation of these technologies.”
An underdog story
You might think that larger-scale BOMs are “better” for complex prompts than smaller ones in terms of accuracy and performance. DisCIPL suggests a surprising counterpoint to these tasks: If you instead combine the strengths of smaller models, you might just see a performance boost with similar results.
The researchers note that theoretically dozens of LMs could be connected to cooperate within DisCIPL, regardless of size. In the typing and reasoning experiments, they used GPT-4o as the “LM planner,” which is one of the models that help ChatGPT generate responses. Brainstormed a plan for several people “Lama-3.2-1B” models (smaller systems developed by Meta) in which these LMs populated each word (or token) of the response.
This collective approach competed with three comparables: the base follower-only system powered by Llama-3.2-1B, the standalone GPT-4o, and the industry-leading o1 inference system, which helps ChatGPT solve more complex questions such as encoding requests and math problems.
DisCIPL demonstrated for the first time the ability to write sentences and paragraphs according to clear rules. The models were given very specific prompts – for example, they were asked to write a sentence of exactly 18 words, where the fourth word had to be “Glasgow”, the eighth word had to be “w”, and the eleventh word had to be “and”. The system handled this request exceptionally well, producing consistent results while achieving o1-like accuracy and consistency.
Faster, cheaper, better
This experiment also revealed that DisCIPL's key components were much cheaper than state-of-the-art systems. For example, while existing reasoning models such as OpenAI's o1 perform reasoning on text, DisCIPL “reasons” by writing Python code, which is more compact. In practice, the researchers found that DisCIPL allowed for 40.1% faster reasoning and 80.2% cost savings compared to o1.
DisCIPL's performance gains are due in part to the use of small llama models as followers, which are 1,000 to 10,000 times cheaper per token than comparable reasoning models. This means that DisCIPL is more “scalable” – researchers have been able to run dozens of llama models in parallel at a fraction of the cost.
According to CSAIL researchers, these were not the only surprising discoveries. Their system also performed well with o1 in real-world tasks such as making ingredient lists, planning a travel itinerary, and writing grant proposals with a word limit. Meanwhile, GPT-4o struggled with these requests and often couldn't place keywords in the right parts of sentences when writing tests. The follower-only base group essentially finished last in the standings because they had difficulty following instructions.
“Over the past few years, we have seen impressive results from approaches that use language models to 'automatically formalize' problems in mathematics and robotics by representing them in code,” says senior author Jacob Andreas, associate professor of electrical engineering and computer science at MIT and principal investigator of CSAIL. “What's most exciting about this paper is that we can now use LM to automatically formalize text generation itself, enabling the same performance gains and guarantees we've seen in other fields.”
In the future, researchers plan to extend this framework to a fully recursive approach, where the same model can be used for both leader and follower. Grand adds that DisCIPL can be extended to mathematical reasoning tasks where the answers are more difficult to verify. They also intend to test the system for its ability to satisfy vague user preferences, as opposed to following hard constraints that cannot be described as clearly in code. Thinking even more broadly, the team hopes to use the largest available models possible, although it notes that such experiments are computationally expensive.
Grand and Andreas wrote the paper with CSAIL principal investigator and MIT professor Joshua Tenenbaum, as well as MIT Department of Brain and Cognitive Sciences principal investigator Vikash Mansinghka and Yale University assistant professor Alex Lew SM '20 PhD '25. CSAIL researchers presented their work at the Language Modeling Conference in October and at the IVADO workshop “Deploying Autonomous Agents: Lessons, Risks, and Real-World Impacts” in November.
Their work was supported in part by the MIT Quest for Intelligence, the Siegel Family Foundation, the MIT-IBM Watson AI Lab, the Sloan Research Fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Research, and the National Science Foundation.

















