Guided learning allows “untrained” neural networks to realize their potential | MIT News

Even networks long considered “untrainable” can learn effectively with a little helping hand. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have shown that a short period of time to adapt neural networks – a method they call guidance – can dramatically improve the performance of architectures previously considered unsuitable for modern tasks.

Their findings suggest that many so-called “inefficient” networks may simply be starting from less than ideal starting points, and that short-term guidance can put them in a place that makes it easier for the network to learn.

The team's guidance method involves encouraging the target network to match the internal representations of the guide's network during training. Unlike traditional methods such as knowledge distillation, which focus on imitating the effects of a teacher's work, guidance transfers structured knowledge directly from one network to another. This means that the target learns how the guide organizes information in each layer, rather than simply copying its behavior. Interestingly, even untrained networks contain architectural errors that can be transferred, while trained guides additionally transmit learned patterns.

“We found these results quite surprising,” says Vighnesh Subramaniam ’23, MEng ’24, a doctoral student in MIT's School of Electrical Engineering and Computer Science (EECS) and a CSAIL researcher who is the lead author of the study paper presenting these findings. “It's impressive that we could leverage representational similarity to make these traditionally 'cheesy' networks actually work.”

Guide angel

The main question was whether guidance needed to continue throughout training or whether its main effect was to provide better initialization. To investigate this, the researchers conducted an experiment with deeply connected networks (FCNs). Before training on a real problem, the network took several steps by training with another network using random noise, such as stretching before exercising. The results were striking: networks that typically overfit immediately remain stable, achieve lower training losses, and avoid the classic performance degradation seen in so-called standard FCNs. This setup acted as a helpful warm-up for the network, showing that even a short training session can provide lasting benefits without the need for constant guidance.

The study also compared guidance with knowledge distillation, a popular approach in which a network of students tries to emulate a teacher's achievements. When the teacher network was untrained, the distillation failed completely because the results contained no significant signal. In turn, the guidelines continued to make significant improvements because they were based on internal representations rather than final predictions. This result highlights a key observation: untrained networks already encode valuable architectural errors that can guide other networks towards effective learning.

Beyond the experimental results, the findings have broad implications for understanding the architecture of neural networks. The researchers suggest that success—or failure—often depends less on task-specific data and more on the network's position in parameter space. By adapting to a network of guides, it is possible to separate the contribution of architectural biases from learned knowledge. This allows researchers to determine which network design features promote effective learning and which challenges are simply a result of poor initialization.

The guidelines also open up new possibilities for examining the relationships between architectures. By measuring the ease with which one network can direct another, researchers can study the distances between functional designs and re-examine theories of neural network optimization. Because the method relies on representational similarity, it can reveal structures previously hidden in network design, helping to determine which components contribute most to learning and which do not.

Saving the hopeless

Ultimately, the work shows that so-called “untrainable” networks are not inherently doomed. With guidance, you can eliminate failure modes, avoid overfitting, and bring previously inefficient architectures up to today's performance standards. The CSAIL team plans to investigate which architectural elements are most responsible for these improvements and how these insights may impact future network design. By revealing the hidden potential of even the most stubborn networks, guidance provides a powerful new tool for understanding – and hopefully shaping – the fundamentals of machine learning.

“It is commonly assumed that different neural network architectures have particular strengths and weaknesses,” says Leyla Isik, an assistant professor of cognitive science at Johns Hopkins University, who was not involved in the research. “This exciting study shows that one type of network can inherit the advantages of another architecture without losing its original capabilities. Interestingly, the authors show that this can be done using small, untrained networks of 'conductors.' This paper presents a novel and specific way of adding various inductive biases to neural networks, which is crucial for developing more efficient and human-adapted artificial intelligence.”

Subramaniam wrote the paper with CSAIL colleagues: scientist Brian Cheung; PhD student David Mayo '18, MEng '19; Research Fellow Colin Conwell; principal investigators Boris Katz, principal scientist of CSAIL, and Tomaso Poggio, MIT professor of brain and cognitive sciences; and former CSAIL scientist Andrei Barbu. Their work was supported in part by the Center for Brain, Mind and Machines, the National Science Foundation, the MIT CSAIL Applications of Machine Learning Initiative, the MIT-IBM Watson AI Laboratory, the U.S. Advanced Research Projects Agency (DARPA), the U.S. Air Force Artificial Intelligence Accelerator, and the U.S. Air Force Office of Scientific Research.

Their work was recently presented at the Neural Information Processing Systems (NeurIPS) Conference and Workshop.

LEAVE A REPLY

Please enter your comment!
Please enter your name here