For years, the inner workings of large language models (LLMs) like Lama and Claude have been likened to a “black box” – huge, complex, and notoriously difficult to control. But a team of researchers from UC San Diego and MIT did just that published the study in the Science Journal this suggests that this box is not as mysterious as we thought.
The team found that complex AI concepts – from specific languages such as Hindi to abstract ideas such as conspiracy theories – are actually stored in the model's mathematical space as simple straight lines or vectors.
Using a new tool called Recursive Feature Machine (RFM) – a feature extraction technique that identifies linear patterns representing concepts ranging from moods and fears to complex reasoning – researchers were able to precisely trace these pathways. Once a concept has direction, it can be “nudged.” By adding or subtracting these vectors mathematically, the team could instantly change the model's behavior without costly retraining or complex prompts.
The effectiveness of this method is what creates buzz in the industry. Using just one standard GPU (NVIDIA A100), the team was able to identify and drive a concept in less than one minute, requiring less than 500 training samples.
The practical applications of this “surgical” approach to AI are immediate. In one experiment, researchers ran a model to improve its ability to translate Python code to C++. By isolating the “logic” of the code from the “syntax” of the language, the guided model performed better than standard versions that only needed to be “translated” via text prompts.
The researchers also found that internally “probing” these vectors is a more effective way of catching AI hallucinations or toxic content than asking the AI to evaluate its own work. Essentially, the model often “knows” that they are lying or intrinsically toxic, even if the final outcome suggests otherwise. By analyzing the internal mathematics, researchers can detect these problems before a single word is generated.
However, the same technology that makes AI safer can also make it more dangerous. The study found that by “de-emphasizing” the concept of denial, researchers could effectively “jailbreak” the models. In tests, controlled models bypassed their own barriers to provide instructions on illegal activities or promote debunked conspiracy theories.
Perhaps the most surprising discovery was the universality of these concepts. The “conspiracy theorist” vector extracted from English data performed equally well when the model spoke Chinese or Hindi. This supports the “linear representation hypothesis” – the idea that AI models organize human knowledge in a structured, linear way that transcends individual languages.
While the study focused on open source models such as Meta's Llama and DeepSeek as well OpenAI GPT-4oscientists believe the findings apply in all cases. As models become larger and more sophisticated, they actually become more controllable, not less.
The team's next goal is to refine control methods to adapt to specific user inputs in real time, potentially leading to a future where AI is not just a chatbot we talk to, but a system we can mathematically “tune” for perfect accuracy and security.


















