Revealing Biases, Moods, Personalities, and Abstract Concepts Hidden in Large Language Models | MIT News

By now, ChatGPT, Claude, and other large language models have accumulated so much human knowledge that they are far from simple response generators; they can also express abstract concepts such as particular tones, personalities, biases, and moods. However, it is not obvious how these models represent abstract concepts, starting from the knowledge they contain.

Now a team from MIT and the University of California, San Diego has developed a way to check whether a large language model (LLM) contains hidden biases, personalities, moods, or other abstract concepts. Their method may focus on the connections in the model that encode the concept of interest. Moreover, the method can then manipulate or “drive” these connections to strengthen or weaken the concept for any answer the model may provide.

The team proved that their method could quickly root and control over 500 general concepts in some of the largest LLMs in use today. For example, researchers could focus on the model's representation for personalities such as “social influencer” and “conspiracy theorist” and attitudes such as “fear of marriage” and “Boston fan.” They could then fine-tune these representations to enhance or minimize the concepts contained in the responses generated by the model.

In the case of the concept of “conspiracy theorist”, the team was able to identify a representation of this concept in one of the largest vision language models currently available. When they refined the representation and then asked the model to explain the origins of the famous “blue marble” image of Earth taken from Apollo 17, the model generated a response with the tone and perspective of a conspiracy theorist.

The team acknowledges that there are risks in isolating certain concepts, which they also illustrate (and warn against). Overall, however, they see the new approach as a way to expose hidden concepts and potential vulnerabilities in LLM, which can then be exploited or mitigated to improve the model's security or improve its performance.

“What really says about LLMs is that they include these concepts, but not all of them are actively exposed,” says Adityanarayanan “Adit” Radhakrishnan, an assistant professor of mathematics at MIT. “With our method, you can isolate these different concepts and activate them in ways that cues can't answer.”

The team published their findings in the study today appearing in the journal Science. Co-authors of the study are Radhakrishnan, Daniel Beaglehole and Mikhail Belkin of the University of California, San Diego, and Enric Boix-Adserà of the University of Pennsylvania.

Fish in a black box

As use of OpenAI's ChatGPT, Google's Gemini, Anthropic's Claude, and other artificial intelligence assistants skyrockets, scientists are racing to understand how models represent certain abstract concepts like “hallucination” and “deception.” In the context of the LLM, a hallucination is a response that is false or contains misleading information that the model has “hallucinated” or wrongly constructed as fact.

To find out whether a concept such as “hallucination” is encoded in LLM, researchers often use an approach of “unsupervised learning” – a type of machine learning in which algorithms extensively search unlabeled representations to find patterns that may refer to a concept such as “hallucination”. However, for Radhakrishnan, this approach may be too broad and computationally expensive.

“It's like fishing with a big net and trying to catch one species of fish. You'll find a lot of fish that you have to sort through to find the right one,” he says. “Instead, we use bait for appropriate fish species.”

He and his colleagues had previously developed the beginnings of a more targeted approach with a type of predictive modeling algorithm known as a recursive feature machine (RFM). RFM aims to directly identify features or patterns in data by using a mathematical mechanism that neural networks – a broad category of artificial intelligence models that includes LLM – indirectly use to learn features.

Since the algorithm was an overall effective and efficient approach to capturing features, the team wondered if it could be used to root out concept representations in LLM, which is by far the most widely used and perhaps least understood type of neural network.

“We wanted to apply our feature learning algorithms to LLM to discover concept representations in these large and complex models in a targeted way,” says Radhakrishnan.

Convergence of concepts

The team's new approach identifies any concept of interest within the LLM and “steers” or directs the model's response based on that concept. The researchers looked for 512 concepts across five classes: fears (such as fears of marriage, insects and even buttons); experts (social influencer, medievalist); moods (boastful, impartially amused); location preference (Boston, Kuala Lumpur); and personality (Ada Lovelace, Neil deGrasse Tyson).

The researchers then looked for representations of each concept in several modern large-scale language and vision models. They did this by training RFMs to recognize numerical patterns in the LLM that could represent a particular concept of interest.

A standard multi-language model is, broadly speaking, a neural network that accepts natural language prompts such as “Why is the sky blue?” and divides the prompt into individual words, each of which is encoded mathematically as a list or vector of numbers. The model passes these vectors through a series of computational layers, creating matrices of many numbers that, in each layer, are used to identify different words that are most likely to be used in response to the original prompt. Ultimately, the layers converge on a set of numbers that are decoded back to text as a natural language response.

The team's approach teaches RFM to recognize numerical patterns in LLM that can be associated with a specific concept. For example, to test whether LLM contains any “conspiracy theorist” representation, researchers will first train the algorithm to identify patterns among LLM representations containing 100 prompts that are clearly related to conspiracies and 100 other prompts that are not. In this way, the algorithm would learn patterns related to the concept of conspiracy theories. Researchers can then mathematically modulate the operation of conspiracy theory concepts by perturbing LLM representations with identified patterns.

This method can be used to find and manipulate any general concept in LLM. Among many examples, researchers identified representations and manipulated LLM to respond in the tone and perspective of a “conspiracy theorist.” They also identified and refined the concept of “anti-refusal” and showed that while the model would normally be programmed to reject specific prompts, it would instead respond by, for example, giving instructions on how to rob a bank.

Radhakrishnan says this approach can be used to quickly find and minimize LLM vulnerabilities. It can also be used to highlight certain characteristics, personality, moods or preferences, for example by emphasizing the concept of “brevity” or “reasoning” in any response generated by the LLM. The team has publicly released the code behind this method.

“LLMs clearly have a lot of abstract concepts in them, in some kind of representation,” says Radhakrishnan. “There are ways in which, if we understand these representations well enough, we can build highly specialized LLMs that are still safe to use but really effective for certain tasks.

This work was supported in part by the National Science Foundation, the Simons Foundation, the TILOS Institute, and the U.S. Office of Naval Research.

LEAVE A REPLY

Please enter your comment!
Please enter your name here