Scientists discover flaw that makes LLMs less reliable | MIT News

Large language models (LLMs) sometimes draw incorrect conclusions, according to an MIT study.

Instead of responding to a query based on domain knowledge, an LLM can respond using the grammatical patterns it has learned during training. This may cause the model to crash unexpectedly when deployed to new jobs.

Researchers have found that models can incorrectly associate certain sentence patterns with certain topics, so the LLM may give a convincing answer by recognizing familiar phrases rather than understanding the question.

Their experiments showed that even the most powerful LLMs can make this mistake.

This shortcoming can reduce the reliability of LLMs that perform tasks such as handling customer inquiries, summarizing clinical notes, and generating financial reports.

This may also pose a security risk. A nefarious actor could use this to trick LLM companies into producing harmful content, even if models have safeguards in place to prevent such reactions.

After identifying this phenomenon and examining its consequences, researchers developed a benchmarking procedure to assess the model's dependence on these abnormal correlations. This procedure can help developers mitigate the problem before implementing LLM.

“This is a byproduct of the way we train models, but models are now being used in practice in safety-critical domains well beyond the tasks that gave rise to these syntactic failure modes. If you as an end user are unfamiliar with model training, this will likely be unexpected,” says Marzyeh Ghassemi, an associate professor in MIT's Department of Electrical Engineering and Computer Science (EECS), a member of the MIT Institute of Medical Engineering Sciences and the Information and Decision Systems Laboratory, and senior author of the book.

Ghassemi is joined by co-authors Chantal Shaib, a graduate student at Northeastern University and visiting student at MIT; and Vinith Suriyakumar, MIT graduate; as well as Levent Sagun, Research Associate at Meta; and Byron Wallace, Sy and Laurie Sternberg Interdisciplinary Associate Professor and associate dean for research in the Khoury College of Computer Sciences at Northeastern University. AND document describing the work will be presented at the Conference on Neural Information Processing Systems.

I'm stuck on syntax

LLMs are trained on huge amounts of text from the Internet. During this training process, the model learns to understand the relationships between words and phrases – knowledge it then uses to answer queries.

In previous work, researchers found that LLM methods capture patterns in parts of speech that often appear together in training data. They call these part-of-speech patterns “syntactic templates.”

LLMs need this understanding of syntax along with semantic knowledge to answer questions in a specific field.

“For example, in the field of news, there is a particular style of writing. So the model not only learns the semantics, but also learns the basic structure of how sentences should be put together to maintain a style specific to that field,” explains Shaib.

However, in this study they found that LLMs learn to associate these syntactic templates with specific domains. The model may incorrectly rely solely on this learned association when answering questions, rather than understanding the query and the item.

For example, an LLM might learn that a question like “Where is Paris?” has the structure of an adverb/verb/proper noun/verb. If there are many examples of sentence structures in the model's training data, LLM can associate this syntactic template with country questions.

So, if the model receives a new question with the same grammatical structure but nonsense words, such as “Quickly sit overcast Paris?” can answer “France” even if that answer doesn't make sense.

“This is an overlooked type of association that a model learns to answer questions correctly. We should pay more attention not only to the semantics but also to the syntax of the data we use to train our models,” says Shaib.

There is a lack of meaning

The researchers tested this phenomenon by designing synthetic experiments in which only one syntactic template appeared in the model's training data for each domain. They tested the models by replacing words with synonyms, antonyms, or random words, but kept the syntax the same.

In each case, they found that college-educated people often still gave the correct answer even when the question was complete nonsense.

When they restructured the same question using a new part-of-speech pattern, LLMs often failed to answer correctly, even though the basic meaning of the question remained the same.

They used this approach to test pre-trained LLMs such as GPT-4 and Lama and found that the same trained behavior significantly reduced their performance.

Curious about the broader implications of these findings, researchers examined whether someone could use this phenomenon to trigger harmful responses from an LLM that had been deliberately trained to reject such requests.

They found that by formulating a question using a syntactic template that the model associates with a “safe” dataset (one that does not contain malicious information), the model can be tricked into ignoring its deny policy and generating malicious content.

“It is clear from this work that we need more robust safeguards to address vulnerabilities in LLMs. In this paper, we have identified a new vulnerability that arises from the way LLMs learn the language. So we need to develop new safeguards based on the way LLMs learn the language, not just ad hoc solutions to various vulnerabilities,” says Suriyakumar.

Although the researchers did not examine mitigation strategies in this work, they did develop an automatic comparison technique that can be used to assess the dependence of LLM on this syntactic and domain abnormal correlation. This new test can help developers proactively address this flaw in their models, reducing security risks and improving performance.

In the future, researchers want to explore potential mitigation strategies that could include augmenting training data to provide a wider range of syntactic templates. They are also interested in studying this phenomenon in reasoning models, i.e. special types of LLM designed to solve multi-step tasks.

“I think this is a really creative approach to studying LLM failure modes. This work highlights the importance of linguistic knowledge and analysis in LLM security research, an aspect that has not been in focus but certainly should be,” says Jessy Li, an associate professor at the University of Texas at Austin, who was not involved in this work.

This work is supported in part by a Bridgewater AIA Labs Fellowship, the National Science Foundation, the Gordon and Betty Moore Foundation, a Google Research Award, and Schmidt Sciences.

LEAVE A REPLY

Please enter your comment!
Please enter your name here