Large language models (LLM) are promoted as tools that can democratize access to information around the world, offering knowledge in a user-friendly interface, regardless of a person's background or location. But new research from MIT's Center for Constructive Communication (CCC) suggests that these AI systems may actually perform worse for the very users who could benefit from them the most.
A study by CCC researchers at the MIT Media Lab found that state-of-the-art AI chatbots—including OpenAI's GPT-4, Anthropic's Claude 3 Opus, and Meta's Llama 3—sometimes provide less accurate and less truthful responses to users who have lower English skills, less formal education, or come from outside the United States. Models also refuse to respond to these users' questions more often, and in some cases respond with condescending or condescending language.
“We were motivated by the prospect that universities would help solve the problem of unequal access to information around the world,” says lead author Elinor Poole-Dayan SM ’25, a technical fellow at the MIT Sloan School of Management who led the research as a CCC affiliate and a master's student in media arts and sciences. “But this vision cannot become a reality without ensuring that model biases and harmful tendencies are safely mitigated for all users, regardless of language, nationality or other demographics.”
An article describing the work entitled “Targeted poor LLM outcomes disproportionately impact vulnerable users” was presented at the AAAI Artificial Intelligence Conference in January.
Systematic underperformance across multiple dimensions
For this study, the team tested how three LLM representatives answered questions from two datasets: TruthfulQA and SciQ. TruthfulQA aims to measure the truthfulness of a model (by relying on common misconceptions and literal truths about the real world), while SciQ includes science exam questions that test the accuracy of facts. For each question, researchers attached short biographies of users, differing in three characteristics: level of education, knowledge of English and country of origin.
In all three models and both datasets, researchers found a significant drop in accuracy for questions asked by users described as having less formal education or non-native English. The effects were most pronounced for users at the intersection of these categories: those with less formal education who were also non-native English speakers saw the greatest decline in response quality.
The study also examined how country of origin affects model performance. By testing users from the United States, Iran, and China with similar educational backgrounds, the researchers found that Claude 3 Opus in particular performed significantly worse for Iranian users in both datasets.
“We see the greatest drop in accuracy for a user who is not a native English speaker and is less educated,” says Jad Kabbara, a research associate at CCC and co-author of the paper. “These results show that the negative effects of model behavior on user characteristics overlap in disturbing ways, suggesting that such models when deployed at scale risk spreading harmful behavior or misinformation down the supply chain to those least able to identify it.”
Denials and condescending language
Perhaps most striking were the differences in the frequency with which models refused to answer questions at all. For example, Claude 3 Opus refused to answer almost 11 percent of questions asked by less-educated, non-native English speakers—compared to just 3.6 percent in a control condition with no user biography.
When researchers manually analyzed these rejections, they found that Claude responded with condescending, condescending or mocking language 43.7% of the time for less-educated users compared to less than 1% for higher-educated users. In some cases, the model imitated broken English or adopted an exaggerated dialect.
The model also refused to provide information on certain topics specifically for less educated users from Iran or Russia, including questions about nuclear energy, anatomy and historical events – even though it correctly answered the same questions to other users.
“This is another indicator that the matching process may encourage models to hide information from some users to avoid potentially misleading them, even though the model clearly knows the correct answer and is providing it to other users,” Kabbara says.
Echoes of human prejudices
The findings reflect documented patterns of human sociocognitive biases. Social science research has shown that native English speakers often perceive non-native speakers as less educated, intelligent and competent, regardless of their actual expertise. Similar biased perceptions have been documented among teachers assessing students who speak English as their first language.
“The value of large language models is evident in their extraordinary adoption by individuals and the enormous investment in technology,” says Deb Roy, professor of media arts and sciences, director of the CCC and co-author of the paper. “This study is a reminder of how important it is to continually assess the systematic biases that can quietly seep into these systems, causing unfair harm to some groups without any of us being fully aware of it.”
The implications are particularly concerning given that personalization features — such as ChatGPT storage, which tracks user information during conversations — are becoming more common. Such features create the risk of treating already marginalized groups differently.
“LLMs are being touted as tools that will foster more equitable access to information and revolutionize personalized learning,” says Poole-Dayan. “However, our findings suggest that they may actually exacerbate existing inequalities by systematically providing misinformation or refusing to respond to queries for some users. The people who rely most on these tools may receive poor, false, and even harmful information.”
















