A company that wants to use a large language model (LLM) for summarizing sales reports or triaging customer inquiries can choose from hundreds of unique LLM models with dozens of model variants, each with slightly different performance.
To narrow down the selection, companies often use LLM ranking platforms, which collect user feedback on how models interact to rank the newest LLMs based on how they perform on specific tasks.
However, MIT researchers found that several user interactions can skew the results, leading someone to mistakenly believe that one LLM is the ideal choice for a particular use case. Their study shows that removing a small portion of crowdsourced data can change which models rank highest.
They have developed a quick method to test ranking platforms and determine whether they are susceptible to this issue. The scoring technique identifies the individual voices most responsible for distorting the results, so users can check these influential voices.
The researchers say this work highlights the need to develop more rigorous strategies for evaluating model rankings. While they did not focus on mitigation in this study, they did provide suggestions that could improve the reliability of these platforms, such as collecting more detailed feedback to create rankings.
The study also offers a warning to users who may rely on the rankings to make decisions about LLM programs, which could have far-reaching and costly effects on a company or organization.
“We were surprised that these ranking platforms were so sensitive to this problem. If it turns out that the top-ranked LLM depends on only two or three user reviews out of tens of thousands, then it cannot be assumed that the top-ranked LLM will consistently outperform all other LLMs once implemented,” says Tamara Broderick, associate professor in MIT's Department of Electrical Engineering and Computer Science (EECS); member of the Laboratory for Information and Decision Systems (LIDS) and the Institute for Data, Systems and Society; a branch of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author of this study.
She joined paper by lead authors and EECS graduate students Jenny Huang and Yunyi Shen, as well as Dennis Wei, senior research scientist at IBM Research. The results of the study will be presented at the International Conference on Learning Representations.
Dropping data
While there are many types of LLM ranking platforms, the most popular variations require users to submit a query to two models and choose which LLM provides a better response.
The platforms aggregate the results of these matchups to create rankings showing which LLM was best at specific tasks, such as coding or visual understanding.
When selecting the top performing LLM, the user likely expects that model's top ranking to generalize, meaning that it should outperform other models in their similar but not identical applications with a set of new data.
MIT researchers have previously explored generalizations in areas such as statistics and economics. This work has revealed some cases where removing a small percentage of data can change model results, indicating that the conclusions of these studies may not extend beyond their narrow scope.
The researchers wanted to see if the same analysis could be applied to LLM ranking platforms.
“Ultimately, the user wants to know whether they are choosing the best LLM. If only a few suggestions influence the ranking, it suggests that the ranking may not be final,” says Broderick.
However, it would be impossible to manually test for data loss. For example, one of the rankings they assessed had over 57,000 votes. Testing for a 0.1 percent drop in data means removing every subset of 57 votes from the 57,000 (there are over 10194 subsets) and then recalculate the ranking.
Instead, the researchers developed an effective approximation method based on their previous work and adapted it to LLM ranking systems.
“Even though we have a theory proving that the approximation works under certain assumptions, the user doesn't have to believe it. Our method informs the user at the end about problematic data points, so they can simply remove those data points, re-run the analysis, and see if they notice a change in the rankings,” he says.
Surprisingly sensitive
When the researchers applied their technique to popular ranking platforms, they were surprised to see how few data points they had to remove to cause significant changes in the top LLMs. In one case, removing just two votes out of over 57,000, or 0.0035 percent, resulted in a change in which model took first place.
Another ranking platform that uses professional annotators and higher quality suggestions was more robust. Here, removing 83 of the 2,575 ratings (about 3 percent) knocked out the best models.
Their study found that many influential votes may have been the result of user error. In some cases, there seemed to be a clear answer to the question of which LLM worked better, but instead the user chose a different model, Broderick says.
“We never know what the user had in mind at the time, but maybe they clicked wrong, weren't paying attention, or really didn't know what was better. The most important takeaway is that you don't want noise, user error, or any outliers to determine which LLM is rated the highest,” he adds.
The researchers suggest that collecting additional user feedback, such as the level of trust in each voice, would provide richer information that could help alleviate this problem. Ranking platforms could also use human mediators to evaluate crowdsourced responses.
On the part of researchers, they want to continue to explore generalizations in other contexts while developing better approximation methods that can capture more examples of lack of robustness.
“Broderick and her students' work shows how it is possible to obtain reliable estimates of the impact of specific data on downstream processes, despite the impossibility of performing exhaustive calculations, given the size of modern machine learning models and datasets,” says Jessica Hullman, the Ginni Rometty Professor of Computer Science at Northwestern University, who was not involved in the work. “The recent work provides insight into the strong data dependencies in routinely used – but also very delicate – methods of aggregating human preferences and using them to update a model. Seeing how few preferences can really change the behavior of a finely tuned model could inspire more thoughtful methods of collecting this data.”
This research is funded in part by the Office of Naval Research, the MIT-IBM Watson AI Lab, the National Science Foundation, Amazon, and a CSAIL seed award.

















