A large language model (LLM) implemented to issue treatment recommendations can be stumbled through non -clinic information in patients such as typos, additional white space, missing sex markers or the use of an uncertain, dramatic and informal language, according to MIT research.
They found that the introduction of stylistic or grammatical changes in messages increases the likelihood that LLM will recommend the patient's independent management of his reported health, and not come to the meeting, even when this patient should seek medical care.
Their analysis also revealed that these non -clinical differences in the text that imitate the way people really communicate, more often change the recommendations for the treatment of the model for women, which caused a higher percentage of women who were incorrectly recommended not to look for medical care, according to doctors.
This work “is strong proof that models must be controlled before use in healthcare – which is the environment in which they are already used,” says Marzyeh Ghassemi, an associate professor at the Department of Electrical Engineering and Computer Science (EEC), a member of the Institute of Medical Engineering and in the laboratory of information and decision -making systems and the older author of the study.
These discoveries indicate that LLM takes into account non -clinical information in making clinical decisions in the previously unknown. It illuminates the need for more stringent LLM tests before they are implemented in the case of high rate applications, such as issuing treatment recommendations, scientists say.
“These models are often trained and tested on questions about medical examinations, but then used in tasks that are quite far from this, such as the assessment of the severity of clinical cases. There are still so many about LLM that we do not know,” adds Abinitha Goubathina, ECS graduate and main author of the study.
Are attached to paperwhich will be presented at the ACM conference on honesty, responsibility and transparency, by a graduate Eileen Pan and Postdoc Walter Gery.
Mixed news
Large language models such as OPENAI's GPT-4 Project of clinical notes and sending patients with distinction In healthcare facilities around the world, trying to improve some tasks to help overloaded clinicians.
The growing number of work has examined the possibilities of LLM clinical reasoning, especially from the point of view of honesty, but little research assessed how non -clinical information affects the judgment of the model.
Interestingly, how sex affects LLM reasoning, Gourrabathina conducted experiments in which she exchanged sex tips in patient notes. She was surprised that formatting errors in hints, like an additional white space, caused significant changes in LLM response.
To examine this problem, scientists designed a study in which they changed the input of the model, exchanging or removing sex markers, adding a colorful or uncertain language or putting an additional space and typos into patient messages.
Each disorder has been designed to imitate the text, which can be written by someone from a sensitive patient population, based on psychosocial research on how people communicate with clinicists.
For example, additional spaces and typos simulate writing patients with limited proficiency in English or patients with lower technological skills, and adding an uncertain language is patients with health anxiety.
“Medical models in which these models are usually cleaned and structured, and not very realistic reflection of the patient's population. We wanted to see how these very realistic changes in text can affect the use of use below,” says Goubathina.
They used LLM to create a disorders of thousands of patients' notes, while providing minimal text changes and all clinical data, such as drugs and previous diagnosis. Then they assessed four LLM, including a large, commercial GPT-4 model and a smaller LLM built especially for medical settings.
They introduced each LLM with three questions based on the patient's attention: if the patient handles at home, if the patient comes to visit the clinic and if the medical resource is assigned to the patient, such as a laboratory test.
Scientists compared LLM recommendations with real clinical answers.
Incorrect recommendations
They saw inconsistencies in the recommendations regarding treatment and a significant dispute between LLM when disturbed data was fed. In the entire management board, LLM showed an increase in self -management suggestions by 7 to 9 percent for all nine types of changed patient messages.
This means that LLM more often recommended that patients did not look for medical care when the news contained typos or sex pronouns. The use of colorful language, such as slang or dramatic expressions, had the greatest impact.
They also discovered that the models made about 7 percent of women's mistakes and more often recommended independent management of women, even when scientists removed all gender tips from the clinical context.
Many of the worst results, such as patients who spoke independent management when they have a serious health, will probably not be captured by tests that focus on the general clinical accuracy of the models.
“In research, we usually look at aggregate statistics, but there are many things lost in translation. We must look at the direction in which these errors occur – not recommending a visit when it should be more harmful than doing reverse,” says Gourrabathina.
Non -clinical inconsistencies become even more pronounced in conversational conditions in which LLM is affected by the patient, which is a common case of the use of chatbots addressed to the patient.
Push controlScientists have found that the same changes in patient messages do not affect the accuracy of clinicians.
“During the review of our continuation, we also state that large language models are fragile in the face of changes that are not clinicians,” says Ghassemi. “Perhaps this is not surprising,” LLM has not been designed to determine the priority of medical care of patients. LLM are flexible and do on average enough to think that this is a good use of use. But we do not want to optimize the healthcare system that only works for patients in specific groups. “
Scientists want to develop this work by designing natural language disorders that capture other sensitive populations and imitate real news better. They also want to examine how LLM they conclude sex from the clinical text.