What is patient privacy for? The Hippocratic Oath, considered one of the earliest and most famous texts of medical ethics in the world, states: “Whatever I see or hear in the lives of my patients, whether in connection with my professional practice or not, which should not be discussed outside, I will keep secret, keeping all such things private.”
As privacy becomes increasingly limited in an age of data-intensive algorithms and cyberattacks, medicine is one of the few remaining fields where confidentiality remains a key practice, enabling patients to trust their doctors with sensitive information.
But paper co-authored by MIT researchers, examines how artificial intelligence models trained on de-identified electronic health records (EHRs) can remember patient-specific information. The work, which was recently presented at the 2025 Neural Information Processing Systems (NeurIPS) Conference, recommends rigorous test setup to ensure that targeted prompts cannot reveal information, emphasizing that the leak should be assessed in the context of healthcare to determine whether it significantly compromises patient privacy.
Basic EHR-trained models should typically generalize knowledge to make better predictions based on multiple patient data. However, in the case of “remembering”, the model relies on a single patient record to deliver results, potentially violating patient privacy. It is worth noting that the foundation models are already known prone to data leaks.
“Knowledge about these high-capacity models can be a resource for many communities, but adversarial attackers can trick the model into extracting information about the training data,” says Sana Tonekaboni, a postdoc at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard and first author of the paper. He notes that given the risk that underlying models may also remember private data, “this work is a step toward providing practical evaluation steps that our community can take before releasing models.”
To conduct research on the potential risks that entry-level EHR models might pose in medicine, Tonekaboni turned to an MIT associate professor Marzyeh Ghassemiwho is the principal investigator in Abdul Latif Jameel Machine Learning in Health Clinic (Jameel Clinic), member of the Computer Science and Artificial Intelligence Laboratory. Ghassemi, a research fellow at MIT's Department of Electrical Engineering and Computer Science and the Institute for Medical Engineering and Science, leads Healthy ML groupwhich focuses on robust machine learning in health.
How much information does a dishonest actor need to disclose sensitive data and what are the risks of information leakage? To assess this, the research team developed a series of tests that they hope will lay the foundation for future privacy assessments. These tests aim to measure different types of uncertainty and assess their practical risk to patients by measuring different levels of attack possibility.
“We really tried to emphasize practicality here. If an attacker needs to know the date and value of a dozen lab tests from your records to extract information, the risk of harm is very low. If I already have access to such a protected level of source data, why would I attack a large base model to get more?” says Ghassemi.
With the inevitable digitization of medical records, data breaches have become increasingly common. The U.S. Department of Health and Human Services has recorded cases over the past 24 months 747 data breaches health information affects over 500 people, with most of them being hacking/IT incidents.
Patients with unique medical conditions are particularly vulnerable, given how easily they can be detected. “Even with de-identified data, it depends on the type of information about the individual that is being leaked,” Tonekaboni says. “When you identify them, you learn so much more.”
In their structured tests, the researchers found that the more information an attacker had about a particular patient, the greater the likelihood of information being leaked from the model. They showed how to distinguish between cases of model generalization and patient-level memorization to properly assess privacy risk.
The article also highlights that some spills are more harmful than others. For example, a model that reveals a patient's age or demographics may be characterized as a milder leak than a model that reveals more sensitive information, such as an HIV diagnosis or alcohol abuse.
The researchers note that patients with unique conditions are particularly vulnerable given how easily they can be selected, which may require a higher level of protection. “Even with de-identified data, it really depends on the type of information about the individual that is being leaked,” Tonekaboni says. The researchers plan to expand the work and make it more interdisciplinary, adding clinicians and privacy experts, as well as legal experts.
“There's a reason why our health data is private,” says Tonekaboni. “There is no reason for others to know about it.”
This work is supported by the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, Wallenberg AI, the Knut and Alice Wallenberg Foundation, the U.S. National Science Foundation (NSF), a Gordon and Betty Moore Foundation Award, a Google Research Scholar Award, and the AI2050 program at Schmidt Sciences. Resources used to prepare this study were provided in part by the Province of Ontario, the Government of Canada through CIFAR, and the companies sponsoring the Vector Institute.
















