Data privacy has a cost. There are security techniques that protect the user's confidential data, such as customer addresses, against attackers who can try to separate them from AI models – but often make these models less accurate.
MIT researchers have recently developed a framework based on a new privacy indicator called Pac Privacy, which could maintain the performance of the AI ​​model, while providing confidential data, such as medical images or financial documentation, remain safe before the attackers. Now they have moved this work a step further, increasing their more efficient computational technique, improving the compromise between accuracy and privacy, and creating a formal template that can be used to privatize virtually any algorithm without having to access this algorithm.
The team used their new version of Pac Privacy to privatize several classic algorithms to analyze data and machine learning.
They also showed that more “stable” algorithms are easier to privatize with their method. The forecasts of the stable algorithm remain consistent, even when its training data is slightly modified. Greater stability helps the algorithm make more accurate forecasts of previously invisible data.
Scientists say that the increased performance of the new PAC privacy framework, and the four -stage template that can be used would facilitate the technique in reality.
“We usually consider reliability and privacy to be unrelated to, and maybe even in conflict, constructing a high -performance algorithm. First, we create a working algorithm, and then we become solid and then private. Paper about this privacy framework.
It is joined by the article by Hanshen Xiao doctorate '24, which in autumn will start as an assistant to the professor at the University of Purdue; and the older author of Srini Devadas, Professor Edwin Sibley Webster Electrical Engineering in MIT. The research will be presented at the IEEE symposium on security and privacy.
Noise estimation
To protect sensitive data that has been used to train the AI ​​model, engineers often add noise or general randomness so that the opponent is more difficult to guess the original training data. This noise reduces the accuracy of the model, so the less noise you can add, the better.
PAC Privacy automatically estimates that the smallest amount of noise that should be added to the algorithm to achieve the desired level of privacy.
The original PAC Privacy Cishing algorithm launches the user AI model many times on different data set samples. It measures variance, as well as correlations between these many results and uses this information to estimate how much noise should be added to protect the data.
This new PAC privacy variant works in the same way, but does not have to represent the entire matrix of data correlation in the results; I only need output variations.
“Because you estimate that it is much, much smaller than the whole matrix of cowariance, you can do it much, much faster,” explains Sridhar. This means that you can scale to much larger data sets.
Adding noise can harm the utility of results and it is important to minimize the loss of utility. Due to the calculation costs, the original PAC privacy algorithm was limited to adding isotropic noise, which is uniformly added in all directions. Since the new variant estimates the anisotropic noise, which is adapted to specific features of training data, the user can add less general noise to achieve the same level of privacy, increasing the accuracy of the privatized algorithm.
Privacy and stability
When she studied privacy, Sridhar hypothesized that more stable algorithms would be easier to privatize with this technique. She used a more efficient PAC privacy variant to test this theory on several classic algorithms.
Algorithms that are more stable have a smaller variance of results when their training data change slightly. PAC Privacy breaks down a set of data into fragments, launches the algorithm on each part of the data and measures the variance between the results. The greater the variance, the more noise should be added to privatize the algorithm.
He explains that the use of stability techniques to reduce the variance of the algorithm results would also reduce the amount of noise that should be added to privatize it.
“In the best cases we can get these scenarios,” he says.
The team showed that these privacy guarantees remained strong despite the tested algorithm, and that the new PAC privacy variant required the order of less attempts to estimate noise. They also tested the method in attack simulations, which shows that its privacy guarantees can withstand the latest attacks.
“We want to examine how algorithms can be jointly designed with PAC privacy, so the algorithm is more stable, safe and solid from the very beginning,” says Devadas. Scientists also want to test their method using more complex algorithms and further examine the privacy compromise.
“The question is: when are these winnings and how can we make them happen more often?” Sridhar says.
“I think that the key advantage of PAC Privacy in this environment over other privacy definitions is that it is a black field-you do not have to manually analyze any question to privatize the results. It can be done completely automatically. We are actively building a database operated in PAC to expand existing SQL engines to support practical, automated and efficient analysis of private data. Madison, who was not involved in this examination.
These studies are partly supported by Cisco Systems, Capital One, the US Department of Defense and the MathWorks scholarship.