The presented approach uses synthetic data to improve the accuracy of AI models that recognize images.
In order for the machine learning model to perform the task of diagnosing diseases in medical images, it should be trained. Training the image classification model usually requires a huge set of data, millions of examples of similar images. And here problems arise.
Using data from real medical images is not always ethical. After all, it can be an invasion of people's privacy, copyright violation or a set of data may be biased in relation to a specific racial or ethnic group. To minimize this risk, you can opt out of a real set of image data and instead use image generation programs. This approach will create a synthetic set of data for training the image classification model. However, these methods are limited because knowledge is often required to manually develop images generating programs that can create effective training data.
Scientists from the Massachusetts Institute of Technology, Mit-IBM Watson Ai Lab and others analyzed all problems encountered in generating image data sets and presented a different solution to the problem. They refused to develop a custom image generation program and collected a large collection of basic image generation programs for a specific training task from publicly available programs on the Internet.
Their set consisted of 21,000 different programs that were able to create images of simple textures and colors. The programs were small, usually occupying only a few code lines. Scientists did not change these programs and immediately used them to generate a set of images.
They used this set of data to train a computer vision model. Based on the test results, it turned out that models trained on such images classified from a set of data more accurately than other models synthetically trained. And yet these models were still worse than models trained for real data. Researchers also found that increasing the number of image processing programs in the data set increases the performance of the model, enabling higher accuracy.
It turned out that the use of many programs that do not require additional work with them is actually better than using a small set of programs requiring additional processing. The data is certainly important, but this experiment has shown that you can achieve good results without real data.
The conducted research allows us to rethink the process before data training. Machine learning models are usually pre -trained. First they are trained on one set of data, after creating parameters, and then they can be used to solve other problems.
For example, a model designed for classification of x -ray radiation images can first be initially trained using a huge set of data generated synthetically. And only then will it be trained using a much smaller set of data of real X -rays to perform its real task. The problem with this method is that synthetic images must match some properties of real images. And this, in turn, requires additional work with programs generating such synthetic images. This complicates the model training process.
Instead, scientists from the Watson AI laboratory used simple image generation programs in their work. There were many of them gathered from the Internet. The programs had to quickly generate images, so scientists chose those that were written in a simple programming language and contained only a few parts of the code. The requirements for image generation were also quite simple, they had to be images that looked like an abstract art.
These programs worked so quickly that there was no need to prepare a set of photos for model training in advance. The programs generated images and the model was immediately trained. This significantly simplifies the process.
Researchers used their wide range of images generating programs to pre -cum computer vision models for both supervised and not supervised image classification. During the supervised training, the image data is marked, while during training without supervision, the model learns to classify images without labels.
When they compared their pre -trained models with modern computer vision models, which were initially trained using synthetic data, their models were more accurate, more often by placing images in the right categories. Although the levels of accuracy were still lower than in models trained in the scope of real data, this method reduced the gap in performance between models trained in the scope of real data and models trained in the field of synthetic data by 38 percent.
This study also shows that performance is scaled logarithmically with the number of generative programs. If more programs are collected, the model will work even better. In this way, scientists emphasize that there is a way to expand their approach.
To determine the factors affecting the accuracy of the model, scientists used each image generation program separately for initial training. They discovered that the more diverse the set of paintings generated the program, the better the model he made. It was also observed that colorful images filling the entire canvas are better to improve the performance of the model.
This approach to initial training turned out to be quite effective. Scientists plan to apply their methods to other types of data, such as multimodal data, which include text and images. They also want to further examine ways to improve the performance of the image classification.
Read more details about the study in article.