STOCHASTIC Cluster Embedding – a new method of visualization of large data sets

An amazing feature of the human brain is the ability to find differences even in a huge amount of visual information. During the study of large amounts of data, this ability turns out to be very useful, because the content of data must be compressed in a form understandable to human intelligence. In the case of a visual analysis, the problem of dimensional reduction remains the main.

Scientists from Aalto University and the University of Helsinki in Finnish Center for Artificial Intelligence (FCAI) carried out test where they tested the functionality of the most famous visual analysis methods and found that none of them works when the volume of data increases significantly. For example, T-nne, Largeviz and AMAP methods could no longer distinguish between extremely strong signaling groups of observations in data, when the number of observations appears in hundreds of thousands. T-nne, Largeviz and MEMAP methods no longer work properly.

Scientists have developed a new non -linear method of reduction of dimensions called the deposition of the stochastic cluster (SCE) in order to better visualize clusters. It aims to visualize data sets so clearly and is designed to visualize data clusters and other macroscopic features in such a way that they are as clear as possible, easy to observe and observe by people. SCE uses graphic acceleration similar to modern artificial intelligence methods to calculate in neural networks.

The discovery of Bozon Higgs was the basis for the invention of this algorithm. The set of data for related experiments contains over 11 million feature vectors. And these data required a convenient, clear visualization. This inspired scientists to develop a new method.

Researchers generalized SNE using a resolution family and, parameterized by the scale coefficient, between unpredictable similarities in the entrance and output space. SNE is a special case in a family in which S is chosen as a normalization factor in the similarity of the results. However, during testing it was found that the best value of S for visualization of clusters is often different from the value chosen by SNE. Therefore, to overcome the shortcoming of T-nne, the new SCE method uses a different approach that combines input similarities when calculating p. Scientists have also developed an efficient optimization algorithm using asynchronous stochastic origin over block coordinates. The new algorithm can use parallel computing devices and is suitable for large -scale tasks with a large amount of data.

During the development of the project, scientists tested the method of various real data sets and compared it with other modern NLDR methods. Users participating in the tests have chosen the most appropriate visualizations that match the range S of the S value for viewing clusters. Then the scientists compared the resulting values ​​of S and T-Sne to see which one is closer to human choice. Four smallest sets of IJCNN, Tomoradar, Shuttle and Mnist were used for testing. For each set of data, test participants were presented a series of visualizations in which they used a slider to indicate the S value and tested the appropriate visualization pre -calculated. The user has chosen the preferred value of S for the visualization of the cluster.

The test results clearly show that S chosen by SNE is on the right side of the human median (solid green line) for all data sets. This suggests that for GSNE people with less S is often better than t-nne to visualize focus. However, the selection of SCE (red lines intermittent) is closer to the human median for all four data sets.

Using the method of embedding the stochastic cluster for data on Boson Higgs, their most important physical features have been clearly identified. The new non -linear method of reducing the dimensions of stochastic cursory deposition for better visualization of the cluster works several rows of size faster than previous methods, and is also much more reliable in complex applications. It modifies T-nne using an adaptive and efficient compromise between attraction and repulsion. Experimental results have shown that the method can consistently identify internal clusters. In addition, scientists have provided a simple and fast optimization algorithm that can be easily implemented on modern parallel computing platforms. Efficient software has been developed, which uses asynchronous stochastic drooping of the block gradient to optimize the new family of objective functions. Experimental results have shown that the method consistently and significantly improves the visualization of data clusters compared to modern approaches to the deposition of a stochastic neighbor.

The method of the method is publicly available at the address Girub.

LEAVE A REPLY

Please enter your comment!
Please enter your name here