Ai learns how vision and sound are connected, without human intervention Myth news

People learn naturally to create connections between eyesight and sound. For example, we can observe how someone played a cello and recognize that the cellist's movements generate the music we hear.

The new approach developed by scientists from MIT and elsewhere improves the ability of the AI ​​model to learn in the same way. This can be useful in applications such as journalism and the production of films in which the model can help in searching for multimodal content by automatic video and audio search.

In the long run, these works can be used to improve the robot's ability to understand real environments, in which auditory and visual information is often strictly connected.

By improving earlier work from their group, scientists have created a method that helps machine learning models to level the relevant audio and visual data from video clips without the need for human labels.

They adapted the way their original model is trained so that they learn fine -grained correspondence between a specific video frame and sound that occurs at this moment. Scientists also made architectural corrections that help the system balance two different learning goals, which improves performance.

To sum up, these relatively simple improvements increase the accuracy of their approach in video search tasks and classifying the action in audiovisual scenes. For example, the new method can automatically and precisely match the sound of the door that hit visual closing in the video clip.

“We build AI systems that can process the world, just like people, when it comes to at the same time obtaining audio information, as well as visual information and the ability to smoothly process both modalities. Looking to the future, if we manage to integrate this audiovisual technology with some tools that we use every day, such as large language models, it can open many new applications,” Andrew Rouditchenko, Andrew Student myth and Ko-Author, as well as Mouthor Article about these studies.

He is joined by the main author of Edson Araujo, a graduate of the Goethe University in Germany; Yuan Gong, former postdoc mit; Saurabhchand Bhati, current postdoc mit; Samuel Thomas, Brian Kingsbury and Leonid Karlinsky from IBM Research; Rogerio Feris, chief scientist and manager at MIT-IBM Watson Ai Lab; James Glass, senior scientist and head of the Language group group at Mit Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Hilde Kuehne, a professor of computer science at Goethe University and an associated professor at MIT-IBM Watson Ai Lab. The works will be presented at a conference on a computer vision and recognition of patterns.

Synchronization

These works are based on the machine learning method, which scientists developed a few years ago, which was an effective way of training the multimodal model in order to simultaneously process audio and visual data without the need for human labels.

Scientists supply this model, called CAV-MAE, unknown video clips and codes for visual and audio data separately in representations called tokens. By using the natural recording sound, the model automatically learns to map the right pairs of audio and visual tokens close to each other in its inner representation.

They found that the use of two learning goals balances the learning process of the model that allows CAV-MAE to understand the relevant audio and visual data while improving its ability to recover video clips that match the inquiries of users.

But Cav-Mae treats audio and visual samples as one unit, so the 10-second video clip and the sound of the door slamming are mapping together, even if this audio event occurs in just one second video.

In their improved model, called Cav-Mae Sync, scientists divided the sound into smaller windows before the model calculates their representations of data, thanks to which it generates separate representations corresponding to each smaller sound window.

During the training, the model learns to associate one video frame with sound that occurs during this frame.

“In this way, the model learns a fine -grained correspondence, which helps in performance later when we aggregate this information,” says Araujo.

They also included architectural improvements that help the model balance its two learning goals.

Adding “Wiggle Room”

The model contains a counter -purpose, in which it learns to associate similar audio and visual data, and the purpose of reconstruction, which aims to recover specific audio and visual data based on user queries.

In CAV-MAE synchronization, researchers have introduced two new types of data or tokens representation to improve the ability to learn the model.

They include dedicated “global tokens” that help in an objective goal of learning and dedicated “registration tokens” that help the model focus on important details of the purpose of reconstruction.

“Basically, we add a little more room to the model so that each of these two tasks can perform, contrasting and reconstructive, slightly more independent. It brought general performance,” adds Araujo.

While scientists had some intuition, these improvements would improve the performance of CAV-MAE synchronization, it took care of the strategy careful to move the model in the direction they wanted.

“Because we have a lot of modalities, we need a good model for both modalities, but we also have to force them to connect and cooperate,” says Roustchenko.

Ultimately, their improvements improved the model's ability to download movies based on audio query and predict the class of audiovisual scene, such as a barking dog or an instrument game.

Its results were more accurate than their previous work, and also worked better than more complex, the most modern methods requiring larger amounts of training data.

“Sometimes very simple ideas or small patterns that you see in the data are of great value when it is used on the model you are working on,” says Araujo.

In the future, scientists want to include new models that generate better data representations for CAV-MAE synchronization, which can improve performance. They also want to enable their system to support text data, which would be an important step towards generating a large audiovisual language model.

These works are partly financed by the German Federal Ministry of Education and Research and MIT-IBM Watson Ai Lab.

LEAVE A REPLY

Please enter your comment!
Please enter your name here