Imagine watching a video in which someone hits the door, and artificial intelligence behind the scenes immediately connects the exact moment of this sound with visual closing of the door – without saying what the door is. It is the future researchers from the myth and international cooperate, thanks to the breakthrough in machine learning, which imitate how people intuitively combine vision and sound.
A team of scientists introduced CAV-MAE SYNC, an improved AI model that learns fine-grained connections between audio and visual data -It's without labels by people. Potential applications include video editing and content treatments for smarter robots that better understand real environments.
According to Andrew Rouditchenko, doctor myth and co -author of the study, people naturally process the world, using both eyesight and sound, so the team wants AI to do the same. By integrating this kind of audiovisual understanding with tools, such as large language models, they can unlock completely new types of AI application.
The work is based on the previous CAV-MAE model, which can process and equalize visual data and audio from movies. This system has learned, coding unknown video clips for representation called tokens and automatically adapted the appropriate audio and video signals.
However, the original model did not have precision: it treated long audio and video segments as one unit, even if a specific sound – like a bark for dogs or a crack of the door – only occurred briefly.
The new model, CAV-MAE SYNC, fixes this by dividing the sound into smaller pieces and mapping each fragment into a specific video frame. This fine -grained alignment allows the model to link a single image with the exact sound, which at this moment occurs, significantly improving the accuracy.
They give the model a more detailed time of time. This has a big difference when it comes to tasks in the real world, such as searching for the right video clip based on sound.
CAV-MAE synchronization uses a double learning strategy to balance two goals:
- A contrast to learning task that helps the model distinguish between matching audiovisual and mismatched couples.
- The task of reconstruction, in which AI learns to download specific content, such as finding a movie based on audio inquiry.
To support these goals, scientists have introduced special “global tokens” to improve contrasting learning and “register tokens” that help the model focus on minor reconstruction details. This “Wiggle room” allows the model to perform both tasks more effectively.
The results speak for themselves: CAV-MAE synchronizes previous models, including more complex data hungry systems, when downloading video and audiovisual classification. It can identify actions such as a played musical instrument or a pet's noise with extraordinary precision.
Looking to the future, the team hopes to further improve the model, integrating even more advanced data representation techniques. They also examine the integration of text input data that could pave the way to the truly multimodal AI-Taki system that sees, hears and reads.
Ultimately, this type of technology can play a key role in developing intelligent assistants, increasing availability tools, and even power supply to robots that interact with people and their environments in a more natural way.