Let's say someone takes their French bulldog, Bowser, to the dog park. Identifying Bowser when playing with other dogs is easy for the dog owner on the spot.
However, if someone wants to use a generative AI model like GPT-5 to monitor their pet at work, this model may not fulfill this basic task. Vision models like GPT-5 often excel at recognizing generic objects like a dog, but are poor at locating personalized objects like Bowser the French bulldog.
To address this shortcoming, researchers at MIT and the MIT-IBM Watson AI Lab have introduced a new training method that teaches vision-language models to locate personalized objects in a scene.
Their method uses carefully prepared video tracking data in which the same object is tracked in multiple frames. They designed the dataset so that the model had to focus on contextual clues to identify a personalized object, rather than relying on previously remembered knowledge.
After receiving several sample images of a personalized object, such as someone's pet, the trained model will be better able to identify the location of the same pet in the new image.
Models trained using this method performed better on this task than state-of-the-art systems. Importantly, their technique leaves the rest of the model's overall skills intact.
This new approach could help future AI systems track specific objects over time, such as a child's backpack, or locate objects of interest, such as animal species under ecological monitoring. It could also help develop artificial intelligence-based assistive technologies that help visually impaired users find specific items in a room.
“Ultimately, we want these models to be able to learn from context, just like humans do. If the model can do this well, instead of retraining it for each new task, we could just give it a few examples, and that would allow it to infer how to do the task from that context. This is a very powerful ability,” says Jehanzeb Mirza, a postdoc at MIT and senior author of the book article about this technique.
Mirza is joined in the article by co-authors Sivan Doveh, a graduate student at the Weizmann Institute of Science; and Nimrod Shabtay, researcher at IBM Research; James Glass, senior research fellow and head of the Spoken Language Systems group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL); and others. The work will be presented at the International Computer Vision Conference.
An unexpected shortcoming
Researchers have found that large language models (LLMs) can excel at learning from context. If they give the LLM some examples of tasks such as addition problems, he or she can learn to answer new addition problems based on the context provided.
The Visual Language Model (VLM) is essentially an LLM model with an associated visual component, so the MIT researchers thought it would inherit the learning capabilities of the LLM context. But that's not the case.
“The scientific community has not yet been able to find a black-and-white answer to this particular problem. The bottleneck may be due to the fact that some visual information is lost in the process of combining the two components, but we simply don't know,” Mirza says.
The researchers set out to improve VLM's capabilities in contextual localization, which involves finding a specific object in a new image. They focused on data used to retrain existing VLMs for a new task – a process called tuning.
Typical tuning data is collected from random sources and represents collections of everyday items. One photo might show cars parked on the street and the other might show a bouquet of flowers.
“This data isn't really consistent, so the model will never learn to recognize the same object across multiple images,” he says.
To address this issue, researchers developed a new dataset by selecting samples from existing video tracking data. These data are video clips of the same object moving in a given scene, like a tiger walking in a meadow.
They cut frames from these videos and structured the dataset so that each contribution consisted of multiple images showing the same object in different contexts, along with sample questions and answers about its location.
“By using multiple images of the same object in different contexts, we encourage the model to consistently locate that object of interest by focusing on the context,” Mirza explains.
Forcing focus
However, researchers have found that VLMs tend to be deceptive. Instead of responding based on contextual clues, they will identify the object using the knowledge they acquired during initial training.
For example, because the model has already learned that the image of a tiger and the label “tiger” are related, it can identify a tiger walking through a meadow based on this previously trained knowledge, rather than inferring from context.
To solve this problem, the researchers used pseudonyms in the dataset rather than the actual names of the feature categories. In this case, the tiger's name was changed to “Charlie”.
“It took us a while to figure out how to prevent the model from cheating. But we changed the game for the model. The model doesn't know that 'Charlie' might be a tiger, so it's forced to look at the context,” he says.
Researchers also faced challenges in finding the best way to prepare the data. If the frames are too close together, the background will not change enough to provide diversity in the data.
Ultimately, tuning the VLM with this new dataset improved the personalized localization accuracy by approximately 12 percent on average. After adding the pseudonym dataset, the performance increase reached 21 percent.
As the model size increases, the technique used leads to a greater increase in performance.
In the future, researchers want to investigate possible reasons why VLMs do not inherit context-based learning capabilities from their primary LLMs. Additionally, they plan to explore additional mechanisms to improve VLM's performance without having to retrain it with new data.
“This work reframes the approach to locating personalized objects based on several shots – adapting on the fly to the same object in new scenes – as an instruction tuning problem and uses video tracking sequences to teach VLMs to localize based on visual context rather than prior learning. It also introduces a first benchmark for this setting with solid benefits for open and proprietary VLMs. Taking “Given the critical importance of fast, instance-specific grounding – often without fine-tuning – for users of real-world workflows (such as robotics, augmented reality assistants, creative tools, etc.), the practical, data-centric recipe proposed in this work could help spread the adoption of basic vision language models,” says Saurav Jha, a postdoc at the Institute of Artificial Intelligence. Mila-Quebec Intelligence, who was not involved in this work.
Additional co-authors include Wei Lin, research fellow at Johannes Kepler University; Eli Schwartz, research associate at IBM Research; Hilde Kuehne, professor of computer science at the Tuebingen AI Center and affiliated professor at the MIT-IBM Watson AI Lab; Raja Giryes, associate professor at Tel Aviv University; Rogerio Feris, principal scientist and manager at the MIT-IBM Watson AI Lab; Leonid Karlinsky, Principal Scientist at IBM Research; Assaf Arbelle, senior researcher at IBM Research; and Shimon Ullman, Samy and Ruth Cohn Professor of Computer Science at the Weizmann Institute of Science.
This research was funded in part by the MIT-IBM Watson AI Lab.