New research from Russia proposes an unconventional method of detecting unrealistic images generated by AI-not by improving the accuracy of large visions models (LVLM), but by deliberately using their tendencies for hallucinations.
The new approach distinguishes many “atomic facts” about the image using LVLMS, and then applies to the natural language (NLI) inference to systematically measuring contradictions among these statements-effectively transforming the model of the model into a diagnostic tool for detecting images that opposite.
Two paintings from UPS! Data set with automatically generated instructions according to the LVLM model. The left image is realistic, which leads to consistent descriptions, while the unusual right picture causes the model hallucinals to create contradictory or false statements. Source: https://arxiv.org/pdf/2503.15948
Asked to assess the realism of the second picture, LVLM can see it something It is wrong because the presented camel has three humps, i.e. unknown in nature.
However, LVLM is initially connected > 2 humps With > 2 animalsBecause this is the only way you can see three humps in one “camel image”. Then he undergoes hallucinating something even more unlikely than three hosts (ie “two heads”) and never describes in detail things that seems to cause his suspicions – an unbelievable additional hump.
Scientists from the new work have found that LVLM models can perform this type of assessment natively and equally with (or better) models that have been refined to this type of task. Because refinement is complicated, expensive and rather fragile in terms of use, the discovery of native use for one of the largest road blockades in the current AI revolution is a refreshing return of general trends in literature.
Open rating
The meaning of this approach, according to the authors, is that they can be implemented Open Source frames. While the advanced and high investment model, such as chatgPT, can (grants the article), potentially offer better results in this task, the argumentated real value of literature for most of us (and especially for the community of hobbyists and VFX) is the opportunity to include and develop new breakthroughs in local implementation; And vice versa, everything that is intended for the reserved commercial API system is subject to withdrawal, arbitrary increase in price and censorship policy, which more often reflect the company's corporate problems than the needs and responsibilities of the user.
. new paper is entitled Do not fight hallucinations, use them: estimating image realism with NLI over nuclear factsAnd it comes from five researchers from all over Skinovo Institute of Science and Technology (Skoltech), the Moscow Institute of Physics and Technology and Russian companies MTS AI and Airi. Work has Accompanying the GitHUB page.
Method
The authors use Israel/USA Ups! Data set To the project:

Examples of impossible images from UPS! Data set. It is worth noting how these images fold likely elements and that their improbability should be calculated based on the combination of these incompatible aspects. Source: https://whoops-benchmark.github.io/
The data set includes 500 synthetic images and over 10,874 annotations, specially designed to test the affection and understanding of the composition of AI models. It was created in cooperation with designers whose task is to generate difficult images via text systems for image, such as Midjourney and Dall-E Series-creating difficult or impossible to record naturally:

Further examples from UPS! Data set. Source: https://huggingface.co/datasets/nlphuji/whoops
The new approach works at three stages: first, LVLM (in particular Llava-V1.6-Mistral-7B) is asked to generate many simple statements – called “atomic facts” – describing the picture. These statements are generated with Different beam searchensuring variability of results.

A variety of beam search produces a better variety of signatures options, optimizing the target raising diversity. Source: https://arxiv.org/pdf/1610.02424
Then, each generated statement is systematically compared with any other statement using a natural language inference model, which assigns the results reflecting whether couples of statements are enthusiastic about whether they are neutral towards each other.
Contradictions indicate hallucinations or unreal elements in the image:

Detection pipeline diagram.
Finally, the method aggregates these NLI couple in a single “result of reality”, which quantifies the general coherence of the generated statements.
Scientists studied various aggregation methods, and the approach based on clusters worked the best. The authors used the K-middle grouping algorithm to separate the individual NLI results into two clusters and Centroid A lower value cluster was then chosen as the final indicator.
The use of two clusters directly connects to the binary nature of the classification task, i.e. distinguishing realistic from unrealistic images. The logic is simply similar to choosing the lowest result; However, grouping allows you to present an average contradiction in many facts instead of relying on one protruding values.
Data and tests
Scientists tested their system to UPS! Basic reference point, by means of turnover Division of tests (TJ, Cross validation). The tested models were BLIP2 FLANT5-XL AND BLIP2 FLANT5-XXL In division I BLIP2 FLANT5-XXL in zero format (i.e. without additional training).
In the case of the instructions base line, the authors prompted LVLM with a phrase “Is this unusual? Explain in a short opinion 'Which Previous research Considered as effective in detecting unrealistic images.
The rated models were Llava 1.6 Mistral 7BIN Llav 1.6 VICUNA 13Band two sizes (7/13 billion parameters) Instructblip.
The test procedure focused on 102 realistic and unreal (“strange”) images. Each pair consisted of one normal image and one counterpart for health breakup.
Three human adnoters marked images, reaching a consensus of 92%, indicating strong consent to humanity as to “strangeness”. The accuracy of assessment methods was measured by their ability to properly distinguish between realistic and unrealistic images.
The system was evaluated by means of three times the cross -validation, randomly shuffling data with a fixed grain. The authors corrected the weights to the results of taking into account (statements that logically agree) and the results of contradictions (logically conflict statements) during training, while “neutral” results were set to zero. The final accuracy was calculated as average in all test divisions.

Comparison of various NLI models and aggregation methods in the subset of five generated facts, measured by accuracy.
As for the initial results shown above, the article says:
“Method (” clust “) stands out as one of the best results. This means that the aggregation of all contradiction results is crucial, and not focusing only on extreme values. In addition, the largest NLI (NLI-Debert-V3-LARGE) model exceeds all other aggregation methods for all methods, which suggests that it more effectively captures the essence of the problem. “
The authors said that optimal weights consistently favored the contradiction in connection with the fact that the contradictions were more informative to distinguish between unrealistic images. Their method exceeded all other tested methods of zero shot, strictly approaching the performance of the Blip2 model:

The performance of various approaches to UPS! benchmark. Small methods (FT) appear at the top, while zero-shot (ZS) methods are listed below. The size of the model indicates the number of parameters, and the accuracy is used as an assessment indicator.
They also noticed, somewhat unexpectedly, that Instructblip worked better than comparable Llava models, taking into account the same prompt. Recognizing the highest accuracy of the GPT-4O, the article emphasizes the preferences of the authors related to demonstrating practical Open Source solutions and, it seems, it may justify the novelty in the clearly use of hallucinations as a diagnostic tool.
Application
However, the authors recognize their debt by 2024 Belief Recovery, cooperation between the University of Texas in Dallas and Johns Hopkins University.

Illustration of the Faithscore assessment. First of all, descriptive instructions are identified in response generated by LVLM. Then these statements are divided into individual atomic facts. Finally, atomic facts are compared with the input image to verify their accuracy. The emphasized text emphasizes the objective descriptive content, while the blue text indicates hallucinated statements, enabling Faithscore to provide interpretative measures of factual correctness. Source: https://arxiv.org/pdf/2311.01477
Faithscore measures the loyalty of the descriptions generated by LVLM by verifying the consistency of the image content, while the methods of the new article clearly use LVLM hallucinations to detect unrealistic images through contradictions in generated facts using natural language inference.
The new work is obviously dependent on the eccentric language models and their tendency to hallucination. If the development of the model should ever bring a completely non -haallucinative model, even the general principles of the new work would no longer be applied. However, this remains a difficult perspective.