A new way of editing or generating images Myth news

Generation of the AI-CHOOSE Image is on neural networks in order to create new images from various input data, including text hints-to become an industry worth a billion dollars by the end of this decade. Even with today's technology, if you wanted to take a fancy photo, let's say, a friend of planting a flag on Mars or an unintentionally flying in a black hole, it may take less than a second. However, before they are able to perform such tasks, image generators are widely trained on massive sets of data containing millions of images, which are often paired with a related text. Training these generative models can be a tedious obligation that takes weeks or months, consuming huge calculation resources in this process.

But what if it was possible to generate images with AI methods without using a generator? This real possibility, along with other intriguing ideas, has been described in Research article Presented at an international conference on machine learning (ICML 2025), which took place in Vancouver, British Columbia this summer. The article, describing new techniques for manipulating and generating images, was written by Lukas Lao Beyer, a student studio in the MIT laboratory in the field of information and decision systems (LIDS); Tianhong Li, Postdoc at Mit's Computer Science and Artificial Intelligence Laboratory (CSAIL); Xinlei Chen from Facebook AI Research; Terta Karaman, Professor Mit Aeronautics and Astronautics and director of LIDS; and Kaiming He, associate professor with the myth of electrical engineering and computer science.

Group efforts had their beginnings in a class project for a graduate seminar on deep generative models that Lao Beyer took last autumn. In talks during the semester, it became obvious to both Lao Beyer and he, who taught the seminar that these studies had the true potential that went far beyond the borders of a typical homework. Other colleagues were soon brought.

The starting point for Lao Beyer's query was the article from June 2024, written by scientists from the Technical University of Munich and the Chinese company Bytedance, which introduced a new way to represent visual information called a one -dimensional tokenizer. Thanks to this device, which is also a kind of neural network, the image of 256×256 pixels can be translated into a sequence of only 32 numbers called tokens. “I wanted to understand how such a high level of compression can be achieved and what the tokens themselves actually represented,” says Lao Beyer.

The previous generation of tokenizers would usually break the same image on the 16×16 token board – with each token capsule information, in a highly condensed form that corresponds to a specific part of the original image. The new 1D tokensers can cod the image more efficiently, using much less tokens, and these tokens are able to capture information about the entire image, not just one quadrant. What's more, each of these tokens is a 12-digit number consisting of 1 and 0, enabling 212 (or about 4,000) possibilities. “It's like the vocabulary of 4000 words, which is an abstract, hidden language used by a computer,” he explains. “It's not like human language, but we can still try to find out what it means.”

It was Lao Beyer that initially decided to discover – the works that the ICML 2025 paper provided the grains of paper. The approach he adopted was quite simple. If you want to find out what a specific token is doing, Lao Beyer says: “You can simply take it out, turn it into a random value and check if there is a recognizable change in the output.” He discovered that by replacing one token, he changes the image quality, transforming a low -resolution image into a high resolution image or vice versa. Another token influenced the background blur, while others still influenced the clarity. He also found a token associated with the “pose”, which means that, for example, in Robin's image, the bird's head can move from right to left.

“It has never been seen before, because no one observed visually identifiable changes in the manipulation of tokens,” says Lao Beyer. The discovery aroused the possibility of a new approach to editing images. In fact, the MIT group showed how this process can be improved and automated so that the tokens do not have to be modified manually, individually.

He and his colleagues achieved an even more consistent result with the participation of image generation. The system capable of generating images usually requires a tobenizer, which compresses and codes visual data, as well as a generator that can combine and arrange these compact representations to create new images. Myth researchers found a way to create images without using a generator. Their new approach is used by the 1D tokenizer and the so -called Detokeiner (also known as a decoder), which can reproduce the image from tokens. However, with the tips provided by an unusual neural network called Clip-who can not generate images on its own, but it can measure how well a given picture matches a specific text monitors-the span was able to convert the image of the red Panda, for example in a tiger. In addition, they could create images of a tiger or any other desired form, starting from scratch – from a situation in which all tokens are initially attributed to random values (and then iteratively improved, so that the reconstructed image is increasingly corresponding to the desired monitor of the text).

The group showed that thanks to the same configuration – relying on tokenizer and detection, but without a generator – they could also “inpainting”, which means filling some of the images that were somehow erased. Avoiding the use of a generator for certain tasks can lead to a significant reduction in calculation costs, because generators, as mentioned, usually require intensive training.

What may seem strange in the band's cartridges explains: “The fact that we did not come up with anything new. We did not invent the 1D tokenizer and did not come up with the clip model either. But we discovered that new possibilities may arise when you submit all these elements together.”

“This work redefines the role of tokenizers,” comments Sining Xie, IT specialist at New York University. “It shows that the image toketenizers-different people usually used only for compression of images-mogs actually do much more. The fact that a simple (but highly compressed) 1D tokenizer can support tasks such as painting or text editing, without the need to train a full generative model, is quite surprising.”

Zhuang Liu from Princeton University agrees, saying that the work of the MIT group “shows that we can generate and manipulate paintings in a much easier way than before.

Karaman suggests that there may be many applications outside the field of computer vision. “For example, we could consider the tokens of robots or self -propelled cars in the same way that can quickly expand the impact of this work.”

Lao Beyer wonders in similar lines, noticing that the extreme amount of compression provided by 1D tokenizers allows for “amazing things” that can be used to other fields. For example, in the field of self -propelled cars, which are one of his research interests, tokens can represent, instead of paintings, various routes that a vehicle can make.

Xie is also intrigued by applications that can come from these innovative ideas. “There are several really cool use cases that can unlock,” he says.

LEAVE A REPLY

Please enter your comment!
Please enter your name here