The AI ​​tool generates high quality images faster than the latest approaches Myth news

The ability to quickly generate high quality images is crucial for creating realistic simulated environments that can be used to train self -propelled cars to avoid unpredictable threats, making them safer on real streets.

But generative techniques of artificial intelligence are increasingly used to create such images. One popular type of model, called the diffusion model, can create staggering realistic images, but it is too slow and intense computation for many applications. On the other hand, autoregressive models that Power LLM, such as chatgpt, are much faster, but produce inferior quality images, which are often full of mistakes.

Scientists from MIT and NVIDIA have developed a new approach that combines the best of both methods. Their hybrid image generation tool uses an autoregressive model to quickly capture a large image, and then a small diffusion model to improve image details.

Their tool, known as Hart (short for the hybrid autoregression of the transformer), can generate images that match or exceed the quality of the most modern diffusion models, but they do it about nine times faster.

The generation process consumes less computing resources than typical diffusion models, enabling hard action on a commercial laptop or smartphone. The user only needs to enter one natural language to the HART interface to generate a picture.

Hart can have a wide range of applications, such as helping scientists in training robots in the performance of complex tasks in the real world and helping designers to create striking scenes for video games.

“If you paint the landscape, and once you just paint the whole canvas, it may not look very good. But if you paint a large picture, and then improve the image with smaller brush strokes, your image may look much better. This is the basic idea with fortitude,” says Haotian Tang Sm '22, PhD '25, Cereators of the author A-Lead author Aaa Boread author of aaa boread author of aaa boread of a hois. New article on HART.

He is joined by the author, who is the leader of Yecheng Wu, a bachelor student at the University of Tsinghua; Elder author of Song Han, associate professor at the Department of Electrical Engineering and Computer Science (EECS), member of MIT-IBM Watson AI Lab and an outstanding NVIDIA scientist; And also others on myth, tsinghua university and nvidia. The research will be presented at an international conference on the representation of learning.

The best of both worlds

Popular diffusion models, such as stable diffusion and DALL-E, are known for creating very detailed images. These models generate images in the iterative process, in which they predict a certain amount of random noise on each pixel, subtract noise, and then repeat the process of predicting and “annoying” many times until they generate a new picture, which is completely free of noise.

Because the diffusion model reflects all pixels in the image at every stage, and it can be 30 or more steps, the process is slow and expensive to computate. But because the model has many chances to improve the details, it will be wrong, the images are of high quality.

Autoregressive models, commonly used to predict the text, can generate images, sequentially anticipating image slices, several pixels at once. They cannot go back and fix their mistakes, but the sequential forecasting process is much faster than diffusion.

These models use representations known as tokens to predict. The autoregressive model uses autoencoder to compress the pixels of a raw image in discreet tokens, as well as reconstruct the image from expected tokens. While this increases the speed of the model, the loss of information that occurs during compression, causes errors when the model generates a new image.

Thanks to HART, researchers have developed a hybrid approach, which uses the autoregressive model to predict compressed, discreet image tokens, and then a small diffusion model to predict residual tokens. Residual tokens compensate for the loss of model information by capturing details omitted by discreet tokens.

“We can achieve a huge increase in the quality of reconstruction. Our residual tokens learn high -frequency details, such as the edges of the object or hair, eyes or mouth. These are places where discrete tokens can make mistakes,” says Tang.

Because the diffusion model provides only the other details after completing its autoregressive model, it can perform the task in eight steps, instead of a regular 30 or more standard diffusion model requires the generation of the entire image. This minimum cost of an additional diffusion model allows HART to maintain the advantage of the speed of the autoregressive model, while significantly increasing its ability to generate complicated image details.

“The diffusion model has easier work, which leads to greater performance,” he adds.

Exceeding larger models

During the development of HART, scientists encountered challenges in the effective integration of the diffusion model in order to increase the autoregressive model. They found that the inclusion of the diffusion model in the early stages of the autoregressive process caused errors accumulation. Instead, their final design of the diffusion model to predict only residual tokens as the last step significantly improved the quality of the generation.

Their method, which uses the combination of the transformer's autoregressive model with 700 million parameters and a light diffusion model with 37 million parameters, can generate images of the same quality as those created by the diffusion model with 2 billion parameters, but does it about nine times faster. It uses about 31 percent less calculations than the most modern models.

In addition, because Hart uses the autoregression model to perform most of the work-the same type of model, which powers LLM-is more compatible to integrate with the new class of a uniform generative language of the vision. In the future, you can interact with a unified generative model in a vision language, perhaps, asking him to show indirect steps required to fold the furniture.

“LLM is a good interface for all models, such as multimodal models and models that can reason. This is a way to push intelligence to a new border. The efficient image generation model would unlock many possibilities,” he says.

In the future, scientists want to go down this path and build models in the language of vision at the top of Hart architecture. Because HART is scalable and generalized to many methods, they also want to apply it for video production and audio forecasting tasks.

These studies were partly financed by MIT-IBM Watson Ai Lab, myth and Amazon Science Hub, hardware program MIT AI and American National Science Foundation. GPU infrastructure for training, this model was transferred by NVIDIA.

LEAVE A REPLY

Please enter your comment!
Please enter your name here