Over the past three years, the use of chatbots like ChatGPT and Claude has skyrocketed because they can help with a wide range of tasks. Whether you're writing Shakespeare's sonnets, debugging code, or needing an answer to an obscure question, AI systems seem to have you covered. The source of this versatility? Billions or even trillions of text data on the Internet.
However, this data is not enough to teach the robot to be a helpful helper in the household or in the factory. To understand how to handle, arrange, and place different arrangements of objects in different environments, robots need demonstrations. You can think of a robot's training data as a collection of instructional videos that walk the system through each movement of a task. Collecting these demonstrations on real robots is time-consuming and not perfectly repeatable, so engineers created the training data by generating simulations using artificial intelligence (which often do not reflect real-world physics) or painstakingly developing each digital environment from scratch by hand.
Scientists at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Toyota Research Institute may have found a way to create the diverse, realistic training field that robots need. Their "controlled scene generation“The approach creates digital scenes of objects such as kitchens, living rooms and restaurants that engineers can use to simulate a wide range of real-world interactions and scenarios. Trained in over 44 million 3D rooms filled with models of objects such as tables and plates, the tool places existing assets in new scenes and then refines each of them, physically creating accurate, realistic environment.
Guided scene generation creates these 3D worlds by “steering” a diffusion model – an artificial intelligence system that generates an image from random noise – towards a scene that can be found in everyday life. The researchers used this generative system to “paint” the environment, filling in individual elements of the entire scene. You can imagine that a blank canvas suddenly turns into a kitchen dotted with 3D objects that gradually transform into a scene imitating real-world physics. For example, the system ensures that a fork doesn't go through a bowl on a table – a common error in 3D graphics known as “clipping” where models overlap or intersect.
However, how precisely controlled scene generation guides scene creation towards realism depends on the strategy chosen. Its main strategy is “Monte Carlo Tree Search” (MCTS), in which the model creates a series of alternative scenes, populating them in different ways to achieve a specific goal (e.g., make the scene more physically realistic or include as many edible items as possible). It is used by the AI program AlphaGo to defeat human opponents in Go (a game similar to chess), as the system considers potential sequences of moves before selecting the most advantageous one.
“We were the first to apply MCTS to scene generation by treating the scene generation task as a sequential decision-making process,” says MIT Electrical Engineering and Computer Science graduate student Nicholas Pfaff, who is a CSAIL researcher and lead author paper presenting work. “We continually build on top of partial scenes to create better or more desirable scenes over time. As a result, MCTS creates scenes that are more complex than what the diffusion model was trained on.”
In one particularly telling experiment, MCTS added the maximum number of objects to a simple restaurant scene. There were as many as 34 objects on the table, including huge stacks of dim sum dishes, after training on stages that averaged only 17 objects.
Guided scene generation also allows you to generate a variety of training scenarios through reinforcement learning – essentially training a diffusion model to meet a goal through trial and error. Once trained on the initial data, the system goes through a second stage of training where you assign a reward (basically a desired outcome with a score indicating how close you are to that goal). The model automatically learns to create scenes with higher scores, often creating scenarios that are significantly different from the ones it was trained on.
Users can also directly prompt the system by entering specific visual descriptions (e.g. “kitchen with four apples and a bowl on the table”). Then, controlled scene generation can precisely implement your requests. For example, the tool accurately followed users' directions at a rate of 98% when creating scenes of pantry shelves and 86% for messy breakfast tables. Both ratings represent at least a 10 percent improvement over comparable methods such as “MiDiffusion“And”Distracted“
The system can also create specific scenes using prompts or light cues (e.g., “think of a different arrangement of the scene using the same objects”). For example, you can ask him to arrange apples on several plates on the kitchen table or to put board games and books on the shelf. This is essentially “fill in the blanks” by placing objects in empty spaces but keeping the rest of the scene intact.
According to the researchers, the strength of their project lies in its ability to create multiple scenes that roboticists can actually use. “The key takeaway from our findings is that it's okay if the scenes we trained for don't exactly resemble what we actually want,” says Pfaff. “Using our control methods, we can go beyond this broad distribution and sample from the 'better'. In other words, generate a variety of realistic and task-specific scenes that we actually want to train our robots on.”
Such vast scenes became a testing ground where it was possible to record a virtual robot interacting with various objects. For example, the machine carefully placed forks and knives in the cutlery holder and then arranged bread on plates in various 3D settings. Each simulation felt smooth and realistic, reminiscent of the real world, and generating scenes through adaptive robots and controls could one day aid in training.
While the system could be an encouraging avenue for generating a wide variety of training data for robots, the researchers say their work is more of a proof-of-concept. In the future, they would like to use generative AI to create completely new objects and scenes, rather than using a fixed library of assets. They also plan to include articulated objects that the robot can open or twist (such as cabinets or jars filled with food) to make the scenes even more interactive.
To make their virtual environments even more realistic, Pfaff and his colleagues can incorporate real-world objects, using a library of objects and scenes pulled from photos on the Internet and building on their previous work on “Scalable Real2Sim” By increasing the variety and realism of robot test sites constructed using artificial intelligence, the team hopes to build a community of users that will create a wealth of data that can then be used as a massive dataset to teach skillful robots various skills.
“Today, creating realistic scenes for simulation can be quite a challenging endeavor; procedural generation can easily generate a large number of scenes, but they will likely not be representative of the environments the robot will encounter in the real world. Creating custom scenes by hand is both time-consuming and expensive,” says Jeremy Binagia, an applied scientist at Amazon Robotics, who has not been involved in participation in the creation of the article. “Guided scene generation offers a better approach: train a generative model on a large set of pre-existing scenes and adapt it (using a strategy such as reinforcement learning) for specific downstream applications. Compared to previous works that use an off-the-shelf visual language model or focus only on arranging objects in a 2D mesh, this approach guarantees physical feasibility and takes into account full translation and rotation 3D, enabling the generation of much more interesting scenes.”
“Guided scene generation with subsequent training and inference search provides a novel and efficient platform for automating large-scale scene generation,” says Rick Cory SM '08, PhD '10, a worker at Toyota Research Institute who was also not involved in the paper. “Furthermore, it can generate 'never-before-seen' scenes that will be considered important for further tasks.” In the future, combining this framework with massive internet data could represent an important milestone towards effectively training robots for deployment in the real world.”
Pfaff wrote the paper with senior author Russ Tedrake, the Toyota Professor of Electrical Engineering and Computer Science, Aeronautics and Astronautics, and Mechanical Engineering at MIT; Senior Vice President of Large Behavior Models at Toyota Research Institute; and Principal Investigator of CSAIL. Other authors included Toyota Research Institute robotics researcher Hongkai Dai SM '12, PhD '16; team leader and senior researcher Sergei Zakharov; and Carnegie Mellon University doctoral student Shun Iwase. Their work was supported in part by Amazon and the Toyota Research Institute. The researchers presented their work at the Conference on Robotic Learning (CoRL) in September.