A robot searching for workers trapped in a partially collapsed mine shaft must quickly generate a map of the scene and identify its location there while navigating treacherous terrain.
Scientists have recently begun building powerful machine learning models to perform this complex task using only images from the robot's onboard cameras, but even the best models can only process a few images at a time. In the event of a real disaster where every second counts, a search and rescue robot would have to quickly traverse large areas and process thousands of images to complete its mission.
To overcome this problem, MIT researchers used ideas from both the latest artificial intelligence vision models and classical computer vision to develop a new system that can process any number of images. Their system accurately generates 3D maps of complex scenes, such as a crowded office corridor, in just a few seconds.
The AI-based system gradually creates and aligns smaller submaps of the scene, which it then combines to reconstruct a full 3D map while estimating the robot's position in real time.
Unlike many other approaches, their technique does not require calibrated cameras or an expert to tune a complex system implementation. The simpler nature of their approach, combined with the speed and quality of 3D reconstruction, would make it easier to scale to real-world applications.
In addition to helping search and rescue robots navigate, the method could be used to create augmented reality applications for wearable devices such as VR headsets, or enable industrial robots to quickly find and move goods in a warehouse.
“For robots to perform increasingly complex tasks, they need much more complex representations of maps of the world around them. At the same time, we do not want to make it difficult to implement these maps in practice. We have shown that it is possible to generate an accurate 3D reconstruction in a matter of seconds using a tool that works right out of the box,” says Dominic Maggio, an MIT graduate and lead author of the book article about this method.
Maggio was joined on the paper by postdoc Hyungtae Lim and senior author Luca Carlone, associate professor in MIT's Department of Aeronautics and Astronautics (AeroAstro), principal investigator in the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Laboratory. The research results will be presented at the Conference on Neural Information Processing Systems.
Solution mapping
For years, researchers have been studying an important element of robotic navigation called simultaneous localization and mapping (SLAM). In SLAM, the robot recreates a map of its surroundings, orienting itself in space.
Traditional methods of optimizing this task usually fail in difficult scenes or require prior calibration of the robot's on-board cameras. To avoid these pitfalls, researchers are training machine learning models to learn this task from data.
Although they are simpler to implement, even the best models can only process about 60 camera images at a time, making them impractical for applications where the robot must move quickly through a diverse environment while processing thousands of images.
To solve this problem, MIT researchers designed a system that generates smaller submaps of the scene instead of the entire map. Their method “glues” these submaps into one overall 3D reconstruction. The model still only processes a few images at a time, but the system can recreate larger scenes much faster by stitching together smaller submaps.
“It seemed like a very simple solution, but when I tried it for the first time, I was surprised that it didn't work that well,” says Maggio.
Looking for an explanation, he turned to scientific articles from the 1980s and 1990s on computer vision. Through this analysis, Maggio realized that errors in the way machine learning models processed images made submap alignment a more complex problem.
Traditional methods of aligning submaps by applying rotation and translation until aligned. However, these new models can introduce some ambiguity into the submaps, making them difficult to match. For example, a 3D submap of one side of a room might have the walls slightly bent or stretched. Simply rotating and translating these deformed submaps to align them does not work.
“We need to make sure that all the submaps are deformed in a consistent way so that we can fit them together well,” Carlone explains.
A more flexible approach
Borrowing ideas from classical computer vision, researchers developed a more flexible mathematical technique that can represent all the deformations of these submaps. By applying mathematical transformations to each submap, this more flexible method can align them in a way that eliminates ambiguity.
Based on the input images, the system generates a 3D reconstruction of the scene and estimated camera locations that the robot would use to locate itself in space.
“Once Dominic had the intuition to combine these two worlds – learning-based approaches and traditional optimization methods – implementation was quite simple,” says Carlone. “Coming up with something so effective and simple has the potential for many applications.
Their system worked faster and contained fewer reconstruction errors than other methods, without the need to use special cameras or additional data processing tools. The researchers generated near-real-time 3D reconstructions of complex scenes, such as the interior of the MIT Chapel, using only short videos recorded on a cell phone.
The average error in these 3D reconstructions was less than 5 centimeters.
In the future, the researchers want to increase the reliability of their method in the case of particularly complex scenes and work on implementing it on real robots in difficult conditions.
“Knowing traditional geometry pays off. If you thoroughly understand what's going on in the model, you can get much better results and make things much more scalable,” Carlone says.
This work is supported in part by the U.S. National Science Foundation, the U.S. Office of Naval Research, and the National Research Foundation of Korea. Carlone, currently on sabbatical as an Amazon Fellow, completed this work before joining Amazon.
















