SLAM Outdoor Data Capture Problems
The core algorithm SLAM to reconstruct 3D scenes has been around for decades. However, in order to make SLAM computationally tractable, certain assumptions are used about the real world. Among them are:
- objects that appear in several photos look the same no matter the difference in viewpoint of the cameras taking the photos
- objects in the scene do not move
Obviously, these assumptions are often not true about the outside world. Cars, buildings, pedestrians, etc. do not look the same from the front, side, and back, and certain objects, such as cars and pedestrians, are usually moving when seen outside on the street. But the assumptions above are mathematically necessary to make SLAM possible, and unfortunately, when the assumptions are violated, they can lead to unrecognizable or seemingly impossible 3D reconstructions.
Let’s look at some examples of these issues and the techniques we’re using to avoid them.
The Distant Points Problem
Consider the following picture of a 3D reconstruction in progress.
Can you tell what is being represented by this reconstruction? Probably not.
Take a look at this next sequence of images. At what point can you make sense of the scene?
Each picture in this sequence depicts the same underlying reconstruction. In the first image, the cluster of points at the center is the road and trees that you see in the last image. The difference is that the first image contains points that are very far away from the area of interest; the spray of points extending from the central cluster to the top left corner of the image. Each subsequent image has fewer and fewer of these irrelevant points.
This example illustrates an underlying problem with capturing outdoor scenes: it’s practically impossible to not incidentally capture objects that are very far away, like clouds, skylines, distant buildings, terrain, or trees. Including that data in the reconstruction means it will include objects that are potentially miles away, compared to the scene we’re interested in which is only 100+ feet in front of us.
To make matters worse, there isn’t a lot of data in the video about those far away objects, so the reconstruction ends up capturing a small amount of inconvenient and mostly useless data, throwing off the scale of the whole scene in the process!
The Moving Objects Problem
Similar to the distant points problem, there’s also a problem with the incidental capture of moving objects.
Consider the following two sets of pictures. The first is two frames extracted from a video recorded by a dashboard-mounted smartphone. You can see the movement between the frames by the change in position of the camera which moved further down the road, and by the oncoming car which is a bit closer in the second frame.
This set of pictures is the same as the first set, but with overlaid feature matches detected in the SLAM process. The red circles are potential features (points in the scene that might be easy to identify in multiple pictures), and the ones that have a connecting green line drawn from one picture to the other are where SLAM thinks there is a match: the same thing in one picture to the next.
Just a side note, there are no features or matches in the top half of the pictures because we excluded that part of the pictures in the data set before doing SLAM feature matching.
At first, these may appear to be good results. In particular, if you look closely, you might notice parts of the incoming car being marked as features in both images and matched. That makes sense – it’s the same object being detected in both pictures, a good demonstration of the power of the feature matching process. Unfortunately, these features will confuse the results of a SLAM process, because SLAM’s mapping process assumes that the objects it’s matching are not moving.
As we mentioned earlier, SLAM does this to make triangulating their positions mathematically solvable. For the car in this example, the process will try to figure out what possible camera positions would explain the apparently different positions of the vehicle. The correct explanation is of course that it’s a combination of both the camera moving and the car moving, and so the algorithm triangulates incorrectly. [For a refresher on how the localization part of SLAM works, see our last post.]
SLAM has a way to automatically exclude moving objects (as well as other anomalies that emerge from violations of its core assumptions) using a sub-algorithm called Random Sample Consensus (RANSAC). However, as the name suggests, this method uses random sampling, so it does not guarantee that SLAM will always choose the correct set of matches for a reconstruction.
So how do we deal with the two problems of distant points and moving objects? Can we remove things like moving vehicles or distant buildings from the images before performing a 3D reconstruction?
Indeed, a version of that solution is what RoadBotics is using. This approach is easy to describe but difficult to do; it requires using deep neural networks for scene understanding, also known as scene segmentation. Fortunately for us, this is an area of expertise we already have from developing neural networks for our current pavement analysis product.
Broadly speaking, a segmentation of an image divides an image into various parts. The kind of segmentation we’re interested in is a pixel-wise annotation of the image data which assigns human-useful labels to each region (i.e. “road”, “person”, “vehicle”). This is called semantic segmentation. The following image is a typical example of a segmented image:
Similar to how RoadBotics uses deep neural networks for road assessments, here we are training a neural network to produce semantic segmentations of the images we plan to use for a reconstruction. We then use those segmentations to remove unwanted objects from the original images, like cars, clouds, sky, and even distant buildings. In the picture above, there are many categories for objects we are not concerned about for a 3D reconstruction, like pedestrians, vehicles, and buildings. For us, all of these categories can be rolled up into one category of “unwanted objects.” We are interested in reconstructing roads and sidewalks, so our neural network simply needs to identify those areas of the image and then we automatically remove the rest.
Here’s an example of how this approach works. The first image is an unmodified image taken by a dashboard-mounted smartphone during routine data collection. The second image has had our semantic segmentation model applied to it. Here we assigned plain white to anything the model doesn’t label a road. We thus “masked” out everything that isn’t a road with high certainty.
Almost all the potentially problematic areas (including the moving car) have been masked out, leaving only the road surface. The reconstruction process can now proceed with much less extraneous data, leading to a faster and more robust reconstruction. And as our models improve, the reconstructions also improve!
Although we have addressed two major problems in this post, there are still a few more that lie ahead. In our next post, we’ll discuss those and what we’re doing to solve them. Stay tuned!