RoadBotics’ Research and Development: 3D Reconstruction from 360 Videos

Share on facebook
Share on twitter
Share on linkedin

Earlier in our blog series, we showed you some initial results regarding the use of GoPro Max fisheye video imagery. We initially planned to show you how to do automated asset extraction using 3D reconstructions, but our team encountered an interesting problem on the way – namely, how to use the marquee feature of the GoPro Max (360 videos) to make 3D reconstructions!

We’ll be examining part of this 360 video, again featuring Todd:

To understand the problem, you need to first understand a little bit about how 360 video capture works – or to be specific, how the GoPro Max captures 360 videos. (There are other approaches, but they tend to be variations on the same theme – namely, simultaneous capture.) 

The GoPro Max is a combination of two fisheye cameras mounted together. Fisheye lenses can capture wide fields of view, and in this case, the GoPro combines two 180 degree FOV cameras and uses on-camera firmware to synchronize the captures – and thus 360 degree videos! There are a few extra technical details, but for our purposes we will ignore these complications.

The raw 180 degree field views look roughly like this: 

The question is – what kind of data do we want to extract out of this to do the best possible 3D reconstruction of the area?

One way to approach the problem is to re-project the 360 video data into versions which we can use our existing reconstruction methods on. So for instance, one can transform the 360 video above into this version (often called a cube map):

Why is this called a cube map? 

Imagine you (the camera) are standing in a particular location. Now imagine you were surrounded by a cube. There would be a face of the cube above you, below you, one in front, one behind, one to your left, and one to your right. 

What we’re doing by transforming the 360 video into a cube map is conceptually the same: at each point in the video, we capture what you would see in those 6 directions around you.  In the above version of the data, the two leftmost cubes of the video are the up and down perspectives, and then the other part of the top row is the forward and right views (which is why they meld nicely), while the other part of the bottom row is the backwards and left views (which is why they also meld nicely).

We can go from this cube map to 6 separate video streams, as if we were using 6 normal cameras pointed in each of those directions, and then simply use all of those images together in a 3D reconstruction! (Though in practice, we end up dropping some of this data; for this kind of data, the straight-looking-’up’ and straight-looking-’down’ views are mostly useless.)

And you might think to yourself, okay, that’s simple enough, why not just do that? But the truth is that that particular cube map is only 1 version of the 360 data that we can produce. Looking at the 360 video, you might notice that you are not restricted to simply looking left or right, up or down, but rather that you can continuously move along those axes. And similarly, we can produce cube maps that are similarly moved along these axes (pitch and yaw). 

So our question then becomes – what is the best view, or set of views, or combination of views that produces the best 3D reconstruction? [As an interesting side note – we ended up also having to apply a mask to clean up the area of the video that shows Todd himself collecting the video! In that respect, it’s kind of like editing out a cameraman you accidentally captured in a big-budget movie.]

How to get the best reconstruction out of the footage turns out to be an interesting but tractable question. First, we determine a ground truth version of the reconstruction, and then compare the point clouds generated by each method we want to test. 

So this is the ground truth version of a small section of the 360 video above:

Now, let’s compare a few of the most interesting versions we tested. Namely:

Method 1 - include most of the images, no rotations
Method 2 - yaw every cube by 45 degrees
Method 3 - yaw 45, pitch 45
Method 4 - a mixture of 30 degree offsets and 60 degree offsets of yaw/pitch

After reconstruction, we compared the point clouds generated by each method, and then judged them by two major metrics: completeness and accuracy

The first, completeness, simply gauges how many relevant 3d points each method captured compared to the ground-truth reconstruction. Obviously by crunching more data, we can improve the completeness metric, but we want to find a reasonably efficient method for using the 3D data. 

The second, accuracy, gauges how close every point generated by a method was to the ground-truth version of that point. Obviously, an ideal reconstruction would be 100%/100%, but it turns out that the actual reconstruction methods show some obvious and interesting trends.

First, we see the results for completeness. These are histogram graphs, where 0 error is farthest to left, and so the ideal graph would be a single spike on the left. (Click each image to view larger)

Method 1
Method 2
Method 3
Method 4

Here, method 3 is a clear winner! High completeness and very little variation.

Next, we examine the accuracy graphs.

Method 1
Method 2
Method 3
Method 4

And here again, method 3 is a clear winner – superb accuracy, and very little high-error tail! And so currently, this method is what we are using to map 360 video into 3D reconstructions.

As always, stay tuned for the next post in our Research and Development blog series!

Authors

Ready to Get Started?

Create your free AgileMapper account or speak with an Expert today!