There is no 3D convolution 3D reconstruction method, only 70ms reconstruction on A100 only needs 70ms

Author:Data School Thu Time:2022.09.27

Source: Heart of the machine

This article is about 1500 words, it is recommended to read for 5 minutes

This article comes from researchers from Niantic and UCL and other institutions using carefully designed and trained 2D networks to achieve high -quality depth estimates and 3D reconstruction.

3D indoor scenes from attitude image reconstruction are usually divided into two stages: the depth estimation of the image, and then the depth combine and the surface reconstruction. Recently, a number of studies have proposed a series of methods to implement rebuilding directly in the final 3D volume feature space. Although these methods have obtained the impressive reconstruction results, they depend on the expensive 3D convolutional layer and limit their applications in limited resources.

Now, researchers from Niantic and UCL and other institutions have tried to re -use the traditional methods and focus on high -quality multi -view depth prediction. Finally, the simple and ready -made deep fusion method is used to achieve high -precision 3D reconstruction.

Thesis address: https: //niaclabs.github.io/simplerecon/resources/simplerecon.pdf

Github address:

https://github.com/nianticlabs/simplerecon

Thesis homepage:

https://nianticlabs.github.io/simplerecon/

not

The study was carefully designed with a powerful image aire and the amount of characteristics and geometric losses, and a 2D CNN was carefully designed. Simplerecon has achieved significant leading results in depth estimates, and allows online low memory reconstruction.

As shown in the figure below, Simplerecon's reconstruction speed is very fast, with only about 70ms per frame.

not

The comparison results of Simplerecon and other methods are as follows:

not

not

method

The depth estimation model is located at the intersection of the monocular depth estimation with the plane scanning MVS. Researchers use Cost Volume (cost volume) to increase the depth predicted encoder -decoder architecture, as shown in Figure 2. The image encoder extracts the matching feature from the reference image and source image to enter the Cost Volume. Use the 2D convolutional encoder -decoder network to process the output of the Cost Volume. In addition, the researchers also use the image -grade features extracted by a separate pre -trained image encoder for enhancement.

The key to the study is to inject existing metadata and typical deep image features into the Cost Volume to allow online access to useful information, such as geometry and relative camera posture information. Figure 3 detailed the Feature Volume structure. By integrating these previously unopened information, the model of this study can be significantly better than the previous method in depth prediction, without the need for expensive 4D Cost Volume costs, complex time fusion, and Gaussian process.

This study is implemented using PyTorch and EfficientNetV2 S is used as the main trunk. It has a decoder similar to UNET ++. In addition, they also use the top 2 blocks of ResNet18 for matching feature extraction. The optimizer is ADAMW. It took 36 hours to complete.

Network architecture design

The network is implemented based on the 2D convolutional encoder -decoder architecture. When building this network, research found that there are some important design options that can significantly improve the accuracy of depth prediction, mainly including:

Cost Volume fusion: Although the RNN -based time fusion method is often used, they significantly increase the complexity of the system. Instead, the study makes the Cost Volume integration as simple as possible, and it is found that the point matching costs between the reference view and the point of each source view can be simply added to the result of the competition with SOTA in depth.

Image encoder and feature matching encoder: Previous research shows that the image encoder is very important for depth estimates, whether in monocular and multi -view estimates. For example, DeepVideomvs uses mnasnet as an image encoder, which has a relatively low delay. The study recommends that the small but more powerful EfficientNetV2 S encoder is used. Although the cost of doing so is increased the parameters and reduced the execution speed by 10%, it greatly improves the accuracy of in -depth estimates. Fusion Multifabic image features to Cost Volume encoder: In deep three -dimensional and multi -view three -dimensional based on 2D CNN, image features are usually combined with the Cost Volume output on a single scale. Recently, DeepVideomvs proposes to splicing depth image features on multiple standards, adding jump connections between the image encoder and the Cost Volume encoder to all resolution. This is very helpful for LSTM -based fusion networks, and the study found that this is equally important for their structure.

experiment

The study trained and evaluated the method mentioned on the 3D scenario reconstruction dataset scannetv2. Below 1 uses indicators proposed by EIGEN et al. (2014) to evaluate the depth prediction performance of several network models.

not

Surprisingly, the underlying model of the institute does not use 3D convolution, but it is better than all baseline models in depth prediction indicators. In addition, the baseline model that does not use metadata -encoded is better than previous methods, which indicates that the 2D network of carefully designed and trained 2D networks is enough to perform high -quality in -depth estimates. Figures 4 and Figure 5 show the qualitative results of depth and legal lines.

not

not

The study uses a standard protocol established by the TransformerFusion for 3D reconstruction evaluation. The results are shown in Table 2 below.

It is critical to reduce sensor delay for online and interactive 3D reconstruction applications. Below 3 shows a given new RGB frame, the integration time of each model on each frame.

In order to verify the effectiveness of each component in the research method, the researchers conducted ablation experiments. The result is shown in Table 4 below.

Edit: Wang Jing

- END -

Recommend a few high -quality public accounts with depth and dry goods

Several public accounts recommended to you today not only produce high -quality timeliness content, but also provide a variety of diversified content angles worthy of your ownAt the same time, pay att

Summer 丨 Qingfeng Xu Lai

From the summer clouds, the cool wind rises. Chi Qiu came again, and the lotus flo...