查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于深度学习的多视图立体重建算法研究
姓名：	孔浩然
学号：	21207223089
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工程硕士
学位年度：	2024
培养单位：	西安科技大学
院系：	通信与信息工程学院
专业：	电子信息
研究方向：	计算机视觉
第一导师姓名：	朱代先
第一导师单位：	西安科技大学
论文提交日期：	2024-06-13
论文答辩日期：	2024-06-04
论文外文题名：	Research on Multi-view Stereo Reconstruction Algorithm Based on Deep Learning
论文中文关键词：	多视图立体重建 ; 注意力机制 ; 扩展卷积 ; 神经渲染 ; 循环神经网络
论文外文关键词：	Multi-view Stereo reconstruction ; Attention mechanism ; Dilated convolution ; Neural rendering ; Recurrent neural network
论文中文摘要：	︿随着计算机视觉技术的快速发展，多视图立体重建成为一个备受关注的领域。多视图立体重建旨在从已知相机参数的多个视角图像中进行场景重建以得到三维信息，在虚拟现实、增强现实和电影特效等领域中发挥着重要作用。基于深度学习的多视图立体重建算法在纹理丰富的区域中进行场景重建已经取得了良好的重建效果，但是实际场景存在着弱纹理、非朗伯表面、遮挡等复杂区域。现有的算法在这些场景中存在重建质量较差、且内存占用量较高的问题。针对这些问题，本文通过改进多视图特征提取网络、优化代价体的噪声、轻量化3D CNN网络的措施展开研究，具体研究内容如下：（1）多视图特征提取网络的改进。针对基于深度学习的多视图立体重建算法在弱纹理和非朗伯表面区域中重建的完整度和整体度较低的问题，本文基于CasMVSNet模型由粗到细的深度估计方式，提出基于注意力机制与扩展卷积的多尺度特征提取算法。该算法利用三个并行的扩展卷积和注意力模块，扩大感受野的同时捕获特征之间的依赖关系以获取全局上下文信息，提升在挑战性区域的特征表征能力。在DTU公共数据集上进行实验，结果表明，相比于原模型CasMVSNet，重建后的点云完整度误差和整体度误差分别降低了15.8%和4.2%。（2）有噪声代价体的优化。由于遮挡区域中同一3D位置的特征在不同图像之间具有明显的差异，多视图立体重建网络易发生特征错误匹配使代价体充满噪声，再将有噪声的代价体利用3D CNN正则化结构估计出粗糙的深度图，从而影响了场景重建的完整度。因此本文建立了一个采用多视图语义特征和神经编码体的神经渲染网络，并应用渲染参考视图损失函数不断优化神经辐射场场景，以精确地解析辐射场景表达的几何外观信息。同时通过引入深度一致性损失函数保持多视图立体重建网络与神经渲染网络之间的几何一致性，来缓解网络因视图之间相互遮挡产生的噪声代价体的影响，从而提升重建点云的质量。相同条件下，在DTU数据集中与改进的多尺度特征提取网络重建的点云相比，进一步使点云完整度误差降低了8.6%。另外在Tanks and Temples中级子集中，重建点云的平均F-score达到60.31，体现出较强的泛化能力。（3）3D CNN网络的轻量化。针对3D CNN内存占用量较高的问题，本文提出一种2D U-Net模型和循环神经网络结合的正则化算法。该算法在深度面数量多的初始阶段和深度面数量少的优化阶段分别使用Hybrid LSTM-Unet和Hybrid GRU-Unet模块组进行正则化。在DTU数据集上的实验结果表明，与引入神经渲染网络的多视图立体重建算法相比，这种算法不仅将内存占用量与网络参数量减少了22.2%和63.6%，而且能够充分利用空间上下文信息使重建点云的整体度提升了2.7%。﹀
论文外文摘要：	︿ With the rapid development of computer vision technology, Multi-view Stereo reconstruction has become a field of great interest. Multi-view stereo reconstruction aims to reconstruct a scene from multiple view images with known camera parameters to obtain 3D information, which plays an important role in the fields of virtual reality, augmented reality and movie special effects. Deep learning-based multi-view stereo reconstruction algorithms have achieved good reconstruction results in texture-rich regions. But the actual scene exists in complex regions such as weak textures, non-Lambertian surfaces, and occlusions. The existing algorithms have the problems of poor reconstruction quality and high memory occupation in these scenes. To address these problems, this thesis carries out research by improving the multi-view feature extraction network, optimizing the noise of the cost volume, and lightweighting the 3D CNN network as follows: (1) Improvement of multi-view feature extraction network. Aiming at the problem that the multi-view stereo reconstruction algorithm based on deep learning has low completeness and overall degree of reconstruction in weak textures regions and non-Lambertian surfaces regions, this thesis proposes a multi-scale feature extraction algorithm based on attention mechanism and dilated convolution based on the depth estimation of CasMVSNet model from coarse to fine. The algorithm utilizes three parallel dilated convolution and attention module to expand the sensory field while capturing the dependencies between features to obtain global contextual information, and to enhance the feature characterization ability in challenging regions. Experiments are conducted in the DTU public dataset, and the results show that the reconstructed point cloud completeness error and the overall error are reduced by 15.8% and 4.2%, respectively, compared to the original model CasMVSNet. (2) Optimization of noisy cost volume. Since the features at the same 3D location in the occlusion region have obvious differences between different images, the multi-view stereo reconstruction network is prone to feature mismatching making the cost volume full of noise, and then the noisy cost volume is used to estimate rough depth maps using the 3D CNN regularization structure, which affects the completeness of the scene reconstruction. Therefore, this thesis establishes a neural rendering network using multi-view semantic features and neural coding volume, and applies the rendering reference view loss function to continuously optimize the neural radiance field scene to accurately resolve the geometric appearance information expressed by the radiance scene. Meanwhile, the geometric consistency between the multi-view stereo reconstruction network and the neural rendering network is maintained by introducing the depth consistency loss function to mitigate the effect of the network's noisy cost volume due to the mutual occlusion of views, thus improving the quality of the reconstructed point cloud. Under the same conditions, the point cloud completeness error is further reduced by 8.6% when compared with the point cloud reconstructed by the improved multi-scale feature extraction network in the DTU dataset. In addition, in the Tanks and Temples intermediate subset, the average F-score of the reconstructed point cloud reaches 60.31, which reflects a strong generalization ability. (3) Lightweighting of 3D CNN network. Aiming at the problem of high memory occupation of 3D CNN, this thesis proposes a regularization algorithm that combines 2D U-Net model and recurrent neural network. The algorithm uses Hybrid LSTM-Unet and Hybrid GRU-Unet module sets for regularization in the initial stage with a large number of depth planes and the optimization stage with a small number of depth planes, respectively. Experimental results in the DTU dataset show that this algorithm not only reduces the memory usage and network parameters by 22.2% and 63.6% but also makes full use of the spatial context information to improve the overall degree of reconstructed point clouds by 2.7%, compared with the multi-view stereo reconstruction algorithm that introduces a neural rendering network. ﹀
中图分类号：	TP391.4
开放日期：	2024-06-14

附件下载