查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于深度学习的多视图三维重建算法研究
姓名：	巩若琳
学号：	22207223096
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2025
培养单位：	西安科技大学
院系：	通信与信息工程学院
专业：	电子信息
研究方向：	计算机视觉
第一导师姓名：	朱代先
第一导师单位：	西安科技大学
论文提交日期：	2025-06-16
论文答辩日期：	2025-06-06
论文外文题名：	Research on Multi-View 3D Reconstruction Algorithm Based on Deep Learning
论文中文关键词：	多视图三维重建 ; 深度学习 ; 深度可分离卷积 ; 注意力机制 ; 循环神经网络
论文外文关键词：	Multi-view 3D reconstruction ; Deep learning ; Depth separable convolution ; Attention mechanism ; Recurrent neural network
论文中文摘要：	︿伴随计算机视觉以空前速度迭代演进，多视图三维重建技术作为该领域的研究热点之一，在虚拟现实、自动驾驶、文化遗产保护等领域有着广泛应用。多视图三维重建旨在利用多视角图像重构出现实场景的三维结构与几何信息。相较于传统方法，基于深度学习的方法在深度估计上取得显著进展。然而，现有方法在弱纹理、非朗伯表面和遮挡等复杂场景下，仍面临重建结果不完整、对不同场景的泛化能力较差，以及模型参数量大幅增加等挑战。针对上述挑战，本文将多视图三维重建技术作为研究对象，以提高重建完整度和整体效率为目标，着重从多视图特征提取优化和三维卷积神经网络（3D CNN）结构轻量化两方面展开深入分析与研究，具体内容如下：（1）以CasMVSNet网络为基准，本文提出一种结合并行卷积-注意力块与特征聚合模块的多尺度特征提取网络。在特征金字塔网络的顶层跳跃连接中，设计了基于深度可分离卷积和自注意力机制的并行卷积-注意力块，利用不同类型的卷积操作捕捉局部细节与全局上下文信息，同时基于自注意力机制动态调整特征权重，实现多尺度特征的高效提取与融合。此外，在特征金字塔网络末端增设基于通道注意力机制的特征聚合模块，突出关键特征，从而更准确地捕捉和利用有效特征信息，为正则化阶段提供更高质量的输入。实验结果表明，在DTU数据集上，重建点云的完整度和整体度较基准网络分别提升了26.23%和6.48%，改善了复杂场景下重建不完整和可视化效果差的问题。在Tanks and Temples数据集上，平均F-score值达到62.83，较基准网络提升了10.54%，表现出较强的泛化性能。（2）针对3D CNN处理高分辨率数据时内存占用高、计算成本大的问题，提出一种混合循环正则化网络。该方法融合了2D U-Net架构与循环神经网络的优势，依据重建任务的阶段性特征，采用差异化的模块组进行正则化。在初始阶段，鉴于图像分辨率较低且待估计深度面数量较多，利用Hybrid Unet-ConvLSTMCell模块沿深度方向进行正则化；在后续优化阶段，考虑到图像分辨率较高且待估计深度面数量较少，通过Hybrid Unet-ConvGRU模块进行正则化。实验结果表明，在DTU数据集上，重建点云的完整度较基准网络提升了26.49%，该策略充分利用空间上下文信息，保证重建效果的同时，显著减少了网络参数量和内存消耗，有效缓解了传统3D CNN显存消耗问题。本文提出的基于多尺度特征提取与基于混合循环正则化的多视图三维重建方法能够提取更丰富的深度信息，提升特征表征质量，有效抑制边缘和背景噪声，同时降低计算成本并增强对高分辨率数据的适应能力，从而优化重建效果。﹀
论文外文摘要：	︿ Accompanied by the iterative evolution of computer vision at an unprecedented speed, multi-view 3D reconstruction technology, as one of the research hotspots in this field, has a wide range of applications in the fields of virtual reality, automated driving, and cultural heritage protection. Multi-view 3D reconstruction aims to reconstruct the 3D structure and geometric information of a real scene using multi-view images. Compared with traditional methods, deep learning-based methods have made significant progress in depth estimation. However, existing methods still face challenges such as incomplete reconstruction results, poor generalization ability to different scenes, and a significant increase in the number of model parameters in complex scenes with weak textures, non-Lambertian surfaces, and occlusions. Aiming at the above challenges, this thesis takes the multi-view 3D reconstruction technology as the research object, with the goal of improving the reconstruction completeness and overall efficiency, focusing on the optimization of multi-view feature extraction and the lightweighting of 3D Convolutional Neural Network (3D CNN) structure to carry out an in-depth analysis and research in the following aspects: (1) Taking the CasMVSNet network as a benchmark, this thesis proposes a multi-scale feature extraction network that combines a parallel convolution-attention block with a feature aggregation module. In the top jump connection of the feature pyramid network, a parallel convolution-attention block based on depth-separable convolution and self-attention mechanism is designed to capture local details and global context information by using different types of convolution operations, and at the same time, the feature weights are dynamically adjusted based on the self-attention mechanism, so as to realize efficient extraction and fusion of multiscale features. In addition, a feature fusion module based on the channel attention mechanism is added at the end of the feature pyramid network to highlight the key features, so as to capture and utilize the effective feature information more accurately and provide higher quality inputs for the regularization stage. The experimental results show that on the DTU dataset, the completeness and wholeness of the reconstructed point cloud are improved by 26.23% and 6.48%, respectively, compared with the baseline network, which improves the problems of incomplete reconstruction and poor visualization in complex scenes. On the Tanks and Temples dataset, the average F-score value reaches 62.83, which is 10.54% higher than the benchmark network, showing strong generalization performance. (2) A hybrid recurrent regularization network is proposed to address the problems of high memory occupation and high computational cost when 3D CNNs process high-resolution data. The method combines the advantages of 2D U-Net architecture and recurrent neural network, and adopts a differentiated set of modules for regularization based on the stage characteristics of the reconstruction task. In the initial stage, the Hybrid Unet-ConvLSTMCell module is used for regularization along the depth direction in view of the low image resolution and the large number of depth surfaces to be estimated; in the subsequent optimization stage, the Hybrid Unet-ConvGRU module is used for regularization in view of the high image resolution and the small number of depth surfaces to be estimated. The experimental results show that the completeness of the reconstructed point cloud on the DTU dataset is improved by 26.49% compared with the baseline network, and this strategy makes full use of the spatial context information to ensure the reconstruction effect while significantly reducing the number of network parameters and memory consumption, which effectively mitigates the problem of traditional 3D CNN memory consumption. The multi-view 3D reconstruction method based on multi-scale feature extraction and hybrid cyclic regularization proposed in this thesis is able to extract richer depth information, improve the quality of feature representation, effectively suppress edge and background noise, and optimize the reconstruction results by reducing the computational cost and enhancing the adaptability to high-resolution data at the same time. ﹀
参考文献：	︿ [1]Campbell N D F, Vogiatzis G, Hernández C, et al. Using multiple hypotheses to improve depth-maps for multi-view stereo[C]//Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I 10. Springer Berlin Heidelberg, 2008: 766-779. [2]Galliani S, Lasinger K, Schindler K. Massively parallel multiview stereopsis by surface normal diffusion[C]//Proceedings of the IEEE international conference on computer vision. 2015: 873-881. [3]Schönberger J L, Zheng E, Frahm J M, et al. Pixelwise view selection for unstructured multi-view stereo[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer International Publishing, 2016: 501-518. [4]Yao Y, Luo Z, Li S, et al. Mvsnet: Depth inference for unstructured multi-view stereo[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 767-783. [5]Yao Y, Luo Z, Li S, et al. Recurrent mvsnet for high-resolution multi-view stereo depth inference[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 5525-5534. [6]Chen R, Han S, Xu J, et al. Point-based multi-view stereo network[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 1538-1547. [7]Gu X, Fan Z, Zhu S, et al. Cascade cost volume for high-resolution multi-view stereo and stereo matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 2495-2504. [8]王江安,黄乐,庞大为,等.基于自适应聚合循环递归的稠密点云重建网络[J].图学学报,2024,45(01):230-239. [9]童伟,张苗苗,李东方,等.基于边缘辅助极线Transformer的多视角场景重建[J].电子与信息学报,2023,45(10):3483-3491. [10]王敏,赵明富,宋涛,等.基于特征聚合Transformer的多视图立体重建方法[J].激光与光电子学进展,2024,61(14):181-190. [11]姬田杰,郑飂默,曹克让,等.基于多视图立体深度学习的堆叠工件三维重建[J].计算机系统应用,2024,33(12):153-160.DOI:10.15888/j.cnki.csa.009710. [12]Li J, Bai Z, Cheng W, et al. Feature pyramid multi-view stereo network based on self-attention mechanism[C]//Proceedings of the 2022 5th International Conference on Image and Graphics Processing. 2022: 226-233. [13]Yu A, Guo W, Liu B, et al. Attention aware cost volume pyramid based multi-view stereo network for 3d reconstruction[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 175: 448-460. [14]Liu W, Wang J, Qu H, et al. Hierarchical MVSNet with cost volume separation and fusion based on U-shape feature extraction[J]. Multimedia Systems, 2023, 29(1): 377-387. [15]Wei Z, Zhu Q, Min C, et al. Aa-rmvsnet: Adaptive aggregation recurrent multi-view stereo network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 6187-6196. [16]Liao J, Ding Y, Shavit Y, et al. Wt-mvsnet: window-based transformers for multi-view stereo[J]. Advances in Neural Information Processing Systems, 2022, 35: 8564-8576. [17]Zhang X, Hu Y, Wang H, et al. Long-range attention network for multi-view stereo[C]//proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021: 3782-3791. [18]Ding Y, Yuan W, Zhu Q, et al. Transmvsnet: Global context-aware multi-view stereo network with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 8585-8594. [19]樊铭瑞,申冰可,牛文龙,等.基于深度学习的多视图立体视觉综述[J].软件学报,2025,36(04):1692-1714. [20]Lhuillier M, Quan L. A quasi-dense approach to surface reconstruction from uncalibrated images[J]. IEEE transactions on pattern analysis and machine intelligence, 2005, 27(3): 418-433. [21]Kazhdan M, Hoppe H. Screened poisson surface reconstruction[J]. ACM Transactions on Graphics (ToG), 2013, 32(3): 1-13. [22]Seitz S M, Dyer C R. Photorealistic scene reconstruction by voxel coloring[J]. International journal of computer vision, 1999, 35: 151-173. [23]Kutulakos K N, Seitz S M. A theory of shape by space carving[J]. International journal of computer vision, 2000, 38: 199-218. [24]Goesele M, Snavely N, Curless B, et al. Multi-view stereo for community photo collections[C]//2007 IEEE 11th international conference on computer vision. IEEE, 2007: 1-8. [25]Xu Q, Tao W. Multi-scale geometric consistency guided multi-view stereo[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 5483-5492. [26]Yan J, Wei Z, Yi H, et al. Dense hybrid recurrent multi-view stereo net with dynamic consistency checking[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 674-689. [27]Saeed S, Lee S, Cho Y, et al. ASPPMVSNet: A high‐receptive‐field multiview stereo network for dense three‐dimensional reconstruction[J]. ETRI Journal, 2022, 44(6): 1034-1046. [28]Zang X D, Yang F Z, Chang M, et al. MG-MVSNet: Multiple granularities feature fusion network for multi-view stereo[J]. Neurocomputing, 2023, 528: 35-47. [29]Ramachandran P, Parmar N, Vaswani A, et al. Stand-alone self-attention in vision models[J]. Advances in neural information processing systems, 2019, 32: 68-80. [30]Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30: 5998-6008. [31]Zhu D, Kong H, Qiu Q, et al. Multi-View Stereo Network Based on Attention Mechanism and Neural Volume Rendering[J]. Electronics, 2023, 12(22): 4603. [32]Zhang S, Wei Z, Xu W, et al. DSC-MVSNet: Attention aware cost volume regularization based on depthwise separable convolution for multi-view stereo[J]. Complex & Intelligent Systems, 2023, 9(6): 6953-6969. [33]Mildenhall B, Srinivasan P P, Tancik M, et al. Nerf: Representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106. [34]Liu T, Ye X, Shi M, et al. Geometry-aware reconstruction and fusion-refined rendering for generalizable neural radiance fields[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7654-7663. [35]Wang G, Wang P, Chen Z, et al. Perf: Panoramic neural radiance field from a single panorama[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(10): 6905–6918. [36]Kerbl B, Kopanas G, Leimkühler T, et al. 3D Gaussian Splatting for Real-Time Radiance Field Rendering[J]. ACM Trans. Graph., 2023, 42(4): 139:1-139:14. [37]Liu T, Wang G, Hu S, et al. Mvsgaussian: Fast generalizable gaussian splatting reconstruction from multi-view stereo[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 37-53. [38]Gallup D, Frahm J M, Mordohai P, et al. Real-time plane-sweeping stereo with multiple sweeping directions[C]//2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2007: 1-8. [39]鄢化彪,徐方奇,黄绿娥,等.基于深度学习的多视图立体重建方法综述[J].光学精密工程,2023,31(16):2444-2464. [40]Wang F, Zhu Q, Chang D, et al. Learning-based multi-view stereo: a survey[J]. arXiv preprint arXiv:2408.15235, 2024. [41]陈暄, 吴吉义. 基于优化卷积神经网络的车辆特征识别算法研究[J]. 电信科学, 2023, 39(10): 101-111. [42]Touvron H, Cord M, Sablayrolles A, et al. Going deeper with image transformers[C]// Proceedings of the IEEE/CVF international conference on computer vision. 2021: 32-42. [43]朱光照,韦博,杨阿峰,等.基于自注意力机制的多视图三维重建方法[J].激光与光电子学进展,2023,60(16):323-330. [44]Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations[J]. arXiv preprint arXiv:1803.02155, 2018. [45]Narayanan M. SENetV2: Aggregated dense layer for channelwise and global representations[J]. arXiv preprint arXiv:2311.10807, 2023. [46]Aanæs H, Jensen R R, Vogiatzis G, et al. Large-scale data for multiple-view stereopsis[J]. International Journal of Computer Vision, 2016, 120: 153-168. [47]Knapitsch A, Park J, Zhou Q Y, et al. Tanks and temples: Benchmarking large-scale scene reconstruction[J]. ACM Transactions on Graphics (ToG), 2017, 36(4): 1-13. [48]Yao Y, Luo Z, Li S, et al. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 1790-1799. [49]Luo K, Guan T, Ju L, et al. P-MVSNet: Learning patch-wise matching confidence aggregation for multi-view stereo[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 10452-10461. [50]Yi H, Wei Z, Ding M, et al. Pyramid multi-view stereo net with self-adaptive view aggregation[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer International Publishing, 2020: 766-782. [51]Yang J, Mao W, Alvarez J M, et al. Cost volume pyramid based depth inference for multi-view stereo[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 4877-4886. [52]Cheng S, Xu Z, Zhu S, et al. Deep stereo using adaptive thin volume representation with uncertainty awareness[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 2524-2534. [53]Ma X, Gong Y, Wang Q, et al. Epp-mvsnet: Epipolar-assembling based depth prediction for multi-view stereo[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 5732-5740. [54]Pan K, Li K, Zhang G, et al. OD-MVSNet: Omni-dimensional dynamic multi-view stereo network[J]. Plos one, 2024, 19(8): e0309029. [55]孙凯,张成,詹天,等.融合注意力机制和多层动态形变卷积的多视图立体视觉重建方法[J].兵工学报,2024,45(10):3631-3641. [56]Zhang J, Li S, Luo Z, et al. Vis-mvsnet: Visibility-aware multi-view stereo network[J]. International Journal of Computer Vision, 2023, 131(1): 199-214. [57]Wei Z, Zhu Q, Min C, et al. Bidirectional hybrid LSTM based recurrent neural network for multi-view stereo[J]. IEEE Transactions on Visualization and Computer Graphics, 2022, 30(7): 3062-3073. [58]Lai H W, Ye C L, Li Z, et al. MFE‐MVSNet: Multi‐scale feature enhancement multi‐view stereo with bi‐directional connections[J]. IET Image Processing, 2024, 18(11): 2962-2973. [59]Zhu Q, Wei Z, Wang Z, et al. Hybrid Cost Volume Regularization for Memory-efficient Multi-view Stereo Networks[C]//BMVC. 2022: 73. ﹀
中图分类号：	TP391.4
开放日期：	2025-06-16

附件下载