论文中文题名: | 基于多信息融合的单目相机位姿估计方法研究 |
姓名: | |
学号: | 21207223070 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 085400 |
学科名称: | 工学 - 电子信息 |
学生类型: | 硕士 |
学位级别: | 工学硕士 |
学位年度: | 2024 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 计算机视觉 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2024-06-13 |
论文答辩日期: | 2024-06-05 |
论文外文题名: | Research on Monocular Camera Pose Estimation Method Based on Multi-Modal Information Fusion |
论文中文关键词: | |
论文外文关键词: | Camera pose estimation ; Visual Odometry ; Semantic Segmentation ; Adaptive Fusion |
论文中文摘要: |
相机位姿估计是导航定位、三维重建等技术的重要组成部分,也是实现计算机视 觉核心任务的第一步,广泛应用于自动驾驶、机器人等领域。目前深度学习的方法在 估计相机位姿时,网络利用的场景信息较少,导致网络在训练过程中易受环境不利因 素的影响,使得估计位姿精度不高。随着深度学习技术的飞速发展,使用多信息融合 的深度学习方法进行相机位姿估计成为了当前的研究热点之一。 (1)针对现有基于深度学习的相机位姿估计网络易受运动模糊问题的影响,在 PoseNet2 的基础上,提出了一种融合视觉里程计信息的相机位姿估计方法。首先,为 了获得高精度的视觉里程计信息,对双暹罗视觉里程计网络进行了改进。将编码器 ResNet50 中的两个Res5残差块进行合并,减少计算参数,替换ReLU激活函数为ELU 函数,以解决神经单元的偏置偏移,并添加 LSTM 单元去解决因误差导致的轨迹偏移 问题。其次,为了提高 PoseNet2 网络估计相机位姿的精度,对网络骨架进行了改进, 使用 ResNet50 作为编码器,三个 FC 层作为解码器,以增强特征的提取能力,同时将 激活函数替换为 ELU 函数以解决神经单元的偏置偏移。为了解决运动模糊问题,将视 觉里程计信息维度转换后,在相机位姿估计网络的 Res5 残差块进行融合,以解决运动 模糊问题,提升估计位姿的精度。最后,在公共数据集上进行试验评估,实验结果表 明,融合改进后的相机位姿估计网络,相机定位轨迹离散点更少,同时误差明显降低, 其中平移误差下降50.8%,旋转下降45.2%。 (2)为了进一步解决网络中视点变化的问题,在融合视觉里程计信息的基础上, 提出了进一步融合语义分割图的相机位姿估计方法。首先,对公开数据集中两万张训 练集图像进行手动标注,并使用 DeepLabV3 模型获取到语义分割图,得到自建的语义 分割图数据集。其次,针对原始语义分割图在相机位姿估计网络中信息表达不完整的 问题,设计了多尺度特征提取网络对语义分割图进行信息挖掘,得到不同尺度下的语 义特征,并采用自适应融合方法对不同尺度的语义特征进行合并,以平衡各个尺度特 征的信息贡献占比。最后,为相机位姿估计网络融入多尺度语义信息,以解决视点变 化的影响,提高网络估计位姿精度。实验结果表明,通过融合多尺度语义信息的网络, 相机定位轨迹的离散点进一步减少,轨迹更为清晰,并且误差也进一步下降,其中平 移误差下降了31.6%,旋转误差下降了27%。 |
论文外文摘要: |
Camera pose estimation is a crucial component of technologies such as navigation, localization, and 3D reconstruction, and serves as the first step in achieving core computer vision tasks. It is widely used in fields such as autonomous driving and robotics. Currently, deep learning methods for estimating camera pose utilize limited scene information, making the network susceptible to adverse environmental factors during training, resulting in lower estimation accuracy. With the rapid development of deep learning technology, using multi information fusion deep learning methods for camera pose estimation has become a current research hotspot. (1) Addressing the susceptibility of existing deep learning-based camera pose estimation networks to motion blur, a method is proposed that integrates visual odometry information on top of PoseNet2. Firstly, improvements were made to the Siamese visual odometry network to obtain high-precision visual odometry information. This involved merging two Res5 residual blocks in the ResNet50 encoder to Reduce computational parameters, replacing the ReLU activation function with ELU to address neuron bias offset, and adding LSTM units to mitigate trajectory drift caused by errors. Secondly, to enhance the accuracy of PoseNet2 network in estimating camera pose, modifications were made to the network architecture using ResNet50 as the encoder and three FC layers as the decoder to enhance feature extraction capability, while replacing the activation function with ELU to resolve neuron bias offset. To address motion blur issues, visual odometry information was dimensionally transformed and fused into the Res5 residual block of the camera pose estimation network to enhance pose accuracy. Finally, experiments on public datasets demonstrate that the fused improved camera pose estimation network exhibits fewer discretization points in camera positioning trajectories and significantly reduced errors, with a 50.8% reduction in translation error and a 45.2% reduction in rotation error. (2) To further address viewpoint changes in the network, an approach is proposed that integrates semantic segmentation maps into the camera pose estimation method built upon visual odometry integration. Firstly, a dataset of 20,000 training set images from public datasets was manually annotated, and semantic segmentation maps were obtained using the DeepLabV3 model to create a custom semantic segmentation dataset. Secondly, a multi-scale feature extraction network was designed to mine information from semantic segmentation maps, obtaining semantic features at different scales, and an adaptive fusion method was employed to merge semantic features from different scales to balance the information contribution of each scale. Lastly, multiple-scale semantic information was integrated into the camera pose estimation network to address viewpoint changes and enhance pose estimation accuracy. Experimental results demonstrate that the network integrating multi-scale semantic information further reduces discretization points in camera positioning trajectories, resulting in clearer trajectories and further reductions in errors, with a 31.6% reduction in translation error and a 27% reduction in rotation error. |
中图分类号: | TP391.41 |
开放日期: | 2024-06-13 |