查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于3D网格的单目视觉人员定位技术研究
姓名：	陈盼霖
学号：	21207223043
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工程硕士
学位年度：	2021
培养单位：	西安科技大学
院系：	通信与信息工程学院
专业：	电子信息
研究方向：	计算机视觉
第一导师姓名：	廖晓群
第一导师单位：	西安科技大学
论文提交日期：	2024-06-14
论文答辩日期：	2024-06-04
论文外文题名：	Research on Monocular Vision Personnel Localization Technology Based on 3D Mesh
论文中文关键词：	深度学习 ; 行人检测 ; 单目视觉定位 ; 目标深度估计 ; 三维空间定位
论文外文关键词：	Deep learning ; Person detection ; Monocular Vision Localization ; Target Depth Estimation ; Three-dimensional space positioning
论文中文摘要：	︿随着深度学习技术的快速发展，基于运动视觉平台的目标识别、定位与跟踪技术已经成为智能监控、人工智能等领域的热点研究方向。目前，基于二维图像的目标定位与跟踪理论研究已经相对成熟，但面向低成本运动视觉平台场景的目标三维空间定位与跟踪问题仍然存在诸多难点，如复杂环境下目标检测模型精度低、复杂环境下视觉定位精度低、单目视觉对目标空间深度估计不准等问题，本文采用基于深度学习的目标检测算法和视觉定位算法，以及几何约束的单目深度估计算法实现单目摄像头下人员三维空间定位。本文的具体研究内容如下：针对行人检测过程中存在的漏检、误检、定位不准的问题，在 YOLOv8n 的基础上对现有的行人检测算法进行了改进。首先在 C2f 模块引入可变形卷积，增强对多尺度特征的提取能力，提取更丰富的多尺度行人特征，提高了模型对行人的检测精度。同时，本文引入损失函数 WIoU 替换原模型的 CIoU，与分布式聚焦损失函数 DFL 一起作为边界框回归损失，在不增加模型复杂度的情况下提升了模型的性能。实验结果表明，该方法在公开数据集上 AP50 和 AP50:95分别提升 1.6%和 1.4%。针对室内复杂环境下摄像头自主定位精度低、特征信息丢失等问题，在基于 3D 网格的视觉定位算法 VS-Net 的基础上进行了改进和优化。首先将骨干网络替换为更加轻量化的 MoblieNetv2，进一步减少了计算量，加快模型训练速度。其次采用密集连接的空洞空间金字塔池化（ASPP）结构，在获得更大的感受野时保证获取更多的像素信息。然后为了避免关键细节信息丢失，在低层次特征引入 CBAM 注意力机制以及在高层次特征引入 ECA 通道注意力机制，来加强特征提取。最后使用 Mish 激活函数替换原有的LeakyReLU 激活函数，优化模型训练。实验结果表明，该方法在 7scenes 数据集上旋转误差平均减少 12.5%，平移误差平均减少 4.17%，定位精度平均提升 1.72%，在复杂环境下视觉定位精度有显著提升效果。针对定位场景下行人目标深度信息估计慢、三维空间定位难的问题，本文提出一种基于几何约束的目标深度估计方法，在目标检测模型获取到行人二维定位的基础上，首先将行人目标抽象为平行四边形，利用行人关键特征点的空间几何约束关系建立几何约束方程，再通过相机投影模型和几何约束方程来构建深度方程组，然后利用高阶牛顿迭代法求解深度方程组，实现对行人目标深度信息的高效实时估计。最后结合视觉定位算法求解的相机位姿，通过相机模型坐标转换关系实现人员像素坐标到世界坐标的转换，完成人员三维定位。实验结果表明，该方法目标定位均方根误差小于 13.42cm,满足实际工程项目需求。﹀
论文外文摘要：	︿ With the rapid development of deep learning technology, target recognition, localization, and tracking based on motion vision platforms have become hot research topics in fields such as intelligent surveillance and artificial intelligence. Currently, the theoretical research on target localization and tracking based on 2D images is relatively mature. However, there are still many challenges in the problem of target three-dimensional spatial localization and tracking in low-cost motion vision platform scenarios. These challenges include low model detection accuracy in complex environments, low visual localization accuracy in complex environments, inaccurate depth estimation of targets by monocular vision, etc. In this thesis, we use deep learning-based target detection algorithms, visual localization algorithms, and monocular depth estimation algorithms with geometric constraints to achieve three-dimensional spatial localization of personnel under a monocular camera. The specific research content of this thesis is as follows: Addressing issues such as missed detections, false alarms, and inaccurate localization in pedestrian detection, we improve existing pedestrian detection algorithms based on YOLOv8n. Firstly, we introduce deformable convolutions into the C2f module to enhance the ability to extract features at multiple scales, extracting richer multi-scale pedestrian features, and improving the model's detection accuracy for pedestrians. At the same time, we introduce the WIoU loss function to replace the original model's CIoU, which, along with the Distributed Focal Loss (DFL), serves as the bounding box regression loss, improving the model's performance without increasing its complexity. Experimental results show that this method increases AP50 and AP50:95 by 1.6% and 1.4%, respectively, on public datasets. To address the low localization accuracy and loss of feature information in indoor complex environments, we improve and optimize the visual localization algorithm VS-Net based on 3D grid. Firstly, we replace the backbone network with a more lightweight MobileNetv2, further reducing the computational load and accelerating model training. Secondly, we adopt the dense connected Atrous Spatial Pyramid Pooling (ASPP) structure to ensure obtaining more pixel information while acquiring a larger receptive field. Then, to prevent the loss of key detail information, we introduce the CBAM attention mechanism at low-level features and the ECA channel attention mechanism at high-level features to enhance feature extraction. Finally, we replace the original LeakyReLU activation function with the Mish activation function to optimize model training. Experimental results show that this method reduces the average rotation error by 12.5%, the average translation error by 4.17%, and improves the average localization accuracy by 1.72% on the 7scenes dataset, significantly improving visual localization accuracy in complex environments. To address the slow estimation of pedestrian depth information and the difficulty of three-dimensional spatial localization in localization scenarios, we propose a depth estimation method based on geometric constraints. Upon obtaining the two-dimensional localization of pedestrians from the target detection model, we abstract the pedestrian target into a parallelogram. Then, we establish geometric constraint equations using the spatial geometric constraints of pedestrian key feature points. Next, we construct depth equation groups using the camera projection model and geometric constraint equations. Finally, we use the high-order Newton iteration method to solve the depth equation groups efficiently for real-time estimation of pedestrian target depth information. Combining with the camera pose solved by the visual localization algorithm, we achieve the transformation from personnel pixel coordinates to world coordinates, completing three-dimensional personnel localization. Experimental results show that the root mean square error of target localization is less than 13.42 cm, meeting the requirements of practical engineering projects. ﹀
中图分类号：	TP391.41
开放日期：	2024-06-14

附件下载