论文中文题名: | 基于无监督学习的视频异常事件检测方法研究 |
姓名: | |
学号: | 21208088025 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 083500 |
学科名称: | 工学 - 软件工程 |
学生类型: | 硕士 |
学位级别: | 工学硕士 |
学位年度: | 2024 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 计算机视觉 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2024-06-19 |
论文答辩日期: | 2024-05-30 |
论文外文题名: | Research on video anomaly event detection method based on unsupervised learning |
论文中文关键词: | |
论文外文关键词: | Video anomaly detection ; Deep learning ; Autoencoder ; Unsupervised learning ; Contrastive learning |
论文中文摘要: |
视频异常事件检测在视频监控、智能安防等多个应用领域中扮演着至关重要的角色,随着深度无监督学习技术的快速发展,利用此技术进行视频内容的智能分析成为了研究焦点。目前,基于自编码器的方法因过分关注视频特征的低级细节而导致正常与异常事件之间的重构误差相似,难以有效检测异常。依赖单一帧的预测任务在复杂多变的场景下无法充分利用视频内容的时空上下文信息,从而限制了检测效果。针对上述问题,本文提出了两种基于无监督学习的视频异常事件检测方法。主要研究内容如下: (1) 针对自编码器过分关注视频特征的低级细节而难以有效区分正常与异常事件的问题,本文提出了一种基于记忆增强时空掩蔽自编码器的视频异常事件检测方法。首先,使用时空立方体的形式来表示视频事件以深入分析视频的时空关系。其次,利用时空掩蔽自编码器提取视频的高级语义特征。此外,引入多个记忆模块并添加跳跃连接,增强记忆正常特征的能力且补偿关键信息的丢失,确保重构的完整性。最后,通过比较重构数据和输入数据之间的差异来计算异常分数,实现异常检测。所提方法在UCSD Ped 2、CUHK Avenue和Shanghai Tech数据集上的AUC分别达到99.9%、94.8%和78.9%,并优于记忆指导自编码器(MNAD)和残差自编码器(AR-AE)等主流方法。 (2) 针对数据标注耗时且昂贵的问题,以及由于预测方法仅依赖单一任务而无法充分利用视频内容的时空上下文信息,导致预测结果和实际结果不一致的问题,本文提出了一种基于深度无监督对比学习的视频异常事件检测方法。首先,在对比学习双分支架构下,将融合通道和空间注意力机制的C3D(Convolutional 3D)卷积网络模型作为编码器,以提取视频时空特征。其次,采用具有两层MLP(多层感知机)结构的投影变换网络,降低时空特征的维度。最后,计算多个无监督任务的对比损失,结合LOF(Local Outlier Factor)算法来识别视频中的异常事件。所提方法在UCSD Ped2和ShanghaiTech数据集上,与双鉴别器生成对抗方法(CT-D2GAN)相比,性能分别提升了2.6%和3.4%,在Avenue数据集上,与多路径帧预测方法(ROADMAP)相比,性能提升达到了6.6%。 |
论文外文摘要: |
Video anomaly event detection plays a critical role in various application domains such as video surveillance and intelligent security. With the rapid development of deep unsupervised learning techniques, utilizing this technology for intelligent video content analysis has become a research focus. Currently, methods based on autoencoders tend to excessively focus on low-level details of video features, leading to similar reconstruction errors between normal and anomalous events, which hinders effective anomaly detection. Moreover, relying on single-frame predictions fails to fully leverage the spatiotemporal contextual information of video content, limiting detection performance in complex and dynamic scenes. To address these challenges, this paper proposes two unsupervised learning-based video anomaly event detection methods. The main research contributions are as follows: To overcome the issue of autoencoders overemphasizing the details of video features and struggling to differentiate between normal and abnormal events, we introduce a method that employs a memory-enhanced spatiotemporal masked autoencoder for video anomaly detection. Firstly, we utilize spatiotemporal cubes to represent video events and analyze their temporal and spatial relationships comprehensively. Then, we leverage the spatiotemporal masked autoencoder to extract high-level semantic features from the videos. Furthermore, we integrate multiple memory modules and skip connections to enhance the model's capacity to memorize normal features and compensate for the loss of crucial information, thereby ensuring the integrity of the reconstruction. Finally, we compute anomaly scores by contrasting the reconstructed data with the input data, enabling effective anomaly detection. The proposed method outperforms mainstream approaches like Memory-guided Autoencoders (MNAD) and Residual Autoencoders (AR-AE), achieving AUC scores of 99.9%, 94.8%, and 78.9% on the UCSD Ped 2, CUHK Avenue, and Shanghai Tech datasets, respectively. To tackle the challenges of costly and time-consuming data annotation, and the disparity between predictions and actual outcomes caused by reliance on single-task prediction methods that overlook the spatiotemporal context of video content, this paper introduces a video anomaly detection approach grounded in deep unsupervised contrastive learning. Firstly, we employ a dual-branch architecture that leverages contrastive learning with a C3D (Convolutional 3D) convolutional network model incorporating channel and spatial attention mechanisms as the encoder to extract spatiotemporal features from the video. Secondly, we utilize a projection transformation network with a two-layer MLP (Multilayer Perceptron) structure to reduce the dimensionality of the spatiotemporal features. Finally, we compute the contrastive loss for multiple unsupervised tasks and combine it with the LOF (Local Outlier Factor) algorithm to identify abnormal events in videos. The proposed method achieves improvements of 2.6% and 3.4% over the Dual Discriminator Generative Adversarial Network (CT-D2GAN) on the UCSD Ped2 and ShanghaiTech datasets, respectively, and a 6.6% performance increase compared to the Multi-Path Frame Prediction method (ROADMAP) on the Avenue dataset. |
参考文献: |
[4] 武光利, 郭振洲, 李雷霆, 等. 融合 FCN 和 LSTM 的视频异常事件检测[J]. 上海交通大学学报, 2021, 55(5): 607-614. [24] 于晓升, 许茗, 王莹, 等. 基于卷积变分自编码器的异常事件检测方法[J]. 仪器仪表学报, 2023 (5): 151-158. [42] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015, 521(7553): 436-444. [54] Tax D M J, Duin R P W. Support vector data description[J]. Machine Learning, 2004, 54: 45-66. [55] Cortes C, Vapnik V. Support-vector networks[J]. Machine Learning, 1995, 20: 273-297. [66] 李诗菁, 卿粼波, 何小海, 等. 基于 NVIDIA Jetson TX2 的道路场景分割[J]. 计算机系统应用, 2019 (1): 239-244. |
中图分类号: | TP391.41 |
开放日期: | 2024-06-19 |