论文中文题名: | 面向类不平衡数据的软件缺陷预测方法研究 |
姓名: | |
学号: | 20208088020 |
保密级别: | 保密(1年后开放) |
论文语种: | chi |
学科代码: | 083500 |
学科名称: | 工学 - 软件工程 |
学生类型: | 硕士 |
学位级别: | 工学硕士 |
学位年度: | 2023 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 人工智能与信息处理 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2023-06-20 |
论文答辩日期: | 2023-06-06 |
论文外文题名: | Research on software defect prediction for class-imbalanced data |
论文中文关键词: | |
论文外文关键词: | Software defect prediction ; Class imbalance ; Semantic features ; Generative Adversarial Network ; Oversampling |
论文中文摘要: |
当前,软件系统被广泛应用于国防、金融、电力、通信等安全关键领域,软件安全已经成为国家安全的重要组成部分。软件缺陷在软件开发过程中是不可避免的,可能会导致系统崩溃,严重威胁软件安全。为了减少软件缺陷的威胁,亟需在软件开发过程中及时排除可能存在的缺陷,确保软件系统质量。软件缺陷预测技术采用机器学习算法来构建相应的预测模型,能够有效挖掘软件系统中的潜在缺陷,从而保障软件安全。然而,现有方法仍存在一些问题:(1) 软件缺陷预测面临着严重的类不平衡问题,现有过采样方法在处理该问题时会合成很多缺乏多样性的缺陷样本,采用这些样本来构建缺陷预测模型会降低模型的预测能力;(2) 软件度量元和代码语义特征被广泛应用于软件缺陷预测中,现有的研究大多只关注一种特征或简单地将不同特征进行拼接,导致代码特征提取不完全、特征结合有效性低。 根据上述存在的问题,本文进行了如下研究: (1) 针对现有过采样方法合成的缺陷样本降低模型预测性能的问题,提出了基于条件序列生成式对抗网络的过采样方法。在生成器和判别器中加入交叉层,使生成样本更加多样化;引入了一个独立的分类器,用来指导生成器生成与类别标签匹配的合成样本;结合瓦瑟斯坦距离和梯度惩罚设计了新的损失函数,避免模式崩溃,实现稳定高效的网络训练。在10个公开软件缺陷数据集上,采用了多种分类模型进行仿真实验。实验结果表明,该方法不仅可以有效解决软件缺陷预测中的类不平衡问题,而且优于现有的过采样方法。 (2) 针对当前软件缺陷预测中代码特征提取不完全、特征结合有效性低的问题,提出了基于卷积神经网络和门控循环单元的软件缺陷预测方法。其中,利用卷积神经网络以及门控循环单元分别提取代码的软件度量元特征和语义特征;引入注意力机制,使重要特征可以得到更多关注;设计了一个自适应权重层进行特征融合,为提取到的软件度量元特征和语义特征分配不同的权重,进一步提升软件缺陷预测模型的预测能力。在6个开源项目上的实验结果表明,该方法能够有效提高软件缺陷预测性能。 |
论文外文摘要: |
Currently, software systems are widely employed in security-critical domains such as national defense, finance, electric power, and communication. Software security has emerged as an essential component of national security. Software defects are inevitable during the software development process and may lead to system crashes, posing a grave threat to software security. In order to reduce the threat of software defects, it is imperative to promptly eliminate potential defects during the software development process and ensure the quality of the software systems. Software defect prediction technology employs machine learning algorithms to construct corresponding predictive models, which can effectively mine potential defects in software systems and thereby ensure software security. However, there are still some issues in the existing methods: (1) There is a severe class imbalance problem in software defect prediction. Existing oversampling methods tend to synthesize numerous defect samples that lack diversity. The usage of such samples for constructing the defect prediction models may reduce their prediction ability. (2) Software metrics and code semantic features are widely used in software defect prediction. However, most of the existing studies focus on a single feature or simply concatenate different features, leading to incomplete code feature extraction and low effectiveness in feature combination. According to the above problems, the following studies are carried out in this thesis: (1) To solve the problem that the defect samples synthesized by the existing over-sampling methods will reduce the prediction performance of the model, an over-sampling method based on conditional sequence generative adversarial network is proposed. In order to make the generated samples more diverse, cross layers are added to the generator and discriminator. An independent classifier is introduced to guide the generator in generating synthetic samples that match the class labels. A new loss function is designed by combining Wasserstein distance and gradient penalty to avoid mode collapse and achieve stable and efficient network training. Simulation experiments are carried out with the use of several classification models based on ten unbalanced defect data sets. The experimental results show that the proposed method can effectively solve the class imbalance problem in software defect prediction, which is superior to the existing oversampling methods. (2) Aiming at the problems of incomplete code feature extraction and low effectiveness of feature combination in current software defect prediction, a software defect prediction method based on convolutional neural network and gate recurrent unit is proposed. Convolutional neural network and gate recurrent unit are used to extract software metric features and semantic features from code respectively. The attention mechanism is introduced to focus on important features. An adaptive weight layer is designed for feature fusion, which assigns different weights to the extracted software metric and semantic features, further enhancing the prediction ability of the software defect prediction model. Experimental results conducted on the PROMISE dataset demonstrate the effectiveness of the proposed method in improving defect prediction performance. |
中图分类号: | TP311.5 |
开放日期: | 2024-06-20 |