查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于深度学习的语音对抗样本生成方法研究
姓名：	胡旭飞
学号：	21208223057
保密级别：	保密（1年后开放）
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2024
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	计算机技术
研究方向：	人工智能安全
第一导师姓名：	于振华
第一导师单位：	西安科技大学
论文提交日期：	2024-06-19
论文答辩日期：	2024-05-31
论文外文题名：	Research on speech adversarial sample generation method based on deep learning
论文中文关键词：	语音识别 ; 对抗样本 ; 生成对抗网络 ; 目标标签 ; 自注意神经预测器
论文外文关键词：	Speech Recognition ; Adversarial Samples ; Generative Adversarial Networks ; Target Labels ; Self-attention Neural Predictor
论文中文摘要：	︿随着自动语音识别技术在自动驾驶、智能家居以及语音交互等领域的快速发展与广泛应用，其安全性与鲁棒性日益受到关注。通过精心设计的微小扰动引发的对抗样本攻击，能够迫使自动语音识别系统产生误识别，从而引发重大安全事故。为降低自动语音识别系统潜在的安全问题，研究人员通过对抗样本检测系统安全漏洞，以此提升其安全性与稳定性。因此，研究语音对抗攻击方法具有较大理论意义和实用价值。在对抗样本生成方法中，根据攻击者对目标模型内部信息的了解程度，可分为白盒攻击和黑盒攻击两种方法。目前，白盒攻击方法面临的主要问题是难以捕获不同语音尺度间的相关性，这一局限性显著降低了攻击成功率。黑盒攻击方法虽能规避对目标模型直接访问，但其搜索过程复杂度高，并且生成过大扰动，这增加了攻击难度，也降低了对抗样本的隐蔽性。针对以上问题，本文进行如下研究： (1) 针对现有白盒攻击方法在捕捉不同语音尺度之间相关性不足的问题，提出了一种基于类别条件生成对抗网络的语音对抗攻击方法。通过目标标签映射模块将攻击目标标签转化为向量，作为条件输入到类别条件生成对抗网络中，控制样本类别的生成；设计NResidual U-block网络结构，并将其与U-Net网络相结合，能够更有效地学习不同时间尺度的语音特征，从而提高对抗样本的质量和攻击效果。在谷歌命令数据集和音乐流派数据集上的实验结果表明，与主流方法相比，本文所提语音对抗样本生成方法的攻击成功率分别提高了3.47%和5.1%，平均信噪比提升了3.2dB和1.49dB，具有良好的攻击效果和语音质量。 (2) 针对黑盒攻击过程中搜索复杂、生成扰动过大的问题，提出一种基于增强型神经预测器的黑盒语音对抗攻击方法。该方法在扰动空间中搜索最小扰动，通过自注意神经预测器指导的优化过程找到最佳扰动方向，将该方向应用于原始样本以生成对抗样本；为提高搜索效率，设计了剪枝策略，在搜索早期阶段丢弃低于阈值的样本，减少搜索次数；最后根据查询自动语音识别系统的反馈结果引入动态因子，以自适应地调整搜索步长大小，进一步加速搜索过程。为验证所提方法性能，在LibriSpeech数据集上进行实验。与主流方法相比，本文方法信噪比提升了0.8dB，样本相似度提升0.43%，查询次数平均降低7%，具有更好的攻击效果和隐蔽性。 (3) 基于上述方法，设计并开发了一个智能语音对抗攻击系统。该系统集成了本文所提攻击方法，用户可选择语音样本并自定义攻击策略与约束条件，从而实现针对特定场景的对抗攻击。该对抗攻击系统验证了对抗攻击方法的有效性，为挖掘自动语音识别系统的安全漏洞提供了有效支持。﹀
论文外文摘要：	︿ With the rapid development and widespread application of automatic speech recognition (ASR) technology in fields such as autonomous driving, smart homes, and voice interaction, its security and robustness have become increasingly prominent concerns. Adversarial sample attacks, which are induced by carefully designed minor perturbations, can force ASR systems to produce misrecognitions, potentially leading to significant safety incidents. To mitigate the potential security issues of ASR systems, researchers have focused on detecting adversarial samples to identify system vulnerabilities and enhance their security and stability. Therefore, studying adversarial attack methods on speech recognition systems holds substantial theoretical significance and practical value. In the context of adversarial sample generation methods, attacks can be categorized into white-box attacks and black-box attacks based on the attacker’s knowledge of the target model’s internal information. Currently, the primary challenge faced by white-box attack methods is their difficulty in capturing correlations across different speech scales, which significantly reduces the attack success rate. Although black-box attack methods can avoid direct access to the target model, their search process is highly complex, and they tend to generate excessively large perturbations. This complexity increases the difficulty of the attack and reduces the concealment of the adversarial samples. To address these issues, this thesis conducts the following research: (1) To address the issue of insufficient correlation capture across different speech scales in existing white-box attack methods, a speech adversarial attack method based on category-conditional generative adversarial networks (GANs) is proposed. This method utilizes a target label mapping module to convert the attack target labels into vectors, which are then input as conditions into the category-conditional GAN to control the generation of sample categories. The designed NResidual U-block network structure, combined with the U-Net network, effectively learns speech features across different time scales, thereby improving the quality and effectiveness of the adversarial samples. Experimental results on the Google Commands dataset and the Music Genre dataset show that, compared to mainstream methods, the proposed speech adversarial sample generation method increases the attack success rates by 3.47% and 5.1%, respectively, and improves the average signal-to-noise ratio (SNR) by 3.2dB and 1.49dB. The experimental results demonstrate that the proposed method generates adversarial speech samples with excellent attack effectiveness and speech quality. (2) To tackle the challenges of high search complexity and excessive perturbation generation in black-box attacks, an enhanced neural predictor-based black-box speech adversarial attack method is proposed. This method searches for minimal perturbations within the perturbation space and uses a self-attention neural predictor to guide the optimization process, identifying the optimal perturbation direction to apply to the original samples to generate adversarial samples. To improve search efficiency, a pruning strategy is designed to discard samples below a threshold in the early search stages, reducing the number of searches. Finally, a dynamic factor is introduced based on feedback from querying the ASR system to adaptively adjust the search step size, further accelerating the search process. To validate the performance of the proposed method, experiments were conducted on the LibriSpeech dataset. Compared to mainstream methods, the proposed method improves the SNR by 0.8dB, sample similarity by 0.43%, and reduces the average number of queries by 7%. The experimental results indicate that the proposed method offers better attack effectiveness and concealment. (3) Based on the aforementioned methods, an intelligent speech adversarial attack system was designed and developed. This system integrates the proposed attack methods, allowing users to select speech samples and customize attack strategies and constraints to carry out adversarial attacks for specific scenarios. The adversarial attack system validates the practicality of the proposed methods, providing effective support for exposing security vulnerabilities in ASR systems. ﹀
参考文献：	︿ [1] Mathew A, Amudha P, Sivakumari S. Deep learning techniques: an overview[J]. Advanced Machine Learning Technologies and Applications: Proceedings of AMLTA 2020, 2021: 599-608. [2] Chen C, Hu Y, Yang C H H, et al. Hyporadise: An open baseline for generative speech recognition with large language models[J]. Advances in Neural Information Processing Systems, 2024, 36. [3] Murthy D, Ouellette R R, Anand T, et al. Using computer vision to detect e-cigarette content in TikTok videos[J]. Nicotine and Tobacco Research, 2024, 26: S36-S42. [4] Gagliardi G. Natural language processing techniques for studying language in pathological ageing: A scoping review[J]. International Journal of Language & Communication Disorders, 2024, 59(1): 110-122. [5] Kheddar H, Himeur Y, Al-Maadeed S, et al. Deep transfer learning for automatic speech recognition: Towards better generalization[J]. Knowledge-Based Systems, 2023, 277: 110851. [6] Baeza V M, Salor L C. New horizons in tactical communications: An overview of emerging technologies possibilities[J]. IEEE Potentials, 2024, 43(1): 12-19. [7] Li Y, Yuan W, Zhang S, et al. Choose your simulator wisely: A review on open-source simulators for autonomous driving[J]. IEEE Transactions on Intelligent Vehicles, 2024. [8] Meng F, Wang X. Application of energy scheduling algorithm based on energy consumption prediction in smart home energy scheduling[J]. Renewable Energy, 2024: 120620. [9] Li S, You J, Zhang X. Overview and Analysis of Speech Recognition[C]//2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). IEEE, 2022: 391-395. [10] Dhanjal A S, Singh W. A comprehensive survey on automatic speech recognition using neural networks[J]. Multimedia Tools and Applications, 2024, 83(8): 23367-23412. [11] 刘宇宸, 宗成庆. 跨模态信息融合的端到端语音翻译[J]. 软件学报, 2022, 34(4): 1837-1849. [12] Xu X, Tu W, Yang Y. CASE-Net: Integrating local and non-local attention operations for speech enhancement[J]. Speech Communication, 2023, 148: 31-39. [13] Kaur N, Singh P. Conventional and contemporary approaches used in text to speech synthesis: A review[J]. Artificial Intelligence Review, 2023, 56(7): 5837-5880. [14] Kheddar H, Hemis M, Himeur Y. Automatic speech recognition using advanced deep learning approaches: A survey[J]. Information Fusion, 2024: 102422. [15] Li Z, Wang Y, Wang C. Design and research of multimedia information publishing system based on speech recognition technology[J]. Optical and Quantum Electronics, 2024, 56(3): 327. [16] Carlini N， Wagner D. Audio adversarial examples: Targeted attacks on speech-to-text[C]//2018 IEEE security and privacy workshops (SPW). IEEE, 2018, 1-7. [17] Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep speech 2: End-to-end speech recognition in english and mandarin[C]//International conference on machine learning. PMLR, 2016: 173-182. [18] Alzantot M, Balaji B, Srivastava M. Did you hear that? adversarial examples against automatic speech recognition[J]. arXiv preprint , 2018, arXiv:1801.00554. [19] Pal M, Jati A, Peri R, et al. Adversarial defense for deep speaker recognition using hybrid adversarial training[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6164-6168. [20] 徐东伟, 房若尘, 蒋斌, 等. 语音对抗攻击与防御方法综述[J]. 信息安全学报, 2022, 7(1): 126-144. [21] 张思思, 左信, 刘建伟. 深度学习中的对抗样本问题[J]. 计算机学报, 2019, 42(8): 1886-1904. [22] 陈晋音, 沈诗婧, 苏蒙蒙, 等. 车牌识别系统的黑盒对抗攻击[J]. 自动化学报, 2021, 47(1): 121-135. [23] Zhu L, Wang T, Li J, et al. Efficient query-based black-box attack against cross-modal hashing retrieval[J]. ACM Transactions on Information Systems, 2023, 41(3): 1-25. [24] Gong Y , Poellabauer C .Crafting Adversarial Examples For Speech Paralinguistics Applications[J]. 2017.DOI:10.1145/3306195.3306196. [25] Kreuk F, Adi Y, Cisse M, et al. Fooling end-to-end speaker verification with adversarial examples[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 1962-1966. [26] Iter D, Huang J, Jermann M. Generating adversarial examples for speech recognition.(2017)[J]. 2017, 137. [27] Chen S, Chen L, Zhang J, et al. Adversarial speech for voice privacy protection from Personalized Speech generation[J]. arXiv preprint arXiv:2401.11857, 2024. [28] Cisse M, Adi Y, Neverova N, et al. Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples[C]//31st International Conference on Neural Information Processing Systems, 2017: 6980-6990. [29] Du C, Huo C, Zhang L, et al. Fast C&W: A fast adversarial attack algorithm to fool SAR target recognition with deep convolutional neural networks[J]. IEEE Geoscience and Remote Sensing Letters, 2021, 19: 1-5. [30] Zhang Q, Yang J, Zhang X, et al. An efficient low-perceptual environmental sound classification adversarial method based on GAN[J]. Multimedia Tools and Applications, 2024: 1-26. [31] Sriram A, Jun H, Gaur Y, et al. Robust speech recognition using generative adversarial networks[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5639-5643. [32] Donahue C, Li B, Prabhavalkar R. Exploring speech enhancement with generative adversarial networks for robust speech recognition[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5024-5028. [33] Chen J, Zhang X, Zheng H. Perturbation-Optimized Black-Box Adversarial Attacks via Genetic Algorithm[M]//Attacks, Defenses and Testing for Deep Learning. Singapore: Springer Nature Singapore, 2024: 5-24. [34] Taori R, Kamsetty A, Chu B, et al. Targeted adversarial examples for black box audio systems[C]//2019 IEEE Security and Privacy Workshops (SPW). IEEE, 2019: 15-20. [35] Zheng B, Jiang P, Wang Q, et al. Black-box adversarial attacks on commercial speech platforms with minimal information[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021: 86-107. [36] Wang Q, Zheng B, Li Q, et al. Towards query-efficient adversarial attacks against automatic speech recognition systems[J]. IEEE Transactions on Information Forensics and Security, 2020, 16: 896-908. [37] Tiwari V. MFCC and its applications in speaker recognition[J]. International Journal on Emerging Technologies, 2010, 1(1): 19-22. [38] Basak S, Agrawal H, Jena S, et al. Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems[J]. CMES-Computer Modeling in Engineering & Sciences, 2023, 135(2). [39] Anjum M F, Dasgupta S, Mudumbai R, et al. Linear predictive coding distinguishes spectral EEG features of Parkinson's disease[J]. Parkinsonism & Related Disorders, 2020, 79: 79-85. [40] Mor B, Garhwal S, Kumar A. A systematic review of hidden Markov models and their applications[J]. Archives of Computational Methods in Engineering, 2021, 28: 1429-1448. [41] Teytaut Y, Bouvier B, Roebel A. A study on constraining Connectionist Temporal Classification for temporal audio alignment[C]//Interspeech 2022. ISCA, 2022: 5015-5019. [42] Kim K, Gopi S, Kulkarni J, et al. Differentially private n-gram extraction[J]. Advances in Neural Information Processing Systems, 2021, 34: 5102-5111. [43] Xie Y, Kawaguchi K, Zhao Y, et al. Self-evaluation guided beam search for reasoning[J]. Advances in Neural Information Processing Systems, 2024, 36. [44] Li Q, Chen L. An image encryption algorithm based on 6-dimensional hyper chaotic system and DNA encoding[J]. Multimedia Tools and Applications, 2024, 83(2): 5351-5368. [45] Liu B, Ding X, Cai H, et al. Precision adaptive MFCC based on R2SDF-FFT and approximate computing for low-power speech keywords recognition[J]. IEEE Circuits and Systems Magazine, 2021, 21(4): 24-39. [46] Hunter L L, Monson B B, Moore D R, et al. Extended high frequency hearing and speech perception implications in adults and children[J]. Hearing Research, 2020, 397: 107922. [47] Cisse M, Adi Y, Neverova N, et al. Houdini: Fooling deep structured prediction models[J]. arxiv preprint 2017, arxiv:1707.05373. [48] Zhu Y, Jiang Y. A non-global disturbance targeted adversarial example algorithm combined with C&W and Grad-Cam[J]. Neural Computing and Applications, 2023, 35(29): 21633-21644. [49] Wang Q, Zheng B, Li Q, et al. Towards query-efficient adversarial attacks against automatic speech recognition systems[J]. IEEE Transactions on Information Forensics and Security, 2020, 16: 896-908. [50] Yuan X, Chen Y， Zhao Y， et al. CommanderSong: A systematic approach for practical adversarial voice recognition[C]//27th USENIX Security Symposium (USENIX security 18), 2018: 49-64. [51] Wang S, Zhang Z, Zhu G, et al. Query-efficient adversarial attack with low perturbation against end-to-end speech recognition systems[J]. IEEE Transactions on Information Forensics and Security, 2022, 18: 351-364. [52] Liu X, Wan K, Ding Y, et al. Weighted-sampling audio adversarial example attack[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(04): 4908-4915. [53] Chen G, Chenb S, Fan L, et al. Who is real bob? adversarial attacks on speaker recognition systems[C]//2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021: 694-711. [54] Alzantot M, Sharma Y, Chakraborty S, et al. Genattack: Practical black-box attacks with gradient-free optimization[C]//Proceedings of the Genetic and Evolutionary Computation Conference. 2019: 1111-1119. [55] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015: 234-241. [56] Warden P. Speech commands: A dataset for limited-vocabulary speech recognition[J]. arXiv preprint 2018, arXiv:1804.03209. [57] Tzanetakis G, Cook P. Musical genre classification of audio signals[J]. IEEE Transactions on Speech and Audio Processing, 2002, 10(5): 293-302. [58] Fan E. Extended tanh-function method and its applications to nonlinear equations[J]. Physics Letters A, 2000, 277(4-5): 212-218. [59] Crnjanski J, Krstić M, Totović A, et al. Adaptive sigmoid-like and PReLU activation functions for all-optical perceptron[J]. Optics Letters, 2021, 46(9): 2003-2006. [60] Box G. Signal-to-noise ratios, performance criteria, and transformations[J]. Technometrics, 1988, 30(1): 1-17. [61] Rix A W, Beerends J G, Hollier M P, et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]//2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001, 2: 749-752. [62] Zhang H, Zhang X, Gao G. Training supervised speech separation system to improve STOI and PESQ directly[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5374-5378. [63] Wang D, Dong L, Wang R, et al. Targeted speech adversarial example generation with generative adversarial network[J]. IEEE Access, 2020, 8: 124503-124513. [64] Cao R, Abdulatif S, Yang B. Cmgan: Conformer-based metric gan for speech enhancement[J]. arXiv preprint, 2022, arXiv: 2203.15149. [65] Yang J, Lee J, Kim Y, et al. VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network[J]. Interspeech 2020, 1238. [66] Panayotov V, Chen G, Povey D, et al. Librispeech: an asr corpus based on public domain audio books[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015: 5206-5210. [67] Suhasini S, Bharathi B, Chakravarthi B R. Attention‐Based End‐to‐End Automatic Speech Recognition System for Vulnerable Individuals in Tamil[J]. Automatic Speech Recognition and Translation for Low Resource Languages, 2024: 15-26. [68] Biolková M, Nguyen B. Neural predictor for black-box adversarial attacks on speech recognition[C]//Interspeech 2022. ISCA, 2022: 5010-5014. [69] Du T, Ji S, Li J, et al. Sirenattack: Generating adversarial audio for end-to-end acoustic systems[C]//Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 2020: 357-369. [70] Wu S, Wang J, Ping W, et al. Defending against adversarial audio via diffusion model [C]//international conference on learning representations (ICLR). 2023: 214-220. [71] Kim H, Park J, Lee J. Generating transferable adversarial examples for speech classification[J]. Pattern Recognition, 2023, 137: 109286. [72] Martinez-Cantin R. BayesOpt: a Bayesian optimization library for nonlinear optimization, experimental design and bandits[J]. J. Mach. Learn. Res., 2014, 15(1): 3735-3739. [73] Ran Y, Wang Y G. Sign-OPT+: An Improved Sign Optimization Adversarial Attack[C]//2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022: 461-465. ﹀
中图分类号：	TP391.9
开放日期：	2025-06-19

附件下载