- 无标题文档
查看论文信息

论文中文题名:

 基于生成对抗网络的矿井下语音增强算法研究    

姓名:

 刘众奇    

学号:

 20208223038    

保密级别:

 保密(1年后开放)    

论文语种:

 chi    

学科代码:

 085400    

学科名称:

 工学 - 电子信息    

学生类型:

 硕士    

学位级别:

 工程硕士    

学位年度:

 2023    

培养单位:

 西安科技大学    

院系:

 计算机科学与技术学院    

专业:

 软件工程    

研究方向:

 语音信号处理    

第一导师姓名:

 张昭昭    

第一导师单位:

 西安科技大学    

第二导师姓名:

 陈小林    

论文提交日期:

 2023-06-14    

论文答辩日期:

 2023-06-05    

论文外文题名:

 Research on Underground speech Enhancement Algorithm based on Generative adversarial Network    

论文中文关键词:

 语音增强 ; 生成对抗网络 ; 矿井噪声 ; 通道注意力机制    

论文外文关键词:

 Speech enhancement ; Generate adversarial network ; Mine noise ; Channel Attention mechanism    

论文中文摘要:

    实现强噪声和低频噪声环境下的语音通信是一项具有挑战性的高难度任务,鉴于现代通信技术愈加发达,人们对语音助手和通信系统的要求也日益变高,面向复杂环境的语音增强技术逐渐受到科学工作者们的重视。语音增强技术的主要功能在于提高语音信号的质量以及可懂度,以此来提升人与人之间的信息传递效率,在当今时代,随着人工智能的高速发展,语音增强算法也实现了质的飞跃,成为了现代通信技术不断进取的重要助推器。然而,在含有多噪声、非平稳噪声和未知噪声的矿井下,传统的语音增强算法往往无法有效去除噪声,增强效果往往较差,导致在井下的信息交互和录入十分困难,甚至会对工人造成生命威胁。因此,本文主要针对环境下语音增强算法进行研究,论文的工作内容如下:

    (1)生成对抗网络算法训练过程不稳定,易产生梯度爆炸,且与传统方法类似,多以语音频域信息作为输入训练网络,从而忽视了语音信号的相位信息。针对该问题,本文提出了一种结合相对损失和梯度惩罚项的时域生成对抗网络语音增强算法,首先引入相对生成对抗网络的思想对训练过程进行优化,重构网络训练流程。其次,引入了Huberloss损失函数优化训练过程,改善网络训练过程中梯度传递的稳定性。通过不同数据集上的实验表明,本文所提出算法与未改进模型相比在五项语音评价指标上提升的平均值为0.048,0.109,0.124,0.084,0.004。

    (2)针对生成对抗网络自身结构限制,导致特征提取能力不足从而使得生成语音质量下降等问题,本文提出了一种基于通道的注意力机制优化的语音增强算法。通过使用一维卷积模块代替原模型中的两个全连接层,以避免通道信息损失的同时将通道注意力机制中的一维卷积模块中替换为空洞卷积块,用于增加卷积模块中的感受野,获取更多的特征信息。实验结果表明,引入通道注意力机制的生成对抗网络在选取的五项指标上与未改进模型相比提升的平均值为0.132,0.091,0.171,0.208,0.026。

     最后,本文设计了面向矿井下的语音增强系统,用于展现本算法的实用性和应用价值,将所实现的算法融入语音增强系统中,并通过可视化的结果展现本算法的优越性。

论文外文摘要:

    It is a challenging and difficult task to realize speech communication under strong noise and low-frequency noise. In view of the development of modern communication technology, people have increasingly high requirements for voice assistants and communication systems. Therefore, speech enhancement technology for complex environments has gradually received the attention of scientists. The main function of speech enhancement technology is to improve the quality and intelligibility of speech signals, so as to enhance the efficiency of information transmission between people. In today's era, with the rapid onset of artificial intelligence, speech enhancement algorithm has also achieved a qualitative leap and has become an important booster of continuous improvement of modern communication technology. However, in the mine with multi-noise, non-stationary noise, and unknown noise, the traditional voice enhancement algorithm is often unable to effectively remove noise, and the enhancement effect is often poor, resulting in extremely inconvenient information interaction and input in the mine, and even a threat to workers' lives.

     (1) The training process of the generated antagonistic network algorithm is unstable and prone to gradient explosion. Similar to the traditional method, it mostly uses the audio domain information as the input training network, thus ignoring the phase information of the speech signal. In order to solve this problem, this paper proposes a voice enhancement algorithm of a time-domain generated adversarial network combining relative loss and gradient penalty terms. Firstly, the idea of a relatively generated adversarial network is introduced to optimize the training process and reconstruct the network training process. Secondly, the Huberloss function is introduced to optimize the training process and improve the stability of gradient transmission in the network training process. Experiments on different data sets show that compared with the original model, the proposed algorithm improves the average value of five speech evaluation indexes by 0.048, 0.109, 0.124, 0.084, and 0.004.

      (2) In view of the structural limitations of the generative adduction network itself, resulting in insufficient feature extraction capability and degradation of the quality of generated speech, this paper proposes a speech enhancement algorithm based on the optimization of attention mechanism based on channels. The one-dimensional convolution module is used to replace the two fully connected layers in the original model to avoid channel information loss. Meanwhile, the one-dimensional convolution module in the channel attention mechanism is replaced with empty convolution blocks to increase the receptive field in the convolutional module and obtain more feature information. The experimental results show that the average improvement of the generative adversarial network with channel attention mechanism is 0.132, 0.091, 0.171, 0.208, and 0.026 compared with the original model.

     Finally, this paper designs a speech enhancement system for the underground mine, which is used to show the practicability and application value of this algorithm. The realized algorithm is integrated into the speech enhancement system to show the superiority of this algorithm with visual results.

参考文献:

[1]周晓凤, 尘兴邦, 刘兰亭, 佟瑞鹏. 煤矿井下噪声诱发职业健康损害评估方法及应用 [J] .中国安全科学学报. 2022;32(08):08-14.

[2]程丽平,李国豪.矿井噪声主动控制技术研究及参数优化 [J] .中国矿业,2021,30(01):90-94.

[3]王冰,王玉玲,刘寅超等.某煤矿井下噪声危害程度调查分析 [J] .中国卫生工程学,2019,18(05):660-661.

[4]王远声. 综采工作面噪声对作业人员影响关系研究 [D] .河南理工大学,2019.

[5]雷柏伟, 吴兵, 程根银, 苏赟, 董梁. 煤矿井下主要设备噪声源测定分析研究 [J] . 中国安全生产科学技术. 2011;7(01):72-75.

[6]梁馨月,寇晓波,康望等.某煤矿职业危害接触调查分析 [J].工业卫生与职业病,2022,48(01):57-61.

[7]Xu X, Wang Y, Xu D, Zhang C, Chen B. Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement. 2021.

[8]刘炜杰,安桐,张涛.基于Katz维数的改进谱减算法 [J] .信息与控制,2021,50(06):677-684.

[9]董胡,刘刚,马振中.基于自适应MMSE-LSA与NMF的语音增强算法 [J] .探测与控制学报,2021,43(04):81-85+91.

[10]汪春华,冯焱侠.基于VMD-维纳滤波的时间序列去噪[J].自动化技术与应用,2022,41(01):9-13.

[11]李维松,许伟杰,张涛.基于小波变换阈值去噪算法的改进[J].计算机仿真,2021,38(06):348-351+356.

[12]冉福星,傅勇,潘晴.基于EMD与SSA的语音增强算法研究 [J] .信息技术,2018(03):113-116.

[13]Gour G B, Udayashankara V, Badakh D K, et al. Quest for Speech Enhancement Method in the Analysis of Pathological Voices[J]. Circuits, Systems, and Signal Processing, 2023: 15(03): 1-32.

[14]Gholamiangonabadi D, Grolinger K. Personalized models for human activity recognition with wearable sensors: deep neural networks and signal processing[J]. Applied Intelligence, 2023, 53(05): 6041-6061.

[15]潘晴,冉福星,李雅昆.基于EMD的前后置滤波语音增强算法[J].河南师范大学学报(自然科学版),2018,46(03):33-39.

[16]王霞, 王丹, 王光艳, 张艳. 压缩感知与EMD相结合的带噪面罩语音增强 [J]. 计算机工程与应用. 2017;53(18):137-140.

[17]Hepsiba D, Vinotha R, Vijay Anand L D. Speech Enhancement and Recognition Using Deep Learning Algorithms: A Review[J]. Computational Vision and Bio-Inspired Computing: Proceedings of ICCVBIC 2022, 2023: 259-268.

[18]Petajan E D. Automatic lipreading to enhance speech recognition (speech reading)[M]. University of Illinois at Urbana-Champaign, 1984.

[19]Manoj C, Nagarajan N. The application of artificial neural networks to magnetotelluric time-series analysis[J]. Geophysical Journal International, 2003, 153(02): 409-423.

[20]Raju M S N, Rao B S. Colorectal multi-class image classification using deep learning models[J]. Bulletin of Electrical Engineering and Informatics, 2022, 11(01): 195-200.

[21]Dahl G E, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on audio, speech, and language processing, 2011, 20(01): 30-42.

[22]Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(07): 1527-1554.

[23]Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network[C].2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM). IEEE, 2017: 1-5.

[24]Priyanka S S, Kumar T K. Multi-channel speech enhancement using early and late fusion convolutional neural networks[J]. Signal, Image and Video Processing, 2022: 1-7.

[25]Yan X, Xu Y, She D, et al. Reliable fault diagnosis of bearings using an optimized stacked variational denoising auto-encoder[J]. Entropy, 2022, 24(01): 36.

[26]Shilandari A, Marvi H, Khosravi H, et al. Speech emotion recognition using data augmentation method by cycle-generative adversarial networks[J]. Signal, Image and Video Processing, 2022, 16(07): 1955-1962.

[27]Jolad B, Khanai R. An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks[J]. International Journal Of Speech Technology, 2023: 1-19.

[28]Abu-Srhan A, Abushariah M A M, Al-Kadi O S. The effect of loss function on conditional generative adversarial networks[J]. Journal of King Saud University-Computer and Information Sciences, 2022, 34(09): 6977-6988.

[29]Pascual S, Bonafonte A, Serra J. SEGAN: Speech enhancement generative ad-versarial network [C]. Interspeech, 2017: 3642-3646.

[30]向前, 唐勇. 基于生成对抗网络的汉语语音增强技术研究 [J]. 计算机应用研究. 2020;37(02):150-151.

[31]Phan H, McLoughlin I V, Pham L, et al. Improving GANs for speech enhancement [J]. IEEE Signal Processing Letters, 2021, 27: 1700-1704.

[32]Phan H, Le Nguyen H, Chén O Y, et al. Self-attention generative adversarial network for speech enhancement[C].ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 7103-7107.

[33]Liu B, Nie S, Zhang Y, et al. Boosting noise robustness of acoustic model via deep adversarial training[C].2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5034-5038.

[34]王怡斐, 韩俊刚, 樊良辉. 基于WGAN 的语音增强算法研究 [J]. 重庆邮电大学学报(自然科学版). 2019,31(01):42-46.

[35]谭诺亚. 基于生成式对抗网络的语音增强算法 [D]. 湖南大学; 2020.

[36]Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C].ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 106-110.

[37]Liang X, Li Y, Li X, et al. A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement[J]. Information, 2023, 14(04): 221.

[38]Mimura M, Sakai S, Kawahara T. Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks[C].2017 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2017: 134-140.

[39]Yu G, Wang Y, Wang H, et al. A two-stage complex network using cycle-consistent generative adversarial networks for speech enhancement[J]. Speech Communication, 2021, 134: 42-54.

[40]Ye S, Hu X, Xu X. Tdcgan: Temporal dilated convolutional generative adversarial network for end-to-end speech enhancement[J]. arXiv preprint arXiv:2008.07787, 2020.

[41]曹洁, 周尧风, 于泓, 李晓旭. 基于SI-SDR优化的生成对抗网络语音增强方法 [J]. 华中科技大学学报(自然科学版). 2020;48(11):17-23.

[42]Yang D H, Chang J H. Attention-based latent features for jointly trained end-to-end automatic speech recognition with modified speech enhancement[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(3): 202-210.

[43]Ullah R, Wuttisittikulkij L, Chaudhary S, et al. End-to-End Deep Convolutional Recurrent Models for Noise Robust Waveform Speech Enhancement[J]. Sensors, 2022, 22(20): 7782.

[44]柏东冰. 矿井工作面噪声分析及降噪算法研究 [D]. 大连海事大学,2014.

[45]Huang Z, Watanabe S, Yang S, et al. Investigating self-supervised learning for speech enhancement and separation[C].ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 6837-6841.

[46]Dong Y, Xiao H, Dong Y. SA-CGAN: An oversampling method based on single attribute guided conditional GAN for multi-class imbalanced learning[J]. Neurocomputing, 2022, 472: 326-337.

[47]Dewi C, Chen R C, Liu Y T, et al. Synthetic Data generation using DCGAN for improved traffic sign recognition[J]. Neural Computing and Applications, 2022, 34(24): 21465-21480.

[48]Li H , Pei K , Sun W . Dynamic State Estimation for Power System Based on the Measurement Data Reconstructed by RGAN[J]. IEEE Access, 2021, PP(99):1-1.

[49]Fu S W, Liao C F, Tsao Y, et al. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement[C].International Conference on Machine Learning. PMLR, 2019: 2031-2041.

[50]Xiaopeng C, Jiangzhong C, Yuqin L, et al. Improved training of spectral normalization generative adversarial networks[C].2020 2nd World Symposium on Artificial Intelligence (WSAI). IEEE, 2020: 24-28.

[51]Xiao F, Guan J, Kong Q, et al. Time-domain speech enhancement with generative adversarial learning[J]. arXiv preprint arXiv:2103.16149, 2021.

[52]Xu Z, Strake M, Fingscheidt T. Deep noise suppression maximizing non-differentiable PESQ mediated by a non-intrusive PESQNet[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 1572-1585.

[53]Shu X, Yang J, Yan R, et al. Expansion-squeeze-excitation fusion network for elderly activity recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(8): 5281-5292.

中图分类号:

 TN912    

开放日期:

 2024-06-21    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式