查看论文信息

免费浏览

查看论文信息

论文中文题名：	强噪声下的陕北方言语音识别系统研究
姓名：	翟蒙恩
学号：	20208223071
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工程硕士
学位年度：	2023
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	计算机技术
研究方向：	语音识别
第一导师姓名：	董立红
第一导师单位：	西安科技大学
论文提交日期：	2023-06-26
论文答辩日期：	2023-06-26
论文外文题名：	Research on speech recognition system of northern shaanxi dialect under strong noise
论文中文关键词：	方言数据集 ; 方言语音识别 ; 强噪声 ; 去噪自编码器
论文外文关键词：	Dialect dataset ; Dialect speech recognition ; Strong noise ; Denoising auto encoder
论文中文摘要：	︿随着语音识别技术的迅速发展、使用范围不断扩大，目前在大语种方面取得了良好的成果。然而在陕北煤矿实际生产中的会议、调度、指挥等一系列沟通交流时，却仍存在陕北方言使用频率高于普通话的问题，因此陕北方言语音识别研究具有现实意义。针对研究过程中存在的煤矿强噪声信号对语音信号存在干扰、方言数据集不足及方言识别率较低的问题，进行了以下相关工作。针对煤矿强噪声对语音识别准确率的影响，提出了改进的堆叠去噪自编码器（SDAE）语音去噪算法，它能有效地消除强噪声对语音信号的干扰。首先对含噪语音信号使用谱减法对强噪声初次去除，再使用堆叠去噪自编码器进行二次去噪。对自编码其进行堆叠有效的加快了训练速度，并降低了反解码过程中梯度消失的问题，从而实现对煤矿环境下强噪声的二次去除，对语音波形重建后得到较为纯净的语音。SDAE同时解决了谱减法过程中的边界定义、音乐噪声及参数调整等问题。通过对去噪处理后的语音信号进行语音可懂度（NCM值）评估，分别在信噪比为（-15DB、-10DB及-5DB）时，不同煤矿环境噪声下进行验证，结果表明本文所提出的融合谱减法的DAE去噪算法较当前的一些主流去噪算法均有所提升和改善。针对方言语音识别率远低于普通话语音识别率的问题，提出了一种以CNN+TDNN-F神经网络为声学模型的语音识别模型，通过融合卷积神经网络和因子化时延神经网络，以更加准确的方式同时捕获语音信号在空间和时间上的特征，从而达到改善语音识别的效果。语言模型采用SRILM工具包构建。使用Kaldi作为语音识别工具，通过速度扰动算法扩充了原本的数据集，将参数分别设置为0.9和1.1，获得了3倍的语音数据。同时使用了i-vector特征，增加了模型的鲁棒性。使用Chain模型进行序列鉴别性训练，编解码后得到词错率结果。实验结果表明使用本文提出的CNN+TDNN-F声学模型将词错误率降低至了11.96%，较之前的语音识别算法在方言语音识别的准确率上有了明显的提高和改善。此外对还对降噪后的语音进行波形重建后在该模型上进行错字率验证，结果表明降噪后的语音错字率为12.11%，与纯净语音基本持平。本文的最后对煤矿强噪声环境下陕北方言语音识别系统进行了需求分析、功能分析设计与实现，并在陕北矿业小宝当煤矿进行了实际应用﹀
论文外文摘要：	︿ With the progress of science and technology, speech recognition technology has rapidly developed and its usage has been expanding. Currently, it has achieved good results in large languages. However, in the actual production of coal mines in northern Shaanxi, during a series of communication and communication such as meetings, scheduling, and command, the frequency of using Shaanxi dialect is higher than that of Mandarin. Therefore, the research on speech recognition of Shaanxi dialect has practical significance. In response to the problems of insufficient dialect dataset and strong noise signal interference in coal mines during the research process, the following related work has been carried out. In response to the strong noise in coal mines has a great impact on the accuracy of speech recognition, the algorithm uses spectral subtraction to remove the strong noise for the first time, and then introduces Species reintroduction to remove the noise for the second time. The use of spectral subtraction reduces the learning time and parameter quantity of DAE, reduces signal fluctuations, and is more conducive to feature mapping of pure and noisy speech by DAE. The introduction of DAE also solves the problems of boundary definition, music noise, and parameter adjustment during spectral subtraction. By evaluating the NCM value of the denoised speech signal and verifying it under different coal mine environmental noise levels when the signal-to-noise ratio is (-15DB, -10DB, and -5DB), the results show that the DAE denoising algorithm proposed in this paper, which integrates spectral subtraction, has improved and improved compared to some current mainstream denoising algorithms In response to the recognition rate of dialect speech is far lower than that of mandarin speech, a new Acoustic model (CNN+TDNN-F) is proposed. By combining Convolutional neural network and factorized delay neural network, the spatial and temporal characteristics of speech signals are simultaneously captured in a more accurate way, so as to improve the effect of speech recognition. The language model is constructed using the SRILM toolkit. Using Kaldi as a speech recognition tool, the original dataset was expanded through speed perturbation algorithm, with parameters set to 0.9 and 1.1, respectively, resulting in three times the speech data. Simultaneously using i-vector features increases the robustness of the model. Finally, the Chain model is used for sequence discriminant training. The experimental results show that the word error rate is reduced to 11.96% by using the CNN+TDNN-F Acoustic model proposed in this paper, which has significantly improved the accuracy of dialect speech recognition compared with previous speech recognition algorithms. In addition, waveform reconstruction was performed on the denoised speech and word error rate verification was performed on the model. The results showed that the word error rate of the denoised speech was 12.11%, which was basically the same as that of pure speech. At the end of the thesis, a requirement analysis, functional analysis, design, and implementation of the Shaanxi dialect speech recognition system under strong noise environment in coal mines were conducted, and it was applied in the Xiaobaodang coal mine of Shaanxi mining industry. ﹀
参考文献：	︿ [1] 杨正哲, 任玉玲, 杜省, 柳瑞波. 分区域方言客服语音识别系统研究[J]. 网络新媒体技术, 2019, 8(1):37–42. [2] 刘伟波, 曾庆宁, 罗瀛, 郑展恒. 低信噪比环境下语音识别的鲁棒性方法研究[J]. 声学技术, 2019, 38(6):650–656. [3] 李轶杰,关海欣,刘升平.医疗场景下智能语音技术难点及解决方法探讨[J].中国数字医学,2021,16(8):7–11. [4] 杨逸舟,陈海江. 方言口音普通话的语音识别优化方法及系统[P]. 浙江省：CN113643695A,2021-11-12. [5] 栗婧, 王真, 秦亚茹等. 不同噪声强度对煤矿工人作业失误率的影响研究[J]. 中国安全科学学报, 2021,31(2):179–184. [6] 鱼昆, 张绍阳, 侯佳正等. 语音识别及端到端技术现状及展望[J]. 计算机系统应用, 2021, 30(3): 14–23. [7] Boll S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on acoustics, speech, and signal processing, 1979, 27(2): 113–120. [8] Lim J S, Oppenheim A V. Enhancement and bandwidth compression of noisy speech[J]. Proceedings of the IEEE, 1979, 67(12): 1586–1604. [9] Paliwal K, Basu A. A speech enhancement method based on Kalman filtering[C]//ICASSP'87. IEEE International Conference on Acoustics, Speech, and Signal Processing. 1987, 12: 177–180. [10] Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019: 106–110. [11] Valin J M. A hybrid DSP/deep learning approach to real-time full-band speech enhancement[C]//2018 IEEE 20th international workshop on multimedia signal processing (MMSP). 2018: 1–5. [12] Yu G, Wang Y, Zheng C, et al. CycleGAN-based non-parallel speech enhancement with an adaptive attention-in-attention mechanism[C]//2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2021: 523–529. [13] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks[C]//2013 IEEE international conference on acoustics, speech and signal processing.2013: 6645–6649. [14] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014. [15] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369–376. [16] Zhang Q, Lu H, Sak H, et al. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 7829–7833. [17] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014. [18] Chan W, Jaitly N, Le Q V, et al. Listen, attend and spell[J]. arXiv preprint arXiv:1508.01211, 2015. [19] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30. [20] Dong L, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018: 5884–5888. [21] O'Malley T, Narayanan A, Wang Q, et al. A conformer-based asr frontend for joint acoustic echo cancellation, speech enhancement and speech separation[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 2021: 304–311. [22] Zhang B, Wu D, Yang C, et al. Wenet: Production first and production ready end-to-end speech recognition toolkit[J]. arXiv e-prints, 2021: arXiv: 2102.01547. [23] Ravanelli M, Parcollet T, Plantinga P, et al. SpeechBrain: A general-purpose speech toolkit[J]. arXiv preprint arXiv:2106.04624, 2021. [24] Kim S, Bae S, Won C. Kospeech: Open-source toolkit for end-to-end korean speech recognition[J]. arXiv preprint arXiv:2009.03092, 2020. [25] Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018. [26] Ren Y, Tan X, Qin T, et al. Almost unsupervised text to speech and automatic speech recognition[C]//International conference on machine learning. PMLR, 2019: 5410-5419. [27] Jia X, Wang J, Zhang Z, et al. Large-scale transfer learning for low-resource spoken language understanding[J]. arXiv preprint arXiv:2008.05671, 2020. [28] Turan M A T, Vincent E, Jouvet D. Achieving multi-accent ASR via unsupervised acoustic model adaptation[C]//INTERSPEECH 2020. 2020. [29] Winata G I, Cahyawijaya S, Liu Z, et al. Learning fast adaptation on cross-accented speech recognition[J]. arXiv preprint arXiv:2003.01901, 2020. [30] 丁枫林, 郭武, 孙健. 端到端维吾尔语语音识别研究[J]. 小型微型计算机系统, 2020, 41(1):19–23. [31] 南措吉, 才让卓玛, 都格草. 基于BLSTM和CTC的藏语语音识别[J]. 青海师范大学学报(自然科学版), 2019,35(4):26–33. [32] 张策, 韦鹏程, 陆晓燕, 石熙. 重庆方言语音识别系统的设计与实现[J]. 计算机测量与控制, 2018, 26(1):256–259+263. [33] 刘晓峰. 山西大同地方方言语音识别技术及应用研究[D]. 山西：中北大学, 2020. [34] 吴君钦, 王迎福. 基于GCC-NMF的语音分离研究[J]. 江西理工大学学报, 2020,41(5):65–72. [35] 郭静芳. 基于深度学习的白语语音识别系统[D].云南：大理大学,2021. [36] 陈康宁. 基于深度学习的语音关键词检测技术研究[D].广东：华南农业大学,2019. [37] 李凯飞. 基于机器学习的工业语音指令识别研究及设计[D].贵州：贵州大学,2022. [38] 俞栋, 邓力. 解析深度学习:语音识别实践[ M]. 余凯, 钱彦是, 译. 5版. 北京: 电子工业出版社, 2017:78–89. [39] Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the lEEE, 1989,77 (2) :257–286. [40] 肖林,肖倩宏,魏莉莉等.基于大数据和深度学习的电网调度语音识别声学模型研究[J].电力大数据,2022,25(9):30–36. [41] 郇晋侠. 山西朔州方言语音识别方法研究[D].山西：中北大学,2020. [42] Cavnar W B, Trenkle J M. N-gram-based text categorization[C]//Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. 1994, 161175. [43] 张敏,杜丹阳,李洪海.智能语音控制系统设计[J].工业控制计算机,2019,32(1):144–145+150. [44] Wang D, Zhang X. Thchs-30: A free chinese speech corpus[J]. arXiv preprint arXiv:1512.01882, 2015. [45] Yun H, Yu W, Kim P, et al. Chinese Retrieval System Using Hangeul Pronunciation of Chinese Language[J]. International Information Institute (Tokyo). Information, 2017, 20(9A): 6233–6243. [46] 丁晓鸽,王成义.基于MATLAB GUI的语音信号去噪处理[J].信息技术与信息化,2023,275(2):26–29. [47] Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 2015: 167–174. [48] 刘笑. 基于改进谱减法的机载语音通话系统研究与实现[D]. 安徽大学, 2020. [49] Berouti M, Schwartz R, Makhoul J. Enhancement of speech corrupted by acoustic noise[C]//ICASSP'79. IEEE International Conference on Acoustics, Speech, and Signal Processing. 1979, 4: 208–211. [50] Morales N , Tang Z , Manocha D .Receiver Placement for Speech Enhancement in CAD Models using Sound Propagation Optimization[J].Applied Acoustics, 2019, 155(DEC.):53–62. [51] Liu D , Wen B , Jiao J ,et al.Connecting Image Denoising and High-Level Vision Tasks via Deep Learning[J].IEEE Transactions on Image Processing, 2020, PP(99):1–1. [52] Khodabandehlou H , Fadali M S .Training recurrent neural networks via dynamical trajectory-based optimization[J].Neurocomputing, 2019, 368:1–10. [53] Jiangyan, Yi,Zhengqi, et al. CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition[J]. Journal of signal processing systems for signal, image, and video technology, 2018, 90(7):985–997. [54] Mayer R , Jacobsen H A .Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools[J].ACM Computing Surveys, 2020, 53(1):1–37. [55] 杜宇斌,赵磊.基于HTK的孤立词语音识别[J].山东理工大学学报(自然科学版),2019,33(5):63–69. [56] Povey D, Ghoshal A, Boulianne G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011 (CONF). [57] Madani M , Tanougast C .FPGA implementation of an enhanced chaotic-KASUMI block cipher[J].Microprocessors and Microsystems, 2021, 80(3):103644. [58] 周婕. 基于Kaldi的中文语音识别研究[D]. 江苏：南京邮电大学, 2022. [59] 徐金石, 杨立东. 基于多窗谱减和LMS在工厂中的去噪实现[J]. 电子测量技术, 2021,44(24):66–71. [60] 陈修凯, 陆志华, 金涛. 基于改进Berouti谱减法和维纳滤波结合的语音增强算法[J].无线通信技术, 2020,29(2):1–5+11. [61] Chen S, Hu X, Li S, et al. An Investigation of Using Hybrid Modeling Units for Improving End-to-End Speech Recognition System [C]//2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021: 6743–6747. [62] 来杰,王晓丹,向前,宋亚飞,权文.自编码器及其应用综述[J].通信学报,2021,42(9):218–230. [63] Berouti M, Schwartz R, Makhoul J. Enhancement of speech corrupted by acousticnoise[C]//1979 IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP). 1979,4:208–211. [64] Hou J C, Wang S S, Lai Y H, et al. Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2018, 2(2):117–128. [65] Tu Y H, Du J, Lee C H. DNN training based on classic gain function for single-channel speech enhancement and recognition[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019: 910–914. [66] Odelowo B O, Anderson D V. A study of training targets for deep neural network-based speech enhancement using noise prediction[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018: 5409–5413. [67] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. nature, 2015, 521(7553): 436–444. [68] Chen F .Predicting the intelligibility of noise-corrupted speech non-intrusively by across-band envelope correlation[J].Biomedical Signal Processing & Control, 2016, 24(Feb.):109–113. [69] 张瑞华. 英文语音纠错自动识别系统设计与实现[J]. 自动化技术与应用,2019,38(10):170–172. [70] 王建领. 陕西方言集成.榆林卷[M]. 商务印书馆, 2020. [71] 梁玉龙, 屈丹, 邱泽宇. 基于改进i-vector的说话人感知训练方法研究[J]. 计算机工程, 2018, 44(5):262–267. [72] 舒帆, 屈丹, 张文林, 周利莉, 郭武. 采用长短时记忆网络的低资源语音识别方法[J]. 西安交通大学学报, 2017, 51(10):120–127. [73] 俞栋, 邓力, 俞凯等. 解析深度学习: 语音识别实践[M]. 电子工业出版社, 2016. [74] Amirhossein, Tavanaei, Anthony, et al. Training a Hidden Markov Model with a Bayesian Spiking Neural Network[J]. Journal of signal processing systems for signal, image, and video technology, 2018, 90(2):211–220. [75] Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts[C]//Sixteenth annual conference of the international speech communication association. 2015. [76] Adams O , Trevor Cohn, Graham Neubig, et al. Phonemic transcription of low-resource tonal languages[C]//Proceedings of the Australasian Language Technology Association Workshop, ALTA 2017, Brisbane, Australia, December, 2017 6–8. [77] Povey D, Cheng G, Wang Y, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks[C]//Interspeech. 2018: 3743–3747. [78] 刘鹏.低信噪比下高可懂度语音增强算法[J].计算机系统应用, 2018, 27(12):5–5. [79] 苑东平,夏璠,杨贝斯.智慧矿山未来市场空间或达万亿级[J].通信企业管理,2023(5):54–56. ﹀
中图分类号：	TP391
开放日期：	2023-06-26

附件下载