How to generate Muong speech directly from Vietnamese text: Cross-lingual speech synthesis for close language pair

Văn Đồng Phạm; Thi Ngoc Diep Do; Dang Khoa Mac; Viet Son Nguyen; Van Thinh Nguyen; Tien Thanh Nguyen; Do Dat Tran

doi:10.54939/1859-1043.j.mst.81.2022.138-147

Các tác giả

Pham Van Dong (Tác giả đại diện) Trường Đại học Mỏ - Địa chất
Do Thi Ngoc Diep Trường Đại học Bách khoa Hà Nội
Mac Dang Khoa Viện Nghiên cứu Dữ liệu lớn - VinGroup
Nguyen Viet Son Trường Đại học Bách khoa Hà Nội
Nguyen Van Thinh Viện Nghiên cứu Dữ liệu lớn - VinGroup
Nguyen Tien Thanh Viettel CyberSpace Center
Tran Do Dat Bộ Khoa học và Công nghệ

DOI:

https://doi.org/10.54939/1859-1043.j.mst.81.2022.138-147

Từ khóa:

Dịch tự động; Tổng hợp tiếng nói; Ngôn ngữ thiểu số; Tiếng Việt; Các phương ngữ tiếng Mường; Ngôn ngữ chưa có chữ viết; Tổng hợp tiếng nói đa ngôn ngữ.

Tóm tắt

Bài báo giới thiệu phương pháp dịch tự động văn bản tiếng Việt sang tiếng nói tiếng Mường ở hai phương ngữ Mường Bi - Hòa Bình và Mường Tân Sơn - Phú Thọ, đều là hai phương ngữ chưa có chữ viết chính thức của tiếng Mường. Do mối quan hệ rất chặt chẽ giữa tiếng Việt và tiếng Mường, hệ thống phiên dịch được xây dựng giống như một hệ thống tổng hợp tiếng nói đa ngôn ngữ, trong đó, đầu vào là văn bản của một ngôn ngữ (ví dụ tiếng Việt) và đầu ra là tiếng nói của một ngôn ngữ khác (ví như tiếng nói của hai phương ngữ Mường). Hệ thống sử dụng mô hình mạng nơ-ron sequence-to-sequence TTS hiện đại đó là Tacotron2 và WaveGlow. Đánh giá đạt được: Tính trôi chảy - 4,61/5, Tính đầy đủ - 4,79/5, Tính tự nhiên trên thang điểm MOS - 4,63/5, Độ dễ hiểu - 94,60%. Các kết quả nhận được cho thấy khả năng áp dụng của hệ thống đề xuất cho các ngôn ngữ thiểu số khác là đầy hứa hẹn, đặc biệt là trong trường hợp ngôn ngữ chưa có chữ viết.

Tài liệu tham khảo

[1]. P. Taylor, “Text-To-Speech Synthesis,” Camb. Univ. Press, (2009). DOI: https://doi.org/10.1017/CBO9780511816338

[2]. X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A Survey on Neural Speech Synthesis,” ArXiv210615561 Cs Eess, (2021). [Online]. Available: http://arxiv.org/abs/2106.15561.

[3]. F. de Saussure, "Course in General Linguistics". Columbia University Press, (2011).

[4]. Y. Ning, S. He, Z. Wu, C. Xing, and L.-J. Zhang, “A review of deep learning based speech synthesis,” Appl. Sci., vol. 9, no. 19, p. 4050, (2019). DOI: https://doi.org/10.3390/app9194050

[5]. Z. Mu, X. Yang, and Y. Dong, “Review of end-to-end speech synthesis technology based on deep learning,” ArXiv Prepr. ArXiv210409995, (2021).

[6]. G. Neubig, “Neural machine translation and sequence-to-sequence models: A tutorial,” ArXiv Prepr. ArXiv170301619, (2017).

[7]. K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” ArXiv Prepr. ArXiv14061078, (2014). DOI: https://doi.org/10.3115/v1/D14-1179

[8]. T. T. T. Nguyen, “HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation,” Paris 11, (2015).

[9]. T. Do Dat, E. Castelli, L. X. Hung, J.-F. Serignat, and T. Van Loan, “Linear F0 contour model for Vietnamese tones and Vietnamese syllable synthesis with TD-PSOLA,” (2006).

[10]. M. Ferlus, “Langues et peuples viet-muong,” Monkhmer Stud., pp. 7–28, (1996).

[11]. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Adv. Neural Inf. Process. Syst., vol. 27, (2014).

[12]. K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” ArXiv Prepr. ArXiv14091259, (2014). DOI: https://doi.org/10.3115/v1/W14-4012

[13]. W.-C. Huang, T. Hayashi, S. Watanabe, and T. Toda, “The sequence-to-sequence baseline for the voice conversion challenge 2020: Cascading asr and tts,” ArXiv Prepr. ArXiv201002434, (2020). DOI: https://doi.org/10.21437/VCC_BC.2020-24

[14]. O. Watts, G. E. Henter, J. Fong, and C. Valentini-Botinhao, “Where do the improvements come from in sequence-to-sequence neural TTS?,” in 2019 ISCA Speech Synthesis Workshop (SSW), vol. 10, pp. 217–222, (2019). DOI: https://doi.org/10.21437/SSW.2019-39

[15]. J. Sotelo et al., “Char2wav: End-to-end speech synthesis,” (2017).

[16]. Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” ArXiv Prepr. ArXiv170310135, (2017). DOI: https://doi.org/10.21437/Interspeech.2017-1452

[17]. J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4779–4783, (2018). DOI: https://doi.org/10.1109/ICASSP.2018.8461368

[18]. L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Special Issue on Processing Under-Resourced Languages-Speech Communication Journal.” Elsevier, (2014). DOI: https://doi.org/10.1016/j.specom.2013.09.001

[19]. J. Riesa, B. Mohit, K. Knight, and D. Marcu, “Building an English-Iraqi Arabic machine translation system for spoken utterances with limited resources,” (2006). DOI: https://doi.org/10.21437/Interspeech.2006-261

[20]. J. Jiang, Z. Ahmed, J. Carson-Berndsen, P. Cahill, and A. Way, “Phonetic representation-based speech translation,” 13th Mach. Transl. Summit, (2011).

[21]. T. Kempton, R. K. Moore, and T. Hain, “Cross-Language Phone Recognition when the Target Language Phoneme Inventory is not Known.,” in INTERSPEECH, pp. 3165–3168, (2011). DOI: https://doi.org/10.21437/Interspeech.2011-792

[22]. P. K. Muthukumar and A. W. Black, “Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2594–2598, (2014). doi: 10.1109/ICASSP.2014.6854069. DOI: https://doi.org/10.1109/ICASSP.2014.6854069

[23]. Nguyễn Văn Tài, "Ngữ âm tiếng Mường qua các phương ngôn". NXB Từ điển Bách khoa, (2005).

[24]. L. Duong, A. Anastasopoulos, D. Chiang, S. Bird, and T. Cohn, “An attentional model for speech translation without transcription,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 949–959, (2016). DOI: https://doi.org/10.18653/v1/N16-1109

[25]. A. Bérard, O. Pietquin, C. Servan, and L. Besacier, “Listen and translate: A proof of concept for end-to-end speech-to-text translation,” ArXiv Prepr. ArXiv161201744, (2016).

[26]. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, (2019). DOI: https://doi.org/10.1109/ICASSP.2019.8683143

Cách tạo tiếng nói tiếng Mường trực tiếp từ văn bản tiếng Việt: Tổng hợp tiếng nói đa ngôn ngữ cho cặp ngôn ngữ họ gần

Các tác giả

DOI:

Từ khóa:

Tóm tắt

Tài liệu tham khảo

Tải xuống

Đã Xuất bản

Cách trích dẫn

Số

Chuyên mục

ISSN: 1859-1043

Ngôn ngữ

Gửi bài mới

Indexed by

Thông tin

Visitors

GTM