Speech translation for Unwritten language using intermediate representation: Experiment for Viet-Muong language pair



  • Pham Van Dong (Corresponding Author) Hanoi University of Mining and Geology
  • Do Thi Ngoc Diep Hanoi University of Science and Technology
  • Mac Dang Khoa VinBigdata – VinGroup
  • Vu Thi Hai Ha Vietnam Institute of Linguistics




Machine translation; Text to speech; Ethnic minority language; Vietnamese; Muong dialects; Unwritten languages; Intermediate representation; Phoneme representation.


The paper studies an automatic translation method that translates from the text of a language (L1) to the speech of an unwritten language (L2). Normally the written text is used as the bridge to connect a translation module that translates from the text of L1 to the text of L2 and a synthesis module that generates the speech of L2 from the text. In the case of unwritten language, an intermediate representation has to be used instead of the writing form of L2. This paper proposes the use of phoneme representation because of the intimate relationship between phonemes and speech in one language. The proposed method was applied to the Viet-Muong language pair. The Vietnamese text needs to be translated into Muong language in two dialects, Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho, both unwritten. The paper also proposes a phoneme set for each Muong language and applies them to the problem. The evaluation results showed that the translation quality was relatively high in both dialects (for Muong Bi, the fluency score was 4.63/5.0, and the adequacy score was 4.56/5.0). The synthesized speaking quality in both dialects is acceptable (for Muong Bi, the MOS score was 4.47/5.0, and the comprehension score was 93.55%). The results also show that the applicability of the proposed system to other unwritten languages ​​is promising.


[1]. J. Riesa, B. Mohit, K. Knight, and D. Marcu, “Building an English-Iraqi Arabic machine translation system for spoken utterances with limited resources,” in Ninth International Conference on Spoken Language Processing, (2006). DOI: https://doi.org/10.21437/Interspeech.2006-261

[2]. L. Besacier, B. Zhou, and Y. Gao, “Towards speech translation of non written languages,” in 2006 IEEE Spoken Language Technology Workshop, pp. 222–225, (2006).

[3]. G. Adda et al., “Breaking the unwritten language barrier: The BULB project,” Procedia Comput. Sci., vol. 81, pp. 8–14, (2016).

[4]. Y.-F. Cheng, H.-S. Lee, and H.-M. Wang, “AlloST: Low-resource Speech Translation without Source Transcription.” arXiv. (2021). http://arxiv.org/abs/2105.00171 DOI: https://doi.org/10.21437/Interspeech.2021-526

[5]. P. K. Muthukumar and A. W. Black, “Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2594–2598, (2014). DOI: https://doi.org/10.1109/ICASSP.2014.6854069

[6]. O. Scharenborg et al., “Speech Technology for Unwritten Languages,” IEEEACM Trans. Audio Speech Lang. Process., vol. 28, pp. 964–975, (2020). DOI: https://doi.org/10.1109/TASLP.2020.2973896

[7]. V. Đ. Phạm et al., “How to generate Muong speech directly from Vietnamese text: Cross-lingual speech synthesis for close language pair,” J. Mil. Sci. Technol., no. 81, (2022). DOI: https://doi.org/10.54939/1859-1043.j.mst.81.2022.138-147

[8]. N.-H. Doan, “Generation of Vietnamese for French-Vietnamese and English-Vietnamese Machine Translation,” in Proceedings of the 8th European Workshop on Natural Language Generation - Volume 8, Stroudsburg, PA, USA, pp. 1–10 (2001).

[9]. D. Thi Ngoc Diep, L. Besacier, and E. Castelli, “Improved Vietnamese-French Parallel Corpus Mining Using English Language,” in IWSLT, (2010).

[10]. D. Thi-Ngoc-Diep, M. Utiyama, and E. Sumita, “Machine translation from Japanese and French to Vietnamese, the difference among language families,” in 2015 International Conference on Asian Language Processing (IALP), pp. 17–20, (2015).

[11]. T. Duarte, R. Prikladnicki, F. Calefato, and F. Lanubile, “Speech recognition for voice-based machine translation,” IEEE Softw., vol. 31, no. 1, pp. 26–31, (2014). DOI: https://doi.org/10.1109/MS.2014.14

[12]. P. Koehn et al., “Moses: Open source toolkit for statistical machine translation,” in Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177–180, (2007). DOI: https://doi.org/10.3115/1557769.1557821

[13]. R. Zens, F. J. Och, and H. Ney, “Phrase-based statistical machine translation,” in Annual Conference on Artificial Intelligence, pp. 18–32, (2002). DOI: https://doi.org/10.1007/3-540-45751-8_2

[14]. K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” ArXiv Prepr. ArXiv14061078, (2014). DOI: https://doi.org/10.3115/v1/D14-1179

[15]. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Adv. Neural Inf. Process. Syst., vol. 27, (2014).

[16]. M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba, “Addressing the rare word problem in neural machine translation,” ArXiv Prepr. ArXiv14108206, (2014). DOI: https://doi.org/10.3115/v1/P15-1002

[17]. R. Sennrich and B. Zhang, “Revisiting low-resource neural machine translation: A case study,” ArXiv Prepr. ArXiv190511901, (2019). DOI: https://doi.org/10.18653/v1/P19-1021

[18]. J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4779–4783, (2018). DOI: https://doi.org/10.1109/ICASSP.2018.8461368

[19]. Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” ArXiv Prepr. ArXiv170310135, (2017). DOI: https://doi.org/10.21437/Interspeech.2017-1452

[20]. L. Besacier, B. Zhou, and Y. Gao, “Towards speech translation of non written languages,” in 2006 IEEE Spoken Language Technology Workshop, pp. 222–225, (2006). DOI: https://doi.org/10.1109/SLT.2006.326795

[21]. G. Adda et al., “Breaking the Unwritten Language Barrier: The BULB Project,” Procedia Comput. Sci., vol. 81, pp. 8–14, (2016), doi: 10.1016/j.procs.2016.04.023. DOI: https://doi.org/10.1016/j.procs.2016.04.023

[22]. J. Jiang, Z. Ahmed, J. Carson-Berndsen, P. Cahill, and A. Way, “Phonetic representation-based speech translation,” in Proceedings of Machine Translation Summit XIII: Papers, (2011).

[23]. Z. Ahmed, J. Jiang, J. Carson-Berndsen, P. Cahill, and A. Way, “Hierarchical phrase-based mt for phonetic representation-based speech translation,” in Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers, (2012).

[24]. F. Stahlberg, T. Schlippe, S. Vogel, and T. Schultz, “Pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment,” in International Conference on Statistical Language and Speech Processing, pp. 260–272, (2013). DOI: https://doi.org/10.1007/978-3-642-39593-2_23

[25]. S. Palkar, A. W. Black, and A. Parlikar, “Text-To-Speech for Languages without an Orthography,” in Coling, (2012).

[26]. S. Sitaram, S. Palkar, Y.-N. Chen, A. Parlikar, and A. W. Black, “Bootstrapping text-to-speech for speech processing in languages without an orthography,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7992–7996. DOI: https://doi.org/10.1109/ICASSP.2013.6639221

[27]. S. Sitaram, G. K. Anumanchipalli, J. Chiu, A. Parlikar, and A. W. Black, “Text to speech in new languages without a standardized orthography,” in Proceedings of 8th Speech Synthesis Workshop, Barcelona, (2013).

[28]. Ban chỉ đạo Tổng điều tra dân số và nhà ở Trung ương, "Tổng điều tra dân số và nhà ở Việt Nam năm 2009: Kết quả toàn bộ". Hà Nội: Nxb Thống kê, (2010), (in Vietnamese).

[29]. Nguyễn Văn Tài, "Ngữ âm tiếng Mường qua các phương ngôn". Hà Nội: Nxb Từ điển Bách khoa, (2005), (in Vietnamese).

[30]. Trần Trí Dõi, "Một vài vấn đề nghiên cứu so sánh - lịch sử nhóm ngôn ngữ Việt - Mường". Hà Nội: Nxb Đại học Quốc gia Hà Nội, (2011) , (in Vietnamese).

[31]. Nguyễn Kim Thản, “Vài nét về hệ thống âm vị tiếng Mường và phương án phiên âm tiếng Mường,” Ngôn Ngữ, vol. 1, (1971), (in Vietnamese).

[32]. M. E. Barker, M. A. Barker, and L. Assessment, “Mường-Vietnamese-English dictionary”, https://www.sil.org/resources/archives/35773

[33]. Nguyễn Như Ý, “Dự thảo phương án chữ Mường.” Tọa đàm Viện Ngôn ngữ học, (1994), (in Vietnamese).

[34]. LDC, “Linguistic data annotation specification: Assessment of fluency and adequacy in translations. Revision 1.5,” (2005).




How to Cite

Pham Van Dong, Do Thi Ngoc Diep, Mac Dang Khoa, and Vu Thi Hai Ha. “Speech Translation for Unwritten Language Using Intermediate Representation: Experiment for Viet-Muong Language Pair”. Journal of Military Science and Technology, no. CSCE6, Dec. 2022, pp. 65-76, doi:10.54939/1859-1043.j.mst.CSCE6.2022.65-76.



Research Articles