DATA AUGMENTATION FOR VIETNAMESE-ENGLISH STATISTICAL MACHINE TRANSLATION USING BACK-TRANSLATION AND ADAPTIVE SELECTION TECHNIQUE

131 views

Authors

  • Dang Thanh Quyen (Corresponding Author) Military Information Technology Institute, Academy of Military Science and Technology

Keywords:

Back-translation; Statistical machine translation; Data augmentation.

Abstract

Back-translation (BT) has become one of the effective techniques for data augmentation in Neural Machine Translation, especially for low resource languages. Most research related to BT in machine translation mainly focuses on Neural Machine Translation of European languages. In this article, we study on applying BT to increase the quality of training data for Vietnamese-English statistical machine translation. Two adaptive measures were proposed to evaluate the generated English sentence set and select “good” sentences to enhance the training data. Experimental results on the MOSES statistical machine translation system with Vietnamese-English language pairs show that our proposed method yields approximately 0.8 BLEU improvement.

References

[1]. Moore, Robert C., and Will Lewis. “Intelligent selection of language model training data.” (2010).

[2]. Axelrod, Amittai, et al. “Class-based n-gram language difference models for data selection.” IWSLT (International Workshop on Spoken Language Translation). 2015.

[3]. Fadaee, Marzieh, and Christof Monz. “Back-translation sampling by targeting difficult words in neural machine translation.” arXiv preprint arXiv:1808.09006 (2018).

[4]. Sennrich, Rico, Barry Haddow, and Alexandra Birch. “Improving neural machine translation models with monolingual data.” arXiv preprint arXiv:1511.06709 (2015).

[5]. Gibadullin, Ilshat, et al. “A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine Translation.” arXiv preprint arXiv:1910.00373 (2019).

[6]. Park, Jaehong, Jongyoon Song, and Sungroh Yoon. “Building a neural machine translation system using only synthetic parallel data.” arXiv preprint arXiv:1704.00253 (2017).

[7]. Edunov, Sergey, et al. “Understanding back-translation at scale.” arXiv preprint arXiv:1808.09381 (2018).

[8]. Poncelas, A., et al. “Investigating backtranslation in neural machine translation.” arXiv preprint arXiv:1804.06189.

[9]. Poncelas, Alberto, et al. “Combining SMT and NMT back-translated data for efficient NMT.” arXiv preprint arXiv:1909.03750 (2019).

[10]. Hoang, Vu Cong Duy, et al. “Iterative back-translation for neural machine translation.” Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 2018.

[11]. Stahlberg, Felix, James Cross, and Veselin Stoyanov. “Simple fusion: Return of the language model.” arXiv preprint arXiv:1809.00125 (2018).

[12]. Wołk, Krzysztof, and Krzysztof Marasek. “PJAIT systems for the IWSLT 2015 evaluation campaign enhanced by comparable corpora.” arXiv preprint arXiv:1512.01639 (2015).

[13]. Axelrod, Amittai, et al. “The UMD Machine Translation Systems at IWSLT 2015.” Proceedings of IWSLT. 2015.

[14]. Trieu, Hai-Long, et al. “The JAIST-UET-MITI machine translation systems for IWSLT 2015.” Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT’15). 2015.

[15]. Thuong, Viet Tran Hong Huyen Vu, V. N. Van, and T. Le Tien. “The English-Vietnamese Machine Translation System for IWSLT 2015.” Proceeding of the 12th International Workshop on Spoken Language Translation. 2015.

[16]. Nomura, Takahiro, Hajime Tsukada, and Tomoyoshi Akiba. “Improvement of Word Alignment Models for Vietnamese-to-English Translation.” 2015.

[17]. Ng, Nathan, et al. “Facebook FAIR's WMT19 News Translation Task Submission.” arXiv preprint arXiv:1907.06616 (2019).

[18]. Jurafsky, Dan, and James H. Martin. “Speech and language processing. Vol. 3.” (2014).

[19]. Chiang, David. “Hierarchical phrase-based translation.” computational linguistics 33.2 (2007): 201-228.

[20]. Papineni, Kishore, et al. “BLEU: a method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002.

Published

15-12-2020

How to Cite

Đặng Thanh Quyền. “DATA AUGMENTATION FOR VIETNAMESE-ENGLISH STATISTICAL MACHINE TRANSLATION USING BACK-TRANSLATION AND ADAPTIVE SELECTION TECHNIQUE”. Journal of Military Science and Technology, no. csce4, Dec. 2020, pp. 23-32, https://online.jmst.info/index.php/jmst/article/view/330.