Vietnamese text recognition in scene images using deep learning



  • Huynh Van Huy Lac Hong University
  • Nguyen Thi Thanh Tan (Corresponding Author) Electric Power University
  • Ngo Quoc Tao Institute of Information Technology, Vietnam Academy of Science and Technology



Detection; Recognition; Feature; Probability; Accuracy.


This article proposes an effective method for recognizing Vietnamese text in scene images. The proposed method is based on the idea of combining three processing tasks simultaneously in one recognition stage, including (i) Recognizing (predicting) character sequences from images; (ii) Context processing; and (iii) Fusing and iterative correction. The effectiveness of this method was carried out on two Vietnamese scene image datasets collected from reality: VinText and VnSceneText. Experimental results show that the proposed method is capable of detecting text of any shape and size with high and stable accuracy. Specifically, the method achieves word-level accuracy, character-level accuracy is (81.87%, 93.02%) and (82.56%, 94.33%) for the test datasets, respectively.


[1]. B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text”.

[2]. W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han, “STAR-Net: A spatial attention residue network for scene text recognition,” in Proc. Brit. Mach. Vision Conf. (BMVC). BMVA Press, pp. 43.1–43.13, (2016).

[3]. W. Liu, C. Chen, and K.-Y. K. Wong, “Char-net: A characteraware neural network for distorted scene text recognition,” in Proc. AAAI Conf. on Artif. Intell., (2018).

[4]. P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang, “Reading scene text in deep convolutional sequences,” in Proc. AAAI Conf. on Artif. Intell., (2016).

[5]. F. Borisyuk, A. Gordo, and V. Sivakumar, “Rosetta: Large scale system for text detection and recognition in images,” in Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, pp. 71–79, (2018).

[6]. C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention modeling for OCR in the wild,” in Proc. IEEE Conf. on Comp. Vision and Pattern Recognit., pp. 2231–2239, (2016).

[7]. J. Wang and X. Hu, “Gated recurrent convolution neural network for OCR,” in Proc. Adv. in Neural Inf. Process. Syst., pp. 335–344, (2017).

[8]. Y. Liu, Z. Wang, H. Jin, and I. Wassell, “Synthetically supervised feature learning for scene text recognition,” in Proc. Eur. Conf. on Comp. Vision (ECCV), pp. 435–451, (2018).

[9]. M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai, “Scene text recognition from two-dimensional perspective,” ArXiv, vol. abs/1809.06508, (2018).

[10]. Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun, "Deep Residual Learning for Image Recognition",, (2015).

[11]. Z. Wan, F. Xie, Y. Liu, X. Bai, and C. Yao, “2D-CTC for scene text recognition,” (2019) .

[12]. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proc. IEEE Conf. on Comp. Vision and Pattern Recognit. (CVPR), pp. 770–778, (2015).

[13]. F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu, “Scene text recognition with sliding convolutional character models,” arXiv preprint arXiv:1709.01727, (2017).

[14]. Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, Yongdong Zhang, "Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition", 2021 arXiv:2103.06495,, (2021).

[15]. N. Nguyen et al., "Dictionary-guided Scene Text Recognition," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 7379-7388, doi: 10.1109/CVPR46437.2021.00730.

[16]. N. T. Pham, V. D. Pham, Q. Nguyen-Van, B. H. Nguyen, D. N. Minh Dang and S. D. Nguyen, "Vietnamese Scene Text Detection and Recognition using Deep Learning: An Empirical Study," 6th International Conference on Green Technology and Sustainable Development (GTSD), Nha Trang City, Vietnam, pp. 213-218, (2022), doi: 10.1109/GTSD54989.2022.9989248

[17]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, "Attention Is All You Need",, 2023.

[18]. Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, Jianming Liang, "UNet++: A Nested U-Net Architecture for Medical Image Segmentation",, 2018.



How to Cite

Huỳnh Văn Huy, N. Thi Thanh Tan, and Ngô Quốc Tạo. “Vietnamese Text Recognition in Scene Images Using Deep Learning”. Journal of Military Science and Technology, vol. 90, no. 90, Oct. 2023, pp. 140-9, doi:10.54939/1859-1043.j.mst.90.2023.140-149.



Research Articles