A simulation-to-real image translation approach via diffusion model for military vehicle recognition

A simulation-to-real image translation approach via diffusion model for military vehicle recognition

Authors

  • Le Thi Thu Hong Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Nguyen Chi Thanh Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Dang Hoang Minh Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Pham Van Tung School of Information and Communication Technology, Hanoi University of Industry
  • Pham Thu Huong Institute of Information Technology and Electronics, Academy of Military Science and Technology

DOI:

https://doi.org/10.54939/1859-1043.j.mst.CSCE9.2025.61-71

Keywords:

Diffusion model; Image synthesis; Sim-to-real transfer; Military vehicle detection.

Abstract

The research and development of automated systems for military vehicle recognition are crucial for enhancing situational awareness of commanders, thereby improving combat effectiveness and mission accomplishment. To apply computer vision technologies in such systems, large-scale and diverse training datasets are required, encompassing images of objects captured across varying temporal and spatial conditions. However, in practice, military vehicle imagery, particularly of adversary assets, is difficult to acquire, limited in quantity, and costly in terms of time and resources. Synthetic simulation-based data provide a cost-effective alternative, but the visual gap between simulated and real-world images hinders model performance in real deployments. In this paper, we propose a diffusion-based data generation method for simulation-to-real domain adaptation, enabling the synthesis of realistic, labeled images from simulation data for training recognition models. Specifically, we develop an image generation framework using Flex.2 diffusion model quantized with fp4, guided by edge maps extracted through the Canny filter. Experimental results on a military vehicle dataset demonstrate a substantial improvement in image quality, with the Fréchet Inception Distance (FID) reduced from 270 to 162 compared to real-world imagery. These findings highlight the scalability and flexibility of our approach as a practical solution for simulation-to-real image translation, ultimately improving the generalization and reliability of military vehicle recognition models.

References

[1]. Vuong, Q., Levine, S., Walke, H. R., Pertsch, K., Singh, A., Doshi, R., Leal, I., “Open x-embodiment: Robotic learning datasets and RT-X models”, Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL, (2023).

[2]. Coicheci, S. and Filip, I., “Self-driving vehicles: Current status of development and technical challenges to overcome”, IEEE 14th International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 255–260, (2020).

[3]. Ho, D., Rao, K., Xu, Z., Jang, E., Khansari, M. and Bai, Y., “RetinaGAN: An object-aware approach to sim-to-real transfer”, IEEE International Conference on Robotics and Automation (ICRA), pp. 10920–10926, (2021).

[4]. Zhu, J. Y., Park, T., Isola, P. and Efros, A. A., “Unpaired image-to-image translation using cycle-consistent adversarial networks”, Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, (2017).

[5]. Liu, D., Chen, Y. and Wu, Z., “Digital twin (DT)-CycleGAN: Enabling zero-shot sim-to-real transfer of visual grasping models”, IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2421–2428, (2023).

[6]. Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S. and Choo, J., “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797, (2018).

[7]. Ho, J., Jain, A. and Abbeel, P., “Denoising diffusion probabilistic models”, Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, (2020).

[8]. ComfyUI Wiki, “FLUX”, ComfyUI Wiki, (2025).

[9]. Black Forest Lab, “FLUX”, GitHub Repository, (2024).

[10]. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. and Hochreiter, S., “GANs trained by a two-time-scale update rule converge to a local Nash equilibrium”, Advances in Neural Information Processing Systems, vol. 30, (2017).

[11]. Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., … Norouzi, M., “Palette: Image-to-image diffusion models”, ACM SIGGRAPH Conference Proceedings, pp. 1–10, (2022).

[12]. Zhang, L., Rao, A. and Agrawala, M., “Adding conditional control to text-to-image diffusion models”, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847, (2023).

[13]. Brooks, T., Holynski, A. and Efros, A. A., “InstructPix2Pix: Learning to follow image editing instructions”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402, (2023).

[14]. Sachenko, A., Derysh, B., Dubchak, L., Sachenko, S. and Chereshnyuk, O., “Real-time military vehicle classification via convolutional neural networks”, MoDaST, (2025).

[15]. Pichler, A. and Hueber, N., “Training embedded DNN-based military vehicle detectors for aerial applications with few images using multisource vehicle signatures, data augmentation, and generative models”, AI, vol. 26, (2024).

[16]. https://huggingface.co/google/gemma-3-4b-it.

[17]. https://huggingface.co/ostris/Flex.2-preview

[18]. https://github.com/ostris/ai-toolkit.

Downloads

Published

2025-12-31

How to Cite

[1]
D. H. Le Thi Thu, Nguyen Chi Thanh, Dang Hoang Minh, Pham Van Tung, and Pham Thu Huong, “A simulation-to-real image translation approach via diffusion model for military vehicle recognition”, JMST’s CSCE, no. CSCE9, pp. 61–71, Dec. 2025.

Issue

Section

Articles
Loading...