Two-stage fine-tuning of whisper with metric-driven dataset selection for domain-specific Vietnamese ASR

Duc Thinh Dang; Hoang Hung Long; Nhu Hai Phung

doi:10.54939/1859-1043.j.mst.CSCE9.2025.51-60

Two-stage fine-tuning of whisper with metric-driven dataset selection for domain-specific Vietnamese ASR

Authors

Dang Duc Thinh Institute of Information Technology and Electronics, Academy of Military Science and Technology
Hoang Hung Long University of Transport and Communications
Phung Nhu Hai Institute of Information Technology and Electronics, Academy of Military Science and Technology

DOI:

https://doi.org/10.54939/1859-1043.j.mst.CSCE9.2025.51-60

Keywords:

Two-stage transfer learning; Automatic speech recognition; Whisper; Domain-specific dataset; Fréchet DeepSpeech distance.

Abstract

Automatic speech recognition (ASR) in low-resource, domain-specific settings remains challenging due to limited labeled data and domain mismatch. This paper proposes a framework that combines metric-based donor dataset selection with a two-stage fine-tuning strategy to adapt the Whisper model for Vietnamese military-specific ASR. The Fréchet DeepSpeech Distance (FDSD) metric is used to identify the most acoustically and phonetically similar general-domain dataset to the target Military Information Retrieval (MIR) corpus. VN-SLU dataset was selected for Stage 1 fine-tuning, bridging the domain gap before Stage 2 fine-tuning on MIR for domain specialization. Experimental evaluation on the MIR test set shows that the proposed method achieves a Word Error Rate of 3.49% and a Character Error Rate of 2.41%, outperforming direct fine-tuning and blended-data approaches. Loss curve analysis confirms that Stage 1 adaptation accelerates convergence and mitigates overfitting in Stage 2. These results demonstrate that integrating metric-driven general-domain dataset selection with sequential fine-tuning is an effective and reproducible approach for enhancing ASR performance in low-resource, domain-specific scenarios.

References

[1]. Radford et al., “Robust speech recognition via large-scale weak supervision”, arXiv, arXiv:2212.04356, (2022). doi:10.48550/arXiv.2212.04356.

[2]. Y. Zhang et al., “Google USM: Scaling automatic speech recognition beyond 100 languages”, arXiv, arXiv:2303.01037, (2023). doi:10.48550/arXiv.2303.01037.

[3]. V. Pratap et al., “Scaling speech technology to 1,000+ languages”, arXiv, arXiv:2305.13516, (2023). doi:10.48550/arXiv.2305.13516.

[4]. Arun Narayanan et al., “Toward domain-invariant speech recognition via large-scale training”, arXiv, (2018). doi:10.48550/arXiv.2305.13516.

[5]. Y. Liu et al., “Toward domain-invariant speech recognition via large-scale training”, arXiv, (2018). doi:10.48550/arXiv.2305.13516.

[6]. Phung Nhu Hai et al., “Enhancing Whisper model for Vietnamese specific domain with data blending and LoRA fine-tuning”, International Conference on Information and Communication Systems and Networks (ICISN), Lecture Notes in Networks and Systems, vol. 1077, pp. 1–11, (2024).

[7]. E. J. Hu et al., “LoRA: Low-rank adaptation of large language models”, arXiv, arXiv:2106.09685, (2021). doi:10.48550/arXiv.2106.09685.

[8]. Y. Liu et al., “Exploration of Whisper fine-tuning strategies for low-resource ASR”, EURASIP Journal on Audio, Speech, and Music Processing, (2024). doi:10.1186/s13636-024-00349-3.

[9]. Vin. Timmel et al., “Fine-tuning Whisper on low-resource languages for real-world applications”, arXiv, arXiv:2412.15726, (2024). doi:10.48550/arXiv.2412.15726.

[10]. Kum. Tripathi et al., “Enhancing Whisper’s accuracy and speed for Indian languages through prompt-tuning and tokenization”, arXiv, arXiv:2412.19785, (2024). doi:10.48550/arXiv.2412.19785.

[11]. K. Sharma et al., “Fine-tuning Whisper Tiny for Swahili ASR: Challenges and recommendations for low-resource speech recognition”, AfricaNLP Workshop, pp. 1–10, (2025).

[12]. D. K. Gete et al., “Whispering in Amharic: Fine-tuning Whisper for low-resource language”, arXiv, arXiv:2503.18485, (2024). doi:10.48550/arXiv.2503.18485.

[13]. S. Mitsumori et al., “Cross-lingual data selection using clip-level acoustic similarity for enhancing low-resource automatic speech recognition”, arXiv, arXiv:2506.22194, (2025).

[14]. J. Billa et al., “Improving low-resource ASR performance with untranscribed out-of-domain data”, arXiv, arXiv:2106.01227, (2021). doi:10.48550/arXiv.2106.01227.

[15]. M. Binkowski et al., “High fidelity speech synthesis with adversarial networks”, arXiv, arXiv:1909.11646, (2019). doi:10.48550/arXiv.1909.11646.

[16]. S. Kothawade et al., “DITTO: Data-efficient and fair targeted subset selection for ASR accent adaptation”, arXiv, arXiv:2110.04908, (2021). doi:10.48550/arXiv.2110.04908.

[17]. Jinpeng Li et al., “Improving Whisper’s recognition performance for under-represented language Kazakh leveraging unpaired speech and text”, arXiv, arXiv:2408.05554, (2024).

[18]. Tuyen Tran et al., “VN-SLU: A Vietnamese spoken language understanding dataset”, Proceedings of Interspeech, pp. 1–5, (2024). doi:10.21437/Interspeech.2024-1976.

[19]. R. Ardila et al., “Common Voice: A massively multilingual speech corpus”, arXiv, arXiv:1912.06670, (2019). doi:10.48550/arXiv.1912.06670.

[20]. “vivos: A Vietnamese speech corpus”, Hugging Face Datasets, (2023). Available: https://huggingface.co/datasets/vivos.

[21]. Conneau et al., “FLEURS: Few-shot learning evaluation of universal representations of speech”, arXiv, (2022). doi:10.48550/arXiv.2408.05554.

Two-stage fine-tuning of whisper with metric-driven dataset selection for domain-specific Vietnamese ASR

Two-stage fine-tuning of whisper with metric-driven dataset selection for domain-specific Vietnamese ASR

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)