Two-stage fine-tuning of whisper with metric-driven dataset selection for domain-specific Vietnamese ASR
DOI:
https://doi.org/10.54939/1859-1043.j.mst.CSCE9.2025.51-60Keywords:
Two-stage transfer learning; Automatic speech recognition; Whisper; Domain-specific dataset; Fréchet DeepSpeech distance.Abstract
Automatic speech recognition (ASR) in low-resource, domain-specific settings remains challenging due to limited labeled data and domain mismatch. This paper proposes a framework that combines metric-based donor dataset selection with a two-stage fine-tuning strategy to adapt the Whisper model for Vietnamese military-specific ASR. The Fréchet DeepSpeech Distance (FDSD) metric is used to identify the most acoustically and phonetically similar general-domain dataset to the target Military Information Retrieval (MIR) corpus. VN-SLU dataset was selected for Stage 1 fine-tuning, bridging the domain gap before Stage 2 fine-tuning on MIR for domain specialization. Experimental evaluation on the MIR test set shows that the proposed method achieves a Word Error Rate of 3.49% and a Character Error Rate of 2.41%, outperforming direct fine-tuning and blended-data approaches. Loss curve analysis confirms that Stage 1 adaptation accelerates convergence and mitigates overfitting in Stage 2. These results demonstrate that integrating metric-driven general-domain dataset selection with sequential fine-tuning is an effective and reproducible approach for enhancing ASR performance in low-resource, domain-specific scenarios.
References
[1]. Radford et al., “Robust speech recognition via large-scale weak supervision”, arXiv, arXiv:2212.04356, (2022). doi:10.48550/arXiv.2212.04356.
[2]. Y. Zhang et al., “Google USM: Scaling automatic speech recognition beyond 100 languages”, arXiv, arXiv:2303.01037, (2023). doi:10.48550/arXiv.2303.01037.
[3]. V. Pratap et al., “Scaling speech technology to 1,000+ languages”, arXiv, arXiv:2305.13516, (2023). doi:10.48550/arXiv.2305.13516.
[4]. Arun Narayanan et al., “Toward domain-invariant speech recognition via large-scale training”, arXiv, (2018). doi:10.48550/arXiv.2305.13516.
[5]. Y. Liu et al., “Toward domain-invariant speech recognition via large-scale training”, arXiv, (2018). doi:10.48550/arXiv.2305.13516.
[6]. Phung Nhu Hai et al., “Enhancing Whisper model for Vietnamese specific domain with data blending and LoRA fine-tuning”, International Conference on Information and Communication Systems and Networks (ICISN), Lecture Notes in Networks and Systems, vol. 1077, pp. 1–11, (2024).
[7]. E. J. Hu et al., “LoRA: Low-rank adaptation of large language models”, arXiv, arXiv:2106.09685, (2021). doi:10.48550/arXiv.2106.09685.
[8]. Y. Liu et al., “Exploration of Whisper fine-tuning strategies for low-resource ASR”, EURASIP Journal on Audio, Speech, and Music Processing, (2024). doi:10.1186/s13636-024-00349-3.
[9]. Vin. Timmel et al., “Fine-tuning Whisper on low-resource languages for real-world applications”, arXiv, arXiv:2412.15726, (2024). doi:10.48550/arXiv.2412.15726.
[10]. Kum. Tripathi et al., “Enhancing Whisper’s accuracy and speed for Indian languages through prompt-tuning and tokenization”, arXiv, arXiv:2412.19785, (2024). doi:10.48550/arXiv.2412.19785.
[11]. K. Sharma et al., “Fine-tuning Whisper Tiny for Swahili ASR: Challenges and recommendations for low-resource speech recognition”, AfricaNLP Workshop, pp. 1–10, (2025).
[12]. D. K. Gete et al., “Whispering in Amharic: Fine-tuning Whisper for low-resource language”, arXiv, arXiv:2503.18485, (2024). doi:10.48550/arXiv.2503.18485.
[13]. S. Mitsumori et al., “Cross-lingual data selection using clip-level acoustic similarity for enhancing low-resource automatic speech recognition”, arXiv, arXiv:2506.22194, (2025).
[14]. J. Billa et al., “Improving low-resource ASR performance with untranscribed out-of-domain data”, arXiv, arXiv:2106.01227, (2021). doi:10.48550/arXiv.2106.01227.
[15]. M. Binkowski et al., “High fidelity speech synthesis with adversarial networks”, arXiv, arXiv:1909.11646, (2019). doi:10.48550/arXiv.1909.11646.
[16]. S. Kothawade et al., “DITTO: Data-efficient and fair targeted subset selection for ASR accent adaptation”, arXiv, arXiv:2110.04908, (2021). doi:10.48550/arXiv.2110.04908.
[17]. Jinpeng Li et al., “Improving Whisper’s recognition performance for under-represented language Kazakh leveraging unpaired speech and text”, arXiv, arXiv:2408.05554, (2024).
[18]. Tuyen Tran et al., “VN-SLU: A Vietnamese spoken language understanding dataset”, Proceedings of Interspeech, pp. 1–5, (2024). doi:10.21437/Interspeech.2024-1976.
[19]. R. Ardila et al., “Common Voice: A massively multilingual speech corpus”, arXiv, arXiv:1912.06670, (2019). doi:10.48550/arXiv.1912.06670.
[20]. “vivos: A Vietnamese speech corpus”, Hugging Face Datasets, (2023). Available: https://huggingface.co/datasets/vivos.
[21]. Conneau et al., “FLEURS: Few-shot learning evaluation of universal representations of speech”, arXiv, (2022). doi:10.48550/arXiv.2408.05554.