CẢI THIỆN HIỆU QUẢ CHÚ THÍCH ẢNH TIẾNG VIỆT DỰA TRÊN HỢP NHẤT ĐẶC TRƯNG BẰNG CƠ CHẾ CHÚ Ý

Anh Cong Hoang; Dinh Cong Nguyen; The Anh Pham

doi:10.70117/hdujs.84.2.2026.1139

pdf

Issue: Số 84-02.2026: Chuyên ngành Khoa học Tự nhiên, Kỹ thuật và Công nghệ

Section: Khoa học Tự nhiên và Công nghệ

DOI: 10.70117/hdujs.84.2.2026.1139

Date Published: 25/03/2026

Views 6

Downloads 4

How to Cite

Hoang, A. C., Nguyen, D. C., & Pham, T. A. (2026). Enhancing Vietnamese Image Captioning Performance through ATTENTION Fusion. Hong Duc University Journal of Science, 84(2), 5-12. https://doi.org/10.70117/hdujs.84.2.2026.1139

Citation format:

Enhancing Vietnamese Image Captioning Performance through ATTENTION Fusion

Anh Cong Hoang¹, Dinh Cong Nguyen, The Anh Pham²
¹ Trường Đại học Văn hóa, Thể thao và Du lịch
² Hong Duc University

Abstract

This paper proposes a Vietnamese image captioning method based on attention-guided feature fusion, in which visual representations are integrated with semantic text embeddings extracted from a pre-trained model through an Attention Fusion. This approach enhances semantic alignment and enables the model to generate more informative and contextually appropriate descriptions compared to traditional baseline methods. Experiments on the UIT-ViIC and KTVIC datasets show consistent improvements across BLEU, METEOR, and CIDEr evaluation metrics, demonstrating the effectiveness and practical feasibility of the proposed approach.

Keywords

Chú thích ảnh, hợp nhất chú ý, biểu diễn văn bản.

References

[1] Ashish, V. (2017). Attention is all you need. Advances in neural information processing systems, 30, I.
[2] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE CVPR (pp. 6077-6086).
[3] Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
[4] Li, L., Li, H., & Ren, P. (2025). Underwater image captioning via attention mechanism based fusion of visual and textual information. Information Fusion, 103269.
[5] Cheng, K., Liu, J., Mao, R., Wu, Z., & Cambria, E. (2025). Echo: Generating cross-modal features for unseen classes in zero-shot remote sensing image captioning. Information Fusion, 103952.
[6] Hoang Lam, Q., Duy Le, Q., Van Nguyen, K., & Luu-Thuy Nguyen, N. (2020). UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning, arXiv-2002.
[7] Pham, A. C., Nguyen, V. Q., Vuong, T. H., & Ha, Q. T. (2024). Ktvic: A vietnamese image captioning dataset on the life domain. arXiv preprint arXiv:2401.08100.
[8] Doanh, B. C., Truc, T. T. T., Thuan, N. T., Vu, N. D., & Vo, N. D. (2022). viecap4h challenge 2021: a transformer-based method for healthcare image captioning in vietnamese. VNU Journal of Science: Computer Science and Communication Engineering, 38(2).
[9] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
[10] Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575).
[11]. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Article Sidebar

Main Article Content

Abstract

Keywords

Article Details

References