Enhancing Vietnamese Image Captioning Performance through ATTENTION Fusion

Anh Cong Hoang1, Dinh Cong Nguyen , The Anh Pham2
1 Trường Đại học Văn hóa, Thể thao và Du lịch
2 Hong Duc University

Main Article Content

Abstract

This paper proposes a Vietnamese image captioning method based on attention-guided feature fusion, in which visual representations are integrated with semantic text embeddings extracted from a pre-trained model through an Attention Fusion. This approach enhances semantic alignment and enables the model to generate more informative and contextually appropriate descriptions compared to traditional baseline methods. Experiments on the UIT-ViIC and KTVIC datasets show consistent improvements across BLEU, METEOR, and CIDEr evaluation metrics, demonstrating the effectiveness and practical feasibility of the proposed approach.

Article Details

References

[1] Ashish, V. (2017). Attention is all you need. Advances in neural information processing systems, 30, I.
[2] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE CVPR (pp. 6077-6086).
[3] Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
[4] Li, L., Li, H., & Ren, P. (2025). Underwater image captioning via attention mechanism based fusion of visual and textual information. Information Fusion, 103269.
[5] Cheng, K., Liu, J., Mao, R., Wu, Z., & Cambria, E. (2025). Echo: Generating cross-modal features for unseen classes in zero-shot remote sensing image captioning. Information Fusion, 103952.
[6] Hoang Lam, Q., Duy Le, Q., Van Nguyen, K., & Luu-Thuy Nguyen, N. (2020). UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning, arXiv-2002.
[7] Pham, A. C., Nguyen, V. Q., Vuong, T. H., & Ha, Q. T. (2024). Ktvic: A vietnamese image captioning dataset on the life domain. arXiv preprint arXiv:2401.08100.
[8] Doanh, B. C., Truc, T. T. T., Thuan, N. T., Vu, N. D., & Vo, N. D. (2022). viecap4h challenge 2021: a transformer-based method for healthcare image captioning in vietnamese. VNU Journal of Science: Computer Science and Communication Engineering, 38(2).
[9] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
[10] Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575).
[11]. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).