Automated AI-Based Image Captioning: A Transformer-Based Approach for Natural Language Generation from Visual Data

Rahul Cherekar

Aim and Scope About Journal Call for Paper Author Guidelines Paper Submission Join us Editor Join us Reviewer Editorial Board Review Process Topics Indexing Downloads Policies FAQ

Volume 2 Issue 2 | 2025 | View PDF
Paper Id:IJMSM-V2I2P105
doi: 10.64137/30485037/V2I2P105

Automated AI-Based Image Captioning: A Transformer-Based Approach for Natural Language Generation from Visual Data

Rahul Cherekar

Citation:
Rahul Cherekar, "Automated AI-Based Image Captioning: A Transformer-Based Approach for Natural Language Generation from Visual Data" International Journal of Multidisciplinary on Science and Management, Vol. 2, No. 2, pp. 57-65, 2025.

Abstract:
Image captioning is essential in computer vision and Natural Language Processing (NLP) to produce relevant textual descriptions from visual information. This paper demonstrates a transformer architecture for deep-learning image captioning that uses attention mechanisms in Transformers to produce improved captions. The parallel processing capability of transformers makes them different from conventional CNN-RNN-based models and allows faster training with better contextual understanding. The research explores Vision Transformer (ViT) and Contrastive Language-Image Pretraining (CLIP) as transformer-based models that work with language models to create superior captioning results. The proposed methodology performs better than conventional models after demonstrating high results on benchmark datasets MS COCO and Flickr8k. Experimental evaluations demonstrate that our technique leads to enhanced scores for BLEU, METEOR and CIDEr metrics, thus proving its effectiveness. The paper investigates forthcoming prospects and present obstacles in automated image captioning technology.

Keywords: Image Captioning, Transformers, Vision Transformer, Natural Language Processing, Deep Learning, Attention Mechanism, CLIP, MS COCO.

References:
1. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator in Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).
2. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). PMLR.
3. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077-6086).
4. Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7008-7024).
5. Herdade, S., Kappeler, A., Boakye, K., & Soares, J. (2019). Image captioning: Transforming objects into words. Advances in neural information processing systems, 32.
6. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
7. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
8. Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10578-10587).
9. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., ... & Gao, J. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 (pp. 121-137). Springer International Publishing.
10. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2020, April). Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 13041-13049).
11. Huang, L., Wang, W., Chen, J., & Wei, X. Y. (2019). Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4634-4643).
12. Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).
13. Wadhwa, V., Gupta, B., & Gupta, S. (2021, December). AI-based automated image caption tool implementation for the visually impaired. In 2021 International Conference on Industrial Electronics Research and Applications (ICIERA) (pp. 1-6). IEEE.
14. Sortino, R., Palazzo, S., Rundo, F., & Spampinato, C. (2023). Transformer-based image generation from scene graphs. Computer Vision and Image Understanding, 233, 103721.
15. Ondeng, O., Ouma, H., & Akuon, P. (2023). A review of transformer-based approaches for image captioning. Applied Sciences, 13(19), 11103.
16. He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., & Pugeault, N. (2020). Image captioning through image transformer. In Proceedings of the Asian conference on computer vision.
17. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).
18. Parvin, H., Naghsh-Nilchi, A. R., & Mohammadi, H. M. (2023). Transformer-based local-global guidance for image captioning. Expert Systems with Applications, 223, 119774.
19. Chandy, A. (2019). A review on IoT-based medical imaging technology for healthcare applications. Journal of Innovative Image Processing (JIIP), 1(01), 51-60.
20. Iijima, L., Giakoumoglou, N., & Stathaki, T. (2024). A multimodal approach for cross-domain image retrieval. arXiv preprint arXiv:2403.15152.
21. Cherekar, R. (2023). A Comprehensive Framework for Quality Assurance in Artificial Intelligence: Methodologies, Standards, and Best Practices. International Journal of Emerging Research in Engineering and Technology, 4(2), 43-51. https://doi.org/10.63282/3050-922X.IJERET-V4I2P105
22. Cherekar, R. (2022). Cloud Data Governance: Policies, Compliance, and Ethical Considerations. International Journal of AI, BigData, Computational and Management Studies, 3(2), 24-31. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V3I2P103
23. Cherekar, R. (2020). DataOps and Agile Data Engineering: Accelerating Data-Driven Decision-Making. International Journal of Emerging Research in Engineering and Technology, 1(1), 31-39. https://doi.org/10.63282/3050-922X.IJERET-V1I1P104
24. Cherekar, R. (2020). The Future of Data Governance: Ethical and Legal Considerations in AI-Driven Analytics. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 3(2), 53-60. https://doi.org/10.63282/3050-9262.IJAIDSML-V3I2P107
25. Cherekar, R. (2022). Cloud Data Governance: Policies, Compliance, and Ethical Considerations. International Journal of AI, BigData, Computational and Management Studies, 3(2), 24-31. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V3I2P103
26. Cherekar, R. (2021). The Future of AI Quality Assurance: Emerging Trends, Challenges, and the Need for Automated Testing Frameworks. International Journal of Emerging Trends in Computer Science and Information Technology, 2(1), 19-27. https://doi.org/10.63282/3050-9246.IJETCSIT-V1I2P104
27. Cherekar, R. (2023). Automated Data Cleaning: AI Methods for Enhancing Data Quality and Consistency. International Journal of Emerging Trends in Computer Science and Information Technology, 5(1), 31-40. https://doi.org/10.63282/3050-9246.IJETCSIT -V5I1P105
28. Rahul Cherekar, "The Integration of Big Data and Business Intelligence: Challenges and Future Directions" International Journal of Multidisciplinary on Science and Management, Vol. 1, No. 2, pp. 38-48, 2024.
29. Cherekar, R. (2020). Integrating AI-Based Image Processing with Cloud-Native Computational Infrastructures for Scalable Analysis. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 6(2), 55-62. https://doi.org/10.63282/3050-9262.IJAIDSML-V6I2P106

International Journal of Multidisciplinary on Science and Management

Volume 2 Issue 2 | 2025 | View PDF
Paper Id:IJMSM-V2I2P105
doi: 10.64137/30485037/V2I2P105

Automated AI-Based Image Captioning: A Transformer-Based Approach for Natural Language Generation from Visual Data

Rahul Cherekar

Quick Links

Quick Links

Get In Touch

Address

Email

Phone

International Journal of Multidisciplinary on Science and Management

Volume 2 Issue 2 | 2025 | View PDF Paper Id:IJMSM-V2I2P105 doi: 10.64137/30485037/V2I2P105

Automated AI-Based Image Captioning: A Transformer-Based Approach for Natural Language Generation from Visual Data

Rahul Cherekar

Quick Links

Quick Links

Get In Touch

Address

Email

Phone

Volume 2 Issue 2 | 2025 | View PDF
Paper Id:IJMSM-V2I2P105
doi: 10.64137/30485037/V2I2P105