A transformer-based Urdu image caption generation

Hadi, Muhammad, Safder, Iqra, Waheed, Hajra, Zaman, Farooq, Aljohani, Naif Radi, Nawaz, Raheel, Hassan, Saeed Ul and Sarwar, Raheem ORCID: https://orcid.org/0000-0002-0640-807X (2024) A transformer-based Urdu image caption generation. Journal of Ambient Intelligence and Humanized Computing, 15 (9). pp. 3441-3457. ISSN 1868-5137

Preview

Published Version
Available under License Creative Commons Attribution.
Download (3MB) | Preview

Official URL: http://dx.doi.org/10.1007/s12652-024-04824-9

Abstract

Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.

Item Type:	Article (Article)
Peer-reviewed:	Yes
Date Deposited:	29 Jul 2024 12:55
Publisher:	Springer
Additional Information:	The version of record of this article, first published in Journal of Ambient Intelligence and Humanized Computing, is available online at Publisher’s website: http://dx.doi.org/10.1007/s12652-024-04824-9
Divisions:	Organisation > Business and Law
Subject terms:	0801 Artificial Intelligence and Image Processing, 0805 Distributed Computing
URI:	https://e-space.mmu.ac.uk/id/eprint/635192
DOI:	https://doi.org/10.1007/s12652-024-04824-9
ISSN	1868-5137
e-ISSN	1868-5145

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

127Downloads

6 month trend

173Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record