e-space
Manchester Metropolitan University's Research Repository

    A transformer-based Urdu image caption generation

    Hadi, Muhammad, Safder, Iqra, Waheed, Hajra, Zaman, Farooq, Aljohani, Naif Radi, Nawaz, Raheel, Hassan, Saeed Ul and Sarwar, Raheem ORCID logoORCID: https://orcid.org/0000-0002-0640-807X (2024) A transformer-based Urdu image caption generation. Journal of Ambient Intelligence and Humanized Computing. ISSN 1868-5137

    [img]
    Preview
    Published Version
    Available under License Creative Commons Attribution.

    Download (3MB) | Preview

    Abstract

    Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.

    Impact and Reach

    Statistics

    Activity Overview
    6 month trend
    6Downloads
    6 month trend
    13Hits

    Additional statistics for this dataset are available via IRStats2.

    Altmetric

    Repository staff only

    Edit record Edit record