e-space
Manchester Metropolitan University's Research Repository

    UrduAI: writeprints for Urdu authorship identification

    Sarwar, Raheem ORCID logoORCID: https://orcid.org/0000-0002-0640-807X and Hassan, Saeed-Ul (2022) UrduAI: writeprints for Urdu authorship identification. ACM Transactions on Asian and Low-Resource Language Information Processing, 21 (2). 34. ISSN 2375-4699

    [img]
    Preview
    Accepted Version
    Available under License In Copyright.

    Download (277kB) | Preview

    Abstract

    The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.

    Impact and Reach

    Statistics

    Activity Overview
    6 month trend
    108Downloads
    6 month trend
    23Hits

    Additional statistics for this dataset are available via IRStats2.

    Altmetric

    Repository staff only

    Edit record Edit record