Alnajran, Noufa, Crockett, Keeley ORCID: https://orcid.org/0000-0003-1941-6201, McLean, David and Latham, Annabel ORCID: https://orcid.org/0000-0002-8410-7950 (2018) A Heuristic Based Pre-processing Methodology for Short Text Similarity Measures in Microblogs. In: IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS),, 23 June 2018 - 30 June 2018, Exeter, UK.
|
Accepted Version
Available under License In Copyright. Download (320kB) | Preview |
Abstract
Short text similarity measures have lots of applications in online social networks (OSN), as they are being integrated in machine learning algorithms. However, the data quality is a major challenge in most OSNs, particularly Twitter. The sparse, ambiguous, informal, and unstructured nature of the medium impose difficulties to capture the underlying semantics of the text. Therefore, text pre-processing is a crucial phase in similarity identification applications, such as clustering and classification. This is because selecting the appropriate data processing methods contributes to the increase in correlations of the similarity measure. This research proposes a novel heuristicdriven pre-processing methodology for enhancing the performance of similarity measures in the context of Twitter tweets. The components of the proposed pre-processing methodology are discussed and evaluated on an annotated dataset that was published as part of SemEval-2014 shared task. An experimental analysis was conducted using the cosine angle as a similarity measure to assess the effect of our method against a baseline (C-Method). Experimental results indicate that our approach outperforms the baseline in terms of correlations and error rates.
Impact and Reach
Statistics
Additional statistics for this dataset are available via IRStats2.