e-space
Manchester Metropolitan University's Research Repository

An integrated semantic-based framework for intelligent similarity measurement and clustering of microblogging posts

Alnajran, Noufa Abdulaziz (2019) An integrated semantic-based framework for intelligent similarity measurement and clustering of microblogging posts. Doctoral thesis (PhD), Manchester Metropolitan University.

[img]
Preview

Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (16MB) | Preview

Abstract

Twitter, the most popular microblogging platform, is gaining rapid prominence as a source of information sharing and social awareness due to its popularity and massive user generated content. These include applications such as tailoring advertisement campaigns, event detection, trends analysis, and prediction of micro-populations. The aforementioned applications are generally conducted through cluster analysis of tweets to generate a more concise and organized representation of the massive raw tweets. However, current approaches perform traditional cluster analysis using conventional proximity measures, such as Euclidean distance. However, the sheer volume, noise, and dynamism of Twitter, impose challenges that hinder the efficacy of traditional clustering algorithms in detecting meaningful clusters within microblogging posts. The research presented in this thesis sets out to design and develop a novel short text semantic similarity (STSS) measure, named TREASURE, which captures the semantic and structural features of microblogging posts for intelligently predicting the similarities. TREASURE is utilised in the development of an innovative semantic-based cluster analysis algorithm (SBCA) that contributes in generating more accurate and meaningful granularities within microblogging posts. The integrated semantic-based framework incorporating TREASURE and the SBCA algorithm tackles both the problem of microblogging cluster analysis and contributes to the success of a variety of natural language processing (NLP) and computational intelligence research. TREASURE utilises word embedding neural network (NN) models to capture the semantic relationships between words based on their co-occurrences in a corpus. Moreover, TREASURE analyses the morphological and lexical structure of tweets to predict the syntactic similarities. An intrinsic evaluation of TREASURE was performed with reference to a reliable similarity benchmark generated through an experiment to gather human ratings on a Twitter political dataset. A further evaluation was performed with reference to the SemEval-2014 similarity benchmark in order to validate the generalizability of TREASURE. The intrinsic evaluation and statistical analysis demonstrated a strong positive linear correlation between TREASURE and human ratings for both benchmarks. Furthermore, TREASURE achieved a significantly higher correlation coefficient compared to existing state-of-the-art STSS measures. The SBCA algorithm incorporates TREASURE as the proximity measure. Unlike conventional partition-based clustering algorithms, the SBCA algorithm is fully unsupervised and dynamically determine the number of clusters beforehand. Subjective evaluation criteria were employed to evaluate the SBCA algorithm with reference to the SemEval-2014 similarity benchmark. Furthermore, an experiment was conducted to produce a reliable multi-class benchmark on the European Referendum political domain, which was also utilised to evaluate the SBCA algorithm. The evaluation results provide evidence that the SBCA algorithm undertakes highly accurate combining and separation decisions and can generate pure clusters from microblogging posts. The contributions of this thesis to knowledge are mainly demonstrated as: 1) Development of a novel STSS measure for microblogging posts (TREASURE). 2) Development of a new SBCA algorithm that incorporates TREASURE to detect semantic themes in microblogs. 3) Generating a word embedding pre-trained model learned from a large corpus of political tweets. 4) Production of a reliable similarity-annotated benchmark and a reliable multi-class benchmark in the domain of politics.

Impact and Reach

Statistics

Downloads
Activity Overview
163Downloads
267Hits

Additional statistics for this dataset are available via IRStats2.

Actions (login required)

Edit Item Edit Item