Manchester Metropolitan University's Research Repository

Semantic sentence similarity incorporating linguistic concepts

Pearce, David Matthew (2015) Semantic sentence similarity incorporating linguistic concepts. Doctoral thesis (PhD), Manchester Metropolitan University.


Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview


A natural language allows a set of simpler ideas to be combined together to communicate much more complex ideas. This ability gives language the potential for use as a highly intuitive method of human interaction. However, this freedom of expression makes interpreting language with automation extremely challenging. Semantic sentence similarity is an approach which allows the knowledge of how to compare simpler units, such as words, to obtain a measure of similarity between two sentences. This similarity can allow existing knowledge to be applied to new situations. The objective of this research is to show that a sentence similarity model can be improved through the inclusion of Linguistic concepts, with the aim of producing a more accurate model. This presents the challenge of adapting the human focused rules of Linguistics for sentence similarity and how to evaluate individual component effects in isolation. This research successfully overcame these barriers through the development of an extensible modular framework and construction of a new mathematical model for this framework , called SARUMAN. The core contribution of the research resulted from gradually incorporating fundamental Linguistic components to SARUMAN including: disambiguation by part of speech; treating the sentence as clauses, and advanced word interaction to handle where meanings merge. The most advanced being called SCAWIT. From experiments on a small data set, each of these introduced concepts showed statistically significant improvement in the Pearson's correlation (0.05 or more) over the previous version. The produced models were capable of processing several hundred sentence pairs a second with a single processor. A further significant advance to the field of sentence similarity was the introduction of opposites to sentence similarity. This was conceptually beyond the pre-existing models and showed strong results for an extension of SCAWIT, called SANO. Other novel contribution was added through automated word sense disambiguation from WordNet definitions; and the use of a properties of words model. Some of these changes have potential but did not yield significant improvement with the current knowledge base.

Impact and Reach


Activity Overview

Additional statistics for this dataset are available via IRStats2.

Actions (login required)

View Item View Item