Predicting lexical complexity in English texts: the Complex 2.0 dataset

Shardlow, Matthew ORCID: https://orcid.org/0000-0003-1129-2750, Evans, Richard and Zampieri, Marcos (2022) Predicting lexical complexity in English texts: the Complex 2.0 dataset. Language Resources and Evaluation, 56 (4). pp. 1153-1194. ISSN 0010-4817

Preview

Published Version
Available under License Creative Commons Attribution.
Download (1MB) | Preview

Official URL: https://link.springer.com/article/10.1007/s10579-0...

Abstract

Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.

Item Type:	Article
Peer-reviewed:	Yes
Date Deposited:	30 Jun 2022 09:58
Publisher:	Springer Verlag
Additional Information:	This is an Open Access article which appeared in Language Resources and Evaluation, published by Springer
Divisions:	Organisation > Science and Engineering
Subject terms:	0801 Artificial Intelligence and Image Processing, 0804 Data Format, 1702 Cognitive Sciences, Artificial Intelligence & Image Processing
URI:	https://e-space.mmu.ac.uk/id/eprint/629988
DOI:	https://doi.org/10.1007/s10579-022-09588-2
ISSN	0010-4817
e-ISSN	1572-8412

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

273Downloads

6 month trend

131Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record