Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents

Safder, Iqra, Hassan, Saeed-Ul, Visvizi, Anna, Noraset, Thanapon, Nawaz, Raheel ORCID: https://orcid.org/0000-0001-9588-0052 and Tuarob, Suppawong (2020) Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents. Information Processing & Management, 57 (6). p. 102269. ISSN 0306-4573

Preview

Accepted Version
Available under License In Copyright.
Download (1MB) | Preview

Official URL: https://www.sciencedirect.com/science/article/pii/...

Abstract

The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively.

Item Type:	Article
Peer-reviewed:	Yes
Date Deposited:	05 Jun 2020 08:33
Publisher:	Elsevier BV
Additional Information:	This is an Author Accepted Manuscript of a paper accepted for publication in Information Processing and Management, published by and copyright Elsevier.
Divisions:	Faculties > Business and Law
Subject terms:	knowledge-based systems, algorithmic metadata, algorithm search, deep learning, bi-directional LSTM, information retrieval, full-text articles, 0804 Data Format, 0806 Information Systems, 0807 Library and Information Studies, Information & Library Sciences
Report number:	102269
URI:	https://e-space.mmu.ac.uk/id/eprint/625933
DOI:	https://doi.org/10.1016/j.ipm.2020.102269
ISSN	0306-4573

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

978Downloads

6 month trend

263Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record