Manchester Metropolitan University's Research Repository

Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents

Safder, Iqra and Hassan, Saeed-Ul and Visvizi, Anna and Noraset, Thanapon and Nawaz, Raheel and Tuarob, Suppawong (2020) Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents. Information Processing & Management, 57 (6). p. 102269. ISSN 0306-4573

Restricted to Repository staff only until 22 May 2022.

Download (1MB)


The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively.

Impact and Reach


Activity Overview

Additional statistics for this dataset are available via IRStats2.


Actions (login required)

View Item View Item