Sparse principal component analysis for natural language processing

Drikvandi, Reza ORCID: https://orcid.org/0000-0002-7245-9713 and Lawal, Olamide (2023) Sparse principal component analysis for natural language processing. Annals of Data Science, 10 (1). pp. 25-41. ISSN 2198-5804

Preview

Published Version
Available under License Creative Commons Attribution.
Download (965kB) | Preview

Official URL: https://link.springer.com/article/10.1007/s40745-0...

Abstract

High dimensional data are rapidly growing in many different disciplines, particularly in natural language processing. The analysis of natural language processing requires working with high dimensional matrices of word embeddings obtained from text data. Those matrices are often sparse in the sense that they contain many zero elements. Sparse principal component analysis is an advanced mathematical tool for the analysis of high dimensional data. In this paper, we study and apply the sparse principal component analysis for natural language processing, which can effectively handle large sparse matrices. We study several formulations for sparse principal component analysis, together with algorithms for implementing those formulations. Our work is motivated and illustrated by a real text dataset. We find that the sparse principal component analysis performs as good as the ordinary principal component analysis in terms of accuracy and precision, while it shows two major advantages: faster calculations and easier interpretation of the principal components. These advantages are very helpful especially in big data situations.

Item Type:	Article
Peer-reviewed:	Yes
Date Deposited:	06 May 2020 10:40
Publisher:	Springer Nature
Additional Information:	This is an Open Access article published by Springer and copyright The Authors.
Divisions:	Faculties > Science and Engineering
Subject terms:	Classification, Dimensionality reduction, Ensemble learning, high dimensional data, Natural language processing, Sparse principal component analysis
URI:	https://e-space.mmu.ac.uk/id/eprint/625649
DOI:	https://doi.org/10.1007/s40745-020-00277-x
ISSN	2198-5804

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

375Downloads

6 month trend

285Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record