Towards optimization of privacy-utility trade-off using similarity and diversity based clustering

Majeed, Abdul ORCID: https://orcid.org/0000-0002-3030-5054, Khan, Safiullah ORCID: https://orcid.org/0000-0001-8342-6928 and Hwang, Seong Oun ORCID: https://orcid.org/0000-0003-4240-6255 (2024) Towards optimization of privacy-utility trade-off using similarity and diversity based clustering. IEEE Transactions on Emerging Topics in Computing, 12 (1). pp. 368-385.

Published Version
File not available for download.
Available under License In Copyright.
Download (7MB)

Official URL: http://dx.doi.org/10.1109/tetc.2023.3258528

Abstract

Most data owners publish personal data for information consumers, which is used for hidden knowledge discovery. But data publishing in its original form may be subjected to unwanted disclosure of subjects' identities and their associated sensitive information, and therefore, data is usually anonymized before publication. Many anonymization techniques have been proposed, but most of them often sacrifice utility for privacy, or vice versa, and explicitly disclose sensitive information when original data have skewed distributions. To address these technical problems, we propose a novel anonymization method using similarity and diversity-based clustering that effectively preserves both the subjects' privacy and anonymous-data utility. We identify influential attributes from the original data using a machine learning algorithm that assists in preserving a subject's privacy in imbalanced clusters, and that remained unexplored in previous research. The objective function of the clustering process considers both similarity and diversity in the attributes while assigning records to clusters, whereas most of the existing clustering-based anonymity techniques consider either similarity or diversity, thereby sacrificing either privacy or utility. Attribute values in each cluster set are minimally generalized to effectively achieve both competing goals. Extensive experiments were conducted on four real-world benchmark datasets to prove the feasibility of proposed method. The experimental results showed that the common and AI-based privacy risks were reduced by 13.01% and 24.3% respectively in contrast to existing methods. Data utility was augmented by 11.25% and 20.21% on two distinct metrics compared to its counterparts. The complications (e.g., # of iterations) of the clustering process were 2.25× lower than the state-of-the-art methods.

Item Type:	Article (Article)
Peer-reviewed:	Yes
Date Deposited:	18 Sep 2024 09:33
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Additional Information:	For copyright reasons, full-text access is not available through this repository
Divisions:	Organisation > Science and Engineering Organisation > Science and Engineering > Department of Computing and Maths
Subject terms:	0805 Distributed Computing, 0806 Information Systems, 0906 Electrical and Electronic Engineering, 46 Information and computing sciences
URI:	https://e-space.mmu.ac.uk/id/eprint/635649
DOI:	https://doi.org/10.1109/TETC.2023.3258528
e-ISSN	2168-6750

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

1Download

6 month trend

90Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record