A scalable implementation of information theoretic feature selection for high dimensional data

Kleerekoper, AJ, Pappas, M, Pocock, A, Brown, G and Lujan, M (2015) A scalable implementation of information theoretic feature selection for high dimensional data. In: Big Data, 2015 IEEE International Conference.

Preview

Available under License In Copyright.
Download (439kB) | Preview

Abstract

With the growth of high dimensional data, feature selection is a vital component of machine learning as well as an important stand alone data analytics tool. Without it, the computation cost of big data analytics can become unmanageable and spurious correlations and noise can reduce the accuracy of any results. Feature selection removes irrelevant and redundant information leading to faster, more reliable data analysis. Feature selection techniques based on information theory are among the fastest known and the Manchester AnalyticS Toolkit (MAST) provides an efficient, parallel and scalable implementation of these methods. This paper considers a number of data structures for storing the frequency counters that underpin MAST. We show that preprocessing the data to reduce the number of zero-valued counters in an array structure results in an order of magnitude reduction in both memory usage and execution time compared to state of the art structures that use explicit mappings to avoid zero-valued counters. We also describe a number of parallel processing techniques that enable MAST to scale linearly with the number of processors even on NUMA architectures. MAST targets scale-up servers rather than scale-out clusters and we show that it performs orders of magnitude faster than existing tools. Moreover, we show that MAST is 3.5 times faster than a scale-out solution built for Spark running on the same server. As an example of the performance of MAST, we were able to process a dataset of 100 million examples and 100,000 features in under 10 minutes on a four socket server which each socket containing an 8-core Intel Xeon E5-4620 processor.

Item Type:	Conference or Workshop Item
Peer-reviewed:	No
Date Deposited:	05 Jul 2016 10:41
Publisher:	IEEE
Additional Information:	This is an author final copy of a paper published in and presented at the Big Data (Big Data), 2015 IEEE International Conference, published by and copyright IEEE.
Divisions:	Organisation > Science and Engineering
URI:	https://e-space.mmu.ac.uk/id/eprint/615528
DOI:	https://doi.org/10.1109/BigData.2015.7363774

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

447Downloads

6 month trend

451Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record