Original Research ARTICLE
Outlier mining methods based on graph structure analysis
- 1Universitat Politecnica de Catalunya, Spain
- 2Universidad Nacional de Córdoba, Argentina
- 3Instituto de Física Enrigue Gaviola (IFEG), Argentina
Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap nonlinear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.
Keywords: Outlier mining, anomaly detection, complex networks, machine learning, unsupervised learning, Percolation
Received: 15 Jul 2019;
Accepted: 06 Nov 2019.
Copyright: © 2019 Amil, Almeira and Masoller. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Mr. Pablo Amil, Universitat Politecnica de Catalunya, Barcelona, Spain, firstname.lastname@example.org