Outlier detection using iterative adaptive mini-minimum spanning tree generation with applications on medical data

As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier proportion. To address this problem to some extent, this article proposes an adaptive mini-minimum spanning tree-based outlier detection (MMOD) method, which utilizes a novel distance measure by scaling the Euclidean distance. For datasets containing different densities and taking on different shapes, our method can identify outliers without prior knowledge of outlier percentages. The results on both real-world medical data corpora and intuitive synthetic datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art methods.


Parameter tuning and result visualization on five synthetic two-dimensional datasets
Five synthetic two-dimensional datasets with different densities, distributions, and outlier numbers, as Table A details, were employed to further validate the effectiveness of the proposed method and to plot more intuitive results of outlier detection.The five synthetic datasets have different morphologies of cluster amount, cluster density, cluster distribution, outlier density, outlier distribution, outlier proportion, and distance between outliers and clusters, which are illustrated in a concise narrative in Table A. Unlike real-world data, parameter tuning of the M EW 's first added value M EW 1 , the threshold of termination T t , and the exit condition aec was carried out for individual synthetic datasets to achieve reasonable MMOD results.Such a procedure accommodates the effect of the synthetic data's deliberated realization of particular distributions or densities.The tuning for the three parameters is not complicated or exhaustive, where adjusting each parameter is only an either-or option: either the default value/equation expressed in Section 3.2 or another reasonable and unified deformation.For M EW 1 , the alternative value is 1 instead of the length of the first edge added to mini-MST due to a significantly high value of d 1 .For T t and aec, the alternatives omit the standard deviation parts of the original equations and keep the mean terms, which is the most straightforward measure to tune Equations 2 and 5. Table B documents the parameter-tuning results on the five synthetic datasets.The parameter values on various morphological datasets provide a useful reference for future research on other datasets using MMOD.As illustrated in Figure A, the experimental results on synthetic two-dimensional datasets indicate that the MMOD method is capable of detecting outliers in manually set data with varying morphologies of cluster amount, cluster density, cluster distribution, outlier density, outlier distribution, outlier proportion, and distances between outliers and clusters, given appropriate parameter tuning.The eight state-of-the-art peer methods (see Section 5.2) were also applied to the synthetic datasets for performance comparison.The performance of the nine methods, including MMOD, on the five synthetic datasets is visualized in Figures B-F for reference.
It should be declared again that the above-mentioned parameter tuning in Appendix, although straightforward and efficient, provides only a reference for testing MMOD in the presence of ground truth or for datasets with artificial morphologies and distributions.The experimental results detailed in the main text have verified on various datasets that MMOD produces good results directly without tuning.We do not imply or emphasize tuning the parameters of MMOD for real-world datasets, especially those without ground-truth outlier labels or even black-box cases.

Figure A .
Figure A. Summary of MMOD's outlier detection results on the five synthetic two-dimensional datasets.Red: outliers detected by the proposed adaptive mini-minimum spanning tree-based outlier detection (MMOD) method.

Figure B .
Figure B. Nine algorithms' outlier detection results on the synthetic "Two densities" dataset.Detected outliers are in red.

Figure C .
Figure C. Nine algorithms' outlier detection results on the synthetic "Three clusters" dataset.Detected outliers are in red.

Figure D .
Figure D. Nine algorithms' outlier detection results on the synthetic "Four densities" dataset.Detected outliers are in red.

Figure E .
Figure E. Nine algorithms' outlier detection results on the synthetic "Uniform outliers" dataset.Detected outliers are in red.

Figure F .
Figure F. Nine algorithms' outlier detection results on the synthetic "Arbitrary outliers" dataset.Detected outliers are in red.

Table A .
Description of the five synthetic two-dimensional datasets for further validation and better result visualization.#: Number of.

Table B .
Parameter tuning of the five synthetic two-dimensional datasets.