ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1607348

Outliers and anomalies in training and testing datasets for AI-powered morphometry -Evidence from CT scans of the spleen

Provisionally accepted
Yuriy  VasilevYuriy Vasilev1,2Anastasia  PamovaAnastasia Pamova1*Tatyana  BobrovskayaTatyana Bobrovskaya1*Anton  VladzimirskyyAnton Vladzimirskyy1,3Olga  OmelyanskayaOlga Omelyanskaya1Elena  AstapenkoElena Astapenko1Artem  KruchinkinArtem Kruchinkin1Kirill  ArzamasovKirill Arzamasov1,4
  • 1Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department, Moscow, Russia
  • 2National Medical and Surgical Center named after N.I. Pirogov of the Ministry of Health of the Russian Federation, Moscow, Russia
  • 3I.M. Sechenov First Moscow State Medical University of the Ministry of Health of the Russian Federation (Sechenov University), Moscow, Russia
  • 4Moscow Technical University - MIREA, Moscow, Russia

The final, formatted version of the article will be published soon.

Introduction: Creating training and testing datasets for machine learning algorithms to measure linear dimensions of organs is a tedious task. There are no universally accepted methods for evaluating outliers or anomalies in such datasets. This can cause errors in machine learning and compromise the quality of end products. The goal of this study is to identify optimal methods for detecting organ anomalies and outliers in medical datasets designed to train and test neural networks in morphometrics.Methods: A dataset was created containing linear measurements of the spleen obtained from CT scans. Labelling was performed by three radiologists. The total number of studies included in the sample was N = 197 patients. Using visual methods (1,5 interquartile range; heat map; boxplot; histogram; scatter plot), machine learning algorithms (Isolation forest; Density-Based Spatial Clustering of Applications with Noise; K-nearest neighbors algorithm; Local outlier factor; One-class support vector machines; EllipticEnvelope; Autoencoders), and mathematical statistics (z-score, Grubb's test; Rosner's test). Results: We identified measurement errors, input errors, abnormal size values and non-standard shapes of the organ (sickle-shaped, round, triangular, additional lobules). The most effective methods included visual techniques (including boxplots and histograms) and machine learning algorithms such is OSVM, KNN and autoencoders. A total of 32 outlier anomalies were found.Discussion: Curation of complex morphometric datasets must involve thorough mathematical and clinical analyses. Relying solely on mathematical statistics or machine learning methods appears inadequate.

Keywords: outliers, anomalies, Dataset, machine learning, statistics, Spleen, Computer tomography

Received: 11 Apr 2025; Accepted: 25 Jun 2025.

Copyright: © 2025 Vasilev, Pamova, Bobrovskaya, Vladzimirskyy, Omelyanskaya, Astapenko, Kruchinkin and Arzamasov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Anastasia Pamova, Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department, Moscow, Russia
Tatyana Bobrovskaya, Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department, Moscow, Russia

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.