AUTHOR=Gottlich Harrison C. , Korfiatis Panagiotis , Gregory Adriana V. , Kline Timothy L. TITLE=AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows JOURNAL=Frontiers in Radiology VOLUME=Volume 3 - 2023 YEAR=2023 URL=https://www.frontiersin.org/journals/radiology/articles/10.3389/fradi.2023.1223294 DOI=10.3389/fradi.2023.1223294 ISSN=2673-8740 ABSTRACT=Methods that automatically flag poor performing predictions are drastically needed to safely implement machine learning workflows into clinical practice as well as to identify difficult cases during model training. We demonstrate a readily adoptable method using sub-models trained on different dataset folds. The disagreement between the sub-models was used as a surrogate for model confidence and was evaluated with thresholds informed by human interobserver values to determine whether final ensemble model performance should be manually reviewed. In two different datasets (abdominal CT and MR predicting kidney tumors), the framework efficiently identified low performing automated segmentations. Flagging images where the minimum Interfold test Dice score was below the interobserver variability maximized the number of flagged images while ensuring maximum test Dice. When our internal model was applied on an external publicly available dataset (KiTS21) to evaluate the performance on out-of-distribution data, we observed that flagged images included smaller tumors than those observed in our internally trained dataset. This finding demonstrates our method is robust in detecting poor performance due to out-of-distribution data. Comparing interfold sub-model disagreement against human interobserver values is an effective and efficient way to approximate a model’s lack of knowledge due to insufficient relevant training data, also termed epistemic uncertainty, a key functionality for adopting these applications in clinical practice.