AUTHOR=Güzel Hamza Eren , Aşcı Göktuğ , Demirbilek Oytun , Özdemir Tuğçe Doğa , Erekli Pelin Berfin TITLE=Diagnostic precision of a deep learning algorithm for the classification of non-contrast brain CT reports JOURNAL=Frontiers in Radiology VOLUME=Volume 5 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/radiology/articles/10.3389/fradi.2025.1509377 DOI=10.3389/fradi.2025.1509377 ISSN=2673-8740 ABSTRACT=ObjectiveThis study aimed to determine the diagnostic precision of a deep learning algorithm for the classificaiton of non-contrast brain CT reports.MethodsA total of 1,861 non-contrast brain CT reports were randomly selected, anonymized, and annotated for urgency level by two radiologists, with review by a senior radiologist. The data, encrypted and stored in Excel format, were securely maintained on a university cloud system. Using Python 3.8.16, the reports were classified into four urgency categories: emergency, not emergency but needs timely attention, clinically non-significant and normal. The dataset was split, with 800 reports used for training and 200 for validation. The DistilBERT model, featuring six transformer layers and 66 million trainable parameters, was employed for text classification. Training utilized the Adam optimizer with a learning rate of 2e-5, a batch size of 32, and a dropout rate of 0.1 to prevent overfitting. The model achieved a mean F1 score of 0.85 through 5-fold cross-validation, demonstrating strong performance in categorizing radiology reports.ResultsOf the 1,861 scans, 861 cases were identified as fit for study through the senior radiologist and self-hosted Label Studio interpretations. It was observed that the algorithm achieved a sensitivity of 91% and a specificity of 90% in the measurements made on the test data. The F1 score was measured as 0.89 for the best fold. The algorithm most successfully distinguished emergency results with positive predictive values that were unexpectedly lower than in previously reported studies. Beam hardening artifacts and excessive noise, compromising the quality of CT scan images, were significantly associated with decreased model performance.ConclusionThis study revealed decreased diagnostic accuracy of an AI decision support system (DSS) at our institution. Despite extensive evaluation, we were unable to identify the source of this discrepancy, raising concerns about the generalizability of these tools with indeterminate failure modes. These results further highlight the need for standardized study design to allow for rigorous and reproducible site-to-site comparison of emerging deep learning technologies.