AUTHOR=Casey Arlene , Davidson Emma , Grover Claire , Tobin Richard , Grivas Andreas , Zhang Huayu , Schrempf Patrick , O’Neil Alison Q. , Lee Liam , Walsh Michael , Pellie Freya , Ferguson Karen , Cvoro Vera , Wu Honghan , Whalley Heather , Mair Grant , Whiteley William , Alex Beatrice TITLE=Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports JOURNAL=Frontiers in Digital Health VOLUME=Volume 5 - 2023 YEAR=2023 URL=https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2023.1184919 DOI=10.3389/fdgth.2023.1184919 ISSN=2673-253X ABSTRACT=Background: Natural language processing (NLP) can automate the reading of radiology reports but there is a need to demonstrate that NLP methods are transferable and reliable to transition these methods into real-world clinical applications.We tested F1 score, precision and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife, and a population-based cohort (Generation Scotland) that spans multiple NHS health boards. We compared four off-the-This is a provisional file, not the final typeset article shelf rule-based and neural NLP tools (EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and report on their performance for three cerebrovascular phenotypes: ischaemic stroke, small vessel disease (SVD), and atrophy. Phenotypes are defined by clinical experts from the EdIE-R team using labelling developed in development of EdIE-R, and with reading of underlying images by an expert researcher.Results: EdIE-R scored the highest F1 in both cohorts for ischaemic stroke, >=93%, followed by ALARM+, >=87%. ESPRESSO's F1 score was >=74%, Sem-EHR >=66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R was >=98% and ALARM+ >=90%. ESPRESSO scored lowest with >=77% and Sem-EHR >=81%. F1 scores for atrophy by EdIE-R and ALARM+ in NHS Fife were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland.When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best scoring 84%, Sem-EHR 82%. For atrophy EdIE-R and both ALARM+ versions were comparable at 80%.The four NLP tools differ in F1 (and precision/recall) scores across all three phenotypes, although more so for ischaemic stroke. If NLP tools are to be applied in clinical settings, this cannot be done 'out of the box' and it is essential to understand the context of their development to assess whether they are suitable for the task at hand, or whether further training, re-training or modification is required to adapt tools to the target task.