AUTHOR=Ali Stephen R. , Strafford Huw , Dobbs Thomas D. , Fonferko-Shadrach Beata , Lacey Arron S. , Pickrell William Owen , Hutchings Hayley A. , Whitaker Iain S. 

TITLE=Development and validation of an automated basal cell carcinoma histopathology information extraction system using natural language processing

JOURNAL=Frontiers in Surgery

VOLUME=Volume 9 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/surgery/articles/10.3389/fsurg.2022.870494

DOI=10.3389/fsurg.2022.870494

ISSN=2296-875X

ABSTRACT=Introduction

Routinely collected healthcare data are a powerful research resource, but often lack detailed disease-specific information that is collected in clinical free text such as histopathology reports. We aim to use Natural Language Processing (NLP) techniques to extract detailed clinical and pathological information from histopathology reports to enrich routinely collected data.

Methods

We used the general architecture for text engineering (GATE) framework to build an NLP information extraction system using rule-based techniques. During validation we deployed our rule-based NLP pipeline on 200 previously unseen, de-identified and pseudonymised BCC histopathological reports from Swansea Bay University Health Board, Wales, UK. Results of our algorithm were compared to gold standard human abstraction by two independent and blinded expert clinicians involved in skin cancer care. 

Results

We identified 11,224 items of information with a mean precision, recall and F1 score of 86.0% (95% CI 75.1-96.9), 84.2% (95% CI 72.8-96.1) and 84.5% (95% CI 73.0-95.1) respectively. The difference between clinician annotator F1 scores was 7.9% in comparison to 15.5% between the NLP pipeline and the gold standard corpus. Cohen's Kappa score on annotated tokens was 0.85.

Conclusion

Using an NLP rule-based approach for NER in BCC we have been able to develop and validate a pipeline with a potential application in improving cancer registry data, service planning and enhancing the quality of routinely collected data for research.