AUTHOR=Batch Karen E. , Yue Jianwei , Darcovich Alex , Lupton Kaelan , Liu Corinne C. , Woodlock David P. , El Amine Mohammad Ali K. , Causa-Andrieu Pamela I. , Gazit Lior , Nguyen Gary H. , Zulkernine Farhana , Do Richard K. G. , Simpson Amber L. 

TITLE=Developing a Cancer Digital Twin: Supervised Metastases Detection From Consecutive Structured Radiology Reports

JOURNAL=Frontiers in Artificial Intelligence

VOLUME=Volume 5 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2022.826402

DOI=10.3389/frai.2022.826402

ISSN=2624-8212

ABSTRACT=The development of digital cancer twins relies on the capture of high-resolution representations of individual cancer patients throughout the course of their treatment. Our research aims to improve the detection of metastatic disease over time from structured radiology reports by exposing prediction models to historical information. We demonstrate that Natural Language Processing (NLP) can generate better weak labels for semi-supervised classification of computed tomography (CT) reports when it is exposed to consecutive reports through a patient’s treatment history. 714,454 structured radiology reports from Memorial Sloan Kettering Cancer Center adhering to a standardized departmental structured template were used for model development with a subset of reports included for validation. To develop the models, a subset of reports was curated for ground-truth: 7732 total reports in the lung metastases dataset from 867 individual patients; 2777 reports in the liver metastases dataset from 315 patients; and 4107 reports in the adrenal metastases dataset from 404 patients. We use NLP to extract and encode important features from the structured text reports, which are then used to develop, train, and validate models. Three models – a simple convolutional neural network, a convolutional neural network augmented with an attention layer, and a recurrent neural network – were developed to classify the type of metastatic disease and validated against the ground truth labels. The models use features from consecutive structured text radiology reports of a patient to predict the presence of metastatic disease in the reports. A single-report model, previously developed to analyze one report instead of multiple past reports, is included and the results from all four models are compared based on accuracy, precision, recall, and F1-score. The best model is used to label all 714,454 reports to generate metastases maps. Our results suggest that NLP models can extract cancer progression patterns from multiple consecutive reports and predict the presence of metastatic disease in multiple organs with higher performance when compared with a single-report-based prediction. It demonstrates a promising automated approach to label large numbers of radiology reports without involving human experts in a time- and cost-effective manner and enables tracking of cancer progression over time.