Multi-centre benchmarking of deep learning models for COVID-19 detection in chest x-rays

Introduction This study is a retrospective evaluation of the performance of deep learning models that were developed for the detection of COVID-19 from chest x-rays, undertaken with the goal of assessing the suitability of such systems as clinical decision support tools. Methods Models were trained on the National COVID-19 Chest Imaging Database (NCCID), a UK-wide multi-centre dataset from 26 different NHS hospitals and evaluated on independent multi-national clinical datasets. The evaluation considers clinical and technical contributors to model error and potential model bias. Model predictions are examined for spurious feature correlations using techniques for explainable prediction. Results Models performed adequately on NHS populations, with performance comparable to radiologists, but generalised poorly to international populations. Models performed better in males than females, and performance varied across age groups. Alarmingly, models routinely failed when applied to complex clinical cases with confounding pathologies and when applied to radiologist defined “mild” cases. Discussion This comprehensive benchmarking study examines the pitfalls in current practices that have led to impractical model development. Key findings highlight the need for clinician involvement at all stages of model development, from data curation and label definition, to model evaluation, to ensure that all clinical factors and disease features are appropriately considered during model design. This is imperative to ensure automated approaches developed for disease detection are fit-for-purpose in a clinical setting.

Table S1.Inclusion and exclusion criteria of CXR exam from NCCID, LTHT and COVIDGR data.For all datasets, CXRs are eliminated if not frontal view.NCCID and LTHT CXR exams conducted after 2019 are eliminated if COVID-19 swab or image acquisition data is incomplete.For NCCID data, CXRs are also eliminated if submission centre data is incomplete or if the CXR exam date falls in between two non-overlapping windows of COVID-19 infection.Abbrvs: National COVID-19 Chest Imaging Database (NCCID); Leeds Teaching Hospital Trust (LTHT); Anteroposterior (AP); Postero-anterior (PA); Chest X-ray (CXR).

Label generation
For NCCID and LTHT data, CXR labels were generated according to a pre-defined diagnostic window.Under clinical guidance, we defined the COVID-19 diagnostic window as 14 days before and 28 days after the acquisition data of a positive RT-PCR test swab.CXR exam date was evaluated relative to the nearest positive RT-PCR COVID-19 swab date, CXRs that fell inside this window (-14/+28 days) were labelled COVID-19 positive.In some cases, evaluation of serial patient swab dates created multiple non-overlapping diagnostic windows, we treated these as separate instances of COVID-19 infection and CXRs that fell between these windows were removed from the dataset.For COVIDGR, CXRs are provided with COVID-19 labels, COVID-19 CXRs are defined by a positive RT-PCR swab within 24 hours of CXR acquisition.We provide an illustration of the labelling schema through case by case examples (Supplementary Fig. S1).

Counterfactual condition generation
Figure S2 show how CXR labels are combined to create the negative cohort and positive cohort that create the various counterfactual datasets.

Chest X-ray observable comorbidities
Comorbidities were categorised with clinical guidance.We grouped the LTHT population with recorded comorbidities into two categories: cases with comorbidities that could be observed in a CXR i.e., features of the disease are known to exist in the thoracic area, and cases without any CXR-observable comorbidities.Typically, the CXR-observable class of comorbidities comprises respiratory and cardiac diseases/disorders, whilst the non-observable class of comorbidities comprises neurological diseases, inflammatory disorders (excluding respiratory), and blood diseases.Supplementary Table S2 presents an exhaustive list of the considered comorbidities and which category they belong to.

Image processing
All images were extracted from high resolution DICOM files and resized to 480x480 using area interpolation.To minimise risk of model overfitting a pipeline of image augmentations was applied, augmentations included: rotation, flipping, shifting, scaling, random brightness contrasting and random contrast.Image augmentations were applied uniformly during model training, with the exception of training under self-supervised conditions for which distortion, in-painting and perspective transformations were applied.

Model selection
We carried out an informal search for models of interest using a set of core keywords: "chest Xrays", "COVID-19" and ("deep learning" OR "artificial intelligence").To ensure a wide variety of approaches were identified, we added deep learning-specific key words, these included, "anomaly detection", "out-of-distribution", "semi-supervised", "weakly supervised", "self-supervised",  "unsupervised", "generative", "autoencoder", and "uncertainty".We searched an array of research databases, including, Google Scholar, PubMed, Scopus and IEEE Xplore.Considering the rapid development of this field and the importance of efficient dissemination of associated findings, pre-print manuscripts were intentionally included in the search.

Training protocol
We selected Binary Cross Entropy (BCE) as the objective function for all models with the exception of MAG-SD, for which we used the categorical cross entropy (due to model architecture restrictions).We tuned model hyperparameters via Optuna, an open source hyperparameter optimisation framework.The learning rates were tuned for pre-training and training stages.We searched learning rates from 0.01 to 0.0001, with intervals of 0.01.The learning rate that produced the lowest validation loss over 10 epochs was selected.All models are trained under 5-fold cross validation.During model training, if the validation loss plateaus for more than 5 epochs, learning rate is reduced by 10%.If validation loss does not improve for 15 epochs then model training is stopped.

Lung segmentation
The lung segmentation model was trained with a VGG backbone and UNet++ architecture.The VGG UNet++ was trained under deep supervision, where nested layer predictions inform model training.The VGG UNet++ was trained through minimisation of a soft DICE-BCE loss with the Adam optimiser algorithm.Learning rates were reduced by a factor of 0.8 with sustained plateaus in validation loss (10 epochs), and training was stopped if validation loss plateaued for more than 20 epochs.Qualitative evaluation of post-processed ROIs showed that this approach was largely successful in preserving all clinically-relevant anatomical structures in the CXR and eliminating 'noise', with consistent performance on both COVID-19 positive and COVID-19 negative CXRs (Supplementary Fig. S4).

SUPPLEMENTARY TABLES AND FIGURES
data outside RT-PCR+swab window ✓

Figure S1 :
Figure S1: Labelling schema for (A) NCCID and LTHT datasets, which share identical labelling protocol, and (B) COVIDGR data.For each protocol we present examples of different label outcomes.(A) CASE 1: An illustration of CXR acquisition preceding the RT-PCR swab date diagnostic window (-14/+28 days), this case is therefore considered COVID-19 negative.(A) CASE 2: An example of CXR acquisition prior to the RT-PCR swab date but within the diagnostic window, as a result this case is labelled COVID-19 positive.(A) CASE 3: A scenario involving CXR elimination, where multiple swab tests are documented for a single case.If a CXR is acquired within the time frame between the windows around the swab dates, it is excluded from the dataset.(B) CASE 1: A case in which the CXR was acquired within the diagnostic window, specifically within 24 hours of the RT-PCR swab date (-1/+1 days).As a result, this case is designated as COVID-19 positive.(B) CASE 2: An example of CXR acquisition occurring after the diagnostic window, leading to the categorisation of this case as COVID-19 negative.Abbrvs: Chest X-ray (CXR); National COVID-19 Chest Imaging Database (NCCID); Leeds Teaching Hospital Trust (LTHT); Reverse Transcription Polymerase Chain Reaction (RT-PCR).

Figure S2 :
Figure S2: Generation of positive and negative cohorts of test dataset LTHT and counterfactual test datasets LTHT NO PNEUMONIA (NP) and LTHT PNEUMONIA (P).The incomplete arrow indicates that on-COVID-19 pneumonia CXRs are not included in LTHT (NP).Abbrvs: Leeds Teaching Hospital Trust (LTHT)

Figure S3 :
Figure S3: Visual representation of evaluated models.The applied CNN backbones are specified and colour-coding is used to indicate DL type.Abbrvs: Deep Learning (DL).; Neural Architecture Search (NAS) 4

Figure S5 :
Figure S5: Post-processing of COVID-19 CXR semantic lung segmentation to generate reliable ROIs.(A) An example of severe COVID-19 in a CXR resulting in an uncertain prediction, with a total prediction uncertainty of 620.The right lung field is under-segmented, the additional structure is highlighted in the semantic segmentation and corresponding uncertainty map.This creates an overly large ROI around the lungs.Pixel-wise prediction uncertainty is used to isolate the extra structure, which is eliminated during post-processing of the semantic segmentation.(B) A COVID-19 CXR with under-segmentation of the lung fields (highlighted in the semantic segmentation and uncertainty map).Pixel-wise uncertainty is used to remove this structure from the semantic segmentation before creating a ROI.Total prediction uncertainty of this example is 230.Abbrvs: Region Of Interest (ROI); Chest X-ray (CXR).

Figure S6 :
Figure S6: (A) XCEPTION NET, (B) SSL-AM and (C) XVITCOS AUROCs for LTHT subgroup and LTHT (P) subgroups.Only subgroups that exist in the LTHT (P) population are included.n is the subgroup population size.Error bars correspond to standard deviation across cross validations.Abbrvs: Area Under Receiver Operator Characteristic (AUROC); Region of Interest (ROI).

Table S3 .
Comparison of average full CXR-trained and ROI-trained model performance metrics.Standard deviation of metrics across crossvalidation folds is included.N is the total number of cases in each test population.Abbrvs: Chest X-ray (CXR); National COVID-19 Chest Imaging Database (NCCID); Leeds Teaching Hospital Trust (LTHT); Accuracy (Acc); Precision (Prec); Area Under the Receiver Operator Characteristic (AUROC).

Table S5 :
Average model performance metrics for each subgroup in the LTHT population.