Real world validation of an AI-based CT hemorrhage detection tool

Introduction Intracranial hemorrhage (ICH) is a potentially life-threatening medical event that requires expedited diagnosis with computed tomography (CT). Automated medical imaging triaging tools can rapidly bring scans containing critical abnormalities, such as ICH, to the attention of radiologists and clinicians. Here, we retrospectively investigated the real-world performance of VeriScout™, an artificial intelligence-based CT hemorrhage detection and triage tool. Methods Ground truth for the presence or absence of ICH was iteratively determined by expert consensus in an unselected dataset of 527 consecutively acquired non-contrast head CT scans, which were sub-grouped according to the presence of artefact, post-operative features and referral source. The performance of VeriScout™ was compared with the ground truths for all groups. Results VeriScout™ detected hemorrhage with a sensitivity of 0.92 (CI 0.84–0.96) and a specificity of 0.96 (CI 0.94–0.98) in the global dataset, exceeding the sensitivity of general radiologists (0.88) with only a minor relative decrement in specificity (0.98). Crucially, the AI tool detected 13/14 cases of subarachnoid hemorrhage, a potentially fatal condition that is often missed in emergency department settings. There was no decrement in the performance of VeriScout™ in scans containing artefact or postoperative change. Using an integrated informatics platform, VeriScout™ was deployed into the existing radiology workflow. Detected hemorrhage cases were flagged in the hospital radiology information system (RIS) and relevant, annotated, preview images made available in the picture archiving and communications system (PACS) within 10 min. Conclusion AI-based radiology worklist prioritization for critical abnormalities, such as ICH, may enhance patient care without adding to radiologist or clinician burden.


Introduction
Intracranial hemorrhage (ICH) is a potentially life-threatening medical event that may occur spontaneously; or in the setting of head trauma or surgical intervention. The etiology of non-traumatic ICH, which has an incidence of >25 per 100,000 person years (1), is diverse and includes hypertension, stroke, ruptured aneurysm, vasculopathy, dural venous sinus thrombosis, arteriovenous fistula, malignancy, anticoagulant use and, rarely, inflammatory disease. Acute

OPEN ACCESS EDITED BY
ICH is associated with an early mortality of 40%-50% and significant neurological disability in surviving patients (1,2). ICH subtypes, defined by extravasation of blood into the subdural, extradural, subarachnoid or parenchymal compartments, vary in terms of clinical presentation, imaging features and prognosis. Head computed tomography (CT), the primary para-clinical tool used to investigate patients presenting to emergency departments with headache or focal neurological deficits, facilitates expedited diagnosis and early intervention that may critically determine clinical outcome in ICH (3,4).
The technical feasibility of artificial intelligence (AI) tools for detecting ICH on head CT has been demonstrated in several recent studies (5)(6)(7)(8)(9). However, limited assessment in real-world clinical settings, where patients or scans with characteristics (for example, severe artefacts, postoperative changes, foreign bodies, other brain pathologies) that were not present in the training dataset occur, has shown suboptimal generalizability (10,11); or performance has been primarily determined by comparison with general, rather than subspecialty expert, radiologist reports (10). Ultimately, patient benefit is also dependent on integration of AI tools with existing radiology workflows, as the use of standalone software interfaces or other additional burdens on clinical staff is a disincentive to their adoption.
We retrospectively investigated the real-world performance of VeriScout™, an artificial intelligence (AI) based CT hemorrhage detection and triage tool, in a dataset of unselected non-contrast CT head scans that were consecutively acquired at a large Australian teaching hospital. VeriScout™ is designed to triage head CT scans acquired in an outpatient/emergency department setting with a high likelihood of hemorrhage, whether acute or chronic, and flag these scans for expedited reporting through background integration with the Radiology Information System (RIS) and notification in existing clinical systems. VeriScout™ is based on a deep learning algorithm and was trained on a dataset of 7,000 expertly labelled head CT scans acquired in an outpatient/emergency department setting. Curation of the training and a separate validation dataset involved the removal of scans with severe metallic or motion artefact; and any post-operative studies. The tool is silently integrated with the existing radiology workflow using an informatics platform (Torana™, Sydney Neuroimaging Analysis Centre, Sydney) that provides a binary result (hemorrhage likely/hemorrhage not likely) to the RIS and returns a low resolution (non-diagnostic) bounding box image to the picture archiving and communications system (PACS), providing the reporting radiologist with an indication of the region of the scan that most informed the algorithm's prediction.

Methods
The study was approved by the Human Research and Ethics Committee, St. Vincent's Hospital.

Dataset
VeriScout™ was evaluated retrospectively in an independent dataset of 527 consecutive, unselected non-contrast head CT scans acquired on one of two CT scanners (Philips Ingenuity 2013; Canon Aquilion ONE Prism 2020) with pre-configured parameters (both with tube voltage of 120kVp, tube current 218 mA, slice thickness of 1 mm, reconstructed with soft tissue windows; with convolution kernel as FCXX or UB, and filter type as MEDIUM or UB), at St. Vincent's Hospital, Sydney. These scanners service the emergency department, outpatients and hospital inpatients. None of the scans acquired for this study were used in the training of the VeriScout™ algorithm; and neither scanner was used for the acquisition of any training data. Ground truth for the presence or absence of hemorrhage was iteratively determined by retrospective review of existing radiology reports by a radiology trainee (RJ) with 2 years of specialty experience, followed by secondary, blind review of every scan by a sub-specialty neuroradiologist (YB), who also documented the presence and degree of metal or motion artefact according to the following scale: 0 = no intracranial artefact, 1 = mild artefact with no impact on interpretation, 2 = moderate artefact with potential minor impact on interpretation, 3 = severe artefact with definite impact on interpretation. All intracranial hemorrhage subtypes, including intra-parenchymal, extra-axial (both acute and chronic), subarachnoid and petechial were labelled as positive. Where there was a discrepancy between the report and the neuroradiologist findings, the scan was reviewed by a third radiologist (TD) and a consensus reached by discussion.

Algorithm and model training
The core algorithm of VeriScout™ was developed based on convolutional neural network technology and comprised two independently trained networks based on ResNeXt (12) that provided inference based on the input case as a 3D volume (referred to as the case-level model) and, separately, a group of 2D slices (referred to as the slice-level model). The purpose of the design was to balance the features extracted from the whole case and those from slices. The final inference result was positive (for likely hemorrhage) if either of the predictions is positive.
Following the network structure and data flow shown in Figure 1, input data was firstly pre-processed with multiple pre-defined greylevel maps (windows) and normalization; output values were then rounded to probabilities and binarized based upon a pre-set threshold that was determined during validation of the algorithm (13).
For the purpose of training neural networks, an extensive brain CT dataset was assembled, consisting of over 20,000 scans sourced from diverse locations throughout Australia. The dataset encompassed more than 50 scanners manufactured by various vendors. Expert clinician reviews of radiology reports were utilized to annotate the dataset, ensuring accurate labeling. To improve the robustness of the network models, data augmentation techniques were employed during the training phase. This involved the random selection of input images from the entire dataset, followed by random cropping, rotation, and contrast adjustments. By employing these techniques, the intention was to encourage the networks to focus on the invariant features present in brain CT images, rather than becoming excessively reliant on features specific to particular scanners or studies.
The networks were optimized to convergence using crossentropy loss and Adam optimizer based on slice-wise gold labels Frontiers in Neurology 03 frontiersin.org indicating hemorrhage subtype. Annotations for hemorrhage subtypes were used for training purposes only and the final output was binarized as "hemorrhage likely" or "hemorrhage unlikely."

Deployment
Head CT images were auto-routed from the scanner to a local instance of Torana™, a software-based informatics platform that appears as a DICOM node on the local network ( Figure 2). On receipt of an appropriate image series (non-contrast head CT), Torana™ automatically de-identified the images by replacing all requisite DICOM headers, before securely (HTTPS protocol) routing the scans over the internet to an Amazon Web Services analysis server hosting the containerized VeriScout™ analysis algorithm. VeriScout™ generated a binary response (hemorrhage/ no hemorrhage) and, for positive cases, non-diagnostic labelled DICOM image(s) of the brain slices that most influenced the algorithm's prediction. This data was automatically routed back to Torana™, which reidentified the image(s), transferred these to the appropriate scan session on the hospital PACS and made the result available in the RIS by flagging scans likely to contain hemorrhage with a specific symbol displayed adjacent to the scan ID in the radiologist's reporting worklist. Additionally, email Network structure and data flow. The dataflow and the network structure of VeriScout™, in which different layers (including convolutional layers, pooling layers, linear layers and activation layers) are defined and trained based on ResNeXt50 (12). Specifically for the inference step of the device, the results of several case-level models and one slice-level model are fused to generate the final output. See text for more details in training and final output. notifications were sent to the nominated radiologist. All postacquisition steps described were fully automated, with no input required from end-users or study staff.

Data analysis
The performance of VeriScout™ was assessed by comparison with the final expert consensus ground truth and expressed as sensitivity and specificity (with 95% confidence intervals) and F1 scores. Discrepant results were classed according to the presence or absence of artefact. Additionally, all discrepant DICOM images were visually re-reviewed to determine other potential reasons for failure of the model. The performance of the model was analyzed by referral source; and postoperative scans were also analyzed independently. Any technical issues (such as processing errors) were analyzed for the root cause.
Results 527 unselected, consecutive CT head scans were acquired, 365 (69.3%) of which were referred from the hospital emergency department (Table 1). Expert consensus found evidence of hemorrhage in 79/527 scans (15.0%), including chronic subdural hematomata and petechial hemorrhage associated with stroke or other pathology such as tumor. The average patient age at the time of scanning was 65.6 years (16.9-97.2). For patients with hemorrhage, the average age was 68.3 years (range 37.7-94.0). There were 45 scans (8.54%) that showed evidence of previous (recent or old) neurosurgery, of which 20 showed evidence of acute or chronic intracranial hemorrhage, predominantly extra-axial in location. When the original report was compared to the multi-radiologist expert consensus, the original report result had a sensitivity of 0.88 and a specificity of 0.98. The sensitivity of the original report was impacted by classification of some subdural collections as hygroma rather than chronic SDH, subacute stroke or tumor associated with petechial hemorrhage as non-hemorrhage and the presence of small amounts of subdural blood associated with recent neurosurgery as post-operative change.
For all scans, VeriScout™ detected hemorrhage with a sensitivity of 0.92 (CI 0.84-0.96) and a specificity of 0.96 (CI 0.94-0.98) using the expert consensus as ground truth (Table 2), in keeping with the expected performance (namely, sensitivity and specificity of the device should exceed 0.90) of the underlying algorithm (13). Precision and F1 scores are also shown in Table 1. The inclusion of postoperative scans and those with artefacts in our scan cohort did not diminish the overall performance of the algorithm despite their exclusion from the VeriScout™ training dataset; and analysis of these scan subsets showed comparable sensitivity and specificity, albeit in limited scan numbers (Table 2). Metallic artefacts primarily affected the posterior fossa structures and temporal lobes, related to dental fillings and metallic earrings/aural implants respectively; and were severe in 18 cases. Severe artefacts related to motion or acquisition issues were present in an additional 4 cases. The algorithm failed to identify hemorrhage (false negative cases) in a small number of cases (n = 6: IPH 4, SDH 1, SAH 1, Table 3); examples of these cases are shown in Figure 3, top row. Similarly, there were a number of false positive cases (n = 16), accounted for primarily by the presence of intracranial calcification, tumor, severe atrophy (misidentified as chronic subdural hematoma) or, in several cases, no visually identifiable pathology. Examples of false positive cases are shown in Figure 3, bottom row.
The detection rate was excellent across all hemorrhage subtypes and, importantly, was preserved in scans containing SAH (13/14 cases), potentially the most critical hemorrhage subtype (Table 3 and Figure 4). Intraparenchymal hemorrhage, which included petechial hemorrhage, was detected in 31/34 cases (Table 3).
VeriScout™ returned a result to the RIS within 10 min in 100% of cases analyzed; and appropriately flagged all positive cases as determined by the algorithm. Upload speed from the hospital network to the cloud analysis server was the primary determinant, with inference completed in less than 1 min in all cases. Integration with the PACS was confirmed by the presence of an appropriate VeriScout™ image(s) in the relevant scan session; a technical misconfiguration prevented initial processing in 11 cases, but this was detected in real-time by the informatics platform and all cases were subsequently re-triggered successfully.

Performance analysis
We demonstrate the performance of an automated AI-based CT head triage system in a hospital radiology department, and its successful integration with our existing PACS-RIS workflow. The sensitivity of VeriScout™ for all intracranial hemorrhage subtypes (0.92), including chronic subdural hematomata and petechial hemorrhage, exceeded that of general radiologists (0.88), with only a minor relative decrement in specificity (0.96 vs. 0.98 respectively). Overall discordance between the AI decision and the consensus expert decision was low (4.17%) and comparable with the rate of disagreement between the general radiologist report and the expert decision in both our cohort (3.03%) and previous reports (14)(15)(16). Reassuringly, the performance of the AI system was not reduced by the inclusion of postoperative scans and those with severe artefact, despite the exclusion of such scans from the algorithm's training dataset.
The AI system's detection rate for SAH (13/14 cases, 0.93), which may require immediate neurosurgical intervention, compared well with the general radiologist report (14/14, 1.00); and significantly exceeds the reported performance of Emergency physicians, who are the first line of defense in many centers with limited radiologist cover after hours (17,18), when some 80% of SAH cases present (19). In a recent retrospective multicenter study from Kundisch and colleagues, the performance of a commercially available AI algorithm was analyzed on 4,946 scans, showing an "estimated miss rate" for ICH of 12.4%, of which almost 40% were SAH (10). However, only results that were discrepant with the primary radiology report (162/4946) were reviewed by an expert neuroradiologist, a process that resulted in re-classification of >3% of cases in our cohort.

Features of the tool
There is a clear need for automated medical imaging triaging tools in settings where the volume of imaging studies exceeds the local radiology reporting capacity. A 2015 UK survey indicated that 330,000 patients were waiting more than 1 month for the results of unreported imaging study studies (20), primarily reflecting lack of adequate radiology workforce to meet increasing demand, especially for CT and MRI (21). Separately, advances in imaging technology, which generate increasingly vast amounts of data per acquisition, in the order of thousands of images per CT or magnetic resonance imaging (MRI) scan, contribute to excessive radiologist workload and high rates of burnout in the profession (22). While the emergence of tele-radiology or centralized reporting can ameliorate local radiologist shortages, geographic dissociation from referring clinicians, especially in emergency department settings, increases the risks associated with delayed reporting. Although triaging tools do not address underlying workforce issues, they can critically inform timely management for patients who are potentially most in need of urgent intervention. Due to the retrospective nature of our study, we did not assess the utility of the real-time clinician notification functionality of VeriScout™/ Torana™, which provides an opportunity to further enhance clinical care by expediting appropriate discussion with onsite or offsite radiologists, even prior to scan reporting. Furthermore, our study design precluded a direct assessment of the impact of VeriScout™ on patient outcomes; or the perceived benefits to radiologists and referring clinicians.
Explainability, an evolving concept in AI, makes a model's predictions interpretable and traceable for the end-user; and is particularly important in medical imaging applications (9). By returning non-diagnostic image(s) to the PACS in cases of suspected intracranial hemorrhage, VeriScout™ is designed to alert the reporting radiologist to the slices and regions (indicated by a bounding box) that most informed the algorithm's prediction, a feature that potentially further reduces the time spent reviewing false positive predictions.
Implementation of AI-based triage systems have the capacity to expedite clinical care to patients with critical brain abnormalities by facilitating early reporting and appropriate discussion with referring clinicians. Triage systems, which are not intended to be diagnostic, carry minimal if any risk in the context of algorithm failure, in which instance the workflow simply defaults to standard clinical care. In this study, we retrospectively assessed the performance of VeriScout™; however, we also demonstrated near real-time integration with our existing RIS-PACS workflow through an associated informatics platform, Torana™; and therefore the feasibility of full deployment in a clinical setting.
Data privacy concerns are a roadblock to the implementation of imaging AI in some jurisdictions and centers (23). Our head CT scans were processed by a secure cloud (Amazon Web Services) analysis server following on-site, automated deidentification of DICOM headers; additionally, image transfer to the server used a secure protocol (HTTPS); and off-site data was automatically deleted when processing completed. While on-site solutions obviate privacy and other governance concerns, these are not scalable and are more difficult to maintain and update. Stringent compliance with HIPAA, GDPR or other relevant regulations; and use of vendor systems that are compliant with international (such as  ISO 13485) and local medical device regulations substantially reduces the risk of exposure of protected health information. Additional precautions, such as 'defacing' software that automatically removes pixels that could be feasibly reconstructed and visually identified from 3D datasets (23), may become compulsory for applications that require off-site data processing.

Future work
VeriScout™ is designed to detect a high likelihood of hemorrhage of any type or extent, and triage these cases for expedited clinical review. We recognize that, in some cases, very small amounts of subarachnoid or subdural blood, particularly in the setting of trauma,  Cases of subarachnoid hemorrhage correctly identified by VeriScout™. VeriScout™ correctly identified 13/14 cases of subarachnoid hemorrhage (SAH). Examples included cases of focal SAH (A,B) and diffuse SAH (C).
Frontiers in Neurology 07 frontiersin.org may have little or no clinical significance in terms of patient management. Conversely, the detection of small amounts of non-traumatic subarachnoid blood may critically impact patient management. Referral information is extremely important in determining the clinical significance of CT findings, including small amounts of ICH. Unfortunately, no referral information was available in the current study; however, the AI tool missed only 6 hemorrhages, only one of which (focal SAH, see Figure 3C) was deemed likely to be of clinical significance by an expert neuroradiologist (YB). As VeriScout™ is not designed as a diagnostic tool, and even small hemorrhages may influence patient management (for example, the administration of antiplatelet or anticoagulant therapy), we believe that our algorithm appropriately triages all hemorrhages for expedited human interpretation in the appropriate clinical context. Future, more general AI algorithms, which are able to access and integrate clinical and imaging information, will likely further refine worklist prioritization and, potentially, diagnostic use cases. AI is likely to transform medical imaging over the coming decades. Worklist prioritization for critical abnormalities, as described here, is one application of medical imaging AI that enhances patient care without adding to radiologist/clinician burden. Increasing demand for the quantitative monitoring of chronic disease, particularly in neurology (24), is a further application that value-adds to qualitative reporting and has the potential to facilitate precision treatment strategy in individual patients. Informatics tools that seamlessly integrate these tools into existing clinical workflows will critically determine their adoption. Finally, the rapidity with which AI-based imaging tools are being developed mandates the implementation of national and local governance frameworks to facilitate their safe adoption in clinical practice.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.