Validation of a Deep Learning Tool in the Detection of Intracranial Hemorrhage and Large Vessel Occlusion

Purpose: Recently developed machine-learning algorithms have demonstrated strong performance in the detection of intracranial hemorrhage (ICH) and large vessel occlusion (LVO). However, their generalizability is often limited by geographic bias of studies. The aim of this study was to validate a commercially available deep learning-based tool in the detection of both ICH and LVO across multiple hospital sites and vendors throughout the U.S. Materials and Methods: This was a retrospective and multicenter study using anonymized data from two institutions. Eight hundred fourteen non-contrast CT cases and 378 CT angiography cases were analyzed to evaluate ICH and LVO, respectively. The tool's ability to detect and quantify ICH, LVO, and their various subtypes was assessed among multiple CT vendors and hospitals across the United States. Ground truth was based off imaging interpretations from two board-certified neuroradiologists. Results: There were 255 positive and 559 negative ICH cases. Accuracy was 95.6%, sensitivity was 91.4%, and specificity was 97.5% for the ICH tool. ICH was further stratified into the following subtypes: intraparenchymal, intraventricular, epidural/subdural, and subarachnoid with true positive rates of 92.9, 100, 94.3, and 89.9%, respectively. ICH true positive rates by volume [small (<5 mL), medium (5–25 mL), and large (>25 mL)] were 71.8, 100, and 100%, respectively. There were 156 positive and 222 negative LVO cases. The LVO tool demonstrated an accuracy of 98.1%, sensitivity of 98.1%, and specificity of 98.2%. A subset of 55 randomly selected cases were also assessed for LVO detection at various sites, including the distal internal carotid artery, middle cerebral artery M1 segment, proximal middle cerebral artery M2 segment, and distal middle cerebral artery M2 segment with an accuracy of 97.0%, sensitivity of 94.3%, and specificity of 97.4%. Conclusion: Deep learning tools can be effective in the detection of both ICH and LVO across a wide variety of hospital systems. While some limitations were identified, specifically in the detection of small ICH and distal M2 occlusion, this study highlights a deep learning tool that can assist radiologists in the detection of emergent findings in a variety of practice settings.

Purpose: Recently developed machine-learning algorithms have demonstrated strong performance in the detection of intracranial hemorrhage (ICH) and large vessel occlusion (LVO). However, their generalizability is often limited by geographic bias of studies. The aim of this study was to validate a commercially available deep learning-based tool in the detection of both ICH and LVO across multiple hospital sites and vendors throughout the U.S.

Materials and Methods:
This was a retrospective and multicenter study using anonymized data from two institutions. Eight hundred fourteen non-contrast CT cases and 378 CT angiography cases were analyzed to evaluate ICH and LVO, respectively. The tool's ability to detect and quantify ICH, LVO, and their various subtypes was assessed among multiple CT vendors and hospitals across the United States. Ground truth was based off imaging interpretations from two board-certified neuroradiologists.
Results: There were 255 positive and 559 negative ICH cases. Accuracy was 95.6%, sensitivity was 91.4%, and specificity was 97.5% for the ICH tool. ICH was further stratified into the following subtypes: intraparenchymal, intraventricular, epidural/subdural, and subarachnoid with true positive rates of 92.9, 100, 94.3, and 89.9%, respectively. ICH true positive rates by volume [small (<5 mL), medium (5-25 mL), and large (>25 mL)] were 71.8, 100, and 100%, respectively. There were 156 positive and 222 negative LVO cases. The LVO tool demonstrated an accuracy of 98.1%, sensitivity of 98.1%, and specificity of 98.2%. A subset of 55 randomly selected cases were also assessed for LVO detection at various sites, including the distal internal carotid artery, middle cerebral artery M1 segment, proximal middle cerebral artery M2 segment, and distal middle cerebral artery M2 segment with an accuracy of 97.0%, sensitivity of 94.3%, and specificity of 97.4%.

INTRODUCTION
Timely diagnosis of acute cerebrovascular disease is critical to reduce patient mortality and morbidity. Two forms of stroke, intracranial hemorrhage (ICH) and ischemic stroke due to large vessel occlusion (LVO) are especially devastating. ICH 28-day mortality has been reported at 50.6% and 6-month mortality due to LVO at 26.2% (1,2).
Prompt intervention of these entities is critical in achieving improved outcomes. For example, ICH hematoma expansion was significantly reduced with early blood pressure control (3). Regarding LVO, functional independence decreased with every hour delay to endovascular thrombectomy (4).
Deep learning, a subset of artificial intelligence, has recently emerged as a means to aid clinicians in the timely diagnosis of both ICH and LVO. Newly developed algorithms have demonstrated strong performance in the detection of each (5)(6)(7)(8)(9)(10)(11)(12). However, limitations of most of these studies are that they are often performed at a single institution and have not been validated in different settings.
Given the potential for deep learning tools to aid physicians in the timely and accurate diagnosis of these emergencies, it is important to validate their uses across a variety of facilities. Prior studies examining the relationship between deep-learning based algorithms and imaging assessment have been limited by geographic bias introduced from their cohorts, with the majority of U.S. states lacking representation (13). The specific aim of this study is to validate a commercially available deep learningbased tool, CINA R v1.0 device (Avicenna.ai, La Ciotat, France) in the detection of both ICH and LVO from multiple hospital sites and vendors through a collaboration between the University of California, Irvine (UCI) and vRAD (Minneapolis, USA). In doing so, the goal was to evaluate the generalizability of this tool to eliminate possible geographic bias introduced in other similar studies.

MATERIALS AND METHODS
This was a retrospective study using anonymized data from UCI and vRAD. A waiver of consent was obtained from the local Institutional Review Board (IRB) at UCI for the UCI cases and the Western IRB for the vRAD cases. The CINA R v1.0 device (Avicenna.ai, La Ciotat, France) was used for standalone performance assessment in both the ICH and LVO validation studies. The statistics provided in this manuscript are derived from an external test set (the validation cohort) and are completely independent from a prior cohort used to train the CINA R v1.0 device. Specifically, the cohort used to train the tool was based off of 8,994 ICH cases acquired between November 2014-May 2018 and 566 LVO cases acquired between May 2018-November 2018. All of the training data was acquired from vRAD data only. No UCI data was used for the training cohort. Additionally, all vRAD cases used for the validation cohort were acquired in 2019 only.

Patient Selection
A cohort of patients with suspected acute ICH on clinical grounds in whom non-contrast CT (NCCT) head studies had been performed from UCI and an American teleradiology service (vRAD) were assessed. In both UCI and vRAD cases, suspected acute ICH cases were identified with keywords such as "hemorrhage, " "NCCT, " and "head" in the clinical indication or Digital Imaging and Communications in Medicine (DICOM) header information of the NCCT studies. Only the initial scan obtained for ICH evaluation was assessed for patients in this validation cohort. vRAD cases were acquired in 2019 only, and UCI data from 2017 to 2019. Inclusion criteria for NCCT scans required a strict axial acquisition, 512 x 512 matrix, slice thickness of <5 mm, soft tissue reconstruction kernel, and kVp ranging between 100 and 160.

Patient Selection
A cohort of patients with suspected LVO on clinical grounds in whom CT angiography (CTA) head studies had been performed from UCI and vRAD were assessed. For both UCI and vRAD cases, suspected LVO cases were identified with keywords such

Statistical Analysis
Data was compared from the CINA R v1.0 device (Avicenna.ai, La Ciotat, France) to the ground truths determined by the board-certified neuroradiologists via a confusion matrix in order to obtain sensitivity, specificity, and accuracy. Positive predictive values (PPV) and negative predictive values (NPV) were computed with varying prevalence values (from 10 to 50%, increments of 5%). All of these statistics were performed using Excel and MedCalc version 19.7.2. These statistics were performed for the total cases in both the ICH and LVO groups in addition to stratifications based on scanner models, NDR, slice thickness, radiation dose parameters, age, and sex, as well as ICH subtypes and volumes and LVO locations. The CINA R v1.0 device is not intended to discern ICH subtype or volume and only detects whether hemorrhage is present or not. Therefore, ICH subtypes and volume information were only assessed in positive cases by the two board-certified neuroradiologists. Only true positives and false negative values could be calculated and only true positive rate was provided for these classifications.
PASS sample size software was used to calculate the minimum number of cases needed to achieve a 95% CI lower bound of at least 80% assuming a point estimate of 90% (for sensitivity and specificity, separately). Using the binomial dichotomous endpoint for a one sample study, at least 137 positive and 137 negative anonymized cases were required (for ICH and LVO).

Intracranial Hemorrhage
Patient Selection 824 cases were selected for analysis from a pool of 400 retrospective anonymized cases from vRAD and 424 from UCI. 10 cases were excluded for the following reasons: 1 because slice thickness was not identical among the volume, 3 because the matrix was not 512 x 512, 1 because it contained a post-contrast series, 2 lacked a full field of view, 2 were uninterpretable due to significant motion artifact, and 1 was uninterpretable due to significant metal artifact. After exclusion, case distribution was 395 from vRAD and 419 from UCI for a total of 814 cases.

Overall Cases
ICH ground-truths were as follows: 204 positive ICH cases from vRAD and 51 from UCI. There were 191 negative ICH cases from vRAD and 368 from UCI. There was initial disagreement on 21 cases between the two neuroradiologists. However, a consensus was eventually reached for each of these cases.

Demographics
Performance metrics for age and sex can be found in Table 2

Scanner Models
Case distribution among the various scanner models are found in

Large Vessel Occlusion
Patient Selection 406 anonymized CT angiography (CTA) cases were assessed; 93 from UCI and 313 from vRAD. 28 of these were excluded for the following reasons: 11 were not CTAs, 2 had no contrast, 7 did not have enough contrast, 2 were uninterpretable due to significant motion artifact, 2 were uninterpretable due to significant metal artifact, 3 did not have a full field of view, and 1 had an acquisition issue (z-spacing variability).

Overall Cases
LVO ground-truths (determined by two board-certified neuroradiologists) were as follows: 156 positive LVO cases and 222 negative LVO cases. There was initial disagreement on 19 cases between the two neuroradiologists. However, a consensus was eventually reached for each of these cases. The CINA R v1.0 algorithm identified 153 true positive LVO (Figure 2) Table 6.

LVO Subtypes
A subset of 55 patients were randomly selected to evaluate performance metrics of the tool in evaluating LVO subtypes (4

Scanner Models
Case distribution among the various scanner models can be found in Table 9. Sensitivity for Siemens, Canon (formerly Toshiba), GE Medical Systems and Philips was 96.7, 94.1, 91.8, and 84.4%, respectively. Specificity for GE Healthcare, Philips, Siemens, Canon, and NMS was 90.0, 100, 98.1, 97.7, and 100%, respectively.

DISCUSSION
This retrospective, multicenter study aimed to demonstrate the generalizability of a commercially available deep-learning based tool, CINA R v1.0, in the detection of ICH and LVO across multiple hospital settings. The algorithm performed well in the ICH cohort, with an overall accuracy of 95.6%, sensitivity of 91.4%, and specificity of 97.5%. Of the ICH subtypes, it achieved the highest sensitivity in the detection of intraventricular hemorrhage with a true positive rate of 100%, followed by epidural/subdural, intraparenchymal, and subarachnoid subtypes which all had sensitivity of at least 90%. When stratified by ICH size, it performed best for medium and large volumes with sensitivities of 100%, but demonstrated lower sensitivity in the detection of small volumes with a sensitivity of 71.8%. The tool also performed well in the LVO cohort, with an accuracy of 98.1%, sensitivity of 98.1%, and specificity of 98.2%. The algorithm showed robust performance in detecting LVO location in a smaller subset of cases with an accuracy of 97.0%, sensitivity of 94.3%, and specificity of 97.4%.
These results corroborate previous studies analyzing the ability for deep-learning tools to detect intracranial emergencies. For example, Chilamkurthy et al. used a deep-learning algorithm to detect and classify ICH on large and diverse cohorts in India (6). Another study obtained an AUC of 0.99 in the detection of ICH via deep-learning algorithms; however, this was based off of a single institution using relatively uniform scanning parameters on two scanner models (9). Similar studies have been performed with respect to AI detection of LVO. For example, a commercially available deep learning software for LVO detection achieved an AUC of 0.86 using a cohort derived from three tertiary stroke centers (11). Our work expands on these previous studies by showing similar robust performance of deep learning tools across a diverse population regardless of scanner parameters and geographic distribution.
Ultimately, given the robust nature of deep learning tools such as CINA R v1.0, the goal of these tools is to streamline the radiologists' workflow by triaging studies to alert physicians to the most time-sensitive findings, and to act as a second set of eyes when studies are more ambiguous. Studies evaluating the effectiveness of such systems have already begun. For example, when a deep-learning tool was prospectively integrated to prioritize studies in a radiologist's workflow based on the presence of ICH, one study found that time to diagnosis was significantly reduced (14). Future studies with CINA R v1.0 could mirror this type of work and evaluate patient outcomes as influenced by the integration of deep-learning tools into a radiologist's workflow. For example, both inpatient and outpatient settings could be evaluated with regards to these neurologic emergencies and how these tools impact efficiency and ultimately clinical outcomes.
Our software had some limitations that warrant further investigation. Perhaps the greatest limitation was in the detection of small bleeds, with false negatives occurring predominantly in very small ICH (<1.5 mL). The false negatives that occurred in larger bleeds (1.5-5 mL), were often located within chronic pathology such as old hematomas, areas of gliosis, or extraparenchymal structures such as along the falx cerebri. On the other hand, ICH false positives predominantly occurred in the setting of significant streak or motion artifact. Similar limitations were identified in the LVO cohort with respect to imaging artifact and small size. For example, the LVO false negatives all occurred in the setting of small occlusions <1.3 mm in length (Note that LVO lengths were only retrospectively measured for the three false negative cases in order to understand why the application failed to detect them and were not measured for the remaining LVO cases). A false positive LVO case also occurred in the setting of significant streak artifact. Another false positive case misdiagnosed an area of stenosis as a complete occlusion. Two false positive cases misidentified the sphenoparietal sinus venous structure as an area of occlusion, likely secondary to its close proximity to the MCA. As a result, caution should be used when relying solely on the software in these settings. However, the limitations discussed above often occurred in settings that would likely pose similar challenges to radiologists and result in a similar distribution of false negatives and positives. CINA R v1.0 was only trained to identify acute blood based off of hyperdense components. Thus, chronic hemorrhages cannot be identified by the algorithm unless they contain more acute hyperdense components. However, given that low density hemorrhages (e.g., chronic SDH) are often not emergencies, we believe this distinction is actually clinically useful and not necessarily a limitation in order to prevent the tool from flooding the radiologist with alerts for non-emergent cases. The tool did not differentiate between acute and non-acute LVO etiologies such as chronic ICA occlusion, and these may have been included as positive LVO cases. Lastly, the tool was not trained to evaluate occlusions in the anterior cerebral arteries or posterior circulation. While further work is needed for future tools to better identify more distal occlusions and subtle hemorrhages, the primary goal of CINA in its current state is to identify obvious findings that need to be assessed urgently for emergent triage and prioritization of the worklist. This is reflected in a separate standalone effectiveness assessment demonstrating a mean "time-to-notification" of 21.6 and 34.7 s for ICH and LVO detection, respectively.
Despite these limitations, CINA R v1.0 demonstrates robust generalizability in the detection of ICH and LVO. For example, in a study examining the geographic distribution of various cohorts evaluated by deep learning based algorithms in various medical specialties, 34 states were not represented among 56 studies (13). Our ICH data spans 44 states and 204 U.S. cities, while LVO data reflects scans from 40 states and 158 U.S. cities. To our knowledge, this is the most heterogeneous population cohort ever studied in the U.S. using a deep learning tool for ICH and LVO detection. This study demonstrates the potential for greater application of deep-learning tools across a wide variety of clinical settings.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by University of California, Irvine Institutional Review Board, and Western Institutional Review Board. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.