Contribution of CT-Scan Analysis by Artificial Intelligence to the Clinical Care of TBI Patients

The gold standard to diagnose intracerebral lesions after traumatic brain injury (TBI) is computed tomography (CT) scan, and due to its accessibility and improved quality of images, the global burden of CT scan for TBI patients is increasing. The recent developments of automated determination of traumatic brain lesions and medical-decision process using artificial intelligence (AI) represent opportunities to help clinicians in screening more patients, identifying the nature and volume of lesions and estimating the patient outcome. This short review will summarize what is ongoing with the use of AI and CT scan for patients with TBI.


INTRODUCTION
Traumatic brain injury (TBI) is a leading cause of death and disability in young people. It affects more than one European in 400 (1). It therefore has a strong human and financial impact on society (2). If the initial aggression induces primary lesions that are instantaneous and unavoidable, it can also cause delayed lesions, named secondary injuries, which strongly influence the neurological outcome. Therefore, the critical care management of TBI patients is to limit secondary brain injuries with therapies that are adapted to patients' severity (3).
Assessment of severity of TBI relies on clinical examination and initial brain imaging. Clinical examination is poor at the early phase of TBI and is based on the pupillary reactivity and the Glasgow Coma Score (GCS) that classifies TBI in 3 stages: mild, moderate, and severe. As a result, GCS categories cover a large pattern of TBI and are unable to differentiate specific evolutions such as diffuse injuries or focal lesions (4). Brain imaging, mainly computed tomography (CT) scan (5), is an additional tool to better classify TBI patients. Based on the radiological reading findings, researchers have built imaging scores that describe and quantify brain lesions. Several scores have been developed, i.e., the Marshall classification (6), the Rotterdam score (7), the Stockholm score (8), and the Helsinki score (9). All scores are correlated with the neurological outcome defined as the Glasgow Outcome Scale (GOS) (9)(10)(11)(12).
However, there are some drawbacks with the use of these CT scan scores. One point is their dichotomized answers, i.e., yes or no, that may underestimate the global effect of small intracerebral lesions. Despite their relative simplicity, inter-observer variability should be considered with the use of these scores. Chun et al. (13) measured this inter-observer agreement and found an average Cohen's Kappa coefficient of 0.57 for the detection of CT abnormalities, far from 1, corresponding to a full inter-observer agreement. Finally, categorizing each patient within a CT scan score is demanding in human resources that may explain why these scores are not used in clinical practice. In this context, the development of automated approaches to determine the nature and volume of traumatic brain injuries from CT scans has gained interest among clinicians.
Due to the philosophical implications of intelligence's definition, the general definition of artificial intelligence (AI) remains unclear. Nevertheless, one can use the expression "AI" as defined in (14): a field of computer science dedicated to the creation of systems performing tasks that usually require human intelligence, and the term "machine learning" (ML) as: a subfield of AI that includes all those approaches that allow computers to learn from data without being explicitly programmed. The rise of computing power capacities at the beginning of the 2,000 and the creation of large database have revealed the effectiveness of AI algorithms applied to medical images to improve the clinical care (14). Among the three ways of learning for ML algorithms (supervised, unsupervised, and reinforcement learning), the main algorithms used on biomedical imaging to predict outcome is the supervised learning, where input data and the variable to predict are known and the algorithms learn the relation between them (15). This expanding field of research may help the quantitative analysis of CT images and offer new perspectives in clinical care of TBI patients by following the process presented in Figure 1A. This review presents studies focusing on the classification and the segmentation of lesions based on the manual or automated analysis of CT scans. The main studies are summarized in Table 1. We will also cover the current research gaps and potential perspectives of the use of AI combined to CT scans in the TBI area.

Definition and Evaluation
Classification is the way for an algorithm to attribute a class to each input object, which is defined by one or more data points. There are different metrics to evaluate the performances of a classification. Let us assume a classification algorithm aimed at predicting a pathology and let us consider its evaluation on a test database. One defines as TP the number of true positive predictions, TN the number of true negative predictions, FP the number of false positive predictions, and FN the number of false negative predictions. The algorithm's Accuracy (Acc), which represents the percentage of good predictions, can be computed as: To separate the cases of false prediction on healthy or pathological subjects, one can define the Sensitivity (Se) and Specificity (Sp): Modifying the discrimination threshold of a classification algorithm allows to weigh the cost of predicting the pathology.
It is indeed more costly to predict normality for a patient than pathology for a healthy subject. It allows to decrease FN, but as a consequence, increase FP, and modify TP and TN. The modification of this threshold leads to new values of TP, TN, FP, and FN, and one can summarize these values by representing Se in function of (1 − Sp), defining the receiver-operator curve (ROC). The measure of the area under the curve (AUC) of the ROC curve represents the aggregate performances of the classification algorithm, whatever its discrimination threshold, and is a metric widely used to evaluate classification algorithms. An AUC of 1 describes a perfect classifier, and a value of 0.5 means that the classifier performs no better than chance.

Manual Quantification
The first studies involving AI tools aimed to predict the outcome after TBI. Patient outcome at 6 and 12 months was defined favorable vs. unfavorable outcome, i.e., death and severe disability, according to the GOS score. From a methodological point of view, these studies were based on the analysis of heterogeneous metrics, representing demography (age, sex, etc.), physiology (pupil response to light, GCS score, blood glucose concentration, etc.), and brain CT scan metrics manually quantified by radiologists using imaging scores. These metrics were aggregated to build prognostic models using multivariate logical regression approaches as reviewed by Perel et al. (31). Two large datasets of TBI patients produced predictive models: CRASH (18) and IMPACT (19,32,33). These models were subsequently externally validated with a good performance in predicting mortality and unfavorable outcome (AUC between 0.65 and 0.87) (34). The prevalence of the metrics was estimated, and if the main predictors are age or GCS (35), CT characteristics manually quantified also carry crucial information able to predict TBI outcome (18,36,37). CT characteristics were also evaluated individually, and showed important prognostic ability in TBI outcome, as reviewed by Zhu et al. (38), but imaging scores were only able to predict outcome with an AUC between 0.56 and 0.82 (9,11,12).
More recently, instead of multivariate logical regression (MLR), complex algorithms were evaluated. In 2018, 9 ML algorithms were tested upon 232 TBI patients in order to predict outcome and mortality from 14 predictors including radiological reading findings at admission and at the following day (20). The best models were obtained using Random Forest (RF), providing an AUC of 0.895 for outcome prediction and using ridge regression, providing an AUC of 0.875 for mortality prediction, both on a dedicated test set.
Some other benchmarks of ML algorithms included artificial neural network (ANN) experiments (39). ANN-based models showed contrasted performances in classifying the outcome from manual quantified metrics, with studies obtaining results outperforming the state-of-the-art models, showing an AUC = 0.946 predicting the dichotomized GOS (21) or an Acc = 0.92 predicting 6-month mortality (22), and studies showing lower performances predicting in hospital mortality (AUC = 0.706) (23) or dichotomized outcome (AUC = 0.78) (24). These differences can be explained by the gap between the number of  Table 1. (B) Example of the use of artificial intelligence (AI) algorithms on clinical routine. CT scans of two patients (P1 and P2) at D0 were quantified with state of the art algorithms. On the right, CT scans of the same two patients acquired at D1 are shown. P1 and P2 had different clinical care. P1 underwent a decompressive craniectomy and not P2. Biggest extra axial hemorrhage (EAH) lesion was segmented with Brain Lesion Analysis and Segmentation Tool for Computed Tomography (BLAST-CT) (16) and radiomic metrics on this region of interest (ROI) were extracted as in (17). At first sight, the two lesions have the same profile, with equivalent volumes and means, but the variance of P1 is higher than twice the one of P2. That could for instance be a biomarker evaluated in further studies to predict the need for craniectomy. ICH, intracranial hemorrhage; GCS, Glasgow Coma Score; MLR, multivariate logical regression; RF, random forest; ANN, artificial neural network; GOS, Glasgow Outcome Score; CT scan, computed tomography image; Ref, References; HU, Hounsfield Units; ROI, region of interest; EAH, extra axial hemorrhage; D, day. subjects of the study (all datasets included) (565, 2059, 11,026, 12,576, respectively, to the order of citation), the proportion of pathological and healthy subjects (6, 10, 26, and 47%) or by the validation method (internal, internal, external, external). These differences indicate that the methodology is a crucial step to truly measure ML model performances and to trust its predictions (40).

Automated Quantification
Besides clinical metrics, CT scan provides high spatial resolution images of the brain that contain much more information than that summarized in imaging scores. Several researchers recently started to use automatic algorithms to exploit this large amount of CT scans with the aim to find biomarkers able to predict outcome in TBI.

Histograms
To our knowledge, the first article to study TBI on CT scan was Kim et al. (25). Thanks to an automatic preprocessing pipeline of brain extraction of head injured pediatric patients, the authors showed that the proportion of brain voxel values ranging between 17 and 24 Hounsfield Units (HU) was a good predictor of edema severity (AUC = 0.85). This work demonstrated that (i) CT scans contain valuable quantitative information linked to the evolution of the status of the brain and (ii) automatic tools may be used to extract these quantitative features.

Radiomics
Distribution of voxels values, represented by a histogram, is a simple metric to quantitatively characterize an image. One can go further and extract more complex metrics to represent shape, texture, or contrast of an image (41). This research field is called Radiomic (42), and has proven its relevance in tumor diagnosis (43), while its biological meaning is currently discussed (44). Each radiomic metric can be calculated on a 2D or 3D region of interest (ROI), leading to extract a large number of metrics characterizing the ROI. This has been employed in TBI patients (17), suggesting that extracting metrics of first-order statistics (FOS), texture, and shape from ROI of segmented lesions could discriminate the injury's nature, among epidural hematoma (EDH), acute subdural hematoma (ASH), and contusion, and providing an Acc = 89.7. Another way to use radiomic is to extract each metric from a sliding window moving on a CT scan, leading to generate as many parametric maps as metrics used. This method has been used by Muschelli et al. (45) to segment hemorrhage. However, this has not been specifically used for TBI yet to our knowledge.

Convolutional Neural Networks
Deep learning is a subset of ML methods based on ANN. There are several types of networks among which recurrent neural networks (RNN) or convolutional neural networks (CNN). In computer vision, due to their ability to analyze pixels and their neighborhood, CNN are the most used networks in classification and segmentation tasks, and so in the biomedical imaging field (46,47). In TBI, to classify injuries from CT scans, Chilamkurthy et al. (26) built several DL (and RF) algorithms each able to detect the presence or the absence of one type of lesion in a CT scan.
Injury types contained different types of intracranial hemorrhage, fracture, midline shift, and mass effect. These algorithms were externally validated with an AUC from 0.92 to 0.97.

Definition and Evaluation
Segmentation is, in image processing, an operation aimed to split an image in 2 or more ROI. Let us assuming a segmentation algorithm aimed to segment a brain lesion and its evaluation on a test database. To compare the output segmentation X to a ground truth segmentation Y, one can computes the Dice similarity coefficient (DSC), defined as, for |X| the number of pixels in the segmentation X: This coefficient measures the number of common pixels between 2 segmentations in relation to the global number of pixels of the 2 segmentations. Some algorithms are trained on images that include images without lesion, therefore without ground truth segmentation. In these cases, if the algorithm predicts a lesion, even very small, the DSC is equal to 0 and strongly affects the mean DSC, although the segmentation error is quite small. That is why on some studies, DSC is only computed on large lesions, leading to better DSC values than studies measuring DSC on all lesions.

State of the Art
In computer vision segmentation, as well as in classification (see section 2.2.2.3), CNN is the most used ANN subtype.

Segmentation
In TBI data, Jadon et al.  (48), and trained it on 839 TBI scans where the four former lesion types had been delineated by trained personnel. The resulting model's segmentation have been validated on an internal cohort of 655 scans, providing a DSC of 0.36 (as mentioned before, DSC is very sensitive to small volumes and might not be the best metric to evaluate segmentation of small lesions), while the classification has been validated on an external cohort of 490 scans, providing an AUC between 0.83 and 0.95 for the different lesion types to classify.

Quantification From Segmentation
Besides this type of segmentation, some researchers went further in the quantification of CT images. Jain et al. (28) explored the measurement of cistern volume and midline shift because of a pipeline composed of CNNs and morphological operations and provides an Acc of 0.89 for midline shift estimation. Yao et al. (30) went, to our knowledge, the furthest. They automatically measured hematoma volumes, their localization, shape features representing the hematomas, and trained a RF algorithm with these metrics and biological information. Thereby, they were able to predict 6 months mortality, achieving an AUC of 0.85 on internal validation on 828 patients.

DISCUSSIONS AND CONCLUSION
The development of ML algorithms offers new possibilities in predicting TBI outcome. While these methods provide few improvements in predicting outcome from clinical variables and imaging scores compared to classic multivariate logistic regression, they might change the paradigm about image quantification. In addition to the automatization and speeding up of the quantification, image processing methods can help to detect humanly undetectable patterns and, by represent early cerebral rearrangement, provide new biomarkers of TBI evolution, especially in TBI's modality of choice: CT.

Current Research Gaps
In the coming years, human segmentation and quantification of CT scans can probably be replaced by automated processing of ML algorithms, but their incorporation in clinical routine will be dependent on models and prediction's explainability and interpretability. Progress must be done on mastering CNN's theory to evaluate their capacities and limitations. A preliminary step could be the development of saliency maps (49), representing the voxels which have a strong impact on the prediction. Each prediction from a CT scan could be presented along with its saliency map, in order to explain the algorithm decision and potentially check its relevance, as already proposed by Kuo et al. (29).
Machine learning algorithms depend on their training dataset's nature. In order to deploy tools usable on multiple sites, one should put efforts on reducing intermachine and interprotocol variability. Indeed, the different models of scanner and parameters of acquisition, such as reconstruction algorithms, output resolution, voltage, or number of detectors, influence output images properties (50,51). While researchers studied harmonization of metrics extracted from images (52), few studies intend to reduce variability at image level (53,54). Since deep learning algorithms might be able to learn intersite variabilities and take them into account if trained on heterogeneous cohorts, efforts must then be made on the establishment of large and multicentric cohorts to cover the spectrum of all secondary injuries induced by TBI, as initiated by Maas et al. (55) or Flanders et al. (56).
As ML models on medical imaging usually follow the same pattern-metric extractions, features selection, training, validation and the same evaluation metrics (AUC, Acc, DSC, etc.)-it seems easy to compare them. However, these metrics strongly depend on the methodology of the model training and evaluation. If each study has its specifics, one must pay attention to the quality of the dataset (number of patients, heterogeneity/homogeneity), training and validation modalities (internal, external), and metrics of evaluation (DSC computed on all lesions or only on large lesions). An improvement in methodology's robustness, as well as in the availability of the mathematical models, will surely benefit all the scientific community and help build stronger studies, which are able to give birth to potential breakthrough in the clinical care of TBI patients.

Potential Future Developments
Brain CT represents the brain's density at the acquisition time. CT scans are generally acquired at different times to follow cerebral rearrangement after injuries or surgery. To improve our comprehension of cerebral rearrangement following heterogeneous secondary injuries, the establishment of longitudinal cohorts and their quantification is crucial to define different profile of evolution (5). Such profiles could then be isolated, leading to specific mechanisms and maybe specific treatments.
From a clinician point of view, the idea to combine clinical data and CT metrics is strongly relevant. One can imagine combining many variables from different nature, like most of the models presented in section 2.2.1, but replace manual quantification by automated and deep quantification of CT images provided by modern ML algorithms. As evaluated by Yao et al. (30), we also believe that in the field of TBI, localization of lesions have a large impact on the patient outcome. The development methods to register structural atlases on distorted TBI CT scans and the incorporation of this spatial information on prediction models might strongly improve their performances.
For research purposes, one might be soon able to predict clinical data as the neurological outcome or the level of medical care needed by a patient, as illustrated in Figure 1B. However, since no study evaluated the relevance of AI algorithms in the quantification of complex images, as mixed type injuries, artifacts altered images, or cohabitation of different lesions types as tumors or white matter diseases, their use in complex case studies remains challenging. One can then imagine that these algorithms will first be used in clinical routine for screening, triage, or indicative prediction in support of human readers.

Conclusion
AI algorithms have shown promising results in the biomedical field, especially in medical imaging. Their use on TBI CT scans, for classification and/or segmentation, is expanding and might become the reference methods in the next years if some problematic but surmountable issues are addressed. Their integration in clinical routine depends on the confidence on their predictions that might be increased with rigorous methodology.