The Clinical Value of Explainable Deep Learning for Diagnosing Fungal Keratitis Using in vivo Confocal Microscopy Images

Background: Artificial intelligence (AI) has great potential to detect fungal keratitis using in vivo confocal microscopy images, but its clinical value remains unclarified. A major limitation of its clinical utility is the lack of explainability and interpretability. Methods: An explainable AI (XAI) system based on Gradient-weighted Class Activation Mapping (Grad-CAM) and Guided Grad-CAM was established. In this randomized controlled trial, nine ophthalmologists (three expert ophthalmologists, three competent ophthalmologists, and three novice ophthalmologists) read images in each of the conditions: unassisted, AI-assisted, or XAI-assisted. In unassisted condition, only the original IVCM images were shown to the readers. AI assistance comprised a histogram of model prediction probability. For XAI assistance, explanatory maps were additionally shown. The accuracy, sensitivity, and specificity were calculated against an adjudicated reference standard. Moreover, the time spent was measured. Results: Both forms of algorithmic assistance increased the accuracy and sensitivity of competent and novice ophthalmologists significantly without reducing specificity. The improvement was more pronounced in XAI-assisted condition than that in AI-assisted condition. Time spent with XAI assistance was not significantly different from that without assistance. Conclusion: AI has shown great promise in improving the accuracy of ophthalmologists. The inexperienced readers are more likely to benefit from the XAI system. With better interpretability and explainability, XAI-assistance can boost ophthalmologist performance beyond what is achievable by the reader alone or with black-box AI assistance.


INTRODUCTION
Fungal keratitis (FK) is one of the most common causes of cornea-derived blindness (1) but the diagnosis and treatment of this disease remain difficult (2,3). Corneal smears and cultures are the gold standard for diagnosing FK (4). However, culture routinely takes several days before the results are available. In vivo confocal microscopy (IVCM) is a useful method for the diagnosis of FK, which allows non-invasive and in vivo detection of even subtle changes in the living cornea (5,6). IVCM shows a variety of cellular changes in cornea suffering from FK (7) and foremost among these is the presence of hyphae, which is considered the specific manifestation of filamentous fungi infection (8,9). Correct and prompt monitoring of the fungal hyphae in IVCM images contributes to make a diagnosis of FK as early as possible and optimize the appropriate management of patients (10). Manual analysis of the IVCM images, however, is extremely labor-intensive, time consuming, and is heavily dependent on observer experience (11).
Recent advances in deep learning (DL) promise to improve diagnostic accuracy, thereby improving the quality of patient care. In our previous study, DL-based models were successfully developed to detect FK in IVCM images with high accuracy (12,13). However, the impact of these methods in clinical settings remains unclarified. A major shortcoming in the application of the DL technology to artificial intelligence (AI)-assisted medical care is the inability to interpret the model decision. From a clinical perspective, interpretability and explainability is essential for gaining clinicians' trust, for establishing a robust decision-making system, and to help overcome regulatory issues. However, DL models conceal the rationale for their conclusions, and therefore lack an understandable medical explanation to support their decision-making process, which seriously restricts its clinical application.
Explainable AI (XAI) is an important title and an active research direction in the field of medical AI research (14). A straightforward and effective strategy is to generate meaningful heatmaps that visualizes which pixel regions of an input image are important for the decision made by the DL model. Toward this objective, many approaches have been proposed for the explainable analysis of medical images, including dimension reduction, feature importance, attention mechanism, knowledge distillation and surrogate representations (15)(16)(17)(18)(19)(20)(21)(22). Among these methods, Class Activation Mapping (CAM) offers a valid approach by performing global average pooling on the convolutional feature maps and mapping back the weights of the classification output to the convolutional layer (23). However, CAM requires altering the network architecture and re-training the network, which limits its application in different kinds of networks. Gradient-weighted Class Activation Mapping (Grad-CAM) is a generalization of CAM (24). Grad-CAM computes the neuron importance weights by performing global average pooling on gradients via backpropagation, enables the creation of class-discriminative visual explanations from much more complex networks. Inspired by this, researches have been proposed to build XAI modules to determine the most predictive lesion areas in computed tomography images (25). However, XAI approaches have not been validated in the analysis of IVCM images.
In this study, we developed an XAI-based system to diagnose FK using IVCM images and provided visual explanations based on Grad-CAM and Guided Grad-CAM methods to highlight the relevance for the decision of individual pixel regions in the input image. We compared the performance of ophthalmologists with the assistance of the black-box AI model and the explainable system, and investigated the potential of the XAI-assisted strategy to help ophthalmologists identify the causative agent of corneal infection.

Study Design and Datasets
A total of 1,089 IVCM images collected from Guangxi Zhuang Autonomous Region People's Hospital were finally included in the testing set in this randomized controlled trial. The images were obtained from eyes diagnosed with fungal keratitis or bacterial keratitis in the Department of Ophthalmology between June 2020 and July 2021. All the infections were confirmed by culture or biopsy. Of the 1,089 images, 522 were collected from 17 eyes with fungal keratitis and were identified as hyphae-positive, and 567 were collected from 18 eyes with bacterial keratitis and were identified as hyphae-negative. The images were acquired following a standard operating procedure with IVCM (HRT III/RCM Heidelberg Engineering, Germany). All images were screened and the poor-quality images were excluded. This study was conducted in compliance with the Declaration of Helsinki and approved by the ethics committee of The People's Hospital of Guangxi Zhuang Autonomous Region. Informed consent was waived because of the retrospective nature of the study and anonymized usage of images.
All images were independently adjudicated by three corneal specialists with over 15 years of experience. Each image was classified as hyphae-positive or hyphae-negative. A reference standard for each image was generated when consistent diagnostic outcomes were achieved by the three specialists. None of the adjudicators were included as readers in this study.

Classification Model and Visual Explanation
The development of the DL-based diagnostic model used in this study is described in detail in our previous study (12). Briefly, the model was trained using the Residual Learning network-101 convolutional neural network architecture. The training set consisted of 2,088 IVCM images that had reference standard labels agreed by a panel of corneal experts. The image was input with the dimensions of 384 × 384. The classification model was trained to output the prediction probability of negative and positive classes.
We used Grad-CAM and Guided Grad-CAM to generate explanation maps (Figure 1). Grad-CAM (23) produced heatmap that highlighted the important regions in the input image for predicting the hyphae. In this study, the last convolutional layers which offered the best trade-off between high-level semantics and spatial information of the input images were used to compute the weights. Let y p be the gradient of the score for class "hyphaepositive", and A k the feature map k of the last convolutional layer. The gradient of y p with respect to A k was computed and averaged by performing global-average-pooled over the total number of elements (indexed by width i and height j) to produce a weight α p k , as shown in Formula (1). The weight α p k represents the importance of the feature map k for the positive class.
Next, a weighted combination was performed to sum the feature maps. Consequently, the Grad-CAM heatmap was generated by applying the ReLU (rectified linear unit) function to only highlight pixel regions that positively contribute to the positive decision. The formulae are described below as Formula (2).
Although Grad-CAM localized class-discriminative image regions, no fine-grained pixel-space details were available in the heatmaps. Therefore, we used Guided Grad-CAM to further highlight the stripes on the hyphae, which provided pixel-level explanation and help readers to quickly identify the pathogen. Guided Grad-CAM is a combination of Grad-CAM and Guided Backpropagation techniques (20). Guided Backpropagation visualizes the positive gradients by suppressing the negative gradients using ReLU layers. L p Grad−CAM is unsampled to the input image resolution and element-wise multiplication is performed to fuse Grad-CAM and Guided Backpropagation, thus generating high-resolution Guided Grad-CAM maps. The network architecture is shown in Figure 2.

Ophthalmologist Evaluation
All IVCM images were assessed by nine ophthalmologist readers of varying expertise as follows. The expert ophthalmologist group consisted of three professors with over 10 years of experience in diagnosing corneal diseases. The competent ophthalmologist group comprised three senior ophthalmologists who had over five years of experience in ophthalmology department. The novice ophthalmologist group was composed of three junior ophthalmologists who were in the third year of standardized training for residents of ophthalmology and had been formally trained in IVCM analysis.
Each image was assessed by each reader exactly once, in one of the three conditions: unassisted, AI-assisted, and XAIassisted. For each reader, the images were equally assigned to each condition so that the same number of images were reviewed for each condition. The assignment of image to reading condition was counter-balanced across groups to make sure that each image was randomly read by one reader in each group in the same condition, thus the reading distributed evenly across reader groups and reading conditions. The images were displayed in a random order. The AI classification results were displayed for both AI-assisted and XAIassisted conditions, in the form of histograms showing the model prediction probabilities of positive and negative classes. The XAIassisted conditions included explanatory Grad-CAM and Guided Grad-CAM maps side by side, in addition to the classification histogram. A screenshot of each condition is shown in Figure 3. The participants were first presented with the original images, and then they had to click on the "AI diagnosis" region to access the classification histograms and explanatory maps.
Readers were masked to the etiology confirmation and reference standard before the reading process. Detailed instructions and guidelines were given to the readers prior to the reading trail. Readers were asked to make a judgment for each image ("Are there fungal hyphae?"). They were told that classification histograms represented the probability of AI prediction, and that the Grad-CAM and Guided Grad-CAM highlighted class-discriminative regions in the images.
Confusion matrices were recorded and the accuracy, sensitivity, and specificity were calculated accordingly. The time required for diagnosis was measured for each reader, but the readers were not informed that the reading time was being recorded.

Statistical Analysis
Data were analyzed using SPSS (SPSS Version 11.0, IBM-SPSS Inc., Chicago, IL, USA). The AUC of the DL-based model was calculated and compared with the chance level (AUC = 0.5). The statistical significance P < 0.05 was considered statistically significant. Comparisons were made using repeated measures analyses of variance (ANOVAs). The Bonferroni posthoc test was used to correct for multiple comparisons. The significance was set at 0.05/N, where N is the number of tests used.

Model Performance
The receiver-operating characteristic (ROC) curve of the DLbased model is shown in
An overview of sensitivity (True positive rate) and specificity (1-False positive rate) is shown in Figure 4. In general, performance in XAI-assisted condition was better than that in AI-assisted condition, and both better than that without assistance. The influence of reading conditions was more prominent on sensitivity than on specificity (shown in Figure 5). For the expert ophthalmologist group, there was no statistical difference in sensitivity among the reading conditions. For the competent and novice group, the sensitivity for both forms of assistance exceeded that of unassisted reads (competent ophthalmologist 0.887, 95% CI 0.852-0.922; novice ophthalmologist 0.764, 95% CI 0.736-0.793), with the XAIassisted sensitivity (competent ophthalmologist 0.927, 95% CI 0.891-0.964, p < 0.001; novice ophthalmologist 0.891,

Efficiency Evaluation
On average, novice ophthalmologists spent more time per image with AI assistance than without assistance (P = 0.040). For novice ophthalmologists, the time spent with XAI assistance was significantly less than with AI assistance (P = 0.045). Although the time spent with XAI assistance tended to be higher than that without assistance, the difference was not statistically significant (P = 0.092). The same trends were observed for competent and expert ophthalmologists but the differences were statistically insignificant (shown in Figure 6).

DISCUSSION
This study evaluated the impact of visually explainable AI on IVCM image analysis. The results showed that AI and XAI helped improve reading accuracy, and the effect was more pronounced for inexperienced ophthalmologists compared to experienced ophthalmologists. The assistance of AI increased sensitivity, but not at the expense of specificity. The addition of explanatory maps further amplified the positive effect. Although AI assistance prolonged the average time per image, the application of explanatory maps reduced the prolongation. We noted that reading without assistance was generally high in specificity but low in sensitivity, which implied that readers might tend to judge an image as "negative", rather than "positive", in cases of ambiguous images. The missed diagnosis might happen easily under the situations. The assistance of AI reduced the number of false negative samples by helping readers correctly recognize the true positives, without increasing corresponding false-positive errors. The possible reason is that a positive model prediction might arise the attention of readers to identify occult lesions that otherwise would be easily overlooked.
Explanatory maps contributed to improving both sensitivity and specificity. For the true positive samples that were correctly predicted by the model (TP), the explanatory maps highlighted the morphology and location of fungal hyphae, providing an interpretable and explainable basis of AI decision-making, thus improving the user's trust in the model's prediction. For the true negative samples that were incorrectly predicted by the model (FP), the explanatory maps displayed meaningless spatial information, which helped the user to identify the model  In the present study, Grad-CAM and Guided Grad-CAM help ophthalmologists determine whether the model results are credible by highlighting the important regions that lead to model decision. Recent studies have proposed uncertainty measurement as an efficient method for the evaluation of model confidence (26). Uncertainty measurement provides an estimation of pixelwise uncertainties for image segmentation results, which enables an easy decision to accept or reject the model outcome based on a certain uncertainty level. This provides new ideas for the research of XAI and we are going to explore this approach in IVCM image analysis in our future research.
The impact of AI and XAI on the accuracy was noted to vary according to the degree of clinical experience. The inexperienced readers may be more likely to profit from the XAI system. With the assistance of XAI, the accuracy of novice ophthalmologist was increased to approximately competent level, and the accuracy of competent ophthalmologists reached close to that of expert. Although IVCM is greatly helpful for the diagnosis of corneal diseases, it is far from universal in many locations. One of the reasons is the scarcity of reading ophthalmologists. The explainable system is appropriate for teaching IVCM analysis skills. The validated model can provide timely monitoring and feedback to inexperienced readers, and the explanatory maps can help quickly identify the important features as a basis for judgment, thus help reduce study time and corresponding costs.
While AI assistance prolonged the mean reading time per image, it was within acceptable ranges considering the contribution of AI to the accuracy improvement. This can be easily understood because the classification histograms and explanatory maps were shown after the original images, and it took more time to reflect on divergent results. The addition of explanatory maps visualized the AI diagnostic basis, thus reduced the hesitation time. It should be mentioned that the readers were not instructed to complete the reading task as fast as possible, and were not informed that the reading time would be used as an auxiliary evaluation index, which might influence the results. Interestingly, although mean time per image increased, AI still has high potential for improving efficiency in the clinical setting given that AI may help to rapidly screen out hyphae-positive images from hundreds of images consecutively collected from each eye. The effect of AI on reading efficiency deserves a further study.
The proposed algorithm framework is general, which can be extended to other pathogens such as Acanthamoeba and Candida. IVCM images show Acanthamoeba cysts and trophozoites in Acanthamoeba keratitis (27), and show spore and pseudohyphae in Candida keratitis (28). A deep learning model can learn either these features or novel features to predict Acanthamoeba and Candida keratitis.
There were several limitations existed in this study. First, this is a single-center study performed in a limited number of readers, selection bias was inevitable. A multicenter study with more participants is needed in the future to generate more robust results. Second, we included a balanced number of positives vs. negatives with a ratio of approximately 1:1, thus the percentage of positivity in this study differed from that in a real-world setting. The diverse prevalence might significantly a?ect the sensitivity and specificity between datasets, thereby reducing the generalizability of our findings. Third, IVCM images were presented in random order in this study, but instead in order of scanning in actual clinical settings. Randomization of order excluded contextual information of adjacent images and so the performance of ophthalmologists could be underestimated. Therefore, additional studies are required to validate the role of AI in the analysis of image sequences. Finally, this study failed to compare the sensitivity between microbiological tests and IVCM. In this study, the microbiological tests were used as the gold standard. All positive images were collected from eye with positive smear/culture of filamentous fungi and were adjudicated by corneal specialists that hyphal structures were present in the observation field. All negative images were collected from eye with negative smear/culture of filamentous fungi and were ensured that no hyphae were included in the images. Therefore, cases with negative fungal smear/culture and positive hyphae fundings in IVCM were not included in the study, and the question of whether patients with negative microbiological tests could benefit from IVCM with DL was not addressed in this study. Despite this, the study provides an important framework for the future researches. Further studies will incorporate hyphae-positive images with negative microbiological results to assess the evaluation of AI-assisted IVCM as a means to complement microbiological tests.

CONCLUSION
AI has shown great promise in improving the accuracy of ophthalmologists in terms of FK detection using IVCM images. The inexperienced readers are more likely to benefit from the XAI system. With better interpretability and explainability, XAIassistance can boost ophthalmologist performance beyond what is achievable by the reader alone or with black-box AI assistance. The present study extends our understanding of the role of AI in medical image analysis.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
This study was approved by the Ethics Committee of The People's Hospital of Guangxi Zhuang Autonomous Region, China. The approval number is KY-SY-2020-1. Informed consent was waived because of the anonymized usage of images. No potentially identifiable human images or data is presented in this study.

AUTHOR CONTRIBUTIONS
FX conceived the research and wrote the manuscript. LJ, WH, and GH performed the analyses. YQ, RL, and XP contributed to algorithm optimization. YH, FT, JL, and YL contributed to data collection and measurements. SZ and ML were involved quality management. QC and NT provided overall supervision, edited the manuscript, and undertook the responsibility of submitting the manuscript for publication. All authors read and approved the final manuscript.