Computational Analysis of Pathological Image Enables Interpretable Prediction for Microsatellite Instability

Background Microsatellite instability (MSI) is associated with several tumor types and has become increasingly vital in guiding patient treatment decisions; however, reasonably distinguishing MSI from its counterpart is challenging in clinical practice. Methods In this study, interpretable pathological image analysis strategies are established to help medical experts to identify MSI. The strategies only require ubiquitous hematoxylin and eosin–stained whole-slide images and perform well in the three cohorts collected from The Cancer Genome Atlas. Equipped with machine learning and image processing technique, intelligent models are established to diagnose MSI based on pathological images, providing the rationale of the decision in both image level and pathological feature level. Findings The strategies achieve two levels of interpretability. First, the image-level interpretability is achieved by generating localization heat maps of important regions based on deep learning. Second, the feature-level interpretability is attained through feature importance and pathological feature interaction analysis. Interestingly, from both the image-level and feature-level interpretability, color and texture characteristics, as well as their interaction, are shown to be mostly contributed to the MSI prediction. Interpretation The developed transparent machine learning pipeline is able to detect MSI efficiently and provide comprehensive clinical insights to pathologists. The comprehensible heat maps and features in the intelligent pipeline reflect extra- and intra-cellular acid–base balance shift in MSI tumor.

efficient tool for predicting the MSI status of patients, but also provide more insights to pathologists with clinical understanding.

Introduction
Microsatellite instability (MSI) is the condition of genetic hypermutability that results from impaired DNA mismatch repair.Cells with abnormally functioning mismatch repair are unable to correct errors that occur during DNA replication and consequently accumulate errors.MSI has been frequently observed within several types of cancer, most commonly in colorectal, endometrial, and gastric adenocarcinomas. [1]The clinical significance of MSI has been well described in colorectal cancer (CC), as patients with MSI-high colorectal tumors have been shown to have improved prognosis compared with those with MSS (microsatellite stable) tumors. [2]In 2017, the U.S. Food and Drug Administration approved anti-programmed cell death-1 immunotherapy for mismatch repair deficiency/MSI-high refractory or metastatic solid tumors, making the evaluation of DNA mismatch repair deficiency an important clinical task.However, in clinical practice, not every patient is tested for MSI, because this requires additional polymerase chain reaction or immunohistochemical tests. [1]Thus, it is in high demand for a cheap, effective, and convenient classifier to assist experts in distinguishing MSI vs MSS.
Unfortunately, it is challenging to distinguish MSS from MSI based on medical expert's visual inspections from pathological images since the morphology of MSS is similar to that of MSI. [3]e recent technical development of high-throughput whole-slide scanners has enabled effective and fast digitalization of histological slides to generate whole-slide images (WSI).
More importantly, the thriving of various machine learning (ML) methods in image processing, make this task accessible.Recent years, ML has been broadly deployed as a diagnostic tool in pathology. [4,5] or example, Osamu Iizuka et al. built up convolutional neural networks (CNNs) and recurrent neural networks to classify WSI into adenocarcinoma, adenoma, and nonneoplastic. [6]The study by Yaniv Bar demonstrated the efficacy of the computational pathology framework in the non-medical image databases by training a model in chest pathology identification. [7]Notably, J.N. Kather et al. showed that deep residual network can predict MSI directly from H&E histology and reported the network achieved desirable performance in both gastric stomach adenocarcinoma (STAD) and CC. [8]These studies exhibit the great capacity of ML methods in medical research.
It is no doubt that the ML revolution has begun, but the deficiency of the 'interpretability' of ML deserves particular concern.Here, the concept of 'interpretability' means that clinical experts and researchers can understand the logic of decision or prediction outputted by ML methods. [9]Essentially, it urges ML systems follow a fundamental tenet of medical ethics, that is, the disclosure of basic yet meaningful details about medical treatment to patients. [10]Besides, interpretability helps clinician aware the decision provided by model would have potential fairness issue which is raised from the sampling bias in training models. [11]In addition, for both scientific robustness and medical safety reasons, interpretability allows researchers to know to what extent the predictions can be altered by small systematic perturbations to the input data, which might be generated by measurement biases.Finally, clinical experts are accessible to potentially crucial domain-knowledge hidden in the interpretable ML models. [9]Unfortunately, to the best knowledge of us, most of the existing MSI diagnosis systems, especially deeplearning based systems, are noninterpretable.4][15][16] In this study, we used Haematoxylin and eosin (H&E)-stained WSI from The Cancer Genome Atlas (TCGA): 360 formalin-fixed paraffin-embedded (FFPE) samples of CC (TCGA-CC-DX), [17] 285 FFPE samples of STAD (TCGA-STAD) [18] and 385 snap-frozen samples of CC (TCGA-CC-KR).H&E stained images in these databases have already been tessellated into 108020 (TCGA-STAD), 139147 (TCGA-CC-KR), and 182403 (TCGA-CC-DX) colornormalized tiles, [8] and all of them only target region with tumor tissue.The aims of the study are: (i) to build an image-based ML method on MSI classification and post-process the fed image to a heat map so as to interpret the diagnosis of MSI at an image level; (ii) to design a fully transparent feature extraction pipeline and understand the pathological features' importance and interactions for predicting MSI by training a feature-based ML model.

A Deep Learning Classifier and Image-Level Visual Interpretability
The deep learning (DL) model recruited to classify MSI is a famous end-to-end CNN,Resnet18. [19]To fit this model for different cancer subtypes, we trained three Resnet18 networks based on 70% tiles randomly sampled from three datasets.The remaining 30% tiles in each dataset were used for testing.At test time, a patient's slide was predicted to be MSI if at least half of the tiles were predicted to be MSI.The patient-level accuracy and area under the curve (AUC) was 0.81 in DX cohort, 0.80 in STAD cohort, 0.84 in KR cohort, respectively in the test set.
Based on the trained DL model, a novel method, Gradient-weighted Class Activation Mapping (Grad-CAM), can make the convolutional based model more transparent by generating localization maps of the important regions. [20]To unveil the hidden logic behind the DL and give a visual interpretability, we deployed Grad-CAM in our model to find out which part of the H&E image supports DL's classification.Two typical images for interpreting DL prediction logic are shown (Figure 1.a).Under the pathologist's judgements, the highlighted region in Figure 1.a tends to be where tumor organism and immune number concentrate in and there exists a visible difference in the texture and color feature of both regions which inspired us to further collect detailed features in these directions to verify the validation of important region.

Classification Model
The interesting information returned by Grad-CAM implies that important regions of the tumor organism might be encoded by certain features of the H&E stained images, which propelled us to study the classification capability of feature-based ML.To this end, we built an automatic and transparent workflow (Figure 2.a) with five main steps.We depict the first three steps in this part.
Firstly, we performed several image pretreatments for further study.White balance and brightness adjustments were performed to normalize the tone of the H&E stain before extracting features of all images.Meanwhile, color deconvolution [21] was applied to separate nucleus from the image for the next step.After the image pre-processing, we proceeded for visible pathological feature extraction.Motivated by the feedback from Grad-CAM, we focused on five H&E feature characteristics: global color feature extraction in RGB and HSV channels, local color feature extraction via gaussian mixture model (GMM) [22] with three components, the numbers of infiltrating immune cells and tumor cells, the grading of differentiation and the Haralick texture feature from tumor cells.Specifically, quantiles (25%, 50%, 70%) and moments (mean, variance, kurtosis, and skewness) in both RGB and HSV channels were recorded as global image features.In line with these, the moments of each cluster segmented by the GMM model were considered as local image features.In order to evaluate the grade of tumor differentiation, we used the circle Hough transform [23] in our workflow.In addition, we located the tumor cells and extracted their texture features via QuPath software [24] .Some typical feature extraction result is displayed in Figure 2.b.A total of 182 features were extracted from each image tile.The Wilcoxon rank sum test is applied on each feature and most of them are significantly different between MSI and MSS (Table S3).
We then applied Random Forest (RF), [25] one of the most popular ML algorithms, to all three databases next to classify MSI versus MSS on H&E stained histology slides.During training, 70% patients in every dataset were randomly selected and all of their tiles were used in training while the rest tiles were held out and used as test sets.In the test sets of each dataset, true MSS image tiles cohort has a median MSS score (the proportion of the prediction result judged to be MSS in each decision tree of the forest) is significantly different from those of MSI tiles (the P-value of the two-tailed test are 0.02, 0.0024, 0.002 in the three datasets), which means our models make sense in distinguishing MSI from MSS.Since one patient may have many different tiles, we obtain the MSI scores on patient-level through averaging the prediction on all its own tiles.AUCs for MSI detection are 0.7 (95% CI 0.65-0.74) in DX cohort, 0.74 (95% CI 0.65-0.79) in STAD cohort, and 0.78 (95% CI 0.7-0.82) in KR cohort (Figure 2.c).The decent performance of RF proves visible pathological features were proved to contribute to MSI prediction.

Feature-Level Visual Interpretability: Feature Importance and Interactions
One of the attractive advantages of the RF method is that it can evaluate the feature importance of the features.Therefore, we verify and quantify these features' power in distinguishing MSI from MSS by extracting information from a trained model.A common pattern can be discovered from the visualization of permutation-based feature importance [25,26] in each dataset (Figure 3).From the figure, we can deduce that the Haralick Texture features play a dominant role.Besides, the texture features, which reflect the surface's average smoothness of the tumor cells in one tile, reveals that the tumor surface between MSI and MSS has significant differences.
Color features also have significant contributions.In the global color feature, the higher order statistics (skew and kurtosis) contribute more than the first order statistics (mean and percentile), indicating that some useful information contributing to classification are hidden in high order features.Local color features generated from GMM also deserve our attention.As compared to global color features reflecting the global distribution, GMM features are regarded as the local ones since it can serve as an image segmentation and divide slices into different clusters so that we can obtain the information in each cluster.From Figure 2.b, the clusters generated from GMM highly correspond to tumor tissue or non-tumor tissue, which may make sense from the clinical perspective.Besides, the number of infiltrating immune cells also matters as expected while the differentiation grade gives the fewest contributions in each dataset.Except for the permutation method, we explored the contribution of these features with another feature importance criterion, mean minimal depth. [27]The importance ordering of features under this measure is highly consistent with the result from the permutation method (Figure S1), strengthening the common feature importance pattern we discovered.
It is widely accepted that feature interactions (i.e., the joint effect of features) is nonnegligible for the complex disease. [28]Our feature-based RF models also allow us to exploit the pairwise feature interactions in MSI classification, and thus, we can attain more clear understanding about the characteristics of MSI tiles as well as the mechanism of RF.Here, we use condition minimal depth [27] to quantitatively assess feature interaction and then demonstrate the foremost 30 pairwise interactions (Figure 4).The features with the most significant interaction effect with other features in each dataset are: the skew of S channel in DX, the variance of H channel in the second component of GMM in KR, and the optical density mean of the nucleus of tumor cells in Hematoxylin stain in STAD.The three features enhance the importance of the feature interacted with them, even the feature may not be significant before.To concretely comprehend how the paired features jointly help the prediction of our RF model, we plot the prediction values of typical feature interaction on a grid diagram (Figure S2).In DX, a higher value of the max caliper in tumor cells and a lower value of tumor cell count leads to a higher probability of MSS prediction.In the KR dataset, a higher value of immune cell number and a lower value of the 75th percentile of R channel leads to a higher probability of MSS prediction.In STAD, a lower value of the optical density range of tumor cells' nucleus in Hematoxylin stains and a higher value of Haralick correlation in Eosin stains leads to a higher probability of MSS prediction.In conclusion, we can say that the interaction effects are vital for MSI diagnosis and tend to occur more often between color features and texture features.

Conclusion and Discussions
To our knowledge, this is the first study to not only build up a classification model in distinguishing MSI from MSS but also provide a detailed interpretability analysis and unveil the hidden logic behind the model.In this study, we followed the framework of interpretability with two steps: first, built up a high-performance DL network with a visual explanation capacity as model-based interpretability; secondly, we further analyzed and confirmed features' power using a feature-based interpretable model.The datasets in our study come from TCGA and are comprised of three different cancer types which are DX, KR, and STAD, respectively.The images we used have already been tessellated into small tiles and gone through the color normalized process.
To achieve the target of building up a DL network with visual explanations, we trained residual learning CNNs and deployed Grad-CAM to the final convolutional layer of the network so as to produce the heatmap that reflects the highly-contributed region.The network can achieve a state-of-the-art performance.Notably, through its coarse localization map of the important regions in the image, it provided a preliminary insight to the pathologists about highlycontributed pathological features.
To further verify and quantify the related features' power in classification, we designed a pipeline to extract the features which may make sense in these regions from images, train and evaluate a random forest classifier to distinguish MSS from MSI and applied different measures to evaluate the contribution of each feature so that we can supply interpretations for our model.
We extracted 182 features in total from images and these features capture color features locally and globally, the count of immune cell and tumor cell, differentiation degree of tumor tissue and Haralick texture features of tumor cells.Two different techniques, permutation and mean minimal depth, were adopted to study the contribution of features and a common importance pattern was discovered.Besides, the interaction between features based on trees' structure in the random forest was also explored.The analysis of the features' contribution in classification and the existing interaction between them can help to unveil the hidden logic in distinguish MSS from MSI.To the best of our knowledge, it is the first study shows the texture and color Previous studies in investigating the pathologic predictors of microsatellite instability through feature extraction and logistics regression model suffered from the limited learning capability of the model and the small sample size, and thus could not achieve great performance. [29]Other works about CC study indicated the significance of the histological structures inspired our work in feature extraction. [30,31] nother work on MSI classification paid attention to the enhancement of the prediction accuracy through establishing a DL network but did not give a detailed description of the mechanism behind the model. [8]reover, we conjecture that the dominant-role features in RF models have a significant importance in the DL model.To verify our speculation, we eliminated the RGB mean differences between MSI and MSS groups and reevaluated the AUCs of our DL models.More precisely, we calculate the RGB mean value of all the tiles in both groups and regulate the RGB mean value of every tile into that population mean value.We found that the AUCs reduced 0.11, 0.12, and 0.14 in DX, KR, and STAD datasets, respectively, which verifies that RGB features also make contributions to the DL method.
One limitation of this study is that the cases in TCGA datasets may not be an unbiased collection from the real situation since pathologists may only upload the representative ones.Although our model performed well in these histopathology images, we should admit that their performance in the actual clinical settings requires a further research.Another limitation is that our study only focused on H&E stained images and we could not confirm whether the pattern in this study especially the color features' contribution works in other types of histopathology slices.The classifier model under other types i.e. immunochemical stained images remains to be explored and established.
In summary, we developed ML models with decent power in the prediction of MSI.Moreover, our models exhibit visual heatmap demonstrating high-contribution regions for MSI prediction in the H&E image, and we certified certain pathological features have nontrivial importance in MSI classification, which is not explicitly studied in the previous research.Therefore, our study not only facilitate MSI diagnosis based on H&E image but also shed a light on the understanding of MSI at both image level and features level.As a byproduction of our study, a user-friendly and ongoing-upgraded web application (http://14.215.135.56:3838/DL/) was developed for world-wide clinical researchers.All the images used in our models have already gone through tumor tissue detection and been tessellated into small tiles in J.N. Kather's work (https://zenodo.org/record/2530835and https://doi.org/10.5281/zenodo.2532612).There are 108020 tiles in TCGA-STAD cohort, 139147 in TCGA-CC-KR, and 182403 in TCGA-CC-DX.Color normalization has already been performed on every tile using Macenko method, which converts all images to a reference color space.

Histopathology
The Details of Grad-CAM: Grad-CAM utilizes the gradient information abundant in the last convolutional layer of a CNN and generates a rough localization map of the important regions in the image.We apply rectified linear unit to the linear combination of maps so as to generate localization maps of the desired class.Grad-CAM visualization was implemented in Python 3.7 with TensorFlow-GPU 1.14.0 and Keras 2.3.0.
The Details of Image Pretreatment: We apply pretreatments to the tiles before feature extraction.First, in order to avoid the influence of color cast, the natural appearance tone of the object is altered in the formation of images when exposed in a lightning condition of different color temperature, white balance is performed on our cohorts.Since every tile has area without cell organization, i.e. without H&E stained, we could view that part as the neutral reference in adjustment.Besides the color cast, overexposure and underexposure also may result in the distortion of our features.Still, taking the unstained area as the reference, we regulated all tiles into the same level of brightness.In addition, to get the location of immune cells' nuclei, we need to perform color deconvolution, an algorithm used to separate color space from immunohistochemical staining, on every tile of our datasets.To realize it, software ImageJ was used with a plugin called Color Deconvolution.In addition, in order to extract the Haralick texture of tumor cells, we used Positive Cell detection plugin in QuPath software to locate every tumor cell in each tile and use its batch process to get needed features.
The Details of Feature Extraction: In global color feature extraction, the region of interest (ROI) is a stained area.We recorded mean value, quantiles (25%, 50%, 70%) and higher order moments (variance, kurtosis, and skewness) in ROI of each channel in RGB and HSV as our global features.Besides, with GMM model, we perform image segmentation to each tile to divide the ROI into three clusters and record the corresponding features in every cluster as our local features.We located immune cells' nuclei after color deconvolution according to their size and grayscale and calculate the amount as the feature.As for the differentiation degree of tissue in tiles, we performed dilation, erosion and Hough Transformation, a method to identify outline similar to circle in images, to decide their differentiation degree.Since the more regular shapes exist, the more highly the tissue differentiates.Since we have recorded the tumor cell's location, we extract Haralick Features of each tumor cell in one tile and adopt the mean value of all cells' as the feature of this tile.Besides, we also recorded the count of tumor cell as our feature.
The Random Forest Model for MSS and MSI Classification: Our RF method was built and tested using Python version 3.7.1 with RandomForestClassifier in sklearn.ensemblelibrary. [32]ring training, 70% patients in every dataset were randomly selected and all of their tiles were used in training while the rest tiles were held out and used as test sets.There are some anomalous tiles in each dataset, i.e. blurred or color disorder, resulting in the loss of the information contained in them.Therefore, we disposed of all of them in every dataset.In addition, we also delete the tiles owning an extreme immune cell number (a value that significant in 1% level), since an extremely small number may represent the non-tumor area while a too large number represents lymphatic concentration area.In each forest, we set 500 trees in total and take Gini impurity as the criterion.The minimum number of samples to split an internal node is 2 and other parameters follow the default setting.In all cases, training and test sets were split on a patient level and no image tiles from test patients were present in any training sets.The visualization of the result in our model was realized with plotnine library in Python and AUCs were generated with sklearn.metricslibrary.And to further analyze the feature importance and interactions between features, we also used R version 3.5.1 with randomForest package [33] to rebuild that random forest and analyze and visualize the relations between different features with randomForestExplainer package [34] .
Mean minimal depth is a measure that indicates the average depth of the feature's first occurrence in every tree.The importance ordering of features under it keeps highly consistent with the result from the permutation method.Pairwise interaction occurs when a split of one feature appears in maximal subtrees with respect to other features.To investigate the latent     Minimal depth indicates the average depth one feature first occurs in every tree.The color bars from 0 to 10 indicates the depth that the feature first appears in each tree and the length of it indicates the count of such trees.NA means that the feature does not appear in that tree.We fill such missing value in mean depth calculation by setting the minimal depth of a variable in a tree that does not use it for splitting equally to the mean depth of trees.

Figure S2. The visualization of typical pairwise features' interaction in each dataset.
The forest predicts on a grid of values for the components of each interaction.The two axes represent the two corresponding features.The prediction value ranges from 0 to 1 with color from blue to red.The bluer means a larger probability of MSI while the redder tends to be MSS.
Table S3.The p-value of each feature in three dataset respectively under Wilcoxon Rank sum Test.The nomination rules follow the ones in Fig3 of the main text.2. The p-values are obtained from Wilcoxon Rank Sum Test, a nonparametric test focusing on hypothesize whether the samples X and Y come from the same population.3. When the p-value are so small that exceeds the range of double-precision floating-point, it will turn out to be 0 of the H&E image and their interactions are crucial for the diagnosis of MSI.We believed our work can be deployed at medical center to facilitate MSI diagnosis at a low cost based on H&E stained histopathology slides.The localization map outputted by our DL models can help experts to narrow their focus on the specific region of the whole H&E slide, thereby contributing to a more accurate diagnosis with the prediction result of our model.Besides, the features' distribution under our interpretable model can provide experts with more insight into analyzing the slices of MSI and MSS from clinical perspectives.Further, considering the similar feature distribution pattern in three datasets we used, it is possible that after running the same pipeline on MSI H&E slides under different cancer types, we can discover a generalization pattern behind them.After training on a larger dataset, the accuracy of the identification and the interpretability could get improved, thereby contributing to accurate sample curation and treatment development of this aggressive cancer subtype.
Image Sources: The whole-slide H&E stained histopathology images were obtained from TCGA, which included 3 cancer subtype datasets.Dataset DX consisted of 295 MSS patients and 65 MSI patients from FFPE samples of CC.Dataset KR contained 316 MSS patients and 72 MSI patients from snap-frozen samples of CC.Dataset STAD collected 225 MSS patients and 60 MSI patients of FFPE stomach adenocarcinoma.

Figure 1 .
Figure 1.(a) The original tile and the corresponding heatmap output by the GCAM.(a1) and (a2) display tiles from the TCGA-CC-DX dataset labeled with MSI and MSS, respectively.In the heatmaps, the brighter region contributes more to the classification.For instance, the red one is the most highlighted area while the blue regions contribute limitedly.(b) The workflow of studying pathological features in discriminating against MSI from MSS.Five main steps, pretreatments, feature extraction, model training, patient-level predictions, and feature contributions analysis, were sequentially executed to improve the quality of image, generate pathological features, build statistical model, evaluate model performance, and measure features' contributions, respectively.

Figure 2 .
Figure 2. (a) The workflow of studying pathological features in discriminating against MSI from MSS.Five main steps, pretreatments, feature extraction, model training, patientlevel predictions, and feature contributions analysis, were sequentially executed to improve the quality of image, generate pathological features, build statistical model, evaluate model performance, and measure features' contributions, respectively.(b) Typical feature extraction result.(b1) Tumor cells detection before Haralick texture identification.The left figure is an original tile while the right one is processed with tumor identification.Each red circle in the right tile indicates the boundary of one tumor cell.(b2) GMM model for image segmentation.The left figure is a tile from TCGA-CC-DX dataset, and its imagesegmentation tiles processed by GMM method are shown in the right figure.The green part whose grayscale is lowest among three parts tends to be tumor tissue while the blue and red ones represent non-tumor tissue.(b3) Infiltrating immune cells detection.Detection of immune cells and we can calculate the connectivity domain.(b4) The grading of differentiation.Detect the circular similar arrangement in one slice and grade the degree of differentiation based on its amount.(c) AUC values of RF models in three datasets.Roc curves for classifying MSI versus MSS in each dataset.

Figure 3 .
Figure 3. (a) The permutation-based variable importance of each dataset.The value of horizontal axis in each pic represents a degree of importance and the vertical axis displays corresponding variables' names.The rules of nomination are following the rules below: 1. Abbreviations of statistics: kur = kurtosis, skew = skewness, var = variance.2. Combination of variable name: G-L2 var = the variance of G channel in the second level of the GMM model.3.In texture features, every number follows F like 'F10' refers to one of feature from Haralick Texture.(b) The distribution of minimal depth for the top ten variables in each dataset.Minimal depth indicates the average depth one feature first occurs in every tree.The color bars from 0 to 10 indicates the depth that the feature first appears in each tree and the length of it indicates the count of such trees.NA means that the feature does not appear in that tree.We fill such missing value in mean depth calculation by setting the minimal depth of a variable in a tree that does not use it for splitting equally to the mean depth of trees.

Figure 4 .
Figure 4.The condition minimal depth of each datasets.The color shade represents the occurrences of such interactions.The height of the bar indicates the conditional average depth of that feature and the height of the line dot means the unconditional average depth.The gap between them can reflect the degree of feature interaction.

Figure S1 .
Figure S1.The distribution of minimal depth for the top ten variables in each dataset.Minimal depth indicates the average depth one feature first occurs in every tree.The color bars from 0 to 10 indicates the depth that the feature first appears in each tree and the length of it indicates the count of such trees.NA means that the feature does not appear in that tree.We fill such missing value in mean depth calculation by setting the minimal depth of a variable in a tree that does not use it for splitting equally to the mean depth of trees. .