Transferability of radiomic signatures from experimental to human interstitial lung disease

Background Interstitial lung disease (ILD) defines a group of parenchymal lung disorders, characterized by fibrosis as their common final pathophysiological stage. To improve diagnosis and treatment of ILD, there is a need for repetitive non-invasive characterization of lung tissue by quantitative parameters. In this study, we investigated whether CT image patterns found in mice with bleomycin induced lung fibrosis can be translated as prognostic factors to human patients diagnosed with ILD. Methods Bleomycin was used to induce lung fibrosis in mice (n_control = 36, n_experimental = 55). The patient cohort consisted of 98 systemic sclerosis (SSc) patients (n_ILD = 65). Radiomic features (n_histogram = 17, n_texture = 137) were extracted from microCT (mice) and HRCT (patients) images. Predictive performance of the models was evaluated with the area under the receiver-operating characteristic curve (AUC). First, predictive performance of individual features was examined and compared between murine and patient data sets. Second, multivariate models predicting ILD were trained on murine data and tested on patient data. Additionally, the models were reoptimized on patient data to reduce the influence of the domain shift on the performance scores. Results Predictive power of individual features in terms of AUC was highly correlated between mice and patients (r = 0.86). A model based only on mean image intensity in the lung scored AUC = 0.921 ± 0.048 in mice and AUC = 0.774 (CI95% 0.677-0.859) in patients. The best radiomic model based on three radiomic features scored AUC = 0.994 ± 0.013 in mice and validated with AUC = 0.832 (CI95% 0.745-0.907) in patients. However, reoptimization of the model weights in the patient cohort allowed to increase the model’s performance to AUC = 0.912 ± 0.058. Conclusion Radiomic signatures of experimental ILD derived from microCT scans translated to HRCT of humans with SSc-ILD. We showed that the experimental model of BLM-induced ILD is a promising system to test radiomic models for later application and validation in human cohorts.


Introduction
Interstitial lung disease (ILD) defines a group of chronic, etiologically different parenchymal lung disorders, characterized by fibrosis as their common final pathophysiological stage. The prognosis of the most prevalent and severe subtypes, idiopathic pulmonary fibrosis (IPF) and ILD associated with the autoimmune disease systemic sclerosis (SSc), is as poor as that of untreated oncologic diseases (1,2). Globally, non-malignant lung diseases including ILD rank third on the mortality scale (3).
Experimental models of fibrosing ILD are paramount for the identification of cellular and molecular key drivers of disease and as preclinical test systems for novel targeted drugs (4). The preferred and best characterized preclinical model of ILD is the murine model of bleomycin-induced lung fibrosis, which reflects important features of human ILD such as apoptosis of epithelial cells, influx of inflammatory cells into the interstitium, followed by activation of fibroblasts with increased deposition of extracellular matrix (ECM) proteins (5,6).
Conventional endpoint measures of lung fibrosis involve histological and biochemical analyses, which, however, have certain disadvantages. To recapitulate the dynamic process of fibrosing ILD at multiple time points and to account for the high interindividual variability, large numbers of animals are required to reach significant statistical power (7). Additionally, lung biopsies are only rarely performed in human ILD (8,9) and biopsy may not be representative for the whole lung pathology. Upcoming alternative outcome measures for translational ILD research include imaging methodologies. An integral part of the routine clinical management is medical imaging, particularly high-resolution computed tomography (HRCT), which allows non-invasive, highly sensitive, timeand spatially resolved visualization of the entire lung changes (10) and a correlative estimation of lung function (11). Similarly, in preclinical models of ILD, small animal microCT is increasingly recognized as a valuable assessment tool (4,7). In the model of bleomycin-induced experimental ILD, the relative comparability of both imaging and molecular changes with human ILD (5,(12)(13)(14)(15) support its suitability for translational ILD research.
The need for innovative, directly transferable, and readily applicable readouts in ILD have prompted the herein presented translational study on the potential value of the model of bleomycin-induced lung fibrosis as experimental "radiomic toolbox" for human ILD. Radiomics is a powerful strategy for indepth analysis of pathologic tissue phenotypes by computational extraction of quantitative imaging features from medical images (16,17). Radiomic features provide objective information on tissue shape, intensity, and texture on a molecular scale as demonstrated by studies on tumor biology showing correlation with tissue-based genomics and proteomics data (18-21). As image-derived tissue surrogates, their potential use as virtual biopsies could make radiomics analyzes an ideal tool for clinical decision support in ILD especially since radiomic features have also been shown to predict disease outcome and response to therapy (18,19,(22)(23)(24)(25). However, compared with oncology (18, [20][21][22], research into the potential of radiomics in nonmalignant lung diseases is limited (26)(27)(28)(29)(30).
Nevertheless, the available literature on human lung pathologies, including chronic obstructive pulmonary disease, radiation-induced pneumonitis and connective tissue diseaserelated ILD showed that texture-based analysis of CT images can be superior compared to the visual or histogram-based measures for diagnosis (28,31,32). Few studies investigated the use of radiomics in experimental settings. Eresen et al. used MRI radiomics for prediction of response to vaccine therapy in a mouse model of pancreatic ductal adenocarcinoma (33,34). Nunez et al. analyzed suitability of MRI radiomics for diagnosis of preclinical GL261 glioblastoma (35). Other researchers focused on radiomic-based prediction of liver metastases or liver fibrosis in mice (36, 37).
To date no study has shown the value of animal models in radiomics research. We are not aware of any studies reporting transferability of radiomic patterns from experimental model to clinical setting. Establishing a link between preclinical and clinical radiomic patterns could enormously facilitate testing a vast range of hypotheses in an experimental setting. Such a link is currently missing. In this analysis, we evaluate if radiomic features and models can be translated from experimental to human ILD.

Study design and data sets
Details of the study design and data sets are shown in Figure 1. In short, we investigated whether radiomic patterns indicative of ILD in mice were also present in human disease.
The preclinical model of bleomycin (BLM)-induced lung fibrosis was used to mimic human ILD. The experimental cohort consisted of 91 8-week-old female mice (C57BL/6J-rj, Janvier Labs). ILD was induced in 55 mice via intratracheal instillation of bleomycin (2 U/kg; Baxter 15,000 I.U.) as described in (6,14,38). The 36 control animals received equivalent volumes of 0.9% NaCl solution. Mice were randomized into the different experimental groups and instillation was performed blinded. Pulmonary micoCT scans were performed at different days (days 3, 7, 14, 21, 28, and 35) after bleomycin instillation to reflect different disease stages. Different mice were scanned at every time point as the animals were euthanized after image acquisition. Scanning mice at different time points after fibrosis induction did not serve a particular purpose in this work. Such design was chosen because this experimental data was also used in other studies which examined temporal aspect of fibrotic development.
A cohort of 98 SSc patients being followed at the Department of Rheumatology, University Hospital Zurich represented the validation data set. All included patients met the following criteria: diagnosis of SSc according to the Very Early Diagnosis of Systemic Sclerosis (VEDOSS) (39) or the 2013 American College of Rheumatology//European League against Rheumatism (ACR/EULAR) classification criteria (40), and availability of an HRCT scan. Patient characteristics are provided in Table 1.
The extent of lung fibrosis was defined as presence of reticular changes or honeycombing within whole lung volume (Figure 2). All visual analyses were performed by a senior radiologist (TF) using a standard picture archiving and communication system workstation (Impax, Version 6.5.5.1033; Agfa-Gevaert) and a high-definition liquid crystal display monitor (BARCO; Medical Imaging Systems). Study design. The mice data set (n = 91) was used to discover radiomic patterns predictive of ILD. The discovered patterns were tested in the human validation data set (n = 98). 55 mice were given Bleomycin to induce ILD, whereas 36 mice were given NaCl and served as the control group. The mice were euthanized at day 3, 7, 14, 21, 28, and 35 and scanned with a microCT scanner. Afterward, classification models were trained to predict occurrence of ILD based on images acquired from the scanner. The 98 patients from the validation data set were retrospectively collected. All patients were scanned with HRCT and graded according to the Goh scale of pulmonary fibrosis. The radiomic models built using mice data were tested in patients.

Imaging and extraction of radiomic features
Pulmonary microCT scans were acquired in freebreathing mice with prospective respiratory gating using Bruker SkyScan 1176. The following scan parameters were used: tube voltage 50 kV, tube current 500 µA, filter Al 0.5 mm, averaging (frames) 3, rotation step 0.7 degrees, sync with event 50 ms, X-ray tube rotation 360 degrees, resolution 35 µm, and slice thickness 35 µm. Images were reconstructed with NRecon reconstruction software (v.1.7.4.6; Bruker) using the built-in filtered back projection Feldkamp algorithm and applying misalignment compensation, ring artifact reduction, and a beam hardening correction of 10% to the images.
The contouring of whole lungs was performed manually in mice and semi-automatically in patients (region growing algorithm followed by manual correction) by two experienced examiners (JS and MB). Left and right lungs were contoured independently and then both contours were merged to generate a single contour including both lungs.
Feature extraction from CT images was performed with Z-Rad, an IBSI-compliant (41), in-house developed Python software. CT scans of mice and patients were interpolated to an isotropic resolution of 0.15 mm and 2.75 mm, respectively. The interpolation resolutions were chosen to achieve similar ratio of voxel size to average lung volume in mice and patients. The region of interest (ROI) for feature extraction was defined as the right and the left lung considered as a single organ.
Only intensity values within the range from −1,000 HU to 200 HU were considered. We used a fixed bin size of 50 HU. The radiomic features describing image intensity (histogram, n = 17) and texture (n = 137) were extracted for each mouse and patient. The texture features were based on gray level co-occurrence matrix (GLCM, n = 26), gray level run length matrix (GLRLM, n = 16), gray level distance zone matrix (GLDZM, n = 16), gray level size zone matrix (GLSZM, n = 16), neighboring gray level dependence matrix (NGLDM, n = 16), and neighborhood gray tone difference matrix (NGTDM, n = 5) to capture wide variety of intensity patterns. Additionally, GLCM and GLRLM features were extracted with two different feature aggregation methods -with and without merging. In total, 154 features were extracted. The list of radiomic features is provided in the supplement.

Statistical analysis
For every radiomic feature, robustness against intra-and interobserver variability was examined. This was realized with estimation of the corresponding intraclass correlation coefficients (ICC). Specifically, we used consistency of ICC (1, 3) according to the Shorut and Fleiss naming convention (42). Features with ICC ≥0.75 for intra-and interobserver settings in both mice and humans were considered stable and were retained. The rest of the features were excluded from further analysis.
Univariate predictive power of the radiomic features was evaluated by estimation of the area under the receiver operating characteristic curve (AUC). To facilitate comparison of the AUC values between mice and patient data sets, we adopted a convention that AUC is equal to the probability that a radiomic feature value of a randomly chosen patient from the positive group is greater than the value of a randomly chosen patient from the negative group. This allowed us to distinguish between features that were characterized by comparable predictive power but a different direction of the effect, for example, AUC = 0.3 in mice and AUC = 0.7 in patients. The linear association of the AUC scores between mice and patient groups has been evaluated with Pearson correlation coefficient.
Three model architectures were considered for evaluation of model transferability from mice to patients: (1) a model based on mean image intensity (MEAN), (2) a model based on first four moments of intensity distribution (mean, standard deviation, skewness, and kurtosis; MSSK), and (3) a machine learning model based on logistic regression (ML). While the first two models are based on predefined radiomic features, the machine learning model employed embedded feature selection methods. All models were built on the mice data and were validated in the patient data. Feature selection and model tuning was realized within 4-times repeated 5-fold cross-validation. The first step of the feature selection procedure was dimensionality reduction by removing features that were highly linearly correlated (Pearson's r). The correlation threshold was one of tunable hyperparameters. The second step of feature selection was fitting a model and selection of most important features from this model which were then fed to the final classifier. In the case of a logistic regression model, the feature selection was realized with another logistic regression. In the case of, extra-trees model, most important features were extracted from a gradient tree-boosting model. The number of extracted features in both cases was one of tunable hyperparameters. For model tuning, we used 500 randomized hyperparameter samples. The optimized models were validated in patients. Additionally, the models were re-optimized in patients to evaluate transferability and predictive power of the discovered radiomic signatures rather than the models themselves. Furthermore, this allowed to reduce the influence of covariate shift between the data sets.

Influence of intra-and interobserver delineation variability on radiomic features
Intra-and interobserver delineation variability were evaluated separately in mice and patient data sets using 15 randomly selected cases per data set. Intraobserver variability was assessed based on delineations done by JS. Interobserver variability was assessed based on delineations provided by JS, CB, and MBr. Figure 3 shows the proportion of the unstable features per feature class. In mice, 7 features from the initial set of 154 were considered unstable (ICC < 0.75) and were excluded from the further analysis. In patients, all features were stable (ICC ≥ 0.75) so no further features were excluded.
Discriminative power of radiomic features is highly correlated between mice and patient data The next steps in our analysis were the investigation of univariate discriminative power of radiomic features and the correlation of AUC scores between mice and patients. ICC analysis was performed to compare two feature aggregation methods of GLCM and GLRLM features. As both feature aggregation methods rendered highly correlated results (ICC GLCM = 0.99, ICC GLRLM = 0.83), only one feature aggregation per feature class method was kept for further analysis to reduce feature redundancy.
Univariate predictive power of radiomic features in terms of AUC is presented in Figure 4A. On average, features describing image intensity tended to perform better than texture-based features. Radiomic features were on average more predictive FIGURE 3 Influence of intra-and interobserver delineation variability on radiomic features stability. Proportion of unstable features stratified by feature type.
Frontiers in Medicine 06 frontiersin.org in mice than in patients. Most predictive features in mice achieved AUC = 0.988, whereas in patients AUC = 0.896. The complete list of feature predictive performance is provided in the supplement. Univariate predictive power of the features was highly correlated between murine and patient groups (Figure 4B) with Pearson's r = 0.86. Very high correlation was observed for histogram-, GLCM-, GLRLM-, and NGLDM-based features (Figures 4C-E,H). GLSZM-and GLDZM-based features exhibited more variability (Pearson's r < 0.6; Figures 4F,G). TABLE 2 Predictive performance in model tuning, testing, and re-optimization. Frontiers in Medicine 07 frontiersin.org Radiomic patterns predictive of interstitial lung disease translate from experimental interstitial lung disease to patients

Model AUC tuning (SD) AUC testing (95% CI) AUC re-opt. (SD) TPR (95% CI) TNR (95% CI) PPV (95% CI) NPV (95% CI) LR + (95% CI) LR-(95% CI)
To analyze transferability of radiomic patterns and models from mice to patients we built and validated four classes of models: (1) a model based on mean image intensity (MEAN), (2) a model based on first four moments of intensity distribution (mean, standard deviation, skewness, and kurtosis; MSSK), and (3) a machine learning model based on logistic regression (ML). The models were trained on mice data and tested in patients. Additionally, the models were reoptimized in patients, that is, retrained using the features from the mouse models. The results and comparison of model performance is shown in Table 2.
All models achieved high diagnostic performance in mice. The baseline MEAN model scored AUC = 0.921 which left little room for improvement. Nevertheless, the MSSK and the ML models exceeded AUC = 0.990 resulting in almost perfect classification performance. Testing model performance in patients resulted in AUC scores varying from 0.754 (MEAN) to 0.832 (ML). Model re-optimization in patients allowed to improve the predictive performance of all models. ROC curves associated with model tuning, testing, and re-optimization together with the underlying features are presented in Figure 5. ROC curves show that re-optimization gave little improvement for the MEAN and the MSSK models as testing and reoptimization curves followed similar characteristics. On the other hand, machine learning models improved significantly in this process. The corresponding re-optimization ROC curves detached from the testing curves to position between tuning and testing curves.
Substantial differences in distribution of radiomic features included in the models in terms of location and dispersion are presented in Figure 6. Most of the features exhibit patterns of the same direction in both mice and patient data sets, that is, either rising or falling trend from healthy to ILD.
representative animal models could represent valuable systems for defined hypothesis testing in radiomics research, particularly for evaluating links with pathophysiology or studying responses to targeted therapies in rare diseases with low number of patients and limited access to tissue samples. Radiomic features proved to be highly indicative of experimental-and SSc-ILD. Furthermore, we observed strong linear correlation in terms of discriminative power between features extracted from mice microCT scans and patient HRCT. We also showed that multivariate models of ILD translated well from mice to patient data sets. Nevertheless, we observed the differences between the data sets in terms of feature classes that were predictive. In mice, most of the feature groups contained features that reached similar maximum AUC scores. On the other hand, in patients we observed that even though histogrambased features achieved high discriminative power, some texture features were more predictive. This difference could be caused by inferior quality of microCT compared to HRCT. For this reason, the assessment of microCT done by our radiologist might have also been mainly led by first order characteristics rather than texture. Furthermore, the ILD manifestations can differ depending on the etiology. As a result, the observed differences may be caused by the limitation of the bleomycininduced ILD being an imperfect model of SSc-ILD. In any case, our results are in line with the available literature on human lung pathologies including chronic obstructive pulmonary disease, radiation-induced pneumonitis or connective tissue diseaserelated ILD, which showed that texture-based analysis of CT data can be superior compared to the visual or histogram-based measures for diagnosis (28,31,32).
Analysis of feature weights in the MEAN and the MSSK models showed that higher values of the mean and standard deviation of the image intensity and lower values of skewness and kurtosis correspond to larger risk of ILD. Effectively, this means that presence of ILD shifts the intensity distribution from a typical "healthy" positively skewed intensity distribution toward higher intensity values with a more symmetric distribution and thin tails. The best performing model (ML) relied on three radiomic features: the root mean square (histogram), gray level non-uniformity normalized (GLSZM), and dependence count non-uniformity (NGLDM). Significant improvement of machine learning models by re-optimization may suggest the existence of similar predictive radiomic patterns in training (mice) and test (patients) data sets in presence of domain shift between both groups.
The presented study has a few limitations. First, the differences in scanning parameters between microCT and HRCT cause a significant domain shift between experimental and patient data sets. Although, we were able to recover the predictive power of the analyzed multivariate models by re-optimization in the patient cohort, and by that confirm transferability of the underlying radiomic signatures, better calibration of the microCT scanner and selection of scanning parameters could potentially improve the transferability. Second, our study focused on CT-derived radiomics approaches, since HRCT scans are part of the routine work-up of ILD patients. Other imaging modalities such as nuclear imaging or MRI, although currently rarely performed in ILD (10), could be evaluated for radiomic analyses to assess whether they might provide additional or complementary information.

Conclusion
Radiomic signatures of experimental ILD derived from microCT scans translated as prognostic factors to HRCT of SSc-ILD. By this we showed that the well-established experimental model of BLM-induced ILD is a valuable system to test defined hypotheses in radiomics research for later validation in human cohorts.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement
The studies involving human participants were reviewed and approved by the local ethics committees (approval numbers: pre-BASEC-EK-839 (KEK-no.-2016-01515), KEK-ZH-no. 2010-158/5, BASEC-no. 2018-02165, and BASEC-no. 2018-01873). The patients/participants provided their written informed consent to participate in this study. This animal study was reviewed and approved by the cantonal authorities and performed in compliance with the Swiss law of animal protection (ZH235-2018).

Author contributions
HG, JG-S, MG, BM, and ST-L contributed to the conception and design of the study. HG, JG-S, MBr, CB, and TF contributed to the acquisition and analysis of the data. HG, JG-S, BM, and ST-L contributed to the interpretation of data. HG and MBo contributed to the creation of software used in the study. All authors have drafted the work or substantively revised it.

Funding
This work was supported by the Forschungskredit PostDoc from University of Zurich (FK-19-046 to JG-S) and Swiss National Fund (SNF 310030_170159 to HG).