Idiopathic Pulmonary Fibrosis Mortality Risk Prediction Based on Artificial Intelligence: The CTPF Model

Background: Idiopathic pulmonary fibrosis (IPF) needs a precise prediction method for its prognosis. This study took advantage of artificial intelligence (AI) deep learning to develop a new mortality risk prediction model for IPF patients. Methods: We established an artificial intelligence honeycomb segmentation system that segmented the honeycomb tissue area automatically from 102 manually labeled (by radiologists) cases of IPF patients’ CT images. The percentage of honeycomb in the lung was calculated as the CT fibrosis score (CTS). The severity of the patients was evaluated by pulmonary function and physiological feature (PF) parameters (including FVC%pred, DLco%pred, SpO2%, age, and gender). Another 206 IPF cases were randomly divided into a training set (n = 165) and a verification set (n = 41) to calculate the fibrosis percentage in each case by the AI system mentioned previously. Then, using a competing risk (Fine–Gray) proportional hazards model, a risk score model was created according to the training set’s patient data and used the validation data set to validate this model. Result: The final risk prediction model (CTPF) was established, and it included the CT stages and the PF (pulmonary function and physiological features) grades. The CT stages were defined into three stages: stage I (CTS≤5), stage II (5 < CTS<25), and stage III (≥25). The PF grades were classified into mild (a, 0–3 points), moderate (b, 4–6 points), and severe (c, 7–10 points). The AUC index and Briers scores at 1, 2, and 3 years in the training set were as follows: 74.3 [63.2,85.4], 8.6 [2.4,14.8]; 78 [70.2,85.9], 16.0 [10.1,22.0]; and 72.8 [58.3,87.3], 18.2 [11.9,24.6]. The results of the validation sets were similar and suggested that high-risk patients had significantly higher mortality rates. Conclusion: This CTPF model with AI technology can predict mortality risk in IPF precisely.


INTRODUCTION
The survival of IPF patients varies considerably. Some are stable for a long time, some progress slowly, and some exacerbate acutely, leading to short-term death (Ley et al., 2011;Raghu et al., 2018). A widely accepted method of assessing disease severity and estimating prognosis remains absent (Gonnella et al., 1984).
Currently, severity assessment models of IPF mainly include the following: 1)The CRP (clinical-radiographic-physiologic) scoring model proposed by Leslie C. Watters et al. (Watters et al., 1986;Watters et al., 1987) in 1986, which consists of seven variables: the degree of dyspnea, X-ray chest radiograph quantitative score, forced vital capacity (FVC), forced expiratory volume in one second (FEV 1 ), intrathoracic gas volume (Vtg), diffusing capacity of the lung for carbon monoxide (DLco) and lung volume (VA) ratio (DLco/VA), and the alveolar-arterial oxygen partial pressure difference (AaPO2) in the resting state. The CRP model includes numerous parameters, the calculation is complex, and it is difficult to identify fibrotic lesions from chest X-ray images. King et al. (2001) improved the CRP scoring system in 2001, by adding parameters such as gender, age, smoking status, and clubbing, which further increased the complexity of the evaluation. 2)In 2002, Wells et al. (2003) proposed CPI (composite physiologic index), which only used the lung function parameters to assess the severity of interstitial lung disease (ILD); however, its calculation formula is complicated, and its clinical application is limited. 3) Ley et al. (2012) proposed a GAP (gender, age, and physiologic variables) model based on gender, age, FVC, and DLco. However, the essential CT data was still not included. 4) Okuda et al. (2013) proposed to use arterial partial pressure of oxygen (PaO2) and oxyhemoglobin saturation (SaO2%), two leading arterial blood gas indicators to assess severity; however, CT, lung function, and other essential parameters are still missing from this approach. Hence, it is necessary to establish a precise and easy-performing model to evaluate and predict the prognosis of IPF.
In recent years, artificial intelligence (AI), especially deep learning, has been evolving rapidly and has achieved remarkable results in computer vision (CV). Traditional computer-based CT analysis provided objective quantitation of IPF disease programs such as CALIPER. Jacob et al. (2017) used it to measure disease severity with feature engineering, which usually involves subjective experience and might lead to nonoptimal results. Compared with the traditional CV method, the deep learning-based method learns the features by itself with an end-to-end architecture that avoids human subjective feature selection, and this usually archives the state-of-the-art results (O'Mahony et al., 2019). One of the essential tasks of CV is semantic segmentation, which can be thought of as pixel-wise classification. Deep learning-based semantic segmentation has been widely applied in biomedical image processing (Ronneberger et al., 2015) in areas such as the lung (Hofmanninger et al., 2020;Handa et al., 2021), kidney (Bazgir et al., 2020), brain tumor (Myronenko and Hatamizadeh, 2020), sublingual vein (Xiong et al., 2020), and prostate (Yoo et al., 2019) and achieved state-of-the-art results. We extended semantic segmentation into pulmonary fibrosis image analysis by training a deep learning model that segments fibrosis tissue regions in chest CT images automatically and calculates the fibrosis tissue percentage of the entire lung (patent application no. 202010985175.8).
Combining the aforementioned pulmonary function and physiological feature (PF) parameters , which have been proved to have a good prognostic value and are easy to be accessed clinically, we set up a new comprehensive framework for evaluating the severity of pulmonary fibrosis (patent no: ZL 2019 1 0514972.5). We conducted clinical verification (ChiCTR-RRC-17010683), which achieved accurate pulmonary fibrosis severity assessment and prognosis evaluation (software registration no. 6406807).

Study Cohorts
For testing the AI system that we established (patent application no. 202010985175.8), we did a retrospective analysis of 232 patients diagnosed with IPF from 1 January 2011 to 31 January 2020 in the Department of Respiratory Medicine, Shanghai Pulmonary Hospital. IPF diagnosis of these cases was confirmed by the criteria of the 2018 IPF International Guidelines (Raghu et al., 2018). Data of gender, age, lung function, fingertip SpO2% (or SaO2% measured by arterial blood gas analysis), chest CT, occupation, and smoking history were recorded. All patients were followed up in outpatient clinics or via phone, including the patient's survival status, time of death (the year and month), cause of death, whether there were other complications, whether undergoing lung transplantation, and the time of lung transplantation. The deadline for follow-up was 1 August 2020. After follow-up, finally, 206 qualified cases were involved in the study. The patient screening process and followup are shown in Figure 1. This study was approved by the Institutional Ethics Committee of Shanghai Pulmonary Hospital (No. K17-016).

Development of the Mortality Risk Prediction Model for IPF
Based on the IPF diagnosis guidelines in 2018 (Raghu et al., 2018), the honeycomb lung extent and scope of the disease presented on the CT images of IPF patients are essential indicators for predicting IPF mortality (Flaherty et al., 2003;Best et al., 2008;Raghu et al., 2011;Rosas et al., 2011). The thickness of HRCT sections was 1-2 mm; section spacing was 2 cm. Patients were in the supine position. The minimum exposure was 200 mA per second. First, we established a deep learning AI model by a neural network (Ronneberger et al., 2015)  to quantify the extent of honeycomb lung lesions for another 206 patients. The CT images were also reviewed separately by two radiologists, who were blinded to the clinical information and the deep learning model results. Both radiologists were board-certified diagnostic radiologists, who were majoring in chest radiology. The observers evaluated the extent of the honeycomb and gave the manual-CT score results. The patient's PF parameters are indispensable for prognosis estimation (Wells et al., 2003;Ley et al., 2012). After analysis of the pros and cons of existing scoring systems (CRP, GAP, CPI, and JRS) shown in Table 1, we chose five parameters, namely, FVC%pred, DLco%pred, SpO2%, age, and gender, to evaluate the severity of the patient's disease (patent no: ZL 2019 1 0514972.5). These parameters have a significant predictive value and can be accessed easily in clinical practice. According to previous studies (Watters et al., 1986;Watters et al., 1987;King et al., 2001;Wells et al., 2003;Ley et al., 2012;Okuda et al., 2013), we formed a multi-parameter severity evaluation metric (PF grading) based on PF.

Statistical Method
The Spearman correlation coefficient was used to analyze the correlation between CTS and lung function parameters, namely, FVC%pred, DLco%pred, SpO2%, and CPI. The patient's survival time was calculated from the evaluation time to the endpoint event, which was death due to lung disease or lung transplantation, measured in months. According to survival time, X-tile software (internal cross-validation method) was FIGURE 1 | Case screening process. In total, 232 cases were diagnosed as IPF according to the 2018 IPF diagnosis and treatment guidelines. A total of 26 patients were excluded, two patients were diagnosed as interstitial pneumonia with autoimmune features (IPAF) during follow-up; 24 patients had incomplete CT and lung function data. Finally, 206 cases were included in the retrospective analysis (including 16 cases of lung transplantation): 93 surviving cases, including 11 lung transplants; 81 deaths, out of which 10 died from lung cancer, 67 died from acute exacerbation of IPF, and 4 died after lung transplantation; and 32 patients failed to follow up, including one failed to follow up after lung transplantation.
Frontiers in Pharmacology | www.frontiersin.org April 2022 | Volume 13 | Article 878764 used to find the optimal CT score threshold to classify patients in three CT stages. Lung transplantation was considered the most effective treatment for IPF (Thabut et al., 2009), so lung transplantation was considered a competing risk event and used the competing risk (Fine-Gray) model for establishment and evaluation of the disease prognosis prediction model as follows: 1) based on the total number of our cases, the modeling parameters were selected, by referring to the existed literature (Roecker, 1991;Zhang et al., 2018); all 206 cases were randomly divided into the training set (165 cases) and the verification set (41 cases). 2) CT staging, PF staging, and CTPF comprehensive staging were used as predictors. We compared the accuracy of the model with GAP staging proposed by Brett Ley and MD and established four mortality risk prediction models based on Fine-Gray regression analysis for training set data, namely the CT staging model, PF grading model, CTPF staging model, and GAP staging model. The predictive accuracy of the risk model was assessed by calculating the area under ROC curve (AUC) and Brier score.
3) The validation set was used to validate the four models. 4) A nomogram was drawn referred to some reports (Zhang et al., 2017;Zhang et al., 2018) to show the 1, 2, and 3-year survival rates of the CTPF model for patients visually with different CT stages and PF grades.

Patient Baseline Clinical Characteristics
Following the process shown in Figure 1, we screened 232 cases of IPF patients. Among them, 206 cases met the scoring requirements and were included in the CTPF staging verification: 93 cases survived, 81 cases died, 32 cases failed to follow up, and 16 cases received lung transplantation. Table 2 shows the primary characteristics of the patients. The average age is 64.1 ± 7.9 (years), and the average survival time is 28.7 ± 19.3 (months). Most patients are male (196/206, 95.1%), and most of them have a history of smoking (156/206, 75.7%). The average CT score is 14.1 ± 11.30 (ranges from 0.04 to 52.3).

Test the CT Score Calculated by AI
The fibrosis segmentation network (FSN) was the essential component of deep learning, which performed the semantic segmentation of fibrosis regions in the CT images, and is the basis of further calculation, such as CT scores (CTS) and FSN's performance, shown in Supplemental Material 2. Figures 2A-D show that the CTS was negatively correlated to FVC%pred (rs = -0.40, p < 0.01), DLco%pred (rs = -0.66, p < 0.01), and SpO2% (rs = -0.44, p < 0.01) and positively correlated with the existing CPI (rs = 0.65, p < 0.01) which reflects the severity of the patient's disease. In addition, the CTS was closely related to manual-CT scores by radiologists and Spearman correlation coefficient rs = 0.80, p < 0.01 ( Figure 2E.). It indicates that the CT scoring system designed in this study properly reflects the severity of pulmonary fibrosis.

Establishment of the CTPF Model
According to the survival time of all patients, we used X-tile software to find the cut-off points of CT scores and are calculated as 5.6 and 25.4, which divide the patients into three groups, and the survival rates of the three groups are statistically different (χ2 = 27.985, p < 0.05). To facilitate clinical application, we tried to take integer cut-off points, i.e., 5 and 25. We used the two cut-off point CT scores (5.6, 25.4) and (5,25) to establish the prediction model and found that both scores have the same prediction efficiency. For ease of clinical use, we chose the latter. So, the three groups were as follows: stage I (CTS<5), stage II (5 < CTS<25), and stage III (CTS>25).
After analyzing the pros and cons of existing scoring systems (CRP, GAP, CPI, and JRS) shown in Table 1, we chose five parameters, namely, FVC%pred, DLco%pred, SpO2%, age, and gender, to evaluate the severity of the patient's disease and calculated the scores using PF grading to assess the severity in patients ( Table 3) and prognosis. Figure 3A shows the relationship between CT staging and mortality risk in Fine-Gray univariate regression analysis, in which the effect of PF grading might be involved. Figure 3B shows the result of multi-factor analysis after eliminating the effect of PF grading, that is, the relationship between CT staging  Table 3. PF-based grade: the grade was determined by using the pulmonary function and physiological parameters (age, gender, FVC%pred, DLco%pred, and SpO2%) and following the description in Table 3. The grade was defined as (a) mild, (b) moderate, and (c) severe. GAP (gender, age, and physiologic variables) stage followed the recommendation by Brett Ley, and a higher stage represented a greater death risk. CPI: composite physiologic index. In 2002, Athol U. Wells and others proposed to use CPI, which combined chest CT and pulmonary functional parameters, to assess the severity of interstitial lung diseases (ILDs). A higher CPI represents a more severe ILD.
Frontiers in Pharmacology | www.frontiersin.org April 2022 | Volume 13 | Article 878764 and mortality risk. In both adjusted and unadjusted cases, PF staging was positively correlated with mortality risk. Similarly, Figures 3C, D illustrate the relationship between PF grading and mortality risk in Fine-Gray regression with unadjusted and adjusted CT staging. In both cases, PF stages were positively correlated with mortality risk. We then combined the two factors to create a new mortality prediction model, the CTPF model. The score was calculated based on the five pulmonary function and physiological feature prognostic predictors, and CT scores were calculated by the AI model, which are shown in Table 3. Table 4 shows the patients' clinical characteristics in the training set and validation set. There is no significant difference between the two sets. Then, classification of the training set was followed according to the CT staging, PF grading, GAP staging, and CT + PF staging. The analysis results in

Prognostic Significance of the CTPF Model
A nomogram of death risk prediction for a CTPF prediction model and calibration curve are shown in Figure 4.
The 1-year, 2-year, and 3-year cumulative survival rates of different CTPF stages based on the nomogram are shown in Table 6. The higher the PF grade, for patients with the same CT staging, the lower the cumulative survival rate and vice versa. As the flow in Supplementary Figure S1 shows that all the patients' chest CT lung images were read into the deep learning model. The model segmented the patients' fibrotic lesion region and calculated the area percentage of the whole lung. Age, gender, FVC%pred, DLco%pred, and SpO2% data were included in the metric to calculate patient's CTPF staging results ( Figure  5), and an evaluation report (Supplementary Figure S7) was generated.
A total of two representative cases are displayed as follows: Figure 5 shows the output of the fibrosis segmentation network. Figure 5A shows a 74-year-old male patient, whose CT fibrosis score is 15.9. In Table 3, the physiological indicators of lung function (PF) correspond to seven points of severity, so his comprehensive stage is IIc; the patient died of exacerbation 23 months later. Figure 5B shows another patient, a 62-year-old male with an AI fibrosis score of 4.0. According to Table 3, the final stage of the patient is Ia. The patient is still alive after 39 months when we followed up.
As some IPF patients had also developed emphysema (Rosas et al., 2011), we trained another semantic segmentation model for pulmonary bulla and calculated its percentage of the entire lung based on the same framework with different parameters. See Supplementary Material S1 for details.  Table 1 are either too simple, such as GAP and JRS which cannot accurately reflect the severity of the disease and estimate prognosis because of fewer data, or too complex, such as CPR and CPI which are complex and difficult to access in clinical practice (Jacob et al., 2017). In fact, chest CT scans are one of the standard clinical examination methods in the diagnosis of IPF, and honeycomb in the lung is the most representative lesion of pulmonary fibrosis and directly related to the prognosis (Flaherty et al., 2003;Lynch et al., 2005;Best et al., 2008;Raghu et al., 2011;Rosas et al., 2011). The semi-quantitative evaluation is the most common method in practice, which requires physicians' expertise, and is laborintensive and time-consuming, and the results of different practitioners might vary considerably. The Cohen-weighted k values of semi-quantitative evaluation are only 0.40-0.58 (Watadani et al., 2013;Hansell et al., 2015), and both repeatability and accuracy are also low. The pulmonary fibrosis segmentation model based on deep learning in this study segmented fibrosis honeycomb accurately and automatically and calculated its percentage of the whole lung, which quantifies the essential factor of fibrosis staging. Compared with manual-CT scores evaluated by radiologists, the scores evaluated by AI were low. Due to the fact that the AI evaluation was a whole-lung range in the chest CT, the manual evaluation was usually selected for the dominant lesion section in the CT, such as the aortic arch section, tracheal bifurcation section, and lung diaphragm section. This method has the advantages of fast incremental learning, objective and accurate quantitative calculation, efficient complete lung scanning, and high repeatability. The DSC of the model reached 77.26%, which is 8.39% higher than that of the benchmark (U-Net with the spatial pyramid pooling module) that is 68.78% (Ronneberger et al., 2015). Compared with CALIPER (Jacob et al., 2017) based on traditional CV technology, deep learning methods learn features automatically and archive better performance. Although Handa et al. (2021) adopted deep learning U-Net architecture, we enhanced it with an attention mechanism and Squeeze-Excitation Network to archive better outcomes. The running time of the CT evaluation for each patient was only 11 s, which is a significant efficiency improvement.
We selected FVC%pred, DLco%pred, SpO2%, age, and gender as five essential indicators that have been proved to have a good prognostic value and are easy to be accessed clinically to evaluate the severity of IPF. In both univariate and multivariate regression analyses, the results suggested that PF classification was an independent risk factor for predicting IPF patients' mortality risk. The severity of each patient's disease stage (a, b, or c) was calculated according to these five parameters. The CTPF evaluation system combines the results from CT pulmonary fibrosis staging (I, II, and III) and severity grading (a, b, and c) to form a complementary pulmonary fibrosis staging/severity grading model CTPF ( Table 3). The assessment report shows the result (Figure 5.).
In the task of mortality risk prediction, the CTPF model has better AUC, Brier score, and stability than any other model (PF, CT, and GAP staging). Lung transplantation is an effective way to  improve the prognosis of IPF patients (Thabut et al., 2009). However, lung transplantation itself also has a mortality risk. In 2015, Yusen et al. (2015) reported a global mortality risk of lung transplantation as 20% in 1 year and 35% in 3 years. We suggest those patients whose mortality risk ( Table 6) is higher than the lung transplantation risk to consider transplantation. In this regard, our  Notes: Measurement data are presented as mean ± standard deviation (SD). Count data are presented as percentage or proportion. SpO 2 %: oxygen saturation of peripheral blood. SpO 2 is the resting arterial oxygen saturation measured at fingertips. FVC: forced vital capacity. FVC%pred: the percentage of the actual FVC over the predicted FVC. FEV 1 : forced expiratory volume in one second. FEV 1 %pred: the percentage of the actual FEV 1 over the predicted FEV 1 . DLco: diffusing capacity of the lung for carbon monoxide. DLco%pred: the percentage of the actual DLco over the predicted DLco. CT-based stage: the stage was determined by using the average score of the two radiologists and following the criteria described in Table 3. PF-based grade: the grade was determined by using the pulmonary function and physiological parameters (age, gender, FVC%pred, DLco %pred, and SpO2%) and following the description in Table 3. The grade was defined as (a) mild, (b) moderate, and (c) severe. GAP (gender, age, and physiologic variables) stage followed the recommendation by Brett Ley, and a higher stage represented a greater death risk.
Frontiers in Pharmacology | www.frontiersin.org April 2022 | Volume 13 | Article 878764  CT I: honeycomb lesion area was <5% of the entire lung. CT II: honeycomb lesion area was 5-25% of the entire lung. CT III: honeycomb lesion area was >25%. The PF-based grade was determined by assessing the scores of age, gender, FVC%pred, DLco%pred, and SpO 2 % according to the criteria in Table 3 and adding the scores. PF (a): score 0-3. PF(b): score 4-6. PF(c): score 7-10. The prognostic evaluation of pulmonary fibrosis with emphysema needs to be further analyzed in additional cases.
The major limitation of this study is lack of an external validation cohort to further evaluate the CTPF model. We are planning a multicenter clinical study in the future and hope to verify its clinical significance.

CONCLUSION
The deep learning-based model calculated the percentage of fibrosis lesions of the whole lung quantitatively by segmenting the fibrosis region from chest CT images automatically, combined with the IPF severity determined by five important physiological and pulmonary function indicators. The CTPF model predicted the mortality risk for IPF patients more precisely.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The study was approved by the Institutional Ethics Committee of Shanghai Pulmonary Hospital (No. K17-016).

AUTHOR CONTRIBUTIONS
HL, XW, CY, XC, YZ, and YS participated in the conception, hypothesis, and design of the study. XW, CY, and WZ performed the artificial intelligence experiments. CY and AZ carried out the statistical analyses. All authors contributed to interpretation of the data. JS participated in CT imaging evaluation. XW, CY, and HL wrote the manuscript, and all authors made critical revisions. All authors read and approved the final manuscript.