Identification of Hub Genes and Pathways Associated With Idiopathic Pulmonary Fibrosis via Bioinformatics Analysis

Idiopathic pulmonary fibrosis (IPF) is a progressive disease whose etiology remains unknown. The purpose of this study was to explore hub genes and pathways related to IPF development and prognosis. Multiple gene expression datasets were downloaded from the Gene Expression Omnibus database. Weighted correlation network analysis (WGCNA) was performed and differentially expressed genes (DEGs) identified to investigate Hub modules and genes correlated with IPF. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis, and protein-protein interaction (PPI) network analysis were performed on selected key genes. In the PPI network and cytoHubba plugin, 11 hub genes were identified, including ASPN, CDH2, COL1A1, COL1A2, COL3A1, COL14A1, CTSK, MMP1, MMP7, POSTN, and SPP1. Correlation between hub genes was displayed and validated. Expression levels of hub genes were verified using quantitative real-time PCR (qRT-PCR). Dysregulated expression of these genes and their crosstalk might impact the development of IPF through modulating IPF-related biological processes and signaling pathways. Among these genes, expression levels of COL1A1, COL3A1, CTSK, MMP1, MMP7, POSTN, and SPP1 were positively correlated with IPF prognosis. The present study provides further insights into individualized treatment and prognosis for IPF.


INTRODUCTION
Idiopathic pulmonary fibrosis (IPF), a chronic and progressive lung disease of unknown etiology (King et al., 2014;Martinez et al., 2017;Raghu et al., 2018), is characterized by diffuse alveolar inflammation and alveolar structural damage (Xu et al., 2016;Claude and Harold, 2018;David et al., 2019). The global annual incidence of IPF is 0.2-93.7 per 100,000 (Hutchinson et al., 2015) and the median survival time is only 2-3 years after diagnosis is confirmed (Tran and Suissa, 2018;Schäfer et al., 2020). Clinical biomarkers reliably reflecting the progression of IPF are small in number. In terms of treatment, medications recommended by current clinical guidelines -nintedanib and pirfenidone have a good therapeutic effect on IPF in general (Maher et al., 2019;Saito et al., 2019;Behr et al., 2021), but not that effective for late stage IPF. Therefore, it is important to illuminate the pathogenesis of IPF and to explore potential biomarkers for early stage IPF in order to improve the therapeutic effect on IPF.
In recent years, the amount of biological data for research has dramatically increased with the development of transcriptomic analysis, providing new viewpoints for exploring the etiology, pathogenesis, and novel targets for clinical treatment (DePianto et al., 2015;Wang et al., 2017). Previous studies primarily tested individual genes in diseased conditions. However, genes with similar expression patterns are likely to be tightly co-regulated in vivo with closely related functions, expressed in the same signaling pathways or processes (Pei et al., 2017). Identifying these genes and their interactions will provide further understanding of biological pathways related to IPF.
In the present study, multiple bioinformatic methods were used to search for hub genes significantly correlated with the occurrence and prognosis of IPF. Their expression has also been validated using quantitative real-time PCR (qRT-PCR). This work will provide further insights into the underlying molecular mechanisms of IPF and potential molecular targets for developing novel interventional strategies.

Acquisition of Microarray Data
In this study, seven gene array expression Series Matrix Files of IPF containing GSE110147, GSE53845, GSE24206, GSE32537, GSE68239, GSE10667, and GSE70866 were obtained from the Gene Expression Omnibus (GEO) repository (https://www.ncbi. nlm.nih.gov/geo/). The main features of these 7 datasets were shown in Table 1. These profiles consisted of gene expression matrices, probe sets, and clinical characteristics. Subsequent analyses were conducted on these datasets.

Removal of the Batch Effect
All of the raw data of GSE110147, GSE53845, GSE24206, GSE32537, and GSE68239 were merged and normalized into the training group using the RMA algorithm provided by the Limma package 3.40.6 of R, followed by removing the batch effect using R package SVA. The principal component analysis (PCA) was used to estimate whether the batch effect was removed. The validation group consisted of data from GSE10667 and GSE70866.

Identification of Significant Modules Using the Weighted Correlation Network Analysis
The WGCNA package was used to determine key genes significantly associated with IPF in the training and validation groups. The best soft threshold power was set to identify the module-trait relationship, module membership (MM), and gene significance (GS). In brief, a weighted adjacency matrix was first constructed based on the selected soft threshold power. Subsequently, the connectivity per gene was deduced through calculating connection strengths with other genes. After validating the module structure preservation using the module preservation R function, the gene expression profile of each module was summarized by the module eigengene on whom IPF traits were regressed in the Limma R package . The IPF-related module was selected with the highest coefficient square (r 2 ) and the p value <0.001.  GSE110147  GPL6244  22  11  2018  London  GSE53845  GPL6480  40  8  2014  South San Francisco  GSE24206  GPL570  11  6  2011  Durham  GSE32537  GPL6244  119  50  2013  Aurora  GSE68239  GPL1708  10  10  2017  Giessen  GSE10667  GPL153  23  15  2003  Bern  GSE70866  GPL17077  112  20 2018 Hannover  GAPDH  F  AGGTCGGTGTGAACGGATTTG  GAPDH  R  TGTAGACCATGTAGTTGAGGTCA  COL1A1  F  GCTCCTCTTAGGGGCCACT  COL1A1  R  CCACGTCTCACCATTGGGG  COL1A2  F  GTAACTTCGTGCCTAGCAACA  COL1A2  R  CCTTTGTCAGAATACTGAGCAGC  COL3A1  F  CCTGGCTCAAATGGCTCAC  COL3A1  R  CAGGACTGCCGTTATTCCCG  COL14A1  F  TTTGGCGGCTGCTTGTTTC  COL14A1  R  CGCTTTTGTTGCAGTGTTCTG  COL15A1  F  GCGGAGTCGGGTTTCAGAG  COL15A1  R  TACTTCGCCCGCAGAACAAA  MMP1  F  GGACAAGCAGTTCCAAAGGC  MMP1  R  GATGCTTAGGGTTGGGGTCT  MMP7  F  CTGCCACTGTCCCAGGAAG  MMP7  R  GGGAGAGTTTTCCAGTCATGG  CTSK  F  GAAGAAGACTCACCAGAAGCAG  CTSK  R  TCCAGGTTATGGGCAGAGATT  CDH2  F  AGCGCAGTCTTACCGAAGG  CDH2  R  TCGCTGCTTTCATACTGAACTTT  ASPN  F  AAGGAGTATGTGATGCTACTGCT  ASPN  R  ACATTGGCACCCAAATGGACA  POSTN  F  CCTGCCCTTATATGCTCTGCT  POSTN  R  AAACATGGTCAATAGGCATCACT  SPP1  F  AGCAAGAAACTCTTCCAAGCAA  SPP1 R GTGAGATTCGTCAGATTCATCCG

Identification of Differentially Expressed Genes
The Limma package 3.40.6 was used to compare DEGs between the normal group and the IPF group in the training group. The Benjamini-Hochberg's method was used to control the false discovery rate (FDR), and DEGs were selected through adopting the commonly used threshold of FDR p value <0.01 and |Log2 fold-change| > 1. The volcano plot of DEGs was generated by the ffplot2 package, and a heatmap of DEGs was produced using the pheatmap package in the R software.

Identification of Key Genes and Gene Set Enrichment Analysis
An online tool (http://bioinformatics.psb.ugent.be/webtools/ Venn/) was employed to construct a Venn diagram to overlap the key genes in the significant modules and DEGs in the training group. The analyses of key genes of Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were conducted using the GOplot, clusterProfiler, DOSE, colorspace, and the stringi packages in the R software.

Protein-Protein Interaction Network Analysis
The Search Tool for the Retrieval of Interacting Genes (STRING, https://www. string-db. org/) was used to identify functional interactions between the products of the key genes in the training group. The PPI network was constructed using STRING by adopting the default threshold of a combined score >0.6. Then, the number of nodes of all the related proteins in the PPI network was counted using the R software and visualized using Cytoscape 3.8.2.

The Correlation Between Hub Genes
The ggcorrpolt and the ggthemes of the R package were used to determine the correlation between hub genes in the training and the validation groups, respectively. p value <0.01 was considered statistically significant.

Ethics Statement and Animal Treatment
In this study, 12 8 week-old male C57BL/6 mice weighing 20-25 g were obtained from Shanghai SLAC Laboratory Animal Ltd., China. Mice were bred at 22-24°C under a 12 h/12 h light/dark cycle and with free access to food and water. All procedures were implemented in conformity to the guidelines for animal care published by the United States' National Institutes of Health (NIH) for animal care (Guide for the Care and Use of Laboratory Animals, Department of Health and Human Services, NIH publication No. 86-23, revised 1985). All procedures were approved by Renji hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China (approval number: RJ-20170930). 5 mg/kg LPS was intraperitoneally injected for 5 consecutive days to 6 mice to induce pulmonary fibrosis as reported in our previous article (Wan et al., 2019). The other 6 mice served as control. The lung tissue of all mice was collected 30 days after LPS injection.

Hematoxylin and Eosin, Masson's Trichrome Staining and Immunohistochemistry
Lung tissue collected from both control and LPS injected mice was fixed in 4% paraformaldehyde solution for overnight, followed by dehydration in 70% ethanol and embedding in the paraffin wax. 5 μm thick sections were cut and subject to H&E and Masson's trichrome staining, respectively, as previously described (Wan et al., 2019). Expression of α-SMA in the lung tissue was evaluated using immunohistochemical staining. Sample sections were deparaffinized and incubated with 5% goat serum containing 0.1% Triton X-100 at room temperature for 2 h, followed by sequential incubation in the primary (rabbit antiα-SMA, 1:2000, CST, Cat No 19245) and secondary antibody solutions (goat anti-rabbit IgG, 1:1000, Jackson Immunoresearch Laboratory, Cat No 111-035-003) antibodies, each for 2 h. After 3 rinses with 1x PBS, the sections were incubated in a 3,3,-diaminobenzidine (DAB) reaction complex (Vector lab, Burlingame, CA, United States) until an optimal colour developed. At the end of the procedure, the sections were mounted and dehydrated before being coverslipped.

Quantitative Real-Time PCR and Statistical Analysis
Total RNA of the mouse lung tissue was isolated using Trizol (Thermofisher) according to the manufacturer's instructions. Primer Script RT reagent kit (Takara, Japan) was used to synthesize complementary DNA, and real-time PCR was run on a Quant Studio 1 real-time PCR system (Thermofisher, United States) using the TB Green Premix Ex Taq TM kit (Takara, Japan). Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is stably expressed in the lung tissue during the process of pulmonary fibrosis. Therefore, it serves as a reliable endogenous control in the reverse transcription-polymerase chain reaction (Loitsch et al., 1999;Cruz-Bermúdez et al., 2019). The 2 −△△Ct method was employed to assess relative expression levels of genes of interest. All of the data were presented as mean ± standard error of mean (SEM) and analyzed using the GraphPad PRISM8 software (United States). Two-tailed Student's t test was used to compare between the two groups. p < 0.001 was considered statistically significant. Sequences of the primers could be found in Table 2.

Survival Analysis of Hub Genes
The Kaplan-Meier method (K-M method; product-limit method) is suitable for analysis with a small sample size. Survival analysis was performed using survfit R package in the R software. The difference in survival curves for IPF patients with different expression levels of hub genes was analyzed using the log-rank method provided by the survival package.

Removal of Batch Effects by Cross-Platform Normalization
To eliminate batch effects from different platforms and batches of datasets, the ComBat function of SVA package was used . Before the removal of batch effects, samples were clustered by batches based on the first two principal components (PC) of unnormalized expression values ( Figures 1A,C). In contrast, the scatter plot of principal component analysis (PCA) based on normalized expression showed that batch processing effects from different platforms were eliminated ( Figures 1B,D). These results confirmed that cross-platform standardization successfully eliminated batch effects.

Identification of IPF-Associated Weighted Correlation 26 Network Analysis Modules
A total of 287 samples and 6,263 genes retrieved from the training group were used for the co-expression network analysis ( Figure 2A). An eigengene correlation coefficient square (r 2 ) of 0.8 and a soft threshold power of 3 were set to identify the module-trait relationship ( Figure 2B). A hierarchical clustering tree was constructed following a dynamic hybrid cut ( Figure 2C). The eigengene dendrogram and heatmap were used to quantify module similarity by eigengene correlation (Figures 2D,E). Among the 15 gene modules identified ( Figure 2F), the green module ( Figure 2G) showed a relatively higher positive correlation with the IPF group (cor 0.71, p < 0.001), and the brown module ( Figure 2H) exhibited significantly positive correlation with the IPF group (cor −0.69, p < 0.001).

Idiopathic Pulmonary Fibrosis-Associated Key Genes and Their Functions
A total of 177 DEGs, including 121 up-regulated and 56 downregulated genes, were identified in the training group ( Figures  3A,B, Table 3). Among them, 72 genes were determined as key genes because they were shared by the green and brown modules ( Figure 4A and Table 4). KEGG analysis revealed that the most notably enriched pathways were protein digestion and absorption and focal adhesion ( Figure 4B, Supplementary Table S1). GO analysis showed that these genes were mainly involved in nine cellular components ( Figure 4D
To further confirm the role of these 12 hub genes in IPF, another WGCNA for 6,252 genes obtained from GSE10667 of the training group was performed. It was found that the black  and greenyellow modules, containing 12 hub genes, out of the 14 gene modules identified ( Figure S1a) showed strong association with IPF (black module: cor -0.63, p < 0.001; greenyellow module: cor 0.81, p < 0.001) (Supplementary Figures S1B-D). These hub genes were involved in many important IPF-related biological processes, such as extracellular matrix organization, response to TGF beta, collagen metabolic process, fibrillar collagen trimer, etc (Supplementary Table S2). These results suggested that the abovementioned hub genes play important roles in the occurrence of IPF.

Expression Validation of Hub Genes by Quantitative Real-Time -PCR
The clustering heatmap showed expression levels of hub genes in the training ( Figure 6A) and the validation groups ( Figure 6B). Expression levels of 12 hub genes showed consistent upregulation in both the training ( Figure 6C) and the validation groups ( Figure 6D). To further validate the results, qRT-PCR was conducted in the LPS-induced pulmonary fibrosis model. The modeling process was described in detail in our previous study (Wan et al., 2019). Firstly, mice were intraperitoneally injected with saline or LPS (5 mg/kg) for 5 consecutive days, and samples were collected on day 30 after LPS injection. H&E E/MASSON staining and α-SMA immunohistochemistry ( Figure 7A) showed that the model of pulmonary fibrosis was successfully constructed. The lung tissue of 6 control and 6 LPS injected mice was collected. As shown in Figure 7B, expression levels of COL1A1, COL1A2, COL3A1, COL14A1, MMP1, MMP7, ASPN, CDH2, SPP1, POSTN, and CTSK were significantly increased in the lung tissue of LPS injected mice compared with that of control mice (p < 0.001), whereas the expression level of COL15A1 was undetectable in the lung tissue of either the control or LPS injected mice. These results demonstrated important roles COL1A1, COL1A2, COL3A1, COL14A1, MMP1, MMP7, ASPN, CDH2, SPP1, POSTN and CTSK play in IPF occurrence.

Survival Analysis of Hub Genes in the Validation Group
To explore the prognostic value of these 12 hub genes for IPF patients, overall survival analysis was performed using the GSE70866 dataset. It was found that IPF patients with high expression levels of COL1A1, CTSK, MMP1, MMP7, and SPP1 had poorer overall survival compared to those with low expression levels (p < 0.05). No significant correlation was found between expression levels of other hub genes and survival of IPF patients (p > 0.05, Figure 8).

DISCUSSION
Idiopathic pulmonary fibrosis (IPF) is an end-stage lung disease with a short survival time after diagnosis is confirmed (Liu et al., 2018). Therefore, it is vital to illuminate the underlying mechanism of IPF pathogenesis and to explore potential biomarkers for early diagnosis in order to improve the prognosis of IPF patients. The present study used bioinformatic methods to screen IPF-related genes and tentatively tested their roles in the occurrence and prognosis of IPF based on the gene expression data of IPF patients in the GEO databases. Compared to other bioinformatics analyses, WGCNA is more valuable due to the comprehensive examination of links between co-expression modules and clinical traits with high reliability and biological significance (Wei et al., 2014;Kong et al., 2020;Lund et al., 2020). In the present study, 11 hub genes (COL1A2, COL1A1, COL3A1, SPP1, MMP1, POSTN, ASPN, CDH2, COL14A1, CTSK, MMP7) were found to be associated with the occurrence of IPF using WGCNA, differential gene expression analysis in the training datasets, PPI network analysis, and cytoHubba plugin analyses. Expression of these genes was validated in the GSE10667 dataset and in pulmonary fibrosis model mice using qRT-PCR. These results are consistent with those of previous studies in that COL1A1, COL1A2, COL3A1, COL14A1, POSTN, SPP1, MMP1, ASPN, MMP7, CDH2, and CTSK were up-regulated in a variety of samples from IPF patients and IPF-related mouse models, including blood, bronchoalveolar lavage, airway fibroblasts, and the lung tissue of human and mice, as well as involved in the occurrence of IPF (Supplementary Table S3) (Tzortzaki et al., 2003;Brule et al., 2005;Obermajer et al., 2006;Ivan et al., 2008;Naik et al., 2012;Vittal et al., 2013;Zhang et al., 2013;Hamai et al., 2016;Ghavami et al., 2018;Liu et al., 2018;Morimoto et al., 2018;Gao et al., 2019;Yu et al., 2020).
The collagen binding protein ASPN, a member of the family of leucine-rich repeat proteins, is produced by   fibroblasts (Nakajima et al., 2007). A previous study showed that the expression of ASPN was up-regulated in the IPF(+) soluble fractions (Åhrman et al., 2018). COL1A1, a functional gene that encodes the alpha 1 chain of type I collagen, is the main constituent of ECM component  and participates in cell proliferation, infiltration, metastasis, and angiogenesis. Its expression is related to many types of tumors (Bi et al., 2017;Ma et al., 2019;Dong et al., 2020). It is known that type III collagen interacts with type I and type II collagen in fibril formation and is an essential regulator of fibril diameter. The increase in type I, III, XIV, XV collagen content will lead to the formation of fibrils. Like many fibrotic disorders, IPF is characterized by enhanced deposition and remodeling of ECM (Tjin et al., 2017). Matrix metalloproteinases (MMPs) are a family of calcium-and zinc-dependent endopeptidases (Visse and Nagase, 2003) whose function is to regulate abnormal epithelial response to injury, fibroblast proliferation, extracellular matrix accumulation, and aberrant tissue remodeling (Ivan et al., 2008). MMPs play a role in different pathological processes such as atherosclerosis (Wågsäter et al., 2011), arthritis (Huang et al., 2017), tumor invasion, and pulmonary fibrosis (Ivan et al., 2008). POSTN plays a notable role in ECM structure and organization and principally in collagen assembly. Secreted phosphoprotein 1 (SPP1), also known as osteopontin-like protein, is a multifunctional secretory acidic glycoprotein . Earlier research displayed that POSTN can promote myofibroblast differentiation and type I collagen production, contributing to aberrant lung fibrosis. Serum levels of POSTN may predict the clinical progression of IPF (O'Dwyer and Moore, 2017; Morse et al., 2019;Alimperti and Andreadis, 2015). SPP1 can be secreted by various cells, such as osteoclasts, macrophages, epithelial cells, and endothelial cells. Prior studies reported that low-level local proliferation of macrophages that highly express SPP1 in the normal lung tissue was strikingly increased in IPF lungs, and macrophages that highly express SPP1 importantly contribute to lung fibrosis in IPF34. CDH2, a calcium-dependent cell adhesion protein and also a mesenchymal cell marker, preferentially mediates homotypic cell-cell adhesion by dimerization with a CDH2 chain from another cell (Alimperti and Andreadis, 2015). Previous study showed that CDH2 was involved in epithelial-mesenchymal transition (EMT) which contributes to pulmonary fibrosis (Gao et al., 2019). CTSK displays potent endoprotease activity against fibrinogen (Obermajer et al., 2006), and the expression level of CTSK was up-regulated in the silica induced pulmonary fibrosis model (Brule et al., 2005). These suggested that 11 hub genes identified in the present study play important roles in IPF occurrence.
To explore the prognostic value of these hub genes, survival analysis was performed. It was found that COL1A1, CTSK, MMP1, MMP7, and SPP1 could be potential biomarkers for poor prognosis of IPF patients because the survival rate of IPF patients with high expression levels of COL1A1, CTSK, MMP1, MMP7, and SPP1 was lower than that of those with low expression levels of the abovementioned genes. A number of studies have reported that type I (COL1A1) collagen and matrix metalloproteinases (MMP1 and MMP7) are fibrotic genes associated with the occurrence and prognosis of IPF (Ivan et al., 2008;Giménez et al., 2017;Adegunsoye et al., 2020;Liu et al., 2020;Xu et al., 2020;Mishra et al., 2021). However, further clinical investigation on these genes is warranted.
GO and KEGG pathway enrichment analyses revealed that COL1A1, COL1A2, COL3A1, COL14A1, POSTN, SPP1, MMP1, ASPN, MMP7, CDH2, and CTSK were involved in multiple biological processes related to extracellular matrix organization and collagen metabolism by protein digestion and absorption signaling pathways. In addition, correlation analysis, and PPI as well as cytoHubba plugin analyses showed a highly consistent positive correlation between hub genes in both the training group and the validation group. These results suggest that altered expression of these genes and their crosstalk might impact the development of IPF by modulating the above biological processes and signaling pathways.
The present study analyzed results of seven GEO datasets including 337 IPF samples and 120 control samples to identify possible biomarkers that facilitate elucidation of the underlying molecular mechanism of IPF using a variety of bioinformatics methods (WGCNA, different gene expression analysis, PPI, correlation analysis, etc.). These results were confirmed in 7 GEO datasets, indicating a high reliability and validity. They will shed light on the potential pathogenic mechanism of IPF and guide the development of more potent drugs for IPF.

CONCLUSION
In summary, results of the present study suggest that ASPN, CDH2, COL1A1, COL1A2, COL3A1, COL14A1, MMP1, POSTN, SPP1, MMP7, and CTSK are potential biomarkers of IPF. Altered expression of these genes and their cross-talk might impact the development of IPF by modulating IPF-related biological processes and signaling pathways. COL1A1, CTSK, MMP1, MMP7, and SPP1 are positively correlated with IPF prognosis. This study provides further insights into individualized treatment and prognosis for IPF.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements. The animal study was reviewed and approved by the Animal Research Ethics Committee of Renji hospital, Shanghai Jiao Tong University School of Medicine.

AUTHOR CONTRIBUTIONS
HW and XH designed the experiments, prepared the manuscript and performed data analysis. PC, MH, AC, TW, DD, WL, LT and XG did the data curation. LX and HL oversaw the project and revised the manuscript.

FUNDING
The present study was supported by a grant from Shanghai Fourth People's Hospital (No. 2019001).