Identification of a Three-Gene Signature Based on Epithelial-Mesenchymal Transition of Lung Adenocarcinoma Through Construction and Validation of a Risk-Prediction Model

Epithelial-mesenchymal transition (EMT) process, which is regulated by genes of inducible factors and transcription factor family of signaling pathways, transforms epithelial cells into mesenchymal cells and is involved in tumor invasion and progression and increases tumor tolerance to clinical interventions. This study constructed a multigene marker for lung predicting the prognosis of lung adenocarcinoma (LUAD) patients by bioinformatic analysis based on EMT-related genes. Gene sets associated with EMT were downloaded from the EMT-gene database, and RNA-seq of LUAD and clinical information of patients were downloaded from the TCGA database. Differentially expressed genes were screened by difference analysis. Survival analysis was performed to identify genes associated with LUAD prognosis, and overlapping genes were taken for all the three. Prognosis-related genes were further determined by combining LASSO regression analysis for establishing a prediction signature, and the risk score equation for the prognostic model was established using multifactorial COX regression analysis to construct a survival prognostic model. The model accuracy was evaluated using subject working characteristic curves. According to the median value of risk score, samples were divided into a high-risk group and low-risk group to observe the correlation with the clinicopathological characteristics of patients. Combined with the results of one-way COX regression analysis, HGF, PTX3, and S100P were considered as independent predictors of LUAD prognosis. In lung cancer tissues, HGF and PTX3 expression was downregulated and S100P expression was upregulated. Kaplan-Meier, COX regression analysis showed that HGF, PTX3, and S100P were prognostic independent predictors of LUAD, and high expressions of all the three were all significantly associated with immune cell infiltration. The present study provided potential prognostic predictive biological markers for LUAD patients, and confirmed EMT as a key mechanism in LUAD progression.


INTRODUCTION
Lung adenocarcinoma (LUAD) is the most common subtype of non-small-cell lung cancer (NSCLC), with an increasing incidence and poor patient prognosis (1)(2)(3). LUAD is highly heterogeneous and aggressive, which is often associated with genetic mutations (4,5). In recent years, advances in chemotherapy, radiotherapy, and targeted therapies have reduced LUAD mortality (6)(7)(8), but its long-term survival is still dismal when compared with other cancers (9). To the best of our knowledge, the molecular pathogenesis of LUAD remains largely unknown. Therefore, there is an urgent need to investigate the mechanisms of LUAD development in-depth and to search potential key prognostic markers.
In terms of primary tumor, spread of cancer cells resulted from metastasis is a main cause of lung cancer-related death (10). During tumor metastasis, local invasion of primary tumor affects only surrounding tissues, especially stromal cells, and shed tumor cells enter the circulatory system through endocytosis and survive in a fluid circulatory system environment. Tumor cells in circulatory system will in turn penetrate the circulatory system and migrate into distant tissues, forming tiny metastatic clones, which will then form visible metastatic foci through proliferation (11,12). In such a tumor metastasis process, epithelial-mesenchymal transition (EMT) is a highly critical mechanism, through which epithelial cells lose cell-cell junctions and polarity, resulting in loss of epithelial properties and acquisition of mesenchymal properties with invasive and migratory abilities (13). EMT process is often activated during the development of tumorigenesis and growth, invasion, migration, colonization, and therapeutic resistance (14)(15)(16). In recent years, many studies have reported EMT as a marker of poor prognosis in LUAD patients (17,18). Some studies showed the prognostic significance and biological functions of some EMT-related genes in LUAD (19), but such studies are more at an early stage and lack in-depth analysis.
In this study, we first collected EMT-related genes. Candidate genes were obtained through co-analysis with differentially expressed genes and total survival-related genes in LUAD. Then a gene signature prognostic risk prediction model was constructed by LASSO and COX regression analysis and validated. LUAD patients were divided into two groups of high and low risk according to risk scores, and prognostic significance of these gene signatures in LUAD was assessed through analyzing the correlation between different risk groupings and clinicopathological characteristics of patients.

Data Collection
The Tumor Genetic Atlas (TCGA) database and Genotype-Tissue Expression (GTEx) database were downloaded to obtain human cancer gene expression data and clinical data. The information of LUAD samples and other 32 cancer types in TCGA, including clinically relevant pathological parameters and RNA-seq of tumor tissue samples, was assessed according to data completeness of clinical sample and degree of matching with sequenced samples remove duplicate and censored samples as well as cases without clinical outcomes. EMT-related genes were obtained from the Epithelial-Mesenchymal Transition (EMT) gene database (http://www.dbemt.bioinfo-minzhao.org/) download. The independent validation cohorts GSE68465 and GSE68571 were downloaded from the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo) using the GEO query R package. Tumor name abbreviations and corresponding meanings are given in Supplementary Table 1.

Differentially Expressed Genes
Differentially expressed genes in LUAD were screened using the edgeR package in R software with |logFC|≧1.5 and ADJUST P value<0.05. Volcano plots of the differentially expressed genes were drawn using ggplot2.

Prognostic Risk Modeling and Analysis
LASSO regression was applied to identify genes and develop a gene signature. The effect of gene signature on prognosis was validated by one-way COX regression. Multifactor COX regression was employed to detect prognostically independent predictors and construct a prognostic risk model. The risk score was calculated according to the model, and the median risk score was the threshold to divide patients into high-risk and low-risk groups. Kaplan-Meier survival curves were drawn to compare the survival of high-risk and low-risk patients. Also, the working curves of subjects were drawn, and the area under the curve was calculated to assess the predictive efficacy of the model.

Survival Analysis
Univariate cox regression analysis was done, and forest plots through the "forestplot" R package were used to display the P value, HR, and 95% CI of each variable. Use R software v4.0.3 for statistical analysis. If not otherwise stated, the rank sum test detects two sets of data, and a P value of <0.05 is considered statistically significant.

TIMER Analysis
The TIMER database (http://timer.comp-genomics.org/) was used to analyze the expression of genes in LUAD correlated with immune cell infiltration.

Protein Expression Validation
Immunohistochemical staining maps of protein expression in both liver cancer tissues and normal tissues were downloaded from the HPA database (The Human Protein Atlas) for validating the gene signature.

Gene-Set Enrichment Analysis
RNA-seq profiles were uploaded to GSEA to investigate key generelated signaling pathways in the high-risk group and the low-risk group. The enriched sets were screened based on a FDR < 0.25 and P < 0.05 after 1,000 permutations.

Cell Culture
The human bronchial epithelial cell line BEAS-2B and human LUAD cell lines NCI-H2009 were all purchased from Beyotime Biotechnology. All the cell lines were cultured in RPMI Medium 1640 (Gibco, USA) containing 10% fetal bovine serum (FBS) (Gibco, USA) and were incubated in a constant temperature incubator at 37°C and 5% CO 2 for future use.

Western Blot
The ice-cold lysate buffer was added to extract cell lines proteins. After centrifugation, the supernatant was extracted (12,000 rpm, 4°C, 10 min), and the protein concentration was measured using a BCA Protein Assay Kit (Beyotime, China). A total of 20 µg protein was separated using 10% SDS/PAGE, and then transferred to the membrane by wet transfer. The membrane was blocked with 5% skimmed milk powder. The primary antibodies of HGF (Abcam, 1:200), PTX3 (Abcam, 1:1,000), S100P (Abcam, 1:1,000), GAPDH (Abcam, 1:1,000) were added into membrane and then incubated overnight in shaking bed at 4℃. Tris-buffered saline with Tween 20 was used to wash the membrane for three times, 10 min each time. The secondary antibody (Goat Anti-Rabbit IgG, Abcam, 1:2,000) was subsequently added and incubated for 2 h at room temperature. Finally, protein bands were detected using the ECL Western Blot Detection Kit (Beyotime, China), and bactin was used as the internal reference protein.

Differentially Expressed Gene Screening
Under the conditions of |logFC|>1 and P<0.05 as the screening conditions, a total of 1,811 differentially expressed genes, including 1,200 downregulated genes and 611 upregulated genes, were obtained from TCGA-LUAD data. The visualization results are shown in the volcano plot ( Figure 1A). The prognostic characteristics of the top 20 genes with the most significant single-factor COX analysis in LUAD are shown in Figure 1B. The Epithelial-Mesenchymal transition (EMT) gene database (http://www.dbemt.bioinfo-minzhao.org/) was used to obtain 1,263 EMT-related genes, and the three categories of genes with overlaps were screened to obtain a total of 47 key genes ( Figure 1C).

Prognostic Prediction Modeling
LASSO regression analysis was conducted to further narrow the range of prognosis-related genes to ensure the result stability (Figures 2A, B). The risk score for each sample was calculated using the risk score formula: The median risk score served as the threshold value to divide the patients into high-risk and low-risk groups. The survival curves were plotted ( Figure 2C), and the results showed that the survival prognosis of high-risk patients was significantly worse than that of low-risk patients. The ROC curves of this risk score model in predicting the 1-, 3-, 5-year survival of LUAD patients were plotted ( Figure 2D), and the AUC areas were 0.712, 0.718, and 0.712, respectively, indicating that this risk model had a high prediction accuracy.

Clinicopathological Characteristics of Different Risk Score Subgroups
Details of the distribution of clinicopathological characteristics of different risk score subgroups (high-risk and low-risk) are shown in a heatmap ( Figure 3A), and the details are shown in Table 1.
The results of univariate and multifactorial COX regression analyses displayed that risk score, TNM stage, T stage, S100P, HGF, and PTX3 could be used as independent predictors of prognosis of patients with LUAD ( Figures 3B-D).

Validating the Prediction of the Risk Model in Independent Cohort
We constructed a prognosis predicting risk model based on HGF, PTX3, and S100P base on TCGA datasets ( Figure 4A). The risk score for each sample was calculated with the risk score formula: Riskscore = (-0:2793) * HGF + (0:0885) * S100P + (0:3172) * PTX3 The optimal cutoff risk score served as the threshold to divide the patients into high-risk and low-risk groups. The survival curves were plotted, and the results showed that the survival prognosis of high-risk patients was significantly worse than low-risk patients ( Figure 4B). To further assess the prediction of the risk model, GSE68465 and GSE68571 derived from GEO database were employed as validation cohorts. Cox regression analysis and Kaplan-Meier curve demonstrated that the prognosis of high-risk patients was worse than low-risk patients, which was consistent with the results found in the TCGA-LUAD cohort (Figures 4C, D).
Expression of HGF, PTX3, and S100P in Pan-Cancer and LUAD We analyzed the expression of HGF, PTX3, and S100P in tumors by integrating data from TCGA and GTEx database samples, and it was found that HGF and PTX3 expression was downregulated and S100P expression was upregulated in most tumors as compared with normal tissues (Figures 5A, C, E). TCGA database samples and integration of GTEx database normal samples displayed that in LUAD, HGF and PTX3 expressions were both downregulated and S100P expression was upregulated ( Figures 5B, D, F).
Survival Analysis of HGF, PTX3, and S100P The results of Kaplan-Meier analysis showed ( Figures 6A, C, E) that high and low expressions of HGF, PTX3, and S100P were significantly correlated with patient prognosis. Specifically, high expression of HGF, PTX3, and S100P were all considered as poor prognostic factors in LUAD patients; however, ROC curves showed that all the three were less accurate when predicting the prognosis ( Figures 6B, D, F). We further investigated the association of HGF, PTX3, and S100P expression with overall survival in 33 tumors via univariate survival analysis. As shown in Figure 7, HGF could significantly affect the overall survival of BLCA, ESCA, KIRC, LGG, LIHC, STAD, THCA ( Figure 7A); PTX3 could noticeably affect the overall survival of ACC, BLCA, CESC, GBM, HNSC, KIRC, LGG, LIHC ( Figure 7B); S100P could significantly affect the overall survival of CESC, LUAD, THCA, THYM ( Figure 7C).

Correlation of HGF, PTX3, and S100P With Immune Cell Infiltration
The expression levels of HGF in LUAD were significantly positively correlated with tumor purity, B cells, CD8+ T cells, CD4+ T cells, neutrophils, and dendritic cells ( Figure 8A). However, the expression levels of PTX3 in LUAD were significantly negatively correlated with tumor purity and positively correlated with B cells, CD8+ T cells, CD4+ T cells, neutrophils, and dendritic cells ( Figure 8B). The expression level of S100P was negatively correlated with B cells, CD8+ T cells, CD4+ T cells, neutrophils, and dendritic cells but positively correlated with tumor purity in LUAD ( Figure 8C).

Validation of HGF, PTX3, and S100P Protein Expression
The protein expression of HGF, PTX3, and S100P in lung cancer tissues and normal lung tissues was validated using the HPA online database. The results demonstrated that HGF was expressed in normal lung tissues but it was not detected in lung cancer tissues ( Figure 9A). PTX3 was mildly expressed in lung cancer tissues but was not detected in normal alveolar cells ( Figure 9B). S100P was moderately expressed in lung cancer tissues but was not detected in normal lung tissues ( Figure 9C).

Gene Set Enrichment Analysis
To examine the effect of gene expression on tumors, we divided the human pan-cancer samples into two groups with high and low expression, according to the expression of HGF, PTX3, S100P, and the enrichment of signaling pathways in KEGG and HALLMARK in high-and low-expression groups was analyzed by GSEA. The top three signaling pathways most significantly enriched in both databases have been listed. GSEA verified that HGF was mainly enriched in hematopoietic cell lineage and inflammatory response ( Figure 10A), and that PTX3 was mainly enriched in hematopoietic cell lineage and epithelial mesenchymal transition ( Figure 10B), and that S100P was mainly enriched in ribosome and MYC target ( Figure 10C).

In Vitro Experiments
We further verified the protein expression level of HGF, PTX3, S100P in vitro. Western blotting results showed that the protein expression level of HG F a nd PTX3 we re downregulated in NCI-H2009 versus BEAS-2B, while S100P was upregulated ( Figure 11).

DISCUSSION
Epithelial-mesenchymal transition (EMT) is a fundamental and critical cell biological process during embryonic development (20). EMT is a reversible process, with the reversal being referred to as mesenchymal-epithelial transition (MET). EMT and MET are essential for embryonic development, tissue regeneration, and wound healing, and its role in contributing to organ fibrosis, cancer progression, and metastasis requires a full understanding (21,22). EMT confers tumor cells the ability to detach from the primary tumor mass and invade lymphatic vessels and blood vessels through extracellular matrix, allowing them to reach secondary tissues or distal organs and reactivate epithelial cell properties to form metastatic foci at secondary sites via the MET pathway (23). Assessing EMT state in tumors is challenging as the process is transient and reversible (24). Previous studies have focused on exploring the biological functions and molecular mechanisms of EMT-related genes, such as the expression of E-cadherin, claudins, ocludins, and cytokeratins, and considered these genes as common markers of epithelial state, and Vimentin (VIM), fibronectin, and a-SMA as the markers of mesenchymal state (25)(26)(27). However, for some circulating tumor cells (CTCs) underwent EMT, the expression of these (25-27) epithelial cell adhesion-based molecular markers is difficult to be detected (28). Immunicon, an FDAapproved company, has demonstrated that the number of CTCs is indicative of patient prognosis, and Tang et al. (19) first developed a new EMT-related gene signature and constructed a nomogram to predict prognosis of LUAD patients. In addition, the construction of predictive models based on EMT-related genes in LUAD has not been reported. Therefore, searching prognostic predictive biomarkers for LUAD patients based on EMT-related genes may be a promising approach. In this study, we identified three prognosis-related gene signatures, namely, HGF, PTX3, and S100P. We first constructed a prognostic prediction risk model based on 20 EMT-related gene signatures and verified that the model had strong prediction accuracy through KM curve and ROC curve. Patients were grouped according to their risk score differences. HGF, PTX3, S100P, and risk score were all identified as independent predictors of prognosis of LUAD patients through performing univariate and multifactorial COX regression analyses on these 20 genes and risk scores. The results showed that hGF and PTX3 expression was downregulated and S100P expression was upregulated in LUAD, and that the expression of all the three was correlated with immune cell infiltration, suggesting that all the three gene could promote tumor progression. HGF, which is a cytokine produced by mesenchymal fibroblasts (29), could stimulate the mi gration, proliferat ion, migration, cell s urvival, morphogenesis, and angiogenesis of epithelial cells (30). It has been found that the HGF/c-Met signaling pathway may influence multiple aspects of tumor development by activating specific pathways that induce interactions between cancer cells and the tumor microenvironment in which they reside (29). For example, in EGFR-mutant lung cancer, HGF may be involved in endogenous and acquired resistance to acid kinase inhibitors (31). GSEA results also demonstrated that HGF was mainly enriched in inflammatory response. PTX3 in the pentraxin family is an immunomodulatory factor involved in angiogenesis, proliferation, and immune escape in cancer (32). It has been reported that PTX3 tends to be expressed in the stroma rather than in tumor cell components, suggesting  that the PTX3 gene is epigenetically modified and silenced in cancer cells (33). This is similar to our findings, as we found that PTX3 was low-expressed in tumor cells by protein expression. S100P is a member of the S100 calcium-binding protein. S100P in the S100 calcium-binding protein family is originally isolated from human placenta (34), and S100P expression is elevated in a variety of tumor cell lines and tumor tissues, including in lung cancer (35), pancreatic cancer (36), and breast cancer (37). Moreover, it is also a tumor microenvironment-associated gene (38). Our western blotting results also demonstrated that the protein expression level of HGF and PTX3 were upregulated in NCI-H2009 cell line compared to BEAS-2B cell line, while the protein expression level of S100P was downregulated.
There are some limitations to this study. Firstly, LUAD is a highly heterogeneous tumor, but certain key clinical variables were not available in public databases, which limited the comprehensiveness when developing the prognostic model. Furthermore, the predictive ability of the EMT-related gene signature using these three genes was only based on bioinformatics analysis, and more basic experiments and clinical evidence are needed to validate the signature.
In summary, this study constructed a prognostic model based on 20 EMT-related gene signatures and validated the performance the model. The final three prognostic predictors of LUAD patients were analyzed by COX regression analysis to obtain three independent factors. These findings may provide a new direction for prognosis prediction and individualized treatment for patients with LUAD.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.