Identification of Ferroptosis-Related Genes Signature Predicting the Efficiency of Invasion and Metastasis Ability in Colon Adenocarcinoma

Background: Colon adenocarcinoma (COAD) is one of the most prevalent cancers worldwide and has become a leading cause of cancer death. Although many potential biomarkers of COAD have been screened with the bioinformatics method, it is necessary to explore novel markers for the diagnosis and appropriate individual treatments for COAD patients due to the high heterogeneity of this disease. Epithelial-to-mesenchymal transition (EMT)-mediated tumor metastasis suggests poor prognosis of cancers. Ferroptosis is involved in tumor development. EMT signaling can increase the cellular sensitivity to ferroptosis in tumors. The aim of our study is finding novel prognostic biomarkers to determine COAD patients for predicting efficiency of metastasis status and targeting precise ferroptosis-related therapy. Methods: A novel gene signature related to metastasis and ferroptosis was identified combing with risk model and WGCNA analysis with R software. The biological functions and predictive ability of the signature in COAD were explored through bioinformatics analysis. Results: We established a four-gene prognostic signature (MMP7, YAP1, PCOLCE, and HOXC11) based on EMT and ferroptosis related genes and validated the reliability and effectiveness of this model in COAD. This four-gene prognostic signature was closely connected with metastasis and ferroptosis sensitivity of COAD. Moreover, WGCNA analysis further confirmed the correlation between PCOLCE, HOXC11, and liver and lymphatic invasion of COAD. Conclusion: The four genes may become potential prognostic biomarkers to identify COAD patients with metastasis. Moreover, this four-gene signature may be able to determine the COAD suitable with ferroptosis induction therapy. Finally, PCOLCE2 and HOXC11 were selected individually because of their novelties and precise prediction ability. Overall, this signature provided novel possibilities for better prognostic evaluation of COAD patients and may be of great guiding significance for individualized treatment and clinical decision.


INTRODUCTION
Colon adenocarcinoma (COAD), one of the most prevalent cancers worldwide, has become a leading cause of cancer death (Bray et al., 2018;Miller et al., 2019). Over 1 million new COAD cases and 551,269 deaths occurred in 2018, accounting for about 1 in 20 cancer cases and deaths (Bray et al., 2018). It is estimated that more than 2.2 million new cases and 1.1 million colorectal cancer deaths occur by 2030 (Arnold et al., 2017). Despite the improvements in surgical and medical therapies, the 5-years survival rate of COAD patients is still unsatisfied in the past several decades (Kuipers et al., 2015;Siegel et al., 2020). Moreover, the 5-years survival rates for patients with early stage cancer are approximately 90%, whereas for patients diagnosed with advanced-stage metastatic disease is only 10% (Kuipers et al., 2015), indicating that metastasis is a critical cause of death for cancer patients. Therefore, there is an urgent need to identify novel prognostic biomarkers related to the tumor metastasis, which may facilitate the timely diagnosis and application of appropriate individual treatments for COAD patients.
At this time, numerous studies have focused on screening new genes and established multi-gene prognostic models. Zuo et al. identified a 6-gene signature (EPHA6, TIMP1, IRX6, ART5, HIST3H2BB, and FOXD1) significantly associated with the survival of colorectal cancer patients (Zuo et al., 2019). Liu et al. constructed an invasion-related 6-gene signature (ITLN1, HOXD9, TSPAN11, GPRC5B, TIMP1, and CXCL13) that could evaluate the prognostic risk of patients with colon adenocarcinoma (Liu et al., 2021). Moreover, Wang et al. established a 12-gene based prognostic signature based on DNA damage and repair related gene sets that is beneficial for the evaluation of clinical prognosis on colorectal cancer patients (Wang X.-Q. et al., 2021). Further data mining using the public datasets via different angles may display other prognosticrelated genes and construct a reliable prognostic model, which may provide a tool for prediction of prognosis and selection of individualized treatment.
Invasion and migration of cancer cells into peritumoral tissues and organs could result in cancer metastasis. Epithelial-tomesenchymal transition (EMT), a process characterized by loss of intercellular adhesion properties and acquire of mesenchymallike phenotype in epithelial cells, is associated with tumor invasion, migration, and metastasis (Pastushenko and Blanpain 2019). Ferroptosis, an iron-dependent form of cell death triggered by excessive lipid peroxidation, has been confirmed in playing an important role in the development and therapy of tumors (Hassannia et al., 2019). Ferroptosis reagents such as erastin and RSL3 could induce ferroptosis and suppress tumor development (Chen et al., 2021). Intriguingly, it is reported that EMT signaling can increase the cellular sensitivity to ferroptosis in tumors (Chen et al., 2021). For example, transcription factor ZEB1 can stimulate EMT-mediated tumor metastasis and high transcript levels of ZEB1 can sensitize cancer cells to ferroptosis through promoting the increase of PPARc, a major regulator of lipid metabolism in liver . Cancer cells with resistance to traditional treatment or high metastasis tendency may be particularly prone to ferroptosis, thus opening up a new field of targeted therapy research. EMT signals can also promote ferroptosis in human cancer cell lines and organs, the high degree of interstitial cell status is associated with the susceptibility to ferroptosis (Hanahan and Weinberg 2011;Chen et al., 2021). Therefore, we speculated that the application of ferroptosis inducer in tumors with EMT may be a potential therapeutic strategy to suppress tumor progress.
Our goal is to provide insight and summary on the mechanism and function of ferroptosis in tumor development and its potential therapeutic targets. The aim of our study is finding novel prognostic biomarkers for identification of in COAD patients for ferroptosis-related therapy. In this study, the mRNA expression data and clinical information of COAD patients from The Cancer Genome Atlas (TCGA) datasets were used to screen some genes related to metastasis and ferroptosis. Membrane fatty acid composition has attracted more and more attention in promoting cell survival, lipotoxicity limitation and ferroptosis. In addition, it is now clear that tumor cells show plasticity in fatty acid metabolism and respond to extratumoral and systemic metabolic signals (such as obesity and tumor treatment) to promote the development of aggressive, treatment resistant and related diseases (Hoy et al., 2021). A patients' BMI level in COAD datasets was a critical strategy for screening. Then, we established a four-gene prognostic signature based on these screened genes. The effectiveness of the prognostic model was verified using another mRNA data from Gene Expression Omnibus (GEO) datasets. Moreover, a nomogram was constructed to predict overall survival and confirmed the prognostic value of our prediction model. Additionally, weighted gene co-expression network analysis (WGCNA) further demonstrated that two of the four genes are closely correlative with liver and lymphatic invasion of COAD. Our four-gene signature may provide new insights for the survival predictions and treatment options of COAD. were downloaded from the TCGA database (Astolfi et al., 2020), then were collected and preprocessed. Of these, 521 samples contained prognostic information. The gene expression data were log2-transformed (normalized RSEM count +1). Genes w ith low or no expression were then picked out from the analysis. The standardized mRNA data of GSE75500 (Evert van den Broek et al., 2016 Nov 8) were obtained from GEO database (https:// www.ncbi.nlm.nih.gov/geo/). Specifically, these datasets included 114 COAD samples processed on the GPL16694 Agilent-022522 SurePrint G3 CGH Array 4x180K (Probe Name Version) of GSE75500, which were applied as validation sets. Quality controls included relative expression (RLE) and standardized scale-free standard error (NUSE) implemented in the affyPLM package provided by Bioconductor (Wang J et al., 2021) (www. bioconductor.org). Raw gene expression data were background corrected using the Robust Multi-Array Average (RMA) method, standardized using the Quantiles method, and summarized by the median polish method. The above methods were all well applied in the flexible package provided from Bioconductor. According to the annotation file in the platform provided by the chip manufacturing (GPL16694 for GSE75500), the probe label is converted into a genetic symbol. The average expression values were set as relative expression values of genes which were matched with multiple probes. Additionally, 28 proliferation and metastasis regulatory genes were identified from the human molecular characterization database V7.2 (MSIgDB). 521 patients with primary COAD in TCGA were selected as the experimental group. The characteristic genes were screened and the TNM staging model was established. To identify differentially expressed genes in samples from advanced patients (stage Ⅲ and stage IV), a Bayes test was performed (FDR<0.05). The above differentially expressed genes were intersected with 28 proliferative and metastatic genes, and the proliferation and metastasis related genes correlated with the prognosis of COAD were preliminarily obtained.

Differentially Expressed Genes Screening
The differentially expressed genes (DEGs) of the TCGA clinical staging dataset were screened by limma (Ritchie et al., 2015) package, and the results were visualized by volcano plot. The volcano plot was generated by ggplot2 (Ito and Murphy 2013). | Log2FC| > 1, p < 0.05 were set as the cutoffs for the DEGs.

Functional Analysis
Gene ontology (GO) (The Gene Ontology 2017) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2017)pathway enrichment analysis were performed to identify the functions of DEGs using the cluster Profile package. GO enrichment analysis included cellular component (CC), molecular function (MF), and biological process (BP), and the parameters were set to p < 0.01, minimum count >3, enrichment factor >1.5. p < 0.05 was considered statistically significant. The Gene Set Enrichment Analysis (GSEA) (Aravind Subramanian et al., 2005) was used to perform pathway enrichment for DEGs, and 2000 alignments per analysis. The KEGG pathways data set in the curated gene sets was selected. The threshold for the statistically significant GSEA analysis was set to the corrected p < 0.05 and false discovery rate (FDR) < 0.25. Selecting "c2.cp.kegg.v7.0.symbols.gmt" as a reference gene set. p < 0.05 and FDR<0.25 was considered to be significantly enriched. The result of enrichment analysis would be characterized by corrected p values and NES. GSEA enrichment analysis and visualization were performed using GSEA local software.

Determination of Final Prognostic Signature Using Lasso Cox Regression Analysis
Lasso Cox regression analysis was used to determine the final prognostic signature using glmnet (Jerome Friedman and Tibshirani, 2010). Genes identified in the univariate analysis as covariates were included in the multivariate Cox regression analysis to determine their effect on the OS. The prognostic model was established by combining the expression values of the selected genes with multivariate Cox regression coefficients, that is, the risk score of each patient was equal to the sum of the gene expression values and multivariate Cox regression coefficients. The median risk score was selected as the threshold to stratify COAD patients for the evaluation of the model prediction effect. Patients with the risk score greater than the threshold were divided into highrisk group, and the remaining patients to low-risk group. The difference in survival rate between the two groups was evaluated by a log-rank test. The author plotted the survival curve using fit plot.

Nomogram and Risk Assessment Analysis
Risk score was accessed using time-dependent receiver operating characteristic (ROC) curves to predict specificity and sensitivity of COAD patients' survival at 1, 3, and 5 years of follow-up, while risk scores and other clinical indicators (age, sex, TNM stage, and histological grade) were observed. The ability of the model to predict OS was verified in the validation datasets (GSE75500). In addition, based on the increased value of risk, a dot plot was used to show the distribution of patient mortality events. Heat map was used to observe the expression distribution of each characteristic gene in two different risk groups. To test the independence and reliability of the prognostic model, we selected clinically common prognostic indicators for COAD (age, sex, TNM stage, and histological grade). Among these possible prognostic factors, risk score and age were used as continuous variables, and gender (male/female), TNM stage (III-IV/I-II), and histological grade (4/3/2/1) were converted into categorical variables. Univariate and multivariate Cox regression analyses were performed for clinical characteristics and risk scores of COAD patients in the TCGA cohort to identify clinical factors associated with survival. Log-rank test was used to verify whether the risk score was related to other survival related clinical information.
Variables that could be used as independent prognostic factors were then used to establish a nomogram.

PPI Network Construction
String Database (http://string-db.org, version 11) (Szklarczyk et al., 2021) was used to construct protein-protein interaction networks. All the related differential molecules after the intersection of different mRNAs were put into the database for analysis, and the interaction threshold was set at 0.4. Cytoscape software was used to visualize molecular interaction networks. A plug-in of Cytoscape software: Cytohubba was used to analyze hub genes in the network.

Receiver Operating Characteristic Diagnostic Prediction of Genes and Risk Assessment
ROC prediction analysis was performed to acquire the specificity and sensitivity of each gene on the screening significant key gene using time ROC package (Blanche et al., 2013). Survival package was used to calculate the clinical parameters and grouping parameters for Survival prognosis and cumulative risk factor.
Weighted Gene Co-expression Network Analysis (WGCNA) WGCNA was conducted using the WGCNA package (Guy et al., 2019). It is a systematic way for effectively acquiring the Ferroptosis-driver and -marker genes were shown with venn diagram from FerrDb database (G); Ferroptosis-driver genes were 108; Ferroptosis-marker genes were 111; the intersected genes co-expressed between ferroptosis-driver and -marker genes were 15; The correlation between DEGs and ferroptosis-related genes was shown with representative correlation plot (H); Blue bar indicated: R (sperman's coefficient) > 0; red bar indicated: R (sperman's coefficient) < 0; The DEGs correlated positively with ferroptosis-driver and -marker genes were presented with UpsetR diagram (I); green bar indicated the final ferroptosis-related genes.
Frontiers in Cell and Developmental Biology | www.frontiersin.org January 2022 | Volume 9 | Article 815104 4 expression patterns of multiple genes in different samples, which can obtain a gene cluster with the same clinical phenotype.

Statistical Analysis
All statistical analysis considered the calculation of mean and standard deviation, and was performed using R software 3.4.0.3. p < 0.05 was considered statistically significant. Univariate and multivariate Cox regressions were analyzed by COXPH function. Lasso-Cox regression analysis was conducted with R software package glmnet. In addition, the Log-rank test was performed using the survdiff (Tan et al., 2020) function in the survival package. The heatmap of characteristic gene expression was drawn by ggplot2 and heatmap function. The R software package rms  was used to achieve the establishment and application of the nomogram. The calibration analysis (Guy et al., 2019) was performed aligned with nomogram. Decision curve analysis (DCA) (Van Calster et al., 2018) was conducted to verify the risk model.

Screening of Ferroptosis-Related Differentially Expressed Genes
The COAD patients in TCGA datasets were divided into two groups by metastasis status (with metastasis and without metastasis) and BMI relative value (BMI_H, BMI_L). Besides, The COAD patients in GEO datasets were also divided into two groups according to the patient's survival status (good survival time and poor survival time) ( Figure 1A). Then, we identified the DEGs by comparing the RNAexpression of the three groups of patients in the TCGA and GEO datasets. The DEGs were displayed in the heat map ( Figure 1B). The volcano plot of the up-regulated DEGs and down-regulated DEGs (p < 0.05) was shown in Figures 1C-E. According to the clinical stage, 875 up-regulated genes were obtained in metastasis group. In addition, there were 1527 up-regulated genes in BMI_H group in the TCGA dataset, and 401 up-regulated genes by patient survival status in the GEO dataset. A Venn diagram displayed 107 DEGs obtained through the intersection of the above up-regulated DEGs ( Figure 1F). And then we downloaded ferroptosis Driver and Marker genes from the FerrDb database, the numbers of ferroptosis Driver and Marker genes were shown with venn diagram ( Figure 1G). Correlation analysis was conducted between 107 DEGs and ferroptosis Driver and Marker genes, and the representative results were displayed in a correlation plot ( Figure 1H). Finally, the 74 intersected genes tightly correlated with ferroptosis were identified ( Figure 1I).

Functional Enrichment Analysis of Ferroptosis Related Differentially Expressed Genes and Screening of Epithelial-to-Mesenchymal Transition Related Genes by Gene Set Enrichment Analysis
Furthermore, we conducted GO/KEGG pathway enrichment analysis on these intersected DEGs. For BP, DEGs were significantly enriched in typeⅠinterferon receptor binding, receptor ligand activity, hormone activity and other pathways related to interferon, hormone synthesis pathway and receptor coupling pathway. The CCs analysis indicated that DEGs were mainly enriched in intermediate filament cytoskeleton, intermediate filament and other cellular components associated with intracellular fibrous skeleton protein remodeling and connection. In MF, DEGs mainly enriched in the regulation of trans-synaptic signaling, modulation of chemical synaptic transmission and other key functions of intercellular signaling transmission correlated to synaptic signal transmission, remodeling and regulation ( Figures  2A,B). Key information was shown in Table 1. The KEGG pathway analysis revealed that DEGs were mostly enriched in pathways in focal Adhesion, hippo signaling pathway, MAPK signaling pathway and other pivotal biological signaling pathways such as invasion, migration, and intercellular junctions, as well as PI3K-Akt signaling pathway, calcium signaling pathway and other signaling pathways related to carcinogenic and calcium channels regulation ( Figures 2C,D). The key enrichment information was sorted into Table2. And then The TCGA COAD samples were divided into the "Metastasis" group and the "No Metastasis" group by metastasis status. Positive correlation enrichment pathways were acquired by GSEA, mainly including Focal adherens, Iron transportation, MAPK signaling pathway, oxidative phosphorylation and other pathways related to invasion, migration, carcinogenesis in BP( Figure 2E). Furthermore, we extract a gene set of some critical metastasis-related signaling pathways, including Focal adherens, iron transportation, MAPK signaling pathway, oxidative phosphorylation. Correlation analysis was conducted between 74 DEGs screened above and the genes tightly correlated with metastasis, and the results were displayed in a venn diagram ( Figure 2F). Finally, we identified 35 genes that are highly associated with the core signaling pathways for further model screening.

Lasso-Cox Analysis and Screening of Key Target Genes
The Lasso regression analysis was performed and four genes were identified (PCOLCE2, MMP7, YAP1, and HOXC11) ( Figures  3A,B). To determine the role of these four genes in COAD prognosis, Kaplan-Meier survival analysis was used to compare patient survival prognosis between the high expression and low expression groups of each prognostic gene. The results showed that the survival rate of patients in the high expression group was much lower than that in the low expression group (p < 0.05, HR > 1.5). Compared with low expression group of PCOLCE2, the median survival time in high expression group was significantly smaller (HR 1.83, p 0.003) ( Figure 3C). Likewise, the median survival time in high expression group of HOXC11 was also significantly smaller than in lower expression group (H 1.66, p 0.013) ( Figure 3D The gene set enrichment analysis (GSEA) was conducted between "metastasis" group and "without metastasis" group, the core pathways were selected, p value < 0.05, FDR<0.25 (E); the correlation analysis was performed between the 74 DEGS and gene sets from four core pathways, the results were shown with venn diagram (F).  Figure 4C), PCOLCE2 ( Figure 4D) and ferroptosis genes were shown with co-expressed heatmap for validating efficiency of screening. Besides, the correlation analysis between HOXC11 ( Figure 4E), PCOLCE2 ( Figure 4F), MMP7 ( Figure 4G), YAP1 ( Figure 4H) and EMT-related genes were also performed for further exploring targeted genes; the results indicated that these four genes were tightly correlated with both ferroptosis and EMT process.

Establishment of a Risk Prediction Model Based on Invasion and Transfer Genes
Based on the above four prognostic genes (PCOLCE2, MMP7, YAP1 and HOXC11), we constructed a metastasis-related prognostic model. In our model, the higher the risk score, the earlier the patient event occurred, and the higher the expression of the four genes ( Figure 5A). Multivariate regression analysis was performed for each prognostic gene to obtain the regression coefficient of each gene. 521 COAD samples were evaluated using TCGA cohort, and each sample was given a risk score and assigned to a risk group. Firstly, to evaluate the effectiveness of the model, we compared the survival differences between the high risk and low risk groups. Results showed that when the four screened predicted genes were highly expressed, the OS of the patients was significantly decreased, and the prognosis of high risk group was significantly worse than that of the low risk group (p < 0.0001) ( Figure 5B). To further evaluate the predictive performance of this risk model, different clinical parameters between high and low risk groups were selected for model validation. Results showed that the median OS of high risk group is significantly shorter than that of low risk group, regardless of in M stage (M0&M1), tumor recurrence (R0&R2), histomathological grade (Ⅰ&Ⅳ) and clinical stages (Ⅰ&Ⅳ) (Figures 5C-F). The above results showed that we have established an effective prognostic model for COAD associated with proliferation and metastasis.
To exclude the possibility of model overfitting in TCGA, we verified the model in GEO dataset. We analyzed the prognosis and the expression level of the four genes between high risk and low risk group in the GEO dataset. Results showed that the four prognostic genes expression is higher in high risk group, and the prognosis is worse in high risk group than that in low risk group (Supplementary Figure S1A). The median OS of patients in high risk group was significantly (p < 0.0001) shorter than that in lowrisk group. In addition, we also verified the effectiveness of the model by combining with clinical characteristics and risk score with nomogram model and calibration analysis in 1, 5, 10 survival years (Supplementary Figures S1B,C). The median OS of high risk group was significantly shorter than that of low risk group, regardless of in tumor recurrence (R0&R2), T stage (T1&T3) and lymphatic invasion (transfer & no transfer) (Supplementary Figures S1D-F). The results indicated that this model related to proliferation and metastasis was robust on different platforms.

Identification of Key Prognostic Genes Particularly Containing PCOLCE2/HOXC11
WGCNA is a systematic way for effectively acquiring the expression patterns of multiple genes in different samples, which can obtain a gene cluster with the same expression pattern. The association between modules and phenotype of samples such as clinical traits can be studied. A total of 234 samples with clinical characteristics were included in WGCNA ( Figure 6A). In this study, the power of β 3 (scale-free R2 0.850) was selected as the soft threshold to establish the scale-free network ( Figures 6B,C). Therefore, six co-expressed modules were identified after removing the gray modules by combined dynamic tree cutting ( Figure 6D). The TOM was mapped to 1000 genes selected in the analysis, indicating that each module was independently verified. We found that the green module containing PCOLCE2/HOXC11 was correlated with COAD liver invasion and lymphatic invasion (R2 >0.3, p < 0.05). A scatter plot was mapped between GS and MM (green module and liver, lymphatic invasion) in COAD patients. Therefore, PCOLCE2/HOXC11 was considered to be one of the critical genes in the green module. PCOLCE2/HOXC11 was identified as the common target gene between the risk model and weighted coexpressed network through Venn diagram ( Figures 6E-G). The percentage and types of SNV of COAD in TCGA samples were presented ( Figure 7A), and top ranking genes with high percentage of SNV in high risk score group of COAD samples were shown ( Figure 7B). The results indicated that high risk score group accompanied higher percentage of SNV which predicted poor prognosis. Besides, the percentage of copy number variation (PCOLCE2, 7%; HOXC11, 6%) was calculated in TCGA samples ( Figure 7C). Furthermore, ROC analysis was used to determine the prognostic values of these two genes. The corresponding area under curve (AUC) value for HOXC11 was 0.734 (Supplementary Figure S2A), and the AUC value for PCOLCE2 was 0.950 (Supplementary Figure S2E). Additionally, the expression of the two prognostic genes in COAD was analyzed according to various clinicopathological features, including NM tumor stage, lymph node metastasis and peritoneal metastasis. The results revealed that the expression  Figures  S2B-D,F-H). Overall, the results further confirmed that the two genes screened in this study were unfavorable factors for the survival status of COAD patients. In addition, we further analyzed the clinical pathological factors and risk scores (p < 0.05, HR > 1) by univariate and multivariate Cox regression analysis, which further showed the prognostic value of the risk model in these four genes. The correlation between the clinical characteristics, the risk score and the expression level of the Frontiers in Cell and Developmental Biology | www.frontiersin.org January 2022 | Volume 9 | Article 815104 8 two genes of COAD patients in the TCGA cohort and the prognosis was verified. Univariate Cox regression analysis showed that risk score and TNM stage were significantly correlated with OS (p < 0.0001). Meanwhile, we established a nomogram model using clinicopathological factors, risk scores and these two key genes in combination with 1, 5, and 10 years survival times, confirming that the expression levels of these two genes could more accurately predict 1-, 5and 10-years survival time (Supplementary Figure S3A), and Time ROC analysis of these two genes (HOCX11 and Frontiers in Cell and Developmental Biology | www.frontiersin.org January 2022 | Volume 9 | Article 815104 10 FIGURE 6 | Screening of key phenotypic module and key genes by WGCNA. Hierarchical clustering dendrogram of samples from TCGA database (A). Analysis of scale-free fit index and the mean connectivity for various soft-thresholding powers. Testing the scale-free topology when β 3 (B). Hierarchical clustering dendrogram of genes with dissimilarity on the basis of topological overlap. Modules are the branches of the clustering tree (C). Correlation between module genes and clinical traits. Each row corresponds to module genes, and columns represent clinical traits. Each cell contains the correlation and p value, and the green module containing PCOLCE2/HOXC11 are selected. The red bar represents positive correlation, while the blue bar represents negative correlation (D). The heat map describes the TOM among the 1000 genes selected in WGCNA. Gene modules are represented by horizontal and vertical coordinates, and highly overlapping modules are marked by highbright parts (E). The green module containing PCOLCE2/HOXC11 was correlated with COAD liver and lymphatic invasion. A scatter plot was mapped between GS and MM (green module and liver invasion) in COAD patients (F). The intersected gene, PCOLCE2/HOXC11, was identified between risk factors in the risk model, and genes contained in green modules with Venn diagram (G).
Frontiers in Cell and Developmental Biology | www.frontiersin.org January 2022 | Volume 9 | Article 815104 11 PCOLCE2) had prediction ability in 1, 3, 5 survival time (AUC>0.5) (Supplementary Figures S3B,C). Additionally, combined with the clinical information of patients, we selected 433 patients in the high risk and low risk groups for univariate and multivariate forest prediction, and found that HOCX11 and PCOLCE2 both had high risk (HR > 1), which has a negative effect on the prognosis of patients (Supplementary Figures 3D,E). And the table was listed  for reference (Table3), which further demonstrated the prognostic value of these two genes.

Validation for the Expression of Predicted Hub Genes
To further explore the related functions of the prognostic genes (PCOLCE2 and HOXC11), we performed a correlation analysis. The positively correlated genes of PCOLCE2 and HOXC11 were respectively identified and showed by a heat map (Supplementary Figures S4A,B). All the acquired genes were performed GO/KEGG pathway enrichment analysis. Results showed that the positively correlative genes were mainly enriched in the cargo receptor activity, scavenger receptor activity and amyloid-beta binding pathway in BP. For CC, the positively correlative genes are significantly enriched in neuron to neuron synapse, axon part, collagene-containing extracellular matrix regulation. The MF analysis revealed that these genes mostly enriched in the regulation of cation channel activity, extracellular structure organization, axonogenesis and other pathways related to key intercellular signal transduction, including extracellular matrix remodeling and signal pathway regulation (Supplementary Figure S4C); However, KEGG enrichment pathways were mostly enriched in hippo signaling pathway, cell adhesion molecules and ECM-receptor interaction and other key biological signaling pathways related to migration, invasion and tight intercellular junctions (Supplementary Figure  S4D). Finally, the detailed information of the key enrichment information was sorted into Table4. Subsequently, we performed protein-protein interaction analysis on these related genes to explore their interactions and connections. Finally, the top 10 key target genes in all related genes were obtained using cytohubba, including ITGAM, FCER1G, CD33, SNAP25, ITGAX, LAIR1, TYROBP, FCGR1A and ODLR1 (Supplementary Figures  S4E,F). And, the tight correlation between CD33 ( Figure 8A Figure 8J) and EMT-related genes were also performed for further exploring targeted genes; the results indicated that these hub genes were tightly correlated with both ferroptosis and EMT process. Immunohistochemistry of the nine genes (ITGAM, FCER1G, CD33, SNAP25, ITGAX, LAIR1, TYROBP, FCGR1A, ODLR1) in tumor and normal colon tissues were shown in Supplementary  Figures S5A-H. The protein expression levels of these hub genes were significantly up-regulated in tumor tissues compared with normal tissues. The results may explain the low survival rate in the high risk group at the molecular level, indicating that the risk model we established is reliable.

DISCUSSION
COAD is a highly malignant cancer with poor prognosis. Although some clinical predictors such as age, gender and TNM stage could predict the prognosis of COAD, the predictive capacity and suitable individualized treatment still needs further improvement due to the high heterogeneity of this cancer. Thus, it is necessary to find novel prognostic biomarkers for COAD. In the present study, we established and validated a four-gene signature which can effectively predict the survival rate of patients with COAD. Moreover, this prognostic risk model may provide potential biomarkers for identification of COAD patients suitable for ferroptosisinduced therapy.
In our study, a total 107 overlapping DEGs were obtained by integrative analysis of multiple datasets. Then, GO/KEGG pathway enrichment analysis were conducted and revealed that the screened DEGs are associated with pathways of tumor invasion and migration, including focal adhesion and MAPK signaling pathway. 35 DEGs were generated through the intersection of the above 107 DEGs related to iron transportation, MAPK signaling pathway, oxidative phosphorylation and focal adherence. Furthermore, four genes were screened by Lasso regression analysis and used to construct a four-gene signature as a prognostic risk model. Moreover, the robustness and reliability of this model were validated in GEO dataset. Subsequently, we employed WGCNA analysis and found that PCOLCE2 and HOXC11 are involved in liver and lymphatic invasion of COAD. Ten positively correlated genes of PCOLCE2 and HOXC11 were identified through correlation analysis and protein-protein interaction analysis. Additionally, we demonstrated that the protein expression levels of these hub genes were significantly up-regulated in tumor tissues compared with normal samples, which further insuring the reliability of our prognostic model. MMP7, a member of the matrix metalloproteinase family, has an ability to degrade the extracellular matrix (Winer et al., 2018). MMP7 participates in the development of multiple malignant tumors and is associated with the invasion and metastasis of cancers, including CRC , lung cancer (Stenvold et al., 2012), pancreatic cancer (Fukuda et al., 2011;Chen et al., 2013), tongue squamous cell carcinoma (Yuan et al., 2020) and prostate cancer (Lynch et al., 2005). A previous study showed that Yes-associated protein 1 (YAP1) contributed to the invasion and migration of non-small cell lung cancer by promoting Slug transcription via the transcription co-factor TEAD (Yu et al., 2018). High YAP1 levels suggested low overall survival rates in CRC and are one of the potential prognostic biomarkers for CRC (Zihui Xu et al., 2019). Interestingly, the activation of transcription factor YAP1 could promote the susceptibility of tumor cells to ferroptosis by controlling the expression of ferroptosis modulatory factors such as TFRC, EMP1, ACSL4, and ANGPTL4 Wu et al., 2019), which suggests that ferroptosis-inducing drugs may be a perfect treatment for YAP1-activated tumors. PCOLCE2 is a gene encoding procollagen C-endopeptidase enhancer protein 2 (PCPE2) (Xu et al., 2000). PCPE2, a 52-kDa glycoprotein, is located in the extracellular matrix and can stimulate bone morphogenetic protein 1 (BMP1) activity on fibrillar procollagens (Moali et al., 2005). Interestingly, it has been demonstrated that BMP1 is involved in the metastasis of gastric cancer and non-small cell lung cancer (Wu et al., 2014;Hsieh et al., 2018). Additionally, it is indicated that PCOLCE (procollagen C-proteinase enhancer protein) acts a pivotal part in promoting the lung metastasis of osteosarcoma . A previous study demonstrated that PCOLCE performed well in predicting overall survivals of colorectal cancer patients (Chen et al., 2019). Thus, combining with our results, we believe that PCOLCE2 has a close relationship with EMT-related metastasis of COAD. Homeobox C11 gene (HOXC11), one member of the HOX family, was shown to be closely related to the clinical outcome of patients with renal cancer, cervical cancer or breast cancer (McIlroy et al., 2010;Liu et al., 2015;Eoh et al., 2017). Moreover, overexpressing HOXC11 was found to promote cancer cell proliferation and migration (Liu et al., 2015;Peng et al., 2021). In summary, the four prognostic genes we screened were associated with cancer development and metastasis, thereby could predict the prognosis of COAD.
At this time, ferroptosis has attracted increased attention in the mechanism of tumor development. There is accumulating evidence indicating that induction of ferroptosis might be an effective therapeutic strategy to prevent tumor resistance and suppress tumor growth (Roh et al., 2016;Hangauer et al., 2017;Viswanathan et al., 2017;Tsoi et al., 2018). Additionally, some cancer cells with mesenchymal-like phenotype are sensitive to ferroptosis (Chen et al., 2021). These findings suggest that using ferroptosis-inducing drugs may become a special and significant therapeutic method for tumors expressing certain factors involved in EMT.
In this study, we screened four genes related to EMT and ferroptosis and established a reliable, stable and effective fourgene risk prediction model in COAD patients. High metastasis tendency may be particularly prone to ferroptosis, thus opening up a new field of targeted therapy research. Our goal is to provide insight and summary on ferroptosisrelated genes signature in tumor invasion and metastasis and its potential therapeutic targets. This signature may perform well as a sign to identify some COAD patients suitable with ferroptosis induction therapy. Nevertheless, there are some limitations in our study. Selection bias may generate due to the limited sample size. For better clinical application, further studies using larger sample quantity are necessary to ensure the reliability of our finding. Moreover, clinical samples and animal experiments for comprehensive verification are needed to understand the potential molecular mechanism of this type of COAD.

CONCLUSION
In this study, we established a four-gene prognostic signature (MMP7, YAP1, PCOLCE and HOXC11) based on EMT and ferroptosis related genes and validated the reliability and effectiveness of this model in COAD. Moreover, through WGCNA analysis, we found that PCOLCE and HOXC11 were correlative with liver and lymphatic invasion in patients with COAD. These four genes may become potential prognostic biomarkers to identify COAD with metastasis. Moreover, this four-gene signature may be able to determine the COAD suitable with ferroptosis induction therapy. Overall, this signature provided novel possibilities for better prognostic evaluation of COAD patients and would be of great guiding significance for individualized treatment and clinical decision.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are publicly available. This data can be found here: https://github.com/ Xieyongjie666/Code-availability/commit/8de1946ea8b980ced428 8d0114135386cb0dbcc3.

AUTHOR CONTRIBUTIONS
CS provided experiments design of this project. YX performed the bioinformatics data analysis; XL, GL are responsible for image processing, result sorting, and statistical analysis; WL, WP helped to collect information and data processsing. TL, XY, JL given the main research idea and were responsible for adjustment of manuscript structure and determination of central idea. All the authors participated in the manuscript composition. Thanks for their works in this research.

ACKNOWLEDGMENTS
We would like to thank the patients and investigators who participated in the TCGA and the GEO for providing data.