ECM–Receptor Regulatory Network and Its Prognostic Role in Colorectal Cancer

Interactions of the extracellular matrix (ECM) and cellular receptors constitute one of the crucial pathways involved in colorectal cancer progression and metastasis. With the use of bioinformatics analysis, we comprehensively evaluated the prognostic information concentrated in the genes from this pathway. First, we constructed a ECM–receptor regulatory network by integrating the transcription factor (TF) and 5’-isomiR interaction databases with mRNA/miRNA-seq data from The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD). Notably, one-third of interactions mediated by 5’-isomiRs was represented by noncanonical isomiRs (isomiRs, whose 5’-end sequence did not match with the canonical miRBase version). Then, exhaustive search-based feature selection was used to fit prognostic signatures composed of nodes from the network for overall survival prediction. Two reliable prognostic signatures were identified and validated on the independent The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) cohort. The first signature was made up by six genes, directly involved in ECM–receptor interaction: AGRN, DAG1, FN1, ITGA5, THBS3, and TNC (concordance index 0.61, logrank test p = 0.0164, 3-years ROC AUC = 0.68). The second hybrid signature was composed of three regulators: hsa-miR-32-5p, NR1H2, and SNAI1 (concordance index 0.64, logrank test p = 0.0229, 3-years ROC AUC = 0.71). While hsa-miR-32-5p exclusively regulated ECM-related genes (COL1A2 and ITGA5), NR1H2 and SNAI1 also targeted other pathways (adhesion, cell cycle, and cell division). Concordant distributions of the respective risk scores across four stages of colorectal cancer and adjacent normal mucosa additionally confirmed reliability of the models.


INTRODUCTION
The extracellular matrix (ECM) is a noncellular component of tissue, which biochemically and structurally supports cells. The ECM is composed of different glycoproteins such as collagens, laminins, and fibronectins (Theocharis et al., 2016), and there are dozens of cellular receptors which directly interact with the components of the ECM, for example, integrins or cadherins (Barczyk et al., 2010). Interactions between the ECM and receptors on the cellular surface regulate cell behavior and play an important role in communications between cells, cell proliferation, adhesion, and migration (Nguyen-Ngoc et al., 2012;Plotnikov et al., 2012;Schlie-Wolter et al., 2013;Lange et al., 2014).
A number of studies revealed a crucial role of the ECM-receptor interaction in colorectal cancer development and metastasis formation (Stankevicius et al., 2016;Crotti et al., 2017;Maltseva and Rodin, 2018). We recently showed the contribution of α5 laminin in differentiation of colorectal cancer cells and chemotherapy resistance (Maltseva et al., 2020). Several works describe biomarkers and signatures for assessment of colorectal cancer prognosis based on the expression of particular genes involved in ECM-receptor interaction, including genes encoding integrins (Boudjadi et al., 2013;Gong et al., 2019), E-and P-cadherin (Sun et al., 2011;Christou et al., 2017), and different laminins (Galatenko et al., 2018). ECM-based prognostic gene signatures were constructed for gastric (Yang et al., 2020), breast (Bergamaschi et al., 2008), prostate (Pang et al., 2019), and bladder (Qing et al., 2020) cancers. However, to the best of our knowledge, no comprehensive prognostic analysis of ECM-receptor interaction-based colorectal cancer gene signatures has been done so far.
Another dimension useful for the construction of prognostic signatures is the analysis of regulatory networks (Ahmad et al., 2012;Guo et al., 2020;Nersisyan et al., 2021b). Specifically, gene expression levels could be dynamically regulated by other molecules, such as transcription factors (TFs), microRNAs (miRNAs), and others. Recently, it was shown that miRNAs are present in a cell in different variants, called miRNA isoforms (isomiRs) (Morin et al., 2008;Loher et al., 2014). As a result of imprecise enzymatic cleavage, miRNA hairpins give rise to mature forms, which differ from each other in 1-3 nucleotides at the ends of the molecule (Zhiyanov et al., 2021). Importantly, the targetome of isomiRs with differences at 5'-ends (5'-isomiRs) significantly differ from the canonical form (Tan et al., 2014;van der Kwast et al., 2020). Thus, 5'-isomiRs could be considered separate miRNAs with their own sets of targets.
In this work, we analyzed expression patterns of genes involved in the ECM-receptor interaction pathway using RNA sequencing data of colorectal cancer samples taken from The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) and Rectum Adenocarcinoma (TCGA-READ) projects (Network, 2012). First, we constructed and analyzed a regulatory network to infer 5'-isomiRs and TFs, which are direct regulators of genes from the ECM-receptor interaction pathway. The network was built with miRGTF-net-the recently developed tool which integrates both expression and databaselevel data for the network construction (Nersisyan et al., 2021b). Next, the obtained network was used to construct hybrid isomiRgene signatures for predicting overall survival in colorectal cancer. For this analysis, we employed a novel technique of exhaustive search-based Cox model fitting. Namely, ExhauFS software (Nersisyan et al., 2021c) was used to construct prognostic models for all gene/5'-isomiR pairs, triples, etc; then, the best performing model was picked.

TCGA mRNA and miRNA Sequencing Data
RNA and miRNA sequencing read count tables were downloaded from the GDC Data Portal for n 426 TCGA-COAD and n 161 TCGA-READ colorectal cancer samples (tumors with unmatched miRNA/mRNA profiles or without clinical information were not considered). For the comparison of primary tumors and adjacent normal mucosa, n 7 TCGA-COAD normal samples were also included. With the use of the trimmed mean of M-values (TMM) algorithm implemented in the edgeR v3.30.3 package (Robinson et al., 2010), the obtained mRNA-seq and miRNA-seq count matrices were processed into the TMM-FPKM and TMM-RPM tables, respectively. Low expressed genes and miRNAs were filtered out using the default procedure available in edgeR.

Network Analysis
A recently developed miRGTF-net tool (Nersisyan et al., 2021b) was applied to the TCGA-COAD dataset for the construction of a colorectal cancer miRNA-gene-TF regulatory network. The main feature of this approach consists in the integration of expression data (TCGA-COAD) with the biological interaction databases: • TFLink database (https://tflink.net) was used to extract TFgene interactions; • TF-miRNA interactions were obtained from TransmiR v2.0 (Tong et al., 2019); • miRDB v6.0 (Chen and Wang, 2020) custom prediction mode was employed to predict targets of 5′-isomiRs (as recommended by the tool authors, interactions with target scores ≥ 80 were considered).
First, the initial network was constructed based on the interactions listed in these databases. Second, all uncorrelated and wrong-directional edges (like positively correlated miRNAs and their targets) were discarded. Then, interaction scores were assigned to each edge and node of the network. The interaction scores are based on the strength of the linear dependence between expressions levels of the connected nodes. After filtering out nodes and edges with low interaction scores, the resulting network consisted of nodes with significant influence on some other nodes and/or significantly regulated by some other node. Configuration for miRGTF-net execution is listed in Supplementary Material.
ECM-receptor interaction-related genes were taken from the KEGG hsa04512 pathway (Kanehisa et al., 2021) (we refer to these genes as ECM set). Next, the output of miRGTF-net was used to construct a subnetwork composed of molecules, which Frontiers in Genetics | www.frontiersin.org December 2021 | Volume 12 | Article 782699 either regulates a gene from the ECM-receptor interaction pathway or is regulated by a gene from the pathway (ECM+ set).

Construction of Prognostic Signatures
The TCGA-COAD cohort was split into training (75% of samples) and filtration (25%) sets with a stratification by outcome indicator (death or censoring) and overall survival time (date of death or date of the last follow-up). Namely, we first sorted samples by outcome indicator and then sorted samples by overall survival time within each outcome group. Finally, every fourth sample was added to the filtration set, and the resting samples were labeled as the training ones. The independent TCGA-READ dataset was used for the model validation (test set). The ExhauFS tool (Nersisyan et al., 2021c) was used to fit Cox survival regression models. For each length of prognostic signature (k 1, 2, . . . , 10), we first selected n most individually predictive features (see the next paragraph for the details) and then fit Cox models for all possible n k feature subsets. The values of n were chosen for reasons of limiting the computational time by the default procedure available in ExhauFS (Supplementary Table S1). The pipeline was executed in two modes: in the first one, genes from the ECM set were pre-selected, and in the second run, all 537 genes and isomiRs from the ECM+ set were considered for the model construction.
The concordance index was used as the main model accuracy metric, including the feature selection step. That is, features (genes and isomiRs) were selected according to the concordance index of the respective univariate model. In addition, patients were separated into high-and low-risk groups (the median risk score calculated on the training set was used as a cut-off value). This allowed us to construct Kaplan-Meier curves and compare low-and high-risk groups with the hazard ratio metric and the logrank test. Finally, time-dependent ROC AUC was calculated to measure discriminative power of models for predicting 3-year patient survival. Configuration for ExhauFS execution is listed in Supplementary Material.
The set of reliable models was defined by the following thresholds, set on both training and filtration sets: concordance index > 0.6, hazard ratio > 2, logrank test pvalue < 0.01, and 3-year ROC AUC > 0.6. The best performing model was chosen by taking the signature with the maximal concordance index on the training set.

Enrichment Analysis
Enrichment analysis of gene sets was conducted using DAVID v6.8 (Huang et al., 2009). Significantly enriched terms were identified by setting a 0.05 threshold on false discovery rates (FDRs).

Statistical Analysis
A hypergeometric test was applied to • identify regulators (TFs and isomiRs) with an overrepresented number of target genes in the ECM set; • identify genes and isomiRs, which were overrepresented in the reliable prognostic signatures.
"Over"-regulated genes from the ECM set were determined by the binomial test. In all the cases, the Benjamini-Hochberg procedure was employed to adjust for multiple testing correction. SciPy implementation of statistical tests was used (Virtanen et al., 2020).

Regulatory Network of ECM-Receptor Interaction Pathway
The first step of our analysis was the inference of regulatory interactions affecting genes from the ECM-receptor interaction pathway (from here onward, we refer to these genes as ECM set). The MiRGTF-net tool allows one to construct miRNA-gene-TF interaction networks combining both database-level and integrative miRNA/gene expression data. At the beginning, the database-level network was constructed; it contained interactions of the three types: • TFs regulating target genes; • TFs regulating target miRNAs; • 5'-isomiRs downregulating target genes.
Then, TCGA-COAD gene and isomiR expression data were analyzed to filter only those interactions which are supported by a significant correlation in considered samples.

Prognostic Power of the ECM and ECM+ Sets
Next, we assessed whether it is possible to find an accurate overall survival prediction model constructed of molecules from the ECM and ECM+ sets. We used our recently developed ExhauFS tool to go over all possible prognostic signatures composed of ECM/ECM+ genes and 5'-isomiRs, where the signature length varied from 1 to 10. While it was possible to search over all possible gene/isomiR pairs composed of 537 molecules (cardinality of the ECM+ set), the exhaustive search was computationally infeasible already for the triples. To tackle this problem, ExhauFS selects the relevant number of the most individually informative features and then performs exhaustive search among them (see Methods for the details). For the pipeline evaluation, 75% of the TCGA-COAD cohort was used for the Cox model training, and the remaining 25% was FIGURE 1 | ECM-receptor regulatory network. Blue nodes represent TFs, red nodes represent genes, and green nodes represent 5′-isomiRs. The node size indicates the value of the node degree; (A) out-degree; (B) in-degree. used for the filtration. The TCGA-READ dataset was used as an independent validation set. We set up several accuracy thresholds to discard models which demonstrated unreliable quality either on the training or the filtration sets (see Methods). The distribution of model accuracies (concordance indices) for each signature length (k) is shown in Figure 2. As it can be seen, both ECM+ and ECM models started to overfit from some point: quality of the models monotonically increased on the training set and started to drop on the filtration set after a certain signature length. The filtration set accuracy peak for the ECM+ set fell on gene/isomiR triples and quadruples ( Figure 2); for the downstream analysis, we selected the shorter signatures, since there was no statistically significant difference between concordance indices for k 3 and k 4 (Mann-Whitney U-test p 0.11). As for the ECM set, the highest filtration concordance indices were detected for the 6-gene signatures.
Among the ECM+ triples, the best model (according to the concordance index on the training set) was constructed with one canonical miRNA and two TFs: hsa-miR-32-5p|0, NR1H2, and SNAI1. The risk score (RS) for the model was calculated as follows: RS 0.25 p hsa-miR-32-5p|0 + 0.34 p NR1H2 + 0.25 p SNAI1.
The signature demonstrated reliable performance on the TCGA-READ validation set: the concordance index was equal to 0.64, difference in survival between groups of low-and highrisk was statistically significant (hazard ratio 2.25, logrank test p 0.0229, Figure 3A), and the model accurately classified 3-year patient survival (3-year ROC AUC 0.71, Figure 3B).
Similarly to the ECM+ case, the most reliable signature in the ECM set was identified as follows: The quality of this signature, composed of six genes directly involved in ECM-receptor interaction, was comparable to the quality of hybrid ECM+ prognostic triple: concordance index 0.61, hazard ratio 2.14, logrank test p 0.0164 ( Figure 4A), 3year ROC AUC 0.68 ( Figure 4B). The complete list of accuracy metrics (including training and filtration sets) is presented in Supplementary Table S4 and Supplementary Figure S1.
To assess the relationship between expression levels of nine identified prognostic molecules, we performed hierarchical clustering using both sample-wise expression values ( Figure 5A) and correlation matrix ( Figure 5B). Notably, three genes (FN1, ITGA5, and TNC) showed a strong co-expression pattern, while the other molecules did not form clear cluster structures. For both signatures, we compared the distributions of the underlying risk scores between four stages of colorectal cancer and adjacent normal tissues. In all cases, the risk scores monotonically increased from the normal mucosa to stage IV cancer ( Figure 6). This observation is an additional piece of evidence of reliability of two constructed models.

Regulatory Neighborhood of the Prognostic 5'-isomiR/Gene Triple
Since the best ECM+ model was composed of three regulators (one miRNA and two TFs), the next step of our analysis was to Frontiers in Genetics | www.frontiersin.org December 2021 | Volume 12 | Article 782699 5 explore the landscape of regulatory interactions mediated by these molecules (Figure 7). Out of three regulators, only miR-32 specifically regulated the ECM-interaction pathway: two (COL1A2 and ITGA5) out of nine predicted targets were from the ECM set (adjusted p 8.92 × 10 -3 ). Moreover, DAVID enrichment analysis of this set of nine genes revealed only two pathways tightly related to the ECM: ECM-receptor interaction (KEGG hsa04512, adjusted p 0.0168) and focal adhesion (KEGG hsa04510, p 0.0463).
Unlike miR-32, targetomes of NR1H2 and SNAI1 were not focused on the ECM set; only 4/705 targets of NR1H2 (HMMR, ITGA3, ITGB4, and LAMB3, p 0.86) and 1/6 targets of SNAI1 (FN1, p 0.16) were associated with ECM-receptor interaction. To uncover the regulatory role of NR1H2 and SNAI1 in colorectal cancer, we also performed DAVID functional enrichment analysis of the sets of their target genes inferred by miRGTFnet. Multiple pathways were enriched in the set of NR1H2 targets, including cell cycle, cell division, DNA repair, and RNA splicing (SupplementaryTable S5). In case of SNAI1, the cell adhesion pathway was enriched when no multiple testing correction was applied ( Supplementary Table S5). Thus, the inclusion of

DISCUSSION
In this work, we used expression data of genes constituting the ECM-receptor interaction pathway and its direct 5'-isomiR and TF regulators to compose prognostic signatures for colorectal cancer. The novel feature of the network construction step consisted in accounting for 5'-isomiR targeting. Importantly, one-third of all isomiR-gene interactions were mediated by noncanonical 5'-isomiRs. These numbers are in agreement with previous experimental findings, which demonstrated biological activity of noncanonical 5'-isomiRs, for example, miR-411|-1 (van der Kwast et al., 2020) or miR-9|+1 (Tan et al., 2014).
With the use of the ECM-receptor regulatory network, we constructed two reliable signatures for overall survival prediction. The first hybrid signature was composed of one canonical miRNA and two TFs: hsa-miR-32-5p|0, NR1H2, and SNAI1. The second signature was composed of six genes, directly involved in ECM-receptor interaction: AGRN, DAG1, FN1, ITGA5, THBS3, and TNC. A number of studies already highlighted the role of several markers from the constructed signatures in colorectal cancer. In two recent reports, miR-32 was shown to promote tumorigenesis, radioresistance, migration, and invasion of colorectal cancer by targeting BMP5 and TOB1 (Chen et al., 2018;Liang et al., 2019). With the use of sequence-based target prediction coupled with co-expression analysis, here, we first showed that miR-32 targets are overrepresented in the ECM-receptor interaction pathway (adjusted p-value 8.92 × 10 -3 ). Thus, the new possible regulatory role of miR-32 was uncovered. Another member of the ECM+ prognostic triple, the SNAI1 transcription factor, was also linked to the poor prognosis of colorectal cancer. Namely, SNAI1 regulates epithelial-mesenchymal transition (EMT) by suppressing E-cadherin and promotes chemoresistance in colorectal cancer (Hoshino et al., 2009;Wang et al., 2018). Nevertheless, we did not find specific roles of NR1H2 in the colorectal cancer development,  progression, or metastasis. Functional enrichment analysis of NR1H2 downstream targets suggested the contribution of this TF to the regulation of core cellular pathways, such as cell cycle and cell division.
The second signature was also partially composed of wellstudied genes. We previously showed the strong upregulation of ITGA5 in Caco-2 human colorectal cancer cell lines exposed to hypoxia, as a consequence of hypoxia-induced decrease in expression of its direct regulator-miR-148a (Nersisyan et al., 2021a). Notably, this regulatory interaction (miR-148a suppressing ITGA5) was also supported by the negative correlation in our miRGTF-net analysis. In the same work, the negative association between ITGA5 expression levels and patients' overall survival was observed. Other studies showed that reduced DAG1 protein expression is associated with poor outcome of colorectal cancer (Coco et al., 2012); downregulation of FN1 decreases proliferation, migration, and invasion of colorectal cancer cells (Cai et al., 2018), while TNC induces EMT and proliferation . As for AGRN and THBS3, we have not found evidence on their role in colorectal cancer pathogenesis. The comprehensive reference list summarizing the role of the selected genes in colorectal cancer prognosis is presented in Supplementary Table S6.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

FUNDING
The research was performed within the framework of the Basic Research Program at HSE University (SNe, YU, and AT; conceptualization, methodology, data curation, and investigation) and the Russian Science Foundation (Project No. 19-15-00397; SNi; bioinformatics analysis and vizualization).