Predict Colon Cancer by Pairing Plasma miRNAs: Establishment of a Normalizer-Free, Cross-Platform Model

Background Plasma miRNAs are emerging biomarkers for colon cancer (CC) diagnosis. However, the lack of robust internal references largely limits their clinical application. Here we propose a ratio-based, normalizer-free algorithm to quantitate plasma miRNA for CC diagnosis. Methods A miRNA-pair matrix was established by pairing differentially expressed miRNAs in the training group from GSE106817. LASSO regression was performed to select variables. To maximize the performance, four algorithms (LASSO regression, random forest, logistic regression, and SVM) were tested for each biomarker combination. Data from GSE106817 and GSE112264 were used for internal and external verification. RT-qPCR data acquired from another cohort were also used for external validation. Results After validation through four algorithms, we obtained a 4-miRNA pair model (miR-1246 miR-451a; miR-1246 miR-4514; miR-654-5p miR-575; miR-4299 miR-575) that showed good performance in differentiating CC from normal controls with a maximum AUC of 1.00 in internal verification and 0.93 in external verification. Tissue validation showed a maximum AUC of 0.81. Further external validation using RT-qPCR data exhibited good classifier ability with an AUC of 0.88. Conclusion We established a cross-platform prediction model robust against sample-specific disturbance, which is not only well-performed in predicting CC but also promising in the diagnosis of other diseases.


INTRODUCTION
With 1.9 million new cases, colon cancer (CC) ranks third among all cancers in incidence worldwide (1). The number of CC patients grows rapidly during the past decade, which has become a major global health problem (2). Early-stage CC patients could be cured by minimally invasive surgery and showed a 5-year survival rate of 74% (3). Thus, detection of CC at a resectable stage is a long-term pursuit of gastroenterologists (4).
Plasma microRNAs (miRNAs), representing an emerging direction of liquid biopsy, are bringing new insights to early detection of CC. Many efforts have been made to evaluate plasma miRNAs to predict CC (5,6). Among the main existing miRNA detection methods, omics tools such as miRNA-sequencing and miRNA chip are costly and not suitable for large-scale population cancer screening. Real-time quantitative PCR (RT-qPCR) is a cost-effective alternative to the omics tools in miRNA detection. However, the absence of a proper standardization method for RT-qPCR data of plasma miRNA largely limited its reliability and repeatability in quantification assays (7). Thus, although dozens of miRNA biomarkers have been proposed in different labs, few of them were reproducible by other researchers and could eventually be applied to large-scale population screening of cancers.
In this study, we proposed a miRNA paired ratio-based standardization method for plasma miRNA quantification. We calculated the ratio of two given miRNAs in the same plasma sample, used this ratio as a new kind of variables, and constructed a prediction system based on these ratios to predict risks of CC. This ratio-based system performed well in both miRNA chip data and RT-qPCR data by eliminating possible sample-specific disturbance. We believe that this system is not only well-performed in predicting CC but also promising in the diagnosis of other diseases.

Patients Cohorts
Plasma miRNA chip data from GSE106817 (8) and GSE112264 (9) in GEO database were obtained for analysis. GSE106817 was divided into a training group and a testing group by random number. GSE112264 was used for external verification. Furthermore, miRNA chip data of tissue samples from GSE115513 (10) was also used for validation in tissue samples. 104 plasma samples of CC patients and normal controls from Beijing Friendship Hospital were collected for RT-qPCR data validation. Clinical details of these subjects were shown in Supplementary File 1. All participants had signed the informed consent, and this study was approved by the ethics committee of Beijing Friendship Hospital. Flowchart of this study was shown in Figure 1.

Differentially Expressed miRNA Analysis and miRNA-Pair Matrix Construction
Differentially expressed (DE) miRNA analysis was performed to training group derived from GSE106817 using "limma" package (11) in R 3.5.2. Then, miRNA paired ratios were calculated from any two DE miRNAs expression values in the same sample. Subsequently, univariate logistic regression was used to exclude pairs not associated with cancer occurrence and constructed a final miRNA-pair matrix for the training group.

Prediction Model Construction
LASSO regression was performed to select variables in the miRNA-pair matrix using "glmnet" package (12) in R 3.5.2. Models enrolled 1, 2, 3, 4 miRNA-pairs were successively established by a step-wise variable selection process by controlling lambda values in a LASSO regression. Then, the logistic prediction model for each of these models were constructed by "glm" function in R. Random forest models were established by "randomForest" package. Support vector machine (SVM) models were designed using "e1071" package. Furthermore, area under the curve (AUC) of all these prediction models were calculated to evaluate their performance.

Internal and External Verification by miRNA Chip Data
The testing group of GSE106817 was used for internal verification of the models. Receiver operator characteristic (ROC) curves of these models were drawn by using "plotROC" package (13) and "ggplot2" package. GSE112264 was used for external verification of these models. Furthermore, we also evaluated the diagnostic validity of those models in tissue miRNA chip data from GSE115513. AUC values of internal and external verification for each model were calculated.

RT-qPCR Validation
104 plasma samples from Beijing Friendship Hospital were used for external validation (Supplementary File 1). Total RNA isolated from plasma was obtained using TRIzol following the manufacturer's protocol (Takara, Shiga, Japan). Total RNA was reverse-transcribed and cDNA was synthesized using miRspecific stem-loop RT primers and components of the High-Capacity cDNA Reverse Transcription kit (Takara, Shiga, Japan, RR036a). The amplification of cDNA was performed in 10-ml reaction system following the SYBRGREEN life assays manufacturer's instructions. Primers used in RT-qPCR validation were shown in Supplementary File 2.

GSEA Analysis, Gene Ontology Analysis and KEGG Pathway Analysis
Target genes of selected miRNAs were predicted by TargetScan (14). Gene ontology analysis was then performed to identify ontologies (MF, molecular functions; BP, biological processes; CC, cellular component) enriched in the target genes of those miRNAs using clusterProfiler (15) package in R software. KEGG pathway and GSEA enrichment analysis (16) was also performed to find out the potential pathways which may be affected by those miRNAs.

DE miRNA Identification and miRNA-Pair Matrix Construction
387 DE miRNAs were identified after DE analysis in the training group (91 CC and 157 NC) of GSE106817 ( Figure 1). Any two of these 387 DE miRNAs were paired, which constructed a miRNApair matrix containing 74,691 miRNA-pairs. Univariate logistic regression was subsequently performed to screen for cancerrelated miRNA-pairs. Finally, a miRNA-pair matrix with 61,939 miRNA-pairs was obtained.

Prediction Model Construction and Internal Verification by miRNA Chip Data
LASSO regression was performed to select the most effective variables from all miRNA-pairs to construct multivariate prediction models. Models included 1, 2, 3, 4 miRNA-pairs were established successively by a step-wise variable selection process in a LASSO regression, which is shown in Table 1 ( Figure 2A).
Plasma miRNA chip data of 24 CC and 73 NC from the testing group was used for internal validation. To fully maximize the performance of these models, four methods (LASSO regression, random forest, logistic regression and SVM) were simultaneously tested. High AUC results of at least 0.9 in each model were obtained, which suggested a good efficacy in CC detection ( Figures 2B-E).

External Verification of Prediction Models by miRNA Chip Data
Another dataset (GSE112264) with plasma miRNA chip data of 50 CC and 41 NC was used for external verification. Four algorithms (LASSO regression, random forest, logistic regression and SVM) were applied to validate each of these models, and we finally obtained high AUC levels of over 0.7 in FIGURE 1 | Flowchart of study. Flowchart of this study. In short, we calculated the ratio of two given miRNAs in the same plasma sample, used this ratio as a new kind of variables, constructed a prediction system based on these ratios to predict risks of CC, and validated these models in internal level, external level, tissue level and RT-qPCR level.  Figure 3D). Those results suggested a good reproducibility of our models in different cohorts.

Tissue Verification of Prediction Models by miRNA Chip Data
Generally, 3-miRNA-pair models and 4-miRNA-pair models with an AUC >0.8 could be considered as possibly applicable in clinical scenarios. To better understand their rationality, we further validated those constructed models in a tissue samplederived miRNA chip dataset (GSE115513). For 3-miRNA-pair models, the SVM method could reach an AUC value of 0.71, while the AUC value of the logistic regression was 0.80 ( Figure  4A). For 4-miRNA-pair models, the logistic regression method could reach an AUC value of 0.81 ( Figure 4B). Those results exhibited reliable diagnostic performance of our prediction models in tissue samples.

Validation for Potential Clinical Application by RT-qPCR
Next, we conducted RT-qPCR assays to validate those models. For the 3-miRNA-pair model, SVM method could reach an AUC value of 0.71, which could boost to 0.82 when taking clinical information such as age and gender into consideration. However, logistic regression models gave a lower AUC level as compared to    Figure 5A). For the 4-miRNA-pair model, AUC of SVM was 0.78 and could increase to 0.88 when adding age and gender into this model ( Figure 5B). Then we performed subgroup analysis focusing on early CC patients and advanced CC patients, respectively. An AUC of 0.83 in the 3-miRNA-pair SVM model and an AUC of 0.92 in the 4-miRNA-pair SVM model were obtained when applying those models to distinguish 16 early CC patients from 48 normal controls ( Figures 5C, D). Similarly, a sub-cohort consists of 40 advanced CC patients and 48 normal controls was also analyzed. The AUC reached to 0.81 in the 3-miRNA-pair SVM model and 0.85 in the 4-miRNA-pair SVM model (Figures 5E, F).  The comparable efficiencies between detecting early CC and advanced CC are inspiring, but the sample size is too small to draw a definite conclusion.

GSEA Analysis, GO Analysis and KEGG Pathway Analysis for the Involved miRNAs
Target genes of selected miRNAs were predicted by TargetScan based on their seed positions, which were shown in Table 2. GSEA analysis was then applied and showed target genes were mostly related to colorectal adenoma and colorectal cancer ( Figures 6A, B). KEGG analysis and GO analysis were also performed to find out gene functions and pathways affected by these miRNAs. KEGG analysis showed that target genes were mainly involved in HSV-1 infection and MAPK signaling pathway ( Figure 6C). GO analysis showed that they were involved in regulation of cell morphogenesis (biological processes, Figure 6D), neuronal cell body (cellular component, Figure 6E), and small GTPase binding (molecular functions, Figure 6F).

DISCUSSION
Early detection of CC is clinically crucial but technically difficult, since the patients are hardly to exhibit any symptoms until turning into advanced stages (17). Traditional biomarkers (e.g., CEA and CA19-9) showed low sensitivity and specificity in cancer detection, which may cause missed and delayed diagnosis. Recently, an increasing number of researches focused on new biomarker types, such as plasma miRNAs (18). Plasma miRNAs are stable under various storage conditions and resistant to degradation, which made them good candidates for cancer biomarkers (6).  adenomas) and 73 healthy controls, and identified 6 plasma miRNAs for CRC detection, which also represented good classifier abilities to advanced adenomas (20). This study had large sample volume and selected a relatively stable miRNA (miR-16) as internal reference; however, a proper modeling algorithm could make its diagnostic value better. Among all detection methods, RT-qPCR is a proper method for clinical application for its low-cost as compared to miRNAseq and miRNA chips (21). However, its accuracy was often questioned due to the low concentration of miRNAs in plasma and no reliable normalizer for quantification. There are two regular methods for plasma miRNA standardization. The most popular method is using traditional reference genes, such as U6 (22) and miR-16 (23), which have been considered stably expressed in human cells. However, even those reference miRNA could be dysregulated and exhibited a altered abundance in the circulation system in a disease-specific manner (24,25). Consequently, those reference RNAs are not suitable for data normalization in circulating miRNA quantification. Spike-in exogenous miRNAs such as cel-miR-39, cel-miR-54, SV40 were also used as external references to partially eliminate deviations from experimental processes. However, these external controls could not correct sample specific deviations (26). Here we proposed a miRNA-pair ratio-based algorithm to quantitate plasma miRNA. We calculated the ratio of any two candidate miRNAs in the same sample, and taking these ratios as new variables to predict cancer occurrence. Considering that expression levels of two miRNAs were simultaneously measured under same conditions, their ratio could reflect true fold change ratio by canceling disturbance of different reference genes, which has been mathematically proved (27).
The thought using miRNA ratios to establish a prediction system for disease evaluation began in 2011. Mattia Boeri et al. performed a miRNA ratio signature study and found that signatures of miRNA ratios could reflect the prognosis of lung cancer patients with an AUC value >0.85 (28). However, the small sample size and a lack of demonstration of theoretical rationality largely weaken the confidence of their results. Then, Yu et al. mathematically verified miRNA-pair ratios to be independent of internal or external controls, which means miRNA ratios would remain stable no matter what inference genes were chosen (27). Thus, using miRNA-pair ratios instead of miRNA quantification could fully eliminate both experimental processes derived deviations and sample specific deviations. Here, for the first time, we constructed a full pipeline by using miRNA-pair ratio to construct prediction models for cancer diagnosis. Enrolled three independent cohorts with both miRNA  profiling and RT-qPCR data, we rigorously proved the robustness and cross-platform stability of this approach. Our results suggested that a 4-miRNA-pair model could well differentiate CC patients from healthy controls with an AUC of 0.78. Stability is the major advantage of this algorithm. No matter which detection method was adopted and which reference gene was chosen, miRNA ratios would remain stable in a certain sample. Here we found that our prediction models could stay effective in both miRNA chip verification and RT-qPCR validation. This cross-platform property makes it more easily to be clinical applied, creates an opportunity for its further application in new detection methods such as microfluidics and thermophoresis.
Although we have assessed this new method in the field of cancer diagnosis and achieved good classifier ability, there were still many other fields worth exploring such as distinguishing chemotherapy/radiotherapy-sensitive CC patients, diagnosis of other cancers or noncancerous diseases, monitoring therapeutic responses. Moreover, although we have analyzed the potential pathways and biological processes involved in these miRNAs, inner connections and biological interaction between those miRNA-pairs still need in-depth investigation.
In conclusion, we have established a cross-platform prediction model robust against sample-specific disturbance, which is not only well-performed in predicting CC but also promising in the diagnosis of other diseases.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Beijing Friendship Hospital Ethics Committee.
The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
DQ, LM, and SZha conceived and designed the study. DQ, RW, and QG performed all experiments. RW, SL, and SZhu helped to collect, reformat, and analyze the primary data. DQ and LM draft the manuscript. LM and SZha proofread and revise the manuscript. All authors contributed to the article and approved the submitted version.