In search of the ratio of miRNA expression as robust biomarkers for constructing stable diagnostic models among multi-center data

MicroRNAs (miRNAs) are promising biomarkers for the early detection of disease, and many miRNA-based diagnostic models have been constructed to distinguish patients and healthy individuals. To thoroughly utilize the miRNA-profiling data across different sequencing platforms or multiple centers, the models accounting the batch effects were demanded for the generalization of medical application. We conducted transcription factor (TF)-mediated miRNA–miRNA interaction network analysis and adopted the within-sample expression ratios of miRNA pairs as predictive markers. The ratio of the expression values between each miRNA pair turned out to be stable across multiple data sources. A genetic algorithm-based classifier was constructed to quantify risk scores of the probability of disease and discriminate disease states from normal states in discovery, with a validation dataset for COVID-19, renal cell carcinoma, and lung adenocarcinoma. The predictive models based on the expression ratio of interacting miRNA pairs demonstrated good performances in the discovery and validation datasets, and the classifier may be used accurately for the early detection of disease.


Introduction
MicroRNAs (miRNAs) have emerged as valuable biomarkers for the early detection of diseases due to their tissue-specific expression profiles (Chen et al., 2008;Song et al., 2023;Lv and Sun, 2024).However, the measure of miRNA expression levels may vary across different platforms or protocols, which limits the application of diagnostic models.This problem is known as batch variance and is prevalent in cross-sequencing platforms (Lazar et al., 2013) and multi-center data (Leek et al., 2010;Heinicke et al., 2020a;Heinicke et al., 2020b;Ibing et al., 2021).The difference in data distribution may be an obstacle to obtain reliable conclusions in the joint analysis of multiple center datasets, and it prevents the cross-validation of models on external datasets (Leek et al., 2010;Zhang et al., 2021;Whalen et al., 2022;Peng et al., 2024).Thus, the effective handling of batch effects is the first problem that needs to be solved in the integrative analysis of large-scale biological datasets (Goh et al., 2017).
Several batch effect correction methods have been developed to facilitate the joint analysis of multi-center data.The "ComBat-seq" tool based on the negative binomial regression model was developed specifically for RNA-seq count data (Zhang et al., 2020a).The "removeBatchEffect" function in the "limma" package can be used to correct data variation based on the linear regression model (Ritchie et al., 2015).However, these correction methods require the artificial transformation of data shapes, which may introduce false discoveries (Nygaard et al., 2016).In contrast, the intrinsic regulatory networks are less likely to be affected by different sequencing protocols, and the pathway-derived genes show potential to be a type of normalizer-free and batch-insensitive markers.Under this consideration, we propose a promising novel tool to integrate datasets from multiple sources, termed as the ratio of the expression values between related miRNAs (ERRmiR), by calculating the ratio of expression values of two related miRNAs in the intrinsic regulatory networks.
The miRNA interaction network was constructed based on prior knowledge to discover ERRmiR features with a biological significance.It is widely known that miRNAs not only regulate the expression of protein-coding mRNAs but also target non-coding RNAs, including long non-coding RNAs and miRNAs (Hill and Tran, 2018;Vishnubalaji et al., 2022;Shang et al., 2023).The miRNAs can directly bind to the 3′UTR of transcription factors (TFs), which can also reverse activate or repress miRNA expressions (Vishnubalaji et al., 2022;Khotib et al., 2023).For example, miR-181b affects the expression of miR-21 through the TF FOS, a critical signaling protein for glioma progression (Tao et al., 2013); miR-660-5p controls the expression of miR-486-5p via mouse double minute 2 (MDM2) and p53 (also known as TP53) in a study of lung cancer (Borzi et al., 2017).A recent review summarizes numerous examples of miRNA->TF (TF regulated by miRNAs) and TF->miRNA (miRNAs regulated by TF) interactions in various cancers, demonstrating the importance of the interaction between miRNA and pluripotent TFs in determining the occurrence of human cancers (Vishnubalaji et al., 2022).All these examples provide important clues for understanding the role of the TF-mediated miRNA functional network in tumor regulation.
In this study, we constructed a TF-mediated miRNA interaction network using public databases and demonstrated that the ERRmiR features were relatively insensitive to batch effects in multi-center studies.We then adopted a genetic algorithm in the feature screening process to avoid the dimension curse, which had a great capacity for selecting markers with stable performances in developing diagnostic models.Lastly, we used three independent examples involving plasma and tissue samples to investigate the predictive performance.

Construction of the miRNA interaction network
The TF-mediated miRNA-miRNA interaction network was constructed by combining the data of miRNA->TF and TF->miRNA relationships.If miRNA_a regulated a TF that was regulated by miRNA_b, miRNA_a was assumed to be able to influence miRNA_b, and they were connected in the miRNA interaction network.
The regulatory network datasets were collected from several public databases.The experiment validated that microRNA-target pairs were collected from miRTarBase (Huang et al., 2020), among which 8,014 targets were recognized as TFs based on the hTFtarget (Zhang et al., 2020b) and AnimalTFDB (Hu et al., 2019) databases.The 1,266 records of TF-regulating precursor miRNAs were obtained from the TransmiR v2.0 database (Tong et al., 2019).Combining the miRNAs->TFs and TFs->miRNA datasets (here, -> denotes a regulatory relationship), a total of 51,770 pre-miRNA indirect interactions were obtained, and then, pre-miRNAs were mapped to mature miRNAs according to the miRBase genomic coordinates.Finally, the miRNA-miRNA interaction network was constructed based on the 75,507 unique records of the indirect interaction relationships.

Predictive feature generation
The features were generated by calculating the expression ratio for each miRNA pair in the reconstructed miRNA interaction network.miRNAs with an expression value smaller than 100 were filtered out to ensure stable detection.The feature constructed with the connected pair of miRNA_a and miRNA_b was denoted by ERRmiR (a,b) and calculated as follows: where the denominator was added by 1 to avoid the divisor being 0.

Data collection and pre-processing
The robustness of ERRmiR features was investigated on the datasets using different library preparation kits [GSE133719 and GSE141658 datasets on the Gene Expression Omnibus (GEO) (Clough and Barrett, 2016) database], and then, datasets for three different disease categories were collected to construct the predictive models, namely, COVID-19, renal cell carcinoma (RCC), and lung adenocarcinoma (LUAD) projects [the NCI's Genomic Data Commons (GDC) (Jensen et al., 2017) database and GEO; Table1].We searched the GEO database to identify datasets meeting the following criteria: both disease and control groups had sample sizes of 10 or more, and each disease had at least 2 datasets with consistent sample types and sequencing platforms.From the filtered datasets, representative validation sets were chosen, which included viral respiratory infections and cancers caused by non-viral mechanisms.These sets were selected to ensure diversity in etiology, validation centers, and sample types, thereby ensuring the generalizability of our feature selection model across different populations.The miRNA expression matrices in the CPTAC (Edwards et al., 2015)/TCGA (Hutter and Zenklusen, 2018) database were downloaded using the GDC tool, and the annotation and quantification were performed using exceRpt (Rozowsky et al., 2019) to obtain the expression matrices of miRNAs.For comparing the results among different datasets, counts of reads were uniformly converted to reads per million (RPM) mapped read values.In the COVID-19 project, the plasma of persons with non-severe symptoms (mild patients and healthy) was categorized as the controls, and the plasma of those with serious symptoms was used as the disease samples.In the RCC and LUAD projects, normal tissues were categorized as the controls, and primary tumor tissues were used as the disease samples.

Feature screening and classification modeling
In each project, the dataset with the most samples was used as the discovery dataset and divided into a training set and a test set proportionally, i.e., 0.75:0.25.In the training dataset, the univariate analysis of the ERRmiR features was performed, the expression fold change in disease samples against the controls and the fdr-adjust p-value were obtained.The "sklearn-genetic" package was adopted to screen the optimal subsets of features.The features with higher appearance frequencies in the optimal subsets were selected as targets for the disease.
The "scikit-learn" package was used to build models for disease classifications.The learning curves were used to detect whether the estimator was overfitting during model training.The trained model was validated on a test set and the external validation datasets for each project.

Statistical analysis and visualization
The quartile plots of miRNA expression/ERRmiR feature values were drawn using the Matplotlib tool.The p-values and fdrcorrected q-values were calculated using SciPy.The miRNA network was visualized using Pyvis and seaborn tools.In miRNA pathway enrichment analyses, target genes of miRNAs were first identified through the TarBase database using the multMiR package in R language, and then, pathway enrichments were performed using clusterProfiler.

The schematic of ERRmiR signature generation and screening
We developed a screening method for the generation of the ERRmiR signature based on machine learning (Figure 1).We first constructed the miRNA interaction network by integrating several databases, including miRTarBase, hTFtarget, AnimalTFDB, and TransmiR v2.0.We then calculated the expression ratios of related miRNA pairs as ERRmiR features.The discovery dataset was randomly divided into training and test sets, and the features were filtered in the training set using univariate analyses according to the fold change of the mean expressions between two groups.We used a genetic algorithm to screen the features, and those with higher frequencies in the screening processes were selected as candidate markers.The trained model was validated on the test set within the same screening dataset and evaluated on external validation datasets.This approach was suitable for discovering biomarkers for various samples.

Construction of the miRNA interaction network
We constructed the miRNA interaction network based on indirect interactions mediated by TFs.The interactions indicated that the expression of one miRNA induced the activation or inhibition of other miRNAs.miR-183-5p is taken as an example to show how miRNAs regulate other miRNAs through TFs (Figure 2A).Here, the pentagram-labeled miR-183-5p is a regulatory miRNA, which regulates the square-labeled TF and further affects the round-labeled target miRNAs.The blue linkages represented the interaction of miR-183-5p acting on the TF, and the pink linkages represented the effects of TFs on other miRNAs.The complete miRNA-miRNA interaction network contained 75,507 unique records of indirect interaction relationships among 2,196 miRNAs (Supplementary Table S1).The degree distribution and topological parameters indicated that the miRNA interaction network has canonical scale-free and smallword characteristics (Figures 2B, C).

Characterization of ERRmiR signatures
To verify the hypothesis that the expression ratios between the interacting miRNAs would be stable across multi-center data, the distribution of ERRmiR values was compared with the distribution of the miRNA expression levels of the same samples (Figure 3).The sequencing datasets of the peripheral blood CD8 + T cells in triplicate from rheumatoid arthritis (RA) patients and healthy controls were generated by different library construction methods.The quartile plots showed that the original miRNA expression data generated by different library preparation kits had significant variance on the scale and distributions (Figure 3A), while the variation of ERRmiR features decreased (Figure 3B), which demonstrated the potential of ERRmiR features as batch-insensitive markers.We presented three application examples from various sample types and diseases.

Prediction of COVID-19 patients with severe symptoms using plasma ERRmiR signatures
The advantage of ERRmiR features was first investigated on the dataset of COVID-19 plasma samples.GSE178246 was randomly divided into a training and test set, and GSE176498 was used as the external validation set.The 42 ERRmiR features were obtained by conducting the genetic algorithm 100 times on the screening dataset.As shown in Figure 4A, the frequency distribution of the ERRmiR appearance was very steep, 3 ERRmiRs have frequencies greater than 10, and the highest frequency was up to 60.We selected the top three high-frequency features as markers and tested them on the validation set.As expected, they were significantly different between the serious and non-serious groups (p < 0.05) and showed consistent trends across multiple datasets (Figure 4B; Supplementary Table S2).The TFs intervening between the ERRmiR pairs for the top three highfrequency features were mothers against decapentaplegic homolog 4 (SMAD4), PR domain zinc finger protein 1 (PRDM1), and forkhead box O3 (FOXO3).To confirm the batch-insensitive nature of the ERRmiR features, the biomarker selection was also directly applied on the expression matrix of miRNAs.As shown in Figure 4C, the targets screened from the expression matrix of miRNAs lost effectiveness across the batches of data, with miR-1224-5p even showing opposite regulation trends (Supplementary Table S3).Based on the three highfrequency ERRmiR markers, the C-support vector classification (SVC) model that was established on the training set showed stable high performances on both the test set and validation dataset (Figure 4D).The model with three high-frequency miRNA panels had a high area under curve (AUC) of 0.906 on the test set but failed on the independent validation set with an AUC of 0.783 (Figure 4E).The performance verification data of the model are shown in Table 2.In addition, the five miRNAs of three ERRmiR markers were used for pathway enrichment (Figure 4F).Infection pathways of bacteria and viruses, including Salmonella infection and human papillomavirus infection, were significantly enriched.

Prediction of the renal cell carcinoma using tissue ERRmiR signatures
The method of marker discovery was also validated on the dataset of the RCC tissue samples.The CPTAC-RCC dataset was used for Overview of the ERRmiR marker discovery process.The miRNA network was constructed based on the transcription factor (TF)-mediated interactions, and the ERRmiR features were calculated between the connected genes in the network.Target screening and model construction were performed based on the ERRmiR features of the screening dataset and verified on the validation dataset.
screening targets and building the model, and TCGA-KIRP and GSE109368 datasets were used for external validations.After conducting the genetic algorithm, we obtained 115 miRNA pairs (Figure 5A).We take the top three highest frequent ERRmiR features as biomarkers, which showed significant differences between the cancer and control groups (p < 0.05), with consistent regulation trends across multiple datasets (Figure 5B; Supplementary Table S4).The TFs intervening between the ERRmiR pairs for the top three highfrequency features were Jun proto-oncogene (JUN, AP-1 transcription factor subunit), HIF1A (HIF1A), nuclear factor erythroid 2-related factor 2 (NFE2L2), and zinc finger E-boxbinding homeobox 2 (ZEB2).As part of the miRNAs in ERRmiR markers, miR-221-3p and miR-221-5p were not significantly differentially expressed between the two sample groups in all the datasets (Figure 5C; Supplementary Table S5).A prediction model using the SVC algorithm was established on the training dataset and achieved high AUC values on both independent validation datasets (Figure 5D).The performance verification data of the model are shown in Table 3.The five miRNAs comprising the three ERRmiR markers were significantly enriched in several pathways associated with cancers (Figure 5E).In particular, the p53 signaling pathway and Hippo signaling pathway had been widely reported to be associated with RCC (Gurova et al., 2004;Guan et al., 2018).

Prediction of lung adenocarcinoma using tissue ERRmiR signatures
In the LUAD project, the CPTAC-LUAD dataset was used to screen the ERRmiR features and build the model, and the GSE110907 and GSE196633 datasets were used for external validations.Thirty one ERRmiRs were obtained by conducting the genetic algorithm with a relatively flat frequency distribution, as shown in Figure 6A.We selected the top three highest frequent ERRmiR features as markers, which presented consistent trends of significant differences between the cancer and control groups (p < 0.05) across multiple datasets (Figure 6B; Supplementary Table S6).The TFs intervening between the ERRmiR pairs for the top three high-frequency features were nuclear factor kappa B (NFKB1), myocyte enhancer factor 2A (MEF2A), and Yin Yang 1 (YY1).The model constructed in the training set had AUC values of 0.995 and 0.91 in the GSE110907 and GSE196633 validation sets, respecively (Figure 6C).The performance verification data of the model are shown in Table 4.The five miRNAs of the three ERRmiR markers were significantly enriched in the p53 signaling, cell cycle, and PI3K-Akt pathways, which are widely reported to be associated with LUAD (Huang et al., 2022;Tang et al., 2022;Zhang et al., 2022) (Figure 6D).

Discussion and conclusion
The miRNA biomarkers have shown initial success in disease diagnosis and prognosis monitoring (Inoue and Inazawa, 2021); Quartile plots of miRNA expression (A) and log2 ratios of every two miRNAs (B) for each sample.Each plot was represented with the median (a solid point), the 0.25 quartile, and the 0.75 quartile of the distribution.
however, different sequencing prepares can cause variances across different batches, making it difficult to use the normalization of expression matrices alone for multi-center applications.In this study, we included three types of miRNA-miRNA interactions (direct interactions, indirect interactions, and global interactions) summarized in a previous review (Hill and Tran, 2021) and considered the indirect miRNA interactions mediated by TFs.Coordinated with an integrated screening method utilizing the genetic algorithm, we demonstrated the effectiveness of this strategy at tissue and plasma levels in three datasets and demonstrated its capacity for universal usage in developing diagnostic and classification models.The generalizability of our findings across diverse datasets was demonstrated through validation in multiple datasets from various sources, encompassing different diseases such as COVID-19, RCC, and LUAD.Specifically, our approach has been validated in two distinct datasets for COVID-19, three separate datasets for RCC, and three different datasets for LUAD.This multi-dataset validation strategy enhances the robustness and reliability of our findings by confirming the consistency and effectiveness of our method across different disease types and data sources.
Biomedical big data are often described as "small sample size in high-dimensional space," indicating a scenario where the sample size is small but each individual possesses a high-dimensional feature set (Zeng et al., 2016).High-dimensional data often lead to the curse of dimensionality, resulting in less reliable predictive analysis.Feature selection is a crucial step in classification tasks as retaining important features through feature selection can expedite the training process and address the curse of dimensionality.Genetic algorithms can be utilized to select the optimal subset of candidate features through crossover and mutation operations.As an embedded feature selection method, its feature selection process requires integration with machine learning models, such as the SVC model selected in this study.Tournament selection, the most popular selection strategy in genetic algorithms, was chosen in this study due to its advantages including lower complexity, less susceptibility to local optima, and ease of parallelization compared to other selection methods like roulette wheel selection (Shukla et al., 2015).
The miRNA possesses various mechanisms to regulate its generation and expression.In addition to directly targeting primary or precursor miRNA to modulate miRNA generation, it can also regulate the expression of TF-mRNA by directly binding to its 3′UTR.Simultaneously, TFs can induce or inhibit miRNA expression (Zhang et al., 2020a;Vishnubalaji et al., 2022).Moreover, many miRNAs may synergistically drive molecular changes, resulting in greater effects than individual miRNAs alone (Bertero et al., 2014).Among these interactions, the miRNA->TF->miRNA interaction represents one significant mechanism, as two TF-mediated miRNAs may involve multiple related TFs, and parts of miRNAs are influenced by more than one miRNA.The feature selection method employed in this study can identify robust differential miRNA features relevant to dataset diseases from numerous interactions, aiding in the more accurate identification of potential biological mechanisms and disease-related molecular biomarkers.
Previous studies have also explored the use of target ratios as features.Research has investigated the creation of novel features based on the relative expression order of genes within samples, achieving insensitivity to batch effects and facilitating the development of diagnostic models across different platforms, including sequencing and microarray data (Yang et al., 2020).Furthermore, studies have employed microarray data from two cohorts to generate new features based on the size relationship of miRNA pairs, converting quantitative data into qualitative data via a binarization process, albeit leading to notable information loss (Liu et al., 2021).Researchers identified 93 miRNAs showing significant differential expression among healthy controls and adenoma and colorectal cancer groups.They computed ratios between these 93 miRNAs in all possible pairwise combinations, resulting in 2,529 ratios.Among these, 36 miRNA ratios were found to exhibit significant differences in colorectal cancer samples compared to healthy controls and adenoma samples (Zhang et al., 2018a).The combination of two negatively correlated miRNAs may offer substantial potential for distinguishing experimental groups.Differential expression miRNA pairs were derived through the subtraction of the original Ct value of one miRNA from the Ct value of another miRNA, providing a novel avenue for biomarker discovery through self-normalization (Matthaei et al., 2012).Previous studies on miRNA ratio features have not systematically explored all miRNAs, whether by generating ratio features first and then screening for differences or by initially calculating differences in miRNAs and subsequently deriving ratios.Some studies validate only within different batches from the same center, without validating samples from different centers.Furthermore, these studies have often overlooked the potential biological significance inherent in these ratios, particularly miRNA-miRNA interactions.Our feature selection method is robust, effectively mitigates batch effects, and applicable across datasets sourced from diverse centers.Moreover, it has the ability to forecast biologically pertinent miRNA pairs, thereby establishing the groundwork for acquiring robust and high-performing biomarkers.
Using this protocol, we discovered some miRNAs with biological significance in all three examples.let-7b-5p, which is a selected marker for predicting severe COVID-19, plays a role in regulating ACE2 and DPP4 receptors and is significantly downregulated in nasopharyngeal swabs of patients (Latini et al., 2022); miR-21-3p, which is regulated by let-7b-5p, shows an upregulation trend in this project and is consistent with the previous experiments of mice infected with COVID-19 (Nersisyan et al., 2020).miRNA-mediated transcription factors are closely associated with COVID-19.Modulating the SMAD signaling pathway to enhance Robo4 expression holds promise in alleviating vascular permeability and mortality in COVID-19 (Cheema et al., 2021).FOXO transcription factors play crucial roles in maintaining normal cellular physiology by regulating survival, apoptosis, oxidative stress, and the development and maturation of T and B lymphocytes.The activation of FOXO can be utilized as a strategy to mitigate inflammatory outbreaks following SARS-CoV-2 infection (Morita et al., 2023).The ERRmiR marker selected in the RCC project is found to be a critical oncogene in previous studies.The high expression of miR-106b-3p may be an important factor in predicting poor prognosis in RCC patients (Li et al., 2016;Liu et al., 2019), and the overexpression of miR-214-5p attenuates cell proliferation and metastasis (Guo et al., 2021).Upregulation of miR-200c-3p inhibits proliferation,  Discovery results on the lung adenocarcinoma (LUAD) project.(Li et al., 2019).miRNA-mediated transcription factors are closely associated with RCC.HIF1A is upregulated in RCC tissues and closely correlated with tumor size and differentiation (Chen et al., 2021).The positive expression of ZEB1 is associated with poor prognosis in RCC patients (Harb et al., 2018).
In the LUAD project, the pairs containing miR-30a-3p or miR-30c-2-3p have been screened out.The role of the miR-30 family as tumor suppressors has been validated in previous reports (Saleh et al., 2019); in particular, miR-30c-2-3p is reported to inhibit tumor progression in esophageal squamous cell carcinoma, breast cancer, and hepatocellular carcinoma (Zhang et al., 2018b;Ma et al., 2018;Zhang et al., 2019).miR-9-5p and miR-503-5p, which are related with miR-30 in the ERRmiR markers, have also been reported to be associated with cell proliferation, migration, and invasion in nonsmall cell lung cancer (Sun et al., 2017;Zhu et al., 2021).miRNAmediated transcription factors are closely associated with lung cancer.The expression of NFKB is related to the tumor stage, lymph node metastasis, and 5-year survival rate in lung cancer (Zhang et al., 2023).YY1 is upregulated in lung cancer tissues, and its higher expression correlates with larger tumor size, poor differentiation, higher TNM stage, and lymph node metastasis.The ectopic expression of YY1 in lung cancer cells promotes cell proliferation and invasion, while YY1 silencing suppresses cell proliferation and induces apoptosis (Huang et al., 2017;Zhu et al., 2023).These miRNAs and TFs are disease-related and have been validated in previous studies.The results further demonstrate that the proposed approach in this study is more helpful in exploring the pathogenic mechanisms of diseases.
We propose an algorithm based on the expression ratio of interacting miRNAs for feature selection.The features selected by this method are stable, capable of removing batch effects, and contribute to data standardization and consistency, providing a basis for obtaining high-performing stable biomarkers.Moreover, our method can identify biologically relevant miRNA pairs, further deepening the understanding of disease pathogenesis.The algorithm relies on prior knowledge about miRNA interactions, effectively reducing the dimensionality of features, alleviating the pressure of feature selection, and facilitating the discovery of true relationship markers.Although we validated the feasibility of our feature selection method across datasets from three different diseases, our study has certain limitations.It is confined to the same detection platform of one disease.We did not investigate the impact of platform transition on differential targets, such as the stability of differential targets between NGS and QPCR platforms.Therefore, further consideration of platform migration is necessary to ensure the robustness and applicability of our method across diverse settings.We will continue our efforts to address these limitations through rigorous validation and method refinement to fully realize its potential in clinical practice.
In conclusion, this study introduces an innovative feature selection algorithm based on the expression ratio of interacting miRNAs.Leveraging prior knowledge about miRNA interactions, the algorithm effectively reduces feature dimensionality, easing the burden of feature selection and aiding in the discovery of genuine relationship markers.This approach not only ensures stable feature selection, eliminating batch effects and facilitating data standardization and consistency, but also identifies biologically relevant miRNA pairs, thereby enhancing our understanding of disease pathogenesis.This method lies in providing a robust tool for the discovery of stable and biologically relevant biomarkers, offering new avenues and methodologies for early disease diagnosis and treatment.

FIGURE 2
FIGURE 2Illustration of the miRNA interaction network.(A) TF-mediated miRNA-miRNA indirect interactions.Pentagrams denote the regulating miRNAs, squares denote the TFs, and circles denote the regulated miRNAs.(B) Degree distribution of the miRNA interaction network followed a power-law tail.(C) Topological characteristics of the interaction network.TF, transcription factors.

FIGURE 4
FIGURE 4 Analysis of ERRmiR features in the COVID-19 project.(A) Occurrence frequencies of the ERRmiR features in the 100-time genetic algorithm.(B,C) The top three high-frequency ERRmiRs showed a relatively stable regulatory trend in both datasets rather than miRNAs.ROC curves of the models based on ERRmiR markers (D) and miRNA markers (E). (F) Pathway enrichment analysis of miRNAs involved in ERRmiR markers showed the top 20 enriched pathways.

FIGURE 5
FIGURE 5 ERRmiR markers discovered in the renal cell carcinoma (RCC) project.(A) The frequency distribution of the ERRmiR features.Violin plots of the top three high-frequency ERRmiR features (B) and the composite miRNAs (C) among the three independent datasets.(D) ROC curves of the model based on the ERRmiR markers.(E) Pathway enrichment analysis of miRNAs in the ERRmiR markers.
(A) Statistics of the frequencies of ERRmiR features.(B) Violin plots of ERRmiR features ranked top three by frequency.(C) ROC curves of the models based on ERRmiR markers.(D) Pathway enrichment analysis of miRNAs involved in the ERRmiR markers.

TABLE 1
Sample information in three projects.The datasets with the most samples were selected as discovery (lines colored by gray background), and the other datasets were used as validation (lines colored by white background) for each project.Each sample has four pieces of sequencing data for GSE178246 and treated as four cases. a

TABLE 2
Comparison of miRNAs and ratio feature model performance on the COVID-19 data.

TABLE 3
Model performance on the renal cell carcinoma (RCC) data.

TABLE 4
Model performance on the lung adenocarcinoma (LUAD) data.