Text mining-based identification of promising miRNA biomarkers for diabetes mellitus

Introduction MicroRNAs (miRNAs) are small, non-coding RNAs that play a critical role in diabetes development. While individual studies investigating the mechanisms of miRNA in diabetes provide valuable insights, their narrow focus limits their ability to provide a comprehensive understanding of miRNAs’ role in diabetes pathogenesis and complications. Methods To reduce potential bias from individual studies, we employed a text mining-based approach to identify the role of miRNAs in diabetes and their potential as biomarker candidates. Abstracts of publications were tokenized, and biomedical terms were extracted for topic modeling. Four machine learning algorithms, including Naïve Bayes, Decision Tree, Random Forest, and Support Vector Machines (SVM), were employed for diabetes classification. Feature importance was assessed to construct miRNA-diabetes networks. Results Our analysis identified 13 distinct topics of miRNA studies in the context of diabetes, and miRNAs exhibited a topic-specific pattern. SVM achieved a promising prediction for diabetes with an accuracy score greater than 60%. Notably, miR-146 emerged as one of the critical biomarkers for diabetes prediction, targeting multiple genes and signal pathways implicated in diabetic inflammation and neuropathy. Conclusion This comprehensive approach yields generalizable insights into the network miRNAs-diabetes network and supports miRNAs’ potential as a biomarker for diabetes.


Introduction
Diabetes is a prevalent endocrine disease characterized by elevated blood glucose levels, which has rapidly grown in incidence and has become a global health concern (1). The long-term consequences of diabetes include both microvascular and macrovascular complications, which pose a severe threat to the health and well-being of individuals. Among diabetic patients, microvascular complications such as diabetic kidney disease, diabetic retinopathy, diabetic neuropathy, and diabetic foot are prevalent, while macrovascular complications such as cardiovascular disease can lead to fatal outcomes (2)(3)(4). The substantial impact of diabetes on morbidity, mortality, and quality of life places a significant burden on healthcare systems worldwide. microRNAs (miRNAs) are a class of small, non-coding RNAs that mediate post transcriptional gene silencing (5). These regulatory molecules have emerged as crucial players in orchestrating cellular responses to physiological perturbations and disease conditions (6,7). Recent research has highlighted the pivotal role of miRNAs in the pathogenesis of diabetes and its associated complications (8). For instance, miRNAs have demonstrated indispensable roles in pancreatic beta cells, regulating their response to metabolic, genetic, and inflammatory stressors (9)(10)(11)(12). This underscores the importance of miRNAs in diabetes management and emphasizes their potential as therapeutic targets for the treatment of diabetes.
PubMed is a widely accessible research interface housing an impressive repository of approximately 35 million medical publications as of 2021, including a substantial number of studies investigating various aspects of diabetes. While individual studies investigating the mechanisms of miRNA in diabetes provide valuable insights, their narrow focus limits their ability to provide a comprehensive understanding of miRNAs' role in diabetes pathogenesis and complications. To achieve a more complete understanding of the complex interactions between miRNAs and diabetes, it is essential to adopt a comprehensive approach that considers the diverse range of factors involved. By utilizing text mining techniques to identify miRNA and diabetes-related literature, we can collect information from multiple studies and establish a more holistic perspective. This approach enables researchers to identify patterns and gaps in the literature and generate new hypotheses to guide future investigations. Ultimately, applying this comprehensive approach contributes to a more profound and thorough understanding of the role of miRNAs in diabetes and its associated complications.

Method
The strategy was shown in Figure 1.

PubMed corpus
PubMed was utilized to retrieve literature information for studies investigating miRNAs in the context of diabetes, spanning the period from January 1, 1993, to March 5, 2023. To assess the specificity and robustness of our machine learning model in distinguishing between diabetes-related and non-diabetes-related studies, miRNA study in diabetes dataset and miRNA study in nondiabetes dataset (referred to as the negative control) were prepared. The search criterion employed for the diabetes dataset was "miRNA" [Title/Abstract] AND "Diabetes" [Title/Abstract], while for the negative control dataset, it was "miRNA" [Title/Abstract] NOT "Diabetes" [Title/Abstract]. The following data were collected for this study: PMID, Publication Date, Publication Type, First Author, Journal Name, Literature Title, and Literature Abstract. Specifically, the study focused on specific publication types including "Case Reports", "Clinical Study", "Clinical Trail", "Comparative Study", "English Abstract", "Evaluation Study", "Journal Article", "Letter", and "Preprint", while excluded "Retraction of Publication", "Published Erratum", "Editorial", "News", "Dataset", "Clinical trial protocol", "Review", and "Systematic review".

Topic modeling
Topic modeling is a statistical technique used in natural language processing to uncover hidden patterns and structures FIGURE 1 Study Design and workflow. within a large corpus. The algorithm employed in the current study for topic modeling is Latent Dirichlet Allocation (LDA), which assumes that each document in the corpus is a mixture of multiple topics, and each topic is characterized by a distribution of words from the abstracts.

Biomedical term tagging
In this text mining study, the focus was placed on literature titles and abstracts due to the crucial information they contain and their widespread availability. To tag biomedical terms (including miRNAs) within the literature, two methods were employed: PubTator and Regex.
PubTator: PubTator is a web-based biomedical named entity recognition (NER) system (13) specifically designed for PubMed. This powerful tool is capable of tagging various entities within PubMed titles and abstracts, including genes (which include miRNAs), diseases, species, chemicals, cell lines, and mutations. The results generated by PubTator are saved in PubTator format, which can be accessed and read using Python for further analysis and exploration.
Tagging miRNA: The miRNA tagging process involved tokenizing the titles and abstracts while preserving hyphenated words and then vectorized them. Regex was utilized to detect miRNA from the tokens, taking advantage of the well-defined and closely followed nomenclature of miRNA. This nomenclature typically consists of a prefix, such as "miR" (or "miRNA", "micro-RNA", "microRNA") followed by a unique identifying number, which is assigned based on sequence order (e.g., "miR-1"), with exceptions let-7 and lin-4 retaining their names for historical reasons. Additionally, the name might include one suffix such as "-a", "-1", "-3p", or "-5p". It may also include a prefix that denotes the species, following regular expression was employed to identify miRNAs from retrieved publication titles and abstracts tokens in different formats, and both lower and upper case were considered:  (14), and were further validated by PubTator gene term labeling. Extracted miRNA from different literature may represent the same miRNA, due to differences in naming conventions, e.g., miR-155, microRNA-155, miR155, miR-155-5p, and others. To ensure accuracy in downstream analysis, miRNAs in different formats were converted into a standardized format, disregarding any extensions such as species prefix, -3p/-5p or genomic suffixes. This approach allows the focus to be on the core miRNA name. Furthermore, to avoid introducing bias, each miRNA was only counted once within an any given literature title/abstract.

Train-test data set splitting and machine learning analysis
To address the issue of imbalanced data, an equal number of the diabetes studies were paired with the non-diabetes studies for model prediction. Four machine learning models were employed, namely Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), and Support Vector Machines (SVM). The performance of these models was evaluated using a receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC). To evaluate the models against ground scientific truth, holdout validation was used for model validation.

Pathway signaling
To explore a comprehensive set of functional annotations of the hub genes, Kyoto Encyclopedia of Genes and Genomes (KEGG) signaling pathway analysis was performed with the gene list using NetworkAnalyst 3.0 (https://www.networkanalyst.ca/). A P-value of < 0.05 was considered significant.

MiRNAs discovery history in diabetes
A total of 1,818 miRNA studies in the context of diabetes were retrieved from PubMed, spanning the period from January 1, 1993, to March 5, 2023 (method). The investigation of miRNA in diabetes began in 2006, with the initial focus being on two specific miRNAs, miR-342, miR-191, and miR-510 in 2009 (15). As shown in Figure 2A, research into miRNAs in diabetes has significantly increased since 2018 and continues to grow each year.

Topic modeling in miRNA studies of diabetes
To investigate the frequently discussed themes in miRNA studies in diabetes research, we performed topic modeling analysis. After applying the inclusion criteria (method), 1,798 studies were selected for topic modeling. A total of 13 distinct topics were identified from the miRNA studies in diabetes research ( Figure 2B). Topic 11 emerged as the most significant topic among 13 identified topics, followed by Topic 10, while Topic 7 has the least publications ( Figure 2C). Furthermore, the heatmap indicates that all the topics showed a great increase post 2018 ( Figure 2D).
The top 5 keywords with the highest weights in each of the 13 topics were shown in Figure 2E. Topic 1 was associated with inflammation and stroke, Topic 2 with metabolism, Topic 3 with pancreatic islet beta cells, Topic 4 with gene mutation (such as SNP), Topic 5 with detecting miRNA in serum, Topic 6 with diabetic kidney disease, Topic 7 with wound healing, Topic 8 with exosomes, Topic 9 with identifying miRNA as a biomarker, Topic 10 with relationship with other noncoding RNAs (such as circRNAs), Topic 11 with development, Topic 12 with insulin resistance, and Topic 13 with diabetic heart disease. Furthermore, miRNAs were identified for each of topic ( Figure 2F), and it was observed that miRNAs trended to cluster together based on the each of the topic, such as miR-146 was studied frequently in Topic 1 and Topic 4, while miR-126 was frequently associated with Topic 5.
To gain insight into the overall topic discussed in miRNA studies in both diabetes and non-diabetes research, word clouds were generated using the tokens from 1,798 abstracts. As expected, the most frequently occurring words included "miRNA", "expression", "diabetes", "patient", and "type". In contrast, in the non-diabetes literatures, the most frequently used words were "miRNA", "expression", and "patient", which represent general concept related to miRNAs but not associated with any specific diseases (Supplementary Figures 1C, D).

Comparison of different machine learning models predicting diabetes with miRNAs
To evaluate the predictive potential of miRNAs for diabetes, we utilized four machine learning models (NB, DT, RF, and SVM) as our methodology. These models were employed to predict diabetes based on the complete set of identified miRNAs. The dataset was split into training and testing datasets using holdout splitting methodology. The performance of each model was evaluated, and the results are presented in Table 1 and Figure 3A. Notably, the SVM model, specifically with hyperparameters C = 10, gamma = "scale", kernel = "sigmoid", outperformed other models, achieving an accuracy score of 0.606 and an AUC of 0.64. Furthermore, we conducted feature importance analysis using the SVM model to identify the most important miRNAs in predicting diabetes. Figure 3B shows the key miRNAs that exert the greatest impact on the accuracy of diabetes prediction. Notably, miR-146, along with miR-29, miR-223, miR-375, miR-126, miR-320, miR-133, miR-17, miR-20, and miR-24, play a crucial role in accurately predicting diabetes. Among these miRNAs, miR-146 stands out as the most prominent, as indicated by its importance score of 0.0181, followed by miR-29 with an importance score of 0.01471 ( Figure 3B). This observation indicates that miR-146 influences the overall performance and accuracy of the SVM model in predicting diabetes and highlights the potential significance of delving deeper into the detailed exploration of miR-146.

Investigating the Role of miR-146 in diabetes
The high importance score of miR-146 suggests that it plays a pivotal role in distinguishing between individuals with diabetes and those without the condition. A total of 34 diabetes studies focusing on miR-146 were identified and analyzed. A word cloud generated from these studies revealed that miR-146 is potentially involved in various diabetic complications, particularly inflammation and neuropathy ( Figure 3C), which was consistent with the result that observed in topic modeling (Figures 2E, F).
The analysis of individual studies on miR-146 in the context of diabetes yielded a heatmap that showcased the prevalence of several genes, namely IRAK1, TRAF6, IL-6, TNF-a, NUMB, EGFR, and TGFb-1, among others. To further investigate to direct target genes of miR-146, an integrating analysis with TargetScanHuman data set (16) was performed. The results revealed that IRAK1, TRAF6, NUMB, and STX3 are direct targets of miR-146 in the context of diabetes (Figures 3D, E). In addition, KEGG signal pathways were constructed using the identified gene list obtained from relevant diabetes literature. The resulting analysis highlighted several significant pathways, including the ErbB signaling pathway, Tolllike receptor signaling pathway, NF-kB signaling pathway, Insulin signaling pathway, and TNF signaling pathway ( Figure 3F). These findings provide valuable insights into the potential mechanisms underlying the involvement of miR-146 in diabetes. The prevalence of specific genes and their direct targeting by miR-146, as well as the identification of relevant signaling pathways, contribute to a more comprehensive understanding of the role of miR-146 in the pathogenesis and progression of diabetes.

Discussion
In recent years, miRNAs have gained attention as potential biomarkers and therapeutic targets for various diseases, including diabetes (17)(18)(19). miRNAs play a crucial role in the regulation of multiple pathways implicated in diabetes pathogenesis, such as insulin secretion, insulin signaling, beta cell function, and glucose homeostasis. Dysregulation of miRNAs has been associated with the development of both type 1 and type 2 diabetes. The current study aimed to summarize our current understanding of miRNAs and their involvement in the development and progression of diabetes and its complications.
We identified 13 crucial areas that highlight the extensive implementation of miRNAs in various domains of diabetes investigation. Notably, miRNA studies encompassed important aspects such as diabetes biomarker research and the exploration of diabetes-related complications, including diabetic inflammation, diabetic cardiovascular diseases, and diabetes kidney disease. One intriguing finding was the identification of miRNA-specific patterns within different domains of diabetes research. For instance, we observed that miR-34 garnered substantial attention in Topic 3, specifically relating to its impact on pancreatic islets/beta cells. This observation aligns with recent review studies, such as the one by Pasquale Mone et al. (20), which emphasized the significant role of miR-34 in regulating pancreatic islets/beta cell function in the context of diabetes. These findings shed light on the multifaceted involvement of miRNAs in diabetes research and emphasize their potential as valuable tools for understanding the pathogenesis and complications of diabetes. By uncovering miRNA-specific patterns in different domains, we provide researchers with valuable insights for further investigations and potential therapeutic interventions.
To unravel potential biomarkers, state-of-art machine learning classifiers were employed. Among the classifiers utilized, the SVM model exhibited exceptional performance, achieving an impressive accuracy score of 0.60. Notably, miR-146 emerged as the most important feature contributing significantly to the accuracy and effectiveness of the prediction model. These findings underscore the potential significance of miR-146 as a key biomarker in the intricate landscape of diabetes. MiR-146 has been well-documented in human disease (21). And notably, it has emerged as a critical miRNA whose deregulation has been implicated in pathogenesis of diabetes. Most recent review study summarized the implication of miR-146 in type 1 diabetes and type 2 diabetes (22-24). Its significance is underscored by numerous studies that have examined the express levels of miR-146 across various sample types, including whole blood, serum, PBMC, plasma (25)(26)(27)(28)(29)(30)(31). These studies collectively demonstrate the involvement of miR-146 in diabetes and highlight its potential as a biomarker for disease detection and monitoring. In our finding, we observed that miR-146 is involved in ErbB signaling, EGFR tyrosine kinase inhibitor resistance, MAPK signaling pathway, Endocrine resistance, Toll like receptor signaling, TNF signaling, NF-kB signaling pathway, etc., consistent with the published studies (22, [32][33][34][35]. In addition to miR-146, our study identified the involvement of several other miRNAs that exhibited varying degrees of contribution to the SVM-based diabetes prediction, including miR-29, miR-223, miR-375, miR-126, miR-320, and miR-133, among others. This indicates that these miRNAs play a critical role in diabetes as well. For example, it has been shown that miR-29 is associated with topics 2, 5, 8, and 9, miR-223 is studied in relation to topic 9, and miR-375 demonstrates relevance to topics 3 and 5 ( Figure 2). These findings are consistent with and supported by previous review studies in the field (36)(37)(38)(39)(40)(41)(42).
The utilization of miRNAs as biomarkers and therapeutic targets has the potential to improve the management of diabetes and its associated complications. However, it is worth noting that while the SVM model demonstrated a notable accuracy score of 0.60, there is room for improvement by incorporating additional features such as genes and SNPs. These additional factors can provide a more comprehensive and nuanced understanding of the intricate mechanisms underlying miRNA's role in diabetes. Considering the multifaceted nature of diabetes, a comprehensive understanding of miRNA's role, coupled with the integration of additional features, can pave the way for personalized interventions and targeted therapies.
In conclusion, our study revealed a comprehensive understanding of the diverse areas of focus within miRNA research in the context of diabetes. Utilizing SVM with only miRNAs as inputs, we achieved promising results in diabetes prediction, particularly in identifying key miRNAs such as miR-146 as significant players in the context of diabetes. However, it is important to note that further confirmation through additional clinical investigations is necessary to validate and reinforce the findings of this study.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.