Machine learning and weighted gene co-expression network analysis identify a three-gene signature to diagnose rheumatoid arthritis

Background Rheumatoid arthritis (RA) is a systemic immune-related disease characterized by synovial inflammation and destruction of joint cartilage. The pathogenesis of RA remains unclear, and diagnostic markers with high sensitivity and specificity are needed urgently. This study aims to identify potential biomarkers in the synovium for diagnosing RA and to investigate their association with immune infiltration. Methods We downloaded four datasets containing 51 RA and 36 healthy synovium samples from the Gene Expression Omnibus database. Differentially expressed genes were identified using R. Then, various enrichment analyses were conducted. Subsequently, weighted gene co-expression network analysis (WGCNA), random forest (RF), support vector machine–recursive feature elimination (SVM-RFE), and least absolute shrinkage and selection operator (LASSO) were used to identify the hub genes for RA diagnosis. Receiver operating characteristic curves and nomogram models were used to validate the specificity and sensitivity of hub genes. Additionally, we analyzed the infiltration levels of 28 immune cells in the expression profile and their relationship with the hub genes using single-sample gene set enrichment analysis. Results Three hub genes, namely, ribonucleotide reductase regulatory subunit M2 (RRM2), DLG-associated protein 5 (DLGAP5), and kinesin family member 11 (KIF11), were identified through WGCNA, LASSO, SVM-RFE, and RF algorithms. These hub genes correlated strongly with T cells, natural killer cells, and macrophage cells as indicated by immune cell infiltration analysis. Conclusion RRM2, DLGAP5, and KIF11 could serve as potential diagnostic indicators and treatment targets for RA. The infiltration of immune cells offers additional insights into the underlying mechanisms involved in the progression of RA.


Introduction
Rheumatoid arthritis (RA) is a systemic autoimmune disease characterized by chronic inflammation, proliferation of synovial membranes, and cartilage destruction, which has a serious impact on the physical and mental health of patients (1).Although RA does not directly lead to the mortality of patients, its systemic inflammatory damage can affect the function of organs such as the heart, lungs, and kidneys, reducing the quality of the patient's life (2,3).The pathogenesis of RA is complex and involves multiple factors such as genetics, environment, and metabolism.Moreover, the exact mechanisms associated with these factors and RA have not yet been systematically determined (4,5).According to recent research, different types of immune cells, such as B cells, T cells, and macrophages, are closely associated with the development of RA (6).Other immune cells, including natural killer (NK) cells, mast cells, and dendritic cells (DCs) also play an important role in the development or advancement of RA (7)(8)(9).
Currently, studies on the treatment and pathogenesis of RA are increasing, but there is still a lack of highly specific and sensitive biomarkers for its early diagnosis.Bioinformatics is a discipline that combines biology, mathematics, and information technology and plays a prominent role in disease detection, biomarker identification, highrisk patient identification, and so on (10).Weighted gene co-expression network analysis (WCGNA) is a common method of identifying disease biomarkers and treatment targets.Machine learning algorithms, a subset of artificial intelligence that allows computers to learn from data and predict genes associated with disease, are also widely used in research (11).In our study, bioinformatics and three machine learning algorithms were comprehensively applied to integrate and analyze multiple expression datasets.This approach allowed the identification of highly sensitive and specific biomarkers and treatment targets and would provide new directions for subsequent experimental research.In this study, the expression matrix of four synovium samples was downloaded, the intersection genes were obtained by difference analysis and WGCNA, and then hub genes were identified by least absolute shrinkage and selection operator (LASSO), support vector machine-recursive feature elimination (SVM-RFE), and random forest (RF) machine learning algorithms, and their diagnostic efficiency was validated.Additionally, we analyzed the infiltration levels of 28 immune cells in the expression profile and their relationship with hub genes using single-sample gene set enrichment analysis (ssGSEA).

Data collection and preprocessing
The steps in the analysis of the entire research are shown in Figure 1.First, we obtained gene expression datasets of RA synovial samples (GSE77298, GSE55235, GSE12021, and GSE55457) from the Gene Expression Omnibus (GEO) database (https:// www.ncbi.nlm.nih.gov/geo/)(12).These datasets included 87 synovial samples (36 normal control samples and 51 RA samples) (Table 1).GSE55457 was used as an external validation dataset, The flowchart depicting the investigation procedure.GSEA, gene set enrichment analysis; GSVA, gene set variation analysis; WGCNA, weighted gene coexpression network construction analysis; DEGs, differentially expressed genes; ssGSEA, single-sample gene set enrichment analysis; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; LASSO, least absolute shrinkage and selection operator; RF, random forest; SVM-RFE, support vector machine-recursive feature elimination; ROC, receiver operating characteristic curve; TFs, transcription factors; miRNAs, microRNAs; DCA, decision curve analysis.
whereas the other datasets were merged and normalized for data analysis as a training set using the sva package (13).Common genes across each dataset were identified for further analysis.

Identification of differentially expressed genes and enrichment analyses
Differentially expressed genes (DEGs) were identified using the limma package with |log Fold Change (FC)| ≥ 1 and P-value < 0.05 used as the cutoff for filtering the DEGs (14).DEGs were visualized using a heatmap and volcano map obtained by using pheatmap and ggplot2 packages.Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were conducted with a cutoff of P < 0.05 (15).A gene set variation analysis (GSVA) was performed using the GSVA R package to calculate a normalized enrichment score under the background of the hallmark gene set (c2.cp.kegg.v7.2) with the thresholds of the P-value and false discovery rate (FDR) set as 0.05 and 0.25, respectively (16).We also used GSEA to identify the biological attributes and functions of all genes in the training set by using clusterProfiler in the R package with significant thresholds selected as P-value < 0.05 and FDR < 0.25 (17).

Construction of the coexpression network
The WGCNA package was used to construct a weighted gene co-expression network (18).The samples were organized into clusters to identify outliers.Then, pairwise correlations were calculated between genes and a weighted adjacency matrix was constructed using a soft thresholding power b.The hierarchical clustering method was used to construct the clustering tree structure of the TOM(Topological overlap matrix).Different branches of the cluster tree represented different gene modules whcic were screened by different colours.To establish a link between modules and clinical characteristics, estimations of module membership (MM) and gene significance (GS) were computed.The modules with the highest Pearson coefficient and P < 0.05 were used to select the candidate hub genes under the criterion of MM > 0.8 and GS > 0.5.

Identifying hub genes
The Venn package was used to obtain intersecting DEGs and WGCNA candidate hub genes.LASSO logistic regression analysis was conducted using the R package glmnet with the optimal minimal lambda identified.Our study validated the selection of optimization parameters through 10-fold cross-validation, ensuring that the partial likelihood deviation satisfied the minimum criteria.The e1071 package was used to conduct the SVM-RFE with fivefold cross-validation, and the RF algorithm of the RF package was used to analyze the intersection genes.Ultimately, hub genes were obtained by identifying the overlapping genes derived from the three machine learning methods using a Venn diagram.

Constructing nomogram model and validation of hub genes
A nomogram for predicting RA was constructed using the rms package (19).The predictive power of the nomogram model was assessed using a calibration curve.A decision curve was used to assess the clinical utility of the nomogram model.A receiver operating characteristic (ROC) curve was created using the R package pROC function to determine the diagnostic value of the hub genes and the nomogram model for RA in the training and validation sets.

Correlation between immune cell infiltration and hub genes
The relative infiltration levels of 28 immune cells in the training set were quantified using the ssGSEA algorithm (20).Barplots were used to show the differential expression levels of 28 immune-infiltrating cells.Spearman correlations of 28 immune-infiltrating cells with hub genes were calculated and then visualized using the ggplot2 package.

Co-expression network of identified hub genes
GeneMANIA (https://genemania.org)was used to create a hub gene co-expression network (21).

Transcription factors and microRNAs associated with the three hub genes
The JASPAR (https://jaspar.elixir.no/)database was used to find the transcription factors (TFs) that frequently bind to the three hub genes.MicroRNA (miRNAs) that interact with the hub genes were obtained from an online platform MirTarbase (https://mirtarbase.cuhk.edu.cn/).

Suppressive potential of small molecules in RA
We accessed the DSigDB (version 1.0) database through the Enrichr platform and obtained the top 10 small molecules that could suppress the expression of hub genes.

Statistical analysis
The statistical software R version 4.3.4 was used to perform statistical analysis, with P-values of < 0.05 indicating statistical significance.

DEG identification
A total of 575 DEGs (including 383 upregulated genes and 192 downregulated genes) were identified between RA and normal samples.The top 10 upregulated and downregulated DEGs are presented in a volcano plot (Figure 2A).In addition, the expression levels of the 25 most upregulated and 25 most downregulated genes are shown in a heatmap (Figure 2B).

Functional enrichment analysis
DEGs in the BP category of GO were enriched in mononuclear cell differentiation, leukocyte cell-cell, and immune responseregulating cell surface receptor signaling pathways.In the MF category, DEGs were mostly related to antigen binding, immune receptor activity, and chemokine activity.In the CC category, DEGs were mostly assigned to the external side of the plasma membrane and clathrin-coated vesicle membrane (Figure 3A).KEGG pathway analysis showed that DEGs were enriched in cytokine-cytokine receptor interaction, chemokine signaling pathway, and RA; GSEA analysis produced similar results (Figure 3B).GSEA was used to depict the signal pathways involved in RA.The top five pathways enriched by DEGs were the chemokine signaling, cytokine-cytokine receptor interaction, intestinal immune network for Immunoglobulin A (IgA) production, RA, viral protein interactions with cytokines, and cytokine (Figure 3C).In the RA group, the GSVA results of enriched DEGs also indicated that immunity and inflammation pathways, such as chemokine signaling, NK cell-mediated immunity, B-cell receptor signaling, primary immunodeficiency, and intestinal immune network for IgA production were evident (Figure 3D).

WGCNA construction and hub module identification
Samples in the training set were clustered using the WGCNA package.Subsequently, the unscaled connectivity index was determined, and an average connectivity analysis was conducted.

A B
Identifications of RA hub genes.When the soft threshold b = 8, the network reached an unscaled topological threshold of 0.9 (Figure 4A).By dynamic tree cutting and calculation, 11 gene modules were obtained (Figure 4B).Correlation analysis was performed between the 11 modules and the normal and RA groups, resulting in a correlation heatmap (Figure 4C).The salmon module had the strongest correlation with the RA (r = 0.73, P < 0.001) and was identified as the key module for RA.Based on filtering criteria, we identified 17 candidate hub genes in the salmon module (Figure 4D).

Screening of hub genes
By intersecting the DEGs and candidate hub genes, 13 intersection genes were obtained (Figure 5A).The 13 intersection genes were then submitted into three machine learning algorithms including LASSO, SVM-RFE, and RF.LASSO resulted in four hub genes (Figures 5B, C), SVM identified five hub genes (Figure 5D), and RF identified seven hub genes (Figures 5E, F).Finally, we obtained three hub genes ribonucleotide reductase regulatory subunit M2 (RRM2), DLG-associated protein 5 (DLGAP5), and kinesin family member 11 (KIF11) by intersecting the three machine learning results (Figure 5G).

Constructing the nomogram model and validation
A nomogram model was then constructed using the three hub genes in the training set to predict the risk of RA (Figure 6A).The nomogram model was found to have the best predictive and clinical efficiency for RA by calibration curves (Figure 6B) and decision Frontiers in Immunology frontiersin.orgcurve analysis (Figure 6C), respectively.The area under the ROC curve (AUC) of the nomogram model and three hub genes were also calculated (Figures 6D, E).Next, we constructed a validation set using all the procedures, which showed a perfect match with the resultsin the training set.To further validate this result, we obtained RNA-seq data on the synovium of patients with RA and osteoarthritis (OA) uploaded to GitHub by Shanghai Guanghua Hospital (23) and analyzed the expression levels of key genes in the data.Consistent with this study, we found that the expression levels of three hub genes in RA synovium were significantly higher than in OA synovium (P < 0.05) (Figure 7).

Correlation between the immune cell infiltration and hub genes
The distribution of 28 immune cells in the training set is demonstrated in Figure 8A.In our results, a significantly higher infiltration of activated CD4 T cells, activated B cells, and activated DC infiltration was found in RA, indicating the important role that they play in the disease (Figure 8B).Correlation analysis of the 28 immune cells with hub genes demonstrated that various T cells, B cells, NK cells, and macrophages were positively correlated with the three hub genes (Figure 8C).

Function analysis of hub genes
To analyze the biological functions of the identified hub genes, we constructed a comprehensive gene interaction network using data from the gene MANIA database (Figure 9A).This network comprised physical interactions, co-expression relationships, predicted interactions, co-localization patterns, genetic interactions, pathway interactions, and shared protein domains.Our findings indicate that the hub genes are primarily associated with mitotic nuclear division, spindle, and microtubule cytoskeleton organization involved in mitosis and spindle organization.Furthermore, to discern the specific biological roles of the three hub genes, we conducted an enrichment analysis.In Figures 9B-D, we illustrate the most enriched terms in the CC, BP, and MF analyses of GO terms.Additionally, Figures 9E-G depict the most significant pathways based on data from the Reactome, Wiki Pathway, and KEGG databases, respectively.

Identification of regulatory signatures
The interplay between the three hub genes and TF regulators is depicted in Figure 10A, whereas the relationships between the hub genes and miRNA regulators are illustrated in Figure 10B.In total, we identified 18 TFs and 17 miRNAs as regulatory signatures by analyzing TF-gene and miRNA-gene interaction networks.(F) Expression level of hub genes according to datasets from GitHub by Shanghai Guanghua Hospital.DCA, decision curve analysis; ROC, receiver operating characteristic; AUC, area under the ROC curve (based on GSE55457).

Discovery of potential small molecules
We generated potential small-molecule findings based on odds ratios.Table 2 presents the top 10 small molecules that could potentially target the hub genes sourced from the DSigDB database.

Discussion
RA is a chronic inflammatory disease that currently lacks early diagnostic indicators (6).Recent studies have highlighted the close association of various immune cells, such as B cells, T cells, and macrophages, with the pathogenesis of RA (24).Therefore, the exploration of new diagnostic biomarkers and their relationship with immune cell infiltration patterns holds significant implications for advancing our understanding of RA pathophysiology.To address this, we gathered four RA synovial microarray datasets from the GEO database and identified 575 DEGs between RA and healthy controls.Enrichment analyses, GO, KEGG, GSEA, and GSVA revealed a robust correlation between RA and the immune response.
Fibroblast-like synoviocytes (FLSs) are the most abundant cells of the stroma and a key population in RA.Recent research indicates that the interaction between RA FLSs and infiltrating immune cells

A B
The cohesive regulatory interaction network of three hub genes and TFs and miRNAs obtained from the Network Analyst.Our findings revealed that macrophages play an important role in the infiltration of immune cells in the synovium.In contrast to tissueresident macrophages, infiltrating macrophages may originate from various monocyte subpopulations in the blood and possess a high level of adaptability.For instance, in mice, they can arise from classical Ly6C+ or patrolling Ly6C − monocytes (26, 27).In a recent comprehensive analysis of immune cell status in patients with RA, single-cell RNA-seq, bulk RNA-seq, and mass spectrometry flow cytometry were used to identify 18 distinct synoviocyte populations, including four monocyte/macrophage populations denoted as SC-M1 to SC-M4 (28).This analysis demonstrated that the activation of different cytokines promoted the expansion of diverse macrophage subpopulations in the RA synovium.Furthermore, as the primary orchestrators of the immune response, DCs can secrete chemokines that facilitate the activation of inflammatory T cells, thereby attracting proinflammatory immune cells such as macrophages and neutrophils (29)(30)(31).In vitro, RA synovial DCs have the potential to induce regulatory T (Treg) cells through the prolonged engagement of programmed cell death 1 receptors (32, 33).Although Treg cells in the peripheral blood of patients with RA retain inhibitory capacity, this function is compromised in local Treg cells, suggesting that the inflammatory cytokine environment may contribute to Treg cell dysfunction (34).
Through the use of WGCNA and machine learning algorithms, we identified DLGAP5, RRM2, and KIF11 as potential diagnostic markers for RA.RRM2 plays a critical role in controlling the production of deoxyribonucleotides, which is essential for DNA repair and synthesis (35).Blocking RRM2 has a substantial impact on reducing cellular growth and triggering cell death (36,37).Recently, other studies have demonstrated that RRM2 could increase the levels of apoptosis and inhibit the proliferation of RA-FLSs by regulating transforming growth factor-b (TGF-b) and IL-6 (38).
Although several bioinformatics methodologies have been used to investigate potential biomarkers for RA, there is limited literature regarding the involvement of DLGAP5 in the pathophysiology of this condition (39,40).Previous investigations have examined the structure and function of DLGAP5 across various species, considering both physiological and clinicopathological perspectives.These studies have revealed that DLGAP5 plays a crucial role in facilitating cell growth, proliferation, and migration (41,42).Therefore, this presents an opportunity to investigate in further detail the potential of DLGAP5 in diagnosing and differentially diagnosing RA, as well as its role in the pathophysiology of the disease.
KIF11 encodes a motor protein belonging to the kinesin-like protein family, which is recognized for its involvement in diverse spindle dynamics.The role of the gene product encompasses chromosome positioning, centrosome separation, and the establishment of a bipolar spindle during cell mitosis (43).However, there is limited literature on the role of KIF11 in the RA joint microenvironment.Therefore, in this study, KIF11 along with the two other hub genes performed a diagnosis of RA with optimal sensitivity and specificity.
This study has several limitations.First, the dataset obtained from the GEO database lacks comprehensive patient information, including serological and imaging indicators.As a result, we were unable to evaluate the correlation of biomarkers or immune cells with clinical characteristics such as hematological indicators, degree of joint destruction, and treatment status in patients with RA.More detailed data are necessary for the further exploration of the clinical significance of the biomarkers.Second, the biomarker discovery was based on the GEO database.Despite the satisfactory performance of our biomarkers in both test and validation datasets, additional in

Conclusion
Using LASSO, SVM-RFE, and RF algorithms in conjunction with bioinformatic analyses, we identified a three-gene signature (RRM2, DLGAP5, and KIF11) implicated in the progression of RA.Immune infiltration analyses revealed that the identified hub genes exhibited the strongest correlation with various T cells, B cells, NK cells, and macrophages.To confirm our identification of diagnostic markers with high sensitivity and specificity for RA, prospective large-sample investigations with experimental validation should be conducted.

5
FIGURE 5 Screening of hub genes.(A) Venn diagram for overlapped genes between DEGs and WGCNA.(B) The LASSO regression partial likelihood deviance with changing log (l) plotted in 10-fold cross-validations.Dotted vertical lines were drawn at the optimal values using the minimum criteria (lambda.min)and 1 standard error of the minimum criteria (1-SE criteria).(C) The LASSO coefficient profiles for four hub genes in the 10-fold cross-validation.The intersection of (D) five gene signatures was identified by SVM-RFE analysis.(E) Prediction accuracy of the RandomForest model.(F) Seven gene signatures were identified by RandomForest analysis.(G) Overlapped genes obtained from the LASSO, SVM-RFE, and random forest algorithms.DEGs, differentially expressed genes; WGCNA, weighted gene co-expression network analysis; LASSO, least absolute shrinkage and selection operator; SVM-REF, support vector machinerecursive feature elimination.

6
FIGURE 6 Nomogram model construction for RA diagnosis.(A) Nomogram to predict RA risk.(B) Calibration curve evaluation for the diagnostic potential of the nomogram model.(C) DCA curve to assess the nomogram practical efficacy.(D) ROC analysis of the model.(E) ROC analysis of three hub genes.DCA, decision curve analysis; ROC, receiver operating characteristic; AUC, area under the ROC curve (based on GSE77298, GSE55235, and GSE12021).

7
FIGURE 7 Nomogram model construction for RA diagnosis.(A) Nomogram to predict RA risk.(B) Calibration curve evaluation for the diagnostic potential of the nomogram model.(C) DCA curve to assess the nomogram practical efficacy.(D) ROC analysis of the model.(E) ROC analysis of three hub genes.(F) Expression level of hub genes according to datasets from GitHub by Shanghai Guanghua Hospital.DCA, decision curve analysis; ROC, receiver operating characteristic; AUC, area under the ROC curve (based on GSE55457).

8
FIGURE 8 Analyses of the OA-related immunological environment.The distribution of 28 different types of immune cells in healthy control and OA synovial tissues is shown in a heatmap (A) and a violin plot (B).(C) The association between immune cells infiltration and four hub genes.****, ***, **, * represented P<0.0001, P<0.001, P<0.01, P<0.05.
(A) Genes and TFs.(B) Genes and miRNAs.Herein, the diamond nodes are TFs; the square node indicates miRNAs; gene symbols as circle nodes.TF, transcription factors; miRNAs, microRNAs.

TABLE 1
Information of datasets obtained from GEO.

TABLE 2
Top 10 small-molecule drugs for RA. and in vivo experiments are required to validate our findings and determine the mechanisms underlying significant immunological changes during RA. vitro