Identification and Analysis of Blood Gene Expression Signature for Osteoarthritis With Advanced Feature Selection Methods

Osteoarthritis (OA) is a complex disease that affects articular joints and may cause disability. The incidence of OA is extremely high. Most elderly people have the symptoms of osteoarthritis. The physiotherapy of OA is time consuming, and the chances of full recovery from OA are very minimal. The most effective way of fighting OA is early diagnosis and early intervention. Liquid biopsy has become a popular noninvasive test. To find the blood gene expression signature for OA, we reanalyzed the publicly available blood gene expression profiles of 106 patients with OA and 33 control samples using an automatic computational pipeline based on advanced feature selection methods. Finally, a compact 23-gene set was identified. On the basis of these 23 genes, we constructed a Support Vector Machine (SVM) classifier and evaluated it with leave-one-out cross-validation. Its sensitivity (Sn), specificity (Sp), accuracy (ACC), and Mathew's correlation coefficient (MCC) were 0.991, 0.909, 0.971, and 0.920, respectively. Obviously, the performance needed to be validated in an independent large dataset, but the in-depth biological analysis of the 23 biomarkers showed great promise and suggested that mRNA surveillance pathway and multicellular organism growth played important roles in OA. Our results shed light on OA diagnosis through liquid biopsy.


INTRODUCTION
Osteoarthritis (OA) is a complex disease that affects articular joints and may cause disability (Appleton, 2017). In the USA, 14 million people have symptomatic knee osteoarthritis (KOA) (Vina and Kwoh, 2017). Approximately 10-20% adult have OA (Bay-Jensen et al., 2018). Although OA is considered a disease primarily for the elderly, nowadays, more than half of patients with OA are under 65 years old. More and more young people show the symptoms of OA. The physiotherapy of OA is time consuming, and the chances of full recovery from OA are very minimal (Nelson, 2017). The most effective way of fighting OA is early diagnosis and early intervention. However, usually at early stage when OA is treatable, the patients often ignore the symptoms and are reluctant to go to the doctor for consultation (Nelson, 2017). When OA becomes serious, it is too difficult to treat this illness.
Blood is a vehicle for mRNAs from different tissues (Budd et al., 2017). It has been widely used for the early detection of various cancers  and predictions of drug responses (Huang et al., 2008;Zhang et al., 2012). As a complex disease, the occurrence and development of OA involves changes to the mRNA (Steinberg et al., 2017). The blood flow under the subchondral bone (Aaron et al., 2018) may carry the signal of OA (Fotouhi et al., 2018). It can be detected when the mRNA level changes in blood (Budd et al., 2017). If so, then the detection of OA will be much easier and more accurate. In fact, there have been several studies of blood biomarkers for OA (Ramos et al., 2014;Feng et al., 2015;Ahmed et al., 2016;Bay-Jensen et al., 2018;Costa-Cavalcanti et al., 2018). For example, Ramos et al. demonstrated that the mRNA expression of apoptotic pathways was significantly different in the blood of patients with OA (Ramos et al., 2014). Bay-Jensen et al. reported the use of biochemical markers for OA, which measured the turnover of joint tissue or the inflammatory status (Bay-Jensen et al., 2018).
To quantify the cartilage turnover, several discovered biomarkers were used, such as PIIANP, CTX-II, ARGS, COMP, and C2C. In serum, PIIANP and CTX-II were found to be associated with OA progression by Osteoarthritis Initiative (OAI) Study of FNIH (Foundation for the National Institutes of Health; Kraus et al., 2017). ARGS was found to be associated with pain in anterior cruciate ligament injury patients (Wasilko et al., 2016). COMP was highly expressed in synovial fluid of patients with OA (Lorenzo et al., 2017). C2C was significantly different among patients with OA with no sign of cartilage damage, early signs of OA, and radiographic OA, and it was highly expressed in the patients with radiographic OA (Schaefer et al., 2017). In addition, there were biomarkers for synovial inflammation and fibrosis, such as C1M, C3M, and CRPM. They were positively correlated with elderly symptomatic OA (Martel-Pelletier et al., 2016).
Unfortunately, many of these biomarkers were for synovial fluid and most of them were only differentially expressed. Such qualitative biomarkers cannot be used in clinical settings directly, and for this reason, a blood biomarker-based quantitative classifier was the ideal model.
To build such a useful model, we reanalyzed a publicly available dataset from Ramos et al. (2014), which included the blood gene expression profiles of 106 patients with OA and 33 control samples with advanced feature selection methods, such as minimal redundancy maximal relevance (mRMR) and incremental feature selection (IFS), instead of a conventional statistical test. We identified 23 blood gene expression biomarkers. On the basis of these 23 genes, we constructed a Support Vector Machine (SVM) classifier and evaluated its performance with Leave-One-Out Cross Validation (LOOCV). The sensitivity (Sn), specificity (Sp), accuracy (ACC), and Mathew's correlation coefficient (MCC) were 0.991, 0.909, 0.971, and 0.920, respectively. In addition, we performed in-depth biological analysis of the 23 biomarkers. They were involved in the mRNA surveillance pathway and multicellular organism growth. Not only was a quantitative classifier constructed, but also the underlying mechanisms of OA occurrence and progression were revealed.

The Blood Gene Expression Profiles of Osteoarthritis and Control Samples
We downloaded the blood gene expression profiles of 106 OA and 33 control samples from the Gene Expression Omnibus (GEO) database under the accession number of GSE48556 (Ramos et al., 2014). The gene expression levels were measured using Illumina HumanHT-12 V3.0 expression beadchip. There were 48,802 probes corresponding to 25,159 genes. The probes representing the same gene were averaged, and the gene expression profiles of OA and control samples were quantilenormalized.
Unlike Ramos's study (Ramos et al., 2014), which identified 694 genes with adjusted p-value smaller than 0.05 using linear regression analysis and then narrowed down the genes to a short list using functional annotation, we aimed to develop an automatic analysis pipeline that minimized human intervention and avoided the hand-picking during biomarker selection. Despite the great performance achieved by Ramos et al. (2014), we believe that there are other actionable biomarkers which may function in a different way and we are trying to find them with advanced feature selection methods.

Mutual Information-Based Feature Ranking
Identifying the phenotype-associated features is one of the basic problems in bioinformatics, and for different problems, there are different solutions (Huang et al., 2008;Cai et al., 2010;Zhang et al., 2012Zhang et al., , 2015Zhang et al., , 2016Zhang et al., , 2017Li et al., 2014;Chen et al., 2018a;Wang et al., 2018). For identifying differentially expressed genes (DEG), the most widely used methods are the t-test, significance analysis of microarrays (SAM; Tusher et al., 2001), and linear regression as performed by Ramos et al. (2014). However, usually such statistics-based methods will identify too many DEG than we require. The redundancy between DEG is extremely high. Many genes have very similar expression patterns.
Unlike DEG, we needed a smaller number of signature genes that can be applied in clinical settings. Therefore, we adopted a mutual information-based method, i.e., mRMR (Peng et al., 2005), which has been widely used in feature ranking (Niu et al., 2013;Zhao et al., 2013;Zhou et al., 2015;Zhang et al., 2016;Li and Huang, 2017;Liu et al., 2017). It considers both the relevance between features and sample labels and the redundancy among features and has been proven to be an effective feature selection method, especially for gene expression analysis (Qin et al., 2012;Zhang et al., 2014bZhang et al., , 2017Zhang et al., , 2018Zhang Y. et al., 2014;Li et al., 2015;Zhou et al., 2015;Wang et al., 2016;Song et al., 2017;Chen et al., 2018b). The method works like this: let us use to denote all the 25,159 genes, s to denote the selected gene set that includes m genes, and g to denote the n genes that will be evaluated, and one of them will be selected. First, the relevance of gene g from g with sample labels l was measured using mutual information (I) (Sun et al., 2012;Huang and Cai, 2013): As the mutual information can only be calculated between categorical variables, the expression levels of each gene were discretized with the thresholds of mean minus standard deviation and mean plus standard deviation. Then, the redundancy of gene g with selected gene set s was quantified: 1 m g i ∈ s I(g, g i ) As we wanted to maximize the relevance and minimize the redundancy, the optimization goal can be characterized as follows and the best gene form g will be selected: (3) After n rounds of optimization, a ranked gene list S = g 1 ′ , g 2 ′ , . . . , g r ′ , . . . , g N ′ was obtained. The top ranked genes had strong relevance to OA but little redundancy among each other. In the next step, we further optimized the top 300 mRMR genes and got the final OA biomarker.

Osteoarthritis Biomarker Optimization
Although the mRMR method can rank genes effectively, it is still unknown how many genes should be finally selected as the OA biomarker. Therefore, we applied a greedy method called incremental feature selection (IFS) Li et al., 2014;Shu et al., 2014;Zhang N. et al., 2014a;Huang et al., 2015;Zhang et al., 2015;Chen et al., 2018a) to optimize the number of signature genes. In this method, too few genes may miss the important information and too many genes may introduce noise. During the IFS procedure, different numbers of genes were tried and their performances were evaluated. As there were too many combinations and the mRMR have already ranked the genes meaningfully, the mRMR genes were tested sequentially, i.e., in the r rounds, g 1 ′ , g 2 ′ , . . . , g r ′ were tested. For each round, an SVM classifier was constructed based on the selected genes and its performance was evaluated through LOOCV. We used the R function SVM from package e1017 with default parameters and kernel of radial to build the SVM classifier.
To have a complete measurement of the prediction performance, four statistics, which were the sensitivity (Sn), specificity (Sp), accuracy (ACC), and Matthew's correlation coefficient (MCC), were calculated: In Equations (4-7), TP, TN, FP, and FN were the number of true OA, true control, false OA, and false control samples, respectively.
On the basis of IFS results, we can determine how many genes should be chosen finally as the OA biomarker to achieve the best performance. As the numbers of OA samples and control samples were not balanced, the MCC was used as the main measurement for classification performance.

The Osteoarthritis-Associated Genes Selected and Ranked Based on the mRMR Method
To identify the OA-associated genes, we used the mRMR method that can select and rank genes based on their relevance with FIGURE 3 | The Venn diagram of our 23 genes and the 27 genes from Ramos et al. (2014). There were four overlapped genes, ADRB2, H3F3B, PELO, and ZNF20, between the 23 osteoarthritis biomarker genes we identified and the 27 genes from Ramos et al. (2014). To evaluate the significance of overlap, we calculated the hypergeometric test p-value and odds ratio, which were 9.18e-09 and 229.87, respectively. The overlap was very significant. OA and their redundancy with other genes. The top 300 most discriminative genes for OA were selected and ranked using the mRMR method. These 300 mRMR genes will be further optimized using the IFS method.

The Osteoarthritis Biomarker Optimization Based on the IFS Method
As a ranked gene list, the top 300 mRMR genes included the candidate OA biomarker genes. However, we still did not know how many genes should be finally selected. To optimize OA biomarker selection, we tried different number of top genes and calculated their prediction performance. On the basis of these performances, we plotted an IFS curve, as shown in Figure 1, in which the x-axis was the number of genes and the y-axis was the LOOCV MCC of the SVM classifier. It can be seen that when the top 23 mRMR genes were used, the MCC was the highest, i.e., 0.920. Meanwhile, the sensitivity, specificity, and accuracy of the 23-gene classifier were 0.991, 0.909, and 0.971, respectively. The 23 genes are listed in Table 1. The confusion matrix of the predicted and actual sample classes is given in Table 2.
We compared our 23 genes with the 27 genes from Ramos et al. (2014) and plotted the Venn diagram, as shown in Figure 3. There were four overlapped genes: ADRB2, H3F3B, PELO, and ZNF20. We evaluated the significance of overlapping using the hypergeometric test. The p-value was 9.18e-09 and the odds ratio was 229.87. The overlap between our 23 genes and the 27 genes from Ramos et al. (2014) was very significant.

The Functional Analysis of the Optimal Osteoarthritis Biomarker
We did functional enrichment analysis of 23 OA biomarker genes using Metascape (Tripathi et al., 2015). The Gene Ontology (GO) results are shown in Figure 4. The enriched GO terms were GO:0032200: telomere organization, GO:1903829: positive regulation of cellular protein localization, GO:0010389: regulation of G2/M transition of mitotic cell cycle, and GO:0010951: negative regulation of endopeptidase activity.
There have been many studies about the relationship between telomere length and OA (Kuszel et al., 2015;Wiwanitkit, 2017). OA is a typical geriatric disease and the telomere length becomes shorter and shorter during aging. In patients with OA, the shortening of telomeres was accelerated (Kuszel et al., 2015). H3F3B, UPF1, and GNL3L were involved in GO:0032200: telomere organization.
The dysfunctional regulation of cellular protein localization in OA was reasonable. Osteoarthritis is a joint disease and the gap junctional communication is regulated by the extracellular FIGURE 5 | The PPI network of the 23 osteoarthritis biomarker genes. The 23 osteoarthritis biomarker genes formed two PPI clusters: the APP cluster that included APP, RNF34, TNFSF14, CEP250, and MLLT6, and the GNL3L cluster that included GNL3L, UPF1, TAOK1, ADRB2, and H3F3B. Frontiers in Genetics | www.frontiersin.org signal pathway (Niger et al., 2009). APP, TNFSF14, CEP250, and GNL3L were involved in GO:1903829: positive regulation of cellular protein localization.
There have been many theories about cell cycle and OA. Franke et al. found that during the pathogenesis of OA, advanced glycation end products (AGEs) influence osteoarthritic fibroblast-like synovial cells through inducing cell cycle arrest (Niger et al., 2009). de Andrés et al. discovered that the demethylation of an NF-κB enhancer can induce OA by regulating the cell cycle (de Andrés et al., 2016). APP, CEP250, and TAOK1 were involved in GO:0010389: regulation of the G2/M transition of the mitotic cell cycle.
It is known that several endogenous peptides have strong inflammatory effects in the joint and they are regulated by endopeptidase (Solan et al., 1998). Therefore, the genes from GO:0010951: negative regulation of endopeptidase activity, such as APP, TNFSF14, and RNF34, may play regulatory roles in OA.

The Protein Interactions Between the Optimal Osteoarthritis Biomarkers
The protein-protein interaction (PPI) between the optimal OA biomarker was derived from the STRING database (https://string-db.org/) and is shown in Figure 5. STRING is a comprehensive database that integrates protein functional associations from multiple sources, such as experiment and literature (Szklarczyk et al., 2015). From Figure 5, we can see that APP, RNF34, TNFSF14, CEP250, and MLLT6 formed a cluster and GNL3L, UPF1, TAOK1, ADRB2, and H3F3B formed another cluster.
Basically, the functions of the APP cluster that included APP, RNF34, TNFSF14, CEP250, and MLLT6 were regulation of endopeptidase activity, cell cycle, and cellular protein localization, whereas the functions the GNL3L cluster that included GNL3L, UPF1, TAOK1, ADRB2, and H3F3B were involved in telomere organization and cellular protein localization. Common function that linked the two clusters was cellular protein localization, which indicated that the secretion of protein into extracellular synovia was the key processes of OA.

DISCUSSION
As a common geriatric disease, OA has extremely high incidence, especially in elder people. As the chances of full recovery from late-stage OA are minimal, the most effective way of fighting OA is early diagnosis and early intervention. As a popular noninvasive test, liquid biopsy showed great potential in cancer detection. To identify the blood gene expression signature for OA, we studied the blood gene expression profiles of 106 patients with OA and 33 control samples. With mRMR and IFS methods, we identified 23 genes whose sensitivity, specificity, accuracy, and Mathew's correlation coefficient were 0.991, 0.909, 0.971, and 0.920, respectively. The prediction performance was excellent. The biological function analysis of these 23 genes suggested that there were two pathways or PPI modules associated with OA through aging, cellular protein localization, and inflammation. These findings may be helpful for understanding OA.
There were still some disadvantages of this work. Here, we investigated only the gene expression levels. However, recent studies have suggested that the genome-wide association study (GWAS) and epigenetics approaches were also effective in OA mechanisms (Kerkhof et al., 2010;Panoutsopoulou et al., 2011;Rushton et al., 2014;Ramos and Meulenbelt, 2017;Simon and Jeffries, 2017). Integrating the genetic and epigenetic data with gene expression may provide a more comprehensive view of OA. We surveyed the identified genes based on one expression and found that the variant rs3815148 of COG5 was found to be associated with OA by GWAS reports (Kerkhof et al., 2010;Panoutsopoulou et al., 2011). Rushton et al. reported that the methylation status of MLLT6, TNFSF14, TAOK1, and MTSS1 was different between OA hip subtypes and LRRC33 was hypermethylated in OA hip than OA knee (Rushton et al., 2014). These results encourage us and others to do integrative studies of multiomics data in OA in future.

DATA AVAILABILITY STATEMENT
The datasets for this study can be found in the Gene Expression Omnibus [https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE48556].