Identify Inflammatory Bowel Disease-Related Genes Based on Machine Learning

The patients of Inflammatory bowel disease (IBD) are increasing worldwide. IBD has the characteristics of recurring and difficult to cure, and it is also one of the high-risk factors for colorectal cancer (CRC). The occurrence of IBD is closely related to genetic factors, which prompted us to identify IBD-related genes. Based on the hypothesis that similar diseases are related to similar genes, we purposed a SVM-based method to identify IBD-related genes by disease similarities and gene interactions. One hundred thirty-five diseases which have similarities with IBD and their related genes were obtained. These genes are considered as the candidates of IBD-related genes. We extracted features of each gene and implemented SVM to identify the probability that it is related to IBD. Ten-cross validation was applied to verify the effectiveness of our method. The AUC is 0.93 and AUPR is 0.97, which are the best among four methods. We prioritized the candidate genes and did case studies on top five genes.


INTRODUCTION
Inflammatory bowel disease (IBD) (Graham and Xavier, 2020) is a worldwide high incidence of intestinal inflammation, which is divided into Crohn's disease (CD) (Roda et al., 2020) and ulcerative colitis (UC) (Danese et al., 2020). It can lead to diarrhea, rectal bleeding, abdominal pain, and malnutrition. In addition, IBD is a major risk factor for colorectal cancer (CRC) (Cheng et al., 2020;Olén et al., 2020). The incidence rate of colon cancer in IBD patients is 18 times than that of the general population. In view of the increasing incidence rate of IBD worldwide and the huge medical burden, IBD has become a public health problem which needs to be solved urgently. Therefore, it is of great significance to carry out the research on the pathogenesis of IBD (Mayer, 2010), which enables the development of IBD drug targets for the prevention and treatment.
At present, it has been found that the main causes of IBD are the destruction of intestinal microorganism homeostasis, the lack of intestinal epithelial barrier function and the disorder of innate/adaptive immune system, which are caused by the external environmental factors such as smoking, obesity, eating habits (Hovde et al., 2014). Studies also reported that genetic factors such as MUC2 ( Van der Sluis et al., 2006) and IL10 (Franke et al., 2008), also play an important role in the development of IBD. In view of the important role of gene mutation in the development of IBD, the genome wide association studies (GWAS) have been applied to the risk prediction and mechanism of IBD (Franke et al., 2008;Zhang et al., 2021). In addition to the discovered NOD2, IL-23R, ATG16L1, IL-10R, IL-10, XIAP, etc. of IBD-related genes in European and American populations, recent studies have reported more new involvement in cell autophagy, immune regulation, intestinal mucosal barrier, etc. The total number of functional IBD-related sites has reached 200. However, GWAS can only identify the relationship between one single mutation and IBD. The biological functions of mutation and the significance of each gene cannot be explored by GWAS. Although many research have combined expression quantitative trait loci (eQTL) with GWAS to explore the biological functions, this method cannot perform large-scale disease-related gene prediction (Zhao et al., 2019(Zhao et al., , 2020c. With the development of computational methods, the applications of algorithms have widely changed the way of doing biological research (Tianyi et al., 2020;Zhao et al., 2020a). Computational methods such as machine learning and deep learning have achieved great achievement in discovering biological mechanism (Zhao et al., 2020b). Predictions by algorithms have shown power in identifying multiple aspects of diseases, such as genes, RNAs, proteins, metabolites, drugs, drug targets, etc. (Zhao et al., 2021). The most common hypothesis of these computational methods is similar diseases are related to similar genes or similar genes cause similar diseases. Research have reported that many disease susceptibility genes do not exist independently, but are associated with a variety of diseases, including IBD. The latest research found that the n2081d allele of LRRK2 gene, which is closely related to the risk of CD, is located in the kinase domain of G2019S gene (Poulopoulos et al., 2012), and G2019S is one of the major mutations involved in familial or sporadic Parkinson's disease. It is also found that three variants of SIAP (s123n, r233q, p257a) of the primary immune deficiency disease type 2 X-linked lymphoproliferative disease related mutation genes (s123n, r233q, p257a) are related to the activation of CDsusceptible gene NOD2 signaling pathway, which is closely related to CD pathogenesis. Therefore, it is theoretically supported to identify IBD related genes from IBD similar diseases.
In this paper, we purposed a Support Vector Machine (SVM)based on method to identify IBD-related genes. Based on all known genes related to IBD and known related genes of diseases similar to IBD, we constructed a gene network by gene interactions. The features of each gene were extracted from this network and inputted into SVM to identify the pattern that genes are related to IBD.

Diseases Similar to IBD
Semantic-based methods use the correlation between ontology and the number of disease-related genes to calculate disease similarity. Obviously, not all associations between diseases are represented by ontology, and some of them are reflected by functional associations between disease-related genes. Semfunsim (Cheng et al., 2014) uses disease-related gene sets in a weighted network of human gene functions to calculate disease similarity.
In this paper, we used the results of Semfunsim. One hundred thirty-five diseases have similarities with IBD according to the results of Semfunsim. The distribution of similarities is as Figure 1.
The disease most similar to IBD is lower respiratory tract disease with similarity of 0.08. The second one is sensory system disease.

Candidate Genes
We obtained disease-related genes from DisGeNET (Piñero et al., 2016) which collects experimentally verified disease-related genes. Many of the 135 diseases are subtypes of diseases, so their related genes cannot be found in DisGeNET.
We totally obtained 5,928 genes from 88 diseases by DisGeNET. Figure 2 shows the distribution of these genes.
These 88 diseases have 15,271 entries, which means 15,271 genes are known related to these 88 diseases. However, most of these genes related to more than one disease, so only 5,928 unique genes were obtained. This also reveals that similar diseases share similar genes.

Workflow of SVM-Based Method
We obtained gene interactions from HumanNet v2.0 (Hwang et al., 2019) and constructed a gene interaction network based on candidate gene interactions.
The features of a gene are expressed as the shortest path to other disease-related genes. Figure 3 shows an example of extracting gene features.
As shown in Figure 3, the blue nodes represent the genes related to disease 1, yellow nodes for disease 2 and orange nodes for disease 3. The features of first blue gene could be represented as the shortest paths to disease 2 and 3. The edge of this network is the interaction strength. Therefore, the first dimension of gene 1 is 1 since this gene is related to disease 1 directly. The second  dimension of gene 1 is the interaction strength between gene 1 and the yellow node.
If the network is huge, we cannot extract features of genes manually. Here, we applied Graph search Dijkstra algorithm (Noto and Sato, 2000) to search the shortest path. The algorithm gradually diffuses outwards with the starting point as the center, and the shortest path can be obtained.
Assuming that each node in the network has a label (d t , p t ),d t is the shortest path length from the starting point sto the point t, and p t represents the point before the midpoint of the shortest path fromsto the point t. If we want to find the shortest path, we need to initialize the starting point.
K is the starting point and p k is NULL. Then, verify the distance from all marked points k to other directly connected unmarked points j, and set: w(k, j) represents the length from k to j. Then, pick the next point. The smallest point i is selected from all unmarked points d i , and the point i is selected as the point in the shortest path. Find the point directly connected to the point from the set of marked points, and mark it as p i . Mark the point i. If all the points are marked, the algorithm ends. The flow of this algorithm is as Table 1.
We extracted the features of genes in the last step. The features should be inputted into SVM (Friedrichs and Igel, 2004) to predict IBD-related genes. The workflow of SVM is shown as Figure 4.
Due to the nonlinear relationship between gene expression and survival time, φ(·) (kernel function) is needed to map gene expression to high-dimensional feature space.
The Gaussian Radial Basis Function (RBF) has radial symmetry, which makes it smooth. At the same time, as a kernel function,  the sample space can be transformed to a high-dimensional space with infinite-dimensional feature space, so that it can better deal with nonlinear relationships. The hyperplane of the feature space can arbitrarily divide the area of the input sample, which can avoid the situation of excessive concentration of training samples.
σ is kernel width parameter. Then perform linear regression, the regression function is: w T is weight vector and b is bias. This functional formula could describe relationship between gene expression and survival time. For any given gene features, it can be brought into formula (5) to give a corresponding probability that this is an IBD related gene. Therefore, the next step is to obtain w T and b. w T and b could be obtained based on the principle of structural risk minimization. The structural risk is: Where γ is regularization parameter, R emp represents the empirical risk function. In our method, the square of the training error was used as the empirical risk function.
To minimize R, Lagrange Multiplier Method was introduced.
Following the Optimal conditions, Therefore, the final SVM model could be represented as: FIGURE 4 | Workflow of Support Vector Machine (SVM)-based prediction method.
Frontiers in Cell and Developmental Biology | www.frontiersin.org

Performance of SVM-Based Method in Predicting Survival Time
To verify the effectiveness of our method, 10-fold cross validation was used. We randomly classified 5,928 genes into 10 groups. Then we used 9 of 10 groups to build the SVM model and the rest 1 group was used as the testing set. The process could be repeated 10 times. Therefore, each group could be trained by 9 times and tested by once. We compared our method with several traditional methods, such as Artificial Neural Network (ANN) (Plumb et al., 2005), Random Forest (RF) (Archer and Kimes, 2008), Naïve Bayes (NB) (Archer and Kimes, 2008). The AUC curves were shown as Figure 5.
As we can see from Figure 5, SVM performed best among these methods. The AUC of SVM is 0.93, which means the IBD-related genes could be precisely predicted.
In addition, we tested the AUPR of SVM. The AUPR is 0.97, which means the false positive rate is low (Figure 6).

Identify IBD-Related Genes
Since we have proven the effectiveness of SVM in section "Performance of SVM-Based Method in Predicting Survival Time, " we could use this method to identify IBD-related genes in this section.
We totally obtained 1,577 genes related to IBD according to DisGeNET. All these 1,577 genes were used as positive samples to build SVM model. Then, to keep the balance of positive and negative sample sizes, we randomly selected 1,577 genes from 5,928 candidate genes as the negative samples. Finally, the final SVM model could be built. We test whether the rest 4,351 genes  are related to IBD by this final SVM model and found that 231 novel genes are related to IBD.

Case Study
In IBDs, intestinal inflammation has been reported to be accelerated by dysfunction in the epithelial paracellular barrier formed by tight junctions (TJs). Some of the intestinal claudinfamily proteins, which form the paracellular barrier, show aberrant expression levels and localizations in IBDs. The intestine-specific Cldn7 deficiency caused colonic inflammation, even though TJ structures were still present due to other claudins according to the study of Tanaka et al. (2015). The paracellular flux (pFlux), determined by measuring the paracellular permeability across the colon epithelium, was enhanced by the Cldn7 deficiency for the small organic solute Lucifer Yellow (457 Da), but not for the larger organic solute FITC-Dextran (4,400 Da). LPA promotes platelet aggregation and induces cellular tension and cell surface fibronectin assembly (Olorundare et al., 2001), which are also important events in wound repair suggesting an important role of LPA in inflammatory disorders. This was confirmed by our group when we demonstrated, that LPA not only promotes epithelial wound healing in vitro by a TGF-β-independent pathway, but also ameliorates experimental colitis in an experimental model of colitis in rats (Sturm et al., 2002). LSAMP has been reported to be associated with UC, which is similar to IBD (Brant et al., 2017). A recent study has also found a single-nucleotide polymorphism at rs309 in the MDM2 gene was associated with UC. They found that people with the GG phenotype of the MDM2 gene were more prone to UC than those with the TT genotype (Doulabi et al., 2020). Although IBD is not considered as an autoimmune disease, it may trigger autoimmunity caused by the increased antigenic load and mucosal immune activation. Several genetic alterations are shared between IBD and autoimmune diseases. IBD-related inflammation consists of a variety of abnormalities in humoral and cell-mediated immunity, and a generalized enhanced reactivity against intestinal bacteria. The IFN signature in autoimmune diseases represents a useful tool as a biomarker of disease progression and treatment efficiency. In a recent study, they found that OAS1 is a significant IFN response gene which could serve as an early predictor of disease activity and progression, as well as a supplementary therapeutic target (Andreou et al., 2020).

CONCLUSION
Increasing evidence have shown the relationship between genetics factors and IBD. Identifying IBD-related genes can further reveal the pathogenesis of IBD and provide important support for clinical diagnosis and treatment. A large number of studies have discovered IBD-related genes by biological experiments. However, few genes have been identified.
Since computational methods have shown strong reliability in identifying diseases-related molecular, in this paper, we proposed a SVM-based method for identifying IBD-related genes. A gene network was constructed based on gene interactions and candidate genes. Candidate genes were selected according to the known related genes of diseases similar to IBD. To extract features of genes, we applied Dijkstra algorithm to extract the network topology characteristics as the features of genes. Finally, a SVM-based model was built to identify IBDrelated genes.
To verify the effectiveness of our method, we compared SVM with RF, NB, and ANN. SVM showed highest AUC and AUPR among the 10-fold cross validation. After confirming the accuracy of SVM, we built a model using all the data and obtained 231 new genes related to IBD. In order to verify the accuracy of the results, we conducted a case study. Previous studies have shown that five of our newly discovered genes are related to IBD.
Overall, we proposed a novel method of identifying IBDrelated genes in large-scale. This method could be also extended into other diseases to help researchers find more diseasesrelated genes.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ supplementary material.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
CC provided the idea for the study. LY, YL, and X-DF interpreted the results of the data analyses. All authors contributed to designing the algorithm and network construction, interpretation of the results, read, and approved the final version of the manuscript.

FUNDING
This work was supported by the National Natural Science Foundation of China (Grant No. 8197032867).