Alzheimer-Compound Identification Based on Data Fusion and forgeNet_SVM

Rapid screening and identification of potential candidate compounds are very important to understand the mechanism of drugs for the treatment of Alzheimer's disease (AD) and greatly promote the development of new drugs. In order to greatly improve the success rate of screening and reduce the cost and workload of research and development, this study proposes a novel Alzheimer-related compound identification algorithm namely forgeNet_SVM. First, Alzheimer related and unrelated compounds are collected using the data mining method from the literature databases. Three molecular descriptors (ECFP6, MACCS, and RDKit) are utilized to obtain the feature sets of compounds, which are fused into the all_feature set. The all_feature set is input to forgeNet_SVM, in which forgeNet is utilized to provide the importance of each feature and select the important features for feature extraction. The selected features are input to support vector machines (SVM) algorithm to identify the new compounds in Traditional Chinese Medicine (TCM) prescription. The experiment results show that the selected feature set performs better than the all_feature set and three single feature sets (ECFP6, MACCS, and RDKit). The performances of TPR, FPR, Precision, Specificity, F1, and AUC reveal that forgeNet_SVM could identify more accurately Alzheimer-related compounds than other classical classifiers.


INTRODUCTION
Alzheimer's disease (AD) is the most common type of senile dementia, which is a frequently occurring disease of the elderly (Romanelli et al., 1990;Morán et al., 1992;Wang et al., 2014).Its main clinical manifestations are the decline of cognitive function, mental symptoms and behavior disorders, and the decline of daily living ability (Almeida and Crocco, 2000;Daulatzai, 2014;Zhao et al., 2016;Gong et al., 2017).It poses a great threat to the health and quality of life of the elderly and brings a heavy economic burden to society (Rice et al., 1993;Rothstein et al., 1996;Hu, 2006;Wang, 2014).The main reason for the onset of AD is the central nervous system disease in the brain, which causes a series of mental diseases such as learning impairment, memory impairment, and speech impairment (Ogomori et al., 1989;Hao et al., 2013).Family inheritance, physical diseases, and head trauma can cause the onset of this disease (Heyman, 1994;Mehta et al., 1999).However, in the process of studying the pathogenesis of AD, there are some problems such as unclear pathogenesis, difficult early diagnosis, and no preventable and curable drugs.Therefore, the diagnosis and treatment of AD have been a difficult problem for medical researchers in recent decades.
Alzheimer is a complex disease with multiple factors.At present, the main drugs for the treatment of AD in clinics are acetyl cholinesterase inhibitors, glutamate receptor inhibitors, etc. (Liston et al., 2004;Dong et al., 2005;Sugimoto, 2006).These drugs can alleviate the symptoms caused by the decline of cognitive function, but cannot fundamentally eliminate the pathogeny.Network pharmacology is based on multi-disciplinary knowledge such as system biology, multi pharmacology, bioinformatics, computer technology, and network analysis (Berger and Iyengar, 2009;Chen et al., 2012;Yuan et al., 2019;Li et al., 2020).It systematically studies the drug-target-pathway-disease interaction network and discusses the multi-component, multi-target, and multi-channel pharmacological mechanism of traditional Chinese medicine (TCM) (Li et al., 2014;Xiong et al., 2018;Jiang et al., 2020;Gao et al., 2021).It plays a very important role in exploring treatment approaches and clarifying drug efficacy, especially in finding the effective components of drugs, which is highly consistent with the holistic view emphasized by the theory of traditional Chinese medicine.In recent years, a variety of traditional Chinese medicine prescriptions have been proposed to improve AD by network pharmacology from point of view of multi-component, multi-target, and multi-channel (Sun et al., 2017;An et al., 2020;Wang et al., 2020;Huang et al., 2021).Pang et al. analyzed 25 targets and 13 TCM prescriptions for the treatment of AD and selected 7 representative Chinese medicines (Pang et al., 2016).Naive Bayesian and recursive partitioning was utilized to predict the targets contributing to the chemical components of traditional Chinese medicine in order to construct a compoundtarget-disease network and explain the synergistic mechanism of multiple effective components of TCM prescriptions.Tao et al. analyzed the compounds of Paeoniae Rubra Radix and Phellodendri Cortex, and the Alzheimer-related targets to reveal the mechanism of these two medicinal materials for intervening AD (Tao et al., 2015).Wang et al. analyzed the main active components of Liuwei Dihuang Decoction and the main action targets of active components and carried out the GO and pathway analyses to give the multi-component, multi-channel and multi-target mechanism of Liuwei Dihuang Decoction in the treatment of AD (Wang et al., 2021).Jiang and Wang utilized network pharmacology to analyze the mechanism of Bajitian for treating AD and obtain that this drug could play an antipharmacological role in many aspects, such as neurotransmitter, regulation and regulation of ion channels (Jiang and Wang, 2021).
In network pharmacology, screening the main active compounds of prescriptions is an essential step.In past studies, this step is processed mainly by manually searching public databases.In this study, a novel machine learning method, namely forgeNet_SVM is proposed to identify Alzheimer-related active compounds.The data mining method is utilized to collect Alzheimer-related compounds from the literature.Three molecular descriptors (ECFP6, MACCS, and RDKit) are utilized to obtain the feature sets of compounds respectively, which are fused into an all_feature set.The all_feature set is input to the forgeNet_SVM, in which forgeNet is utilized to give the importance of each feature and select the important features for feature extraction.The selected features are input to support vector machines (SVM) algorithm to identify the new AD-related compounds in TCM prescription.

METHODS forgeNet
Forest graph-embedded deep feed forward network (forgeNet) is based on ensemble method and deep learning, which has been utilized for gene regulatory network inference and biology data classification (Kong and Yu, 2020;Yang, 2021).Figure 1 shows the framework of forgeNet, in which the development of feature graph and classification of deep learning model are contained.Compared to classical deep learning models, forgeNet could solve the dimension imbalance of biomedical data and is more robust (Kong and Yu, 2020).

Development of Feature Graph
With the dimension-imbalance data, the important features of the data are selected for feature extraction.Thus, forgeNet utilizes forest ξ , which includes p decision trees (DTs).According to the training dataset with the classification labels, ξ is fitted and pDTs could be created (ξ (θ ) = {T 1 (θ 1 ), T 2 (θ 2 ), . . ., T p (θ p )}, θ i is the coefficient).If a binary tree is considered a special case of a directed graph, the graph set could be obtained as follows.
Where V i and E i denote vertex and edge sets of G i , respectively.
In order to combine the directed graph set , we can obtain the final aggregated graph as follows. (2)

Classification of Deep Learning Model
According to the feature graph obtained from the previous step, graph-embedded deep feed forward networks (GEDFN) are utilized to train in order to obtain the optimal model, which is utilized to provide the classification results of the unknown data (Yang, 2021).Every layer of GEDFN is given as followed. (3) Where X represents input vector, Z k denotes the k − th hidden layer, is Hadamard product, W k and b k are the weight and bias of the k − th hidden layer, respectively.forgeNet also gives a feature importance evaluate mechanism, which is based on Graph Connection Weights (GCW) method (Kong and Yu, 2018).The score of i − thfeature is defined as follows.
Where n is the number of features in the dataset, W (in) represent the weights between the input layer and the first hidden layer, and W (1) represent the weights between the first hidden layer and the second hidden layer.After forgeNet is trained, the importance scores for all the features could be computed with the trained weights.

Support Vector Machine
Support vector machine (SVM) is one of the most classical machine learning algorithms, which was proposed in the year 1995 (Cortes and Vapnik, 1995).SVM is suitable for the classification problems of small-medium samples, nonlinear, and high-dimensional pattern recognition.The basic principle of SVM is to find an optimal classification surface (Hyperplane), which can not only separate the samples without errors but also maximize the margin, based on the most classification surface in the case of linear separability (Suykens and Vandewalle, 1999;Saunders et al., 2002).Therefore, the learning process of SVM is an optimization problem.
Where b * is a classification threshold.
For the linearly separable dataset, linear SVM is suitable.However, for a nonlinear dataset, in order to solve the linear inseparable problem, the kernel function could be utilized to map the characteristics of nonlinear separable data points from a relatively low dimension to a relatively high dimension and calculate the relationship between them.The algorithm process of searching the optimal classification hyperplane in the highdimensional feature space is similar to linear separable SVM, which utilizes kernel function to replace the point product in the high-dimensional feature space.The common kernel functions contain linear kernel, polynomial kernel, radial basis function (rbf), and Sigmoid kernel function, which are defined as followed.
Where d is an order of polynomial, σ is the radius of radial basis, k is a scalar and θ is a shifting value. forgeNet_SVM In order to improve the classification accuracy of SVM, especially for high-dimensional datasets, a new classifier based on forgeNet and SVM (forgeNet_SVM) is proposed in this paper.ForgeNet can not only be utilized for classification but also score the features in the dataset to indicate the importance of the features.Therefore, in forgeNet_SVM algorithm, for highdimensional datasets, the forgeNet algorithm is used to select important features for feature extraction.In the next step, the important features are input into SVM for learning to solve the classification problem.where n e , n m , and n r are the numbers of ECFP6, MACCS, and RDKit feature sets, respectively.The forgeNet_SVM is utilized to identify Alzheimer-related compounds according to the dataset collected.In order to improve the classification performance of the classifier, all features are input to the forgeNet, which could be utilized to provide the importance of each feature.According to the score of each feature, the important features for classification are selected in order to achieve the purpose of feature extraction.The selected feature set is give as [d 1 , d 2 , . . ., d n ].Next, the selected features are input to SVM algorithm for learning.The features of new compounds in TCM prescription are extracted with the same method, which are input to SVM in order to be identified.

EXPERIMENTS AND DISCUSSIONS
In order to test the effectiveness of the proposed method in this paper, the prescriptions and drugs for treating AD are searched.In total 94 Alzheimer-related active compounds are collected and 282 unrelated compounds are also obtained.Each compound is extracted by ECFP6, MACCS, and RDKit to obtain three feature sets (ECFP6, MACCS, and RDKit), respectively.These three feature sets are combined, and a total of 2,423 features are obtained for each compound as the all_feature set.In order to evaluate the performance of the method, TPR, FPR, Precision, Specificity, F1, ROC, and AUC are applied.Seven classical classifiers containing AdaBoost (Cao et al., 2013), Gradient Boosting Decision Tree (GBDT) (Hu and Min, 2018), K-Nearest Neighbor (KNN) (Denoeux, 1995), logistic regression (LR) (Maalouf, 2011), naive Bayes (NB) (Rish, 2001), random forest (RF) (Breiman, 2001), and decision tree (DT) (Breiman et al., 1984)) are also utilized to identify the compounds about Alzheimer.In forgeNet_SVM, the number of trees is set to 1,000, random forest is utilized, three hidden layers are contained, the learning rate is set as 0.0001, the number of training epochs is set to 50, and the linear kernel is selected as the kernel function.In GBDT, the maximum number of weak learners is set to 200.In LR, L2 norm is utilized to constrain the arguments.
In RF, the number of decision trees is set to 100, the bootstrap method is utilized and the number of features is set to n_features(n_features is the number of features) when searching for the best segmentation.
For forgeNet_SVM, forgeNet can select the important features from a large number of feature sets.First, the different numbers of features are tested for affecting the performance of our method.The numbers of important features selected by forgeNet are 50, 100, 200, 500, 600, 700, 800, 900, 1,000, and 1,200.With the different numbers of feature sets, by 10-cross validation method, the performances of TPR, FPR, Precision, Specificity, F1, ROC, and AUC obtained are shown in Figure 3.The 10-cross validation method is utilized to divide the training and testing datasets in order to evaluate the model.From Figure 3, we can see that our method performs best in terms of TPR when selecting 50, 500, 600, 800, 900, and 1,000 features.In terms of FPR, Precision, Specificity, and F1, our method performs best when selecting 800 and 900 features.Through the results, we can see that our method performs best when 800 and 900 features are selected.In the following experiment, we select the first 900 important features as feature set by forgeNet.
We compare the effects of different feature sets on the performance of the algorithm.The feature sets include ECFP6, MACCS, and RDKit, and all features and selected features are obtained by forgeNet.Two datasets are utilized.The first dataset contains all the compounds (Dat1), and another one is obtained by random division (Dat2) in which 70% of compounds are used as the training set and the remaining compounds are as the testing set.With Dat1, using the 10-cross validation method, the performances of our method with different feature sets for Alzheimer-related compound identification are shown Figure 4 and Table 1.From Figure 4, it could be seen that the selected feature set has better ROC curves than three single feature sets (ECFP6, MACCS, and RDKit) and all features.Furthermore, in terms of AUC, the selected feature set is 4% higher than ECFP6, 6% higher than MACCS, 4.1% higher than RDKit, and 0.4% higher than the all_feature set.From Table 1, it could be seen that in terms of TPR, FPR, Precision, Specificity, and F1, the selected feature set performs better than ECFP6, MACCS, RDKit, and the all_feature sets.With Dat2 and the different feature sets, the identification results of active compounds are shown in Figure 5 and Table 2. From Figure 5, the selected features are utilized to obtain a better ROC curve than the other four feature sets.In terms of AUC, the selected feature set is 4, 6, 4.1, and 0.37% higher than ECFP6, MACCS, RDKit, and the all_feature sets, respectively.Table 2 shows that our selected features could make SVM obtain the best performances of TPR, FPR, Precision, Specificity, and F1.From all the results, it could be seen that the merged feature set (all features) performs better than the three single feature sets (ECFP6, MACCS, and RDKit).Using the forgeNet, the important features could be selected, so the selected feature set could obtain better performances than the merged feature set in terms of TPR, FPR, Precision, Specificity, and F1.Thus the    feature extraction method can improve the accuracy of active compound recognition.AdaBoost, GBDT, KNN, LR, NB, RF, and DT are also directly utilized to predict Alzheimer-related compounds with Dat1 and Dat2.In forgeNet_SVM, SVM is also replaced with these seven classifiers in order to constitute forgeNet_AdaBoost, forgeNet_GBDT, forgeNet_KNN, forgeNet_LR, forgeNet_NB, forgeNet_RF, and forgeNet_DT, which are utilized to identify compounds.With Dat1 and Dat2, the performances of 15 methods for Alzheimer-related compound identification are listed in Tables 3, 4, respectively.From Table 3, KNN and LR could obtain the best TPR performance, which shows that KNN and LR could identify the most active compounds.But these two methods shave the worst FPR performances, which are 0.77305 and 0.56383, respectively.The results reveal that LR identifies most of the compounds as active compounds.In terms of FPR, Precision, and Specificity, NB performs best.But NB has the worst TPR performance, which shows that NB identifies most of the compounds as inactive compounds.In terms of F1 and AUC, forgeNet_SVM could obtain the best performances among the 15 methods.From Table 4, KNN and LR could gain the best TPR performance, which reveals that these two methods could identify all true active compounds.forgeNet_SVM, forgeNet_NB, and forgeNet_DT could obtain the second better TPR performance.ForgeNet_SVM could gain the best FPR performance, which shows that our proposed method can identify all true inactive compounds.In terms of Precision, Specificity, F1, and AUC, forgeNet_SVM also performs best.On the whole, our proposed method could infer more true active and inactive compounds than other methods.

CONCLUSION
In this study, a novel Alzheimer-related compound identification algorithm based on data fusion and forgeNet_SVM is proposed.Three feature description methods (ECFP6, MACCS, and RDKit) are utilized to obtain the feature sets of Alzheimer related and unrelated compounds, which are fused into the all_feature set.In forgeNet_SVM, all_feature set is input to forgeNet, which could evaluate the importance of each feature and extract the important features according to the given scores.The selected features are input to SVM algorithm to identify the new compounds in a TCM prescription.The Alzheimer-related dataset collected is utilized, and the experiment results show that forgeNet_SVM could identify more true-positive compounds and fewer falsepositive compounds than other classical classifiers, such as AdaBoost, GBDT, KNN, LR, NB, RF, and DT.We make the comparison experiments that give the optimal number of the selected features for forgeNet_SVM.In terms of TPR, FPR, Precision, Specificity, F1, and AUC, the selected feature set performs better than the all_feature set and three single feature sets (ECFP6, MACCS, and RDKit).
In the future, we will apply forgeNet_SVM to identify other diseases related compounds, such as cancer, COVID-19, and cardiovascular diseases.

FIGURE 2 |
FIGURE 2 | Flowchart of Alzheimer-related active compound identification by forgeNet_SVM.

Figure 2
Figure 2 is the flowchart of Alzheimer-related active compound identification by forgeNet_SVM.The detailed algorithm is given as follows.1.Studies on TCM in the treatment of AD have to be searched in the literature databases.The queried works of literature need to be analyzed and then collected and mined for important drugs and prescriptions for the treatment of AD, which contains Epimedii Folium, Anemarrhena asphodeloides, Radix Ginseng-Poria drug pair, Bajitian, and Polygni Multiflori Caulis.Next, mAlzheimer-related closely active compounds, such as naringin, quercetin, Kaempferol, β-Sitosterol, Isorhamnetin, Stigmasterol, and Icariin have to be retrieved.These important compounds have been verified by biological experiments or the molecular docking method.m active compounds are utilized as positive samples for further data analysis.In order to determine the negative sample, m active compounds are input to the UDU-E website to generate the corresponding decoys (Mysinger et al., 2012).In order to set up the inactive compound set (negative samples), the random decoy selection is performed 3 m times from the obtained decoy sets without putting it back.Thus, the inactive compound set contains 3 m compounds.The sets of active and inactive compounds constitute the compound sample dataset.2. The molecular structures of compounds in the dataset collected are SMILES (simplified molecular input line entry system).According to the SMILES structures, three molecular descriptors (ECFP6, MACCS, and RDKit) are utilized to obtain the feature sets of compounds respectively.ECFP6 (e 1 , e 2 , . . ., e n e ), MACCS (m 1 , m 2 , . . ., m n m ) and RDKit (r 1 , r 2 , . . ., r n r ) feature sets of each compound are fused into an

FIGURE 3 |
FIGURE 3 | Performances of forgeNet_SVM with the different numbers of features.

FIGURE 4 |
FIGURE 4 | ROC curves and AUC performances of our method with different feature sets for Alzheimer-related compound identification with Dat1.

TABLE 1 |
Performances of our method with different feature sets for Alzheimer-related compound identification with Dat1.
FIGURE 5 | ROC curves and AUC performances of our method with different feature sets for Alzheimer-related compound identification with Dat2.Frontiers in Aging Neuroscience | www.frontiersin.org6 July 2022 | Volume 14 | Article 931729

TABLE 2 |
Performances of our method with different feature sets for Alzheimer-related compound identification with Dat2.

TABLE 3 |
Performances of 15 methods for Alzheimer-related compound identification with Dat1.

TABLE 4 |
Performances of 15 methods for Alzheimer-related compound identification with Dat2.