Identifying Acetylation Protein by Fusing Its PseAAC and Functional Domain Annotation

Acetylation is one of post-translational modification (PTM), which often reacts with acetic acid and brings an acetyl radical to an organic compound. It is helpful to identify acetylation protein correctly for understanding the mechanism of acetylation in biological systems. Although many acetylation sites have been identified by high throughput experimental studies via mass spectrometry, there still are lots of acetylation sites need to be discovered. Computational methods have showed their power for identifying acetylation sites with informatics techniques which usually reduce experiment cost and improve the effectiveness and efficiency. In fact, if there is an approach can distinguish the acetylated proteins from the non-acetylated ones, it is no doubt a very meaningful and effective method for this issue. Here, we proposed a novel computational method for identifying acetylation proteins by extracting features from the conservation information of sequence via gray system model and KNN scores based on the information of functional domain annotation and subcellular localization. The authors have performed the 5-fold cross-validation on three datasets along with much analysis of features and the Relief feature selection algorithm. The obtained accuracies are all satisfactory, as the mean performance, the accuracy is 77.10%, the Matthew's correlation coefficient is 0.5457, and the AUC value is 0.8389. These works might provide useful insights for the related experimental validation, and further studies of other PTM process. For the convenience of related researchers, the web-server named “iACetyP” was established and is accessible at http://www.jci-bioinfo.cn/iAcetyP.


INTRODUCTION
To date, more than 450 unique protein modifications have been identified (Han et al., 2018), including phosphorylation, acetylation, ubiquitination, and sumoylation, which are regulatory mechanisms of cellular proteins with a number of biological functions, and also are very important for regulating the function of many prokaryotic and eukaryotic proteins (Yang et al., 2017). Among these post-translational modification (PTM), acetylation is a dynamic and highly conserved PTM (Figure 1) that plays a vital role in the regulating processes of diverse cellular. The role of acetylation in histones were first discovered in histones (Allfrey et al., 1964), and the first deacetylase activity was identified back in 1969 (Inoue and Fujimoto, 1969). Owing to its important involvement in some relevant biological processes, acetylation becomes one of the most important reversible protein posttranslational modifications, hence, more and more acetylated proteins are discovered with the help of highthroughput technologies. Thus, it is a piece of very interesting work to identify the potential acetylation sites for finding the underlying molecular mechanisms, and is helpful for basic bioresearch and drug development.
However, due to the importance and complexity of acetylation, identifying acetylation sites is a great challenge to fully understand the regulatory roles and the molecular mechanism of acetylation regulation. Actually, it is a timeconsuming, expensive and labor-intensive process for purifying acetylation sites due to that the acetylation process is dynamic, rapid and reversible (Li et al., 2017;Yang et al., 2017). Fortunately, some studies had showed that experimental methods and computational models can be used to identify underlying PTMs sites (Hershko and Ciechanover, 1998;Haglund and Dikic, 2005;Tung and Ho, 2008;Radivojac et al., 2010), such as ubiquitination model of Radivojac et al. (2010), Zhao et al. (2011), andCai et al. (2012), phosphorylation model of Ingrell et al. (2007), Yao et al. (2012Yao et al. ( , 2015, Chen et al. (2015), Li et al. (2015), Trost et al. (2015), and Xu et al. (2015), sumoylation model of Beauclair et al. (2015), Xu et al. (2016), andHan et al. (2018), acetylation model of Zhao et al. (2010), Wang et al. (2012), Hou et al. (2014), and Wuyun et al. (2016), and so on. Although these researchers did make much contribution to this issue, there is still a lot of room for improving the prediction quality. However, most of these efforts are on identifying some determinate PTMs sites for a given protein sequence, and few of computational method was proposed for distinguishing the acetylated proteins from the non-acetylated ones. This study was an attempt for the issue.
For a given protein represented with amino acid sequence, how to identify whether it may be one of some certain PTM proteins or may not? This may be the first step for identifying PTM sites and then is helpful and meaningful for basic research and drug development. In fact, we have made some preliminary exploration and attempt on identifying phosphorylated proteins. In Qiu et al. (2017a,b), we presented a method for identifying human phosphorylated proteins and a multi-label classifying model for different type of phosphorylated proteins with the help of the General PseAAC concept and gray system theory. Although the results are not so perfect, we still argue that the formulations and models can be applied to this issue, and it may be more powerful when some structure, function or localization information of proteins were added into the model. This site may be a fruitful opportunity for bioinformatics. For example, Gene Ontology (GO) (Ashburner et al., 2000) was proposed by Ashburner to reposit the concepts denoted as GO Terms that are associated to other gene products, and it has been widely used in describing the attributes for gene products (Agapito et al., 2016;Peng et al., 2016).
The dataset we used here was fully extracted from Uniprot (The UniProt, 2017). The present study tried to construct a classifying model for potential acetylation proteins by fusing the digital features which are come from its evolution information, Subcellular localization (noted as SL) (Nakai and Horton, 1999) information and functional domain annotation (noted as FDA) databases including GO (Harris et al., 2004), Pfam (Bateman et al., 1999), Smart (Letunic et al., 2004), InterPro (Hunter et al., 2009), PRINTS (Attwood et al., 2012), PROSITE (Sigrist et al., 2010), SUPFAM (Pandit et al., 2004). As for subcellular localization , it was retrieved from the original UniProt database, which was reorganized by UniProt build-in hierarchical subcellular localization table. This paper proposed a new computational model for identifying potential acetylation proteins only on the basis of a query amino acid sequence by using its evolution information obtained with gray system model (Gray-PSSM) (Kaur and Raghava, 2004;Jones, 2007) and KNN scores calculated with the fuzzy distance by using its FDA and subcellular localization information. There are 80 amino acid sequence features extracted by incorporating the sequence evolution information were fused into PseAAC feature set and KNN scores, all of these features are combined according to different coefficients on the basis of its importance. To highlight the advantages of the proposed model, the model was trained and tested with three sub-datasets and cross-validations methods. In addition to some discussion of protein abovementioned features, some hypotheses for distinguishing acetylation proteins from non-acetylation ones were also depicted with the aid of training dataset.

Benchmark Dataset
It is fundamental and important that a stringent benchmark dataset be stablished for testing the proposed statistical predictor. Luckily, the UniProtKB/Swiss-Prot database is accepted by most of bioinformatics researchers, and has been using more and more widely. The data used in the current study to support this work are established on the basis of web http://www.uniprot.org.
In this study, we assume that our work is to identify whether an uncharacterized amino acid sequence is acetylation protein.
As we known, the input sequence is comprised by amino acids and can be expressed as P = P 1 P 2 P 3 · · · P i · · · P L (1) where P i represents the i-th residue of amino acids sequence P, L is the length of P.
Here, we separate a benchmark dataset into a training dataset noted as S. Thus, the datasets can be formulated as: where S posi is composed of the acetylation proteins, S nega is composed of the non-acetylation proteins, . ∪ and ∩ represent the symbol for "set union" and "set intersection, " respectively.
The version of protein data used in the current study was released in May 2017. The positive dataset was generated according to the following criteria: (1) The potential acetylated proteins should be noted by anyone keyword of the set, i.e. {N_acetylcysteine, N_acetylserine, N_acetylglutamate, N_acetylglycine, N_acetylproline, N_acetylthreonine, N_acetylvaline, N_acetylmethionine, N_acetyltyrosine, N2_acetylarginine, N6_acetyllysine, O_acetylserine, O_acetylthreonine}.
(2) The collected proteins are labeled by "Evidence" for the item of "Any assertion method." (3) Only the proteins which consisting of 30 and more amino acid residues can be included, and the redundant proteins were removed with the threshold of 50% by using CD-HIT software. The negative dataset was generated similar to the positive one except that those proteins should not be labeled none member of the mentioned above keyword-set. Since there are mass number of candidates here, we randomly collected negative datasets which have the balance samples size with positives.
Under the aforementioned standards, we obtained 2,925 protein samples, of which, the numbers of positive and negative samples are 725 and 2,175, respectively. In terms of Equation (2), we have 725 positive samples in S posi and 2,175 negative samples in S nega . Here, we test the models with cross-validation on the three datasets with 1,450 samples, i.e., S posi ∪ S − 1 , S posi ∪ S − 2 and S posi ∪ S − 3 , of which, the positive and negative ones are equal, i.e., 725 samples.

General Pseudo Amino Acid Composition (PseAAC)
Most of traditional machine-learning algorithms, such as Random Forest, SVM, and K nearest Neighbor, are not so powerful, the input should be vectors instead of sequence samples for biological issue. To overcome this problem, the researchers trying their best to improve the discrete or vector model by formulating the amino acids sequence into all kinds of pseudo amino acid composition (PseAAC), encoding method (Zhang et al., 2006;Chen et al., 2011;Shi et al., 2012;Jiao and Du, 2017) or other approaches.
Here, the proposed model followed the idea of PseAAC (Chou, 2011), and formulated an amino acids sequence P as: Here, the symbol T means the transpose operator for a matrix, N is an integer representing the number of features which depend on the method(s) used for extracting information from protein P (cf. Equation 1). P is a vector for representing amino acids sequence P and p i (i = 1, 2, · · · , N) is the ith element of the vector. Below, we will describe how to extract functional domain annotation and subcellular localization information as well as pseudo amino acid composition, which are used in this study, from a query sequence to define the components for amino acids sequence P.

Protein Sample Formulation With KNN Score Based on FDA and Subcellular Localization (SL)
In addition to GO database, "Pfam, " "Smart, " "PROSITE, " "SUPFAM, " "InterPro, " and "PRINTS" are established according to cellular component, molecular function, biological process or some other characteristics. For example, the Pfam database is a large collected protein families generated by using hidden Markov models. SMART is abbreviation of Simple Modular Architecture Research Tool which can be used for research on the protein domains and architectures. PROSITE consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles. InterPro provides a functional analysis of protein sequences, and PRINTS also is a resource of detailed annotation for protein families in addition to a diagnostic tool for newly determined sequences. Subcellular localization feature is a key functional characteristic of potential gene products such as proteins, especially for plant.
Actually, in the GO database, proteins are clustered in a way in which their subcellular locations can be reflected fully. To incorporate more information, most of the approaches need to formulate a long list of the GO numbers, and a great part of the GO numbers make meaningless as a whole. In literatures (Gao et al., 2010;Yao et al., 2012), the authors show us that local sequence clusters often appear in the neighborhood of PTM sites for the reason that the same PTM family generally have some similarities in local sequences. As a better choice for depicting the character, K nearest neighbor score was proposed. To take advantage of such cluster information of GO and other FDA databases as well as subcellular localization for predicting acetylation proteins, for a given potential acetylation protein, we took the characteristics around the query neighborhood and extracted the KNN scores features from the training dataset containing both positive and negative samples. The algorithm is listed as follows.
Step 1. For a query protein sequence find its k nearest neighbors, which can be positive or negative samples, in the whole set according to local sequence similarity. For a given protein p, FDA j (p) = {N p,j 1 , N p,j 2 , · · · , N p,j n p } represents the keywords set of the jth FDA. The j (=1, 2, . . . , 7, 8) represents "GO, " "Pfam, " "Smart, " "PROSITE, " "SUPFAM, " "InterPro, " "PRINTS, " or "subcellular localization, " respectively), FDA j (q) = {N q,j 1 , N q,j 2 , · · · , N q,j n q } is the similar mean for protein q. The similarity distance Dist j p, q between p and q can be defined as follows: Where , and | | are the operators "union, " "intersection, " and "norm" of the set theory, respectively. Here, | | is defined as the number of its elements. dist(p, q) is the Euclidean distance on the basis of PseAAC. w 1 and w 2 are the weights of the two distances.
Step 2. A corresponding KNN feature is then extracted by calculating the KNN score, noted it as the percentage of acetylation proteins in its k nearest neighbors.
Step 3. To obtain diverse and enough properties of neighbors with KNN scores, the above two steps were repeated for different k values. For the jth member of FDA, the protein P can be formulated as: In this paper, the number of features is 50 and k was defined to be 0.1, 0.4, 0.7, . . . , 14.5 and 14.8 percent of the size of the involved dataset. In this way, 50 KNN scores were extracted as features for identifying acetylation proteins. To be more precisely, ϕ 1 (j) is the ratio of positive neighbors to whole concerned samples, i.e., 0.1 percent of the size of the training data set, ϕ 2 (j) is the ratio of positive neighbors to whole concerned samples whose value is the product of 0.004 and the size of the training data set, and so forth, when K = 50, ϕ 50 (j) is the ratio of positive neighbors to 14.8 percent of the size of the training data set. In a word, a query protein sequence can be formulated with seven 50-Dimension vectors, i.e., P FDA = [P FDA 1 , P FDA 2 , . . . , P FDA 7 ], by using FDA database. Since Chou's pseudo amino acid composition (PseAAC) (Chou, 2001;Mondal and Pai, 2014) have showing so great powerful for identifying structure and function of protein, the proposed method took it into account according to the style of reference (Shen and Chou, 2008) (we select type 1 and let λ = 5). Thus, a given protein sequence can be expressed as 375-dimension vector, and these digital representations served as the input of the query protein for the prediction model.

Algorithms
Here we choose Random Forest as the operation engine as the predictor, and named the final predictor as "iAcet-PseFDA." This name is an acronym created from its description, and Figure 2 would show how iAcet-PseFDA working.
As shown in Figure 2, the first step is to input the query amino acid sequence P. And then, the PSI-BLAST software was used to find the most similar protein to P, which is used to determine the most likely GO or other information of FDA set and generate the KNN scores with it. With the descriptor of P, the desired result can be obtained with the framework of Random Forest classifier trained on the benchmark.

Metrics and Test Method
The predictor iAcet-PseFDA was evaluated with crossvalidation tests in the terms of following seven widely-accepted measurements: accuracy (or Acc, for short), Mathew's correlation coefficient abbreviated as Mcc, sensitivity (abbreviated as Sn, i.e., the fraction of the relevant documents that are successfully retrieved), specificity (i.e., Sep), Precision (i.e., Pre, a description of random errors), F-measure (or F-m, the harmonic mean of precision and recall), and G-mean. Since the area under the receiver operating characteristic curve (auROC, for short) is another important measurement of the performance of a given model, it was also calculated and plotted in this study. In view of the traits of validation method trait, cross-validation method was applied on three datasets for evaluating the proposed predictor.

RESULTS AND DISCUSSION
Investigating the Performances of KNN Score of FDA Represent  Figure 3A showed the comparison of PAAC represents between acetylation proteins and nonacetylation proteins, Figure 3B showed those of KNNScore-GO, and so forth, Figure 3I showed those of Subcellular localization.
Overall, acetylation proteins gained obvious larger KNN scores than non-acetylation proteins on GO and Subcellular localization, and a little larger gap between the KNNScores of positive and negative datasets, all of the average KNN scores are nearly merged in 0.5 with the growth of features.
Specifically, for acetylation proteins with the view of GO evaluated on different sizes of nearest neighbors, the average values shown in Figure 3B are within 0.6-0.8, however, the average digits are within 0.2-0.4 for non-acetylation proteins. From the view of Subcellular localization as showed in Figure 3I, most of the average KNN scores of acetylation proteins are waved within 0.5-0.7 while those of non-acetylation proteins fluctuating around 0.4. From the view of Smart, Supfam, InterPro Pfam, Prosite and PRINTS as showed in Figures 3C-H, there are clearly gaps between the acetylation proteins and nonacetylation proteins, and the gaps are narrowing with the growth of KNNScores number.
We tested the eight kinds of features on the three datasets with RF, and the mean performances are depicted in the first 11th lines of Table 1, while the compared measurements obtained from the proposed model, in which the features were selected with Relief, are attached in the last line. As showed in the table, the features of Subcellular localization reached the best results with Acc is 73.95%, Mcc is 0.4843, Sn is 81.24%, Recall is 81.24%, F-measure is 75.72%, and G-mean is 73.59%. As regards for Sp and Precison, GO gained the best result which are 68.37 and

Performance of Proposed Model
Based on the above discussions, we argue that the local amino acids surrounding acetylation sites, which have been verified, would share in similar pattern(s) with positive set on average as expected. These findings confirm that there are some acetylationrelated clusters in acetylated proteins and hence may be used to distinguish them from the non-acetylation protein. Accordingly, the KNN scores were used to encode query sequence for predicting acetylation proteins in this study. As we known, the Relief algorithm as a feature weighting algorithm was first proposed by Kira and Rendell (1992). In the algorithm, the features were allocated different weights in light of the relevance of characteristics and categories. The feature will be removed when its weight less than a threshold by this method. Since the combined features generated a highdimensional vector, and the Relief method can rank the values of features, this work thus used Relief to reduce feature redundancy. With the help of Relief, we tested the predictor on different features sets and listed the mean performances in the last line of Table 1. The Acc is 77.55% which is better than 74.64%, the result obtained by using all of the eight features, and better than that of subcellular localization. The Relief model gained the better results according to the other seven measurements. Figure 4   in the selected feature set. The predictor obtains the best result at 156, which means there are 156 features were selected here, with Acc is 77.55%, Mcc is 0.5883, Sn is 96.41%, Sp is 71.26%, Precision is 52.79% which isn't the best performance unfortunately, Recall is 96.41%, F-measure is 68.23% and G-mean is 82.89%. These obtained results are better than anyone of Table 1.
The performance of iAcet-PseFDA was also depicted with ROC curves shown in Figure 5 in which the graphic lines are represent for GO, Subcellular localization and other Domain notations' KNNScores along with PseAAC's. As shown in first subfigure of Figure

CONCLUSION
In order to detect acetylation proteins, this study developed a method on the basis of Random Forest algorithm and Relief. Our approach considered information of sequence conservation extracted by PSI-BLAST besides with PseACC. The involved features are extracted from the sequence conservation information and "GO, " "Pfam, " "Smart, " "PROSITE, " "SUPFAM, " "InterPro, " "PRINTS" and Subcellular localization information of the given query amino acid sequence. This work may cope with the expensive and time-consuming process of identifying acetylation proteins because that the features only incorporated the sequence conservation via gray system model and Knn scores based on FDA databases. All of these processes only need computational model instead of any physical chemistry experiment.
Also, our result manifested that it appears that using FDAs is essential for the prediction of acetylation functional class, which had been reported in previous research (Qiu et al., 2016a(Qiu et al., ,b, 2017b, and the information related to subcellular is also important for identifying the PTM proteins. As the growing demand of verification of acetylation sites, we argue that more effort should be input in developing organismspecific predictors for this issue. The reason for presenting the model here then is for the improving the predictor used in similar research, and it may be helpful for those researchers who would like to deal with bioinformatics problems with computational models. In addition, the involved features may provide important clues of the acetylation mechanism and guide the related experimental validations.
Additionally, a web-server has been established at http://www. jci-bioinfo.cn/iAcetyP which is user-friendly and convenient for the researchers who are working in distinguishing acetylated proteins from non-acetylated proteins.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: http://www.uniprot.org/.

AUTHOR CONTRIBUTIONS
AX and Z-CX carried out the extraction of annotation features, model construction, model training, and evaluation, also drafted the related initial manuscript version. C-HZ carried out the preparation of the data, GO enrichment, subcellular localization analysis, comparison with deep learning tool, and drafted the subsequent manuscript. XX led this project and guided this work and shared his idea to implement and discussion. All authors involved in discussion and conclusion section.