A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

The thermostability of proteins is a key factor considered during enzyme engineering, and finding a method that can identify thermophilic and non-thermophilic proteins will be helpful for enzyme design. In this study, we established a novel method combining mixed features and machine learning to achieve this recognition task. In this method, an amino acid reduction scheme was adopted to recode the amino acid sequence. Then, the physicochemical characteristics, auto-cross covariance (ACC), and reduced dipeptides were calculated and integrated to form a mixed feature set, which was processed using correlation analysis, feature selection, and principal component analysis (PCA) to remove redundant information. Finally, four machine learning methods and a dataset containing 500 random observations out of 915 thermophilic proteins and 500 random samples out of 793 non-thermophilic proteins were used to train and predict the data. The experimental results showed that 98.2% of thermophilic and non-thermophilic proteins were correctly identified using 10-fold cross-validation. Moreover, our analysis of the final reserved features and removed features yielded information about the crucial, unimportant and insensitive elements, it also provided essential information for enzyme design.


INTRODUCTION
Proteins denature when the environmental temperature increases dramatically . However, thermophiles can survive in temperatures ranging from 41 • C to 122 • C (Takai et al., 2008;Fan et al., 2016) and produce enzymes that react well at higher environmental temperatures, such as 120 • C (Fan et al., 2016). In enzyme engineering, identifying the functional mechanisms of these proteins will provide insights into the design and optimization of enzymes .
Protein thermostability has been shown to be related to hydrophobicity (Gromiha et al., 2013), hydrogen bonding (Bleicher et al., 2011), hydrophobic free energy (Gromiha et al., 1999;Saraboji et al., 2005), and residue (Meruelo et al., 2012) and inter-residue contacts (Gromiha, 2001). Moreover, Das and Gerstein (2000) found that salt bridges are essential for maintaining protein thermostability in thermophilic bacteria. The distribution of amino acids in proteins (Fukuchi and Nishikawa, 2001;Zhou et al., 2008) and the presence of dipeptide (Ding et al., 2004;Zhang and Fang, 2006a,b) also affect protein thermostability. In a study by Vieille, the composition of Arg is greater in thermophiles than in mesophiles (Vieille and Zeikus, 2001). Guo also showed that expurgation of water-accessible thermo-labile residues, such as Gln and Met, affects the thermostability of enzymes expressed by thermophiles . Besides, Chen et al. (2016) found the pseudo amino acid composition had a big effect on the protein identification task, and constructed a web server to give a free way to use their algorithm 1 .
Sequence-based protein identification provides an alternative method for studies of protein thermostability (Zhang and Fang, 2007;Wu et al., 2009;Li and Fang, 2010;Liu et al., 2011Zuo et al., 2013;Fu et al., 2018;Wang et al., 2018;Zhang et al., 2018;Cheng et al., 2019;Yu et al., 2019b). Wang et al. (2011) introduced a feature selection method to identify vital features from the pseudo amino acid composition, amino acid composition, physicochemical features, composition transition, and distribution features using a support vector machine (SVM) to detect thermophilic proteins. Additionally, Tang proposed a two-step discrimination method with 94.44% accuracy using 5fold cross-validation. Lin et al. constructed a dataset containing 915 thermophilic proteins and 793 non-thermophilic proteins, and predicted 93.8% thermophilic proteins and 92.7% nonthermophilic proteins using SVM. The same conclusion was also reached by Nakariyakul et al. (2012), who obtained 93.3% identification accuracy in the same database used by Lin. In another study, Fan et al. (2016) integrated information on the amino acid composition, evolution information, and acid dissociation constant to identify thermophiles by SVM, yielding an overall accuracy of 93.53%. Modarres et al. (2018) proposed a new thermophilic protein database, which contained 14 million protein sequences. In this database, all sequences were categorized according to the thermal stability and protein family property. Not only the sequences but also structures of thermophilic proteins were contained in the database. This online database gave the developers a powerful tool in the thermophilic protein prediction task.
In this study, we integrated 188 physicochemical characteristic features, auto-cross covariance (ACC) information, and dipeptide compositions of reduced amino acids to obtain a mixed feature set. Redundant features were then removed using correlation analysis, and dimensions were reduced using the max-relevance-max-distance (MRMD) method and principal component analysis (PCA). Finally, the SVM and other three machine learning methods were used to identify thermostability.

MATERIALS AND METHODS
The main framework of the method used in this study could be divided into the following four parts: (a) transforming thermophilic protein sequences to a reduced amino acid form; (b) extracting useful features; (c) using the SVM to train the extracted 1 http://lin-group.cn/server/Lypred/ features; (d) predicting the test data by machine learning (Yu et al., 2017a,b;Zou et al., 2017a,b;Zhang et al., 2019a). The framework is shown in Figure 1.

Datasets
We used the dataset constructed by Lin et al. (Lin and Chen, 2011), whose data were chosen from the Universal Protein Resource (UniProt). The temperature of thermophilic proteins in this dataset was set to above 60 • C and the temperature of non-thermophilic proteins was set to be less than 30 • C. After removing redundancy and homology bias, there were 915 thermophilic and 793 non-thermophilic proteins. These data can be downloaded from http://www.labio.info/index-1therm.html.

Reduced Amino Acid Composition (RAAC)
In order to improve phylogenetic estimates, it is possible to recode the amino acids in the protein sequence (Susko and Roger, 2007). Furthermore, some reduced amino acid schemes, including the "Dayhoff classes" (AGPST, DENQ, HKR, ILMV, FWY, and C), have attracted attention (Susko and Roger, 2007).
In order to maximize the ratio of the expected number of substitutions within bins under the JTT model, Susko et al. proposed their reduced amino acid alphabet, which contains 30 schemes. In this study, we chose the final scheme as follows: A, C, D, E, F, G, H, IV, K, L, M, N, P, Q, R, S, T, W, Y. Thus, the 20 amino acids were classified into 19 types in the above scheme (Susko and Roger, 2007), in which Ile (I) and Val (V) were viewed as a single type, while every one of other categories had only one amino acid. Under this reduced scheme, we use the webserver of Zuo (Zheng et al., 2019) to calculate the RAAC of the thermophilic and non-thermophilic proteins.
Furthermore, dipeptides of proteins, like AA, A * A (λ gap = 1), and A * * A (λ gap = 2), AK, A * K, A * * K, etc., were also obtained using this webserver (Chen et al., 2016;Yang et al., 2019). The following formula was used to calculate the values of those features: where y λ 361 (j) denotes the number of λ-gap dipeptides of type j in a protein sequence.

Physicochemical Characteristics
To quantitatively identify proteins, the physicochemical characteristics were obtained using a method (temporarily called 188d), which could extract sequence information and amino acid properties (Song et al., 2014;Xu et al., 2014Xu et al., , 2018Fu et al., 2019;Liu, 2019;Zhu et al., 2019). The first 20 elements in the results of this method denoted the frequency of the 20 original amino acids ; the next 24 features reflected the group proportion corresponding to three groups (Qu et al., 2019); the following 120 dimensions were the distributions of three groups in five local positions (Cai et al., 2003); the last 24 features were the numbers of three types of dipeptides.

ACC
Auto covariance (AC) and cross-covariance (CC) calledACC, can reflect the relationship between amino acids with certain length features and contains AC and CC (Dong et al., 2009;Liu et al., 2015). The formula of CC transforms a protein sequence to a vector form Liu et al. (2016): where N denotes the number of properties. ϕ i can be calculated as: (Guo et al., 2008) where i is a residue, Ldenotes the length of the whole protein sequence, S i,j represents the i-th property of the j-th amino acid, and S i reflects the mean value of the i-th property (Qu et al., 2019). In our experiment, the value of lg was set to 2.

Correlation Analysis
Some pairs in our feature set were found to be highly correlative, indicating that the effects of these two features were similar. Furthermore, this phenomenon denotes redundant and repeated information were present in the feature set. However, without the preprocess of discarding redundant information, machine learning models are associated with a risk of overfitting (Hua et al., 2009;Mwangi et al., 2014;Zeng et al., 2019b).
Thus, a correlation analysis-based redundant information expurgate method was proposed to discard one feature from each of the highly relevant feature pairs. As a prepare step, all feature values need to be normalized to [0,1] using the following equation: where x i (i = 1, 2, 3, · · · ) denotes the i-th value in the feature set, xrepresents the mean value of the current feature vector, andx max , x min correspondingly reflect the maximum and minimum values of the feature vector. Then, Pearson's correlation was used to evaluate the correlations between any two features. Its value was written as follows: (Thibeault and Srinivasa, 2013;Jin et al., 2019) where X and Y are two given feature vectors,XandȲ represent the mean value of X and Y, respectively, and σ(X) and σ(Y) denote the standard deviations of X and Y, respectively. In our experiment, for any feature pair X and Y, if the value of ρ (X, Y) was larger than the threshold T, then X and Y were considered a highly correlated feature pair. In the next step, we decided whether to remove one of the features from the feature set while retaining the other in the feature set. Thus, for the first feature pair, a removed feature set D and a reserved feature set R were created and set as an empty set. Then, one feature was set to belong to D, while the other was set to belong to R randomly. In the following computation, the rule for assigning features could be expressed as follows: assuming that X-Y is a highly correlative feature pair, After all M features in D were removed from the feature set, the correlation between feature pairs was decreased dramatically. The threshold T used in our experiment was set as 0.85.

MRMD Feature Selection
Dimensionality reduction is a key process in machine learning research and application (Bhola and Singh, 2018). The MRMD method, as presented by Zou et al. (2016), was used to rank features in descending order and reduce the feature number. There were two object functions; the first reflected the relationship between the current feature and the target class, which could be written as follows : where f i,k and C i,k represent the k-th element in the feature vector F i and C i , respectively. The other object function was expressed in the following form Zou et al. (2016): Integrating the above two functions, we obtained the final objective function, which was written as follows: Solving this function, when the function reached the maximum ACC value, the iteration was stopped automatically, giving a feature dimension reduced set.

PCA
Principal component analysis (Price et al., 2006) is a widely used tool that can transform the features of observation into an uncorrelated feature set (Zeng et al., 2017(Zeng et al., , 2019aXiao et al., 2018;Zhang et al., 2019b). The main steps of PCA are as follows: (1) normalize the feature vector value; (2) calculate the covariance matrix by = 1 m X · X T ; (3) use the singular value decomposition method (U, S, V T ); = SVD( ); (4) extract the first k singular vectors from U and (5) calculate the i-th eigenvalue λ i , i = 1, 2, 3, · · · We used ρ to evaluate the cumulative contribution value of the singular vectors; this value was defined as ρ = where m denotes the dimension of the transformed features. The above function denotes there is enough information to serve as the optimal feature set for the identification 0task when the cumulative contribution value of singular vectors from the first one to the λ-th one reaches a value, namely, the threshold T . Thus, through the threshold T , only a part of features were selected and then formed an optimal feature set, which made the model simple and fast to run.

Machine Learning Methods
In order to distinguish between thermophilic and nonthermophilic proteins, SVM (Ding et al., 2016a,b;He et al., 2018;Qiao et al., 2018;Wei et al., 2018;Fu et al., 2019;Wang et al., 2019b), random forest [RF, (Ding et al., 2017;Wang et al., 2019a)], decision tree (Mohasseb et al., 2018;, and naïve Bayes [NB, (Rajaraman and Chokkalingam, 2014)] methods were used in our experiment. The first two methods were implemented and optimized in the python 3.7 environment with our edited code. All four methods were also tested in the Weka environment, yielding similar results.

Evaluation of Performance
In order to evaluate the model performance, we used a 10-fold cross-validation scheme in our experiment and adopted three commonly used accuracy indicators for quantification (Jiang et al., 2013(Jiang et al., , 2018Zeng et al., 2016;Wei et al., 2017a,b;Lu et al., 2018Lu et al., , 2019Xiong et al., 2018;Chen et al., 2019;Ding et al., 2019;Lin et al., 2019;Shan et al., 2019;Shen et al., 2019;Xu et al., 2019;Yu and Gao, 2019;Yu et al., 2019a). The first indicator was sensitivity (Sn), which represents the ratio of the correctly identified thermophilic proteins and could be calculated as follows: where TP, TN, FP, and FN represent the number of the correctly identified thermophilic proteins, the number of the correctly indemnified non-thermophilic proteins, the number of nonthermophilic proteins predicted as thermophilic proteins, and the number of the thermophilic proteins predicted as nonthermophilic proteins, respectively (Lin and Chen, 2011). The second indicator was specificity (Sp), which denotes the percentage of the correctly identified non-thermophilic proteins among all non-thermophilic observations. Sp was defined as follows: The last indicator was accuracy (ACC), which reflected the percentage of correctly recognized thermophilic and nonthermophilic proteins among all observations, written as follows:

RESULTS
Our experiments were performed on the basis of qualitative evaluation, quantitative analysis, and comparison with other counterparts, as shown in Figure 2. The data were calculated using 500 randomly selected thermophilic proteins and 500 randomly selected non-thermophilic proteins, and experiments were evaluated in 10-fold cross-validation format. First, we evaluated the proposed method using qualitative analysis. In this analysis, all feature data were reduced to 12 dimensions through the PCA method. Furthermore, the t-SNE method (van der Maaten and Hinton, 2012; van der Maaten, 2014) is one of the powerful visualization tools for showing the structure of high-dimension data. Thus, we used the t-SNE method (van der Maaten and Hinton, 2012; van der Maaten, 2014) to differentiate thermophilic and nonthermophilic proteins in the figure. Additionally, the t-SNE method used here was not a part of the proposed model, but was a display tool of the experiment data. The first two features of the results using the t-SNE method are plotted in Figure 2A; from these data, a distinct boundary was observed for separating thermophilic and non-thermophilic observations. Moreover, it was easy to distinguish thermophilic proteins from non-thermophilic proteins.
In order to verify these findings, SVM was used to train and test the 12-dimensional data, and the results are shown in Figure 2B. Both types of proteins were separated successfully using this method. This phenomenon directly demonstrated that our proposed data had good separation quality and the SVM method had strong recognition ability for thermophilic proteins and non-thermophilic data.
Second, the processed data were tested using the other three machine learning methods, as detailed in Figure 2C. For every method, we also calculated three accuracy indicators: Sn, Sp, and ACC. The results showed that the SVM yielded the highest values for all three indicators, and all values reached at least 98.2%. NB also showed higher accuracy, with values of 96.25%, 97.56%, and 96.89%, respectively. The accuracy of the random forest model was higher than that of J48, for which the average value was only 91.48%.
Our method was also compared with the results of Lin (Lin and Chen, 2011) and the method of using the same dataset (Fan et al., 2016). The results are shown in Figure 2D. Notably, our method got the highest accuracy values based on the results of the MRMD methods, which denotes our proposed method outperformed the method described by Lin (Lin and Chen, 2011). Additionally, the performance of the proposed method was better than the effects described by Fan et al. (2016) too, suggesting that the proposed method could be a state-of-the-art model in current research.
Features using the original dipeptides were also tested in our study. All reduced features in our feature set were replaced with the original dipeptides. From the accuracy data shown in Figure 2D, the ability to distinguish thermophilic proteins from non-thermophilic ones was lower than that using the reduced amino acid dipeptides. Additionally, the receiver operating characteristic (ROC) curve was also plotted, which could be seen in Figure 3A. It is easy to found that the results of the ROC curve verified the identification efficiency of the proposed method too.
Finally, the newly released thermophilic protein database (Mohasseb et al., 2018) is also tested through the proposed method in this manuscript. In the experiment, we selected 106 thermophilic proteins and 101 psychrophilic proteins from the database. All those data can be downloaded on the website: http://www.labio.info/index-1therm.html. In the experiment, we did three experiments using three different thresholds in the correlation analysis step. The experiments are given in Figure 3B, from which it was easy to find that the identification accuracy was bigger than 0.97 in most cases when using the threshold of 0.95 and 0.90. It also showed that the classification efficiency was not ideal when using the threshold 0.85. The reason for this phenomenon may be the calculated features of the current data have a stronger correlation between each other than the previous thermophilic protein database. Thus, in this condition, a big value than 0.85 is needed to identify the thermophilic proteins accurately. It is worth noting that the results in this figure verified the perfect identification ability of the proposed method.

DISCUSSION
Many features are removed from the original feature set during correlation analysis and MRMD feature selection. Moreover, these removed features are typically not crucial or redundant for performing thermophilic protein recognition. However, the selection of features to remove and retain is essential, and further studies are needed to evaluate such approaches. Thus, in this study, we evaluated the removed features, as depicted in Figure 4.
The 10 most critical original features are shown in Figure 4A, and under our proposed model framework, the feature values of K * H, KR, TF, P * M, F * * N, I * * Y/V * * Y, MW, and WQ (where * represents a gap in the residues) showed significant contributions to the recognition of thermophilic proteins. Additionally, residue K also plays a vital role in enhancing thermostability. Interestingly, our conclusions regarding residue K were consistent with the results of Lin (Lin and Chen, 2011).  Thus, we concluded that the physicochemical characteristics were essential features, but not the most essential features, The symbol "*" means any one of the 20 amino acids, it may be "A", "C", "P", or others. Besides, "**" has the same meaning; it represents a two-letter combination of 20 amino acids, "AA", "DC", "VP", for example.
for this recognition task. Accordingly, we did not analyze the details of the removed physicochemical characteristics. We also showed that only three ACC features were excluded from the final feature set, and the remaining 15 ACC features were retained, reflecting the crucial roles of the ACC features in this recognition task.
The amino acid frequency, which was one of the first 20 features in the 188D feature set, included only four residues removed from the feature set. These four residues were V (Ile and Val), A, E, and K, which had little contribution to recognizing thermophilic protein and non-thermophilic proteins. Interestingly, the reduced amino acid V, which included both Ile and Val, was also deleted. It is worth noting that the amino acid V appeared later in this manuscript denotes the reduced V, namely, both Ile and Val. This finding indicated that both Ile and Val were redundant and did not contribute to the identification task. If we used the original amino acid dipeptide features, additional useless features, including IA, I * A, and I * * A, etc., would also be observed in the feature set. The number of additional redundant features in the original dipeptides could be as high as 39 if compared with the reduced amino acid dipeptides. As shown in Figure 2D, the smallest prediction accuracy was obtained, and represented those many additional useless features caused the classification model fail in the overfitting state when using the original dipeptides. Additionally, this observation could explain why the accuracy increased significantly when using the reduced amino acid dipeptides.
There were three types of dipeptides, expressed as AA (λ = 0), A * A (λ = 1), and A * * A (λ = 2). The numbers of these types of removed dipeptides were 60, 61, and 71, respectively. To conveniently visualize these data, we counted the numbers of the same dipeptide (omitting the symbols * and * * ). If a dipeptide appeared more than twice, it was drawn in the figure. Thus, if the dipeptide NV was shown in the figure, there were at least two types of dipeptides, i.e., NV, N * V, or N * * V, in the removed feature set.
All discovered dipeptides were classified into two parts, as shown in Figures 4C,D. The reduced dipeptides in Figure 4C were dipeptides having relationships with the reduced residue V, verifying the reduced power of the recognition task in the above analysis. Moreover, residue V enabled the discovery of seven related dipeptides in the removed features. This phenomenon demonstrated that residue V and some dipeptides containing V were insensitive to the recognition task under our proposed model framework. Figure 4D also shows another seven removed dipeptides, including VV, AV, AE, AL, LE, LL, and LR.
These results provide insights into the design of stable mutants to increase protein thermostability.

CONCLUSION
In this study, we aimed to develop an approach to distinguish thermophilic proteins from non-thermophilic proteins; to this end, a recognition method that combined mixed features of proteins and a machine learning method was established. First, an amino acid reduction method was introduced to reduce the categories of amino acids. Nest, we calculated the physicochemical characteristics, ACC, and reduced dipeptides of thermophilic and non-thermophilic proteins. After performing a dimension reduction step using correlation analysis, the MRMD method, and PCA, an optimal feature set was obtained. Finally, machine learning methods were used to train and predict feature data, and the results revealed that the proposed model could identify 98.2% of thermophilic proteins and non-thermophilic proteins if the data were operated in a 10-fold cross-validation mode. Furthermore, the feature values of K * H, KR, TF, P * M, F * * N, V * * Y, MW, and WQ were found to play vital roles in thermostability, and some residues and dipeptides, including V (Ile and Val), A, E, K, NV, VG, VA, AE, AL, and LE, were not important for identifying thermostability. As discussed in previous studies (Liu and Li, 2019;Liu and Zhu, 2019), the web-server is very important. In our future work, our research will focus on developing a free webserver that could provide a platform to test the currently proposed method using an easily accessible approach.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the http: //www.labio.info/index-1therm.html.