A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Feng, Changli; Ma, Zhaogui; Yang, Deyun; Li, Xin; Zhang, Jun; Li, Yanjuan

doi:10.3389/fbioe.2020.00285

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 05 May 2020

Sec. Computational Genomics

Volume 8 - 2020 | https://doi.org/10.3389/fbioe.2020.00285

This article is part of the Research TopicExplainable Intelligent Processing of Biological Resources Integrating Data, Information, Knowledge, and WisdomView all 21 articles

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Changli Feng¹

Zhaogui Ma¹

Deyun Yang¹

Xin Li¹

Jun Zhang^2*

Yanjuan Li^3*

¹College of Information Science and Technology, Taishan University, Tai’an, China
²Department of Rehabilitation, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
³Information and Computer Engineering College, Northeast Forestry University, Harbin, China

The thermostability of proteins is a key factor considered during enzyme engineering, and finding a method that can identify thermophilic and non-thermophilic proteins will be helpful for enzyme design. In this study, we established a novel method combining mixed features and machine learning to achieve this recognition task. In this method, an amino acid reduction scheme was adopted to recode the amino acid sequence. Then, the physicochemical characteristics, auto-cross covariance (ACC), and reduced dipeptides were calculated and integrated to form a mixed feature set, which was processed using correlation analysis, feature selection, and principal component analysis (PCA) to remove redundant information. Finally, four machine learning methods and a dataset containing 500 random observations out of 915 thermophilic proteins and 500 random samples out of 793 non-thermophilic proteins were used to train and predict the data. The experimental results showed that 98.2% of thermophilic and non-thermophilic proteins were correctly identified using 10-fold cross-validation. Moreover, our analysis of the final reserved features and removed features yielded information about the crucial, unimportant and insensitive elements, it also provided essential information for enzyme design.

Introduction

Proteins denature when the environmental temperature increases dramatically (Tang et al., 2017). However, thermophiles can survive in temperatures ranging from 41°C to 122°C (Takai et al., 2008; Fan et al., 2016) and produce enzymes that react well at higher environmental temperatures, such as 120°C (Fan et al., 2016). In enzyme engineering, identifying the functional mechanisms of these proteins will provide insights into the design and optimization of enzymes (Tang et al., 2017).

Protein thermostability has been shown to be related to hydrophobicity (Gromiha et al., 2013), hydrogen bonding (Bleicher et al., 2011), hydrophobic free energy (Gromiha et al., 1999; Saraboji et al., 2005), and residue (Meruelo et al., 2012) and inter-residue contacts (Gromiha, 2001). Moreover, Das and Gerstein (2000) found that salt bridges are essential for maintaining protein thermostability in thermophilic bacteria. The distribution of amino acids in proteins (Fukuchi and Nishikawa, 2001; Zhou et al., 2008) and the presence of dipeptide (Ding et al., 2004; Zhang and Fang, 2006a, b) also affect protein thermostability. In a study by Vieille, the composition of Arg is greater in thermophiles than in mesophiles (Vieille and Zeikus, 2001). Guo also showed that expurgation of water-accessible thermo-labile residues, such as Gln and Met, affects the thermostability of enzymes expressed by thermophiles (Guo et al., 2014). Besides, Chen et al. (2016) found the pseudo amino acid composition had a big effect on the protein identification task, and constructed a web server to give a free way to use their algorithm¹.

Sequence-based protein identification provides an alternative method for studies of protein thermostability (Zhang and Fang, 2007; Wu et al., 2009; Li and Fang, 2010; Liu et al., 2011, 2019; Zuo et al., 2013; Fu et al., 2018; Wang et al., 2018; Zhang et al., 2018; Cheng et al., 2019; Yu et al., 2019b). Wang et al. (2011) introduced a feature selection method to identify vital features from the pseudo amino acid composition, amino acid composition, physicochemical features, composition transition, and distribution features using a support vector machine (SVM) to detect thermophilic proteins. Additionally, Tang proposed a two-step discrimination method with 94.44% accuracy using 5-fold cross-validation. Lin et al. constructed a dataset containing 915 thermophilic proteins and 793 non-thermophilic proteins, and predicted 93.8% thermophilic proteins and 92.7% non-thermophilic proteins using SVM. The same conclusion was also reached by Nakariyakul et al. (2012), who obtained 93.3% identification accuracy in the same database used by Lin. In another study, Fan et al. (2016) integrated information on the amino acid composition, evolution information, and acid dissociation constant to identify thermophiles by SVM, yielding an overall accuracy of 93.53%. Modarres et al. (2018) proposed a new thermophilic protein database, which contained 14 million protein sequences. In this database, all sequences were categorized according to the thermal stability and protein family property. Not only the sequences but also structures of thermophilic proteins were contained in the database. This online database gave the developers a powerful tool in the thermophilic protein prediction task.

In this study, we integrated 188 physicochemical characteristic features, auto-cross covariance (ACC) information, and dipeptide compositions of reduced amino acids to obtain a mixed feature set. Redundant features were then removed using correlation analysis, and dimensions were reduced using the max-relevance-max-distance (MRMD) method and principal component analysis (PCA). Finally, the SVM and other three machine learning methods were used to identify thermostability.

Materials and Methods

The main framework of the method used in this study could be divided into the following four parts: (a) transforming thermophilic protein sequences to a reduced amino acid form; (b) extracting useful features; (c) using the SVM to train the extracted features; (d) predicting the test data by machine learning (Yu et al., 2017a, b; Zou et al., 2017a, b; Zhang et al., 2019a). The framework is shown in Figure 1.

FIGURE 1

Figure 1. The whole framework of the proposed method in this manuscript.

Datasets

We used the dataset constructed by Lin et al. (Lin and Chen, 2011), whose data were chosen from the Universal Protein Resource (UniProt). The temperature of thermophilic proteins in this dataset was set to above 60°C and the temperature of non-thermophilic proteins was set to be less than 30°C. After removing redundancy and homology bias, there were 915 thermophilic and 793 non-thermophilic proteins. These data can be downloaded from http://www.labio.info/index-1therm.html.

Reduced Amino Acid Composition (RAAC)

In order to improve phylogenetic estimates, it is possible to recode the amino acids in the protein sequence (Susko and Roger, 2007). Furthermore, some reduced amino acid schemes, including the “Dayhoff classes” (AGPST, DENQ, HKR, ILMV, FWY, and C), have attracted attention (Susko and Roger, 2007).

In order to maximize the ratio of the expected number of substitutions within bins under the JTT model, Susko et al. proposed their reduced amino acid alphabet, which contains 30 schemes. In this study, we chose the final scheme as follows: A, C, D, E, F, G, H, IV, K, L, M, N, P, Q, R, S, T, W, Y. Thus, the 20 amino acids were classified into 19 types in the above scheme (Susko and Roger, 2007), in which Ile (I) and Val (V) were viewed as a single type, while every one of other categories had only one amino acid. Under this reduced scheme, we use the webserver of Zuo (Zheng et al., 2019) to calculate the RAAC of the thermophilic and non-thermophilic proteins.

Furthermore, dipeptides of proteins, like AA, A^∗A (λ_gap = 1), and A^∗∗A (λ_gap = 2), AK, A^∗K, A^∗∗K, etc., were also obtained using this webserver (Chen et al., 2016; Yang et al., 2019). The following formula was used to calculate the values of those features:

f_{361}^{λ} (j) = \frac{y_{361}^{λ} (j)}{\sum_{j} y_{361}^{λ} (j)} λ = 0, 1, 2, \dots, 361,

where $y_{361}^{λ} (j)$ denotes the number of λ-gap dipeptides of type j in a protein sequence.

Feature Extraction

Physicochemical Characteristics

To quantitatively identify proteins, the physicochemical characteristics were obtained using a method (temporarily called 188d), which could extract sequence information and amino acid properties (Song et al., 2014; Xu et al., 2014, 2018; Fu et al., 2019; Liu, 2019; Zhu et al., 2019). The first 20 elements in the results of this method denoted the frequency of the 20 original amino acids (Zhu et al., 2019); the next 24 features reflected the group proportion corresponding to three groups (Qu et al., 2019); the following 120 dimensions were the distributions of three groups in five local positions (Cai et al., 2003); the last 24 features were the numbers of three types of dipeptides.

ACC

Auto covariance (AC) and cross-covariance (CC) calledACC, can reflect the relationship between amino acids with certain length features and contains AC and CC (Dong et al., 2009; Liu et al., 2015). The formula of CC transforms a protein sequence to a vector form Liu et al. (2016):

P^{'} = [φ_{1}, φ_{2}, φ_{3}, \dots, φ N * (N - 1) * l g)]^{T},

where N denotes the number of properties. φ_i can be calculated as:(Guo et al., 2008)

φ_{n} = A C (i, l g) = \frac{1}{N - l g} \sum_{j = 1}^{L - l g} (S_{i, j} - \bar{S_{i}}) (S_{i, j + l g} - \bar{S_{i}}),

where i is a residue, Ldenotes the length of the whole protein sequence, S_i,j represents the i-th property of the j-th amino acid, and S_i reflects the mean value of the i-th property (Qu et al., 2019). In our experiment, the value of lg was set to 2.

Correlation Analysis

Some pairs in our feature set were found to be highly correlative, indicating that the effects of these two features were similar. Furthermore, this phenomenon denotes redundant and repeated information were present in the feature set. However, without the preprocess of discarding redundant information, machine learning models are associated with a risk of overfitting (Hua et al., 2009; Mwangi et al., 2014; Zeng et al., 2019b).

Thus, a correlation analysis-based redundant information expurgate method was proposed to discard one feature from each of the highly relevant feature pairs. As a prepare step, all feature values need to be normalized to [0,1] using the following equation:

x_{i}^{n} = \frac{x_{i} - \bar{x}}{x_{m a x} - x_{m i n}},

where x_i(i = 1,2,3,⋯) denotes the i-th value in the feature set, $\bar{x}$ represents the mean value of the current feature vector, andx_max, x_min correspondingly reflect the maximum and minimum values of the feature vector.

Then, Pearson’s correlation was used to evaluate the correlations between any two features. Its value was written as follows:(Thibeault and Srinivasa, 2013; Jin et al., 2019)

ρ (X, Y) = \frac{1}{n - 1} \sum_{i = 1}^{n} (\frac{X_{i} - \bar{X}}{σ (X)}) (\frac{Y_{i} - \bar{Y}}{σ (Y)}),

where X and Y are two given feature vectors, $\bar{X}$ and $\bar{Y}$ represent the mean value of X and Y, respectively, and σ(X) and σ(Y) denote the standard deviations of X and Y, respectively.

In our experiment, for any feature pair X and Y, if the value of ρ(X,Y) was larger than the threshold T, then X and Y were considered a highly correlated feature pair. In the next step, we decided whether to remove one of the features from the feature set while retaining the other in the feature set. Thus, for the first feature pair, a removed feature set D and a reserved feature set R were created and set as an empty set. Then, one feature was set to belong to D, while the other was set to belong to R randomly. In the following computation, the rule for assigning features could be expressed as follows: assuming that X-Y is a highly correlative feature pair,

• If X∉D and X∉R: Y∉D and Y∉R→Y ∈ D, X ∈ R

• If X∉D and X∉R: Y ∈ D→X ∈ R

• If X∉D and X∉R: Y ∈ R→X ∈ D

• Elseif X ∈ R : Y∉D and Y∉R→Y ∈ D

• Elseif X ∈ D : Y∉D and Y∉R→Y ∈ D

Let $D = {f_{1}^{'}, f_{2}^{'}, f_{3}^{'}, \dots, f_{M}^{'}}$ denote the final removed feature set. After all M features in D were removed from the feature set, the correlation between feature pairs was decreased dramatically. The threshold T used in our experiment was set as 0.85.

MRMD Feature Selection

Dimensionality reduction is a key process in machine learning research and application (Bhola and Singh, 2018). The MRMD method, as presented by Zou et al. (2016), was used to rank features in descending order and reduce the feature number. There were two object functions; the first reflected the relationship between the current feature and the target class, which could be written as follows (Zou et al., 2016):

P P C (\vec{F_{i}}, \vec{C_{i}})

= \frac{\frac{1}{N - 1} \sum_{k = 1}^{N} (f_{i, k} - \bar{f_{i}}) (C_{i, k} - \bar{C_{i}})}{\sqrt{\frac{1}{N - 1} \sum_{k = 1}^{N} {(C_{i, k} - \bar{C_{i}})}^{2}} \sqrt{\frac{1}{N - 1} \sum_{k = 1}^{N} {(f_{i, k} - \bar{f_{i}})}^{2}}},

m a x M R_{i} = | P P C (\vec{F_{i}}, \vec{C_{i}}) |,

where f_i,k and C_i,k represent the k-th element in the feature vector F_i and C_i, respectively. The other object function was expressed in the following form Zou et al. (2016):

E D (\vec{X}, \vec{Y}) = \sqrt{\sum_{k = 1}^{N} {(x_{k} - y_{k})}^{2}},

\max M D_{i} = E D_{i} = \frac{1}{M - 1} \sum E D (\vec{F_{i}}, \vec{F_{k}}) .

Integrating the above two functions, we obtained the final objective function, which was written as follows:

m a x (M R_{i} + M D_{i})

Solving this function, when the function reached the maximum ACC value, the iteration was stopped automatically, giving a feature dimension reduced set.

PCA

Principal component analysis (Price et al., 2006) is a widely used tool that can transform the features of observation into an uncorrelated feature set (Zeng et al., 2017, 2019a; Xiao et al., 2018; Zhang et al., 2019b). The main steps of PCA are as follows: (1) normalize the feature vector value; (2) calculate the covariance matrix by $\sum = \frac{1}{m} X \cdot X^{T}$ ; (3) use the singular value decomposition method (U,S,V^T); = SVD(Σ); (4) extract the first k singular vectors from U and (5) calculate the i-th eigenvalue λ_i,i = 1,2,3,⋯

We used ρ to evaluate the cumulative contribution value of the singular vectors; this value was defined as $ρ = \frac{\sum_{i = 1}^{p} λ_{i}}{\sum_{i = 1}^{m} λ_{i}} \geq T^{'}$ , where m denotes the dimension of the transformed features. The above function denotes there is enough information to serve as the optimal feature set for the identification 0task when the cumulative contribution value of singular vectors from the first one to the λ-th one reaches a value, namely, the threshold T′. Thus, through the threshold T′, only a part of features were selected and then formed an optimal feature set, which made the model simple and fast to run.

Machine Learning Methods

In order to distinguish between thermophilic and non-thermophilic proteins, SVM (Ding et al., 2016a, b; He et al., 2018; Qiao et al., 2018; Wei et al., 2018; Fu et al., 2019; Wang et al., 2019b), random forest [RF, (Ding et al., 2017; Wang et al., 2019a)], decision tree (Mohasseb et al., 2018; Li et al., 2019), and naïve Bayes [NB, (Rajaraman and Chokkalingam, 2014)] methods were used in our experiment. The first two methods were implemented and optimized in the python 3.7 environment with our edited code. All four methods were also tested in the Weka environment, yielding similar results.

Evaluation of Performance

In order to evaluate the model performance, we used a 10-fold cross-validation scheme in our experiment and adopted three commonly used accuracy indicators for quantification (Jiang et al., 2013, 2018; Zeng et al., 2016; Wei et al., 2017a, b; Lu et al., 2018, 2019; Xiong et al., 2018; Chen et al., 2019; Ding et al., 2019; Lin et al., 2019; Shan et al., 2019; Shen et al., 2019; Xu et al., 2019; Yu and Gao, 2019; Yu et al., 2019a). The first indicator was sensitivity (Sn), which represents the ratio of the correctly identified thermophilic proteins and could be calculated as follows:

S n = \frac{T P}{T P + F N} \times 100 %,

where TP, TN, FP, and FN represent the number of the correctly identified thermophilic proteins, the number of the correctly indemnified non-thermophilic proteins, the number of non-thermophilic proteins predicted as thermophilic proteins, and the number of the thermophilic proteins predicted as non-thermophilic proteins, respectively (Lin and Chen, 2011).

The second indicator was specificity (Sp), which denotes the percentage of the correctly identified non-thermophilic proteins among all non-thermophilic observations. Sp was defined as follows:

S p = \frac{T N}{T N + F P} \times 100 % .

The last indicator was accuracy (ACC), which reflected the percentage of correctly recognized thermophilic and non-thermophilic proteins among all observations, written as follows:

A C C = \frac{T N + T P}{T N + F P + F N + F P} \times 100 % .

Results

Our experiments were performed on the basis of qualitative evaluation, quantitative analysis, and comparison with other counterparts, as shown in Figure 2. The data were calculated using 500 randomly selected thermophilic proteins and 500 randomly selected non-thermophilic proteins, and experiments were evaluated in 10-fold cross-validation format.

FIGURE 2

Figure 2. The figure of model performance. (A) The first two dimensions of the result of compression characteristics of the TSNE method; (B) the figure of the ultra-classification surface of SVM method; (C) The accuracy values of four different models; (D) the comparison results with other methods.

First, we evaluated the proposed method using qualitative analysis. In this analysis, all feature data were reduced to 12 dimensions through the PCA method. Furthermore, the t-SNE method (van der Maaten and Hinton, 2012; van der Maaten, 2014) is one of the powerful visualization tools for showing the structure of high-dimension data. Thus, we used the t-SNE method (van der Maaten and Hinton, 2012; van der Maaten, 2014) to differentiate thermophilic and non-thermophilic proteins in the figure. Additionally, the t-SNE method used here was not a part of the proposed model, but was a display tool of the experiment data. The first two features of the results using the t-SNE method are plotted in Figure 2A; from these data, a distinct boundary was observed for separating thermophilic and non-thermophilic observations. Moreover, it was easy to distinguish thermophilic proteins from non-thermophilic proteins.

In order to verify these findings, SVM was used to train and test the 12-dimensional data, and the results are shown in Figure 2B. Both types of proteins were separated successfully using this method. This phenomenon directly demonstrated that our proposed data had good separation quality and the SVM method had strong recognition ability for thermophilic proteins and non-thermophilic data.

Second, the processed data were tested using the other three machine learning methods, as detailed in Figure 2C. For every method, we also calculated three accuracy indicators: Sn, Sp, and ACC. The results showed that the SVM yielded the highest values for all three indicators, and all values reached at least 98.2%. NB also showed higher accuracy, with values of 96.25%, 97.56%, and 96.89%, respectively. The accuracy of the random forest model was higher than that of J48, for which the average value was only 91.48%.

Our method was also compared with the results of Lin (Lin and Chen, 2011) and the method of using the same dataset (Fan et al., 2016). The results are shown in Figure 2D. Notably, our method got the highest accuracy values based on the results of the MRMD methods, which denotes our proposed method outperformed the method described by Lin (Lin and Chen, 2011). Additionally, the performance of the proposed method was better than the effects described by Fan et al. (2016) too, suggesting that the proposed method could be a state-of-the-art model in current research.

Features using the original dipeptides were also tested in our study. All reduced features in our feature set were replaced with the original dipeptides. From the accuracy data shown in Figure 2D, the ability to distinguish thermophilic proteins from non-thermophilic ones was lower than that using the reduced amino acid dipeptides. Additionally, the receiver operating characteristic (ROC) curve was also plotted, which could be seen in Figure 3A. It is easy to found that the results of the ROC curve verified the identification efficiency of the proposed method too.

FIGURE 3

Figure 3. The comparison results of experiments. (A) The receiver operation characteristic (ROC) curve of three methods; (B) the results of experiments over the database (Fan et al., 2016).

Finally, the newly released thermophilic protein database (Mohasseb et al., 2018) is also tested through the proposed method in this manuscript. In the experiment, we selected 106 thermophilic proteins and 101 psychrophilic proteins from the database. All those data can be downloaded on the website: http://www.labio.info/index-1therm.html. In the experiment, we did three experiments using three different thresholds in the correlation analysis step. The experiments are given in Figure 3B, from which it was easy to find that the identification accuracy was bigger than 0.97 in most cases when using the threshold of 0.95 and 0.90. It also showed that the classification efficiency was not ideal when using the threshold 0.85. The reason for this phenomenon may be the calculated features of the current data have a stronger correlation between each other than the previous thermophilic protein database. Thus, in this condition, a big value than 0.85 is needed to identify the thermophilic proteins accurately. It is worth noting that the results in this figure verified the perfect identification ability of the proposed method.

Discussion

Many features are removed from the original feature set during correlation analysis and MRMD feature selection. Moreover, these removed features are typically not crucial or redundant for performing thermophilic protein recognition. However, the selection of features to remove and retain is essential, and further studies are needed to evaluate such approaches. Thus, in this study, we evaluated the removed features, as depicted in Figure 4.

FIGURE 4

Figure 4. The critical and removed features in the proposed method: (A) the most important features; (B) the deleted amino acid frequency features; (C) the deleted reduced-depiptides (I); (D) the deleted reduced-depiptides (II). The symbol “*” means any one of the 20 amino acids, it may be “A”, “C”, “P”, or others. Besides, “**” has the same meaning; it represents a two-letter combination of 20 amino acids, “AA”, “DC”, “VP”, for example.

The 10 most critical original features are shown in Figure 4A, and under our proposed model framework, the feature values of K^∗H, KR, TF, P^∗M, F^∗∗N, I^∗∗Y/V^∗∗Y, MW, and WQ (where ^∗ represents a gap in the residues) showed significant contributions to the recognition of thermophilic proteins. Additionally, residue K also plays a vital role in enhancing thermostability. Interestingly, our conclusions regarding residue K were consistent with the results of Lin (Lin and Chen, 2011).

For the removed features, the results are shown in Figures 4B–D. There were four types of components in the final feature set: ACC features, physicochemical characteristics, amino acid frequencies (the first 20 features in the 188D feature), and reduced amino acid dipeptides. Approximately half of the physicochemical characteristics were deleted from the original feature set, and there were only a few reserved physicochemical characteristics in the first 50 crucial features. Thus, we concluded that the physicochemical characteristics were essential features, but not the most essential features, for this recognition task. Accordingly, we did not analyze the details of the removed physicochemical characteristics. We also showed that only three ACC features were excluded from the final feature set, and the remaining 15 ACC features were retained, reflecting the crucial roles of the ACC features in this recognition task.

The amino acid frequency, which was one of the first 20 features in the 188D feature set, included only four residues removed from the feature set. These four residues were V (Ile and Val), A, E, and K, which had little contribution to recognizing thermophilic protein and non-thermophilic proteins. Interestingly, the reduced amino acid V, which included both Ile and Val, was also deleted. It is worth noting that the amino acid V appeared later in this manuscript denotes the reduced V, namely, both Ile and Val. This finding indicated that both Ile and Val were redundant and did not contribute to the identification task. If we used the original amino acid dipeptide features, additional useless features, including IA, I^∗A, and I^∗∗A, etc., would also be observed in the feature set. The number of additional redundant features in the original dipeptides could be as high as 39 if compared with the reduced amino acid dipeptides. As shown in Figure 2D, the smallest prediction accuracy was obtained, and represented those many additional useless features caused the classification model fail in the overfitting state when using the original dipeptides. Additionally, this observation could explain why the accuracy increased significantly when using the reduced amino acid dipeptides.

There were three types of dipeptides, expressed as AA (λ = 0), A^∗A (λ = 1), and A^∗∗A (λ = 2). The numbers of these types of removed dipeptides were 60, 61, and 71, respectively. To conveniently visualize these data, we counted the numbers of the same dipeptide (omitting the symbols ^∗ and ^∗∗). If a dipeptide appeared more than twice, it was drawn in the figure. Thus, if the dipeptide NV was shown in the figure, there were at least two types of dipeptides, i.e., NV, N^∗V, or N^∗∗V, in the removed feature set.

All discovered dipeptides were classified into two parts, as shown in Figures 4C,D. The reduced dipeptides in Figure 4C were dipeptides having relationships with the reduced residue V, verifying the reduced power of the recognition task in the above analysis. Moreover, residue V enabled the discovery of seven related dipeptides in the removed features. This phenomenon demonstrated that residue V and some dipeptides containing V were insensitive to the recognition task under our proposed model framework. Figure 4D also shows another seven removed dipeptides, including VV, AV, AE, AL, LE, LL, and LR.

These results provide insights into the design of stable mutants to increase protein thermostability.

Conclusion

In this study, we aimed to develop an approach to distinguish thermophilic proteins from non-thermophilic proteins; to this end, a recognition method that combined mixed features of proteins and a machine learning method was established. First, an amino acid reduction method was introduced to reduce the categories of amino acids. Nest, we calculated the physicochemical characteristics, ACC, and reduced dipeptides of thermophilic and non-thermophilic proteins. After performing a dimension reduction step using correlation analysis, the MRMD method, and PCA, an optimal feature set was obtained. Finally, machine learning methods were used to train and predict feature data, and the results revealed that the proposed model could identify 98.2% of thermophilic proteins and non-thermophilic proteins if the data were operated in a 10-fold cross-validation mode. Furthermore, the feature values of K^∗H, KR, TF, P^∗M, F^∗∗N, V^∗∗Y, MW, and WQ were found to play vital roles in thermostability, and some residues and dipeptides, including V (Ile and Val), A, E, K, NV, VG, VA, AE, AL, and LE, were not important for identifying thermostability. As discussed in previous studies (Liu and Li, 2019; Liu and Zhu, 2019), the web-server is very important. In our future work, our research will focus on developing a free webserver that could provide a platform to test the currently proposed method using an easily accessible approach.

Data Availability Statement

The datasets generated for this study can be found in the http://www.labio.info/index-1therm.html.

Author Contributions

CF and YL: conceptualization. CF: methodology, software, and writing – original draft preparation. YL, JZ, and XL: validation. DY: formal analysis. ZM and XL: investigation. ZM: resources. XL: data curation. DY, YL, and JZ: writing – review and editing, supervision. CF and XL: visualization. DY and ZM: project administration. YL and JZ: funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant nos. 91935302 and 61971119.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

^ http://lin-group.cn/server/Lypred/

References

Bhola, A., and Singh, S. (2018). Gene selection using high dimensional gene expression data: an appraisal. Curr. Bioinf. 13, 225–233.