DeepCSO: A Deep-Learning Network Approach to Predicting Cysteine S-Sulphenylation Sites

Cysteine S-sulphenylation (CSO), as a novel post-translational modification (PTM), has emerged as a potential mechanism to regulate protein functions and affect signal networks. Because of its functional significance, several prediction approaches have been developed. Nevertheless, they are based on a limited dataset from Homo sapiens and there is a lack of prediction tools for the CSO sites of other species. Recently, this modification has been investigated at the proteomics scale for a few species and the number of identified CSO sites has significantly increased. Thus, it is essential to explore the characteristics of this modification across different species and construct prediction models with better performances based on the enlarged dataset. In this study, we constructed several classifiers and found that the long short-term memory model with the word-embedding encoding approach, dubbed LSTMWE, performs favorably to the traditional machine-learning models and other deep-learning models across different species, in terms of cross-validation and independent test. The area under the receiver operating characteristic (ROC) curve for LSTMWE ranged from 0.82 to 0.85 for different organisms, which was superior to the reported CSO predictors. Moreover, we developed the general model based on the integrated data from different species and it showed great universality and effectiveness. We provided the on-line prediction service called DeepCSO that included both species-specific and general models, which is accessible through http://www.bioinfogo.org/DeepCSO.


INTRODUCTION
Protein Cysteine S-sulphenylation (CSO) is the reversible oxidation of protein cysteinyl thiols to suphenic acids. S-sulphenylation functions as an intermediate on the path toward other redox modifications, such as disulfide formation, S-glutathionylation, and overoxidation to sulfinic and sulfonic acids (Paulsen and Carroll, 2013;Huang J.J et al., 2018). This modification has been reported to influence protein functions, regulate signal transduction and affect cell cycle (Van Breusegem and Dat, 2006;Men and Wang, 2007;Paulsen and Carroll, 2013;Hourihan et al., 2016;Choudhury et al., 2017;Mhamdi and Van Breusegem, 2018). So far, thousands of CSO sites have been identified from different species including the mammal Homo sapiens and the plant organism Arabidopsis thaliana using the chemoproteomics approach (Yang et al., 2014;Li et al., 2016;Gupta et al., 2017;Akter et al., 2018;Huang et al., 2019;summarized in Supplementary Table 1). Nevertheless, the CSO site detection remains a major methodological issue due to low abundance and dynamic level of CSO-containing proteins in vivo. In contrast to the time-consuming and expensive experimental approaches, computational methods for predicting CSO sites have attracted considerable attention because of their convenience and efficiency.
Several computational methods have been developed for the prediction of CSO sites, mainly based on a single human dataset containing 1105 identified CSO sites (Yang et al., 2014). They include MDD-SOH (Bui et al., 2016a), iSulf-Cys (Xu et al., 2016), SOHSite (Bui et al., 2016b), PRESS (Sakka et al., 2016), Sulf_FSVM (Ju and Wang, 2018), S-SulfPred (Jia and Zuo, 2017), Fu-SulfPred , SulCysSite (Hasan et al., 2017), SOHPRED (Wang et al., 2016), and PredCSO (Deng et al., 2018). Out of them, two are based on protein three-dimensional structures, in which PRESS relies on four different protein structural properties (Sakka et al., 2016) whereas PredCSO is an ensemble model that combines bootstrap resampling, gradient tree boosting and majority voting with the 21 features refined out using a two-step feature selection procedure (Deng et al., 2018). The advantage of both classifiers is the inclusion of accurate structural features but their drawback is the limitation of the available structures. The rest classifiers are based on protein sequences. They can be classified into two clusters in terms of model complexity. The first cluster contains four relatively simple models. ISulf-Cys is an SVM (Support Vector Machine)-based classifier with the integration of three features including binary, PSAAP, and AAindex (Xu et al., 2016). SOHSite is an SVM-based classifier with the combined features of position-specific scoring matrix (PSSM) and AAindex (Bui et al., 2016b). SulCysSite is an RF (Random Forest)-based classifier with the integration of multiple features (Hasan et al., 2017) and Sulf_FSVM is an fuzzy SVM classifier using mRMR feature selection from three kinds of features (Ju and Wang, 2018). The second cluster includes four relatively complex models. MDD-SOH contains two-layered SVMs trained with MDDLogo-identified substrate motifs (Bui et al., 2016a). S-SulfPred is an SVMbased classifier with the balanced training dataset established using one-sided selection undersampling for negative samples and synthetic minority oversampling for positive samples (Jia and Zuo, 2017). Fu-SulfPred contains two layers of forest-based structure with the reconstruction of training datasets for data balance . SOHPRED was built by integrating four complementary predictors (i.e., a naive Bayesian predictor, an RF predictor, and two SVM predictors), each of which was associated with different training features (Wang et al., 2016). In summary, the characteristics of these sequence-based models are the combination of distinct types of features, or/and the balancing of training data, or/and the integration of different classifiers. Although the developed classifiers have made contribution to the prediction of CSO sites, most of them are currently inaccessible. Moreover, there is a lack of prediction tools for the CSO sites of multiple species. With the growing number of CSO sites verified, it is essential to develop species-specific prediction models with high accuracy or even a general model.
Compared to traditional machine-learning (ML) algorithms (e.g., SVM and RF) used in the prediction approaches described above, the deep-learning (DL) architecture is a promising ML algorithm. In the DL algorithm, a suitable representation of the input data can be transformed into highly abstract features through propagating the whole model. Superposition of hidden layers in neural networks can increase the ability of feature extraction, resulting in a more accurate interpretation of latent data patterns. Indeed, several frequently utilized DL models have been recently applied in the field of Bioinformatics, especially the prediction of post-translational modification (PTM) sites. For instance, deep neural networks were utilized for the prediction of protein nitration and nitrosylation sites (Xie et al., 2018), recurrent neural networks (RNNs) were employed for the prediction of lysine Malonylation sites (Chen et al., 2018b) and convolutional neural networks (CNNs) were used for the prediction of phosphorylation sites and crotonylation sites (Wang et al., 2017;Zhao et al., 2020). Deep learning algorithms have demonstrated their advantages in the application of large data sets, compared to the traditional ML methodology (Chen et al., 2018b). Because of this, the introduction of DL algorithms into the prediction of CSO sites would be a promising move to provide reliable candidates for further experimental consideration.
In this study, we constructed a number in silico approaches for the prediction of the CSO sites for H. Sapiens and A. thaliana. These approaches included the RF and SVM algorithms, onedimensional CNN (1D-CNN), two-dimensional CNN (2D-CNN) and long short-term memory (LSTM) that is an RNN type. The LSTM model with the word-embedding encoding approach, called LSTM WE , compared favorably to the rest approaches with AUC as 0.82 and 0.85 in human and Arabidopsis in terms of cross-validation. Moreover, LSTM WE trained using the data from one organism achieved outstanding performance in predicting CSO sites of other organisms (e.g., AUC = 0.80 for the prediction of Arabidopsis CSO sites using the human model), suggesting that CSO is highly conserved. Therefore, we constructed a general CSO prediction model. These models will facilitate the discovery of new CSO sites and thus will contribute to the understanding of roles and functions of CSO in diverse cellular processes.

Data Collection and Preprocessing
The experimentally identified CSO sites were derived from two different organisms including H. Sapiens and A. thaliana (Yang et al., 2014;Li et al., 2016;Gupta et al., 2017;Akter et al., 2018;Huang et al., 2019). The data of the species were pre-processed and the related procedure was exemplified using the A. thaliana data, as listed below ( Figure 1A). We mapped 1537 Arabidopsis CSO sites (Huang et al., 2019) to the UniprotKB database (UniProt Comstortium, 2011) and 1535 sites from 1130 proteins were retained as positive sites. The rest 8819 Cysteine residues in the same proteins were defined as negative sites. Moreover, we truncated these protein sequences into 35-residue segments with the Cysteine located at the center and the positive/negative sites correspond to positive/negative segments, respectively. It should be noted that if the central Cysteine was located around the N-terminus or C-terminus of a protein sequence, the gap symbol "-" was added to the corresponding positions to ensure that the segment had the same length. The segment length was optimized as a hyper-parameter in the Bayesian optimization method (see details in Section of "Optimization Methods for Hyper-Parameters") and finally determined as 33. Furthermore, to reduce the potential influence of the segments with high similarity on the performance of the models to be constructed, we set the identity of any two sequences with less than 40%, referring to previous studies (Bui et al., 2016a;Wang et al., 2016;Xu et al., 2016). When the identity was >40% between two positive segments or two negative segments, one was randomly removed. When the identity was >40% between a positive segment and a negative segment, the positive was retained and the negative was discarded. As a result, 1380 positives and 7421 negatives were retained. Finally, we randomly separated the positive and negative segments into 11 groups of which 10 were used for 10-fold cross-validation (1254 positives and 6746 negatives) and the rest for an independent test (126 positives and 675 negatives) ( Figure 1A). Similarly, the cross-validation dataset for H. sapiens contained 16,249 samples (2507 positives and 13,742 negatives) and the independent test set comprised 1625 samples (251 positives and 1374 negatives) ( Figure 1B). These datasets are available at http://www.bioinfogo.org/DeepCSO/download.php.

Feature Encoding Schemes
Numerical Representation for Amino Acids (NUM) The NUM encoding approach maps each type of amino acid residue to an integer . Specifically, in the alphabet "AVLIFWMPGSTCYNQHKRDE-", each letter from "A" to "-" is converted to the integers from 0 to 20 in turn. For example, the sequence "VAMR" is encoded as "1,0,6,17." This encoding was used as the input of the first layer for both LSTM and 1D-CNN.

Enhanced Amino Acid Composition
The enhanced amino acid composition (EAAC) encoding (Chen et al., 2018b(Chen et al., ,c, 2020Huang Y. et al., 2018) introduces a fixedlength sliding window based on the encoding of amino acid composition (AAC), which calculates the frequency of each type of amino acid in a protein or peptide sequence (Bhasin and Raghava, 2004). EAAC is calculated by continuously sliding a fixed-length sequence window (using the default value 5) from the N-terminus to the C-terminus of each peptide. The related formula is listed below: win ∈ window1, window2, . . . , window35 (1) where N(t, win) is the number of amino acid t in the sliding window win, and N(win) is the size of the sliding window win.

Binary Encoding
In the binary encoding (Chen et al., 2018c), each amino acid is represented by a 21-dimensional binary vector that represents 20 amino acids and a complement "-." The corresponding position is set as 1 and the rest position is set as 0. For example, the amino acid "A" is represented by "100000000000000000000, " "V" is represented by "010000000000000000000, " and the symbol "-" is represented by "000000000000000000001, " according to the alphabet "AVLIFWMPGSTCYNQHKRDE-."

AAindex Encoding
AAindex is a database of various indices representing distinct physicochemical and biochemical properties of amino acids and pairs of amino acids. 1 In the 544 physicochemical properties, we retained 531 properties after the removal of properties with "NA." We calculated the performance for each property using the RF classifier 1 http://www.genome.jp/aaindex/ based on the 10-fold cross-validation dataset of arabidopsis. We selected the top 36 properties with AUC > 0.7 (Supplementary Table 3).

The Composition of k-Spaced Amino Acid Pairs
The composition of k-spaced amino acid pairs (CKSAAP) encoding contains the frequency of the amino acid pair of which both are separated by k-residues (k = 0, 1, 2, 3, 4, 5. We used the default value 5) (Chen et al., 2018c). This scheme represents the short-or long-range interactions amongst the residues along the sequence. The CKSAAP encoding with k = 0 is identical to the di-peptide composition.

The Position-Specific Scoring Matrix
The PSSM encoding was derived from the previous publication (Xie et al., 2018). In brief, we calculated the statistical significance of the differences in the frequencies of symbol occurrence between the positive and negative samples using a two-sample t-test (Vacic et al., 2006). Accordingly, the PSSM of significant P-values were constructed. By integrating the PSSM of P-values with the frequency PSSM for positive and negative samples, we generated the final encoding PSSM that represented the conservation tendency of the positive or negative samples.

Architecture of the Machine-Learning Models
The LSTM Model With the Word Embedding Encoding (LSTM WE ) LSTM WE contained five layers, listed as follows (Figure 2).
1. Input layer. Each peptide segment is converted into an integer vector with the NUM encoding. Frontiers in Cell and Developmental Biology | www.frontiersin.org 2. Word Embedding (WE) layer. Each integer of the vector from the input layer is encoded into a four-dimension word vector for humans and a five-dimensional word vector for arabidopsis, respectively. 3. LSTM layer. Each of the word vectors is input sequentially into the LSTM cell that contained 32 hidden neuron units. 4. Dense layer. It contains a single dense sublayer that has 16 neurons with the ReLU activation function for humans and 32 neurons for arabidopsis, separately. 5. Output layer. This layer has only one neuron activated by sigmoid function, outputting the probability of the CSO modification.

The 1D-CNN Model With the Word Embedding Encoding
The 1D-CNN model with the word embedding encoding (1D-CNN WE ) contains five layers ( Supplementary  Figure 1), where the first two layers and last one layer were as same as LSTM WE . The third layer was a 1D convolution layer with 22/20 filters for humans/arabidopsis and kernel size as nine. The fourth layer had a single dense sublayer with 16 neurons. The optimal hyperparameter values were obtained using the Bayesian optimization algorithm.

The 2D-CNN Model With the PSSM Feature
We took advantage of the 2D structure of an input image of CNN architecture and conveniently made similar 2D inputs of PSSM matrixes with the sizes of 20 × 20 s. The purpose of using the 2D-CNN model is to catch the hidden figures inside PSSM profiles. Next, PSSM profiles were connected to the 2D CNN design from the input layer through several hidden layers to the output layer. Supplementary  Figure 2 demonstrated the procedure of inputting a PSSM profile into the CNN model, then passing through a series of convolutional, non-linearity, pooling and fully connected layers and finally outputting the result. This model contained four hidden layers including one 2D convolutional layer, one pooling layer, one flattening layer, and one fully connected layer. Specifically, the first layer contained a PSSM profile on which we applied 2D convolutional operations with some existing parameters including 5 × 5 kernel size, 15 filters and 1 × 1 stride.

The RF Algorithms With Different Features
The RF algorithm integrates multiple decision trees and chooses the classification with the most votes from the trees. Each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. In this study, we constructed the RF models with six different features, including binary encoding, EAAC encoding, AAindex encoding, CKSAAP encoding, PSSM encoding, and WE. The number of decision trees was selected as 580 via the grid search method. These classifiers were developed based on the Python module "sklearn."

The SVM Algorithms With Different Features
We applied the Python-based machine learning package "scikitlearn" to implement the SVM algorithm and adopted the "RBF" kernel function to build the SVM models. The above encoding schemes for RF were applied to the SVM model. In particular, we normalized the feature values that do not range between 0 and 1 (such as PSSM) before inputting the SVM model.

Optimization Methods for Hyper-Parameters
The hyper-parameters of an ML classifier affect prediction performance. Although a lot of combinations of hyperparameters need to be tested, there are no formal rules to find optimal hyper-parameters. Here we applied two search approaches [grid search and Bayesian optimization (BO)] to automatical adjustment and evaluation of hyper-parameters (Figure 3). Grid search is a brute-force method to find the optimal hyper-parameters by training models using each possible combination of hyper-parameters and retaining the hyper-parameters corresponding to the model with the best performance. This method applies to a limited number of hyper-parameters due to the exponential increase in time spent with the number of hyper-parameters. In this study, it was used for the RF-based and SVM-based models. The related grid search spaces (Supplementary Table 3) were searched using the GridSearchCV function of the sklearn library in Python. On the contrary, BO provides a principled technique based on Bayes theorem to direct a search of a global optimization problem, which is effective to tune the hyper-parameters of DL models. The BO strategy FIGURE 3 | Hyper-parameter optimization procedure for machine-learning classifiers.
Frontiers in Cell and Developmental Biology | www.frontiersin.org was executed using the fmin function of the hyperopt library in Python. The BO related hyper-parameter space contained 10 parameters, including window size, kernel size, and dropout rate (Supplementary Table 3). The optimal hyperparameter combination results for the DL models were listed in Supplementary Table 4.

Strategy of Avoiding Overfitting
1The parameters in the DL models were trained and optimized based on binary cross-entropy loss function using the Adam algorithm. The maximum of the training cycles was set through the optimized number of epochs to ensure that the loss function value converged. In each epoch, the training dataset was separated with the batch size as 512 and iterated. To avoid overfitting, the early-stopping strategy was applied, where the training process was stopped early when the training loss did not go down within 50 consecutive iterations. The model with the smallest training loss was saved as the best model. Moreover, the dropout rate of the neuron units was set, which was obtained through the hyper-parameter optimization. Supplementary Figures 3, 4 showed the training and validation accuracy and loss curves of the LSTM WE models for different species.

Performance Assessment of the Predictors
Several measures were used to evaluate the prediction performance, including accuracy (ACC), specificity (SP), sensitivity (SN), Matthew's correlation coefficient (MCC). They are defined as follows: where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. Additionally, because the number of positive and negative samples was unbalanced and the above measures were calculated based on the threshold value, a measure that was independent of the threshold value and unaffected by the sample ratio was needed. Therefore, the receiver operating characteristic (ROC) curve and AUC were employed to comprehensively evaluate classification performance. Specifically, due to the low falsepositive rate of a predictor is significant in practical application, the area under the ROC curve with <10% false-positive rate (AUC01) was considered.

Statistical Methods
The paired student's t-test was used to test the significant difference between the mean values of the Frontiers in Cell and Developmental Biology | www.frontiersin.org two paired populations. The adjusted P-value with the Benjamini-Hochberg (BH) method was adopted for multiple comparisons.

The Flowchart of the Prediction Model Construction
The flowchart of the prediction model construction contained three steps (Figure 4). This first step was data collection and preprocessing, in which the sample data were separated into the cross-validation dataset and the independent test dataset for model construction and evaluation. The second step was classifier construction, which involved data decoding, model training, and hyper-parameter adjustment for resulting in a robust predictive model. The third step was the development of the final model as an online prediction tool.  and AUC01 (B), respectively, for 10-fold cross-validation. AUC (C) and AUC01 (D) curves were generated using the independent test.

RESULTS AND DISCUSSION
We first took the Arabidopsis data to construct and compare different models (Huang et al., 2019). The Arabidopsis crossvalidation dataset contained 8000 samples (1254 positives and 6746 negatives) and the independent test set covered 801 samples (126 positives and 675 negatives) (Figure 1). We compared the performances of these algorithms in terms of several measures (e.g., ACC, MCC, AUC, and AUC01) for both the 10-fold cross-validation (Table 1) and the independent test  (Yang et al., 2014). Table 5). In the traditional ML models, RF EAAC showed superior performance than other RF-based and SVMbased models. The previous studies of CSO site prediction showed that the models with the combination of different encoding methods compared favorably to their counterparts with a single encoding approach (Bui et al., 2016b;Xu et al., 2016). Accordingly, we constructed such models and the RF model with the combination of EAAC, CKSAAP, and AAindex, dubbed RF E+C+A , had the best performance. To our surprise, RF E+C+A had inferior performance compared to RF EAAC (Table 1 and  Supplementary Table 5).

(Supplementary
All the models constructed above were based on the imbalanced dataset. To evaluate the effect of the imbalanced dataset on potential overfitting of the classifiers, we reconstructed RF EAAC based on the balanced positive and negative samples. Specifically, because the number of negative samples was around five times larger than that of the positive samples, we randomly separated the negative samples into five parts and created five subsets of training data with a 1:1 positive-to-negative ratio. Subsequently, five RF EAAC models (sub-classifiers) were trained and the average output score from the five sub-classifiers was taken as the final prediction score. Supplementary Figure 5 showed the performances of the two RF EAAC models based on the balanced and imbalanced dataset, respectively, in terms of the 10fold cross-validation and the independent test dataset. Because of FIGURE 6 | Sequence pattern surrounding the CSO sites, including the significantly enriched and depleted residues based on CSO-containing peptides and non-modification peptides for H. sapiens (A) and A. thaliana (B) (P < 0.05, t-test with Bonferroni correction). The pattern was generated using the two-sample-logo method (Vacic et al., 2006). the slightly better performance of the RF EAAC model constructed using an imbalanced training dataset, we selected the imbalanced dataset for the construction of the models.
In our previous studies, DL models showed superior performance than traditional ML models (Chen et al., 2018b;Zhao et al., 2020). It is still true for the CSO site prediction. LSTM WE had the best performance among these constructed models in terms of ACC, Sn, MCC, and AUC values for both 10-fold cross-validation and independent test. For instance, its AUC value is 0.852 for the cross-validation and its values of ACC, Sn, Sp, and MCC were 0.786, 0.717, 0.799, and 0.417, respectively (Table 1 and Figures 5A,C). As prediction performance at a low false-positive rate is highly useful in practice, we estimated these predictors using AUC01, where the specificity was determined to be >90%. LSTM WE again showed the largest AUC01 values for both 10-fold cross-validation and the independent test (Figures 5B,D). As the encoding approach has a great impact on the traditional ML models (Chen et al., 2018b;Huang Y. et al., 2018;Zhao et al., 2020) and the WE approach integrated with LSTM had the best performance in this study, we attempted to investigate whether the integration of WE and RF had a good performance. Accordingly, we extracted WE layer vector as feature encoding from LSTM WE and trained the RF model, dubbed RF WE . Interestingly, RF WE did not show good performance compared to RF EAAC , 1D-CNN WE , or LSTM WE . It suggests that the WE encoding approach may be improper for the construction of traditional ML algorithms.
We further constructed the models for the human organism. The Humans cross-validation dataset contained 16,249 samples (2507 positives and 13,742 negatives) and the independent test set covered 1625 samples (251 positives and 1374 negatives) ( Figure 1B). Similarly, LSTM WE had the best performance ( Table 1, Supplementary Table 5, and Supplementary Figure 6).
For instance, its values of AUC, ACC, Sn, Sp, MCC, and AUC01 for the 10-fold cross-validation were 0.822, 0.777, 0.651, 0.800, 0.367, and 0.024, respectively. We evaluated the robustness of LSTM WE by comparing their performances between the cross-validation and independent tests for individual organisms. As their performances were not statistically different for each organism (P = 0.18/0.085 for the arabidopsis/humans, respectively), we concluded that the constructed models were robust and neither over-fitting nor under-fitting.

LSTM WE Performed Better Than Reported Classifiers
Six approaches for the prediction of human CSO sites were based on 1105 identified human CSO sites (Yang et al., 2014), including MDD-SOH, SOHSite, SOHPRED, iSulf-Cys, SulCysSite, and Sulf_FSVM. We compare these models and our models (i.e., RF EAAC , RF E+C+A , and LSTM WE ) to evaluate their prediction performances. Accordingly, we constructed our models using the same dataset derived from the original study (Yang et al., 2014). SulCysSite, LSTM WE , and RF E+C+A had the best and similar performances ( Table 2). The observation that the model with the combined features (i.e., RF E+C+A ) had better accuracy than the counterpart with a single feature (i.e.,  RF EAAC ) is consistent with the previous studies (Bui et al., 2016b;Xu et al., 2016) but conflicted with our observation above that RF EAAC compared favorably to RF E+C+A . This contradiction derived from the different amounts of the training datasets, where the dataset here was smaller than the datasets described above, indicating that the amount of training data affected the performance of the models. Indeed, based on the small human dataset (1105 positives), RF E+C+A had a better performance than RF EAAC , whereas the performance of RF EAAC was better than that of RF E+C+A with a large amount of the training set (arabidopsis: 1380 positives; human 2758 positives) (Supplementary Figure 7). In all comparisons, LSTM WE showed the best performance (Supplementary Figure). Additionally, as iSulf-Cys (Xu et al., 2016) is the only accessible model to date, we compared it and LSTM WE using the human independent dataset of this study. The AUC value (0.839) of LSTM WE is significantly larger than that (0.666) of iSulf-Cys (Supplementary Figure 8). In summary, LSTM WE performed better than reported classifiers.

Conservation of the CSO Modification and the Development of General LSTM WE Models
Cysteine S-sulphenylation has been identified across various organisms, ranging from yeasts to worms and from plants to humans (Men and Wang, 2007;Hourihan et al., 2016). To understand its conservation, we compared the characteristics of CSO-containing peptides in human and arabidopsis species, respectively, using the two-sample-logo approach (Vacic et al., 2006). Figure 6 showed that both species shared the enriched basic amino acids R and K and the depleted polar neutral amino acid C. Nevertheless, the amino acid H was enriched for A. thaliana whereas the hydrophobic amino acid L was depleted for H. sapiens. As the characteristics of CSO-containing peptides were similar between both species, we hypothesized the generalization ability of our developed models. To test this hypothesis, we used the human LSTM WE model to predict the arabidopsis independent test dataset and employed the Arabidopsis LSTM WE model to predict the human independent test dataset. The AUC values were 0.799 and 0.766, respectively, significantly larger than the random prediction (i.e., AUC = 0.5; Table 3). Nevertheless, the cross-species prediction had relatively low performance compared to the selfspecies prediction (AUC = 0.876/0.839 for arabidopsis/human, respectively). As the CSO sites were systematically analyzed in a few species, we developed a general CSO prediction model according to its conservation to boost the investigation for other species. Accordingly, we mixed the training datasets of H. Sapiens and A. thaliana and constructed the general LSTM WE model and validated it using the independent datasets from both organisms. The performance of the general LSTM WE model was slightly lower than that of the self-species prediction, which may be caused by the interference of the CSO characteristics of other species (Table 3). Overall, the conservation of the CSO modification leads to the effective prediction of the general LSTM WE classifier.
To further understand the performance of the general LSTM WE classifier, we visualized the sample distribution, based on the human independent dataset, from the outputs of the input layer, WE layer, LSTM layer, and dense layer of the general model using the t-SNE algorithm (van der Maaten and Hinton, 2008; Figure 7). After the input layer (Figure 7A), the positive and negative samples were mixed, as the training goes on (Figures 7B,C), positive and negative samples were gradually separated. After the LSTM layer, they were separated ( Figure 7D). This comparison indicates that the LSTM layer is a powerful method to detect the distinctive features of the positives and negatives. A similar observation is made for the arabidopsis independent test dataset (Supplementary Figure 9).

Construction of the Online CSO Predictor
We developed an easy-to-use online tool for the prediction of the CSO sites, dubbed DeepCSO. DeepCSO contains three LSTM WE models: the general model and two species-specific models (i.e., H. sapiens and A. thaliana). The users could select the general model or species-specific model at the input interface and input the query protein sequences directly or upload the sequence file. After the job submission, the prediction will start and the prediction process may take several minutes. Finally, the prediction results are output in tabular form with five columns: sequence header, position, sequence, prediction score, and prediction results at the specificity levels of 80, 85, and 90%, respectively.
Several Cysteine modification types have been reported in the human organism, such as carbonylation (Wang et al., 2014;Chen et al., 2017Chen et al., , 2018aZhang S. et al., 2019), oxidation (Gupta et al., 2017;Akter et al., 2018), succination (Adam et al., 2017), and sulfenylation. Some Cysteine sites can be modified with multiple modification types, which cause PTM cross-regulation. To examine potential PTM cross-regulation at the proteome scale, we downloaded the latest human protein sequences from the Swiss-Prot database (version: 2020_05) and applied the human DeepCSO predictor to predict the potential CSO sites with the annotation of the reported Cysteine modifications (Supplementary Table 6). This resource will assist in the investigation of the Cystine co-regulation in the community.

CONCLUSION
The current prediction tools for CSO sites are based on traditional ML methodology that requires experts to pre-define informative features, and no prediction tool has been developed for other than the human organism. In this study, three LSTMbased prediction models were constructed, where two were organism-specific and one was general, and they compared favorably to the reported models. Despite lacking pre-defined features, the deep learning classifier demonstrated superior performance compared to the traditional machine learning methods. This may be due to the self-learning ability of deep learning. The outstanding performance of the general model suggests that the CSO is well conserved and the LSTM-based model has an advantage in long-term memory to capture the key features of the entire sequences.

DATA AVAILABILITY STATEMENT
The 10-fold cross-validation and independent data sets can be found in http://www.bioinfogo.org/DeepCSO/.