Predicting and Analyzing Protein Phosphorylation Sites in Plants Using Musite

Although protein phosphorylation sites can be reliably identified with high-resolution mass spectrometry, the experimental approach is time-consuming and resource-dependent. Furthermore, it is unlikely that an experimental approach could catalog an entire phosphoproteome. Computational prediction of phosphorylation sites provides an efficient and flexible way to reveal potential phosphorylation sites and provide hypotheses in experimental design. Musite is a tool that we previously developed to predict phosphorylation sites based solely on protein sequence. However, it was not comprehensively applied to plants. In this study, the phosphorylation data from Arabidopsis thaliana, B. napus, G. max, M. truncatula, O. sativa, and Z. mays were collected for cross-species testing and the overall plant-specific prediction as well. The results show that the model for A. thaliana can be extended to other organisms, and the overall plant model from Musite outperforms the current plant-specific prediction tools, Plantphos, and PhosphAt, in prediction accuracy. Furthermore, a comparative study of predicted phosphorylation sites across orthologs among different plants was conducted to reveal potential evolutionary features. A bipolar distribution of isolated, non-conserved phosphorylation sites, and highly conserved ones in terms of the amino acid type was observed. It also shows that predicted phosphorylation sites conserved within orthologs do not necessarily share more sequence similarity in the flanking regions than the background, but they often inherit protein disorder, a property that does not necessitate high sequence conservation. Our analysis also suggests that the phosphorylation frequencies among serine, threonine, and tyrosine correlate with their relative proportion in disordered regions. Musite can be used as a web server (http://musite.net) or downloaded as an open-source standalone tool (http://musite.sourceforge.net/).


INTRODUCTION
Protein phosphorylation plays important roles in numerous cellular processes in plants. Although mass spectrometry based studies have provided high-throughput phosphorylation data, it is still time-consuming and expensive to identify phosphorylation sites experimentally. Computational prediction of phosphorylation sites directly from protein sequences provides an alternative approach. A number of software tools have been developed under this provenance, such as NetPhos (Blom et al., 1999), Scan-x (Schwartz et al., 2009), and DISPHOS (Iakoucheva et al., 2004). We recently developed Musite (Gao et al., 2010), which incorporates feature selection and machine-learning processes as well as other useful tools into one open-source frame work. It is computationally efficient, offers a statistical assessment of data quality, and can handle proteome-wide prediction.
However, when Musite was initially developed, phosphorylation site data from plants were sparse and insufficient to train and test beyond model plants. The only large dataset was Arabidopsis (Arabidopsis thaliana) data from PhosPhAt with 3,159 phospho-serine sites and 504 phospho-threonine sites (Heazlewood et al., 2008). Hence, Musite did not have a general plant phosphorylation predictor except for Arabidopsis. This was also the case for other tools that support plant phosphorylation site prediction such as DISPHOS (Iakoucheva et al., 2004). Since Musite was first published (Gao et al., 2010), more experimental plant phosphorylation sites have become available. The Plant Protein Phosphorylation Database (P 3 DB; Gao et al., 2009) now contains 32,963 non-redundant sites collated from 23 experimental studies from six plant species (A. thaliana, B. napus, G. max, M. truncatula, O. sativa, and Z. mays), providing a good opportunity to train and test for a general plant phosphorylation site predictor.
As the coverage of experimental phosphorylation sites in plants increases, there have been a few studies on conservation patterns of phosphorylation sites among different plants. It was observed that some phosphorylation sites and peptides are conserved within the gene families across different species (Maathuis, 2008). A few comparative phosphoproteomics studies explored and revealed evolutionary patterns in terms of conserved, functional phosphorylated sites (Boekhorst et al., 2008;Nakagami et al., 2010;Meyer et al., 2012). To complement these studies, it is interesting to explore the evolutionary patterns in the predicted phosphoproteome as well. Although predicted phosphorylation sites are putative, the statistical patterns may be similar between the predicted and bona fide phosphorylation sites. An advantage of using predicted phosphorylation sites is that coverage is higher and less biased (in terms of protein abundance, etc.) than experimentally identified phosphorylation sites.
In this paper, a Musite prediction model based on Arabidopsis data only was trained and cross-organism testing was then performed on other plants. A phosphorylation prediction model for plants was also trained from combined plant phosphorylation data. Predicted phosphorylation sites in orthologous groups off our green plants were compared to reveal potential evolutionary trends. In addition, analysis on distribution of protein disorder in predicted phosphorylation peptides was performed and provides some hypotheses of the phosphorylation tendency and evolutionary tends.
The sequence-wide redundancy was removed in order to avoid potential bias in the machine-learning training process. CD-HIT (Li and Godzik, 2006) was used in this process and proteins with more than 50% of sequence identity were removed. For A. thaliana, the non-phosphorylated proteins dominate the whole dataset.
When training, we sampled the negative data to create a balanced dataset at the protein level. While the balance at the protein level does not mean the same numbers of phosphorylated sites and nonphosphorylated sites, the balance at the phosphorylated site level was handled by the bootstrapping procedure described below. The distribution of the datasets is shown in Table 1.

MACHINE-LEARNING FRAMEWORK
The phosphorylation prediction is formulated as a binary classification problem, which can be modeled and solved by a machinelearning framework (Gao et al., 2010). K Nearest Neighbor (K NN) scores, disorder scores, and amino acid frequencies were used as the features for the training. The serine-, threonine-, or tyrosine-centered flanking sequences were used as peptide samples to extract the features. Practically, the length of the flanking sequences is not fixed. Multiple sizes of the peptides represent different scales of the local properties.
K nearest neighbor score is the ratio between the numbers of positive and negative sites among the pre-defined number of neighbors of a given peptide. The neighborhood of a peptide is defined as a certain percentage of the training dataset ranked by the pair-wise similarity between the target peptide and the peptides in the dataset. In this study, the sequence similarity calculation was based on the BLOSUM62 matrix; the neighborhood was set to 0.25, 0.5, 1, 2, and 4% of the training dataset, respectively; and the flanking sequence size was set to 13 amino acids (±6 at each side).
Disorder score is a feature to measure the stability of the local structure. It was calculated by VSL2B (Obradovic et al., 2005), a widely used predictor of protein disorder from sequence only. The disorder score of a given peptide was calculated as the average disorder score over all its amino acids. In this study, the sequence length of the disorder calculation was set as 1, 5, and 13 amino acids.
Amino acid frequency reflects the amino acid preference in phosphorylated peptides (Iakoucheva et al., 2004). It was represented as a vector of length of 20, which contained the normalized counts for every type of amino acid. The length of the peptide for calculating amino acid frequency was set to 13.
Bootstrapping was applied to solve the unbalanced problem of the positive and negative sites. When training, the same number of negative sites as positive sites was randomly sampled and formed a balanced set for training. Then, a support vector machine (SVM; Joachims, 2002) was used to train multiple prediction models on the sampled datasets. When predicting, the final prediction score was aggregated by averaging the outputs of all the SVM classifiers.

CROSS-ORGANISMS TESTING
The SVM model was trained on A. thaliana. The testing was first performed on a new dataset of A. thaliana to evaluate if the model performance was consistent or not. The main testing cases were on other organisms, as described above. The testing results were presented as receiver operating characteristic (ROC) curves, with "1-specificity" (i.e., false positive rate) vs. "sensitivity" (i.e., true positive rate) on horizontal and vertical axes, respectively. Specificity and sensitivity are defined as follows: where TN represents true negative, FP false positive, TP true positive, and FN false negative.

ORTHOLOGOUS GROUP ANALYSIS
The orthologous groups among four organisms (A. thaliana, O. sativa, R. communis, and P. patens) were collected from OrthoMCL orthologous dataset (Li et al., 2003;Chen et al., 2007). The data for other plants were unavailable in OrthoMCL so that they were not included in this study. The protein sequences needed for the orthologous groups were downloaded from Phytozome (Goodstein et al., 2012). In each group, the four sequences of an orthologous group were aligned globally by MUSCLE (Edgar, 2004) and their phosphorylation sites were predicted by Musite. Then the peptides centered by serine, threonine, or tyrosine were extracted and classified into the S/T class and the Y class according to the center amino acid. By analyzing the aligned sequences within each orthologous group and checking their predicted phosphorylation states, we studied potential phosphorylation conservation patterns. If at least one site in the center position of an orthologous group was predicted as a phosphorylation site, we included the orthologous group to study the distribution of the phosphorylated centers in terms of their phosphorylation states, which could be 1, 2, 3, or 4 phosphorylation sites. This can be modeled by a binomial distribution and therefore the chi-square test was used to test if the expected polynomial distribution is the same as the observed distribution, i.e., with three degrees of freedom.

RESULTS
For the trained model from A. thaliana, the ROC curves for the prediction results were plotted in Figures 1-3. Figure 1 indicates that the performance between training and testing is similar, especially for the S/T class, i.e., over-training is not a concern. Z. mays does not have an ROC curve for Y sites since no experimental data for tyrosine phosphorylation are currently available. The same model was applied to the other five organisms (B. napus, G. max, M. truncatula, O. sativa, and Z. mays) for S/T and Y phosphorylation site prediction. Figure 2 shows ROC curves of the results for these five plant organisms. For the S/T site prediction, the model performs roughly equally well on A. thaliana, G. max, M. truncatula, and O. sativa. Z. mays has the best performance and B. napus has the worst performance. For the Y site prediction, the model performs similarly between O. sativa and A. thaliana. Performance of M. truncatula is much better than those of G. max and B. napus.
An overall model of all the above plants species combined was also trained. For the sake of comparison, a randomly selected 3/4 training and 1/4 testing strategy was used, and the performance is shown in Figure 3. Figure 3A shows the comparison with Plantphos (Lee et al., 2011) on the 1/4 testing data. The Musite model outperformed Plantphos on both S/T sites and Y sites. To compare with PhosphAt (Heazlewood et al., 2008), an Arabidopsis-specific predictor, we used the Arabidopsis data from our 1/4 testing set that have the predictions from PhosphAt. Because the tyrosine phosphorylation predictions from PhosphAt were very sparse, we could not develop an ROC curve for tyrosine phosphorylation prediction. Figure 3B shows that Musite prediction outperformed PhosphAt. It is worth mentioning that in both testing cases, Musite did not include any testing data in its training set, while Plantphos and PhosphAt may have had some test data in their training, which could give them an advantage. Nevertheless, Musite's improvement over these tools is significant. Figure 4 shows the disorder score distributions. Figure 4A gives the disorder distribution on all the sites that we collected (phospho-and non-phospho-) showing that tyrosine tends to be in the ordered region while serine prefers disorder more than the other two amino acid types. Figure 4B provides the distributions of known phosphosites and predicted sites. The relative trends of www.frontiersin.org serine, threonine, and tyrosine were similar between known and predicted sites, although the predicted sites are more enriched in high disorder regions. Analysis of orthologous groups can help reveal conservation patterns of S/T and Y phosphorylation sites. The analysis was conducted among the phosphorylated groups (at least one center of the four in an orthologous group is phosphorylated). The motivation of this analysis was to determine the transfer probability of phosphorylation from one organism (one center) to other organisms (the other three centers in the orthologous group). The total number of extracted orthologous groups containing S/T or Y at the centers was 225,265. Two classes with pure S/T center or Y center were of interest. In the S/T centered class, there were 631 groups having at least one center predicted to be phosphorylated, and there were 200 in the corresponding Y centered class (see Table 2).
Given one phosphorylated center, if the other three centers are phosphorylated randomly, the distribution should yield a binomial distribution. As an example, under this assumption, for the S/T class, the expected probability of having a second phosphorylated center among the three remaining centers is P ( Y) having a second phosphorylated center is the total number of sites multiplied by the corresponding probability. Using a similar method, the expected numbers of orthologous groups containing four predicted phosphorylation sites were 23 and 3, respectively for S/T and Y classes vs. 102 and 15 observed orthologous groups of such cases, respectively. This indicates that phosphorylation is not a random process in evolution. The chi-square test was then performed between the observation and expectation based on Eq. 2. The statistic of chi-square test was 0.641 and 0.337 for S/T and Y classes, and the p-value for each was 0.113 and 0.047, respectively. This result reveals 89 and 95% of confidence for S/T and Y sites, respectively, indicating that phosphorylation is not a random factor in evolution. Table 2 also lists the conditional probability of having an additional phosphorylation site in an orthologous group. Since the study is on the orthologous phosphorylation groups, we always consider that one of the four centers is already phosphorylated. If we assume an additional phosphorylation site is a random and independent event, the probability for any additional phosphorylation site is just the ground probability, e.g., for the S/T class, Hence, the expected probability to have an additional phosphorylation site is 0.331 for the S/T class and 0.242 for the Y class. Since the first phosphorylation site is always given, the observed conditional probability for the second site is the same as the expected one. However, the observed conditional probabilities for the third and fourth site in the S/T class are P st (O 4 | O 3 O 2 O 1 ) = 0.554 and P st (O 3 | O 2 O 1 ) = 0.421, both of which are significantly higher than 0.331. This means that if the second site is phosphorylated in the same orthologous group, the third and fourth sites are more likely to be phosphorylated than random. The same is true for the Y class. The observed conditional probabilities are: P y (O 4 | O 3 O 2 O 1 ) = 0.375 and P y (O 3 | O 2 O 1 ) = 0.688, both of which are significantly higher than 0.242. A difference is that for the Y class, the peak observation is the third site when two phosphorylated sites are observed while the peak observation is the fourth site for the S/T class.
The distribution of 4T, 1S3T (one serine and three threonine aligned in the center and so on), 2S2T, 3S1T, and 4S is shown in  Table 3. In the non-phosphorylated class (none of the four center aligned amino acids were phosphorylated), the distribution was 32. 33, 7.19, 4.88, 10.22, and 45.38%, while the distribution in the fully conserved phosphorylated class (all of the centers were phosphorylated) is 8.82, 3.92, 2.94, 6.86, and 77.45%, respectively. In particular, only nine groups have four threonine (4T), while 79 groups have 4S in the 102 fully conserved phosphorylation S/T groups. This is consistent with the known trend that phosphorylated S is much more frequent than phosphorylated T. The table shows number of orthologous groups having at least one center predicted to be phosphorylated, the percentage of each state (1-4 predicted phosphorylation sites at the center), the www.frontiersin.org expected number of groups assuming a random distribution of phosphorylation sites at the center, and the conditional probability of having an additional phosphorylation site.
An interesting pattern to explore is the sequence profile of fully conserved phosphorylated classes. To reduce the impact of alignment gaps, some of which could be due to alignment errors by MUSCLE, we only selected those without gaps in the flanking regions to build sequence logs. The overall WebLogo (Crooks et al., 2004) plot ( Figure 5A) and the WebLogo plot for each species (Figures 5B-E) are shown. In this study, a flanking sequence has a length of 21 or 10 amino acids at each side. Some over-represented amino acid types at particular locations are revealed as interesting sequence motifs. It is shown that around the phosphorylation site, more amino acids are negatively charged or polar. Amino acid Aspartate (D), Glutamate (E), and Proline (P) appear in high frequencies toward the C-terminus of the phosphorylation sites.
For the S/T and Y classes, 102 and 15 groups have fully conserved centers (with all the sites predicted to be phosphorylated), respectively, and 20,535 and 6,805 groups have exclusively non-phosphorylated centers, respectively. To evaluate the sequence conservation for each class, a similarity score was calculated based on the pair-wise comparison, i.e., the pair with the same amino acid at a location of the alignment was counted as one and different amino acids as zero. The frequency of the same amino acid occurred at a position was normalized by the total number of pairs and the total number of the orthologous groups in the S/T or Y class. The mean and variance were calculated and plotted in Figures 6A,B. Surprisingly, there is no significant difference in sequence conservation between the exclusively non-phosphorylated sites and the fully conserved phosphorylation sites. The average similarity scores for the non-phosphorylated groups are almost flat at all positions for both S/T and Y class, while the scores for the fully phosphorylated group are fluctuating among different positions. The fluctuation may be due to smaller sample size for the fully phosphorylated group. In contrast, the distribution of the disorder score on the non-phosphorylated sites and the fully conserved phosphorylated sites is shown to have a major difference (Figure 6C). It is interesting to note that the distribution of the disorder score for S/T non-phosphorylated sites in the orthologous groups is similar to a right-side skewed normal distribution, while the distribution for fully conserved phosphorylation sites tends to bean exponential distribution.

DISCUSSION
In this work, we extended Musite for general plant phosphorylation site predictions. We tested our models extensively using six plant species. The cross-species test shows that Musite has the capability of extending the model from Arabidopsis to other green plants. In terms of the overall plant model, through comparison of ROC curves, the Musite predictions on both S/T and Y sites outperformed other plant-specific tools like Plantphos and PhosphAt. This comparison was conservative in that none of the testing data was included in the training set for Musite, while other tools may have included such data.
Analysis on the disorder score distributions in this study revealed different properties of serine, threonine, and tyrosine. The analysis suggests that the phosphorylation capability among serine, threonine, and tyrosine was mainly due to their relative populations in disordered regions ( Figure 4A). In particular, the fact that tyrosine phosphorylation appeared less frequently than serine and threonine probably results from tyrosine's large population in the ordered region, where the phosphorylation is not likely to occur. The observation that serine phosphorylation is much more frequent than threonine phosphorylation (Table 3) is also likely due to serine's higher probability in the disordered regions ( Figure 4A). It is interesting to note that the disorder distribution in the observed or predicted phosphosites tends to be converged, especially between serine phosphorylation and threonine phosphorylation ( Figure 4B). This suggests that the same enzyme for both serine and threonine phosphorylation may require similar protein disorder in the flanking regions of the phosphorylation sites. Although the distribution gaps between the known sites and the predicted sites ( Figure 4B) reveal our model's preference for highly disordered regions, the relative trend of each amino acid is still consistent between known and predicted sites. Analysis of predicted phosphorylation sites reveals some potential conservation patterns in phosphorylation. For example, an analysis of 1-4 phosphorylated centers using conditional probability in Table 1 indicates that the phosphorylation in the same orthologous group is not a random process. There are a large number of single phosphorylated centers. However, the probability of a third or fourth additional phosphorylation site increases dramatically. Such a bipolar distribution of isolated phosphorylation sites and highly conserved ones may indicate the evolutionary features of phosphorylation sites. The isolated phosphorylation sites may be non-functional (Landry et al., 2009) or function through finetuning protein's bulk electrostatic charge in a non-positionally conserved manner (Tan et al., 2010). The prediction of Musite provides a way to study the evolutionary effect beyond the limitation of the experimental data. Although it might introduce some artifacts, the statistical analysis of the predicted phosphorylation sites may provide some valuable hypotheses, since the sample size is large and the stringent threshold with 95% specificity was used for the prediction.
Sequence similarity patterns for the fully conserved phosphorylation groups and non-phosphorylation groups show that the two groups do not have significant differences in terms of sequence conservation. In contrast, the two groups have dramatically different disorder score profiles. This suggests that phosphorylation may be driven by protein disorder much more than sequence similarity in the flanking regions. As long as the disorder property is maintained, the phosphorylation is likely to maintain during www.frontiersin.org evolution. The sequence conservation pattern in the disordered region is generally weak. That may be a major reason why the sequence similarity is insignificant in fully conserved phosphorylation groups. Although our computational models introduce some bias in the disorder distribution from the known sites, it is reasonable to assume that such a bias does not invalidate the sequence conservation pattern, since the relative characteristics between the predicted and observed phosphorylation sites are probably preserved as described earlier.
Finally, we have to point out that although the computational approach provides a possibility to study the global phosphorylation pattern and trend from a large population, it does have some limitations. If a feature is weighted more than the others in the machine-learning process, then in the analysis part, this feature may dominate the pattern as a bias. For example, the dominant disorder score pattern could be due to the high weight of the disorder score feature in the predictor. Although we maintain a relatively low false positive rate, the incorrectly classified sites may introduce significant noises, or naively simplify the feature space. These may limit the capability of model's generalization and data coverage. Separation of computational artifacts and bona fide biological effects requires more investigations after the machine-learning approach. Hence, while our study provides some interesting ideas and hypotheses, they call for experimental validations to further study these issues.