Predicting Endoplasmic Reticulum Resident Proteins Using Auto-Cross Covariance Transformation With a U-Shaped Residue Weight-Transfer Function

Background: The endoplasmic reticulum (ER) is an important organelle in eukaryotic cells. It is involved in many important biological processes, such as cell metabolism, protein synthesis, and post-translational modification. The proteins that reside within the ER are called ER-resident proteins. These proteins are closely related to the biological functions of the ER. The difference between the ER-resident proteins and other non-resident proteins should be carefully studied. Methods: We developed a support vector machine (SVM)-based method. We developed a U-shaped weight-transfer function and used it, along with the positional-specific physiochemical properties (PSPCP), to integrate together sequence order information, signaling peptides information, and evolutionary information. Result: Our method achieved over 86% accuracy in a jackknife test. We also achieved roughly 86% sensitivity and 67% specificity in an independent dataset test. Our method is capable of identifying ER-resident proteins.


INTRODUCTION
The endoplasmic reticulum (ER) is an important subcellular organelle in eukaryotic cells. Two major functions are usually recognized for ER. One is that it selectively transports secreted proteins and membrane proteins. The other is that it retains some proteins to maintain its own structure and function (Lavoie and Paiement, 2008). The ER proteins are sorted precisely with quality controls (Ellgaard and Helenius, 2003;Araki and Nagata, 2011). An understanding of these processes contributes to the elucidation of endoplasmic reticulum function and the pathogenesis of many diseases (Paschen and Frandsen, 2001;Verkhratsky, 2002).
ER-resident proteins are an important topic in ER-related studies. Some of the ER-resident proteins possess sorting signals, such as KDEL or KXXX, while some others do not (Stornaiuolo et al., 2003). Over the last two decades, several efforts have been made to determine the ER sorting signals experimentally. For example, Teasdale and Jackson (1996) found that UGT2 localizes to the endoplasmic reticulum when they studied the UDP-galactosyl transporter (UGT). They also reported that the C-terminal sequence "LLTKVKGS" of the UGT2 is useful in the sorting process. Kabuss et al. (2005) proved that mutating this part of the sequence will result in re-localization of UGT2 to the Golgi apparatus. Although wet experiments for detecting protein localization signals can provide clear evidence and distinguish between maintenance and return signals, performing these experiments is always costly and time-consuming. Therefore, computational predictions are recognized as an alternative approach that provides useful and informative guidance to the experimental methods.
The general-purpose protein subcellular location predictors take ER as only one of many subcellular locations. The dataset used for training and testing these methods does not distinguish between ER-resident proteins and non-ER-resident proteins. Since both of these types of proteins may be annotated with subcellular localization ER, constructing a high-quality dataset that is capable of separating them is important. Kumar et al. (2017) proposed the ERPred method, using a carefully curated dataset to distinguish the ER-resident proteins from the non-ERresident proteins. By using split amino acid compositions (SAAC), they achieved a very promising result. Their results confirmed that the peptide sequences at the terminals of proteins are very informative in guiding the protein sorting process in the ER. Moreover, their results revealed that even if no known sorting signals were found on the sequence, the terminal peptides were still very useful in identifying ER-resident proteins (Kumar et al., 2017).
Pseudo-amino acid composition, which was proposed by Chou (2001), has been widely applied in representing protein sequences for predicting various attributes of proteins. By coupling this with many different machine-learning algorithms, a series of consecutive successes have been achieved. These successful efforts provide consolidated evidence that the pseudo-amino acid compositions are capable of representing protein sequences of various lengths using a fixed-length numerical vector without losing much of the sequential information (Chou, 2011;Chou, 2013;Chou, 2015).
In this study, we introduced a U-shaped weight-adjustment function to improve the pseudo-amino acid compositions. The Ushaped weight-adjustment function transfers weights from the middle-positioned residues to those at the terminals. Besides the weight-adjustment function, we have made two more augmentations to the original pseudo-amino acid compositions. One is to introduce the auto-cross covariance pseudo-factor form, which has been applied in finding protein folding patterns (Dong et al., 2009). The other is to incorporate positional-specific physicochemical properties, which have been applied in predicting protein submitochondrial locations and sub-Golgi locations (Du and Yu, 2013;Jiao and Du, 2017;Zhao et al., 2019).
Our method actually emphasizes the terminal signaling peptide information in pseudo-amino acid compositions. We expect that our approach can be applied not only in predicting ER-resident proteins but also in other topics associated with analyzing protein sorting and localization processes.

Benchmarking Datasets
In this study, we took the ERPred dataset as our benchmarking dataset. Kumar et al. (2017) released this dataset along with their ERPred study. The ERPred dataset contains two parts: the training set and the independent testing set. Table 1 gives a breakdown of the entire ERPred dataset. The training set contains 124 ER-resident proteins and 1200 non-ER-resident proteins. The independent testing set contains 65 ER-resident proteins and 2900 non-ER-resident proteins. It is obvious that this dataset is highly imbalanced. The number of non-ERresident proteins is about 10 times that of the ER-resident proteins in the training set and over 40 times that in the independent testing set. The identifiers of the proteins in the benchmarking dataset are listed in the supplementary materials (Tables S1-S3).

Sequence Representations
The ERPred study applied SAAC sequence representations. The result of ERPred implied that the terminal peptides contain more information for sorting proteins to ER (Kumar et al., 2017). Therefore, we introduced a U-shaped weight-adjustment function to transfer weights from those residues in the middle part of the sequence to those at the terminals of the sequence. Besides this improvement, we incorporated the sequential evolution information using the positional-specific physicochemical properties (PSPCP) (Du and Yu, 2013;Jiao and Du, 2017), as well as the auto-cross covariance form pseudofactors (Dong et al., 2009). In order to explain our method properly, we developed a new set of matrix-based notations to describe the Type-II classic pseudo-amino acid compositions, also known as the amphiphilic pseudo-amino acid compositions (Chou, 2005). These new formulations, in mathematics, equal the original ones but with a much simpler appearance. We first give the definitions of the all-ones vector and the shifting matrix.
An n-D all-ones vector is defined as follows: where d i = 1 (i = 1, 2, …, n). An n-sized shifting matrix is defined as: where m i,j = 1 , when i − j = 1; 0 , otherwise (i = 1, 2, :::, n, j = 1, 2, ::, n) ( A given protein sequence p with length l can be represented as a string: where r j (j = 1, 2, …, l) is the j-th residue on the protein sequence. Every residue represents one of twenty different kinds of amino acids. We use a 20-D binary vector A j to represent r j (j = 1, 2, …, l): where a i,j = 1 , when r j is the i−th type amino acid; The whole sequence can be represented using a matrix, as follows: where A(p) is a matrix-based sequence representation, and A j (j = 1, 2, …, l) as in Eq. (5).
When the PSSM can be created using the PSI-BLAST program for protein p, we can obtain a normalized PSSM scoring matrix for p, as elaborated in (Du and Yu, 2013). The normalized PSSM scoring matrix is denoted as follows: where the following normalization condition is satisfied: We define the following matrix to combine matrix B(p) and matrix A(p): where matrix E is a weight-adjustment matrix. It can be defined as a diagonal matrix, as follows: where e j (j = 1, 2,…, l) is a weight-adjustment factor for the j-th residue on the sequence. It is computed by a U-shaped function, as follows: where k is a weight distribution parameter, exp(.) is the exponential function, l is the length of the sequence, and j is the j-th residue. Given a type of physicochemical property H, the values for 20 different types of amino acids can be represented using a 20-D vector.
where h i (i = 1, 2,…, 20) is the physicochemical property value of the i-th type amino acid. We use the following method to standardize the physicochemical property vector: where J 20 is a 20-D all-ones vector, and In this study, we took two different kinds of physicochemical properties into consideration: the hydrophobicity and hydrophilicity of amino acids. We denote them as H 1 and H 2 , respectively. We define the sequence auto-cross covariance matrix of physicochemical properties as: where u, v ∈{1,2}.
The k-th order covariance factor can be defined as: where tr(.) computes the trace of a matrix, M l the l-sized shifting matrix, and u, v as in Eq. (17). For every given value of k, a 4-D covariance vector can be generated as: By setting the maximum value of k, which is denoted as l, we can use a 4l-D vector to contain all covariance factors as: Considering the weight-adjustment factors, the 20-D conventional amino acid composition vector can be constructed as follows: We can combine the V l (p) and the C(p) to create a (20 + 4l)-D vector to represent the protein sequence p, as follows: where w is a balancing parameter between 0 and 1. We use F(p) to represent protein p in this study.

Prediction Algorithm
We employed a support vector machine (SVM) as the prediction algorithm. The SVM searches for an optimal separating hyperplane in the high-dimensional feature space, which is widely used in bioinformatics problems (Liao et al., 2018;Meng et al., 2019a;Meng et al., 2019b). The hyper-plane can maximize the margin in the feature space. We applied the radial basis function (RBF) as the kernel function in SVM, because the RBF kernel function is the most flexible and the most widely used of such functions. It can be defined as follows: where p and q are two proteins, and |.| is the operator that computes the Euclidean length of a vector. Due to the dataset imbalance, we developed a voting scheme to use all samples in the dataset. We partitioned the negative samples into m subsets. The first m -1 subsets have an equal number of negative samples as that of all the positive samples. The remaining subset contains all the remaining negative samples. For each of these m subsets, all the positive samples were replicated to compose a training subset. We trained the SVM classifier on each of these training subsets. The final prediction result is the majority result of these m classifiers. Figure 1 is a flowchart of the entire algorithm.

Evaluation Method
Three validation methods are commonly applied in evaluating a bioinformatics predictor. They are known as the self-consistency test, jackknife test, and independent dataset test (Jiao and Du, 2016). Of them, the jackknife test is usually considered as the most objective and rigorous (Chou and Zhang, 1995). However, some recent studies have shown that the independent dataset test can provide even better estimation to the true performance if a sufficiently large testing dataset can be given (Jiao and Du, 2016). Due to the limited size of the training dataset and the fact that our training dataset is highly imbalanced, we applied the jackknife test to estimate the prediction performance of our method. We also evaluated our method using the independent testing dataset, which allowed us to compare our method to the state-of-the-art methods in a fair manner.
Four statistics were applied to measure the prediction performances of our method quantitatively. They are the FIGURE 1 | Flowchart of the algorithm. The input sequence will be first converted to matrix-based notations. These notations will be converted into fixed-length numerical vectors, which can represent the sequence order information, the evolutionary information, and the importance of the terminal signaling peptides. sensitivity, specificity, overall accuracy, and the Matthew's Correlation Coefficient (MCC). They are defined as follows: where Sen is the sensitivity, Spe the specificity, Acc the overall accuracy, MCC the Matthew's Correlation Coefficient, and TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively.

Parameter Calibrations
Several parameters can be adjusted in our method. The values of these parameters affect the prediction performances. We applied a grid-search strategy to optimize the jackknife test performance by scanning different combinations of the values of k, l, and w.
The parameter k was scanned in the set {0, 0.01, 0.1, 1, 1.5}, the parameters l from 2 to 20 with a step of 1, and the parameter w from 0.05 to 0.95 with a step of 0.05. For each parameter combination, we use another grid-search to find the best values of c, g, and w, where c is the cost parameter of SVM, g is the parameter in the RBF kernel, and w is the class weight ratio between two classes. In this study, we applied the SVM functions in the scikit-learn python package. The grid search of SVM parameters was conducted automatically with a python script.

Performance Analysis and Comparison
We obtained the optimized combination of parameters when k = 0.1, l = 16, w = 0.55, c = 1000, g = 0.01, and w = 1.2. The PSSM matrix was created using the PSI-BLAST program with three iterations and 0.001 as the threshold of e-values. In the jackknife test, our method can correctly identify 111 out of all 124 ER-resident proteins. The prediction performance values are recorded in Table 2, with comparison to the ERPred method.
According to these performance values, our method performed better than the ERPred method. Our method achieved a sensitivity of 83.06% and a specificity of 86.38%, which are both higher than the values for ERPred on the same dataset.

Independent Dataset Test
The training dataset of our work is identical to that used for ERPred. This dataset is highly imbalanced. To further eliminate the concern of over-estimated performances, we performed testing with an independent dataset. We took the same independent testing dataset as used in the ERPred method. The independent testing dataset was processed by the predictor that was trained with the training dataset. The prediction performances of our method are recorded in Table 3. Although the specificity is lower than that from the jackknife test, the sensitivity value remains almost unchanged. Therefore, we think the prediction performance is not over-estimated.
We also entered the same testing dataset into several other predictors for comparison. The compared predictors include ERPred (Kumar et al., 2017), Cello v2.5 (Yu et al., 2006), iLoc-Euk  and Euk-mPLoc 2.0 Chou and Shen, 2010), which all provide the option to identify ER proteins. According to the prediction performance values, our method has the best sensitivity. However, the specificity of our method is lower. The results indicate that Cello and iLoc-Euk tend to assign non-ER locations to an ERresident protein. They increase the specificity by severely sacrificing the sensitivity. As the nature of the ER-resident proteins is that the number of non-ER resident proteins is much larger than the resident ones, we think it is acceptable to sacrifice some specificity for the balance to the sensitivity. The ERPred method, Euk-mPLoc 2.0, and our method have a better balance between sensitivity and specificity. Particularly, it seems that the Euk-mPLoc 2.0 method has the best performance, as it achieves over 66% sensitivity while maintaining over 99% specificity. However, it should be noted that Euk-mPLoc 2.0 is not specifically designed to identify ER-resident proteins. Some of the proteins in the testing dataset may have already been used as training samples when Euk-mPLoc 2.0 was developed. This may result in an over-estimated performance value in the comparison. Another factor that should be noticed for Euk-mPLoc 2.0 is that it relies on GO annotations, which makes it not an ab initio predictor. Although using GO annotations is common in developing this kind of predictor (Du and Xu, 2013), comparing an ab initio predictor with a homology search-based method is not a fair comparison. Therefore, we believe that our method has, at least, comparable prediction performance to other existing methods. Especially in identifying ER-resident proteins, our method should be considered with a h i g h e r p r i o r i t y t h a n g e n e r a l -p u r p o s e s u b c e l l u l a r location predictors.

Effects of the Residue Weight-Transfer Function
The ER-resident proteins can be roughly divided into two different types. One type is proteins with a specific C-terminal tetra-peptide signal, which usually has a form like KDEL or HDEL. The other type is proteins without this kind of signaling peptide on either its C-terminal or N-terminal. The latter types of proteins usually have an N-glycan modification or similar modifications like cereal prolamin storage proteins (Stornaiuolo et al., 2003). In our training dataset, we searched for the tetrapeptide signals by using ProSite. We found only 41 signaling peptides in all of the 124 ER-resident proteins. In our testing dataset, we performed the same search. We found only 11 singling peptides in all of the 65 non-ER-resident proteins. Therefore, it is not practical to identify ER-resident proteins using only the signaling peptide information. This observation is consistent with the motivation of the ERPred study. ERPred is a very powerful and useful computational method. It introduces SAAC sequence representations, which successfully emphasize the terminal signaling sequence information. However, the sequence order information is lost in the amino acid composition representations. Although the pseudo-amino acid composition representation can preserve the sequence order information, it cannot emphasize the terminal signaling peptides in the protein sequence. Therefore, we introduced a U-shaped weighttransfer function into the pseudo-amino acid composition in this study. The purpose of this weight-transfer function is to emphasize the terminal signaling information and also to incorporate the sequence order information. However, it is difficult to decide how many weights should be transferred to the terminals from the middle part of a sequence. We formulate this factor as a parameter k in Eq. (12). Figure 2 illustrates the shape of the function with different k values. Figure 2 enables an intuitive understanding of this U-shaped weighting function. The larger the value of k, the more weights are transferred to the terminals of a sequence. Please also note that Figure 2 is only an intuitive illustration of the Ushaped function when the length of a protein is 100. The crossing point under this condition cannot be extended to other cases.
To find an optimized k value, we trained and tested predictors with different k values. Figure 3 plots the performance values with different k. The sensitivity increases slightly with an increase in k. The specificity peaks when k = 0.1. Therefore, at least for predicting ER-resident proteins, k = 0.1 creates a good weighttransfer function.
The choice of using a U-shaped function rather than another shape is not easy. Since we do not know how much weight should be transferred, this must be an adjustable parameter in the function. Besides, we need to make the function satisfy the following conditions at the same time: (1) all weights are positive; (2) the sum of all weights equals the sequence length; (3) the portion of the weight-increased part and weight-decreased part remains almost unchanged when we adjust the amount of weight that is transferred. This will make the function only transfer weights among residues, not create or remove total weight. The Ushaped function not only satisfies all these conditions but also provides us with a simple way to implement it.

Sequence Representation Augments
Besides the U-shaped weight-transfer function, we augmented the classic amphiphilic pseudo-amino acid compositions in two FIGURE 2 | Illustration of the U-shaped weight-transfer function with various k values. The U-shaped function transfers weights from the middle part of a sequence to its terminals. The total weight of a sequence does not change after applying the U-shaped weight-transfer function. When the parameter k is 0, every residue on the sequence has equal weights, which will produce identical results as where there is no weight-transfer function. When the value of k increases, more and more weights are transferred from the residues in the middle part of a sequence to the residues on its terminals. ways. One is to use auto-cross correlation to replace the autocorrelation in the classic amphiphilic pseudo-amino acid compositions, while the other is to use matrix-based notations to represent the sequence itself.
The advantage of using auto-cross correlation over autocorrelation has been proved in predicting protein folding patterns (Dong et al., 2009). The matrix-based sequence notations see each residue on the sequence as a 20-D composition vector. The original sequence can then be represented using the one-hot encoding scheme, which can be unified with the normalized PSSM. Since PSI-BLAST cannot generate PSSM for every protein sequence, the matrix-based notation actually provides a mathematically compatible way to compensate for the missing PSSM using the one-hot encodings. As elaborated in Du and Yu (2013), when the PSSM is available for a protein sequence, this matrix-based notation also adjusts the weights of residues according to the evolutionary information.
Therefore, our sequence representation actually encoded the sequence order information and the evolutionary information with emphasis on the terminal signaling peptides in a (20 + 4l)-D numerical vector. Compared to other studies, our sequence representation has a much lower number of dimensions. On a dataset with limited samples, the risk of over-estimated performance increases with the number of dimensions of the representation. Our method should be a better choice when the number of samples is limited.

CONCLUSIONS
Many existing methods can predict protein subcellular locations. However, only the ERPred method can specifically identify ERresident proteins. The ER may be the most important type of subcellular organelle, linking all the major subcellular structures, including the nucleus, cytoplasm, and cell membrane. In this study, we present a new method for predicting ER-resident proteins. Although establishing a web server for a predictive method is good practice, it is not easy for us to do so due to the limitations of our resources and the complexity of this new method. We will establish a web server for this method in the future. The most important part of this work is to introduce a Ushaped weight-transfer function into the pseudo-amino acid compositions. Since the signaling peptide information is useful in analyzing many different subcellular processes and this is the first time that the signaling peptide information has been e m p h a s i z e d i n p s e u d o -a m i n o a c i d c o m p o s i t i o n representations, we believe that our method has great potential for application in predicting various attributes of proteins.

DATA AVAILABILITY STATEMENT
In this study, we took the ERPred dataset as our benchmarking dataset. Kumar et al released this dataset, along with their ERPred study (Kumar et al., 2017).

AUTHOR CONTRIBUTIONS
Y-YM curated the dataset, designed the algorithm, implemented the algorithm, and partially calibrated the parameters. WZ performed the experiments, partially calibrated the parameters, and collected the results. G-PL partially collected the results and analyzed the results. YG and P-FD investigated the question, designed the whole study, FIGURE 3 | Performance analysis with different weight-transfer functions. Prediction performance varies with the value of parameter k in the weight-transfer function. When k = 0.1, the performance value peaks. This means that the residues on the terminals are slightly more important than those in the middle part in predicting ER-resident proteins.
conceptualized the algorithm, analyzed the results, and wrote the manuscript.