PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method

Gram-negative bacteria use various secretion systems to deliver their secreted effectors. Among them, type IV secretion system exists widely in a variety of bacterial species, and secretes type IV secreted effectors (T4SEs), which play vital roles in host-pathogen interactions. However, experimental approaches to identify T4SEs are time- and resource-consuming. In the present study, we aim to develop an in silico stacked ensemble method to predict whether a protein is an effector of type IV secretion system or not based on its sequence information. The protein sequences were encoded by the feature of position specific scoring matrix (PSSM)-composition by summing rows that correspond to the same amino acid residues in PSSM profiles. Based on the PSSM-composition features, we develop a stacked ensemble model PredT4SE-Stack to predict T4SEs, which utilized an ensemble of base-classifiers implemented by various machine learning algorithms, such as support vector machine, gradient boosting machine, and extremely randomized trees, to generate outputs for the meta-classifier in the classification system. Our results demonstrated that the framework of PredT4SE-Stack was a feasible and effective way to accurately identify T4SEs based on protein sequence information. The datasets and source code of PredT4SE-Stack are freely available at http://xbioinfo.sjtu.edu.cn/PredT4SE_Stack/index.php.


INTRODUCTION
Gram-negative bacteria use various secretion systems to deliver their secreted substrates (also called as effectors) from the bacterial cytosol into host cells, which can promote virulence and cause diseases. Until now, eight different secretion systems (type I to type VIII) have been found in Gramnegative bacteria, which differ from each other in their outer membrane secretion mechanisms. There are a number of well-organized databases or web resource on collecting experimentally validated effectors of Type III, IV, and VI secretion systems (Bi et al., 2013;Li et al., 2015;Eichinger et al., 2016;An et al., 2017). Among them, type IV secretion system (T4SS) exists widely in a variety of bacterial species, such as Bordetella pertussis, Helicobacter pylori, Coxiella burnetii, and Legionella pneumophila (Chandran et al., 2009;Fronzes et al., 2009;Lifshitz et al., 2013). T4SS specifically secretes type IV secreted effectors (T4SEs), which vary widely across bacterial species. T4SEs mimic the function of host proteins, exert vital functions in cytoplasm of infected eukaryotic cells and play crucial roles in host-pathogen interactions. Accurate and reliable identification of T4SEs is a crucial step toward the understanding of the pathogenic mechanism of T4SS. Due to the biological significance of T4SEs, a number of experimental approaches have been developed to identify novel T4SEs such as fusion protein report assays and secretion apparatus. However, these experimental approaches are timeand resource-consuming. It is highly desirable to develop in silico classification models to accurately predict type IV secreted effectors of T4SS based on protein sequence information.
In the last decade, several computational approaches using machine learning (ML) algorithms were developed to predict T4SEs based on protein sequence information. A pioneering method proposed by Burstein et al. (2009) formulated the task of identifying T4SEs on Legionella pneumophila genome as a classification problem using various ML algorithms, including naïve Bayes, Bayesian networks, support vector machine (SVM), Neural networks, and a voting algorithm that is based on these four algorithms. The input features of these algorithms include taxonomical dispersion, regulatory data, genomic organization, and similarity to eukaryotic proteomes (Burstein et al., 2009). Later, the same group developed a hidden semi-Markov model (HSMM) to characterize the amino acid composition of the secretion signal for identification of T4SEs across species (Lifshitz et al., 2013). Chen et al. (2010) used the similar ML-based model as the previous study (Burstein et al., 2009) to predict putative T4SEs in Coxiella burnetii genome, which helped narrow the number of potential targets for subsequent experimental validation. T4EffPred is a SVM-based prediction tool for identifying T4SEs based on four types of sequence-derived features, which were calculated from amino acid composition (AAC) and position specific scoring matrix (PSSM) profiles . T4SEpre (Wang et al., 2014) is another SVMbased tool for predicting T4SEs from C-terminal 100 amino acids of protein sequences by using AAC, position-specific AAC profiles, and predicted structural features such as secondary structure and solvent accessibility. An et al. (2016) constructed an ensemble model by random forest to integrate the output of the individual predictors (i.e., T4EffPred and T4SEpre) to improve predictive performance. Recently, Wang Y. et al. (2017) presented an effective method to predict T4SEs prediction by integrating information from both 50 N-terminal and 100 C-terminal residues of protein sequences. The model was built by SVM based on three types of features, namely AAC, PSSM, and composition, transition and distribution.
Overall, the currently available computational approaches for prediction of T4SEs vary from one another in terms of the utilized features and ML algorithms. Since the numbers of effectors and non-effectors in genomes are heavily unbalanced (the effectors comprise only a small fraction of a genome), it is highly desirable to develop a prediction method with high precision and high specificity. Otherwise, the number of true positives would easily be overwhelmed by the number of false positives, so that such a predictor is impractical to generate reliable candidates for experimental validation. In the present study, we aim to propose a stacked ensemble model, PredT4SE-Stack, to further improve the prediction performance (i.e., higher precision and specificity) for identifying T4SEs from protein sequence information. The stacked generalization approach (Wolpert, 1992) consists of an ensemble of base classifiers whose outputs are further learned by a meta-classifier to model the relationship between the ensemble outputs and the actual classes/labels. To construct the model, the protein sequences are firstly encoded by the feature of PSSM-composition by summing rows that correspond to the same amino acid residues in PSSM profiles. Based on the PSSM-composition features, a total of eight types of MLbased algorithms (including advanced ML techniques) are used to build base-classifiers in the first stage. Then, the optimal combination of base-classifiers is searched, and the output of these selected base-classifiers are utilized as input for a metaclassifier at the second stage. Our experimental results on both cross validation and independent tests demonstrated that the framework of PredT4SE-Stack is a feasible and effective way to accurately identify T4SEs based on protein sequence information. It also has achieved better performance than previously published methods.

Dataset
In this study, the same benchmark dataset curated by Wang Y. et al. (2017) was used to evaluate the performance of our proposed method. The dataset consists of 1,765 protein sequences across multiple bacterial species, categorized into two classes (380 T4SEs as the positive class and 1,385 non-T4SEs as the negative class). These proteins in this dataset have mutual sequence identity no more than 30%. The 1,765 protein sequences were divided into two subsets for cross validation in the training and the independent testing, respectively. The training dataset (Train-915) are composed of 915 sequences, among which 305 T4SE sequences were randomly selected from positive class, and 610 non-T4SE sequences were randomly selected from negative class. The dataset of Train-915 was further randomly divided into five subsets (or folds) with an equal number of protein sequences for cross validation to attain the optimized model. In each of the five validations, 4 of the 5-folds were used for training and the remaining one for testing, which was repeated for five times. The testing dataset (Test-850) included the remaining 75 T4SE sequences as positive samples and 775 non-T4SE sequences as negative samples for independent testing.

Feature Representation of Protein Sequence Samples
One of the key problems in designing a predictor based on machine learning is how to encode a protein sequence as an informative feature vector enriched with highly discriminative information. In the present section, we describe how to formulate an effective mathematical expression that describes protein sequences in the training and testing data sets.
The protein sequence profile (i.e., PSSM) is a powerful representation of residue or sequence information of proteins. It has achieved good performance on a number of bioinformatics applications such as functional residues prediction and protein function prediction (Xiong et al., 2011a(Xiong et al., ,b, 2012Zhu et al., 2013;Wei et al., 2017a). In this study, PSSMs were generated by three iterations of PSI-BLAST searches against Uniref50 with the BLOSUM62 substitution matrix. The parameter of e-value was set to 0.001. Because ML-based models can only handle vectors with equal lengths for all protein sequence samples, the PSSM of a protein sequence (amino acid length is L) has a dimension of L * 20, which could not be directly used as the input feature vector for machine learning algorithms. Instead, the original PSSM profile was further used to calculate the feature of PSSMcomposition by summing rows that correspond to the same amino acid residues in a PSSM profile, in much the same way as the previous studies Wang J. et al's., 2017). The sum value was divided by the length of the protein sequence for each type of amino acid (there is a total of 20 types). Thus, a vector of size at 400 (=20 × 20) is finally used for representing a protein sequence sample. Figure 1 presents the details about how to generate a feature vector of PSSM-composition for a given protein sequence.

Classification System
The ensemble learning techniques can be categorized into three main types, which include bagging, boosting, and stacked ensemble. It is demonstrated that the ensemble learning techniques can help improve the prediction performance in various bioinformatics applications Lin et al., 2013Lin et al., , 2014Zou et al., 2015;Li et al., 2016;Yuan et al., 2016;Wan et al., 2017;Iqbal and Hoque, 2018;Mishra et al., 2018;You et al., 2018). In this section, we introduce the components of the twostage stacked ensemble scheme, including various classification algorithms used as base-classifiers in the first stage, and the input of the meta-classifier in the second stage.

Base-Classifier
In order to find the optimal combination of base-classifiers in the first stage and the meta-classifier in the second stage, the following eight different machine learning algorithms were exploited: (i) SVM (Cortes and Vapnik, 1995) (Breiman, 2001), (vi) Extremely Randomized Trees (ERT) (Geurts et al., 2006), (vii) Gradient Boosting Machine (GBM) (Friedman, 2001), and (viii) eXtreme Gradient Boosting (XGB). The algorithms such as NB, LR, and GBM were implemented by using h2o package in R software. The algorithms of SVM, KNN, RF, ERT, and XGB are implemented by using e1071, caret, randomForest, extraTrees and xgboost packages in R, respectively. The optimal parameters in these algorithms are determined by a grid search strategy.

Meta-Classifier
The meta-classifier in the second level generalization (or stacked generalization) is used to combine the outputs of base-classifiers in an ensemble. In our classification system, we applied a stacked generalization approach proposed by Wolpert (1992), in which an ensemble of base-classifiers are first constructed, whose outputs are used as inputs to a second level of meta-classifier to learn the relationship between the ensemble outputs and the actual classes/labels. The stacked generalization scheme can be viewed as an extension version of cross validation. In the first stage, the base-classifiers were trained with the feature of PSSMcomposition of sequences. In the second stage, the prediction class probabilities of the base-classifiers were taken as inputs to the meta-classifier (shown in Figure 2).

Model Validation Method
To evaluate performances of classification models, the validation methods are mainly consisting of k-fold cross validation, leaveone-out cross validation (or called as jackknife test), and independent tests. In k-fold cross validation, the sample set is randomly divided into k subsets with equal sizes. Of the k subsets, only one subset is selected as the validation data for testing the model, and the remaining k-1 subsets are used as training data. The cross validation process is then repeated k times (the folds), with each of the k subsets used exactly once as the validation data. The results from k folds are finally averaged. The k-fold cross validation method has been widely used as the model validation approach in various bioinformatics applications (Zhu and Mitchell, 2011;Xu et al., 2017;Zeng et al., 2017;Chen X. et al., 2018;He et al., 2018a,d). In the present study, the 5-fold cross validation was used for validation in the training set, and the independent test was used for testing the generalization ability of the proposed method, and comparison with other methods.

Model Evaluation Metric
In order to assess prediction performances of single-label classification systems, a set of six threshold-dependent metrics are widely used in the bioinformatics studies (Xia et al., 2010;Li et al., 2011;Zhang et al., 2017Zhang et al., , 2018aHe et al., 2018c;Jia et al., 2018;Zhao et al., 2018). They are accuracy (ACC), sensitivity (SE, also called recall), specificity (SP), precision (PR), Matthew's correlation coefficient (MCC) and F-measure (F 1 ). The definitions of these metrics are shown as below.
where TP (true positives) is the number of correctly predicted T4SEs, TN (true negatives) is the number of correctly predicted   non-T4SEs, FP (false positives) is the number of non-T4SEs wrongly predicted as T4SEs, and FN (false negatives) is the number of T4SEs wrongly predicted as non-T4SEs. The receiver operating characteristic (ROC) curve is a plot of the sensitivity versus (1-specificity) for a binary classifier at varying thresholds from 0 to 1 (the threshold is assigned as the probability of the target sequence to be a T4SE in our study). The area under the curve (AUC) can be used as a powerful metric for evaluation performances of classifiers. It is worth mentioning that AUC of ROC (and ACC, MCC) can present overly optimistic assessment of performance of an algorithm on a heavily unbalanced dataset. Therefore, we only used AUC of ROC for evaluation in 5-fold cross validation, but not used it for evaluation in the independent dataset (only 75 proteins are true positives among 850 samples). Instead, the metric of F 1 , which is a harmonic mean of recall (or sensitivity) and precision, is a main metric for evaluating performances of classifiers in the present study.

Predictive Power of Various Base-Classifiers on Train-915 Dataset
The aim of this section is to test the predictive power of baseclassifiers based on PSSM-composition profiles for eight different machine learning algorithms on Train-915 dataset using 5fold cross validation. Experimental results shown in Table 1 indicate that the algorithm of naïve Bayes performed worst on this task. The algorithms of KNN, logistic regression, random forest, and extremely randomized trees performed moderately.
The algorithms of support vector machine, extreme gradient boosting, and gradient boosting machine performed best. The results of ROC shown in Figure 3 are mainly in agreement with the findings in Table 1. However, the fact that the AUC-ROC of SVM is higher than that of XGB and GBM indicates that SVM can achieve more stable performance than XGB and GBM using PSSM-composition feature as input in the present task, in regardless of the change of the thresholds. It should be noted that we tried a large number of other types of PSSMderived features generated by POSSUM tookit , and a variety of structural and physiochemical descriptors extracted from protein sequences generated by iFeature tookit  when we designed the input features of the base-classifiers. Our experimental results demonstrated that the PSSM-composition feature utilized in this study yielded satisfactory performance, which performed better than other types of sequence-based features. Moreover, we attempted to directly combine the PSSM-composition feature with other types of features as the input of the base-classifiers. It was found that the combined features could not significantly produce higher performance than the single type of PSSM-composition feature (data not shown).

Predictive Power of Meta-Classifiers on Train-915 Dataset
Since combining all of the above mentioned base-classifiers in a meta-classifier could not yield optimal prediction performance, it is desirable to search for the optimal combination of baseclassifiers. Since RF and ERT are tree-based classifiers, we chose   one of them at a time. Because GBM and XGB are boosting-based methods, and XGB is an efficient and scalable implementation of GBM, we chose one of them too. It was found that the combination of SVM, GBM, and ERT achieved the optimal performance, which is in agreement with the finding of study by Pan et al. (2018) on the prediction task of hot spots in protein-RNA interfaces. Furthermore, we tested the same set of eight ML methods as the classification algorithms of meta-classifiers to compare their prediction performances. The results in Table 2 showed that all meta-classifiers except the one based on ERT achieved very similar performances, for example, the values of F 1 are falling in a narrow range from 0.847 to 0.858, whereas the base-classifiers using the same set of ML algorithms are ranging from 0.669 to 0.847 in the first stage. These results can be explained by the fact that the pattern learned from the first stage is effective enough, leading to the similar level of performances at the second stage on the same dataset of Train-915, irrespective of ML algorithms, except ERT (also demonstrated in Figure 4) .

Predictive Power of Meta-Classifiers on Test-850 Dataset
In the section, the prediction performances of meta-classifiers are evaluated on the independent dataset, which is mimicking a true prediction task, since the model trained on one dataset is really tested on an unseen dataset for examining its generalization ability on a new dataset. Table 3 indicated that LR and SVM have top performances on Test-850 dataset. Therefore, both of them can be utilized as the classification algorithms of the metaclassifier in PredT4SE-Stack. Considering the fact that LR is more interpretable than SVM, we could use LR to construct the metaclassifier in our model PredT4SE-Stack. In real application, we will re-train PredT4SE-Stack on a whole dataset consisting of Train-915 and Test-850.

Comparison With Previous Studies
The main purpose of this section is to compare our proposed approach PredT4SE-Stack to previously published methods. Performance comparisons among different T4SE prediction approaches are scientifically meaningful only if they train and test their methods on the same dataset. Accordingly, our approach PredT4SE-Stack was only compared with the recently published method proposed by Wang Y. et al. (2017). The first reason is that both two studies used the same benchmark dataset  Table 4 in their published study, we firstly calculated the TP, TN, FP, and FN using the sensitivity and specificity of their method, and then calculated F 1 and precision of Wang Y. et al.'s (2017) method. The meta-classifier of our PredT4SE-Stack classification system was implemented by SVM or LR. For SVM or LR, the performance (F 1 = 0.734 or 0.733) of our method is much higher than that (F 1 = 0.521) of Wang Y. et al.'s (2017) method. If our SVM-based metaclassifier is tuned on the same recall or sensitivity of 90.7%, our method achieved better performance at specificity, precision, and F 1 , which are 2.4, 4.1, and 4.1% respectively, higher than that of Wang Y. et al.'s (2017) method. If our LR-based metaclassifier is tuned on the same recall or sensitivity of 90.7%, our method achieved better performance at specificity, precision, and F 1 , which are 3.7, 6.7, and 6.5% respectively, higher than that of Wang Y. et al.'s (2017) method.

CONCLUSION
The main goal of the current study is to develop a stacked ensemble model PredT4SE-Stack to predict T4SEs from protein sequence information. The proposed model utilized an ensemble of base-classifiers implemented by SVM, GBM, and ERT to generate outputs for the meta-classifier in the classification system. It was demonstrated that the framework of PredT4SE-Stack was a feasible and effective way to accurately identify T4SEs based on protein sequence information. However, the performance of PredT4SE-Stack can be further improved in several respects. Firstly, the diversity of base-classifiers was implemented by various classification algorithms in the present work. It can be further improved by different features in different base-classifiers. Secondly, inspired by the successful application of feature selection strategies in various bioinformatics tasks (Zou et al., 2016;Wei et al., 2017bWei et al., , 2018He et al., 2018b;Manavalan et al., 2018;Qiao et al., 2018;Su et al., 2018;Tang et al., 2018), the predictive power of base-classifiers can be boosted by incorporating an effective feature selection technology on a large pool of sequence-derived features. Moreover, an effective model selection on a large number of candidate base-classifiers will be explored to improve the prediction performance of the metaclassifier. These improvements will be explored in the further study.

AUTHOR CONTRIBUTIONS
XZ and D-QW conceived the study. YX and XZ designed the experiments. YX performed the experiments. YX, QW, JY, and XZ analyzed the data. YX and XZ wrote paper. All authors reviewed the manuscript and agreed to this information prior to submission.