Impact Factor 3.517 | CiteScore 3.60
More on impact ›

Technology and Code ARTICLE

Front. Genet., 18 September 2019 | https://doi.org/10.3389/fgene.2019.00842

iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition

Bin Liu1,2*†, Shengyu Chen3†, Ke Yan4 and Fan Weng4
  • 1School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
  • 2Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
  • 3School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, IN, United States
  • 4School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China

Summary: Identification of replication origins is playing a key role in understanding the mechanism of DNA replication. This task is of great significance in DNA sequence analysis. Because of its importance, some computational approaches have been introduced. Among these predictors, the iRO-3wPseKNC predictor is the first discriminative method that is able to correctly identify the entire replication origins. For further improving its predictive performance, we proposed the Pseudo k-tuple GC Composition (PsekGCC) approach to capture the “GC asymmetry bias” of yeast species by considering both the GC skew and the sequence order effects of k-tuple GC Composition (k-GCC) in this study. Based on PseKGCC, we proposed a new predictor called iRO-PsekGCC to identify the DNA replication origins. Rigorous jackknife test on two yeast species benchmark datasets (Saccharomyces cerevisiae, Pichia pastoris) indicated that iRO-PsekGCC outperformed iRO-3wPseKNC. It can be anticipated that iRO-PsekGCC will be a useful tool for DNA replication origin identification.

Availability and implementation: The web-server for the iRO-PsekGCC predictor was established, and it can be accessed at http://bliulab.net/iRO-PsekGCC/.

Introduction

In the process of the cell cycle, DNA replication is one of the most important steps (Shirahige et al., 1998). Since the DNA replication is initiated from a specific region, which is called replication origin, identifying the DNA replication origin is especially important for studying drug developments, cell life activities, genetic engineering, etc. (Méchali, 2010). Experimental methods detect the replication origins by using Chromatin immunoprecipitation (Chip) with high cost (Lubelsky et al., 2012). Therefore, researchers are seeking computational methods to efficiently predict the replication origins only based on the sequence information. Compared with non-replication origins, replication origins show uneven distribution of G (guanine) and C (cytosine) (Lobry, 1996), and the concept of “GC Skew” (Grigoriev, 1998) was proposed. Later, some computational methods incorporated these characteristics into the predictors based on the replication origins (Zhang and Zhang, 1991; Zhang and Zhang, 1994; Grigoriev, 1998; Roten et al., 2002; Thomas et al., 2007; Gao and Zhang, 2008; Luo et al., 2014; Bu et al., 2018). In order to further improve the predictive performance, the discriminative methods were proposed by using both the information of the positive and negative samples (Chen et al., 2012; Li et al., 2015; Zhang et al., 2016), and all of these methods mentioned above achieved the-state-of-the-art performance. A recent method iRO-3wPseKNC incorporated the “GC asymmetry bias” (Lobry, 1996; Grigoriev, 1998; Lubelsky et al., 2012; Li et al., 2014) into the prediction by representing the entire replication origins based on three-window-based PseKNC (3wPseKNC) (Liu et al., 2018b). Feature extraction methods are the keys for the performance improvement. In this regard, many features have been proposed, which can be easily generated by some software tools.

These existing computational methods have significantly enhanced the development of this hot area, but they all suffer from certain disadvantages or limitations, for example, as discussed above the GC Skew is an important feature of replication origins, but all the existing discriminative methods failed to directly use GC Skew to construct the predictors. Furthermore, the existing feature extraction methods cannot reflect the uneven distribution of G and C. To solve these problems, we followed the framework of iRO-3wPseKNC (Liu et al., 2018b), and proposed an improved predictor called iRO-PsekGCC for replication origin identification. iRO-PsekGCC cannot only capture the CG asymmetry bias by using k-tuple GC composition (or k-GCC), but can also incorporate the GC Skew into the concept of PseKNC (Chen et al., 2014a; Chen et al., 2014b).

Manuscript Formatting

Benchmark Datasets

In order to evaluate the performance of the proposed method, two recently established benchmark datasets of the Saccharomyces cerevisiae and Pichia pastoris (Liu et al., 2018b) were employed in this study, because they showed clear CG asymmetry distributions, which can be represented as:

Sτ=Sτ+Sτ, τ={1 for Saccharomyces cerevisiae2 for Pichia pastoris(1)

where the symbol ∪ represents the union, and S+ represents the positive dataset containing 340 replication origins, and S1 represents the negative dataset containing 342 non-replication origins; 305 replication origins are in positive dataset S2+, and 302 non-replication origins are in the negative dataset S2. For both of the two benchmark datasets, the redundant samples have been removed by using CD-HIT software tool (Li and Godzik, 2006) with the most stringent cut-off threshold (80%).

Pseudo k-Tuple GC Composition (PsekGCC)

One of the key steps for constructing machine-learning predictors for analyzing biological sequences is feature extraction. Following the framework of three-window-based PseKNC (3wPseKNC) (Liu et al., 2018b), we proposed a feature extraction method called “Pseudo k-tuple GC composition (PseKGCC)” to directly incorporate the CG asymmetry bias (Lobry, 1996; Grigoriev, 1998; Lubelsky et al., 2012; Li et al., 2014) and GC skew (Grigoriev, 1998) into the predictor. In the following sections, we will introduce how to represent DNA samples by using PseKGCC.

A DNA sequence D can be formulated as follow:

D=N1N2N3NiNL(i=1,2,3,L)(2)

where L denotes the length of D, and

Ni{A(adenine),C(cytosine),G(guanine),T(thymine)},(i=1,2,3,,L)(3)

which represents the i-th nucleobase in the sequence, and fi ∈ denotes the “member of’” in the set theory. Following the study (Liu et al., 2018b), D is divided into three windows by two parameters ε and δ, including front window, middle window, and rear window respectively. ε and 1 − δ denote the percentage of total nucleobases of D in the front window and rear window, respectively. The front window, middle window and rear window can be represented as D[1,η], D[η + 1,ξ], and D[ξ + 1, L], respectively, where η and ξ are defined as (Liu et al., 2018b),

{η=IntC[L×ε]ξ=IntC[L×δ] ,(0<ε<δ<1.0)(4)

where the symbol IntC represents the ceiling operator, which means to return the smallest integer value greater than or equal to the float number.

According to (Liu et al., 2018b), if D is formulated by the k-tuple nucleotide (or k-mer) (Liu et al., 2019b; Liu, 2017) based on the three windows strategy, it can be represented as follow:

D=[f1(1)fv(1)f4k(1)f4k+1(2)f4k+v(2)f4k+v(2)  f2×4k(2)f2×4k+1(3)f2×4k+v(3)f3×4k(3)]T(5)

where in vector operations, symbol ‘T’ denotes the transformation symbol, and in the sample D, the normalized frequency values of the corresponding k-tuple nucleotides appearing in the front window, middle window and rear window are represented as f(1), f(2), f(3), respectively. The feature vector’s dimension is 3 × 4k.

This strategy was proposed to capture the patterns of “GC asymmetry bias” in yeast species genomes, and it is able to improve the predictive performance for identifying replication origins among multiple yeast species genomes. However, this approach has the following disadvantages: 1) the three windows strategy can only capture the local GC asymmetry bias of replication origins, but it cannot incorporate the GC asymmetry bias in a global fashion; 2) for large k values of k-tuple nucleotide, the dimension of the resulting feature vectors is high, which will cause high dimension disaster.

In order to overcome these disadvantages, we proposed a new composition of DNA sequence called “k-tuple GC composition (or k-GCC)” to capture the GC preference in the replication origins and their global interactions. k-GCC treats A (adenine) and T (thymine) as one nucleotide type represented as *. Therefore, the alphabet of k-GCC is

Ni{G(guanine), C(cytosine),*},(i=1,2,3,,L)(6)

Therefore, by replacing the k-tuple by k-GCC, a DNA sequence D can be represented as:

D=[f1(1)fv(1)f3k(1)f3k+1(2)f3k+v(2)f2×3k(2)f2×3k+1(3)  f2×3k+1(3)f2×3k+v(3)f3×3k(3)]T(7)

Compared with Equation 5, the k-GCC can efficiently reduce the dimension of the feature vector from 3 × 4k to 3 × 3k by focusing on the GC composition.

The proposed Pse-KGCC incorporates both the k-GCC and GC skew into the framework of PseKNC (Chen et al., 2014a), which can be represented as:

D=[ϕ1ϕ3kϕ3k+λ ϕ3k+λ+1ϕ(3k+λ)+3kϕ2×(3k+λ) ϕ2×(3k+λ)+1ϕ2×(3k+λ)+3kϕ3×(3k+λ)]T(8)

where

ϕu={fu(1)i=13kfi(1)+wj=1λθj(1)1u3kwθu3k(1)i=13kfi(1)+wj=1λθj(1)3k+1u3k+λfu(2)i=3k+12×3kfi(2)+wj=1λθj(2)3k+λ+1u2×3k+λwθu(3k+λ)3k(2)i=3k+12×3kfi(2)+wj=1λθj(2)2×3k+λ+1u2×3k+2λfu(3)i=2×3k+13×3kfi(3)+wj=1λθj(3)2×3k+2λ+1u3×3k+2λwθu2×(3k+λ)3k(3)i=2×3k+13×3kfi(3)+wj=1λθj(3)3×3k+2λ+1u3×3k+3λ(9)

where λ denotes the highest tier correlation of the k-GCC nucleotides in each local window of D, whose the value is an integer. w is a float number that represents the weight factor, and the value of w is between 0 and 1. In the front window, the middle window and the rear window, the correlation factor of the j-th tier is represented as θj(1), θj(2), and θj(3), respectively. The GC skew value of the k-GCC nucleotides separated by j nucleotides is used to represent the correlation factor of the j-th tier in each local window. (Figure 1). θj(1), θj(2), and θj(3) can be calculated by

128{θj(1)=1IntC[ηkj]+1i=0IntC[ηkj]Θ(Ni×j+1Ni×j+2Ni×j+k)θj(2)=1IntC[ξηkj]+1i=0IntC[ξηkj]Θ(Nη+i×j+1Nη+i×j+2Nη+i×j+k)j=1, 2,,λ; λmin(η,ξη, Lξ)θj(3)=1IntC[Lξkj]+1i=0IntC[Lξkj]Θ(Nξ+i×j+1Nξ+i×j+2Nξ+i×j+k)(10)

where IntC[ηkj]+1 denotes the number of the k-GCC in the corresponding local window, and Θ(Ni × j + 1Ni × j + 2 ⋯ Ni × j + k) is the GC Skew (Lobry, 1996; Grigoriev, 1998; Li et al., 2014) of the i-th k-GCC in the local window, which can be calculated by

Θ(Ni×j+1Ni×j+2Ni×j+k)= fG(Ni×j+1Ni×j+2Ni×j+k)fC(Ni×j+1Ni×j+2Ni×j+k)fG(Ni×j+1Ni×j+2Ni×j+k)+fC(Ni×j+1Ni×j+2Ni×j+k)(11)

where fG(Ni × j + 1 Ni × j + 2 ⋯ Ni × j + k) denotes the frequency of G in the subsequence Ni × j + 1 Ni × j + 2 ⋯ Ni × j + kfC(Ni × j + 1 Ni × j + k ⋯ Ni × j + k) denotes the frequency of C in the subsequence Ni × j + 1 Ni × j + 2 ⋯ Ni × j + k, reflecting the CG asymmetry bias directly. Please note that for the terminal subsequence, if its length is less than k, then the GC skew will be calculated by all the available nucleotide residues.

FIGURE 1
www.frontiersin.org

Figure 1 A schematic diagram to illustrate how to calculate the GC Skew in the front, middle, and rear windows along a DNA sequence. (A) The coupling between all the contiguous k-GCC (k = 3); (B) The coupling between the second most contiguous k-GCC (k = 3); (C) The coupling between all the contiguous k-GCC (k = 4); (D) The coupling between the second most contiguous k-GCC (k = 4).

Random Forest

Being widely used in bioinformatics (Zhao et al., 2014; Su et al., 2019), Random Forest (RF) (Ho, 1995; Barandiaran, 1998) is a machine learning classifier. Its training process can prevent overfitting (Hastie et al., 2008). The Random Forest model was implemented by calling the command line RandomForestClassifier (“max_features=’sqrt’, min_samples_leaf=1, min_samples_split=2, criterion = ‘gini’, =optimize-d value”) with the help of the Scikit-learn package (Pedregosa et al., 2011), where the values of represents the number of the trees in the forest, and it was set as 600 for both the two benchmark datasets (cf. Equation 1).

Ensemble Learning

Previous studies (Zou et al., 2015; Liu et al., 2016a; Chen et al., 2016b; Chen et al., 2017a; Chen et al., 2017b; Liu et al., 2018a) have demonstrated that fusing a series of individual predictors by a voting strategy can improve the predictive performance. In this regard, in this study an ensemble predictors was constructed by fusing 10 top performing individual predictors constructed by different parameter combinations of PseKGCC (see Supplementary Information S1), which can be represented as (Liu et al., 2016a):

FE=RF(1)  RF(2)   RF(i RF(10)=i=110RF(i)(12)

where FE represents the ensemble classifier, ∀ represents the fusing operator, and RF(i) represents the basic Random Forest predictor.

The ensemble predictor is constructed based on the fusion score ß of the probabilities predicted by the 10 basic predictors, which can be calculated by

ß= i=110qiPi(13)

where qi is the weight of the i-th basic RF predictor, which was optimized by the genetic algorithm (Mitchell, 1998), and their values were listed in Supplementary Information S1. If the value of ß is higher than 0.5, it is a replication origin, otherwise, it is a non-replication origin. The flowchart of the iRO-PseKGCC is illuminated in Figure 2.

FIGURE 2
www.frontiersin.org

Figure 2 A flowchart illustration to show how the iRO-PseKGCC predictor works.

Cross Validation

Three widely used cross-validation strategies include: i) independent test, ii) K-fold cross validation, and iii) jackknife test. Among these methods, only the jackknife test can achieve the unique results for the same benchmark dataset. Therefore, in this study, the jackknife test was employed to give the final predictive results. However, considering its high computational cost, during the parameter optimization process, the 5-fold cross-validation was used to reduce the computational cost (see Optimize Parameters section).

Evaluation Method of Performance

To evaluate the quality of the classifier for prediction of the replication origins, the four metrics are used (Feng et al., 2013; Chen et al., 2016c; Chen et al., 2019): i) the sensitivity, Sn, ii) the specificity, Sp, iii) the overall accuracy of the predictive results, Acc, iv) the Mathew’s correlation coefficient, MCC, and v) Arear under ROC Curve, AUC (Chen et al., 2016a), defined as:

{Sn=1N+N+0Sn1Sp=1N+N0Sp1Acc=1N++N+N++N0ACC1MCC=1(N++N+N++N)(1+N++N+N+)(1+N++N+N)1MCC1AUCArear under ROC Curve(14)

where N+ denotes the number of all the positive samples (replication origins), N denotes the number of all the negative samples (non-replication origins), N+ denotes the number of the positive samples (replication origins) incorrectly predicted as the negative samples (non-replication origins), N+ denotes the number of the negative samples (non-replication origins) incorrectly predicted as the positive samples (replication origins). More information of these performance measures can refer to Liu et al. (2016b).

Results and Discussion

Optimize Parameters

There are five parameters in PseKGCC according to Equations 4–9. These parameters were optimized by the following equations:

{0.15ε0.5,with stepε=0.050.5<δ0.85, with stepδ=0.053k7, with stepk=11λ10,with stepλ=30.1w1, with stepw=0.1(15)

The fivefold cross-validation was employed to search the optimal parameters by gridding method so as to reduce the time consumption, and the predictive results of the top 10 performing predictors, and their optimized parameters were listed in Supplementary Information S1.

Comparison With Other Methods

To the best knowledge of ours, iRO-3wPseKNC (Liu et al., 2018b) is the only existing predictor that is able to predict the entire replication origins. All the other predictors can only predict the fragments of replication origins. Therefore, the performance of the proposed iRO-PseKGCC was compared with iRO-3wPseKNC on the two benchmark datasets, and the results were listed in Table 1, from which we can see that iRO-PseKGCC obviously outperformed iRO-3wPseKNC in terms of the five performance measures (cf. Equation 14), indicating that the proposed PseKGCC feature is able to capture the GC asymmetry bias, and incorporate the GC skew into the predictor. Therefore, iRO-PseKGCC is an efficient approach for improving the predictive performance.

TABLE 1
www.frontiersin.org

Table 1 The results of the iRO-PseKGCC Predictor and comparison with iRO-PseKGCC on the two benchmark datasets (cf. Equation 1) obtained by using jackknife test.

Feature Analysis

Random forest is a combination classifier model composed of decision tree classifiers. During the process of constructing each tree by the “Bootstrap” method (Efron, 1992), samples not extracted for training the corresponding tree can be used to make “Out Of Bag” (OOB) error estimate (Breiman, 1996) to evaluate the generalization performance of a predictor. Based on the OOB error, the Mean Decrease Accuracy (MDA) (Jiang et al., 2007) can be used to estimate the importance of the features. The details of the process are (Jiang et al., 2007): 1) When training a Random Forest model, using the OOB samples to test the accuracy of each tree in the model; 2) Randomly disturb the value of the feature variable v in the OOB samples, and retest the accuracy of each tree; 3) Calculate the mean value of the decreasing accuracy between the two tests in all decision trees in the Random Forest model. The MDA value can reflect the importance of the corresponding feature.

As shown in previous studies (Liu and Zhu, 2019; Liu et al. 2019a), feature analysis is critical for exploring the characteristics of the predictors. To explore the reason why the proposed predictor iRO-PseKGCC works so well, we analyzed the features of the two top performing iRO-PseKGCC predictors (see Supplementary Information S1) on the two benchmark datasets (cf. Equation 1) by MDA approach, and the results are listed in the Table 2, from which we can see that: 1) for both the two RF-based predictors, their most important features are the “***” and “*****,” indicating the importance of the k-GCC; 2) The global sequence order effects measured by different λ values and GC skew values contribute to the performance improvement; 3) Features in certain local window show more discriminative powers than those in other windows, for examples, for Pichia pastoris, all the top 10 most important features are in the middle window, which is consistent with the previous observations that the nucleobase composition distribution is uneven along the replication origins (Lobry, 1996; Grigoriev, 1998; Frank and Lobry, 1999; Tillier and Collins, 2000; Liu et al., 2018b).

TABLE 2
www.frontiersin.org

Table 2 The top 10 most important features of the top two performing RF-based predictors on the two benchmark datasets (cf. Equation 1).

Web Server and User Guide

Web-servers are important for the researchers to implement the corresponding computational predictors. In this regard, for the user’s convenience, we established a web-server named “iRO-PseKGCC.” For users’ convenience, a detailed user guide explaining how to use the web-server is given.

Step 1. Click on the web sites address http://bliulab.net/iRO-PsekGCC/ to open the web-server, then the main pages on the website as shown in Figure 3 will appear in front of you. To see a brief introduction about the server, please click on the “Read Me” button.

Step 2. Choose the one specie from Saccharomyces cerevisiae or Pichia pastoris.

Step 3. The input sequences should be in the FASTA format. The sequence data can be uploaded via the “Browse” button or copy and paste or type into the input box directly.

Step 4. To see the predicted results, please click on the “Submit” button. For example, if the four query DNA sequences in the Example window are used as the queried data, the predictive results are the 1st and 2nd query sequences are replication origins, and the 3rd and 4th are non-replication origins.

FIGURE 3
www.frontiersin.org

Figure 3 A semi-screen shot to show the homepage of the web-server iRO-PseKGCC, which can be accessed at http://bliulab.net/iRO-PsekGCC/.

Data Availability

Publicly available datasets were analyzed in this study. These data can be found here: https://academic.oup.com/bioinformatics/article-abstract/34/18/3086/4978052?redirectedFrom=fulltext

Author Contributions

BL provided the main idea of the manuscript and wrote the manuscript. SC did the experiments and revised the manuscript. KY revised the manuscript and did the typesetting. FW did the experiments.

Funding

This work was supported by the National Natural Science Foundation of China (No. 61672184, 61732012, 61822306), Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063), Shenzhen Overseas High Level Talents Innovation Foundation (Grant No. KQJSCX20170327161949608), Guangdong Natural Science Funds for Distinguished Young Scholars (2016A030306008), and Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00842/full#supplementary-material

References

Barandiaran, I. (1998). The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20 (8), 832–844. doi: 10.1109/34.709601

CrossRef Full Text | Google Scholar

Breiman, L. (1996). “Out-of-bag estimation”. Citeseer).

Google Scholar

Bu, H. D., Hao, J. Q., Guan, J. H., Zhou, S. G. (2018). Predicting enhancers from multiple cell lines and tissues across different developmental stages based on svm method. Curr. Bioinform. 13 (6), 655–660. doi: 10.2174/1574893613666180726163429

CrossRef Full Text | Google Scholar

Chen, H., Peng, S., Dai, L., Zou, Q., Yi, B., Yang, X., et al. (2017a). Oral microbial community assembly under the influence of periodontitis. PloS One 12 (8), e0182259. doi: 10.1371/journal.pone.0182259

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, J., Guo, M., Li, S., Liu, B. (2017b). Protdec-ltr2. Bioinformatics 33 (21), 3473–3476. doi: 10.1093/bioinformatics/btx429

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, J., Guo, M., Wang, X., Liu, B. (2016a). A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief. Bioinform. 19 (2), 231–244. doi: 10.1093/bib/bbw108

CrossRef Full Text | Google Scholar

Chen, J., Long, R., Wang, X.-l., Liu, B., Chou, K.-C. (2016b). dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Sci. Rep. 6, 32333. doi: 10.1038/srep32333

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, J., Wang, X., Liu, B. (2016c). IMiRNA-SSF: improving the identification of MicroRNA precursors by combining negative sets with different distributions. Sci. Rep. 6, 19062. doi: 10.1038/srep19062

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, W., Feng, P., Lin, H. (2012). Prediction of replication origins by calculating DNA structural properties. Febs Letters 586 (6), 934–938. doi: 10.1016/j.febslet.2012.02.034

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, W., Lei, T.-Y., Jin, D.-C., Lin, H., Chou, K.-C. (2014a). PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal. Biochem. 456, 53–60. doi: 10.1016/j.ab.2014.04.001

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, W., Lv, H., Nie, F., Lin, H. (2019). i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics. doi: 10.1093/bioinformatics/btz015

CrossRef Full Text | Google Scholar

Chen, W., Zhang, X., Brooker, J., Lin, H., Zhang, L., Chou, K.-C. (2014b). PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 31 (1), 119–120. doi: 10.1093/bioinformatics/btu602

PubMed Abstract | CrossRef Full Text | Google Scholar

Efron, B. (1992). “Bootstrap methods: another look at the jackknife,” in Breakthroughs in statistics (New York: Springer), 569–593. doi: 10.1007/978-1-4612-4380-9_41

CrossRef Full Text | Google Scholar

Feng, P.-M., Chen, W., Lin, H., Chou, K.-C. (2013). iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 442 (1), 118–125. doi: 10.1016/j.ab.2013.05.024

PubMed Abstract | CrossRef Full Text | Google Scholar

Frank, A., Lobry, J. (1999). Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene 238 (1), 65–77. doi: 10.1016/S0378-1119(99)00297-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Gao, F., Zhang, C.-T. (2008). Ori-Finder: a web-based system for finding oriC s in unannotated bacterial genomes. BMC Bioinform. 9 (1), 79. doi: 10.1186/1471-2105-9-79

CrossRef Full Text | Google Scholar

Grigoriev, A. (1998). Analyzing genomes with cumulative skew diagrams. Nucleic Acids Res. 26 (10), 2286–2290. doi: 10.1093/nar/26.10.2286

PubMed Abstract | CrossRef Full Text | Google Scholar

Hastie, T., Tibshirani, R., Friedman, J., (2008). The elements of statistical learning (2nd ed.). New York: Springer series in statistics New York.

Google Scholar

Ho, T. K. (1995). “Random decision forests,” in Proceedings of 3rd international conference on document analysis and recognition (Washington: IEEE), 278–282.

Google Scholar

Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., Lu, Z. (2007). MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35 (suppl_2), W339–W344. doi: 10.1093/nar/gkm368

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, W.-C., Deng, E.-Z., Ding, H., Chen, W., Lin, H. (2015). iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemom. Intell. Lab. Syst. 141, 100–106. doi: 10.1016/j.chemolab.2014.12.011

CrossRef Full Text | Google Scholar

Li, W.-C., Zhong, Z.-J., Zhu, P.-P., Deng, E.-Z., Ding, H., Chen, W., et al. (2014). Sequence analysis of origins of replication in the Saccharomyces cerevisiae genomes. Front. Microbiol. 5, 574. doi: 10.3389/fmicb.2014.00574

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, W., Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 (13), 1658–1659. doi: 10.1093/bioinformatics/btl158

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, B. (2017). BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. doi: 10.1093/bib/bbx165

CrossRef Full Text | Google Scholar

Liu, B., Gao, X., Zhang, H. (2019b). BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. doi: 10.1093/nar/gkz740

CrossRef Full Text | Google Scholar

Liu, B., Li, C., Yan, K. (2019a). DeepSVM-fold: Protein fold recognition by combining Support Vector Machines and pairwise sequence similarity scores generated by deep learning networks. Brief. Bioinform. doi: 10.1093/bib/bbz098

CrossRef Full Text | Google Scholar

Liu, B., Li, K., Huang, D.-S., Chou, K.-C. (2018a). iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics 34 (22), 3835–3842. doi: 10.1093/bioinformatics/bty458

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, B., Long, R., Chou, K.-C. (2016a). iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics 32 (16), 2411–2418. doi: 10.1093/bioinformatics/btw186

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, B., Wang, S., Long, R., Chou, K.-C. (2016b). iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 33 (1), 35–41. doi: 10.1093/bioinformatics/btw539

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, B., Weng, F., Huang, D.-S., Chou, K.-C. (2018b). iRO-3wPseKNC: Identify DNA replication origins by three-window-based PseKNC. Bioinformatics 34 (18), 3086–3093. doi: 10.1093/bioinformatics/bty312

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, B., Zhu, Y. (2019). ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into Learning to Rank. IEEE Access 7, 102499–102507. doi: 10.1109/ACCESS.2019.292963

CrossRef Full Text | Google Scholar

Lobry, J. R. (1996). Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13 (5), 660–665. doi: 10.1093/oxfordjournals.molbev.a025626

PubMed Abstract | CrossRef Full Text | Google Scholar

Lubelsky, Y., MacAlpine, H. K., MacAlpine, D. M. (2012). Genome-wide localization of replication factors. Methods 57 (2), 187–195. doi: 10.1016/j.ymeth.2012.03.022

PubMed Abstract | CrossRef Full Text | Google Scholar

Luo, H., Zhang, C.-T., Gao, F. (2014). Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes. Front. Microbiol. 5, 482. doi: 10.3389/fmicb.2014.00482

PubMed Abstract | CrossRef Full Text | Google Scholar

Méchali, M. (2010). Eukaryotic DNA replication origins: many choices for appropriate answers. Nat. Rev. Mol. Cell Biol. 11 (10), 728. doi: 10.1038/nrm2976

PubMed Abstract | CrossRef Full Text | Google Scholar

Mitchell, M. (1998). An introduction to genetic algorithms. Boston: MIT press.

Google Scholar

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (Oct), 2825–2830. doi: 10.1524/auto.2011.0951

CrossRef Full Text | Google Scholar

Roten, C.-A. H., Gamba, P., Barblan, J.-L., Karamata, D. (2002). Comparative Genometrics (CG): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Res. 30 (1), 142–144. doi: 10.1093/nar/30.1.142

PubMed Abstract | CrossRef Full Text | Google Scholar

Shirahige, K., Hori, Y., Shiraishi, K., Yamashita, M., Takahashi, K., Obuse, C., et al. (1998). Regulation of DNA-replication origins during cell-cycle progression. Nature 395 (6702), 618. doi: 10.1038/27007

PubMed Abstract | CrossRef Full Text | Google Scholar

Su, R., Liu, X., Wei, L., Zou, Q. (2019). Deep-Resp-Forest: A deep forest model to predict anti-cancer drug response. Methods (San Diego, Calif.) 166 (2019), 91–102. doi: 10.1016/j.ymeth.2019.02.009

PubMed Abstract | CrossRef Full Text | Google Scholar

Thomas, J. M., Horspool, D., Brown, G., Tcherepanov, V., Upton, C. (2007). GraphDNA: a Java program for graphical display of DNA composition analyses. BMC Bioinform. 8 (1), 21. doi: 10.1186/1471-2105-8-21

CrossRef Full Text | Google Scholar

Tillier, E. R., Collins, R. A. (2000). The contributions of replication orientation, gene direction, and signal sequences to base-composition asymmetries in bacterial genomes. J. Mol. Evol. 50 (3), 249–257. doi: 10.1007/s002399910029

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, C.-J., Tang, H., Li, W.-C., Lin, H., Chen, W., Chou, K.-C. (2016). iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget 7 (43), 69783–69793. doi: 10.18632/oncotarget.11975

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, C.-T., Zhang, R. (1991). Analysis of distribution of bases in the coding sequences by a digrammatic technique. Nucleic Acids Res. 19 (22), 6313–6317. doi: 10.1093/nar/19.22.6313

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, R., Zhang, C.-T. (1994). Z curves, an intutive tool for visualizing and analyzing the DNA sequences. J. Biomol. Struct. Dyn. 11 (4), 767–782. doi: 10.1080/07391102.1994.10508031

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, X., Zou, Q., Liu, B., Liu, X. (2014). Exploratory predicting protein folding model with random forest and hybrid features. Curr. Proteomics 11 (4), 289–299. doi: 10.2174/157016461104150121115154

CrossRef Full Text | Google Scholar

Zou, Q., Guo, J., Ju, Y., Wu, M., Zeng, X., Hong, Z. (2015). Improving tRNAscan-SE annotation results via ensemble classifiers. Mol. Inf. 34 (11–12), 761–770. doi: 10.1002/minf.201500031

CrossRef Full Text | Google Scholar

Keywords: replication origin identification, pseudo k-tuple GC composition, random forest, web-server, DNA sequence analysis

Citation: Liu B, Chen S, Yan K and Weng F (2019) iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition. Front. Genet. 10:842. doi: 10.3389/fgene.2019.00842

Received: 05 July 2019; Accepted: 13 August 2019;
Published: 18 September 2019.

Edited by:

Liang Cheng, Harbin Medical University, China

Reviewed by:

Xun Lan, Tsinghua University, China
Qiwen Dong, East China Normal University, China

Copyright © 2019 Liu, Chen, Yan and Weng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Bin Liu, bliu@bliulab.net

These authors have contributed equally to this work and share first authorship