Identification of discriminant features from stationary pattern of nucleotide bases and their application to essential gene classification

Introduction: Essential genes are essential for the survival of various species. These genes are a family linked to critical cellular activities for species survival. These genes are coded for proteins that regulate central metabolism, gene translation, deoxyribonucleic acid replication, and fundamental cellular structure and facilitate intracellular and extracellular transport. Essential genes preserve crucial genomics information that may hold the key to a detailed knowledge of life and evolution. Essential gene studies have long been regarded as a vital topic in computational biology due to their relevance. An essential gene is composed of adenine, guanine, cytosine, and thymine and its various combinations. Methods: This paper presents a novel method of extracting information on the stationary patterns of nucleotides such as adenine, guanine, cytosine, and thymine in each gene. For this purpose, some co-occurrence matrices are derived that provide the statistical distribution of stationary patterns of nucleotides in the genes, which is helpful in establishing the relationship between the nucleotides. For extracting discriminant features from each co-occurrence matrix, energy, entropy, homogeneity, contrast, and dissimilarity features are computed, which are extracted from all co-occurrence matrices and then concatenated to form a feature vector representing each essential gene. Finally, supervised machine learning algorithms are applied for essential gene classification based on the extracted fixed-dimensional feature vectors. Results: For comparison, some existing state-of-the-art feature representation techniques such as Shannon entropy (SE), Hurst exponent (HE), fractal dimension (FD), and their combinations have been utilized. Discussion: An extensive experiment has been performed for classifying the essential genes of five species that show the robustness and effectiveness of the proposed methodology.


Introduction
Essential genes are necessary for the survival of a living being and are considered the basis of life. Essential genes consist of vital data of genomes and, hence, could be the key to the broad interpretation of life and expansion (Juhas et al., 2011). It decides significant attributes involving cellular structure, chemistry, and reproduction, among others. Genomes have encoded data for the functions regularly viewed as in all life forms, and the instructions could be species-specific. Some genes appear essential for survival, whereas others seem to be optional. Essential genes have been provided to segregate genes and determine the fundamental sustaining cellular life components. Deletion of an essential gene would result in cell death. As a result, essential gene prediction aids in identifying the bare minimum of genes necessary for the vital survival of specific cell types. The discovery and analysis of essential genes aids our understanding of origin of life (Koonin, 2000). Furthermore, essential genes play a crucial role in synthetic molecular biology, vital to genome development. An extensive comprehension of essential genes can empower researchers to clarify the biological essence of microorganisms (Juhas et al., 2014), generate the smallest genome subset (Itaya, 1995), evolve promising medication targets, and create probable drugs to fight infectious diseases (Dickerson et al., 2011). Due to their significance, the identification of essential genes has been viewed as essential in bioinformatics and genomics.
Essential genes are a set of genes necessary for an organism to thrive in a certain climate. Most of these are only necessary for particular circumstances. For instance, if a cell is supplied with the amino acid lysine, the gene responsible for lysine production is nonessential. However, if the amino acid supply is unavailable, the gene encoding the enzyme responsible for lysine biosynthesis becomes essential, as protein synthesis is not possible without it. Essential genes regulate the activity of fundamental cells in almost every species (Qin, 2019;Guo et al., 2021). Genes are essential if they cannot be knocked out individually under circumstances when most of the needed nutrients are present in the growth medium and the organism grows at its optimal temperature. One of the major issues is determining which identified genes are necessary. There are various experimental techniques to identify essential genes in microorganisms, such as gene knockouts (Roemer et al., 2003), RNA interference (Cullen and Arndt, 2005), transposon mutagenesis (Veeranagouda et al., 2014), and single-gene knockout procedures (Giaever et al., 2002). However, these experimental techniques have various benefits and are generally good. They are still expensive and laborious. So, there is a need for computational methods to identify essential genes.
Because essential genes have biological significance, several computational methods, particularly machine learning methods, have been employed to ascertain them. For this objective, many feature extraction and model building approaches have been developed (Gil et al., 2004;McCutcheon and Moran, 2010;Juhas et al., 2012;Mobegi et al., 2017). Chen and Xu (2005) effectively used high-throughput data and machine learning techniques in Saccharomyces cerevisiae to evaluate protein dispensability. Seringhaus et al. (2006) constructed a machine learning model to predict essential genes in S. cerevisiae using several intrinsic genomic factors. Additionally, Yuan et al. (2012) designed three machine learning techniques based on informative genomic characteristics to detect knockdown lethality in mice. Deng (2015) proposed an important gene classification algorithm using hybrid characteristics like intrinsic and context-dependent genome aspects. This model acquired area under the receiver operating characteristic curve (AUC) scores of 0.86-0.93 when testing the same organism and scores of 0.69-0.89 when predicting cross-organisms using ten-fold cross-validation. Zhang et al. (2020) have contributed significantly by combining sequence-and network-based features to identify essential genes and arrived at valid results by utilizing a deep learning-based model to learn the characteristics generated from sequencing data and protein-protein interaction networks. Liu et al. (2017) published the findings of comprehensive research on 31 bacterial species, including cross-validation, paired, self-test, and leave-one-speciesout experiments. Rout et al. (2020) proposed a method to identify essential genes of four species based on various quantitative methods, including purine and pyrimidine distribution. Le et al. (2020) proposed a model for identifying essential genes using an ensemble deep neural network. Xu et al. (2020) developed a method to predict essential genes in prokaryotes based on sequence-based features using an artificial neural network. A web server, Human Essential Genes Interactive Analysis Platform (HEGIAP), was developed by Chen et al. (2020) for detailed analysis of human essential genes.
An expression-based predictor was developed by Kuang et al. (2021) to recognize the essential genes in humans. The predictor utilized gene expression profiles to predict lncRNAs in cancer cells. Senthamizhan et al. (2021) created a database NetGenes for essential genes, which contains predictions for 2,711 bacterial species using network-based features. The protein-protein interaction network was used to extract features from the STRING database. Marques de Castro et al. (2022)predicted the essential genes in Tribolium castaneum and Drosophila melanogaster based on the physicochemical and statistical data along with subcellular locations. They extracted extrinsic and intrinsic attributes from the essential and nonessential data. This paper analyzed the DNA sequences of five species, i.e., Homo sapiens, Danio rerio, D. melanogaster, Mus musculus, and Arabidopsis thaliana, to identify essential genes. The proposed model extracts cooccurrence matrices from the essential gene sequences to find some informative patterns that distinguish the species. This paper also finds the impact of different co-occurrence matrices and existing features, such as Hurst exponent (HE), fractal dimension (FD), Shannon entropy (SE), and modified Shannon entropy (MSE).
The rest of the paper is structured in the following manner. The definitions of various fundamental parameters are given in Section 2, with relevant descriptions. The proposed methodology with detailed dataset description is discussed in Section 3. The efficiency of our strategy is proven by experimental findings and comments in Section 4, which summarizes the paper by highlighting the most important aspects of the whole investigation. Finally, the paper is concluded in Section 5.

Basic terminology
Essential genes are a family linked to critical cellular activities for survival of species. Identifying essential genes is a multidisciplinary process that necessitates both computational and wet-lab validation experiments. Several machine learning methods have been developed to improve classification accuracy, making it a time-consuming and resource-intensive process. Hence, with lower validation costs, most Frontiers in Genetics frontiersin.org of these methods use supervised methods, which necessitate massive labeled training data sets, typically impractical for less-sequenced species. On the other hand, the rise of high-throughput wet-lab experimental approaches like next-generation sequencing has resulted in an oversupply of unlabeled essential gene sequence data.
In the initial study, it has been observed that a fixed-dimensional feature vector represents every DNA sequence by using various quantitative measures, such as SE, MSE, FD, and HE. To estimate these quantitative measures, we convert gene sequences into binary sequences based on pyrimidine and purine distribution. The two main forms of nucleotide bases in DNA are made up of nitrogenous bases. Adenine (A) and guanine (G) are purines, whereas cytosine (C) and thymine (T) are pyrimidines. Here, purine and pyrimidine bases are expressed as 1 and 0, respectively.

Shannon entropy and modified Shannon entropy
SE may be used to determine how much uncertainty or information a sequence contains (Zurek, 1989;Khandelwal et al., 2022b). The uncertainty affects the distribution of each word. A sequence's uncertainty concerning a base pair ranges from 0 to 2n, where n is the length of a word. The SE uses the probability p of the two possibilities (0/1) to calculate information entropy. The following equation gives the SE of a binary sequence: where p i indicates the probability of two values regarding the binary sequence, and SE is used to compute the uncertainty in a binary string (Khandelwal et al., 2022a). When the probability p = 0, the event is assured never to happen, resulting in no uncertainty and entropy of 0. Similarly, if p = 1, the result is definite; hence, the entropy must be 0. When p = 1/2, the uncertainty is highest, and the SE is 1. The MSE of different word size is given by where w j indicates the frequency of the j th word in the gene sequence. For instance, for a word of length 1, w j is determined using the frequencies of purine or pyrimidine 0, 1, and for a word of length 2, w j is determined using the two-time repeat of purine or pyrimidine 00, 10, 01, and 11. The number of words determined by taking the maximum length of both purines and pyrimidines is represented by k (Rout et al., 2020).

Hurst exponent
The HE evaluates a data set's smoothness and degree of similarities. The HE is often used to analyze auto-correlation in time-series analysis. It is calculated using rescaled range analysis (R/ S analysis) and has a value of 0-1 (Hurst, 1951;Khandelwal et al., 2022c). A negative auto-correlation of a time series is indicated by a HE value between 0 and 0.5, while a HE value between 0.5 and 1 indicates a positive auto-correlation. If the HE value is 0.5, the series is random, meaning that there is no relation between the variable and its previous values (Hassan et al., 2021;Rout et al., 2022). The HE of a binary sequence D n is computed by the following equation: where and

Fractal dimension
Every DNA sequence is converted into indicator matrices (Rout et al., 2018;Umer et al., 2021). Let X = {A, T, C, and G} denote the set of finite alphabet nucleotides, and D(N) denote a DNA sequence with four symbols from X of length N. The indicator function for every DNA sequence is described by the following equation: such that the indicator matrix will be Here, I(N, N) is a matrix with values 0 and 1, and it produces a binary image of the DNA sequence as a 2D dot-plot. Within the same sequence, the binary image can represent the distribution of 0s and 1s. It is possible to assign a white dot to 0 and a black dot to 1. The FD from an indicator matrix can be computed as the average number of σ(n) of 1, randomly selected n× n from an N× N indicator matrix (Cattani, 2010;Rout et al., 2014;Upadhayay et al., 2019). Using σ(n), the FD is computed by the following equation: 3 Proposed scheme In this paper, we used the Database of Essential Genes (http:// www.essentialgene.org/) for experimental findings and discussion. This dataset consists of essential genes of five species. There are  Table 1 lists some of the terminologies employed in the proposed technique for reference.

Proposed feature representation technique
The DNA (deoxyribonucleic acid) sequence of essential genes S is composed of four bases: adenine (A), guanine (G), cytosine (C), and thymine (T). So, several occurrences may exist with combinations of A, C, T, G within the sequence S. The cooccurrences of A, C, T, G in the DNA sequence establishes the relationship between the nucleotide. It is the first time that a method has been proposed for finding the co-occurrences of nucleotides A, C, T, G within S. The objective of finding these co-occurrences is to analyze the patterns of A, C, T, G within the DNA sequence S to derive some useful features that uniquely discriminate the species by the feature representation of their essential genes. Assuming x = (A, C, T, G) is a vector of the nucleotides, then the possibility of arrangement of these characters in the DNA gene sequences is represented through co-occurrence matrices formed by the vector combination, which are shown in Table 2.
Here, the computed co-occurrence matrices of different combinations of nucleobases represent the distribution of nucleobases throughout the essential gene S. This distribution of nucleobases examines the texture pattern and considered the spatial relationship of nucleobases in the essential gene S. Experimentally, it has been observed that the occurrences of the spatial relationship of nucleobases cannot provide fixed information of the stationary and non-stationary patterns of A, C, T, and G. However, the obtained spatial relationship contains the information of both these patterns at a time. Hence, statistically it is easier to compute information considering both stationary and non-stationary patterns at a time rather than differentiating stationary and non-stationary patterns in S. The essential genes are very critical for the survival of any organism. It is beneficial for cell growth. Each gene sequence is variable in length, and the arrangements A, C, T, G nucleobases are zigzag. Hence, finding the stationary and non-stationary patterns of A, C, T, G and the co-occurrences of the different combinations of these nucleobases will help find its natural pattern in the gene. Hence, deriving the valuable patterns of the variety of A, C, T, G through co-occurrence matrix descriptors will considerably improve the retrieval performance and be eligible to analyze the statistical and structural information effectively from those patterns. Hence, inspired by the co-occurrence matrix of texture analysis (Umer et al., 2016) of image processing and pattern recognition, we have employed the ideas of gray-level co-occurrence matrix. Here, we have computed several co-occurrence matrices from each essential gene data. Now, I       Table 3, Supplementary Table S1, Supplementary  Table S2, Supplementary Table S3, Supplementary Table S4,  Supplementary Table S5, Supplementary Table S6, and  Supplementary Table S7, respectively.
Here, from the given DNA sequence S, the aforementioned cooccurrence matrices are obtained. Each co-occurrence matrix G contains the number of occurrences of A, C, T, G nucleobases with a specific combinations and offset in S. Since a sequence S with q different combinations of A, C, T, G nucleobases will produce a cooccurrence matrix of size q × 4 for the given offset, so the (r,s) th value of a co-occurrence matrix (Table 3, Supplementary Table S1,  Supplementary Table S2, Supplementary Table S3, Supplementary  Table S4, Supplementary Table S5, Supplementary Table S6, and  Supplementary Table S7) gives the number of times that r th and s th nucleobases present in S. Hence, mathematically, here each co-occurrence matrix (Table 3, Supplementary Table S1,  Supplementary Table S2, Supplementary Table S3, Supplementary  Table S4, Supplementary Table S5, Supplementary Table S6, and  Supplementary Table S7) is given by The offset (△i, △j) defines the spatial relation for which the matrix G is calculated. The number of co-occurrences of the combinations of A, C, T, G present in S is obtained by the co-occurrence matrices. So, to extract distinguish and discriminant features, each matrix G is normalized to G′ G q r 0 q s 0 G(r,s) . Then, the normalized cooccurrence matrix G′ is used to compute some features like entropy, dissimilarity, energy, homogeneity, and contrast. The mathematical definitions of these features are shown in Table 4. Now, the features defined in Table 4 are extracted from each cooccurrence matrix (Table 3, Supplementary Table S1,  Supplementary Table S2, Supplementary Table S3, Supplementary  Table S4, Supplementary Table S5, Supplementary Table S6, and  Supplementary Table S7), and the list of feature vectors extracted from these matrices is obtained as follows: f I = (f 1 , f 2 , f 3 , f 4 , f 5 ) from I (Table 3) f J = (f 6 , f 7 , f 8 , f 9 , f 10 ) from J (Supplementary Table S1 Framework of the proposed model for the classification of essential genes. Here, CoM indicates the co-occurrence matrices. Hence, the final feature representation of a DNA sequence or essential gene S is given by the feature vector

Classification
In this study, for the classification of the essential genes in the employed species, the decision tree (DT), k-nearest neighbor (KNN), and support vector machine (SVM) classifiers are used. During experimentation, the datasets of each species Arabidopsis thaliana (AT), Drosophila melanogaster (DOM), Danio rerio (DR), Homo sapiens (HS), and Mus musculus (MM) are divided into two, with 50% of its data input into the training set and the remaining 50% into the testing set. Then, a five-fold cross-validation technique is employed. Finally, the average performance for the testing data is reported for the proposed system.
DT is a supervised algorithm, and it is generated by using the Iterative Dichotomiser 3 algorithm (ID3) or CART algorithm (Classification algorithm and Regression Tree) (Quinlan, 1986). The DT uses decision nodes to split the dataset into smaller subsets based on information gain (IG) or the Gini index. ID3 uses IG to evaluate how well an attribute splits the training dataset based on its classification objective. IG is the difference between the dataset's entropy before and after splitting depending on the specified attribute values. Let X = x 1 , x 2 , x 3 , . . .., x n represent the set of instances, A represent the attribute, and X v subset of X having A = v. Then, IG is given by where ENT(X) is the entropy of X and V(A) is the collection of all possible A values. Entropy of X is given by where p i denotes the probability for current state X. KNN is a supervised machine learning and non-parametric technique that signifies that it makes no assumptions about the underlying data. The KNN method ensures that the unseen data and  i.e., Euclidean distance, and classifies it based on the similarity (Peterson, 2009). The steps of the KNN algorithm are as follows.
1. First, select the value of K, i.e., the closest data points. Any integer may be used as K. 2. Do the following for each data point in the test data set: (i) find the distance between the data point and all samples in the training dataset using one of the following methods: Manhattan, Euclidean, or Hamming distance. In this paper, Euclidean distance measure is used for calculating the distance; (ii) sort samples in the ascending order depending on the distance value; (iii) select the top K samples as the nearest neighbors to the test data point; (iv) next, the test data point will be assigned a class depending on the most common class of these K samples.
The SVM is a supervised machine learning approach for classifying data. The SVM is a well-known technique used in various bioinformatics and computational biology problems, and it needs fewer model parameters to describe the non-linear transition from primary sequence to protein structure region. To minimize the error, the SVM will create the hyperplane repeatedly. The SVM is noted for its quick training, which is necessary for highthroughput database testing (Suthaharan, 2016). Let the dataset be represented by (X 1 , y 1 ), (X 2 , y 2 ), (X 3 , y 3 ), . . .. , (X n , y n ). The SVM solves the following equation: where w and b is the weight and bias of the hyperplane equation w · X + b = 0, respectively.

Evaluation metrics
In this paper, the essential gene classification problem is a multi-class classification problem as we have classified essential genes of five species, i.e., AT, DOM, DR, HS, and MM. For every class in the target, the evaluation matrices (accuracy, precision, recall, and F1-score) were computed. Then, the weighted averaging technique was used to give the final value of evaluation metrics. where and where TP i , TN i , FP i , and FN i are the counts of true positives, true negatives, false positives, and false negatives, respectively, for the i th class. Here, C represents the number of classes in the problem, and n i indicates the number of samples in the i th class.

Model framework
The proposed model classified essential genes of five species based on co-occurrence matrices. The proposed model finds the eight different co-occurrence matrices from the DNA sequences. From each co-occurrence matrix, five features, i.e., energy, entropy, homogeneity, contrast, and dissimilarity, were extracted. The existing features, such as HE, FD, SE, and MSE were also computed and then combined with the proposed features for the classification of essential genes. A supervised machine learning algorithm, SVM, was used to evaluate the model. Figure 1 shows essential genes. A supervised machine learning algorithm, SVM was used to evaluate the model. Figure 1 shows the framework of the proposed model.

Result and discussion
The proposed essential gene classification model can identify novel essential genes with high recall and precision while only requiring a small number of previously identified essential genes in some species. Such a method could be highly beneficial when investigating essential genes in newly sequenced genomes of other species with few known examples of essential genes. The proposed work has been implemented in the 'Python' environment, while the 'Python' library of machine  Frontiers in Genetics frontiersin.org learning algorithms has been employed for data classification tasks. Python is the best scripting and programming language, is open-source, and has high-level object-oriented programming approaches that deal with mathematical and statistical functions. The method's implementation for the proposed methodology is executed in the Kaggle repository that explores research to data scientists and machine learning engineers as best practitioners in these fields.
Here, for Python tools, we have employed NumPy, Pandas, Matplotlib, Sklearn.Preprocessing, Sklearn.Classifiers, Sklearn.Metrics, and some other packages for data analysis and prediction models. The feature vectors extracted from each DNA gene sequence S undergo KNN, DT, and SVM classifiers. The datasets from AT, DOM, DR, HS, and MM species are given in Table 5. The experimentation of the proposed methodology has been divided into sub-sections.

Experiment for the proposed features
In this section, experiments with individual features have been performed. Here, from each DNA sequence S, individual feature from each f I , f J , f K , f L , f M , f N , f O , f P have been considered, and then classification has been performed. Figure 2 demonstrates the distribution of F1-score performance obtained by DT, KNN, and SVM classifiers with respect to every 40 features computed from co-occurrence matrices of DNA sequence S. From this figure, it has been observed that both the KNN and SVM classifiers predict the classification problem better than the DT classifier for most of the features. Moreover, it has also been observed that classifiers have obtained more or less similar performance for most features but better performance due to the 19th, 26th, 27th, 30th, 32nd, and 35th features of the forty-dimensional feature vector f. For measuring the impact of individual features such as entropy, homogeneity, energy, contrast, and dissimilarity on the classification of essential genes, the performance has been reported concerning KNN, DT, and SVM classifiers in Table 6. Here, experiments are carried out under the same training-testing protocols, and from each DNA sequence S, the corresponding features are extracted from all co-occurrence matrices. So, each eight-dimensional feature vector is extracted for entropy, homogeneity, energy, contrast, and dissimilarity features.
As shown in Table 6, for every feature, the performance is more or less the same, but for the KNN classifier, the performance is better than that of DT and SVM. Here, F1-score has been considered classification performance as the employed species AT, DOM, DR, HS, and MM have class imbalance problems. Furthermore, the effect of features computed from each co-occurrence matrix in the subsequent experiments has been considered. Here, the 5dimensional feature vector is extracted from each co-occurrence matrix. The performance due to these feature vectors is reported in Table 7 under the same training-testing protocol. Table 7 shows that there is a more or less a similar effect of co-occurrence matrix features on the essential gene classification. Hence, the features computed from the co-occurrence metrics are helpful and effective. Here, the KNN classifier has better performance.

Experiment for the existing features
In the further experiment, the performance has been compared with some existing state-of-the-art feature extraction techniques such as SE, MSE, HE, and FD(discussed in Section 2), where these features are extracted accordingly. The performance is obtained concerning KNN, DT, and SVM classifiers. The performance due to these features is reported in Table 8, implying that SE, HE, MSE, and FD features have more or less similar performance. Still, among the classifiers, SVM has obtained better performance. The comparison of these performances and the proposed system has been shown in Figure 3, which shows that the proposed approach has better classified the essential genes of AT, DOM, DR, HS, and MM species under the same training-testing protocol. Here, the difference is in the proposed system, and the fortydimensional feature vector is considered, while the onedimensional feature vector is extracted in each existing feature extraction technique. Hence, this work investigates the discriminatory power of co-occurrence matrix features with better performance than the existing state-of-the-art features.

Experiment for the combined features
The co-occurrence of nucleotides A, C, T, G in the essential gene derives the distribution of these nucleotides and also their relative position information within the gene S. The existing state-of-the-art techniques of feature extraction (discussed in this work) are key measures in information theory. For example, SE and its modified technique compute the amount of uncertainty and randomness of nucleotides in the gene S. HE measures the relative tendency and characteristic parameters for analyzing its distribution in the essential gene. The FD computes the fractal-like distribution of nucleotides from the indicator matrix calculated from the essential gene S. So, the similarity of patterns of nucleotides computed by the co-occurrence matrices and the information of uncertainty, randomness, relative tendency, and fractal-like distribution information in S are combined here to obtain more discriminant features for the classification of essential genes of AT, DOM, DR, HS, and MM species. The principal component analysis of dimensionality reduction with variation ratio has been adopted to find the best suitable combination of these features. The performance due to the combination of these features is demonstrated in Table 9. Table 10 reports the discriminatory power of combined features with respect to various dimensional reduced features concerning  The bold value indicates the highest F1-score.
Frontiers in Genetics frontiersin.org KNN, DT, and SVM classifiers and shows that highest F1-score is 71.42 and it is due to the SVM classifier. As this is class imbalance problem, so F1-score performance has been reported. For better understanding and visibility, the final performance for the combination of features for the classification of essential genes of AT, DOM, DR, HS, and MM species has been shown in Figure 4.

Conclusion
A novel method of feature extraction and analysis for the classification of essential genes of Arabidopsis thaliana (AT), Drosophila melanogaster (DOM), Danio rerio (DR), Homo sapiens (HS), and Mus musculus (MM) species has been considered in this work. The implementation of the proposed scheme is divided into three segments. In the first segment, novel co-occurrence matrix-based features are extracted from genes that derive the distribution of nucleotides and their relative position from the respective gene. The features from these measures belong to the statistical analysis of the distribution of stationary patterns of nucleotides in the essential genes. In the second segment, some existing state-of-the-art feature computation techniques such as SE, HE, and FD are used as information theory measures that compute uncertainty, randomness, relative tendency, and fractal-like structures in the gene. In the third segment of this work, the features from the proposed methodology and the existing techniques are individually carried out for classification tasks where their F1-score performance has been considered for comparison. These comparisons show the robustness and effectiveness of the proposed methodology. Finally, the features from the proposed scheme and the existing techniques are combined to compute more discriminatory features for classifying essential genes of AT, DOM, DR, HS, and MM species.

Data availability statement
Data used for this study is publicly available at http://www. essentialgene.org/.