METHODS article

Front. Genet., 11 February 2021

Sec. Computational Genomics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.569120

COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses

  • 1. Department of Computer Science and Engineering, National Institute of Technical Teachers' Training and Research, Kolkata, India

  • 2. Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha ‘O’ Anusandhan (Deemed to Be University), Bhubaneswar, India

  • 3. Department of Electronics and Communication Engineering, MCKV Institute of Engineering, Howrah, India

  • 4. Cognizant Technology Solutions Pvt. Ltd., Kolkata, India

  • 5. Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland

  • 6. Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland

Abstract

The COVID-19 disease for Novel coronavirus (SARS-CoV-2) has turned out to be a global pandemic. The high transmission rate of this pathogenic virus demands an early prediction and proper identification for the subsequent treatment. However, polymorphic nature of this virus allows it to adapt and sustain in different kinds of environment which makes it difficult to predict. On the other hand, there are other pathogens like SARS-CoV-1, MERS-CoV, Ebola, Dengue, and Influenza as well, so that a predictor is highly required to distinguish them with the use of their genomic information. To mitigate this problem, in this work COVID-DeepPredictor is proposed on the framework of deep learning to identify an unknown sequence of these pathogens. COVID-DeepPredictor uses Long Short Term Memory as Recurrent Neural Network for the underlying prediction with an alignment-free technique. In this regard, k-mer technique is applied to create Bag-of-Descriptors (BoDs) in order to generate Bag-of-Unique-Descriptors (BoUDs) as vocabulary and subsequently embedded representation is prepared for the given virus sequences. This predictor is not only validated for the dataset using -fold cross-validation but also for unseen test datasets of SARS-CoV-2 sequences and sequences from other viruses as well. To verify the efficacy of COVID-DeepPredictor, it has been compared with other state-of-the-art prediction techniques based on Linear Discriminant Analysis, Random Forests, and Gradient Boosting Method. COVID-DeepPredictor achieves 100% prediction accuracy on validation dataset while on test datasets, the accuracy ranges from 99.51 to 99.94%. It shows superior results over other prediction techniques as well. In addition to this, accuracy and runtime of COVID-DeepPredictor are considered simultaneously to determine the value of k in k-mer, a comparative study among k values in k-mer, Bag-of-Descriptors (BoDs), and Bag-of-Unique-Descriptors (BoUDs) and a comparison between COVID-DeepPredictor and Nucleotide BLAST have also been performed. The code, training, and test datasets used for COVID-DeepPredictor are available at http://www.nitttrkol.ac.in/indrajit/projects/COVID-DeepPredictor/.

1. Introduction

The first case of COVID-19 surfaced in Wuhan, China in December 2019 (Huang et al., 2020; Meng et al., 2020; Yan L. et al., 2020). In no time it spread to 212 countries and territories (Worldometer, 2021) worldwide creating a pandemic in its wake. SARS-CoV-2 falls in the same family as SARS-CoV and MERS-CoV (all belong to the family of coronavirus) and mainly targets the respiratory system (Zhou et al., 2020). As of 8th January 2021, over 885 million cases of COVID-19 have been reported worldwide, with more than 1,906 thousand cases of death and 63.6 million cases of recovery (Worldometer, 2021).

SARS-CoV-2 is defined as an enveloped, positive-sense, single-stranded RNA virus with a genome of around 30 kilobases in length (Weiss and Navas-Martin, 2005; Su et al., 2016; Cui et al., 2019). RNA viruses generally have very high mutation rates (Jenkins et al., 2002; Woo et al., 2009). Genetic mutation can occur infrequently between viruses of the same species but of divergent lineages. The resulting mutated viruses may sometimes cause an outbreak of infection in humans e.g., the case of SARS-CoV-2. Coronavirus results from zoonotic transmission to human and shows symptoms of pneumonia, fever, and breathing difficulties (Guan et al., 2003; Alagaili et al., 2014). Human to human transmission has also been confirmed for SARS-CoV-2 (Chan et al., 2020; Huang et al., 2020). Next-generation sequencing using metagenomic analysis has recently been used to identify the genetic features of SARS-CoV-2 (Zhou et al., 2020).

There have been several analysis regarding SARS-CoV-2. This include whole genome analysis of a virus and viral protein-based comparisons which have resulted in the conclusion that SARS-CoV-2 is mostly related to two bat SARS-like coronaviruses (Chan et al., 2020; Lu et al., 2020). Phylogenetic analysis of full genome alignment and similarity plot show that SARS-CoV-2 has high similarity with bat coronavirus RaTG13 (Paraskevis et al., 2020). Furthermore, another study (Wan et al., 2020) has shown that spike protein receptor-binding domain (RBD) of SARS-CoV-2 binds with host receptor angiotensin-converting enzyme 2 (ACE2), just like other Sarbecovirus strains, thus making the claim that SARS-CoV-2 originated from bat very likely (Letko et al., 2020; Liu and Wang, 2020).

As the genomic structure of SARS-CoV-2 is similar to the other viruses of the same family, and it shows similar symptoms like them, the early prediction of SARS-CoV-2 is a very challenging task. Ozturk et al. (2020) have used deep neural networks with X-ray images for automated detection of SARS-CoV-2 cases. The results show that the method has a prediction accuracy of 98.08% for binary classes (COVID vs. No-Findings) and 87.02% for multiple classes (COVID vs. No-Findings vs. Pneumonia). Another work (Yan Q. et al., 2020) where deep learning has been used to predict age-related macular degeneration (AMD) which is a leading cause of blindness among the elderly population. The results show an average area under the curve (AUC) value of 0.85. On the other hand, the authors in Koohi-Moghadam et al. (2019) have used deep learning approach to predict disease-associated mutation of metal-binding sites in proteins. The prediction results depict AUC as 0.90 and an accuracy of 0.82. These encouraging results show that deep learning has the potential for highly accurate prediction. This led us to devise a predictor based on deep learning which uses genomic sequences of pathogenic viruses. In this work, a deep learning technique, viz. COVID-DeepPredictor based on Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997; Tang et al., 2019) is developed. Though, LSTM has been profusely used in many works for text classification (Jin et al., 2019; Liu et al., 2019; Zhang et al., 2019), to the best of the authors' knowledge, this is the first attempt to use LSTM for the prediction of SARS-CoV-2 using genomic sequences of virus considering alignment-free approach. For this purpose, k-mer technique is used to generate Bag-of-Descriptors (BoDs) and consequently Bag-of-Unique-Descriptors (BoUDs) as vocabulary. Subsequently embedded representation is prepared for the given virus sequences using BoDs and BoUDs. It is worth mentioning that, though SARS-CoV-2 is a single-stranded RNA virus, the genomic information of a virus is captured in the form of DNA sequence. These DNA sequences are used in this work to predict SARS-CoV-2 and other pathogenic viruses viz. SARS-CoV-1, MERS-CoV, Ebola, Dengue, and Influenza. COVID-DeepPredictor achieves 100% prediction accuracy on validation dataset while on test datasets, the accuracy ranges from 99.51 to 99.94%. COVID-DeepPredictor also shows superior results over the existing prediction techniques based on Linear Discriminant Analysis, Random Forests, and Gradient Boosting Method. Moreover, apart from prediction accuracy, critical analysis like the choice of k in k-mer by considering the accuracy and runtime of COVID-DeepPredictor simultaneously, a comparative study of Bag-of-Descriptors (BoDs) and Bag-of-Unique-Descriptors (BoUDs) for different values of k and a comparison between an alignment-based technique viz. Nucleotide Basic Local Alignment Search Tool (BLASTN) and COVID-DeepPredictor as alignment-free technique.

2. Materials and Methods

In this section, description of dataset preparation that has been used in this work are elucidated, a brief description of Long-Short Term Memory (LSTM) and the detailed discussion of proposed COVID-DeepPredictor are put forth.

2.1. Data Preparation

The datasets of SARS-CoV-1, MERS-CoV, Ebola, Dengue, and Influenza have been downloaded from NCBI (National Center for Biotechnology Information)1. Dataset for SARS-CoV-2 has been downloaded from NCBI and GISAID (Global Initiative on Sharing All Influenza Data)2. The total number of complete or near-complete genomic sequences of all the pathogenic viruses amounted to 4,643, named as Initial dataset. Additionally, the recent complete or near-complete SARS-CoV-2 sequences of 3,030 during January 2020 to August 2020 are taken from NCBI whereas 2,410 (from February 2020 to July 2020) and 4,000 (from June 2020 to December 2020) sequences are considered from GISAID. For our training purpose, 1,500 samples from 4,643 sequences are taken randomly for training. To ensure that representatives from all the six pathogenic viruses are available and to avoid imbalance class problem, 250 samples from each pathogenic viruses are taken in the training dataset. In order to perform testing, five different test datasets are created and named as Testdata-1, Testdata-2, Testdata-3, Testdata-4, and Testdata-5. It is important to mention that Testdata-1 consists of the remaining 3,143 sequences out of 4,643 sequences, while Testdata-2 contains 200 sequences each for MERS-CoV, SARS-CoV-2, Ebola, Dengue, and Influenza and 90 sequences of SARS-CoV-1 from different sources. Moreover, Testdata-3, Testdata-4, and Tetsdata-5 comprise of recent SARS-CoV-2 sequences from NCBI and GISAID respectively along with other pathogenic viruses. The statistics of Initial dataset as well as training and testing datasets are given in Table 1. It is worth mentioning that in this work more than 10,000 SARS-CoV-2 genomic sequences have been used from January 2020 to December 2020 considering different sources in order to develop COVID-DeepPredictor.

Table 1

DatasetVirus nameNumber of sequencesMax. length of sequenceAvg. lengthof sequenceSource of sequence
Initial datasetSARS-CoV-134030,31129,515NCBI-SARS-CoV-1
MERS-CoV29130,15029,983NCBI-MERS-CoV
SARS-CoV-22,40229,98629,507GISAID-SARS-CoV-2
Ebola30019,89718,976NCBI-Ebola
Dengue30011,19510,746NCBI-Dengue
Influenza1,0102,3472,322NCBI-Influenza
Training datasetSARS-CoV-125029,76529,520NCBI-SARS-CoV-1
MERS-CoV25030,12329,999NCBI-MERS-CoV
SARS-CoV-225029,92729,334GISAID-SARS-CoV-2
Ebola25019,89718,979NCBI-Ebola
Dengue25011,19510,748NCBI-Dengue
Influenza2502,3472,333NCBI-Influenza
Testdata-1SARS-CoV-19030,31129,494NCBI-SARS-CoV-1
MERS-CoV4130,15029,887NCBI-MERS-CoV
SARS-CoV-22,15229,98629,527GISAID-SARS-CoV-2
Ebola5019,03418,964NCBI-Ebola
Dengue5010,76410,737NCBI-Dengue
Influenza7602,3412,318NCBI-Influenza
Testdata-2SARS-CoV-1903031129494NCBI-SARS-CoV-1
MERS-CoV20030,42329,066NCBI-MERS-CoV
SARS-CoV-220029,85529,850GISAID-SARS-CoV-2
Ebola20018,79818,762NCBI-Ebola
Dengue20010,73110,692NCBI-Dengue
Influenza2002,3412,323NCBI-Influenza
Testdata-3SARS-CoV-19030,31129,494NCBI-SARS-CoV-1
MERS-CoV22030,42329,162NCBI-MERS-CoV
SARS-CoV-23,03029,90329,780NCBI-SARS-CoV-2
Ebola22018,87118,850NCBI-Ebola
Dengue22010,69010,677NCBI-Dengue
Influenza2202,3412,323NCBI-Influenza
Testdata-4SARS-CoV-19030,31129,494NCBI-SARS-CoV-1
MERS-CoV25030,42329,277NCBI-MERS-CoV
SARS-CoV-22,41030,42329,726GISAID-SARS-CoV-2
Ebola25018,87118,852NCBI-Ebola
Dengue25010,75710,538NCBI-Dengue
Influenza2502,3162,316NCBI-Influenza
Testdata-5SARS-CoV-19030,31129,494NCBI-SARS-CoV-1
MERS-CoV25030,42329,277NCBI-MERS-CoV
SARS-CoV-24,00029,90329,798GISAID-SARS-CoV-2
Ebola20018,79818,762NCBI-Ebola
Dengue22010,69010,677NCBI-Dengue
Influenza2502,3162,316NCBI-Influenza

Description of initial, training, and test datasets.

All the experiments are performed with the training and testing datasets as mentioned in Table 1. For the visualization of all the virus sequences (SARS-CoV-1, MERS-CoV, SARS-CoV-2, Ebola, Dengue, and Influenza), t-distributed Stochastic Neighbor Embedding (tSNE) (Hinton and Roweis, 2003) is applied on 4,643 sequences after generating the count vector (Khattak et al., 2019) using k-mer technique (Manekar and Sathe, 2018; Solis-Reyes et al., 2018). In this regard, the number of clusters known apriori is six and such embedded representation of virus sequences is shown in Figure 1A along with the distribution of initial SARS-CoV-2 sequences in 56 countries in Figure 1B. It is to be noted that COVID-DeepPredictor is developed in MATLAB R2020a.

Figure 1

2.2. Long-Short Term Memory

Long-Short Term Memory (LSTM) is a type of recurrent neural network (sub-branch of deep learning) which is capable of learning order dependence in sequence prediction problems. The main components of an LSTM network are sequence input layer and an LSTM layer. A sequence input layer provides text as an input into the LSTM network. An LSTM layer learns long-term association between steps of sequence data. Elaborately speaking, an LSTM network acquires a context vector from previous time step and an input vector from the given data. This is used to calculate the next context and gate vectors to control memory cell state vector (Kim et al., 2018). With an input data at time t and a context vector h, a raw cell vector and input vectors for each gate are created by one hidden layer. At the input gate, the cell vector is then multiplied by the input vector. The cell input is added to given previous cell vector weighted by the forget vector. Then the resultant vector is controlled by the output vector. The update of the cell is controlled by the control gate. LSTM is mainly trained using Back-propagation Through Time and mitigates the vanishing gradient problem that is quite rampant in RNN. In LSTM, the memory cells and the gates can store time and thus can eliminate old observations overcoming vanishing gradient problem.

To sum up, LSTM consists of four gates, input gate (it), forget gate (ft), control gate (Ct), and output gate (ot). Considering a sentence S = x1, x2, ..., xK, where K is the length of a sentence, the equations for LSTM can be depicted as:

Here, W are weight matrices, ht−1 is the hidden layer which is used updated by the output layer and is also responsible for updating the output and tanh and sigm, respectively represent the tanh-activation and sigmoid-activation functions.

2.3. COVID-DeepPredictor

The main objective of COVID-DeepPredictor is to correctly predict the virus classes based on the given genomic sequences of the different pathogenic viruses using an alignment-free technique. To achieve this, the entire genomic sequence is initially divided into descriptors of sequences called as Bag-of-Descriptors (BoDs) using the popular k-mer technique. Here, descriptors are patterns of the genomic sequences of length k. Thereafter, Bag-of-Unique-Descriptors (BoUDs) as vocabulary are created using such BoDs. With the use of BoDs and BoUDs, an embedded representation is created of size N × M where N is the number of genomic sequences and M is the indices of the descriptors in vocabulary. This embedded representation is then used to train COVID-DeepPredictor. Since we have divided the genomic sequences into descriptors and represented in the form of tokens, they behave like texts, thus boiling down to a text classification problem. The pipeline of the proposed COVID-DeepPredictor is shown in Figure 2.

Figure 2

3. Results

To validate COVID-DeepPredictor, experiments are conducted on genomic sequences of different pathogenic viruses. In this regard, MATLAB R2020a is used on an Intel Core i5-8250U CPU @ 1.80 GHz machine with 8 GB RAM and Windows 10 operating system. The parameters of the underlying predictor, LSTM of COVID-DeepPredictor have been set experimentally. In this regard, the number of hidden units for LSTM layer is set as 80. Next, to use the LSTM layer for a sequence-to-label prediction problem, the output mode is set to “last.” Finally, a fully connected layer with the same size as the number of classes, a softmax layer and a prediction layer are added as well. Mini-batch gradient descent is used to train LSTM. The mini-batch size is specified as 16 and the gradient threshold is set to 2. The COVID-DeepPredictor is compared with other predictors based on Linear Discriminant Analysis (LDA), Random Forests (RFs), and Gradient Boosting Method (GBM). For LDA, the discriminant type is considered to be pseudo-linear, for Random Forests, the number of trees taken are 50 and for GBM the maximum depth of the tree is 10 and maximum iterations are taken as 100. All these parameters are set experimentally.

Each predictor has been evaluated using -fold cross-validation ( = 10) technique followed by further validation on unseen test datasets. The cross-validation partition uses random non-stratified sampling method which is applied to prepare the training and validation datasets resulting in a total of 1,500 samples. The training and validation datasets consist of all the pathogenic virus classes; SARS-CoV-1, MERS-CoV, SARS-CoV-2, Ebola, Dengue, and Influenza. For each predictor, the descriptors of the sequences of the viruses are created using k-mer method. Thereafter to train the COVID-DeepPredictor and the other compared predictors, an embedded matrix of size N × M is created with the use of BoDs and BoUDs.

To determine the performance of COVID-DeepPredictor and the other predictors, Confusion Matrix (Luque et al., 2019) is considered. In confusion matrix, True Positives (TP) refer to a data being correctly identified and they are represented by the diagonal elements. The remaining predictions lead to an error ϵ. Moreover, False Positives (FP) for a particular class refer to the sum of the values in the corresponding column, excluding the TP and False Negatives (FN) for a class is the sum of the values in the corresponding row, excluding the TP. Lastly, True Negatives (TN) for a class is the sum of all columns and row, barring the one for itself. To evaluate the results of COVID-DeepPredictor, metrics like Accuracy, Precision, Recall, and G-Mean have been considered which can be deduced from a confusion matrix. They can be calculated as:

Accuracy:

Precision:

Recall:

G-mean:

Different existing state-of-the-art predictors based on Linear Discriminant Analysis (LDA), Random Forests (RFs), and Gradient Boosting Method (GBM) are used in this work for comparison purposes. LDA is a very popular machine learning tool for prediction. In LDA, each dependent variable is expressed as a linear combination of other features. RFs are ensemble learning methods which build numerous decision trees during training and as an output produces the class that is the mode of the classes. GBM is also an ensemble learning model which produces a prediction model in the form of an ensemble weak prediction models, usually decision trees.

For conducting the experiments, first and foremost, we need to determine the value of k in k-mer. In order to do this, the experiments have been conducted on five test datasets as mentioned in section 2. The results are shown in Figures 3A–E, where k is varied from 3 to 15 with accuracy and running time of COVID-DeepPredictor. It can be seen from figures that the accuracy is higher at k = 3 for all the five test datasets. Although, the same accuracy can be found for other k values as well, e.g., in Figure 3Ak = 9, 11, and 13 show the same accuracy, as we increase the k-mer value, the run time increases. This trend of increasing time with the increasing value of k-mer has also been shown in Solis-Reyes et al. (2018). Keeping this in context, we have taken the value of k in k-mer to be 3 as with this value, the run time is least. For the compared predictors based on LDA, RF, and GBM, the k values are similarly determined as 13, 4, and 4, respectively. In this work, -fold cross-validation with = 10 is used. The average results in terms of accuracy for the test datasets are shown in Figure 4A. Moreover, apart from accuracy, different metrics such as precision, recall and g-mean have also been computed for the test datasets and reported in Table 2. As can be seen from the results of Figure 4A, for COVID-DeepPredictor the accuracy ranges from 99.51 to 99.94%. Thus, the experiments establish the fact that COVID-DeepPredictor can detect SARS-CoV-2 with a very high accuracy. The confusion matrices as circos plots for Testdata-1 and Testdata-2 (k = 3) are shown in Figures 4B,C. It can be seen from Figures 4B,C that there is only one misprediction, where SARS-CoV-1 has been wrongly predicted as SARS-CoV-2. The confusion matrices for Testdata-3, Testdata-4, and Testdata-5 (k = 3) are shown in Supplementary Figure 2.

Figure 3

Figure 4

Table 2

MethodDataSetk-merAverage accuracyAverage precisionAverage RecallAverage G-Mean
COVID-DeepPredictorTestdata-1399.86799.91499.3360.996
LDA1398.98191.84598.0150.948
RF498.40997.57790.0240.937
GBM498.52497.61190.1210.937
COVID-DeepPredictorTestdata-2399.51399.52799.4230.994
LDA1398.80798.81498.9250.988
RF496.78896.98197.2640.971
GBM497.84497.54297.9910.977
COVID-DeepPredictorTestdata-3399.87799.59599.6860.996
LDA1399.65098.98199.1620.989
RF499.25097.72798.4400.981
GBM499.26597.72898.8910.983
COVID-DeepPredictorTestdata-4399.86099.63799.6820.996
LDA1398.88597.28197.6480.974
RF499.37198.41499.3250.988
GBM499.44198.92299.4440.991
COVID-DeepPredictorTestdata-5399.94099.76699.8080.997
LDA1399.38097.46797.9270.976
RF499.58098.51999.3710.989
GBM499.59098.95699.7630.993

Prediction performance of COVID-DeepPredictor and other compared methods on test datasets.

The results highlighted in bold show that COVID-DeepPredictor has superior performance as compared to the other predictors.

COVID-DeepPredictor is performed on a validation dataset as well. Accuracy, precision, recall, and G-mean values of the prediction for the validation dataset are 100, 100, 100, and 1%, respectively (k=3). As we have used -fold cross-validation with = 10, ten convergence plots of COVID-DeepPredictor are generated. One of the corresponding convergence plots for COVID-DeepPredictor is given in Figure 4D. The blue line indicates the training accuracy and the black line is the validation accuracy. All the convergence plots are shown in Supplementary Figure 1. The Bag-of-Unique-Descriptors of the six virus classes, SARS-CoV-1, MERS-CoV, SARS-CoV-2, Ebola, Dengue, and Influenza are shown in Figures 4E–J for k=3.

4. Discussion

SARS-CoV-2 is a global pandemic and since human to human transmission (Chan et al., 2020; Huang et al., 2020) is confirmed for SARS-CoV-2, the need for its early prediction has become imperative. Viral outbreaks of this kind call for timely and prompt analysis of the genomic sequences to help the prediction of the virus in its early stages. COVID-DeepPredictor can be used by pathogen laboratories for the prediction of SARS-CoV-2 very quickly and as concluded from the results, most accurately. It is worth mentioning over here that for COVID-DeepPredictor to be effective, there must be at least two virus classes present in the training input sequences.

COVID-DeepPredictor has two functions for: (a) training, testing, and accordingly saving an LSTM model [COVIDdeepPredictor()] and (b) loading a pre-trained LSTM model for testing on unseen test dataset [COVIDdeepPredictorLoad()]. There is a main code COVIDmain.m which loads both COVIDdeepPredictor() and COVIDdeepPredictorLoad(). If users want to have their own training model and also get the results for a test dataset, they need to use only COVIDdeepPredictor() and disable COVIDdeepPredictorLoad(). On the other hand, if they want to use a pre-trained model, they can disable COVIDdeepPredictor() and run only COVIDdeepPredictorLoad() to get the results for test datasets.

For ease of users, training and testing files are provided to make them acquainted with the functionalities of COVIDdeepPredictor(). Trainingdata.csv is the input file for training and any one of the test files among Testdata-1.csv, Testdata-2.csv, Testdata-3.csv, Testdata-4.csv, and Testdata-5.csv can be used for testing. The results of the prediction will have the sequence ID, predicted virus name, along with its sequence which will be stored in Results.csv.

On the other hand, in case of COVIDdeepPredictorLoad(), only any one of the test files needs to be provided to get the results in Results.csv. Similarly, new training and test datasets can be prepared by the users after following the same structures of the training and testing files as provided. This is important so that new training models of COVID-DeepPredictor can be prepared for different set of viruses or similar kind of tasks. It is to be noted that the pre-trained model is provided in Supplementary Material, where the value of k for k-mer is 3. The choice of k has been done experimentally as it takes computationally less amount of time and provides higher accuracy. Sample files for training, testing, pre-trained models for COVID-DeepPredictor and the code of the software are available in Supplementary Material for re-usability3.

Setting the appropriate value of k in k-mer is very important to achieve the desired results in a text classification problem. As this work is based on the underlying concept of text classification, k-mer has a very important role to play. Thus, to determine the value of k in k-mer, extensive experiments have been performed. It can be observed from Figures 3A–E that with the increasing value of k, the run time of COVID-DeepPredictor is also on the rise. Therefore, to choose the appropriate value of k, apart from the accuracy, the run time of COVID-DeepPredictor also needs to be taken into account. For Testdata-1, at k = 9, 11, and 13, the accuracy is same as at k = 3. Similarly, for Testdata-2, Testdata-3, Testdata-4, and Testdata-5, similar accuracies can be observed at k = 3, 11, 13, k = 3, 4, 5, 13, k = 3, 4, and k = 3, 13, respectively. Although, the accuracies are same at these k-mer values, run time is increasing as can been seen from Figures 3A–E. Thus, the smallest k-mer value has been chosen without compromising on the accuracy. From Table 2 and Figure 4A, it is quite evident that with k = 3, COVID-DeepPredictor shows the best results among all the compared predictors.

To understand the relation among k-mer, size of BoDs and BoUDs, Table 3 is reported. From this table, we can see that the sizes of both BoDs and BoUDs increase with the increase in k-mer for each virus class. In the table, “All” represents all the six virus classes taken together. For example, at k = 15 for training dataset of all virus classes, the sizes of BoDs and BoUDs are 30193594 and 518372, respectively for 1,500 sequences while for the same k, for Testdata-1, the sizes of BoDs and BoUDs are 70595908 and 581774 respectively for 3,143 sequences. On the other hand, for k = 3, less number of BoDs and BoUDs are generated. Here, as expected, the BoD values for “All” are the summation of the BoDs of the individual virus classes. On the contrary, BoUD is less than the summation of the BoUDs of the six virus classes. This can be attributed to the relatedness between different virus classes. For example, SARS-CoV-1, MERS-CoV, and SARS-CoV-2 are more related and thus they may share unique descriptors (BoUDs) resulting in the intersection of the BoUDs when all the virus classes are considered together. Apart from this, BoDs and BoUDs for the varying k have also an impact on the accuracy and run time of COVID-DeepPredictor as well which can be observed by combining Figure 3 and Table 3.

Table 3

k-merVirus NameTraining datasetTestdata-1Testdata-2Testdata-3Testdata-4Testdata-5
BoDBoUDBoDBoUDBoDBoUDBoDBoUDBoDBoUDBoDBoUD
3SARS-CoV-11600064576064576064576064576064576064
MERS16000642642811283190140836716003671600367
SARS-CoV-216000641383361811280064193920641542406425600064
Ebola160006432006414248125147411251666112514248125
Dengue160006432127512827821408064164961381408064
Influenza160006448688901280367140806416000641600064
All960006420183818171269125256664125225160141322091125
5SARS-CoV-12557231024920531024920531024920531024920531024920531024
MERS25600010244201210542046741081225101102925582110292558211029
SARS-CoV-22555781024220205514462045921023309952810242465318102440917521024
Ebola25596610245119510242087662461227294210425798521042087662461
Dengue25321010245065910442026161054222741102425392314932227411024
Influenza200176102260829310931594071020175272101520151310072015131007
All147665310243046267154810721082555404198921063526613229350726462452
7SARS-CoV-1280457815151100895515813100895515813100895515813100895515813100895515813
MERS29289521289747975212526229358615184252811315111287985215113287985215113
SARS-CoV-22649879123302289921615728213749211100323498631407325724590129714268598814211
Ebola24439311340749007714109194749018116214313517557243566817562194749018116
Dengue16814741576433798313206133295115733145447814773165057616470145447814773
Influenza51362710642154526096274074348175447771825351011868245101186824
All13022441163652676124317235912790820509399323151881534209759195215048688120334
9SARS-CoV-1662810374045238409899891238409899891238409899891238409899891238409899891
MERS678971536574110920632462526619668377581133568421662899768503662899768503
SARS-CoV-264773533978256109728876335264550296007960353162655633276984711110511105765922
Ebola44411213863288807642449351014952268387387169072440389469127351014952268
Dengue25526078543751092539245203203884400223026559231250361783849223026559231
Influenza5763532578117360592090845859315572504138159215710451166257104511662
All2746525217045662738092176102189156241902309440723819112779819349194988120435611188263
11SARS-CoV-1730762710776426286691646542628669164654262866916465426286691646542628669164654
MERS743333843970121450737565576163293236635864693410725433093587725433093587
SARS-CoV-27255552505346287069214621859057353466489280255940017103633464429117924347100857
Ebola47081964708494099650512371423764927410161491849466309891945371423764927
Dengue267000713638653407451172212669413557650841119304261923713225950841119304
Influenza58025633741175255626635462340187592334852854075760531364857605313648
All29954976385098699415184259102059930746547510521244749166288777721504060132606047448483
13SARS-CoV-1736866712200826506371914502650637191450265061419143826506141914382650614191438
MERS7491153471141223927393305806329101342640804410157273106821017887310682101788
SARS-CoV-273203395463463432818171269595921536016900829381097217167712072455118991161117831
Ebola4733413511179460005346537324767073641229221002384687630100355373247670736
Dengue26797461631425359895791821347551625712344694996952629120157280234469499695
Influenza5801083948917525283093546225121020508334217131510157597115101575971
All30173426466701705418995235792074566356984610611754660770388970267622408135044728578236
15SARS-CoV-1737467813300526527622113942652762211394265273921138326527392113832652739211383
MERS7495710498141224682407645809898106916641200510718973151841074417315184107441
SARS-CoV-273262675789063484669189982596414337021901560051231537173523279346119088222132184
Ebola4737182545899467525575537356167567841264891066354691704106770373561675678
Dengue2680123185342536022638002135271184450234544411185026294441775922345444111850
Influenza5796344469517510213495846185123061507894239045754801647157548016471
All30193594518372705959085817742075954163032710620057667374189599783689040135712685642812

Bag-of-Descriptors and Bag-of-Unique-Descriptors for each virus class.

The main advantage of COVID-DeepPredictor is that it uses k-mer technique which is an alignment-free technique. Most analysis based works attempted so far have used alignment based techniques. Although, they are highly successful in detecting similarities in sequences of viruses, they take a lot of computational time. Also, alignment based techniques have the underlying constraint of homologous sequences which may not be the case every time. To mitigate these problems of alignment based techniques, alignment-free techniques (Kari et al., 2015) can be used. Alignment-free techniques are meant to be fast and can work with a large number of sequences. To prove the advantage of COVID-DeepPredictor over BLASTN4, which is an alignment-based technique, Table 4 is reported where different input sequences of size 50, 100, 200, 300, and 400 of SARS-CoV-2 are taken. For 50 sequences, BLASTN takes 1 h 15 min to align the sequences and to produce the subsequent results. Thereafter, such results are further required to be analyzed by machine intelligence technique to predict the virus class which takes some additional time as well. On the contrary, COVID-DeepPredictor successfully completes the job of training and testing, which involves prediction, in just 1.26 min. Similar results are also seen for the other varying sequences as well. Thus, we can conclude that an alignment-free technique is significantly faster than an alignment based technique.

Table 4

Alignment-free techniqueAlignment-based technique
Number of sequences
of SARS-CoV-2
COVID-DeepPredictor (k=3)
[Training and Testing (in min)]
BLASTN
501.261 h 15 min
1001.272 h 40 min
2001.284 h 30 min
3001.296 h 35 min
4001.319 h 10 min

Runtime comparison of COVID-DeepPredictor and BLASTN.

5. Conclusion

In the current scenario of global pandemic, it has become very important to predict SARS-CoV-2 as early as possible as both the affected and the number of death cases are increasing exponentially everyday. However, polymorphic nature of SARS-CoV-2 allows it to adapt and sustain in different kinds of environment which makes SARS-CoV-2 very hard to predict. In such scenarios, the proposed COVID-DeepPredictor can be very useful for predicting SARS-CoV-2 and other kinds of pathogenic viruses based on their genomic information very quickly as it uses an alignment-free technique. The results for COVID-DeepPredictor are highly encouraging as it shows prediction accuracy in the range of 99.51 to 99.94% for test datasets. Human health being the main concern of this work, the code for COVID-DeepPredictor along with the pre-trained model are also provided so that the scientific community can reap as much benefit as possible from it. Apart from SARS-CoV-2, COVID-DeepPredictor can also be used by pathogen laboratories to recognize the other five pathogenic viruses (SARS-CoV-1, MERS-CoV, Ebola, Dengue, and Influenza) very easily4 and accurately from a given genomic sequence. To achieve good performance, data preprocessing and the experiments are carried out on real-life datasets. Moreover, comparisons with popular existing prediction methods based on Linear Discriminant Analysis, Random Forests, and Gradient Boosting Method are also performed to show the superiority of COVID-DeepPredictor. Additionally, accuracy and runtime of COVID-DeepPredictor are taken together to determine the value of k in k-mer, comparison among k values in k-mer, Bag-of-Descriptors (BoDs) and Bag-of-Unique-Descriptors (BoUDs) is considered along with a comparative study between COVID-DeepPredictor and Nucleotide BLAST.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

Author contributions

IS designed the research. IS, NG, DM, AS, and DP analyzed data and wrote the manuscript. NG performed the experiments and collected results. All authors reviewed and approved the final version of the manuscript.

Acknowledgments

We thank all those who have contributed sequences to GISAID and NCBI databases. We are also thankful to the reviewers for providing valuable comments to improve the paper.

Conflict of interest

AS was employed by company Cognizant Technology Solutions. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.569120/full#supplementary-material

References

  • 1

    AlagailiA. N.BrieseT.MishraN.KapoorV.SameroffS. C.de WitE.et al. (2014). Middle east respiratory syndrome coronavirus infection in dromedary camels in Saudi Arabia. MBio5:e0088414. 10.1128/mBio.01002-14

  • 2

    ChanJ. F.-W.YuanS.KokK.-H.Kai-WangK.ChuH.YangJ.et al. (2020). A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet395, 514523. 10.1016/S0140-6736(20)30154-9

  • 3

    CuiJ.LiF.ShiZ.-L. (2019). Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 17, 181192. 10.1038/s41579-018-0118-9

  • 4

    GuanY.ZhengB.HeY.LiuX.ZhuangZ.CheungC.et al. (2003). Isolation and characterization of viruses related to the sars coronavirus from animals in southern China. Science302, 276278. 10.1126/science.1087139

  • 5

    HintonG. E.RoweisS. T. (2003). “Stochastic neighbor embedding,” in Advances in Neural Information Processing Systems (Vancouver, BC), 857864.

  • 6

    HochreiterS.SchmidhuberJ. (1997). Long short-term memory. Neural Comput. 9, 17351780. 10.1162/neco.1997.9.8.1735

  • 7

    HuangC.WangY.LiX.RenL.ZhaoJ.HuY.et al. (2020). Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet395, 497506. 10.1016/S0140-6736(20)30183-5

  • 8

    JenkinsG. M.RambautA.PybusO. G.HolmesE. C. (2002). Rates of molecular evolution in RNA viruses: a quantitative phylogenetic analysis. J. Mol. Evol. 54, 156165. 10.1007/s00239-001-0064-3

  • 9

    JinY.LuoC.GuoW.XieJ.WuD.WangR. (2019). Text classification based on conditional reflection. IEEE Access7, 7671276719. 10.1109/ACCESS.2019.2921976

  • 10

    KariL.HillK. A.SayemA. S.KaramichalisR.BryansN.DavisK.et al. (2015). Mapping the space of genomic signatures. PLoS ONE10:e119815. 10.1371/journal.pone.0119815

  • 11

    KhattakF. K.JebleeS.Pou-PromC.AbdallaM.MeaneyC.RudziczF. (2019). A survey of word embeddings for clinical text. J. Biomed. Informatics4:100057. 10.1016/j.yjbinx.2019.100057

  • 12

    KimK.KimD.NohJ.KimM. (2018). Stable forecasting of environmental time series via long short term memory recurrent neural network. IEEE Access6, 7521675228. 10.1109/ACCESS.2018.2884827

  • 13

    Koohi-MoghadamM.WangH.WangY.YangX.LiH.WangJ.et al. (2019). Predicting disease-associated mutation of metal-binding sites in proteins using a deep learning approach. Nat. Mach. Intell. 1, 561567. 10.1038/s42256-019-0119-z

  • 14

    LetkoM.MarziA.MunsterV. (2020). Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage b betacoronaviruses. Nat. Microbiol. 5, 562569. 10.1038/s41564-020-0688-y

  • 15

    LiuJ.XiaC.YanH.XieZ.SunJ. (2019). Hierarchical comprehensive context modeling for Chinese text classification. IEEE Access7, 154546154559. 10.1109/ACCESS.2019.2949175

  • 16

    LiuX.WangX.-J. (2020). Potential inhibitors against 2019-nCoV coronavirus m protease from clinically approved medicines. J. Genet. Genomics. 47, 119121. 10.1016/j.jgg.2020.02.001

  • 17

    LuR.ZhaoX.LiJ.NiuP.YangB.WuH.et al. (2020). Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet395, 565574. 10.1016/S0140-6736(20)30251-8

  • 18

    LuqueA.CarrascoA.MartínA.de las HerasA. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 91, 216231. 10.1016/j.patcog.2019.02.023

  • 19

    ManekarS.SatheS. (2018). A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience7:giy125. 10.1093/gigascience/giy125

  • 20

    MengY.WuP.LuW.LiuK.MaK.HuangL.et al. (2020). Sex-specific clinical characteristics and prognosis of coronavirus disease-19 infection in Wuhan, China: a retrospective study of 168 severe patients. PLoS Pathol. 16:e1008520. 10.1371/journal.ppat.1008520

  • 21

    OzturkT.TaloM.YildirimE. A.BalogluU. B.YildirimO.Rajendra AcharyaU. (2020). Automated detection of covid-19 cases using deep neural networks with x-ray images. Comput. Biol. Med. 121:103792. 10.1016/j.compbiomed.2020.103792

  • 22

    ParaskevisD.KostakiE.MagiorkinisG.PanayiotakopoulosG.SourvinosG.TsiodrasS. (2020). Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event. Infect. Genet. Evol. 79:104212. 10.1016/j.meegid.2020.104212

  • 23

    Solis-ReyesS.AvinoM.PoonA.KariL. (2018). An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS ONE13:e206409. 10.1371/journal.pone.0206409

  • 24

    SuS.WongG.ShiW.LiuJ.LaiA. C.ZhouJ.et al. (2016). Epidemiology, genetic recombination, and pathogenesis of coronaviruses. Trends Microbiol. 24, 490502. 10.1016/j.tim.2016.03.003

  • 25

    TangB.PanZ.YinK.KhateebA. (2019). Recent advances of deep learning in bioinformatics and computational biology. Front. Genet. 10:214. 10.3389/fgene.2019.00214

  • 26

    WanY.ShangJ.GrahamR.BaricR. S.LiF. (2020). Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus. J. Virol. 94:e00127-20. 10.1128/JVI.00127-20

  • 27

    WeissS.Navas-MartinS. (2005). Coronavirus pathogenesis and the emerging pathogen severe acute respiratory syndrome coronavirus. Microbiol. Mol. Biol. Rev. 4, 635664. 10.1128/MMBR.69.4.635-664.2005

  • 28

    WooP. C.LauS. K.HuangY.YuenK.-Y. (2009). Coronavirus diversity, phylogeny and interspecies jumping. Exp. Biol. Med. 234, 11171127. 10.3181/0903-MR-94

  • 29

    Worldometer (2021). Coronavirus Disease 2019 (COVID-19) Cases. Available online at: https://www.worldometers.info/coronavirus (accessed January 8, 2021).

  • 30

    YanL.ZhangH.-T.GoncalvesJ.XiaoY.WangM.GuoY.et al. (2020). An interpretable mortality prediction model for covid-19 patients. Nat. Mach. Intell. 2, 283288. 10.1038/s42256-020-0180-7

  • 31

    YanQ.WeeksD. E.XinH.SwaroopA.Y. E. ChewE.HuangH.et al. (2020). Deep-learning-based prediction of late age-related macular degeneration progression. Nat. Mach. Intell. 2, 141150. 10.1038/s42256-020-0154-9

  • 32

    ZhangY.ZhengJ.JiangY.HuangG.ChenR. (2019). A text sentiment classification modeling method based on coordinated CNN-LSTM-attention model. Chinese J. Electron. 28, 120126. 10.1049/cje.2018.11.004

  • 33

    ZhouP.YangX. L.WangX. G.HuB.ZhangL.ZhangW.et al. (2020). A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature579, 270273. 10.1038/s41586-020-2012-7

Summary

Keywords

long-short term memory, SARS-CoV-2, sequence analysis, virus prediction, genomic information

Citation

Saha I, Ghosh N, Maity D, Seal A and Plewczynski D (2021) COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses. Front. Genet. 12:569120. doi: 10.3389/fgene.2021.569120

Received

03 June 2020

Accepted

13 January 2021

Published

11 February 2021

Volume

12 - 2021

Edited by

Xian-Tao Zeng, Wuhan University, China

Reviewed by

Sarath Chandra Janga, Indiana University, Purdue University Indianapolis, United States; Xue-Qun Ren, Henan University, China

Updates

Copyright

*Correspondence: Indrajit Saha

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

†These authors have contributed equally to this work

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics