DETIRE: a hybrid deep learning model for identifying viral sequences from metagenomes

A metagenome contains all DNA sequences from an environmental sample, including viruses, bacteria, archaea, and eukaryotes. Since viruses are of huge abundance and have caused vast mortality and morbidity to human society in history as a type of major pathogens, detecting viruses from metagenomes plays a crucial role in analyzing the viral component of samples and is the very first step for clinical diagnosis. However, detecting viral fragments directly from the metagenomes is still a tough issue because of the existence of a huge number of short sequences. In this study a hybrid Deep lEarning model for idenTifying vIral sequences fRom mEtagenomes (DETIRE) is proposed to solve the problem. First, the graph-based nucleotide sequence embedding strategy is utilized to enrich the expression of DNA sequences by training an embedding matrix. Then, the spatial and sequential features are extracted by trained CNN and BiLSTM networks, respectively, to enrich the features of short sequences. Finally, the two sets of features are weighted combined for the final decision. Trained by 220,000 sequences of 500 bp subsampled from the Virus and Host RefSeq genomes, DETIRE identifies more short viral sequences (<1,000 bp) than the three latest methods, such as DeepVirFinder, PPR-Meta, and CHEER. DETIRE is freely available at Github (https://github.com/crazyinter/DETIRE).


. Introduction
High-throughput sequencing or next-generation sequencing (NGS) technology, which makes it possible to obtain all nucleotide sequences directly from environmental samples, has played important roles in many fields, such as pathogen detection (Wu et al., 2015;Miao et al., 2022) and human disease analysis (Georg et al., 2014;Santiago-Rodriguez and Hollister, 2019;Zhu et al., 2019). In these applications, detecting viruses from metagenomic sequences becomes more and more essential because it is the first step in the analysis of viruses (Bonhoeffer and Sniegowski, 2002). However, it is still a quite difficult task because of their relatively low biomass compared with those of bacteria and high mutation rates.
To overcome this challenge, several methods have been proposed to identify viruses from metagenomes in the past few years and can be categorized into similarity-based, machine learning-based, and deep learning-based methods. Similarity-based methods generally map a query sequence to a reference dataset and recognize it as the one with the highest similarity .
/fmicb. . score (Roux et al., 2011(Roux et al., , 2015Wommack et al., 2012;Wood and Salzberg, 2014;Buchfink et al., 2015;Kim et al., 2016;Rampelli et al., 2016;Truong et al., 2016;Paez-Espino et al., 2017;Vilsker et al., 2019). These methods, however, suffer from long execution time during the mapping process, and hardly detect short viral sequences because of the limited features they have. Different from the similarity-based methods, machine learning-based methods could extract human-designed features from DNA sequences and classify them by a well-trained classifier, such as VirFinder (Ren et al., 2017), MARVEL (Amgarten et al., 2018), and VirSorter2 . Although these methods have the ability to produce more accurate results, the features have to be designed artificially, and the performance in identifying short viral sequences is relatively poor.
With the great success of deep learning methods in the past few years, several deep learning-based methods have been proposed to identify viruses from metagenomes. Long-short term memory (LSTM) network and convolutional neural network (CNN) are the most commonly used models. For example, ViraMiner (Tampuu et al., 2019), VirNet (Abdelkareem et al., 2018), and RNN-VirSeeker ) utilize a single LSTM network to learn the interconnections between each part in a one-hot encoded sequence; DeepVirFinder (Jie et al., 2020), PPR-Meta (Fang et al., 2019), and CHEER (Shang and Sun, 2021) establish a single CNN to extract high-level features from one-hot encoded sequences automatically before a set of dense layers and a softmax layer for classification. However, the following two issues hamper the performance of deep learning models for the recognition of short viral sequences as follows: (1) the single deep learning architecture suffers from failing to extract enough features to represent sequences; (2) the one-hot encoding strategy omits the relationship between two parts of a sequence because of its orthogonal property (Mikolov et al., 2013).
To solve the above-mentioned issues, a novel hybrid deep learning-based virus identifier, namely, DETIRE, is proposed in this study to identify viral fragments directly from metagenomes. DETIRE is a two-stage architecture, containing a graph convolutional network (GCN)-based sequence embedder and a two-path deep learning model. First, every sequence is cut into several 3-mer fragments, and, then successively input to the GCN-based sequence embedder to train the representations of all 3-mer fragments. After that, these embedded fragments are then fed into the CNN-path and BiLSTM-path to learn their features, respectively. Finally, by two dense layers and a softmax layer, pair scores are generated, and the higher score determines which type of the input sequence is.
The main contributions of this study are 2-fold: (1) the GCNbased embedder is utilized to enrich the representations of short sequences; (2) BiLSTM and CNN are combined to learn not only the spatial characteristics but also sequential characteristics simultaneously to generate abundant features of short sequences. To the best of our knowledge, DETIRE is the first hybrid model that combines CNN and BiLSTM networks to identify viruses from metagenomes.
The last of this study is organized as follows: Section 2 introduces the datasets used to train and test the DETIRE, and details the architecture of DETIRE and its training strategy. Section 3 shows the performance of the DETIRE on several datasets and discusses the selection of some key parameters which can affect the overall performance. Section 4 gives a brief conclusion of this study.
. Materials and methods . . Virus and host RefSeq genome datasets for training and testing Virus RefSeq genome (up to 11 October 2022) was downloaded from NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/ v\ssi/#/virus?SeqType_s=Genome&SourceDB_s=RefSeq). It has been proven that the less the difference between the lengths of sequences from the training and testing datasets, the better classification result will be achieved (Amgarten et al., 2018). Thus, all 34,492 viral sequences were split into a set of non-overlapped fragments with a length of 500 bp, resulting in 2,043,539 fragments total. The whole set of 500 bp viral fragments combined with 2,040,000 sequences of 500 bp subsampled from 4,410 prokaryotic host RefSeq genomes, supplied in VirFinder (Amgarten et al., 2018), called the GCN training dataset, was jointly used to train the GCN-based embedding model in order to generate a meaningful embedding matrix for sequences embedding. To show the ability of DETIRE for identifying new viruses, 400,000 viral fragments with a length of 500 bp from 22,066 viral sequences (before 16 August 2021) were subsampled randomly to be established as a training dataset to train a hybrid deep learning model. In total, 100,000 fragments from 6,188 viral sequences (between 16 August 2021 and 3 September 2022) were chosen randomly as a validation set. Moreover, 100,000 fragments from 6,238 viral sequences (between 3 September 2022 and 11 October 2022) were chosen randomly as a testing set. Since there is no overlap between the three datasets (training, validation, and testing datasets), a fully trained DETIRE may have the ability to identify new viral sequences (Jie et al., 2020;Shang and Sun, 2021). The NCBI accession numbers of the viral RefSeqs can be found on GitHub (https://github.com/ crazyinter/DETIRE/blob/main/supplementary%20files/Virus_ RefSeq_accession_numbers.csv).

. . Composition of DETIRE
DETIRE utilizes a two-stage strategy for virus prediction, including GCN-based sequence embedding and deep learningbased sequence classification ( Figure 1). Before embedding, every 3-mer fragment is generated by a three-base sliding window moving from the head to the tail of the sequence with a stride of one. For example, the original nucleotide sequence "ATTGCCTGACAT" will be cut into "ATT, TTG, TGC, GCC, CCT, CTG, TGA, GAC, ACA, and CAT." In the process of sequence embedding, TextGCN (Yao et al., 2019) is utilized to learn the meaningful high-level representations of all 3-mer fragments from every nucleotide sequence. First, a heterogeneous graph containing 3-mer fragment nodes is built in order to model global co-occurrence between these 3-mer .
/fmicb. . fragments explicitly. Then, the built graph is fed into a simple twolayer GCN (Kipf and Welling, 2016). The first layer constructs the nodes and edges. Every nucleotide sequence in the GCN training dataset and all unique 3-mer fragments from it are constructed to their single nodes. There are no edges between each nucleotide sequence. Edges are built between 3-mer fragments and their original sequences. All 64 3-mer fragments have an edge

FIGURE
The workflow of DETIRE. DETIRE contains a GCN-based sequence embedding model and a deep learning-based method to learn features of viral sequences automatically and identify them directly from metagenomes. First, the graph neural network learns the high-level representations of -mer fragments in each sequence through supervised back propagation. Then, DETIRE extracts the features of spatial characteristics and sequential characteristics by designing the CNN model and LSTM model, respectively. Finally, the learned features are combined together to make the final decision by several dense layers and a softmax layer.
between each of them ( Figure 2). The weight of the edge between a sequence node and a 3-mer fragment node is determined by the term frequency-inverse document frequency (TF-IDF; Liang and Chengsheng, 2008) of the fragment in the sequence, where term frequency is the frequency of the 3-mer fragment appears in the sequence and inverse document frequency is the logarithmically scaled inverse fraction of the number of sequences that contain the 3-mer fragment. Point-wise mutual information (PMI; Church and Hanks, 1989), a popular measure for word associations, is employed to calculate weights between two fragment nodes. The second layer learns the fragment and sequence embeddings in each node. Finally, these nodes are fed into a softmax classifier, after which the cross-entropy error over all labeled sequences is defined as the cost function (Li et al., 2018). After 1,000 epochs of backpropagation by the Adam optimization algorithm (Kingma and Ba, 2015) with a learning rate of 0.015 and a dropout rate of 0.5, the 30-dimension representations of all 3-mer fragments in the second layer of the GCN are embedded into the sequences in the training and testing datasets.
In the process of sequence classification, two parallel deep learning models (a hybrid deep learning model), CNN and BiLSTM, respectively, used to learn spatial and sequential features of sequences. In the CNN path, each embedded sequence is considered as an image to extract a spatial feature through three sets of layers. Each set of the layer contains a convolutional layer (16, 32, and 64 filters with sizes of 7*7, 5*5, and 3*3, respectively), a ReLU activation function, a max pooling layer (with a pooling size of 4*4 and a stride of 4), and a batch normalization (BN) layer, respectively. In the BiLSTM path, the embedded 3-mer fragments in a sequence are input into the BiLSTM cells (498 tokens total) one by one, generating a sequential feature. Then, the first dense layer with 100 hidden neurons receives the weighted merged two sets of features from the CNN path and BiLSTM path. The second hidden layer after that contains 30 hidden neurons. Finally, a softmax layer generates two scores that reflect the likelihood of the input sequence

FIGURE
The structure of the GCN-based embedding model. Every nucleotide sequence in the GCN-training dataset and all unique -mer fragments from it are constructed to their single nodes. There are no edges between each nucleotide sequence. Edges are built between -mer fragments and their original sequences. All -mer fragments have an edge between each of them. After the training strategy, the vectors from the codon nodes in the second GCN layer are combined into an embedding matrix.

. . A marine metagenome dataset from CAMI
To test the performance of DETIRE on identifying viral sequences, a marine metagenome was downloaded from the 2nd CAMI Challenge Marine Dataset (https://data.cami-challenge. org/participate), short and long-read shotgun metagenome data from samples at different seafloor locations of a marine environment. All sequences from the fasta document named CAMI2_short_read_pooled_gold_standard_assembly were selected. These sequences were, then, mapped to the Virus RefSeq genome using BLAST (Basic Local Alignment Search Tool) Altschul et al. (1990), with default parameters (word size is 11, threshold is 0, and score matrix is BLOSUM62). As a result, 8,431 sequences shorter than 500 bp, 7,087 sequences in length of 500-1,000 bp, 7,915 sequences in length of 1,000-3,000 bp, and 12,518 sequences longer than 3,000 bp were recognized as viral sequences. The same amount of non-viral short sequences was also randomly subsampled from the rest of sequences for each length, respectively. Together, these 71,902 mixed sequences were built into a metagenome dataset to test DETIRE.

. . A real human gut metagenome dataset
A real human gut metagenome dataset was downloaded from the NCBI short-read archive (accession ID: SRA052203; Sharon et al., 2013). The same BLAST progress with default parameters was conducted as it was in the 2nd CAMI Challenge Marine Dataset, resulting in 385 viral sequences shorter than 500 bp, 409 sequences in length of 500-1,000 bp, 728 sequences in length of 1,000-3,000 bp, and 915 sequences longer than 3,000 bp. Similarly, the same amount of non-viral short sequences for each length was subsampled randomly from the rest of non-viral sequences. Totally, 4,874 mixed sequences were used to test DETIRE.
. . The influence of k-mer fragments and their embedding sizes on the viral identification results Every nucleotide sequence was converted into a set of k-mer fragments before being expressed as high-level vectors by the graph neural network to enrich the representation of the sequence. The chosen k-value and the embedding size have a significant impact on the identification performance. To find a relatively optimal set of the two parameters, a lot of models have been trained by the established training dataset and evaluated by the testing dataset. Each model corresponds to a specific set of parameters, where k is chosen from [1, 2, 3, 4, 5, 6, 7, 8, 9, and 10] and embedding size is chosen from [4,8,30,50,100,200,500,1,000], respectively. The evaluation results are shown in Figure 3. When the embedding size is a little lower than 4 k (all combinations of the k-mer fragments), the AUROC value is close to the maximum. Thus, 3-mer fragments with an embedding size of 30 were chosen for our final model. The final representations of all 3-mer fragments can be found at https://github.com/crazyinter/DETIRE/blob/main/ supplementary%20files/embedding_matrix.csv.

. . Performance on the testing dataset
To prove the outstanding performance of DETIRE on identifying short viral sequences (<500 bp), a testing experiment on the testing dataset was performed to make a comparison between DETIRE and three latest viral identification methods, DeepVirFinder, PPR-Meta, and CHEER. The accuracies, recalls, precisions, and F1 scores of the four methods are calculated in Table 1. DETIRE outperforms the three methods at all of the four criteria.

. . Performance on the CAMI Marine metagenome
To deal with different lengths of sequences from the metagenome and avoid vanishing gradient problem in the process of training BiLSTM, a sequence longer than 500 bp will be divided into several non-overlapped sub-sequences of 500 bp before input into the BiLSTM path. If the length of the last part in the sequence .

FIGURE
The performance of the models with various k-values and embedding sizes on the testing dataset. Sequences with specific k-mer fragments and embedding sizes were used to train several models, which were, then, evaluated by the testing dataset. The AUROC values of these models were calculated. The k is chosen from [ , , , , , , , , , and ] and the embedding size is chosen from [ , , , , , , , and , ]. is shorter than 500 bp, the last bases of the sequence will be zeropadded and regarded as a single subsequence. Then, all of the subsequences are input to the hybrid deep learning model one after another, to get their own scores, the average of which will be the final score and is contributed to identifying whether the query long sequence is viral or not. The accuracies, recalls, precisions, and F1 scores of DETIRE, CHEER, PPR-Meta, and DeepVirFinder on classifying viral and non-viral sequences from the CAMI Marine metagenome are calculated and made a comparison in Table 2. DETIRE exceeds DeepVirFinder and PPR-Meta at all of the four criteria for identifying all lengths of viral sequences. DETIRE is better than CHEER when the length is shorter than 3,000 bp. For identifying long sequences (>3,000 bp), the accuracy, recall, and F1 score of DETIRE are slightly lower than CHEER because DETIRE identified a little less viral sequences. For all lengths, DETIRE achieves the best performance than the other three methods.

. . Performance on the real human gut metagenome
The accuracies, recalls, precisions, and F1 scores of DETIRE, CHEER, PPR-Meta, and DeepVirFinder are calculated according to the number of correctly and incorrectly identified viral and host sequences from the real human gut metagenome dataset (Table 3). DETIRE also achieves the best overall performance than the other three methods of identifying viral sequences (all lengths). In spite of a 0.0087 and 0.0019 lower precision than CHEER for 500-1,000 bp and 1,000-3000 bp, DETIRE gets higher accuracy, recall, and F1 score for these lengths. For sequences longer than 3,000 bp, CHEER is the outperforming method. However, a comparable performance of DETIRE is achieved as CHEER for long sequences.

. . Comparison of the testing time consuming
The testing time of the four methods on the testing dataset, the CAMI Marine metagenome and the real human gut metagenome are made a comparison, as shown in Table 4. The equipment used for the analysis is two Intel Xeon Gold 6226R (CPU) with a memory of 256 Gb. For all the three datasets, DETIRE has the minimum time consumption for the testing strategies.

. . Availability of DETIRE on identifying phages
DETIRE is trained to identify viral sequences directly from metagenomes, including prokaryotic virus and eukaryotic virus. In some specific virus analysis tasks, finding bacteriophage-originated sequences is a meaningful strategy. To test the ability of DETIRE to detect phages, a phage dataset was established according to DeePhage Wu et al. (2021). All 225 phage genomes were collected, and then, 10 fragments of 500 bp were randomly subsampled from each phage genome, resulting in 2,250 sequences.These sequences were embedded by the trained embedding matrix and input to DETIRE. DeepVirFinder, PPR-Meta, and CHEER were also tested by these sequences. The accuracies of identifying these . /fmicb. . phages are shown in Table 5. DETIRE identified 90.27% phages and 4.36 and 4.58% more than CHEER and DeepVirFinder, respectively. The accuracy of DETIRE is 2.49% lower than PPR-Meta. This may be caused by PPR-Meta's training dataset, containing only phages because it is a 3-class classifier that allows simultaneous identification of both phage and plasmid fragments. Eventhough, DETIRE achieves a better performance than CHEER and DeepVirFinder, with relatively comparable results as PPR-Meta. This means DETIRE has the potential in identifying phages.
. . The ability of DETIRE on making a whole viral genome by identifying viral sequences after assembling DETIRE is tested additionally by metagenomic data with known viral infection, to see if the identified viral sequences can be assembled to make whole genome of the virus. A wastewater metagenome (Schoch et al., 2020;NCBI:txid527639) is selected from the IMG/VR database David et al. (2021), containing 359 viral nucleotide sequences. All of the sequences are separated into 100 bp sequences, resulting in 3,459 reads. All of the four methods were separately used to identify these reads as virus or not. The number of identified viral sequences and accuracy is shown in Table 6. Then, these identified viral sequences are assembled by SOAPdenovo2 Ruibang et al. (2012), with default parameters. The number of assembled contigs is shown in Table 6. After that, these contigs are input to CheckV Nayfach et al. (2021), to assess the quality and completeness of the viral genome. The number of hits for each method is shown in Table 6. All the hits are of low quality by the AAI-based method. The completeness of contigs from DETIRE can be seen in https://github.com/ crazyinter/DETIRE/blob/main/supplementary%20files/DETIRE_ completeness.csv. DETIRE identifies most viral reads from the wastewater metagenome. A total of 289 assembled viral contigs are detected by CheckV, with low-quality, 24, 88, and 77 more than DeepVirFinder, PPR-Meta, and CHEER, respectively. The results show that DETIRE performs better than the other three methods of making whole genome of the virus by assembling identified viral sequences.

. Discussions
Currently, identifying short viral sequences directly from metagenomes is still of low accuracy because of too little information to extract enough features from short sequences. There are three reasons why DETIRE works. First, the GCN-based embedding model creates a new representation for a sequence. Each codon is transformed into a high-dimensional embedded vector by a trained embedding matrix. The new representation .
/fmicb. .   of the sequence enriches the features significantly. Second, the hybrid model contains BiLSTM and CNN simultaneously. It could learn not only the spatial characteristics but also sequential characteristics to generate abundant features of short sequences, which is intuitively consistent with the expression of the DNA sequence in nature. Thirdly, the combination of the two features extracted from the CNN-path and BiLSTM-path is dynamic. The longer the input sequence, the more contribution of the CNN-path. Similarly, the shorter the input sequence, the more contribution of the BiLSTM-path. Furthermore, the attention mechanism in both paths decides which part of the information from the sequence should be remembered or forgotten, which is really important to determining which part of the sequence contributes more to distinguishing.
To verify the effectiveness of the GCN-based embedding model and hybrid classification model in DETIRE, the sequences from the same training and testing datasets (500 bp) were one-hot  Figure 4. The accuracy of DETIRE in identifying viral sequences exceeds that of BOHEM and FOHEM by 3.62 and 4.39%, respectively, representing the effectiveness of the GCN-based sequence embedding method in DETIRE. DETIRE also has a superiority over single CNN-based and BiLSTM-based model. The gap is 2.26 and 1.74% on identifying viral sequences, respectively, showing the availability of the hybrid deep learning model in DETIRE. The advantage of DETIRE becomes smaller with the increase in the length of input sequences. This may be caused by the length of sequences in the GCN-training dataset being 500 bp. The trained embedding matrix is more suitable for identifying short viral sequences. Furthermore, the labels of the sequences in the marine metagenome dataset and the real human gut metagenome dataset are made by the BLAST search in the whole virus RefSeq. This may lead to false positives or negatives for viruses because the training set for the hybrid deep learning model is subsampled from the virus RefSeq. However, the size of the training set is rather smaller than the whole virus RefSeq, and the influence of the duplicate data would be rather small.
DETIRE could accurately identify short viral sequences (<1,000 bp) from metagenomes, with comparable performance at longer length (>1,000 bp). It supplements viral analysis since a high proportion of the existing tools tend to deal with long sequences generated from assembling and binning, which misses many viral species. Moreover, identifying viral sequences is the very first step in the viral analysis, the effectiveness of which could have an impact on downstream work, such as virus taxonomy, virushost interactions, and virus-derived diseases detection. We hope that DETIRE would play an important role in the realm of virus analysis.

. Conclusion
In this study, a deep learning-based hybrid model, DETIRE, is proposed to identify viral sequences directly from the metagenome. Encoded by a graph-based embedding method, nucleotide sequences are fed into a CNN-path and a BiLSTM-path, respectively, for feature extraction before being classified by a softmax layer. In comparison to the three latest viral identification methods, DeepVirFinder, PPR-Meta, and CHEER, on the test dataset, the CAMI Marine dataset and a real human gut metagenome, DETIRE outperforms in identifying short sequences (<1,000 bp). DETIRE will do something to promote the development of the field on the natural viral community analysis because of the huge number of short sequences generated by the NGS technique. DETIRE is anticipated to play a positive role in the downstream viral analysis such as viral taxonomy and pathogens identification.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
. /fmicb. . of false positive assembly. All authors reviewed the manuscript.

Funding
This research was funded by the Fundamental Research Funds for the Central Universities.