NeoRdRp2 with improved seed data, annotations, and scoring

RNA-dependent RNA polymerase (RdRp) is a marker gene for RNA viruses; thus, it is widely used to identify RNA viruses from metatranscriptome data. However, because of the high diversity of RdRp domains, it remains dif ﬁ cult to identify RNA viruses using RdRp sequences. To overcome this problem, we created a NeoRdRp database containing 1,182 hidden Markov model (HMM) pro ﬁ les utilizing 12,502 RdRp domain sequences. Since the development of this database, more RNA viruses have been discovered, mainly through metatranscriptome sequencing analyses. To identify RNA viruses comprehensively and speci ﬁ cally, we updated the NeoRdRp by incorporating recently reported RNA viruses. To this end, 557,197 RdRp-containing sequences were used as seed RdRp datasets. These sequences were processed through deduplication, clustering, alignment, and splitting, thereby generating 19,394 HMM pro ﬁ les. We validated the updated NeoRdRp database, using the UniProtKB dataset and found that the recall and speci ﬁ city rates were improved to 99.4% and 81.6%, from 97.2% and 76.8% in the previous version, respectively. Comparisons of eight different RdRp search tools showed that NeoRdRp2 exhibited balanced RdRp and nonspeci ﬁ c detection power. Expansion of the annotated RdRp datasets is expected to further accelerate the discovery of novel RNA viruses from various transcriptome datasets. The HMM pro ﬁ les of NeoRdRp2 and their annotations are available at https://github.com/ shoichisakaguchi/NeoRdRp.


Introduction
RNA viruses play pivotal roles in various environments and influence several ecosystems and organisms including humans, animals, and plants (1).Studies on RNA virus detection using metatranscriptome data have emerged as powerful tools for obtaining insights into both known and novel RNA viruses in samples (2).RNA-dependent RNA polymerase (RdRp) is a universal gene found in almost all RNA viruses except retroviruses and deltaviruses.RdRp sequences are commonly used to search for various RNA viruses from metatranscriptome data; however, because of the great diversity of RdRps, finding various RNA viruses based on homology-based searches using RdRp sequences is known to be difficult (2).To overcome this problem, we developed a bioinformatics pipeline to generate hidden Markov model (HMM) profiles from 12,502 RdRp domain sequences.We shared 1,182 of these RdRp HMM profiles as NeoRdRp 1.1, along with the associated RdRp domain sequences and RdRp domain sequences, and the bioinformatics pipeline at NeoRdRp (3): https:// github.com/shoichisakaguchi/NeoRdRp.NeoRdRp was used in various RNA virus identification studies (4)(5)(6).
This study aimed to update the NeoRdRp by including the RdRp sequences of recently reported RNA viruses to enhance the ability of the model to identify RNA viruses from metatranscriptome data.To this end, we modified our bioinformatics pipeline and processed and annotated 557,197 amino acid sequences containing RdRp domains.Consequently, 19,394 HMM profiles of the RdRp domains were generated and named as NeoRdRp2.We evaluated the RNA detection performance of NeoRdRp2, a previous version of NeoRdRp (version 1.1), and other RNA virome bioinformatics tools.All datasets and annotations for NeoRdRp2 are available at https://github.com/shoichisakaguchi/NeoRdRp.

Datasets
The following four RdRp sequence datasets were used: Wolf2018 (1), Zayed2022 (10), Edgar2022 (8), and Neri2022 (9).Wolf2018 consists of 4,620 RdRp sequences extracted from RNA viruses and unclassified viral sequences registered in GenBank as of April 2017 using PSI-BLAST (18) and iterative clustering and alignment.Zayed2022 is an RdRp sequence dataset from marine RNA metatranscriptomic data, consisting of 209,588 RdRp sequences, 6,238 of which are near-full-length RdRps.From Edgar2022, we employed two datasets: the rdrp1 dataset consisting of 14,680 annotated and non-annotated RdRp sequences from GenBank, and the Serratus dataset consisting of 250,799 recently detected RdRps stored in the Serratus database (https://serratus.io/).Neri2022 contains 77,510 RdRp sequences obtained from metagenome datasets registered in IMG/M [https://img.jgi.doe.gov/,(9)].Additionally, we conducted an HMM search against the NCBI RNA Virus database and obtained 7,896 RdRp domains.A total of 565,093 amino acid sequences containing the RdRp domain were used in this study.
The UniProtKB database, known for its curated protein sequences and high reliability, was used to evaluate RdRp searches (19).This database contains 565,254 protein sequences and was downloaded on October 11, 2021, from https:// www.uniprot.org/uniprotkb.The sequences in this dataset were obtained from multiple organisms and included 836 RdRps (3).In addition to the UniProtKB database, we utilized a metatranscriptomic assembly dataset obtained using a fragmented and primer-ligated double-stranded RNA (dsRNA) sequencing (FLDS) method, named "FLDS-data" (20).This dataset was derived from marine samples and comprised 228 RdRp domains and 20 capsids in 1,143 assembled transcripts.This dataset was used to assess the efficacy of various tools for accurately detecting RdRp sequences, inadvertently identifying capsid proteins, and probing potentially unidentified RdRps.This dataset contains appropriately annotated data and is a suitable evaluation platform.

Construction of RdRp HMM profiles
The bioinformatics programs for HMM construction are summarized in Table 1, and the procedure is illustrated in Figure 1.To construct the RdRp HMM profile, amino acid sequences of the five RdRp datasets were merged and clustered using CD-HIT with a 99% threshold and word size of 5 to exclude sequences that exist redundantly within the datasets.This process was also aimed at removing sequences in which part of one sequence was contained entirely within a longer sequence.After clustering, a representative sequence from each cluster was retained as a deduplicated RdRp dataset.Based on the deduplicated RdRp dataset, HMM profiles of the RdRp domains were constructed as shown in Figure 1, which was modified from the procedure used in our previous study (3).The deduplicated RdRp seed datasets were clustered using CD-HIT; parameters for CD-HIT were tested with 40% to 60% similarity (Supplementary Table 1).Clusters containing more than three sequences were aligned using MAFFT L-INS-i.Following the alignment, we employed our script "cutgap.py"(available at https:// github.com/shoichisakaguchi/NeoRdRp/blob/main/script/archive/v2.0/cutgap.py)to delineate and split the sequences based on the gappy regions.Specifically, the script scrutinized each aligned sequence column and identified regions where the gap occurrence rate was greater than 25%, which was decided by testing three different gap occurrence rates: 15%, 25%, and 35% (Supplementary Table 2).Consecutive regions with more than 25% gaps spanning eight or more alignment positions, which were decided by comparing six different spanning lengths (2, 4, 6, 8, 10, and 12), were identified as potential split points (Supplementary Table 1).After identifying these gappy regions, the sequences were split and saved in distinct FASTA files, with file names reflecting gap thresholds and split positions.Finally, HMM profiles were created for each conserved domain using the HMMER hmmbulid program.The constructed HMM profiles were designated NeoRdRp2.

Re-annotation of seed RdRp datasets
We conducted a comprehensive reannotation of the RdRp seed datasets used for NeoRdRp2 to enable NeoRdRp users to evaluate their search results.Specifically, we used InterProScan, a tool that aggregates data from various databases to provide detailed insights into protein domains and functional annotations.The datasets were subjected to searches against the following databases: Conserved Domain Database (CDD) for identifying conserved domains; Coils for predicting coiled-coil regions; FunFam for classifying proteins into functional families; Gene3D for categorizing proteins based on their 3D structures; Pfam for protein family classification; ProSiteProfiles for domain and key site identification; SUPERFAMILY for superfamily-based classification; MobiDBLite for predicting protein mobility regions; PRINTS for protein fingerprint classification; PANTHER for protein family and subfamily classification; SMART for domain identification and annotation; ProSitePatterns for identifying protein sequence patterns and motifs; NCBIfam, PIRSF, Hamap, and AntiFam for further classification and annotation based on protein families, phylogeny, and false predictions filtering.RdRp-scan was conducted with default parameters for sequence similarity searches to identify the characteristics and origin of the RNA viruses.Additionally, we conducted Palmscan and BLASTp searches with an e-value of ≤1E-10 and default parameters using each RdRp seed dataset sequence as a query against the UniProtKB database.

RdRp HMM profiles
For the construction of our RdRp-HMM (Figure 1), the first stringent clustering, indicated by the blue curved rectangle in Figure 1, yielded a refined dataset comprising 328,977 unique sequences from the original pool of 565,093 sequences.Secondly, a relatively loose clustering approach (orange curved rectangle in Figure 1) grouped the unique sequences into 68,118 clusters, revealing a diverse spectrum of RdRp sequences.The large number of RdRp clusters indicates the extensive diversity of the RNA virus families.Multiple sequence alignments were performed for each cluster containing more than three sequences, using MAFFT.Subsequently, consecutive regions with more than 25% gaps spanning eight or more alignment positions were identified and removed, and the sequence was split at these regions.Finally, 17,452 clusters were obtained and an RdRp HMM profile was generated for each cluster using the HMMER suite hmmbulid program.This set of 19,394 HMM profiles was named NeoRdRp2.

Evaluation of NeoRdRp2 using UniProtKB
To evaluate the performance of NeoRdRp2, we used the same database, program, and parameters as previously used to assess NeoRdRp 1.1 (3).To this end, the UniProtKB database with hmmsearch and BLASTp with an e-value of 1E-10 was analyzed using NeoRdRp 1.1 and 2 datasets.The results, including the number of amino acid sequences and HMM profiles, are summarized in   3).As a result of the BLASTp search using the NeoRdRp2 seed sequences, the recall and precision rates were 99.9% and 11.5%, respectively.Three genera, Livivirus, Emvecovirus, and Orthopenumovirus, were not detected by hmmsearch using NeoRdRp2 but were detected by BLASTp.Of the 564,418 non-RdRp sequences, 188 were incorrectly identified as RdRp-containing sequences.The precision rate of NeoRdRp2 was 81.6%, which was also improved compared to that of NeoRdRp 1.1 (76.8%).These results indicate that increasing the number of RdRp sequences improves the accuracy of the RdRp search.

Comparison of RdRp identification
We next evaluated NeoRdRp2, using the FLDS-data containing 228 20 capsids, and 895 unannotated amino acid sequences (20), which were used in our previous analysis (3).For comparison, we used other RdRp and RNA virus detection tools, including RVDBprot, RdRp-scan, ViralRdRp_pHMMs, LucaProt, Serratus, Palmscan, RdRpBin, and VirSorter2.See Materials and Methods for further details.Using these HMM models and RdRp and RNA virus detection programs, the total number of identified RdRp sequences, false positives in capsid protein identification, and detection of unannotated sequences were assessed (Table 4).Among the 228 RdRp-annotated FLDS sequences, NeoRdRp2 showed the highest count of RdRp detection with 216 sequences, closely followed by LucaProt with 212 sequences.RdRp-scan and RVDB-prot also exhibited substantial RdRp detection, with 201 and 199 sequences, respectively (Figure 2A).On the lower end, VirSorter2 identified 67 RdRp sequences, indicating relatively low sensitivity for RdRp detection.Notably, there were cases in which other tools detected RdRp, whereas NeoRdRp2 did not (Supplementary Table 4).In the HMM searches, there were ten sequences that NeoRdRp2 could not detect, but RdRp-scan and/or RVDB-prot detected two sequences.One RdRp sequence was only detected by non-HMM-based searches for Serratus and RdRpBin.Interestingly, the ML-based RdRpBin alone detected eight RdRps.The taxonomy of the RNA viruses that could not be detected by NeoRdRp2 was as follows: three of 24 Narnaviridae, one of 22 Partitiviridae, and one of 3 Endornaviridae.
Among the 20 capsid FLDS sequences, a few possible misidentifications were found using all the tools (Table 4; Figure 2B).RVDB-prot had the highest number of capsid identifications (12 sequences), followed by LucaProt with eight sequences.Note that RVDB-prot contains the HMM profiles of all RNA viral proteins, not just RdRp (16).In contrast, the lowest number (4) was found using RdRpBin, followed by 5 with NeoRdRp2, VirSorter2, RdRp-scan, ViralRdRp_pHMMs, and Serratus Lite.
The detection of unannotated sequences varied significantly among the tools (Table 4).LucaProt identified a notably large number of unannotated sequences (155).The fourth largest number was 26, for both NeoRdRp2 and RdRp-scan.We compared the overlapping numbers of the detected sequences using different tools (Figure 2C).Multiple tools detected 62 sequences: four sequences were detected by all tools used in this study, whereas 168 sequences were detected using only one tool.The machine-learningbased searches Lucaprot and RdRpBin uniquely detected 100 and 2 unannotated datasets, respectively.In contrast, NeoRdRp2 and RdRp-scan did not detect any RdRp sequences that were annotated only by these tools, although they did not entirely overlap (i.e., 22 of 26 overlapped).To provide a fair comparison, it is important to note that both RVDB-prot and VirSorter2 include HMM profiles for various viral proteins, in addition to RdRp.Consequently, the detection numbers reported in these databases may reflect a broader range of viral protein identifications.

Re-annotation of seed RdRp datasets
NeoRdRp is aimed to provide a comprehensive dataset of RdRp sequences.To this end, we collected as many RdRp sequences as possible.We cannot exclude the possibility that some RdRp sequences used in NeoRdRp may not function as RdRp.Therefore, we evaluated the 557,197 RdRp seed sequences used in this study using InterProScan, Palmscan, and BLASTP searches against UniProtKB and hmmsearch with RdRp-scan, as summarized in the Supplementary Data.
However, even if a sequence was not determined to be RdRp by the programs, this may simply be due to the low sequence similarity;   Leviviruses, and 29 Orthopneumoviruses.While Embecovirus was not found in the seed RdRp datasets, 61 beta Coronaviruses were included, and Embecoviruses were their subgenera.

Discussion
A comparison using the UniProtKB dataset showed improved performance of NeoRdRp2 from version 1.1; introducing expanded seed RdRp datasets enhanced the detection of non-mammalian The performance of various RdRp detection tools in detecting the unannotated FLDS data.The performance of various RdRp detection tools was evaluated using FLDS data containing 228 RdRps, 20 capsids, and 895 unannotated amino acid sequences.(A) RdRp, (B) capsid, and (C) non-RdRp detection.The detection ability of each tool was plotted to evaluate its ability to identify RdRp sequences.
virus sequences missed in version 1.1.This improvement stems from the enriched metatranscriptome data of environmental origin in the dataset.In contrast, 188 non-RdRp sequences in UniProtKB were by hmmsearch using NeoRdRp2, which appears to be a cautionary point for NeoRdRp that is not limited to the core motifs of RdRp.However, because the NeoRdRp HMM profiles were generated based on conserved regions (see Materials and Methods), non-RdRp domains could rarely be merged in the HMM profiles.Indeed, NeoRdRp detected a relatively small number of possible false-positive RdRp hits in the capsid or unannotated sequences compared with other estimation programs (Figure 2; Table 4).As we previously reported, a BLASTp search of each RdRp amino acid sequence stored in the NeoRdRp 1.1 dataset could further identify RdRp candidates.However, this method can also detect many false positives (3).Similar trends were observed in the NeoRdRp2 dataset because the amino acid sequences used for NeoRdRp2 also contained non-RdRp regions.NeoRdRp2 showed high performance using hmmsearch alone compared to its version 1.1, as well as other RNA virus detention tools, which promise a highly specific RdRp search.Additionally, the possibility that these 188 non-RdRp sequences contained non-annotated RdRps should be considered.Despite employing the curated and reliable UniProtKB dataset, which we consider a robust benchmark dataset, the possibility of misannotations cannot be ruled out.Therefore, a solid benchmark dataset that is well annotated and comprehensive is required for the variation in detection capabilities across software versions and tools.The methodology for accurate annotation of RdRp and the need for consensus benchmark datasets were discussed at the RdRp summit in 2023 and summarized in a consensus statement (27).Establishing such a benchmark dataset is crucial for evaluating and improving RdRp detection tools, including future NeoRdRp iterations.
As shown in Figure 2; Table 4, the number of detected RdRp sequences varied significantly depending on the program used.NeoRdRp2 showed the highest detection in the search of FLDS data using multiple HMMs for RNA virus detection.This highlights the benefits of developing a comprehensive HMM profile to ensure broader sequence detection.The present results, which show that adding newly found RdRp sequences can increase detection and maintain accuracy, demonstrate the advantages of RNA Virome studies using HMM.In addition, the detection of many RdRps by LucaProt and eight RdRps by RdRpBin alone indicates that machinelearning-based RdRp searches can also detect comprehensive and low-similarity RdRp sequences.However, the possibility that a large fraction of false positives was included cannot be ruled out, and extra care must be exercised when using machine-learning-based methods.
In recent years, the field of viral discovery and the molecular analysis of novel viruses has experienced significant advancements, leading to substantial enrichment of the viral sequence database and enhanced annotation accuracy (28).Therefore, tools that use newer databases can potentially identify undiscovered viral sequences in published data.In this regard, while the FLDS data that we used for the evaluation of RNA virus detection tools is well-annotated metatranscriptome data from marine samples ( 29), there might be potential RdRp sequences.Originally, annotations for this dataset were performed using BLAST, along with manual curation, leaving several sequences without comprehensive annotations.Our findings indicate that with the advent of improved bioinformatics tools, there is a significant opportunity to identify RdRp sequences in previously unannotated sequences.Although this study did not examine the unannotated FLDS data employed in the benchmark dataset, several unannotated sequences were detected using multiple programs that can be undiscovered RdRp sequences (Supplementary Table 5), and this should be assessed in future research.This highlights the need for a standardized approach to annotation in virology and the importance of revisiting existing datasets using contemporary methodologies to discover potential novel RNA viruses (27).
The core domains A, B, and C are crucial motifs in RdRp proteins and act as the principal catalytic sites (30).As a popular approach, RdRp search tools have focused on these domains to generate HMMs that encapsulate all core domains to ensure the detection of complete RdRp sequences (8,14,15).In contrast, our approach broadens the search scope to include conserved regions outside the core domains identified through the alignment of input sequences.While innovative, this strategy introduces the potential risk of inadvertent splitting of the core domain during HMM creation.Acknowledging this, we performed a comprehensive evaluation using the UniProtKB dataset to refine our threshold parameters, aiming to effectively balance sensitivity and specificity.However, our method does not explicitly prevent the division of the core domains.Instead, it leverages the presence of less conserved, yet important, regions, as demonstrated in our previous work and confirmed by the performance of NeoRdRp2 in this study.These regions, although not always encapsulating complete ABC domains, contributed to enhancing the accuracy of the tool.This approach may lead to the inclusion of partial RdRp sequences, potentially increasing the risk of false positives.However, even if the domains are split into different profiles, each may be hit as an RdRp domain in a given input sequence; they may also be properly identified.Indeed, based on our benchmarks, the risk of NeoRdRp2 remained within acceptable limits.Future updates will focus on refining our validation steps and incorporating new seed RdRp datasets to improve the specificity and reduce the likelihood of false positives.
In conclusion, NeoRdRp2, which contains 19,394 HMM profiles of the RdRp domain, improves the detection of RdRp in various RNA viruses.Compared with other RNA search tools, NeoRdRp2 exhibited a good balance between sensitivity and specificity of the RdRp domains.NeoRdRp2 achieved more precise RdRp identification in various RNA virome studies.

FIGURE 1
FIGURE 1Schematic illustration of the pipelines for the construction of RNA-dependent RNA polymerase (RdRp) hidden Markov model (HMM) profiles.In the first round, Wolf2018 was inputted as the seed RdRp into this pipeline.The resulting orange HMM profiles were used to hmmsearch for NCBI RNA Virus dataset.In the second round, Wolf2018, the RdRp domains obtained by hmmsearch, and the recently acquired RdRp domain data were merged with the initial RdRp seed datasets and re-input into the pipeline.The resulting HMM profiles were consolidated into a single HMM profile database, referred to as NeoRdRp2.In the bottom, input datasets used for NeoRdRp 1.1 and 2 were shown for each step.
we did not remove such unmatched sequences in this study.Users can check the degree of certainty of a given HMM based on the annotation results.InterProScan analysis facilitated the identification of a wide range of conserved domains, motifs, and protein families, thereby enriching our understanding of the functional landscape of RdRp.Palmscan and RdRp-scan annotated 453,579 and 246,017 sequences in the seed RdRp datasets, respectively.Although there were viruses that NeoRdRp2 failed to detect, the seed RdRp datasets contained at least 717 Partitiviruses, 122 Narnaviruses, 101 Endornaviruses, 9 Morbilliviruses, 89

TABLE 1
Bioinformatics program for constructing HMM.

Table 3 .
NeoRdRp2 detected 831 out of 836 RdRps in the UniProtKB database, improving the recall rate (99.4% from 97.2%) compared to that of NeoRdRp 1.1.NeoRdRp2 detected 12 genera, which NeoRdRp 1.1 could not detect, and failed to detect three genera, which 1.1 detected (Supplementary Table

TABLE 2
Bioinformatic database program used for RdRp detection.
(17)mmended parameters were employed; -include-groups RNA, -min-length 1500, all.VirSorter2 does not have the option to target RdRp sequences exclusively, so we opted to narrow our search to RNA viruses.(17)

TABLE 4
RdRp search against FLDS data.
*NaN, unannotated sequences.**These numbers are for reference only because RVDB-prot and VirSorter2 contain viral proteins other than RdRp.

TABLE 3
Statistics of NeoRdRp detection power estimated using UniProtKB.