Duo: A Signature Based Method to Batch-Analyze Functional Similarities of Proteins

With the rapid advancement of sequencing technology, handling of large sequencing data to analyze for protein coding capacity and functionality of predicted proteins has become an urgent demand. There is a lack of simple and effective tools to functionally annotate large number of unknown proteins in a personalized and customized workflow. To address this, we developed Duo, which batch-analyze functional similarities of predicted proteins. Duo can screen query proteins with specific characteristics based on highly flexible and customizable reference inputs from the user. In the current study, Duo was applied to screen for virulence associated proteins in the genome-sequence of Salmonella Typhimurium. Based on the analysis, recommendation for choice of Seed_database in order to get a reasonable number of predicted proteins for further analysis, and recommendation for preparing a Reference_proteins set for Duo was given. Delta-bitscore analysis was shown to be useful tool to focus the follow-up on predicted proteins. A successful screen for virulence proteins in the bacterial genome-sequence was further performed in a selection of 32 pathogenic bacteria, documenting the ability of Duo to work on a broad collection of bacteria. We anticipate that Duo will be a useful auxiliary tool for personalized and customized protein function research in the future.


INTRODUCTION
With the continuous evolution of next-generation sequencing (NGS) technologies, application of NGS methods for routine research is now possible at relatively low cost (Kircher and Kelso, 2010;Goodwin et al., 2016). As a result, customized ways to manage the constantly increasing amount of sequencing data has become urgent, particularly for functional categorization of proteins deduced from sequence data (Mitchell et al., 2019).
To address the demand of functional annotation of proteins, different methods have been developed for summarizing the functional similarity of proteins (known as "signatures"), such as hidden Markov models (HMMs)-based methods (HMMER 1 ) (Eddy, 2011), and the positionspecific scoring Matrices-based method PSI-BLAST (Altschul et al., 1997). In addition, sets of protein annotation databases have been established and are available for global data sharing, including InterPro, which integrates 14 different databases (Mitchell et al., 2019), and eggNOG, which is an HMM-based protein annotation database (Huerta-Cepas et al., 2019).
Advances in protein classification methods, coupled with various types of protein annotation databases, each focused on different types of proteins, have enabled a better understanding of unknown proteins. The most direct way is to check all annotation records manually, and then empirically select proteins of interest for further research. This method is selection free, allowing customizable selection of proteins for specific studies, however, it relies heavily on the experience and knowledge of the researcher and is unsuitable for high throughput screening. To facilitate function annotation, the annotation format has been standardized and classified according to different terminologies, such as Gene Ontology (GO), 2 KEGG pathways 3 2 http://geneontology.org/ 3 https://www.kegg.jp/kegg/ (Kanehisa and Goto, 2000) and Clusters of Orthologous Groups (COG) (Tatusov et al., 1997). Moreover, annotation databases designed for specific research areas have been established, such as the Virulence Factor Database (VFDB 4 ) (Liu et al., 2019) and the Comprehensive Antibiotic Resistance Database (CARD 5 ) (Alcock et al., 2020). All these developments have promoted the efficiency of functional annotation. However, unlike the purely manual selection process, which can be highly customized, the protein filtering step depends on selecting predefined biological terms in the databases, and these are not always compatible with the specific research purposes. Therefore, it is necessary to develop a more flexible and customizable method for functional protein screening in large datasets.
In this paper, we present a new workflow named "Duo, " build to batch-analyze the functional similarities of proteins. Duo facilitates screening of query proteins with specific characteristics using freely available databased and customizable reference  protein sequences defined by the user. As a case study, we applied Duo to screen for virulence associated proteins first in the genome sequence of S. Typhimurium and then more broadly in a selection of pathogenic bacteria using different, customizable input data. Duo is expected to become a valuable auxiliary tool for personalized and customized protein function prediction in the future.

Testing of Duo on Genomic and Meta-Data of S. Typhimurium
For Duo to work, one needs three components: a list of query proteins (the unknowns), a list of reference proteins with the characteristics one is searching for (in broad terms), defined by the user, and one or more reference databases. To evaluate the performance of Duo as a proteinfunction screening-utility and to discuss the influence of reference proteins and protein databases (

Experiment 1: Performance Comparison of Different Seed_Databases
In order to compare the influence of choice of Seed_databases on prediction of virulence associated proteins in S. Typhimurium, we used 14 publically available databases listed in InterPro as Seed_databases (Mitchell et al., 2019). In addition, we prepared two custom Seed_databases specially designed for this experiment, one listing proteins of Escherichia coli and Salmonella listed in the eggNOG database [15], and one listing proteins of Gamma-proteobacteria in the same database. Details of the Seed_databases are listed in Table 1. In this experiment, the protein sequence database named E. coli-vfdb ( Table 2) was used as the input of Reference_proteins to obtain a set of proteins which were not identical to the Query_proteins set (Salmonella). The screening results based on different Seed_databases were summarized in "Interpro_all.Rtab, " then parsed by our custom R script "Compare_Seed_DB.R." In order to observe the delta-bitscore (Wheeler et al., 2016) (protein functional similarity index) distribution between experimentally verified and unverified virulence encoding proteins predicted in the screen, delta-bitscore results were summarized in "cross_result.Rtab." Subsequently, it was parsed with our custom R script "cross_analysis.R." Briefly, the deltabitscore results observed with different Seed_databases were recorded. If a predicted protein was assigned several bitscores by a Seed_database, which could happen if the database predicted function based on different signatures, the lowest deltabitscore was selected to represent the delta-bitscore for the virulence factors. Step 1: Separately, Query_protein and Reference_proteins are used as query inputs to query the same Seed_database(s) (subject input) by suitable programs (ps_scan.pl or hmmer3 or InterProScan).
Step 2: The matched protein functional tags (signatures) for Query_proteins or Reference_proteins are recorded together with the related bitscore(s) (if applicable).
Step 3: Based on the outputs from the previous step, the parsed records from Query_proteins and Reference_proteins are linked if they contain the same functional tag, and if applicable, the absolute value of the bitscores differences between the linked Query_proteins and Reference_proteins (delta-bitscore) is calculated.

Experiment 2: Performance Comparison of Different Reference_Proteins Sets
To compare different sets of Reference_proteins on the prediction of virulence associated proteins, we used eight different sets of Reference_proteins ( Table 2). In this experiment, 14 available public Seed_databases (Table 1) were applied in combination with each of the eight Reference_proteins. The screening results with different Reference_proteins were summarized in "Interpro_all.Rtab.combine, " then parsed by our custom R script "Compare_reference_proteins.R."

Testing of Duo on a Broad Selection of Pathogenic Bacteria
To validate that Duo can be used on a wide variety types of bacteria, we further use Duo to screen for virulence associated proteins in a broad selection of pathogenic bacteria (32 common bacterial pathogens). Similar to the former case study of Salmonella, for every screening of a single bacterial species, one needs to prepare three inputs: Query_proteins, Reference_proteins, and Seed_database. Briefly, in the preparation of Query_proteins, we selected a representative whole genome sequence in the target species (same as the representative of that species listed in VFDB) and extracted all the coding sequences. In the selection of Reference_proteins, firstly, SetA-vfdb (Table 2), a core dataset including bacterial genes associated with experimentally verified virulence factors only, was selected as the basis for selection of "Reference_proteins." To avoid that query proteins were identical to reference proteins, the final "Reference_proteins" for the target species was constructed by excluding the subset of SetA-vfdb proteins from the target species itself (e.g., the Reference_proteins for Salmonella consisted of the virulence proteins in the VFDB database, but excluding the proteins from Salmonella). The Seed_databases were retrieved from the eggNOG database 7 according to the taxonomic grouping of target species (e.g., Salmonella belong to the Class Gammaproteobacteria, and hence the Profile HMMs of Gammaproteobacteria in the eggNOG database was used as the Seed_database for this species). A detailed description of inputs per species are summarized in Supplementary Table 1.

RESULTS AND DISCUSSION
Overview of the Duo Workflow Figure 1 shows an overview of the Duo workflow and details the steps in the application of Duo. The Duo workflow contains three input parts, which are defined by users for specific research purposes. We named these three parts Query_proteins, Reference_proteins, and Seed_database. Both Query_proteins and Reference_proteins are protein sequence files in fasta format. Query_proteins are the candidate proteins of interest for the user. Seed_database are database(s) of different types of protein signatures, whose biological entities will be used as the correlation point(s) between Query_proteins and Reference_proteins, e.g., hidden Markov models (HMMs) method-based databases (Pfam (El-Gebali et al., 2019), TIGRFAM (Haft et al., 2013), and SMART (Letunic and Bork, 2018)) or profile method-based databases (HAMAP (Pedruzzi et al., 2015), Prosite (Sigrist et al., 2013), and CDD (Marchler-Bauer et al., 2017)). Duo has been designed to work with different formats of Seed_databases, and as shown in Figure 1A, before analysis can begin, the user needs to choose one of the three python scripts who are designed to handling with different format of Seed_database. After assigning these three inputs, Duo will automatically query the Query_proteins and Reference_proteins against Seed_databases, and the matched point(s) (biological entity term(s) between Query_proteins/Reference_proteins and Seed_database) will be recorded for further analysis ( Figure 1B Steps 1 and 2). Next, Duo will associate the Query_proteins with Reference_proteins according the same matched point(s) (Figure 1B Step 3).
The Duo workflow creates three output files and two subfolders detailing the correlations among Query_proteins, Reference_proteins, and Seed_database. "Domain_correlation.csv" record details of all the correlation records (parsed annotation results from Query_proteins and Reference_proteins); "Domain_correlation_inner.csv" only details the correlation records shared between Query_proteins and Reference_proteins; and "cross_result.csv" details the delta-bitscore (Wheeler et al., 2016) (protein functional similarity score) records between Query_proteins and Reference_proteins. Finally, the functional similar proteins among Query_proteins and Reference_proteins are stored in a file in the "PROTEINS" folder and the original files of Query_proteins and Reference_proteins back up in

The Influence of Seed_Databases on Functional Protein Prediction
In the first experiment, Duo was applied to screen for S. Typhimurium virulence associated proteins based on the experimentally verified virulence factors from E. coli. S. Typhimurium was chosen as the study object, because the pathogenesis is well described and a high number of virulence factors of different types have been identified and verified experimentally. The screening was performed with different Seed_databases and results are summarized in Figures 2, 3. GENE3D, SUPERFAMILY, and CUSTOM_DB_2 databases ranked in the top three according to the number of predicted virulence associated proteins, and HAMAP, PIRSF, and SFLD databases were in the bottom (Figure 2). Unlike other Seed_databases, the biological entity in GENE3D and SUPERFAMILY is Homologous Superfamilies (Table 1), which is a more general entity than that of biological entities Domain and Family (Mitchell et al., 2019). Similarly, compared with CUSTOME_DB_1 (Profile HMMs for Salmonella and Escherichia proteins retrieved from the eggNOG database), CUSTOME_DB_2 is a more general 8 https://github.com/china-fix/Duo profile (Profile HMMs for Gamma-proteobacteria proteins retrieved from the eggNOG database). These results indicate that using a more general set of criteria for Seed_database in Duo results in a higher number of predicted proteins of the desired type. Both PIRSF and SFLD focus on protein clustering based on apparent evolutionary relationships between proteins (Nikolskaya et al., 2007;Akiva et al., 2014). Even though we attempted to predict proteins with similar characteristics (i.e., virulence associated proteins) based on query proteins and reference proteins from closely related bacterial species (Salmonella and E. coli), using these Seed_databases resulted in a low number of hits, indicating that such databases are less suited for this purpose. TIGRFAM focuses on the annotation of prokaryotic proteins (Haft et al., 2013), and it thus should be suitable for the study of the organisms used in this experiment (i.e., bacteria). Notably, even though the total number of predicted virulence associated proteins was not high using this database (Figure 2), the number of experimentally verified virulence associated proteins was above the medium level, and with 62% of the predicted ones, it showed the highest rate of experimentally verified virulence associated proteins among the databases tested (Figure 3).
To further illustrate the importance of Seed_database when screening for proteins with specific characteristics, we built two customized Seed_databases (Table 1) containing protein sequences of only the input species (Salmonella) and the closely related bacterium E. coli. The results showed that both of the two custom Seed_databases exhibited good performance with an above medium number of experimentally verified proteins (Figure 2) and at the same time an above medium number of verification rate of the total predicted proteins (Figure 3).
In summary, the results of our first experiment clearly showed the influence of the choice of Seed_database on the performance of Duo for functional protein screening. A general recommendation for choice of Seed_database in order to get a reasonable number of predicted proteins for further analysis appears to combine a broad classification scale (Super family, etc.) with a database which is aligned with the type of bacteria under research, however, as shown from the better performance of CUSTOM_DB_2 over CUSTOM_DB_1, not limited to the narrow group of bacteria investigated. This is because proteins are unlikely to be annotated with unknown function, if it is closely related to another protein in the same species. It should be noted that making custom made databases may not always be straight forward, as in the current example.

The Influence of Reference_Proteins on Functional Protein Prediction
In experiment 2, Duo was applied to screen S. Typhimurium LT2 for virulence associated proteins using different sets of Reference_proteins, consisting of experimentally verified virulence proteins from of different sources ( Table 2). The result in Figure 4 showed that using setA-vfdb, Non-Salmonella-vfdb or Salmonella-vfdb Reference_proteins resulted in prediction of a similar number of experimentally verified virulence associated protein, and the numbers of proteins identified were higher than with other sets. It is not surprising that the Salmonella-vfdb and SetA-vfdb groups resulted in relatively high numbers of experimentally verified virulence associated protein, as they contain virulence proteins of same species as the Query_proteins (S. Typhimurium LT2). Interestingly, the verified numbers were similar between the SetA-vfdb database and the Non-Salmonella-vfdb database (A subset of SetA-vfdb excluding the proteins from Salmonella). This result showed that the Duo workflow is suitable for identification of functionally similar proteins across species. The composition of Reference_proteins could be a key factor for the precision of the screen. One notable example is the difference in the rate of experimentally verified proteins among predicted virulence proteins when using Salmonella-vfdb compared to Non-Salmonella-vfdb as Reference_proteins. The two screens showed similar number of experimentally verified proteins (Figure 4) but the rate was much higher using the Salmonella-vfdb proteins (Supplementary Figure 1). Non-Salmonella-vfdb contains 3455 reference virulence proteins sourced from different species. While this multiple source composition can introduce more biological signatures, thereby improving screening ability across different evolutionary backgrounds, this inevitably introduces more non-specific biological signatures, which may increase the number of false positive predictions among the screened proteins.
To analyze this, we prepared five Reference_protein databases sourced from different public health relevant bacteria ( Table 2). This included Salmonella (the target), and E. coli, Shigella, and Campylobacter jejuni, all Gram-negative bacteria classified as Gammaproteobacteria. For convenience, we named these Reference_protein databases the Gramnegative sets. In addition, we build Reference_protein databases for the Gram-positive bacteria Staphylococcus aureus and Clostridium, which belong to Firmicutes Phylum. We named these two databases the Gram-positive sets. As shown in Figure 4, the experimentally verified number of virulence associated proteins was always higher when using the Gram-negative sets as reference compared to the Gram-positive sets. This result indicated that the evolutionary relationship between query and reference proteins influence the outcome when using Duo for functional protein screening; the closer the evolutionary distance the more specific the outcome will be. In addition, compared with the non-Salmonella-vfdb database, the E.coli-vfdb and Shigella-vfdb databases showed better screening credibility (higher verification rate, Supplementary Figure 1). Based on all these observations, the general recommendation for preparing a Reference_proteins set for Duo is to make a custom Reference_proteins set which spans across species, but the evolutionary distance between the "Query-species" and the species included in the Reference_proteins should not be too distant, and one should avoid Reference_proteins a Verified protein number indicates the number of experimentally verified virulence associated proteins in the genera. b Total protein number is the number of protein coding sequences in the genome corresponding to the number of query proteins for Duo. c Verified rate is the percentage of experimentally verified virulence associate proteins among the total proteins (Verified rate = Verified protein number/Total protein number × 100%). d The number of experimentally verified virulence associated proteins among the virulence associated proteins predicted by Duo. e The predicted number of virulence associated proteins by Duo. f Verified rate change is the change in verified rates before and after screen (Verified rate change = verified rate (after screen) -verified rate (before screen)). "+" means the verified rate has increased after screen, and '-' means the verified rate has dropped after screen. g Verified pass rate is the rate of experimentally verified virulence associate proteins among the predicted virulence proteins from Duo (Verified pass rate = Verified protein number (after screen)/Verified protein number (before screen) × 100%).
sets which mix too many attributes, as this may reduce the precision of predictions.
Delta-Bitscore, an Auxiliary Reference Score for Evaluation of Functional Protein Screening The delta-bitscore was first introduced as part of studies of Salmonella adaptation (Kingsley et al., 2013) and is a credible index to rank the functional similarity of orthologous genes (Clifford et al., 2004;Shihab et al., 2013;Liu et al., 2015;Wheeler et al., 2016). It is the absolute value of the bitscores differences between Query_protein and Reference_protein with the same matched biological signature. Duo calculates the delta-bitscore for every matched record between query and reference proteins. This is illustrated in Figure 5 based on data from experiment 2. The databases COLIS, MOBIDB_LITE, PROSITE_PATTERNS, and SUPERFAMILY are incompatible with delta-bitscore measures, so no delta-bitscore results could be obtained for these. The results showed that for most of the different Seed_databases (9/12), the median delta-bitscore was lower for experimentally verified virulence factors than for the unverified ones. This corresponds well to the fact that the lower the delta-bitscore, the higher the functional similarity between the query and reference protein (Clifford et al., 2004;Shihab et al., 2013;Liu et al., 2015;Wheeler et al., 2016). This result implies that delta-bitscore is a good tool to evaluate the precision of one's screen. According to the results in 6A, the deltabitscore generally appeared more uniform among the verified proteins. In concordance with this, the standard deviation of delta-bitscores was lower in the experimentally verified virulence associated proteins with 11 out of the 12 databases compared to the unverified ones. A low standard deviation indicates that the values tend to be close to the mean (Pearson and Henrici, 1997). This indicated that filtering the predicted proteins based on a sub-range of delta-bitscore may improve the precision of the functional protein prediction.
In summary, the delta-bitscore analysis on data from experiment 2 indicated the usability of this score in functional protein-prediction as a tool to focus the follow-up on predicted proteins with low delta-bitscore, and if the number of predicted target proteins is large, to use delta-bitscore in further filtering to concentrate on a fixed sub-range of delta-bitscore.

Application of Duo on a Broad Spectrum of Bacteria
Duo is designed as an auxiliary tool to facilitate the biological signature correlations among proteins. Theoretically, Duo can be used to screen proteins with specific characteristics on any organisms. In order to practically validate the feasibility of using Duo on a broad spectrum of bacteria, we selected one strain of 32 bacterial genera, and screened the genomes for virulence associate proteins. The total protein-encoding sequence number and the experimentally verified virulence associated proteins among them were counted as summarized in Table 3. The results showed that for 31 out of 32 of the tested strains, the rate of verified virulence proteins increased by the screening. Simultaneously, for all but six of these 31 tested strains, the verified pass rate was over 70%. These results indicates that Duo mainly eliminates the non-virulence associate proteins (reflected by increased verified rate after screen) and contains the virulence associate ones (reflected by high verified pass rate after screen). The result supports that Duo works well on a broad spectrum of bacteria. It is worth noting that event though Duo increased the verified rate in most bacteria, it was still at a relative low level (less than 10%). This is because we applied Reference_proteins with multiple source composition, which may reduce the screening specificity, as discussed in the section named "The influence of Reference_proteins on functional protein prediction." In most cases, users will have specific background knowledge about their research target, and thus they can use more specific and customized Reference_proteins input to achieve better predictions.

CONCLUSION
With the rapid advancement of sequencing technology (Kircher and Kelso, 2010;Goodwin et al., 2016), handling the enormous and constantly increasing amount of protein-encoding sequence data has become one of the most urgent demands among the scientific community (Mitchell et al., 2019). In this study, we present a biological signature-based method to batch-analyze the functional similarities of proteins. We have named the method Duo. Duo provides an easy and effective way for batch scoring of the functional similarity between query and reference proteins. As a key utility, Duo allows to screen proteins with unknown function for specific characteristics using free and customizable reference protein sequence inputs defined by the user. We anticipate that Duo will be a useful auxiliary tool for personalized and customized protein function research.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/china-fix/Duo.

AUTHOR CONTRIBUTIONS
XF designed and implemented the workflow and carried out the majority of the analyses with input from JO and QL. XF and JO wrote the manuscript with input from XJ and QL. JO and XJ guided the research with input from QL. All authors contributed to the article and approved the submitted version.