The Landscape of Virus-Host Protein–Protein Interaction Databases

Knowledge of virus-host interactomes has advanced exponentially in the last decade by the use of high-throughput screening technologies to obtain a more comprehensive landscape of virus-host protein–protein interactions. In this article, we present a systematic review of the available virus-host protein–protein interaction database resources. The resources covered in this review are both generic virus-host protein–protein interaction databases and databases of protein–protein interactions for a specific virus or for those viruses that infect a particular host. The databases are reviewed on the basis of the specificity for a particular virus or host, the number of virus-host protein–protein interactions included, and the functionality in terms of browse, search, visualization, and download. Further, we also analyze the overlap of the databases, that is, the number of virus-host protein–protein interactions shared by the various databases, as well as the structure of the virus-host protein–protein interaction network, across viruses and hosts.


INTRODUCTION
Knowledge of virus-host interactomes has advanced exponentially in the last decade by the use of high-throughput screening technologies to obtain a more comprehensive landscape of virushost protein-protein interactions (de Chassey et al., 2014;Sharma et al., 2015). Beyond physical methods such as affinity chromatography and coimmunoprecipitation (Phizicky and Fields, 1995), the development of mass spectrometric methods such as the yeast two-hybrid system (Fields and Sternglanz, 1994) and affinity purification combined with mass spectrometry (Kim et al., 2010) has fostered the high-throughput identification and characterization of protein-protein interactions (Börnke, 2008), computationally predicted and experimentally validated using these techniques, for protein-protein interactions within single bacteria, viruses, and small and large eukaryotes (Zhang, 2009) and also for interactions between viral proteins and proteins of the host they infect (Brito and Pinney, 2017).
The databases are reviewed on the basis of the specificity for a particular virus or host, the number of virus-host protein-protein interactions included, and the functionality in terms of browse, search, visualization, and download. Further, we also analyze the overlap of the databases, that is, the number of virus-host protein-protein interactions shared by the various databases, as well as the structure of the virus-host protein-protein interaction network, across viruses and hosts.

Databases
For all the generic databases, we downloaded the virus-host protein-protein interaction data.  (Schoch et al., 2020) in order to determine their classification as virus or host proteins.

Datasets
In order to be able to analyze the overlap of the databases, we mapped all virus and host protein identifiers to UniProtKB-AC unique identifiers, using the programmatic access to the database identifier mapping service at https://www.uniprot.org/mapping/. Host protein identifiers in the Viruses.STRING database were also mapped to UniProtKB-AC unique identifiers using the mapping files available at https://version-10-5.string-db.org/ mapping_files/uniprot_mappings/. Apart from discarding any virus-virus and host-host protein-protein interactions, some of the virus-host protein-protein interactions had to be discarded as well, because the corresponding virus or host protein identifiers could not be mapped to UniProtKB-AC in a unique way. We have also raised viral strains to the species (virus) level, in order to facilitate comparison of virus-host protein-protein interactions across the databases. The resulting virus-host protein-protein interaction datasets are summarized in Table 1 and further detailed below. The 1,009 virus-host protein-protein interactions in the EBI-GOA-nonIntAct database contained 628 UniProtKB AC/ID identifiers, all of which were mapped to UniProtKB-AC in a unique way. This resulted in 534 unique virus-host proteinprotein interactions among 173 unique proteins from 77 viruses and 455 unique proteins from 26 hosts.
The 28,473 virus-host protein-protein interactions in the BioGRID database contained 4 UniProtKB AC/ID identifiers, all of which were mapped to UniProtKB-AC in a unique way; 2 BioGRID identifiers, which could not be mapped to UniProtKB-AC; and 6,589 Entrez Gene (GeneID) identifiers, 3,007 of which were mapped to UniProtKB-AC in a unique way. This resulted in 5,157 unique virus-host protein-protein interactions among 50 unique proteins from 13 viruses and 2,101 unique proteins from 6 hosts.
The 10,907 virus-host protein-protein interactions in the VirusMentha database contained 4,347 UniProtKB AC/ID identifiers, 4,332 of which were mapped to 4,313 UniProtKB-AC in a unique way. This resulted in 10,626 unique virus-host protein-protein interactions among 627 unique proteins from 114 viruses and 3,624 unique proteins from 8 hosts.
The 26,443 virus-host protein-protein interactions in the IntAct database contained 10,282 UniProtKB AC/ID identifiers, 10,047 of which were mapped to UniProtKB-AC in a unique way. This resulted in 22,727 unique virus-host protein-protein interactions among 1,062 unique proteins from 197 viruses and 8,102 unique proteins from 68 hosts.
The 35,405 virus-host protein-protein interactions in the VirHostNet database contained 10,049 protein identifiers: 9,868 UniProtKB AC/ID identifiers, 9,717 of which were mapped to UniProtKB-AC in a unique way; 180 RefSeq Protein identifiers, 169 of which were mapped to UniProtKB-AC in a unique way; and one EMBL/GenBank/DDBJ identifier, which could not be mapped to UniProtKB-AC. This resulted in 28,132 unique virushost protein-protein interactions among 984 unique proteins from 128 viruses and 7,361 unique proteins from 6 hosts.
The 51,216 virus-host protein-protein interactions in the HPIDB database contained 19,784 protein identifiers: 16,465 UniProtKB AC/ID identifiers, 16,295 of which were mapped to UniProtKB-AC in a unique way; 3,106 Entrez Gene (GeneID) identifiers, 1,928 of which were mapped to UniProtKB-AC in a unique way; 110 RefSeq Protein identifiers, 86 of which were mapped to UniProtKB-AC in a unique way; four EMBL/GenBank/DDBJ identifiers, one of which was mapped to UniProtKB-AC in a unique way; two Ensembl Protein identifiers, one of which was mapped to UniProtKB-AC in a unique way; one Ensembl Genomes Protein identifier, which was mapped to UniProtKB-AC in a unique way; and 96 IntAct identifiers, none of which could be mapped to UniProtKB-AC in a unique way. This resulted in 33,906 unique virus-host protein-protein interactions among 1,387 unique proteins from 205 viruses and 7,570 unique proteins from 36 hosts.
The 330,136 virus-host protein-protein interactions in the Viruses.STRING database contained 41,490 protein identifiers: 29,236 Ensembl Protein identifiers, 29,093 of which were mapped to UniProtKB-AC in a unique way; 1,371 Ensembl Genomes Protein identifiers, 1,212 of which were mapped to UniProtKB-AC in a unique way; and 131 UniProtKB AC/ID identifiers, all of which were mapped to UniProtKB-AC in a unique way. None of the remaining 10,752 identifiers could be mapped to UniProtKB-AC in a unique way. However, using the aforementioned mapping files, 37,395 host protein identifiers were mapped to UniProtKB-AC in a unique way. Combining the two approaches, this resulted in 242,784 unique virus-host protein-protein interactions among 1,703 unique proteins from 186 viruses and 52,440 unique proteins from 61 hosts.
The 621 virus-host protein-protein interactions in the virusspecific HCVpro database contained 487 protein identifiers, 145 of which were mapped to UniProtKB-AC in a unique way. This resulted in 140 unique virus-host protein-protein interactions among 7 unique Hepatitis C virus proteins and 138 unique human proteins.
The 1,036 virus-host protein-protein interactions in the hostspecific VirusMINT database contained 706 gene identifiers and 706 protein identifiers. Only 993 of the 1,412 gene and protein identifiers were mapped to UniProtKB-AC in a unique way. This resulted in 391 unique virus-host protein-protein interactions among 287 unique proteins from 43 viruses and 287 unique human proteins.
The 52,976 virus-host protein-protein interactions in the host-specific PHISTO database contained 8,212 UniProtKB AC/ID identifiers, 8,167 of which were mapped to UniProtKB-AC in a unique way. This resulted in 39,010 unique virus-host protein-protein interactions among 1,700 unique proteins from 182 viruses and 6,520 unique proteins from one host.
Finally, the 48,643 virus-host protein-protein interactions in the host-specific HVIDB database contained 9,900 protein identifiers, 9,699 of which were mapped to UniProtKB-AC in a unique way. This resulted in 44,590 unique virus-host proteinprotein interactions among 1,939 unique proteins from 737 viruses and 7,437 unique human proteins.

Functionality of the Databases
All the databases support, to some extent, browsing, searching, visualization, and download. While EBI-GOA-nonIntAct, BioGRID, VirusMentha, IntAct, and HPIDB only allow for browsing search results, VirHostNet allows for browsing the database by virus lineage (Baltimore class, family, species, and taxon) and by UniProtKB keyword annotation, and Viruses.STRING has no browsing facilities, although it allows for searching by virus or host name.
EBI-GOA-nonIntAct allows for searching over the entire database using a query language based on the PSI-MITAB format (Kerrien et al., 2007), using the PSICQUIC web service (del Toro et al., 2013). BioGRID allows for searching by gene name, publication identifier, and full text search using a simple query language. IntAct allows for searching by gene name, UniProtKB identifier, taxon identifier, publication identifier, and Gene Ontology terms. VirusMentha allows for searching by gene name, UniProtKB identifier, and keyword annotation, over the entire database or for a specific virus family or host. VirHostNet allows for searching by UniProtKB identifier, name, keyword annotation, virus lineage (species or taxon), and PubMed identifier (PMID), and also allows for BLASTP (Altschul et al., 1990) searches in a database of interacting protein sequences. HPIDB allows for regular expression searching by protein accession number or name, species or taxon identifier or name, PubMed identifier (PMID) or author name, and interaction type. Viruses.STRING allows for searching by protein, virus, and host name.
For the virus-specific and the host-specific databases, HCVpro allows for browsing by virus (Hepatitis C) protein name or host (human) protein name or chromosome, virus protein identifier, interaction type, and PMID, as well as for searching by host protein name or gene identifier. VirusMINT has no browse, search, or visualization facilities, as the resource at http://mint. bio.uniroma2.it/virusmint/ is no longer available. PHISTO allows for browsing by virus family and species, and searching by taxon identifier, virus name, virus or host protein name or UniProtKB identifier, experimental method, and PMID. HVIDB allows for browsing by viral family, and searching by UniProtKB identifier, UniProtKB entry name, gene identifier, gene name, protein name, and keyword annotation.
Download facilities differ among the various databases. For the generic databases, EBI-GOA-nonIntAct allows for downloading a single tab-separated (TSV) text file with all the interactions stored in the database, as the result of a query to the PSICQUIC web service. BioGRID allows for downloading a single text file, in PSI-MITAB format, with all the interactions stored in the database. VirusMentha allows for downloading a zip file containing a single semicolon-separated text file for each of the 8 hosts and for each of the 25 families of viruses covered in the database, and these zip files are updated every week. IntAct also allows for downloading a single text file in PSI-MITAB format with all the interactions stored in the database. VirHostNet also allows for downloading a single tab-separated text file with all the interactions stored in the database. HPIDB also allows for downloading a single text file in PSI-MITAB format with all the interactions stored in the database. Viruses.STRING allows for downloading a tar-gzip-compressed folder containing a single space-separated text file with either all the interactions stored in the database, or only those for a particular virus or host. On the other hand, for the virus-specific and the host-specific databases, all of them allow for downloading a single comma-separated (CSV) (for PHISTO) or tab-separated (for HCVpro, VirusMINT, and HVIDB) text file with all the virus-host interactions stored in the corresponding database. The main features of the various databases are summarized in Table 2.

Structure of the Virus-Host Protein-Protein Interaction Networks
The structure of biological networks in general, and proteinprotein interaction networks in particular, can be analyzed by means of topological measures (Börnke, 2008;Steuer and López, 2008;Zhang and Hwang, 2009;Gaudelet and Pržulj, 2019;Hauschild et al., 2019). We show next that, under several of these topological measures, virus-host protein-protein interaction networks do not differ much from other protein-protein interaction networks.
Protein-protein interaction networks usually consist of a large component that fills most of the network, with the rest of   (Newman, 2018). Table 3 shows the size (number of nodes and edges), the number of connected components, the distribution of component sizes, and the average path length for the generic, virus-specific, and host-specific virus-host protein-protein interaction networks. These data show that virus-host proteinprotein interaction networks also consist of a large component and a large number of small components, all of small average path length.
The degree of a node in a network is the number of edges attached to it, and the degree distribution of a network is the fraction p k of the nodes that have degree k, for every k. Thus, p k is the probability that a randomly chosen node in the network has degree k, and the degree distribution measures the frequency with which nodes of different degrees appear in the network (Newman, 2018).
Biological networks tend to have degree distributions that follow a power law of the form p k ∼ k −γ for some positive constant γ , that is, a straight line with a negative slope. Figure 2 shows a scatter plot of the degree distribution, in logarithmic scale, for all but the two smallest virus-host proteinprotein interaction networks. As can be seen therein, the degree distribution of virus-host protein-protein interaction networks follows a power law, that is, they are scale-free networks. The same behavior has been observed in other protein-protein interaction networks (Jeong et al., 2001;Barabási and Oltvai, 2004).
These structural properties of virus-host protein-protein interaction networks also characterize the networks for a specific virus or for the viruses that infect a specific host. Table 4 shows the size (number of nodes and edges), the number of connected components, the distribution of component sizes, and the average path length of the virus-host protein-protein interaction network for the Influenza A virus. This virus-specific network also consists of a large component and a large number of small components, all of small average path length, although the number of small components is smaller and the average path length is larger than in the whole virus-host protein-protein interaction networks.

Overlap of the Datasets
Most of the databases contain interactions derived from literature curation and from the other databases and thus, their overlap in terms of common proteins and   The overlap among each three or more generic datasets is even smaller. For example, while 8,505 of the 43,944 interactions in VirusMentha, IntAct, and HPIDB are shared by the three This is all summarized in the set intersection diagram shown in Figure 3, which were obtained using a Python implementation of the UpSet tool (Lex et al., 2014). The overlap across the datasets is also small in the virus-host protein-protein interaction networks for the Influenza A virus, as shown in the set intersection diagram in Figure 4.
The centrality of proteins and interactions in the virus-host protein-protein interaction networks can also be studied by means of topological measures, in order to establish whether the networks overlap on central or on peripheral proteins and interactions. For example, the centrality of a virus-host protein-protein interaction can be measured by means of the betweenness centrality of the corresponding edge in the virushost protein-protein interaction network, which is the sum of the fraction of all-pairs shortest paths in the network that contain the edge (Brandes, 2008). However, visual inspection of the virus-host protein-protein interaction networks, as shown in Figure 5 for the Viruses.STRING dataset along with all the other datasets, suffice to determine that they overlap on peripheral, as opposed to central, interactions. The overlap on peripheral proteins and interactions is even more clear in the virus-host protein-protein interaction networks for the Influenza A virus in the Viruses.STRING and VirusMentha datasets, shown in Figure 6.

DISCUSSION
Central to the comparative review of the available virus-host protein-protein interaction database resources is the mapping of the virus and host protein identifiers used in each of the databases to unique proteins identifiers. The reader may be familiar with the good old six-symbol unique identifiers found in the UniProtKB-AC database (The UniProt Consortium, 2017). There are about 30 million 6-symbol and about 200 million 8symbol identifiers stored therein now, what comes as a surprise since unique identifiers made up of six letters and digits would suffice to store over two billion proteins. Nevertheless, the comparative analysis of virus-host protein-protein interaction databases requires mapping proteins to unique protein identifiers such as those in UniProtKB-AC.
While some of the databases include such a mapping, it is in general neither complete nor up-to-date. The mapping problem is not trivial, as the virus and host protein identifiers used in the databases do not always map to unique proteins identifiers. Moreover, some of the databases even include proteins annotated to multiple organisms, such as HVIDB, which has 552 unique proteins in 10,689 interactions annotated to multiple organisms, often along the same lineage. Thus, the identifier mapping problem can only be partially solved, and about 25% of the proteins in the generic, virus-specific, and host-specific databases had to be discarded because they could not be mapped to unique UniProtKB-AC identifiers.
Overall, the generic, virus-specific, and host-specific databases have very good search and visualization facilities. However, when it comes to downloading protein-protein interaction data for further use, most of the databases have their own protein identifiers and include only partial, if any, unique mappings to UniProtKB-AC. Indeed, once the protein identifiers in the various databases have been mapped to UniProtKB-AC identifiers, the resulting datasets have a rather small overlap. For example, while 14.27% of the interactions in BioGRID, 31.84% of the interactions in EBI-GOA-nonIntAct, 61.90% of the interactions in IntAct, 84.60% of the interactions in VirHostNet, and 84.71% of the interactions in VirusMentha are also found in HPIDB, only 4.55% of the interactions in BioGRID, 5.30% of the interactions in VirHostNet, 6.55% of the interactions in EBI-GOA-nonIntAct, 12.41% of the interactions in HPIDB, 16.76% of the interactions in IntAct, and 41.64% of the interactions in VirusMentha are also found in Viruses.STRING.
Further, the structural analysis of the virus-host proteinprotein interaction networks showed that the databases overlap mostly on peripheral interactions, and the central interactions in the networks are not shared among the databases. This comes as a surprise, because essential proteins are known to have higher centrality in a protein-protein interaction network than the network average (Jeong et al., 2001;Raman et al., 2014) and thus, central proteins and interactions are more widely studied and more likely to be reflected in virus-host protein-protein interaction databases than peripheral proteins and interactions. The structural analysis of the virus-host protein-protein interaction network for the Influenza A virus, on the other hand, showed that it has a smaller number of small components and a larger average path length than the other virus-host proteinprotein interaction networks, which can be explained by Influenza A being a widely studied virus, with a larger fraction of the virus-host protein-protein interactions reflected in the databases.

DATA AVAILABILITY STATEMENT
The datasets generated for this study (virus-host protein-protein interactions) are available in the Supplementary Material.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and has approved it for publication.

FUNDING
This research was partially supported by the Spanish Ministry of Science and Innovation, and the European Regional Development Fund, through project PID2021-126114NB-C44 (FEDER/MICINN/AEI).