Bacterial hypothetical proteins may be of functional interest

Genomic analysis is part of the daily routine for many microbiology researchers. These analyses frequently unveil genes that encode proteins with uncertain functions, and for many bacterial species, these unknown genes constitute a signi ﬁ cant proportion of their genomic coding sequences. Because these genes do not have de ﬁ ned functions, they are often overlooked in analyses. Experimentally determining the function of a gene can be challenging; however, ongoing advancements in bioinformatics tools, especially in protein structural analysis, are making it progressively easier to assign functions to hypothetical sequences. Leveraging various complementary tools and automated pipelines for annotating hypothetical sequences could ultimately enhance our comprehension of microbial functions and provide direction for new laboratory experiments


Introduction
High-throughput sequencing has made its mark in the biological sciences, and it is now possible to sequence complete genomic DNA samples at low cost.Genomic analyses are currently routine in bacteriology, and can be used, among other things, to identify new organisms (Richter and Rossello-Moŕa, 2009) and virulence genes (Li et al., 2018) and to track the spread of antibiotic resistance determinants (Orlek et al., 2023).This explosion in sequences has required the creation of databases to archive, classify, and make genomic data accessible.Some databases-such as the well-known RefSeq (O'Leary et al., 2016), which includes non-redundant, well-annotated sequences-are generalists.Others are specialized and contain sequences from model organisms or organisms that have been extensively studied, such as Pseudomonas (Winsor et al., 2016) or Mycobacterium (Kapopoulou et al., 2011).Finally, some databases, such as EggNOG (Huerta-Cepas et al., 2019) or STRING (Szklarczyk et al., 2015), are geared toward the functional categorization of genes.These latter databases are particularly useful for determining the role of proteins and possibly their interactions in a network.Obviously, functional databases require more precise gene sequence information generalist databases.
Generally, functional characterization of a gene involves modifying the gene by molecular biology techniques and then observing differences between the mutant strain (with the modified gene) and the parental strain.Although increasingly effective strategies-such as CRISPR-Cas9 (Doudna and Charpentier, 2014)-are available, the characterization of a gene can still be arduous and time-consuming.For example, the integration of the exogenous genetic material into cells, required for genetic manipulation, can be difficult in little-studied organisms for which few or no protocols are available.In some cases, the effect caused by gene alteration may be subtle and difficult to observe.Finally, detecting differences when modifying an essential gene is often impossible because these changes can be fatal for the cell.
Sequence similarity searches using bioinformatics tools such as BLAST (Altschul et al., 1990) or DIAMOND (Buchfink et al., 2015) enable the function of genes to be inferred from their evolutionary proximity to other known genes.This principle of inference underlies all automatic annotation tools, such as Prokka (Seemann, 2014), Bakta (Schwengers et al., 2021), and RAST (Aziz et al., 2008).Homology is also useful for assigning functions to genes in organisms that are difficult to manipulate genetically.However, to assign a function to a gene with these tools, at least one evolutionarily close sequence must already have been characterized and be in the database used.
Following a homology search, it is possible that no homologous sequence is found.This gene is considered an ORFan (Fischer and Eisenberg, 1999), and may be either a chance open reading frame (ORF) that codes for nothing, or a real gene identified for the first time.However, homology searches more commonly identify sequences with no known function.These gene sequences are generally considered to code for hypothetical proteins.
2 Hypothetical proteins: the case of Escherichia coli By October 2023, the RefSeq database included approximately 5,000,000 protein sequences from Escherichia coli, one of the most studied organisms.The genome of the reference strain E. coli O157:H7 str.Sakai (RefSeq GCF_000008865.2) contains 5155 protein-coding genes.However, more available genomic sequences obviously means that more genes are listed, because each strain-having a life of its own-may have acquired genes horizontally from other bacteria.By the same date, just over 35,000 genomic assemblies were available for E. coli.In addition to the large number of sequences available, E. coli is known to have an open pan-genome, meaning that it has great facility in acquiring genes from other bacteria in its environment (Rasko et al., 2008).
Of the 5,000,000 protein sequences, approximately 500,000 (10%) correspond to hypothetical proteins.These hypothetical protein sequences vary in length (Figure 1), some being far too long to believe they are from coincidental open reading frames.Several sequences even exceed 1000 amino acids (AAs) long, whereas an average bacterial gene, in general, is around 1000 bp (~333 AAs) long.The longest sequence (RefSeq WP_301221190.1) is 7556 AAs.BLASTP analysis against the nr/nt database identified that the sequence of this protein is also present in different species of Staphylococcus, in addition to E. coli.
Interestingly, a re-annotation of the hypothetical protein sequences with a local installation of the eggNOG-mapper tool (Cantalapiedra et al., 2021) revealed a categorization for 145,225 sequences, i.e., almost 30% of the hypothetical sequences (Figure 2).Unfortunately, the category with the most sequences was "S, function unknown," indicating that the sequences have several homologs in the EggNOG database, but their functions are also unknown.Despite this, many sequences are thought to be involved in cell membrane biogenesis and metabolism.Even if it is impossible to assign a clear function to the sequences in the "S" category, a description is often offered by the tool, which can help to determine a potential role for these proteins.For example, of the 44,441 sequences in the "S" category, 5625 (~12.6%)have the term "phage" in the description, suggesting a viral origin.
3 Should we be interested in hypothetical proteins?
Many bioinformatics analyses require the investigation of a multitude of bacterial genes (e.g., mutation screening, differential gene expression).Typically, these analyses generate a list of genes of interest.the reflex is to look only at known genes, especially those that might have a link with the reason for the analysis, and assume the hypothetical protein-coding genes are non-existent or negligible from a biological point of view.However, if the analysis has identified these genes, they may be of interest.
Some research groups have demonstrated the value of investigating genes coding for hypothetical proteins.For example, Rahman et al. identified genes involved in the adaptation of Bacillus paralicheniformis and other genes of potential biotechnological interest among the genes coding for hypothetical proteins (Rahman et al., 2022).In 2020, Araujo et al. characterized genes coding for hypothetical proteins that could be involved in the pathogenesis of the bacterium Corynebacterium pseudotuberculosis (Araujo et al., 2020).A 2017 study also demonstrated that characterizing the coding sequences for hypothetical proteins in Mycobacterium tuberculosis, one of the deadliest bacteria in humans, was of interest in providing new therapeutic targets (Raj et al., 2017).Similar discoveries have been made in eukaryotic organisms.For example, Silva et al. identified a hypothetical protein in Penicillium rubens as having an important role in glucose/galactose metabolism (Silva et al., 2020).Finally, in a recent study, the Q6S8D9_SARS protein of the virus SARS-CoV was determined to potentially alter the host antiviral inflammatory cytokine and interferon production pathways (Rahman et al., 2023), demonstrating that hypothetical viral proteins may also be of interest to investigate.

Discussion and perspectives
When a gene is no longer needed by a bacterium, it tends to accumulate mutations due to reduced conservation pressure and drift into a pseudogene that is quickly eliminated (Kuo and Ochman, 2010).Therefore, if genes coding for hypothetical proteins are maintained, it is reasonable to believe that they have a function necessary for the proper functioning or survival of the cell.There is some evidence to support the importance of a gene coding for a hypothetical protein: a gene is too long to be an adventitious reading frame, homologous sequences are found in several organisms, and a transcript is present.
One of the challenges with hypothetical protein-coding genes is assigning functions to them efficiently and with a good degree of certainty.As previously demonstrated by the E. coli example, it may be possible to assign putative functions to several proteins encoded by these genes using various bioinformatics tools.Not all tools use the same databases, algorithms, and criteria to find homologous sequences.Using different, complementary tools can, therefore, lead to better functional annotation.This rationale for using complementary tools was addressed in 2019 by Ijaq et al., who proposed a nine-point classification to help assign function to hypothetical proteins (Ijaq et al., 2019).In addition to sequence homology annotation against different databases, the authors also proposed the use of other tools to infer protein-protein relationships, cellular localization, and protein structures.
The increasing availability of genomic and metagenomic sequences, coupled with advancing computing power, allows for the implementation of large-scale strategies to investigate protein distribution, such as clustering proteins into homologous groups.A recent endeavor clustered 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes into 2,940,257 high-quality clusters (Vanni et al., 2022), with 43% of clusters identified as unknowns.This information can be important because hypothetical protein families that are conserved in multiple genomes are likely to be functional.Inferring the taxonomy of the microorganisms carrying these unknown families, and considering the environment where they were found, can provide valuable insights.Unknown families typically exhibit narrower taxonomic and ecological distributions compared with known families, The 10 functional categories containing the most E. coli hypothetical protein sequences and annotated by eggNOG-mapper.
indicating their potential significance for niche adaptation (Coelho et al., 2022;Vanni et al., 2022).Intriguingly, another recent study revealed that many families of proteins are conserved in archaeal groups, suggesting their importance in the emergence and diversification of these groups (Meheust et al., 2022).
Protein functions and structures are closely linked, which is why two proteins with similar structures can also have similar functions, even if no sequence homology is detected (Sousounis et al., 2012).Tools for determining 3D protein structures from primary sequences have taken an incredible leap forward in recent years with the integration of deep learning into their algorithms.For instance, the AlphaFold2 tool (Jumper et al., 2021), developed by DeepMind, has enabled the prediction of the structure of over 200 million sequences in the UniProt database; these results are accessible through a database called AlphaFold DB (Varadi et al., 2022).Many of these structures are for hypothetical proteins and may eventually be used to infer the functions of these proteins.An automated pipeline, 3DFI, exploits these structure prediction tools to infer the functionality of hypothetical proteins (Julian et al., 2021).In 2022, a tool called I-TASSER-MTD was published and can predict, from the primary sequence of a protein, its 3D structure, function, ligand, and more (Zhou et al., 2022).Other bioinformatics resources, such as CATH (Knudsen and Wiuf, 2010), enable searches based on protein structures rather than sequences.Realistically, the growing number of protein structures will enable these tools to be increasingly used and integrated into analysis pipelines.
Although bioinformatics tools to effectively predict protein functions are becoming more available, these analyses can be computationally demanding.Specialized computing and human resources may be required to successfully perform analyses, especially on tens or even hundreds of hypothetical protein sequences.However, computer hardware is also becoming more efficient, including graphics cards with great computing power through GPUs; these are widely used in protein structure prediction algorithms.
In conclusion, coding sequences for hypothetical bacterial proteins are common in databases, such as RefSeq.However, just because proteins are hypothetical does not mean they are not interesting and without biological function.The use of several complementary tools can significantly aid the functional annotation of protein sequences.The development of bioinformatics tools and tools related to protein structures, combined with the improvement of computer equipment, make it possible that new functions will be assigned to proteins currently considered hypothetical.These analyses, by providing functional evidence, will facilitate the experimental confirmation of these proposed functions.

FIGURE 1
FIGURE 1Length distribution of E. coli hypothetical protein sequences found on RefSeq.