Are Antisense Proteins in Prokaryotes Functional?

Many prokaryotic RNAs are transcribed from loci outside of annotated protein coding genes. Across bacterial species hundreds of short open reading frames antisense to annotated genes show evidence of both transcription and translation, for instance in ribosome profiling data. Determining the functional fraction of these protein products awaits further research, including insights from studies of molecular interactions and detailed evolutionary analysis. There are multiple lines of evidence, however, that many of these newly discovered proteins are of use to the organism. Condition-specific phenotypes have been characterized for a few. These proteins should be added to genome annotations, and the methods for predicting them standardized. Evolutionary analysis of these typically young sequences also may provide important insights into gene evolution. This research should be prioritized for its exciting potential to uncover large numbers of novel proteins with extremely diverse potential practical uses, including applications in synthetic biology and responding to pathogens.


INTRODUCTION The Many Functions of Antisense RNAs
A wide range of non-coding RNAs have been characterized in bacterial genomes. Among these putatively non-coding sequences are many antisense transcripts. Indeed, up to 75% of all prokaryotic genes are associated with antisense RNAs -though the number differs significantly between species and according to the methods used (Georg and Hess, 2018). Their functions, if any, are poorly understood in most cases. The characteristics of antisense RNAs range widely in terms for instance of length, location in relation to the sense gene, and mechanisms of regulation (Lejars et al., 2019). In studies so far they are usually associated with reducing transcription of the sense gene, but they can also increase it, for instance by changing the structure of the sense transcript -various mechanisms are known in each case (Lasa et al., 2012). They can influence single genes, or have global effects for instance through a target involved in general translation. Other known effects relate to functions including virulence, motility, various mechanisms of gene transfer, and biofilm formation (Lejars et al., 2019). The numerous examples of antisense transcription which have been investigated do not just include short antisense RNAs, though these are well-known; the many longer examples include a 1200 nucleotide antisense RNA in Salmonella enterica, AmgR (Courtney and Chatterjee, 2014). Antisense transcripts have been shown to be coexpressed within a single cell with the use of an antibody against double-stranded RNA in various studies, including in Escherichia coli and Streptomyces coelicolor, as reviewed in Georg and Hess (2018). Relatively little attention, however, has been paid to the possibility that RNA in antisense to protein coding genes may also frequently encode proteins (Georg and Hess, 2018). Rather than short, trivial overlaps, which are well known (Saha et al., 2016), here we focus on cases where an antisense (or "antiparallel") ORF with evidence of translation is fully embedded within a known protein coding gene.
The existence of substantially overlapping gene pairs has been known since the beginning of modern genome sequencing, when the proteins directly detected in the bacteriophage phiX174 were shown to not be able to fit into the sequenced genome without the translation of overlapping open reading frames (ORFs; Barrell et al., 1976). Since then, overlapping genes have typically been assumed to be fairly common only in viruses and extremely rare in other taxa, with the possibility of there being multiple examples in other taxa only sporadically discussed, e.g., Chou et al. (1996). However, their occurrence in bacteriophage in particular should raise the suspicion that they may be common in bacteria as well, given for instance the large amounts of genetic material transferred from temperate phage genomes to bacterial genomes (Harrison and Brockhurst, 2017;Owen et al., 2020). The properties of same-strand overlaps between viral genes have been studied (Pavesi et al., 2018;Willis and Masel, 2018), but even in viruses, relatively little attention has been given to antisense overlaps. There is, however, increasing evidence for functional translated antisense ORFs too, notably the antisense protein Asp in HIV-1 (Cassan et al., 2016;Affram et al., 2019;Nelson et al., 2020). In general it can be said that small ncRNAs are well recognized but their coding potential has been overlooked. Many might be protein-coding (i.e., mRNA), some are indeed ncRNA, and several will be dual-functional (Wadler and Vanderpool, 2007;Gimpel and Brantl, 2017;Neuhaus et al., 2017). The same trichotomy of functional categories applies in the case of antisense RNAs.
In bacteria, a number of individual antisense proteins have been discovered; the lines of evidence for some of these will be discussed below. High throughput analyses of ribosome profiling data, which uncovers the part of the transcriptome associated with ribosomes (Ingolia et al., 2009) thus revealing the "translatome, " have begun to suggest that many more may be present. Friedman et al. (2017) found evidence for approximately 17 antisense ORFs, previously thought to be non-coding sRNAs, translated over above the level expected by chance in E. coli K12. The 10 sRNAs these belong to are shown in Figure 1A. Weaver et al. (2019) found ribosome profiling evidence, including evidence specifically for translation initiation (using retapamulin), for nine antisense overlapping gene candidates in E. coli K12, also shown, combined with the data from Weaver et al., in Figure 1A. As reported in a recent pre-print, Smith et al. (2019) found many overlapping ORFs in Mycobacterium tuberculosis associated with ribosomes using retapamulin. From 355 novel ORFs expressed in two replicates they report 241 overlapping and embedded in annotated genes, including both sense, and antisense overlaps, of which many were very short. Of those encoding at least 20 amino acids, 51 are antisense embedded. These antisense ORFs are shown in Figure 1B. From Figure 1 we see that translated antisense overlapping genes are distributed roughly evenly across the genome, and in both frames, in the best-studied example genomes to-date; E. coli K-12 and M. tuberculosis. We have good reason to expect many similar overlapping genes across prokaryotic genomes.
It has been claimed that as a class the novel ORFs in M. tuberculosis are not under selection, and the association with ribosomes was attributed to non-functional pervasive translation  -this is discussed further in the section on evolution and constraint below. Whatever their selective status, other claims of translation in antisense to known genes continue to accumulate in prokaryotes. Jeong et al. (2016) reported ribosome profiling in S. coelicolor -although this result was not highlighted, examining the supplementary data showed 10 antisense putative sRNAs with ribosome profiling evidence. No doubt many more such discoveries await systematic analysis of published ribosome profiling data. There are also many putative same-strand overlapping genes, as discussed early on by Ellis and Brown (2003) showing that alternate frame translation is likely a general phenomenon -but these have also been claimed to not be under selection (Meydan et al., 2019). This increasing evidence for translation of both sense and antisense alternate frame ORFs, currently only typically acknowledged as ncRNAs, should push the question of "pervasive function" and how to categorize the range of translated ORFs to the forefront of microbiology, but it is yet to receive substantial attention. The evidence of expression in alternate frames is generally ignored, and when acknowledged it is generally presumed to be non-functional -, however, we argue this inference is made too quickly on insufficient grounds. Here we explore how to ascertain function and present a few examples of antisense genes with evidence for functionality.

"Function" and Natural Selection
The question of what counts as "function" in a biological context is not straightforward. An interdisciplinary group of researchers have recently discussed the issue in relation specifically to de novo gene origin (Keeling et al., 2019) and proposed five categories of meanings of function, pertaining to expression, capacities, interactions, physiology, and evolution. As they helpfully note, "Separating these meanings from one another enables communicating with increased precision about what the findings are, thereby helping to [avoid] fallacious logical shortcuts such as 'this protein is expressed therefore it is functional therefore it is under selection."' Interestingly, they had limited success in actually applying their categorisation, with most instances in a test set of article abstracts not uniformly assigned to a category by different team members. This suggests that biologists should write with more precision to clarify the sense of function intended. In this article we will focus on the senses relating to the biochemical "capacity" of the products of genetic elements and their evolutionary history, although the other senses will also come into play. The unifying general concept we use is that an element is functional if it does something useful for the organism in ecologically relevant circumstances.
The important philosophical questions have been reviewed elsewhere (Brandon, 2013). Here we summarize some established methods for determining function in the molecular biosciences and how they have been or could be applied to antisense proteins. It has become popular to adopt an etiological account of function, i.e., that an element's function depends on its selective history, particularly in relation to the dispute over how to assign function to elements in the human genome following the ENCODE project (Graur et al., 2013(Graur et al., , 2015Doolittle et al., 2014;Doolittle, 2018). However, the evolutionary etiology of biological systems is not always fully accessible to us (Ardern, 2018) and sometimes the history of selection in a lineage or for a particular gene may be inaccessible or the accessible parts incomplete in important ways. The genomic influence of different kinds of selection on bacterial genomes, including selective sweeps, background selection, positive selection, and purifying selection, remains a point of contention (Takeuchi et al., 2015;Bendall et al., 2016;Gibson and Eyre-Walker, 2019;Sela et al., 2019). Perhaps the most difficult issue here is how to characterize function in young genes, which may be subject to evolutionary forces lying anywhere along a spectrum between positive selection and purifying selection. Positive selection may be acting to modify a sequence which has only recently evolved or only recently become useful, for instance due to new environmental conditions. At some point, however, modifications are overwhelmingly selected against, i.e., purifying selection dominates. This fascinating transition region to our knowledge has received little study, but it is plausible that most young genes fall within it (Vishnoi et al., 2010). As such, many young genes are likely to be missed by methods seeking clear signatures of either purifying or positive selection. A recent study has shown that embedded overlapping genes in viruses usually evolve faster than the gene they are embedded in (Pavesi, 2019) such cases will tend to be missed by tests of purifying selection.
Additional relevant complexities include recombination, horizontal gene transfer, varying evolutionary rates, and unknown past environmental conditions. Evolutionary analyses certainly can provide strong evidence for function in cases of strong selection, but appropriate lower thresholds for determining that an element is functional while minimizing false negatives are much harder to determine. Arguably of much greater relevance than etiology for molecular biologists is what a genetic element does in the current system, and whether it contributes to the goals or life-conducive activities of that system. That is, as the etiological theorists correctly emphasize, function is not just about "causal role, " it concerns a contribution to a wider system which is in some sense goal-directed. However, given complex histories of multiple evolutionary forces this does not necessarily imply anything directly about a particular canonical signature of natural selection being observable in the existing sequence. A good example of these complexities is the prevalence of translation and likely functions in putative "pseudogenes" (Goodhead and Darby, 2015;Cheetham et al., 2019).

High-Throughput Experimental Evidence
The "gold-standard" proof of the active translation of a gene has traditionally been direct evidence from proteomics experiments, a technology which precedes modern genome sequencing by a few years. However, evidence from current proteomics methods is inherently limited even after decades of improvements. For instance, small proteins are notoriously difficult to detect by mass spectrometry, because upon proteolytic digestion they tend to generate no suitable peptides or just a small number.
Another issue for detecting proteins by mass spectrometry is high hydrophobicity (Lescuyer et al., 2004) for example, proteins that contain trans-membrane domains are often underrepresented in proteomic data sets. Finally, also factors like a low protein abundance, only context-specific expression, a high turnover rate or protein secretion might all hamper a successful detection of proteins (Elguoshy et al., 2016). Nonetheless, despite these barriers there are a few examples of translated overlapping genes with proteomic evidence. Notably, a large-scale study of 46 bacterial genomes found up to 261 cases of annotation "conflict, " i.e., overlaps greater than 40 base pairs with either proteomic evidence for both, or the unevidenced gene being annotated as something other than "hypothetical" (Venter et al., 2011). A more recent study of 11 bacterial transcriptomes (Miravet-Verde et al., 2019) found 185 antisense transcripts previously annotated as non-coding could in fact code for proteins based on a random forest classifier (RanSEPs). A study in Pseudomonas putida found proteomic evidence for 44 ORFs embedded in antisense to annotated ORFs (Yang et al., 2016). An improved proteogenomics pipeline reported in a recent preprint manuscript found numerous gene candidates in S. enterica serovar Typhimurium, including a 199 amino acid long protein antisense to the annotated gene CBW18741 (Willems et al., 2019). It is interesting given the previous comment concerning the rate of phage to bacterial gene transfer that a BLAST search shows that this is likely a bacteriophage protein. The same study also found 18 antisense ORFs in Deinococcus radiodurans supported by at least two peptides. A search in Helicobacter pylori mass spectrometry data from a previously published study designed to find small proteins (Müller et al., 2013) found evidence for a protein encoded by an ORF antisense to a proline/betaine transporter gene (Friedman et al., 2017). A recent discussion paper presented proteomic evidence for many small proteins (sORFs) and overlapping genes ("altORFs"), but did not specifically consider antisense overlaps (Orr et al., 2020).
Aside from proteomics datasets there is extensive publicly available high throughput RNA sequencing data which can be mined for further indicators of specific reproducible regulation of antisense ORFs. There are approximately 1500 relevant RNAseq studies from prokaryotes in the NCBI GEO database (Edgar et al., 2002) each with multiple samples; over 100 ribosome profiling studies, and a number of more bespoke methods which may also provide relevant information. Cappable-seq data, which discovers transcriptional start sites (Ettwiller et al., 2016), helps to delineate the borders of operons and their expression under different conditions. The new method SEndseq, through circularisation of transcripts, is able to detect both transcriptional start and termination sites with single nucleotide resolution (Ju et al., 2019). CHiPseq datasets indicate whether known transcription factors are associated with a particular operon of interest (Wade, 2015) other TF-binding assays also have potential for testing hypotheses concerning TF binding, e.g., DNAse footprinting (Haycocks and Grainger, 2016). Each of these methods is yet to be fully utilized in searching for the transcriptional regulation of overlapping genes. At the level of translation, there are a number of variations on ribosome profiling now available, including accurate prediction of translation initiation sites. The first study of ribosome profiling in bacteria used chloramphenicol in one of the two methods presented (Oh et al., 2011), which has since been shown to stall the ribosome at initiation and, thereby, can assist in inferring translation initiation site positions Glaub et al., 2020). More precise stalling has been achieved with the use of tetracycline (Nakahigashi et al., 2016), retapamulin (Meydan et al., 2019), and the antibacterial peptide Onc112 (Weaver et al., 2019). Properties of ribosomes at different stages of translation, including initiation, have recently been studied in E. coli K12 with TCP-seq; translation complex profiling (Sharma and Anand, 2019). Translation stop sites have also been specifically explored (Baggett et al., 2017). Most of these methods, outside the analysis of ribosome profiling discussed above, have not yet been applied to the detection or investigation of protein coding alternate frame ORFs, and any RNAs at these sites are assumed to be non-coding. Perhaps particularly useful will be ribosome profiling experiments conducted for cells grown in different conditions -many relevant contexts may, however, not be able to be surveyed due to technical limitations.

Phenotypes of Antisense Proteins
An important indicator of functionality is specific regulation in response to defined environmental conditions. Some key canonical work in molecular genetics (Jacob and Monod, 1961;Ames and Martin, 1964) has been concerned with the differential induction of genetic elements under varying environmental conditions. Specific differential induction is widely assumed in this kind of literature to be equivalent to function -how precisely to draw a line between functional and non-functional, given the inherent noisiness of biology, is not, however, entirely clear.
In general, what kind of phenotype is a good indicator of functionality? The most obvious case perhaps is an improvement in growth associated with expression of a genetic element. This could be either through improved growth following overexpression, or decreased growth following a deletion in the genomic sequence. Within an evolutionary context, a growth advantage effectively just is what it is to be "useful" or "functional." An example of this for antisense proteins is citC, discussed below. However, less intuitively, a decrease in growth associated with expression, as seen in the cases of asa, laoB, and ano is also an indicator of functionality in the right context. Most simply, the gene might literally function as a toxin. More generally though, overexpression of many functional genes is deleterious -in fact in E. coli the majority of annotated genes have a deleterious effect on growth in overexpression constructs (Kitagawa et al., 2005). Similarly, a condition-specific positive growth phenotype following knockout of an expressed gene is also indicative of function. Such a phenotype could be simply because the protein is not required in this environment and so losing it decreases the cost of expression. Or it could be because losing a gene with, for instance, a regulatory or inhibitory function is beneficial under certain conditions where regulation of a process is not useful. Indeed, whatever the underlying mechanisms, adaptation in bacteria following loss of function is pervasive (Behe, 2010;Hottes et al., 2013;Albalat and Cañestro, 2016).
Possible reasons for the high tendency toward deleterious over-expression phenotypes in bacteria compared with organisms such as yeast are discussed in Bhattacharyya et al. (2016). This situation should perhaps not be surprising given the extreme optimality of bacterial metabolism (Schuetz et al., 2012) significant disturbance of such a finely tuned system is unlikely to be beneficial under most conditions. This general principle follows from, for instance, Fisher's Geometric Model, in which random changes are less likely to be beneficial when a population is close to a fitness optimum (Tenaillon, 2014). Overexpression of a non-functional "junk" genetic sequence, however, is also likely to be deleterious (Weisman and Eddy, 2017;Knopp and Andersson, 2018) so such a phenotype does not by itself provide evidence for functionality. What is important in the examples discussed above is that the deleterious growth phenotypes are observed as a significant difference between environmental (media) conditions. This implies a specificity of interaction which appears improbable under the "junk" hypothesis, and so constitutes evidence of function.
A number of antisense overlapping genes in E. coli have been analyzed regarding expression and phenotypes across different environmental conditions. The gene nog1 is almost fully embedded in antisense to citC. A strand-specific deletion mutant has a growth advantage over the wildtype in LB, and a stronger advantage in medium supplemented with magnesium chloride (Fellner et al., 2015). The gene asa, embedded in antisense to a transcriptional regulator in E. coli O157:H7 strains, was found to be regulated in response to arginine, sodium, and different growth phases. Overexpression resulted in a negative growth phenotype in both excess sodium chloride and excess arginine, and no phenotype in LB medium (Vanderhaeghen et al., 2018). The gene laoB is embedded in antisense to a CadC-like transcriptional regulator (Figure 2A). A strandspecific genomic knock-out mutant was shown to provide a growth advantage specific to media supplemented with arginine. Further, the differential phenotype was replicated with the addition of inducible plasmid constructs bearing laoB and WT laoB, showing that the phenotype is removed through complementation (Hücker et al., 2018b). How to mechanistically interpret such a growth advantage following gene knockout is unclear, but the condition-specific clear phenotype implies a functional role. The gene ano is nearly fully embedded antisense to an L,D-transpeptidase (Figure 2B). Similarly to laoB, a knock-out mutant showed a condition-specific phenotype. In this case it occurs in anaerobic conditions, and could be partially complemented with a plasmid construct (Hücker et al., 2018a). The putative protein-coding gene aatS was found in the pathogenic E. coli strain ETEC H10407 fully embedded in antisense to the ATP transporter ATB binding protein AatC (Haycocks and Grainger, 2016). It was shown to be transcribed, to have a functional ribosome binding sequence, and to have widespread homologs including a conserved domain of unknown function. Figure 2 illustrates the expression of three examples of antisense genes, with the gene in Staphylococcus aureus (Figure 2C) a special case, as discussed below.
Other than the high-level phenotypes (e.g., expression under particular conditions) determined for some candidates, very little is known about the possible roles or mechanisms of action of antisense proteins. Signaling or interactions between cells will be a significant area to investigate regarding possible functions. This suggestion is based on both the evidence gained so far for small proteins (Neuhaus et al., 2016;Hücker et al., 2017a;Sberro et al., 2019) and the particular importance of signaling or interaction proteins, meaning that this hypothesis derived from findings on small proteins more generally, deserves special attention This area is a crucial field of research as infectious disease continues to be a major health burden and the long natural history of interactions between microbes has been a fruitful source of new antimicrobial strategies.

Simultaneous Transcription?
In response to the evidence for overlapping genes, the question is often raised concerning how two genes could be simultaneously expressed from opposite strands. Indeed, the phenomenon of RNA polymerase collision is a real barrier to antisense transcription in at least some instances and is involved in transcriptional silencing or reduction via various mechanisms (Courtney and Chatterjee, 2014). Bypass of sense and antisense RNA polymerases has been shown for bacteriophage RNA polymerases (Ma and McAllister, 2009) but in vitro experiments have shown no such bypass in bacterial systems (Crampton et al., 2006). The role of accessory helicases in removing barriers to replication due to the presence of RNA polymerases has recently been highlighted (Hawkins et al., 2019) expanding on knowledge of simultaneous transcription and replication (Helmrich et al., 2013). It is conceivable that transcribing alongside the formation of a replication fork could facilitate antisense transcription, but this would restrict antisense transcription to the replication process. However, even in cases of collision of RNA polymerases operating in antisense, transcriptional stalling is not guaranteed. A recent study argues on the basis of simulations and careful assays with reporter constructs that RNA polymerases trailed by an active ribosome are, remarkably, about 13-times more likely to resume transcription following collision than those without the translation apparatus following (Hoffmann et al., 2019). This finding follows on from a range of similar work in recent years showing multiple mechanisms involved in ensuring that RNA polymerases stall and are subsequently released less in protein coding than non-coding RNAs (Proshkin et al., 2010;Brophy and Voigt, 2016;Ju et al., 2019). We suggest that this phenomenon likely applies to antisense embedded protein-coding genes as much as to convergent antisense transcripts and thereby facilitates antisense protein expression.
Recent detailed elucidation showed the working of an operon in S. aureus with a functional gene encoded in antisense to a contiguous set of co-transcribed genes (Sáenz-Lahoya et al., 2019). The authors showed that despite being encoded on opposite strands (although not directly overlapping in this case), these elements comprised a single transcriptional unit ( Figure 2C). This study highlights a mechanism which may be widespread and may apply to genes which are directly antiparallel as well. Results from Weaver et al. (2019) obtained by chromosomal tagging of three antisense proteins show that proteins encoded in antisense can be expressed simultaneously, i.e., under the same growth conditions. All this evidence for simultaneous antiparallel gene expression notwithstanding, it may be that antiparallel overlapping genes are generally translated under different conditions, or separated in time -this is yet to be determined.

Evolution and Constraint in Antisense Proteins
The evolutionary analysis of function at the nucleotide sequence level is a fairly recent development (Robinson-Rechavi, 2019) so we should not be surprised at unexpected results in this rapidly developing field. While the evolutionary analysis of antisense proteins in prokaryotes awaits further investigation of strong overlapping gene candidates, those discovered so far are typically relatively young (Fellner et al., 2014(Fellner et al., , 2015Hücker et al., 2018a). This may be seen as a point against their functionality, particularly for candidates limited to just one species. However, a number of genome elements with undisputed functionality are also evolutionarily young. Various functional putatively ncRNA elements are known to have high evolutionary turnover, see Dutcher and Raghavan (2018). For instance, an sRNA found only in E. coli was shown to be derived from a pseudogenized bacteriophage gene (Kacharia et al., 2017). Also relevant here is the large literature on the functions of "orphan" or taxonomically restricted genes restricted to a single genome or small clade (Satoshi and Nishikawa, 2004;Tautz and Domazet-Lošo, 2011) and orphan genes may play diverse important roles in bacteria (Hu et al., 2009).
It appears likely that antisense proteins are often less constrained in sequence than most protein-coding genes currently known. For one, antisense proteins are typically quite small and hence unlikely to fold into complex structures. Secondly, given initial evidence from viruses that protein domains in overlapping genes may be situated so as to not overlap (Fernandes et al., 2016), it seems that overlapping gene sequences are unlikely to be comprized of a high proportion of constrained sequence domains. While our previous analyses of individual prokaryotic overlapping genes have shown that they are typically young compared to the genes in which they are embedded (Hücker et al., 2018a) many embedded ORFs are quite well conserved beyond the genus. As a conservative example, we take a subset of the Enterobacteriaceae family, the smallest clade including both Citrobacter rodentium and E. coli (Figure 3A). We find that out of the 3391 antisense embedded ORFs predicted as having single homologs in all 13 representative genomes assessed, 29.5% exceed the conservation level of the lower quartile of annotated genes (Figures 3B,C). Here, conservation is judged by median pairwise amino acid similarity between genomes. Given the conservative nature of this analysis and that less than half of even annotated genes met the criterion of having single homologs in all of these genomes, we posit that thousands of embedded antisense ORFs are sufficiently conserved beyond the Escherichia genus to be candidates for functional genes in this particular respect. Factors affecting these conservation statistics, FIGURE 3 | A subset of embedded antisense ORFs are well conserved in the genomes situated taxonomically between E. coli and Citrobacter rodentium. (A) Phylogenetic tree showing 13 representative genomes used, derived from the genome taxonomy database (GTDB). (B) Median pairwise amino acid identity and similarity (gray) among orthologs for 1000 annotated genes with single copy orthologs in all of the 13 genomes. As a comparison, the effect on identity and similarity of adding random mutations to simulated sequences is shown (orange). There is a clear bias toward variants which result in higher "similarity." (C) Conservation of antisense embedded ORFs (blue), as compared to the identity-similarity relationship observed for control randomly mutated sequences (red) simulated as before but translated in antisense. Many antisense embedded ORFs are highly conserved and a subset also shows a bias toward similarity. and additional criteria for gene-likeness which distinguish coding from non-coding antisense sequences deserve further study.
The orange and red lines in Figures 3B,C show the effect of randomly mutating a sequence created based on the codon usage in the annotated genes in E. coli K12. The points plotted represent median identities and similarities in comparison to originally simulated sequences, following successive rounds of random mutation, approximately mimicking the mutational distances observed between the orthologs of annotated genes and embedded ORFs. We suggest that two main results should be taken from Figure 3. Firstly, the blue cluster in the top right of Figure 3C shows that many embedded antisense ORFs are highly conserved across a significant evolutionary distancethey are not all immediately degraded following mutations in the alternate frame as might be naively assumed. Secondly, the bias above the orange and red lines shows that nearly all annotated genes and many embedded antisense ORFs tend toward fixing more "similar" mutations than might be predicted based on amino acid identity statistics alone. This result may be partly due to the structure of the genetic code, i.e., when a "mother gene" in the reference frame is conserved there is some tendency for conservation in the alternative strand (Wichmann and Ardern, 2019), but it is also suggestive of a kind of purifying selection where mutations to biochemically similar amino acids are preferred in a subset of embedded antisense ORFs. It has previously been shown that long antisense ORFs appear more often in natural genomes than expected based on codon composition of annotated coding genes (Mir et al., 2012), another hint of selective processes preserving some antisense ORFs.
A recent, currently unpublished, study in M. tuberculosis  has claimed that novel ORFs identified by ribosome profiling typically do not illustrate the strong codon bias evident in annotated mycobacterial genes and therefore cannot be expected to be functional. Given that many of these ORFs are situated in antisense to annotated genes, where the genetic code limits the possibilities for achieving optimal codon usage, this result is not surprising, and we suggest provides little evidence for the claim that they are nonfunctional. There is a problematic circularity here as well, as annotation of prokaryotic ORFs is based on models which take into account codon usage, based on usage in long ORFs -so short ORFs with "abnormal" codon usage will likely remain unannotated, reinforcing any bias in codon usage statistics in annotated genes. In general, short and weakly expressed genes should not be expected to match "canonical" highly expressed genes in terms of codon usage (Gupta and Ghosh, 2001), although the relationship between expression and codon usage is not straightforward (Dos Reis et al., 2003). Careful evolutionary sequence analyses are required here. A study of some putative same-strand overlapping genes also suggested that they are not under constraint (Meydan et al., 2019). However, more biologically nuanced analyses of sequence constraint, for instance after partitioning the homologs into phylostrata, would be useful. Further, a fundamental assumption of methods for detecting selection (e.g., Firth, 2014;Wei and Zhang, 2015), is the neutrality of synonymous mutations, but this assumption has been shown to be false, with the rate of synonymous mutations varying widely across sites (Wisotsky et al., 2020). The extent to which this affects conclusions regarding dN/dS as calculated with the various available methods remains unexplored. More generally, to our knowledge, there has been no demonstration of any synthesis of non-functional protein in prokaryotes. The high bioenergetic cost of protein production (Lynch and Marinov, 2015) would seem to militate against such a phenomenon being widespread in bacteria, where costs are minimized through gene loss (Koskiniemi et al., 2012). As such, we argue that the default assumption following demonstration of a clear signal of translation should be that the product plays a functional role.

DISCUSSION
The Context: Unexpected Complexity The historical trajectory in bacterial genomics has been toward finding previously unappreciated layers of complexity (Grainger, 2016). In particular, the number of different kinds of functional elements recognized has continued to increase in recent years. Examples of genetic elements previously ignored or written off as background noise which are now known to be functional in some or many cases include antisense transcription, small RNAs, and microRNAs, proteins with alternative start sites, small proteins (Storz et al., 2014), and micropeptides. Antisense transcription has been widely disregarded as noise (Raghavan et al., 2012;Lloréns-Rico et al., 2016). However, despite these generalizations, these elements have recently been found to at least sometimes have physiological roles (Wade and Grainger, 2014;Lejars et al., 2019). Functional RNAs which are not yet well understood include structured noncoding RNAs such as riboswitches (Hücker et al., 2017b;Stav et al., 2019). Proteins with alternative start sites, designated isoforms or "proteoforms, " have also been reported in a few bacterial systems (Berry et al., 2016;Nakahigashi et al., 2016;Meydan et al., 2019). These genetic elements are all yet to be incorporated into genome annotation files and gene prediction algorithms. As such, genome annotation is years behind the leading edge of research in bacterial genetics, and various functional elements remain unannotated.

Recommendations for Further Research
Even recent attempts at comprehensive studies of small proteins have tended to ignore antisense proteins or to use methods unintentionally biased against them -perhaps unsurprising given the reigning paradigm in genome annotation, which excludes substantive overlaps as a matter of principle. As an example, the NCBI prokaryotic genome annotation standards include among the minimum standards that there can be " [no] gene completely contained in another gene on the same or opposite strand" (NCBI 2020). For instance, a recent study investigated small proteins in the human microbiome (Sberro et al., 2019) finding hundreds of previously unknown small proteins with evidence from evolutionary sequence constraint, and many also with evidence of transcription and/or translation. Two key steps were the use of MetaProdigal for gene prediction and RNAcode for inference of conservation. Both of these are implicitly biased against overlapping genes, in that Prodigal explicitly excludes long overlaps, and RNAcode looks for patterns of sequence constraint associated with normal non-overlapping genes, which are unlikely to be found in overlapping genes.
The bacteriological research community ought to relinquish the common assumption that unannotated functional elements are only to be found in intergenic regions. We must also be aware that antisense regions often need to be treated differently from intergenic regions, for instance in analyses of sequence constraint. Developing appropriate corrections to take into account the sequence context of antisense overlapping ORFs is an important area for further work. A major emphasis should be on high-throughput functional studies. For in-depth laboratory studies dissecting the details of an overlapping gene's regulation and function, the focus should be on the strongest candidates as determined with sequence and expression data.
One key criterion here is evidence of reproducible regulated translation from one of the various ribosome profiling methods now available. Sequence properties determined from such sets should help to find strong candidates which are not expressed under already-assayed conditions. It is also clear that further advances in proteomics for small proteins should result in proteomic evidence for the translation of many more antisense proteins in bacteria and other systems. Following on from this, the discovery of any protein structures would be a major step forward toward understanding the molecular mechanisms of function. Finally, studying the evolutionary history of antisense proteins may provide useful insights on function. In this aspect these genes have a significant advantage over others in that their genomic context is relatively fixed by the gene in which they are embedded. This study has focused on eubacteria, but the same principles conceivably apply in archaea. A recent study, for instance, chose to only consider same-frame overlapping ORFs (proteoforms) on account of an absence of proteomics results and reliable BLAST hits for out-of-frame overlapping ORFs (Ten-Caten et al., 2018). Neither of these negative results are surprising, however, given the limitations of proteomics discussed above and the current bias against annotating out-of-frame overlaps; as such, archaeal datasets ought also be re-examined for functional overlapping genes.
In summary, what is required in order to assign the descriptor 'functional' to a putative gene, such as a gene encoded in antisense to a known gene? Regarding evolutionary evidence, a codonlevel pattern of sequence constraint is sufficient to guarantee function, as constraint matching expectations for amino acids is unexpected in coding sequences. Detecting such constraint is a challenge for antisense sequences, however. Regarding evidence from wet-lab experiments, a condition-specific phenotype is also sufficient to establish functionality. The "gold standard" in this area would be a condition-specific negative growth phenotype in a genomic knock-out mutant, which could be complemented in trans (e.g., with a plasmid construct). Regarding highthroughput evidence, significant protein expression is evidence of functionality in highly optimized bacterial genomes, particularly if shown to be consistent across species or highly diverged strains. Appropriate thresholds for significant expression and sufficient evolutionary divergence in order to be able to confidently infer function are yet to be established. While each of these three lines of evidence is arguably sufficient to establish function, none is necessary, as there are functional elements which fail to meet at least one of these criteria.
We have collated evidence from diverse bacteria (including the genera Escherichia, Pseudomonas, and Mycobacterium) for protein coding ORFs embedded in antisense to annotated genes, discussed reasons to believe that they are biologically functional, and responded to common objections, informed by the most recent work in bacterial molecular genetics. We suggest that a pro-function attitude regarding antisense prokaryotic transcripts and the antisense translatome is both more useful for research and justified by multiple lines of evidence. How many of these elements are functional and what they do remain contentious, however, and worthy of significant further investigation.

METHODS
For Figure 1, positions of previously discovered putative antiparallel genes in E. coli K12 and M. tuberculosis were extracted from the supplementary data of previous studies (Friedman et al., 2017;Smith et al., 2019;Weaver et al., 2019); information on those with ribosome profiling reads was provided by Robin Friedman. Positions are shown visualized with Circos (Krzywinski et al., 2009).
For Figure 2, ribosome profiling ("RIBO-seq") data was visualized to show examples of antisense overlapping genes. In each case, adapter sequences were predicted using DNApi.py (Tsuji and Weng, 2016), trimmed with cutadapt (Martin, 2011) using a minimum length of 19 and quality score of 10, and aligned (local alignment) with bowtie2 (Langmead and Salzberg, 2012). Fastq data from SRR5874479 (LB) and SRR5874484 (BHI) for E. coli O157:H7 Sakai was aligned against the genome GCF_000008865.1_ASM886v1. Fastq data from SRR1265839 (without azithromycin) and SRR1265836 (with azithromycin) for S. aureus was aligned against genome GCF_900475245.1_43024_E01. Reads mapping at each site per million total mapped reads (RPM) are calculated from aligned bam files with reads mapping to rRNA and tRNA locations removed, using samtools (Li et al., 2009). Images of RPM in the region around the putative antisense gene are drawn in gnuplot with "smooth csplines." For Figure 3, the relationship between similarity and identity in comparisons of different ORF homologs was compared. Representative genomes from release 89 of the genome taxonomy database (GTDB; Parks et al., 2018) in the smallest clade uniting E. coli and C. rodentium ( Figure 3A) were chosen. Of these 23 strains, 13 had a GenBank genome and feature table with the same accession version available. These were downloaded, and annotated ORFs in each compared to each other using OrthoFinder (Emms and Kelly, 2015). Genes with a single copy ortholog present in all 13 genomes were extracted, and members of each ortholog family were aligned against each other using the EMBOSS (Rice et al., 2000) program needleall to determine median similarity and identity at the amino acid level ( Figure 3B). As a control, 50 sequences of 333 codons length were created based on codon usage in E. coli K12, using EMBOSS programs cusp and makenucseq. These were then mutated through 70 rounds of point mutation (10 mutations per round) using the EMBOSS program msbar, and translated in order to determine the relationship between varying levels of amino acid identity and similarity. In each case, the mutated sequences were compared to the original simulated sequence they were derived from, using EMBOSS needle. For each percent decrease in identity observed, results were collated and the median values of identity and similarity reported. The procedure used initially for annotated genes was repeated using all antisense embedded ORFs (using the bacterial genetic code, NCBI code table 11), found using a Perl script "ORFFinder" (available from Christopher Huptas) and Bedtools (Quinlan and Hall, 2010). A negative control for these sequences was also created similarly, to before, but using an antisense reading frame. As the particular antisense frame used had no significant effect on the sequence similarities obtained in the simulation, for the data shown the initial sequences based on codon usage in E. coli K12 were directly reverse complemented with no further frame-shift, prior to the 70 rounds of random mutation.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

AUTHOR CONTRIBUTIONS
ZA drafted the manuscript and prepared the figures. KN and SS assisted with drafting the manuscript and conceiving of the project. All authors read and approved the final version of the manuscript.