Are There Hidden Genes in DNA/RNA Vaccines?

Due to the fast global spreading of the Severe Acute Respiratory Syndrome Coronavirus – 2 (SARS-CoV-2), prevention and treatment options are direly needed in order to control infection-related morbidity, mortality, and economic losses. Although drug and inactivated and attenuated virus vaccine development can require significant amounts of time and resources, DNA and RNA vaccines offer a quick, simple, and cheap treatment alternative, even when produced on a large scale. The spike protein, which has been shown as the most antigenic SARS-CoV-2 protein, has been widely selected as the target of choice for DNA/RNA vaccines. Vaccination campaigns have reported high vaccination rates and protection, but numerous unintended effects, ranging from muscle pain to death, have led to concerns about the safety of RNA/DNA vaccines. In parallel to these studies, several open reading frames (ORFs) have been found to be overlapping SARS-CoV-2 accessory genes, two of which, ORF2b and ORF-Sh, overlap the spike protein sequence. Thus, the presence of these, and potentially other ORFs on SARS-CoV-2 DNA/RNA vaccines, could lead to the translation of undesired proteins during vaccination. Herein, we discuss the translation of overlapping genes in connection with DNA/RNA vaccines. Two mRNA vaccine spike protein sequences, which have been made publicly-available, were compared to the wild-type sequence in order to uncover possible differences in putative overlapping ORFs. Notably, the Moderna mRNA-1273 vaccine sequence is predicted to contain no frameshifted ORFs on the positive sense strand, which highlights the utility of codon optimization in DNA/RNA vaccine design to remove undesired overlapping ORFs. Since little information is available on ORF2b or ORF-Sh, we use structural bioinformatics techniques to investigate the structure-function relationship of these proteins. The presence of putative ORFs on DNA/RNA vaccine candidates implies that overlapping genes may contribute to the translation of smaller peptides, potentially leading to unintended clinical outcomes, and that the protein-coding potential of DNA/RNA vaccines should be rigorously examined prior to administration.


INTRODUCTION
The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is a positive-sense single-stranded RNA virus that was first described in late 2019 (1). SARS-CoV-2 is phylogenetically related to the causative agent of the 2002 SARS-CoV epidemic and causes many of the same symptoms, such as fever and myalgia (2). Because of the high transmissibility of SARS-CoV-2 and rapid spreading throughout the world, by March 2020, the World Health Organization declared the global outbreak as the COVID-19 pandemic (3). The health and economic-related losses accruing as a result of the pandemic led to the prioritization of prevention and treatment options with the quickest route to safe clinical application (4). Although small molecule inhibitors and inactivated or live attenuated virus vaccine candidates have been used to successfully treat infection by pathogenic viruses, the pipelines to bring these products into clinical use can require significant time and resources with potentially low success rates (5,6). However, among novel vaccine delivery platforms developed in recent years, DNA and RNA vaccines have become of interest due to their potential to be inexpensively and quickly produced at a large scale (7). Only the nucleotide sequence of the selected antigenic protein is required to begin production, which can be derived from DNA/RNA sequencing of the virus. Thus, DNA/ RNA vaccines have been suggested as prime candidates for mitigating COVID-19 transmission.
The SARS-CoV-2 genome codes for at least 30 proteins, three of which are exposed on the virion surface and can be recognized by the immune cell system (8)(9)(10). The spike protein is a large trimeric glycoprotein (1,273 amino acid long protomers) that protrudes from the virion surface to bind to cell surface receptors on host cells, such as angiotensin converting enzyme II (ACE2), in order to initiate viral entry (11). The large surface area of the spike protein and its role in host cell entry make it attractive as a target for the immune system and clinical treatments, such as drugs and therapeutic antibodies (12). Of note, the spike protein is heavily glycosylated, which helps shield the virus from interactions with antibodies (13). The region with the lowest degree of glycosylation is the receptor-binding domain, which binds to host cell surface proteins to initiate viral entry, and, as a result, is the most antigenic region of the spike protein (14). The other two proteins exposed on the virion surface, the envelope and membrane proteins, are also available for use as antigen targets; however, they are smaller in size and less accessible for protein-protein interactions than the spike protein. Because of the evidence indicating that spike is the most suitable antigenic target for SARS-CoV-2, it has been widely used in vaccine trials.
Numerous companies and academic institutions across the globe have developed or are currently developing DNA/RNA vaccines for the SARS-CoV-2 spike protein (15). Generally, in the case of DNA vaccines, the full-length SARS-CoV-2 spike protein DNA sequence is inserted into a plasmid, and additional technologies, such as electroporation, can assist in making transfection more efficient (16,17). The spike protein DNA transfected into the human cell can then be transcribed and translated to create the trimeric spike protein, which, then, moves to the endoplasmic reticulum and Golgi apparatus for post-translational modification (e.g. signal sequence cleavage, glycosylation) and continues through the secretory route to become anchored to the cell membrane for exposure to the immune system (18,19). The RNA-based vaccine formulations comprise lipid nanoparticles assembled around mRNA molecules coding for the full-length SARS-CoV-2 spike sequence (20). The transfected mRNA can be directly translated to make the spike protein. The BioNTech/Pfizer and Moderna mRNA vaccines, which have widely been approved by government agencies and administered in several countries, have reported approximately 50-70 and 70-90% effectiveness after 1 and 2 doses, respectively, against the wild-type and alpha variant (B.1.1.7) and 30-60 and 60-90% effectiveness, respectively, against the beta (B.1.351) and gamma (P.1) variants (21,22). However, additional variants of concern have been noted to provide either further partial or complete immune escape; thus, adapting the sequences may be required over time (23,24). Prior to COVID-19, no DNA/RNA vaccines had been approved for human use, but, in August 2021, BioNTech and Pfizer received FDA approval for use of their mRNA vaccine (25). Further investigation into nucleic acid-based vaccine delivery platforms may improve effectiveness.
Although SARS-CoV-2 DNA/RNA vaccines have been subjected to health and safety testing prior to bulk dissemination, a diverse assortment of both systemic and local (near injection site) side effects, ranging from mild to severe, following vaccination have been described (26,27). Symptoms resembling that of viral infection (e.g. headache and myalgia), life-threatening conditions (e.g. myocardial injury and thrombosis), and mortalities have been reported in relation to vaccination (28)(29)(30)(31). Although some side effects may stem from the delivery modalities, several studies have indicated that the spike protein alone causes adverse effects on host tissues, such as blood brain barrier disruption, neuron fusion, inflammation, and cell senescence (32)(33)(34)(35). Although it is difficult to detect the origin of side effects in vaccinated individuals, more investigation on the cellular effects of mRNA vaccines or the expressed protein antigen are warranted to create safer vaccines.

OVERLAPPING ORFs ON THE SARS-COV-2 SPIKE PROTEIN NUCLEOTIDE SEQUENCE
The understanding that one mRNA codes for one gene in eukaryotes, not considering alternative splicing, has been challenged after several studies have revealed the presence and translation of multiple open reading frames (ORFs) within one expressed mRNA (36,37). Additionally, stretches of RNA that have been annotated as non-coding have also been discovered to code for small peptides from internal ORFs in the larger gene that have regulatory activity in the cell (38,39). Alternative start codons, internal ribosome entry sites, and frameshifting have all been described as mechanisms contributing to the translation of smaller ORFs within a larger expressed gene (40)(41)(42).
Furthermore, overlapping genes are a common feature among viral genomes (43). Overlapping genes in viruses originate from mutations that allow the spontaneous generation of a translation start site within a gene that leads often to a new ORF usually with a different frame while still maintaining the integrity of the original gene (44). Thus, understanding the coding capacity of a viral transcript in a human cell may be more complex than assessing the full-length protein sequence alone.
Several ORFs have been found to overlap previously annotated genes in the SARS-CoV-2 proteome (45). For instance, ORF3d and ORF9c have been recently discovered, using Ribo-Seq and phylogenetics analyses, to overlap the ORF3a and N genes, respectively (8,9). Included among the newly discovered overlapping ORFs in the SARS-CoV-2 genome are ORF2b and ORF-Sh, which have been shown to overlap the spike protein sequence on the +1 frame (one nucleotide towards the 3' end) (9,46,47). The ORF2b and ORF-Sh nucleotide sequences are both 120 nucleotides long and both code for 39 amino acid long proteins. ORF2b has been found to be translated in human cells during infection, and ORF-Sh has thus far been found only using in-depth phylogenetic comparisons. ORF2b was found to be absent in almost all bat coronavirus strains in a genomic comparison study, suggesting that it has recently evolved (48). ORF-Sh has been proposed to have evolved recently among the clade of viruses that includes the Bat-CoV-RaTG13 and pangolin coronavirus strains. Mutations in only few sequenced SARS-CoV-2 strains have been discovered that lead to truncation of the protein sequence (47,49). Little is known about the structure or function of these proteins, although both are predicted to contain a transmembrane domain. The existence of additional open reading frames within the sequence of the spike protein sequence, however, begs the question as to whether these or other overlapping ORFs are being translated on DNA/RNA vaccine sequences. Since the translation of overlapping accessory ORFs has been shown to significantly alter the dynamics of host protein-protein interaction networks, the translation of these two, and possibly other ORFs within the sequence of the spike protein RNA/DNA vaccines, may lead to signaling perturbations that resemble SARS-CoV-2 infection (50,51).
Since little has been reported on the functionality of ORF2b or ORF-Sh in infected cells (although translation of ORF-Sh is still to be confirmed) and no homologous domains were found using sequence analysis tools, we used structural bioinformatics methodologies, as previously performed to elucidate the threedimensional features of the SARS-CoV-2 proteome, to model the structural characteristics both ORFs (52)(53)(54)(55)(56)(57)(58)(59)(60)(61). As shown in Figure 1A, the majority of the ORF2b protein is predicted to be fixed in the membrane (62). Structural similarity comparisons revealed that the predicted ORF2b transmembrane domain resembles the human metapneumovirus phosphoprotein (HMPV) P oligomerization domain (PDB: 5oix; TM-score: 0.67), which may result in tetramerization or other states of oligomerization ( Figure 1A) (63). Such oligomerization in the membrane could lead to viroporin activitysimilar to the ORF3a and E SARS-CoV-2 proteins (64,65). Although a transmembrane domain was predicted in the ORF-Sh sequence, structural modelling and secondary structure prediction depict a bend in the middle of the putative transmembrane domain ( Figure 1B). Structural comparisons revealed fold similarity between the predicted ORF-Sh model and DNA-binding zinc finger proteins, such as the transcriptional repressor CTCF (PDB: 1x6h; TM-score: 0.38), which is further supported by the presence of four basic amino acids on what is predicted to be the DNA-exposed side shown in Figure 1B (66). Experimental validation is required to verify these results however. Nevertheless, the translation of overlapping, small ORFs within larger ORFs can result in harmful effects on host tissues, such as interfering with organelle membrane protein activity or perturbing signaling pathways. Thus, the protein-coding potential of DNA/RNA vaccines within the context of overlapping ORFs should be investigated further.

CODON OPTIMIZATION OF DNA/RNA VACCINE CANDIDATES
Although the presence of overlapping genes on the wild-type nucleotide sequence of the spike protein challenges the effectiveness of DNA/RNA vaccines, precautionary steps can be taken to prevent the translation of these smaller, internal ORFs. For example, vaccine nucleotide sequences can be selectively codon optimized, as is normally performed to enhance translation efficiency in host tissues, to remove alternative start codons and internal ribosome entry sites, thus preventing non-specific recognition by ribosomal complexes (67). Codon optimization without consideration of overlapping ORFs, however, can result in both disruption of the current overlapping ORFs, ORF2b and ORF-Sh in the case of the spike protein vaccines, or spontaneously generating new ORFs. Although most, if not all, DNA/RNA vaccine candidate spike sequences have been reported to be codon optimized for translation in human cells, the spike nucleotide sequences have largely, so far, been kept private by the corresponding company or institution. Interestingly, however, the Moderna mRNA-1273 and Pfizer BNT162b2 vaccine mRNA sequences have been made publicly-available (https://github.com/NAalytics; https:// berthub.eu/articles/posts/reverse-engineering-source-code-ofthe-biontech-pfizer-vaccine/). The posting of these data allows direct comparative analyses between the vaccine-formulated and wild-type spike protein sequences.
Comparing the nucleotide sequences of the wild-type and vaccine mRNA spike proteins may reveal the extent to which the sequences have been changed during codon optimization, thus potentially altering translation efficiency of the spike protein and overlapping ORFs. Of note, prior to codon optimization, both companies have reported including proline mutations to stabilize and preserve the spike protein structure, thus implying small changes in spike amino acid content as well. Using the EMBOSS Needle pairwise sequence alignment tool, the wild-type spike sequence (NCBI accession: NC_045512) is found to be 68.7% and 45.3% identical to the mRNA-1273 and BNT162b2 vaccine spike sequences (as opposed to the entirety of the mRNA sequence), respectively, and the mRNA-1273 and BNT162b2 spike sequences are 48.6% identical to one another (68). The GC contents of the wild-type, BNT162b2, and mRNA-1273 spike nucleotide sequence, which correlate well with translation efficiency, are 37.3%, 56.9%, and 62.3%, respectively. These alignments reveal that extensive codon optimization was performed during vaccine preparation.
To quantify the degree to which the codon optimization performed on the vaccine mRNA sequences matches that of the human genome amino acid pool, the codon adaptability index (CAI), which has been noted to be an accurate reflector of gene translation, was calculated for all three spike sequences using the COUSIN and CAIcal web servers (69)(70)(71). As a reference, calculated CAI values for SARS-CoV-2 genes with regards to human codon usage average around 0.7, and a higher score represents a stronger indication for translation (72,73). The CAI values for the wild-type, BNT162b2, and mRNA-1273 spike nucleotide sequences are 0.703, 0.715, and 0.981, respectively. While the BNT162b2 vaccine CAI value was slightly increased compared to the wild-type sequence, the mRNA-1273 vaccine CAI value was found to be significantly higheralmost reaching the maximum value. These findings suggest that the codon optimization used on both vaccine sequences have resulted in higher translation potential than the wild-type. Notably, the mRNA-1273 vaccine codon usage seems much more closely aligned with human codon biases, and the sequence contains a lower amount of substituted nucleotides and a higher GC content.

OVERLAPPING ORFs ON DNA/RNA VACCINE CANDIDATES
Considering the extensive codon optimization performed on the vaccine spike sequences, the comparison of putative ORFs in the wild-type and selected vaccine mRNA sequences may shed light on the protein-coding potential of DNA/RNA sequences used in SARS-CoV-2 vaccines. Thus, in order to examine the differences between the available ORFs on the wild-type, Moderna mRNA-1273, and Pfizer BNT162b2 sequences, putative ORFs of all three nucleotide sequences were detected using the NCBI ORFfinder web server (https://www.ncbi.nlm.nih.gov/orffinder/). Although ORF identification using this tool does not imply translation, an overview of the available reading frames may provide insights into coding potential differences between the wild-type and vaccine candidate sequences. Minimum ORF length was set to the default 75 nucleotides, no alternative initiation codons were allowed, and only "ATG" start sites were considered.
As shown in Figure 2, ORF2b and ORF-Sh were found in the wild-type sequence; however, both ORFs are absent in both of the mRNA vaccine candidate sequences. The counts, lengths, and sequence identities of predicted ORFs on both mRNA sequences were found to be markedly different from one another and from the wild-type, re-asserting that codon optimization can result in significant changes in the presence of overlapping ORFs. Eleven small overlapping ORFs (27-87 residues long) were discovered using NCBI ORFfinder on the wild-type spike protein sequence, and eight small ORFs (26-52 residues long) were found to overlap the Pfizer BNT162b2 vaccine mRNA sequence. Notably, the Moderna mRNA-1273 vaccine mRNA sequence displayed no overlapping sequences on the positive sense strandonly on the negative sense, which can be disregarded when considering mRNA. However, DNA-based vaccines, such the INO-4800 SARS-CoV-2 spike DNA vaccine from INOVIO Pharmaceuticals, should be assessed for the presence of protein-coding ORFs on the reverse strand (73). Thus, in terms of predicted protein-coding potential, the Moderna mRNA-1273 mRNA vaccine appears to be the most optimized sequence of the two to solely code for the SARS-CoV-2 spike protein. These findings also support the notion that  Although the shortening of the spike sequence reduces the number of overlapping ORFs, the potential for alternative translation still remains. Multimeric vaccine DNA/RNA sequences that include antigenic regions of different viral proteins could also be used to increase immunogenicity while shortening the length of the construct and, thus, controlling for the presence of overlapping ORFs (75)(76)(77). For example, the hepatitis C E2 protein scaffold has been used to present the antigenic HIV-1 gp120 variable loop region to promote immunogenicity for potential HIV vaccination (78). Thus, the downsizing of the sequence to include only the most antigenic regions of the spike receptorbinding domain, such as the receptor-binding motif, or domains from other viral proteins to be placed on a codon-optimized protein scaffold may further control for overlapping proteincoding sequences (79). Sequence length and content can further affect the number of overlapping ORFs, but the scrutinization of protein-coding regions nevertheless relies on validating the translation of alternative reading frames.
The use of experimental techniques, such as ribosomal profiling or mass spectrometry, on vaccinated patient or laboratory animal samples or pseudovirus-infected tissue

CONCLUSIONS
DNA/RNA vaccines have proven to be an effective way to develop vaccines quickly for emerging pathogens. However, with a new set of solutions, comes a new set of problems (80). Although the wild-type SARS-CoV-2 spike protein nucleotide sequence has been found to code for translated overlapping genes, ORF detection predictions on the sequences of two mRNA vaccines reveal that codon optimization has the potential to disrupt non-specific translation. Additional overlapping ORFs can arise during codon optimization; thus, the final sequences should nevertheless be scrutinized for their protein-coding potential. In the case of DNA vaccines and viral vectors, the negative-sense strand should also be checked for its proteincoding potential. Additionally, as variants of concern become known and vaccines are altered to include them, the spontaneous generation of ORFs should be re-assessed. Many precautionary steps have been taken to ensure the safety and efficacy of the mRNA vaccines, including nucleoside modification to reduce inflammatory responses and 5'-capping and polyadenylation tail length optimization to increase mRNA stability and translation (20). Thus, the inclusion of additional steps to ensure that vaccine sequences code solely for the intended protein may also lead to better health and safety outcomes. Measures to check for other adverse effects on host cells, such as those resulting from potential interactions of vaccine nucleotide sequences with host RNAs or proteins, or the host microbiome may be increase efficacy and safety as well (81). More in-depth investigation of these delivery methods may reveal aspects that should be further refined to safeguard against unintended side effects.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
CB, MB, and AV contributed to conception and design of the study. CB, MB, and AV contributed to sequence and structural analyses. All authors contributed to manuscript writing and revision. All authors contributed to the article and approved the submitted version.