Strand-Specific Patterns of Codon Usage Bias Across Cressdnaviricota

The rapidly expanding phylum Cressdnaviricota contains circular, Rep-encoding single-stranded (CRESS) DNA viruses that are organized within seven established families, but many CRESS DNA virus sequences are not taxonomically defined. We hypothesized that genes in CRESS DNA virus ambisense genomes exhibit strand-specific signatures due to a cytosine to thymine transition bias that can help determine the orientation of the genome: which strand is packaged and is in the “virion sense”. To identify broad strand-specific patterns across genera, we performed compositional analyses of codon usage across the two major opposite sense open reading frames of 712 reference viruses. Additionally, we developed a statistical test to identify relative codon overrepresentation between ambisense sequence pairs for each classified virus exemplar and an additional 137 unclassified CRESS DNA viruses. Codons clustered by the identity of their third-position nucleotide, displaying both strand- and genus-specific patterns across Cressdnaviricota. Roughly 70% of virion-sense sequences have a relative overrepresentation of thymine-ending codons while ~80% of anti-sense sequences display a relative overrepresentation of adenine-ending codons (corresponding to a relative overrepresentation of thymine in these genes as packaged). Thirteen of the 137 unclassified viruses show strong evidence of having the rarer circovirus-like genome orientation, and likely represent novel genera or families within Cressdnaviricota. Given the strong strand-specific patterns of relative codon overrepresentation, the results suggest that the relative codon overrepresentation test can serve as a tool to help corroborate the genome organization of unclassified CRESS DNA viruses.


INTRODUCTION
The rise of metagenomics has revolutionized our understanding of the expansive distribution of viruses (1). With innovative high-throughput sequencing methods, many novel groups of uncultivated viruses have been discovered during the past decades. Circular, Rep-encoding single-stranded DNA (CRESS DNA) viruses are among the viral groups with a drastic increase in discovery rate (2). Once thought to be relatively rare within the virosphere, the application of the phi29 polymerase for rolling circle amplification of circular DNA templates in metagenomic studies have revealed the remarkable breadth and ubiquity of CRESS DNA viruses (3)(4)(5). To keep up with the increasing rate of data collection, there has been a paradigm-shift from traditional virus taxonomy involving biological features like host range and virion morphology to sequence-guided classification (6,7).
CRESS DNA viruses are characterized by their small, circular genomes, which can be monopartite (comprised of one segment) or multipartite (comprised of many segments) (2). Seven families (i.e., Bacilladnaviridae, Circoviridae, Geminiviridae, Genomoviridae, Nanoviridae, Smacoviridae and Redondoviridae) and six groups of unclassified CRESS DNA viruses that infect or are associated with eukaryotes were recently classified in phylum Cressdnaviricota based on phylogenetic analysis of the homologous HUH endonuclease that functions as a replication initiation protein (Rep) (8). Cressdnaviricota includes the smallest known capsidencoding, eukaryotic-infecting pathogens in the virosphere, with hosts as diverse as diatoms, fungi, plants, and animals (2). However, many are known from sequence alone and do not have any definitive host, as is the case for viruses in Smacoviridae, Redondoviridae and unclassified CRESS DNA viruses. While viruses in some genera may have up to 10 open reading frames (ORFs), their genomes minimally encode for two proteins: the homologous Rep and a non-orthologous coat protein (CP) (2). CRESS DNA viruses' elevated substitution rates (9)(10)(11)(12) and a mechanistic predisposition for recombination (13)(14)(15)(16) have facilitated their rapid evolution and emergence. Excepting the nanoviruses which encode one gene per genomic segment, viruses in the established families of this diverse group have ambisense genomes, meaning they encode proteins in both directions with respect to a single origin of replication. Specifically, Rep and CP are always transcribed in opposite directions, with CP typically encoded in the template (or virion) sense strand (transcribed from the complement of the packaged ssDNA genome) and Rep encoded in anti-sense in the virion genome (capable of direct transcription from the virion sense strand). Since the orientation of these genes can be an important criterion in assigning a novel species to a genus, as is the case for the members of Circoviridae (17), identifying strand-specific traits for CRESS DNA viruses might facilitate genome characterization.
For this study, we wanted to determine whether there are patterns of unequal usage of synonymous codons, or codon usage bias (CUB), across Cressdnaviricota and whether those patterns could help classify unclassified CRESS DNA viruses with ambisense genomes. CUB is the joint consequence of mutation, selection, and genetic drift (18)(19)(20)(21), and, thus, revealing broad patterns in codon preference can give significant insights into processes shaping viral genomes. Factors implicated in shaping viral CUB include selection to match the host codon usage to optimize the use of the host tRNA pool (i.e., translational selection) (22,23), selection against sequences activating innate host immunity such as CpG dinucleotides (24,25), and mutational biases (26,27). Previous studies in ssDNA bacteriophages show a preference for thymineending codons that is consistent with a cytosine-to-thymine (c!t) transition bias and that differs from the codon preferences of their hosts (28,29). Cardinale etal. (30) further showed in begomoviruses that thymine-ending codons (nnt) are preferred in the virion-sense CP and adenine-ending codons (nna) are preferred in the anti-sense Rep, suggesting that a uniform mutational bias (not translational selection) is the main driver of CUB in viruses of this genus. Given the fact that unpaired DNA is highly vulnerable to spontaneous oxidative cytosine deamination, which can lead to increased c!t transitions during replication (31), we hypothesized that the c!t mutational bias will be present across Cressdnaviricota. If this mutational pressure proves to be significant for many CRESS DNA viruses, we believe that CUB could be exploited to aid taxonomic classification by imprinting ambisense gene pairs with strand-specific biases. We expect an overrepresentation for nnt codons in virion-sense and, in complement, nna codons in anti-sense. Additionally, we expect these biases to reflect a reduction in nnc codons in the virion-sense and a complementary reduction of nng codons in the anti-sense. Alternatively, if this mutational bias is not a dominant feature, then CUB of uncultured CRESS DNA viruses may match host CUB and help determine the ecology of these uncharacterized viruses.
Here we describe the Rep and CP codon usage patterns for circoviruses and cycloviruses based on their differences in genome organization, followed by the patterns exhibited by viruses of other CRESS DNA virus genera. Lastly, we extend the analyses to unclassified CRESS DNA viruses and report relative codon overrepresentation patterns. Results broadly reveal that codon usage is both genus-and strand-specific in Cressdnaviricota, indicating that additional factors besides a c!t mutational bias influence codon preference. We nonetheless observed that there is a general trend of relative codon overrepresentation of nnt codons in the virion-sense and of nna codons in ORFs encoded in anti-sense on the virion single-stranded genome of genera in Cressdnaviricota. When applied to unclassified CRESS DNA viral genomes, relative codon overrepresentation tests corroborate that 13 distinct unclassified CRESS DNA viruses encode for Rep in the virionsense and CP in the anti-sense, a genome organization that is presently unique to circoviruses. This suggests that the circoviruslike genomic orientation is more common than our current taxonomy suggests.

Data Retrieval
Coding sequences of all ambisense CRESS DNA viruses from genera with at least 20 species exemplars in the ICTV MSL36 were downloaded from the NCBI RefSeq database. The CRESS DNA virus genera analyzed here are Begomovirus, Circovirus, Cyclovirus, Gemycircularvirus, Mastrevirus and Porprismacovirus. Only complete genomes with annotated Rep and CP were selected for analysis; only mastreviruses that had annotations of the spliced version of Rep were included (sample sizes in Table 1). For each genus, Rep and CP sequences were further parsed into separate FASTA files and analyzed separately (except in the principal component analysis detailed in 2.5). For comparison, the coding sequences of the ambisense DNA-B segment of bipartite begomoviruses (i.e., the virion-sense nuclear shuttle protein, NSP, and the anti-sense movement protein, MP) were also analyzed. Gene annotations in the opposite sense of the defined genome orientation for each genus (see Figure 1) were verified for potential incorrect classification by BLASTing the ORFs. Thirtyone misannotated sequences of classified members of Cressdnaviricota were found to be in reverse complement in GenBank; accessions and their GenBank orientation are detailed in Supplemental File 1.
The complete genomes of unclassified ambisense CRESS DNA viruses with Rep and CP annotations were downloaded via the NCBI Taxonomy Browser (https://www.ncbi.nlm.nih. gov/Taxonomy/Browser/wwwtax.cgi) in July 2021. Redundant sequences were removed from the data set using a 95% sequence identity cutoff with CD-Hit (32) in the CD HIT Suite (33). The correct orientation of the ORFs for each virus was verified based on the stem loop harboring the putative origin of replication predicted by the StemLoop-Finder tool (34). Viruses in which the software could not predict a stem loop were removed from the data set. If predicted motifs did not match the canonical stem loop sequences commonly found in CRESS DNA viruses (2), genomes were manually examined for sequences that did match and were used to correctly orient the genomes. Annotated Rep and CP sequences were then extracted and split into virion-sense and anti-sense groups for analyses.

Codon Content Analysis
The proportions of nnt, nna, nng and nnc codons (where n=any nucleotide, t=thymine, a= adenine, g=guanine and c=cytosine) were calculated for all Rep and CP (in the case of begomovirus DNA-B, MP and NSP) sequences using the Biostrings package in R version 4.0.3 (35). Mann-Whitney U tests with a Bonferronicorrected significance cutoff of 0.05 were used to identify statistically significant differences between codon frequencies.

Relative Synonymous Codon Usage (RSCU) and Effective Number of Codons (Nc)
Relative synonymous codon usage (RSCU) values and effective number of codons (Nc) were calculated using DAMBE6 (36). The RSCU value for a codon is the ratio between that codon's observed and expected frequencies, assuming equal usage of all synonymous codons for each amino acid (37). Following convention, codons with RSCU values <0.6 were considered underrepresented, whereas codons with RSCU values >1.6 were considered overrepresented. RSCU heatmaps were visualized using the ComplexHeatmap R package (38). Nc is a gene-specific CUB index that reveals the degree to which the entire genetic code is used (39). The theoretical values of Nc are at their minimum when exactly one codon is used in each synonymous codon family and at their maximum when all synonymous codons are used equally. In the DAMBE6 implementation of the Nc calculation, which breaks 6-fold compound codon families into 2-fold and 4-fold codon families (40), values range from 23 (when synonymous codon usage is maximally biased) to 61 (when all codons are used uniformly).

Amino Acid Composition Indices
To assess potential constraints on CUB imposed by amino acid composition, grand averages of hydropathy (GRAVY) and aromaticity (AROMA) scores were calculated with CodonW version 1.4.2 (41). A GRAVY score is calculated as the average of the sum of the hydropathic indices of each amino acid in a sequence (42). Values range from -2 to 2, with negative values indicating hydrophilicity while positive values indicate hydrophobicity. The AROMA scores reflect the aromaticity of a protein, defined as the relative frequency of aromatic amino acids in a sequence (43). Higher AROMA values indicate increased aromaticity.

RSCU Principal Component Analysis (PCA)
Principal component analysis (PCA) was performed on the RSCU values using the stats package in R. A PCA is a dimensionality reduction statistical method that transforms a large set of variables into linear combinations (known as principal components) that account for as much of the variance as possible. In this case, the RSCU values of 59 codons (excluding Met and Trp, which do not have synonymous codons, and STOP codons) were reduced into two principal components and plotted against each other using the factoextra package in R.

Correlation Analyses Between Compositional Features and CUB Indices
A correlation analysis was performed to identify the relationships between frequencies of codons ending in a given nucleotide (nnt, nna, nng, nnc), amino acid composition (i.e., GRAVY and AROMA scores) and CUB indices (i.e., Nc and the RSCU PC1 and PC2). Pearson's correlation coefficients were calculated using the Hmisc package and matrices were visualized using the corrplot package, both in R.

Hypergeometric Tests for Relative Codon Overrepresentation
Hypergeometric tests were performed to assess whether nnt, nna, nng and nnc codons of each sequence were statistically overrepresented relative to the major opposite sense sequence of the same genomic segment. The hypergeometric test uses the hypergeometric distribution (i.e., the probability distribution of k success states in n draws without replacement) to calculate the statistical significance of having drawn a specific k successes (out of n draws) from a given population. We used this test to identify which sub-populations are over-or under-represented in a sample. Using nnt codons as an example, the test determines the probability of randomly drawing the observed number of nnt codons for ORF 1 or ORF 2 in the global population of codons (defined as the sum of all codons in ORF 1 and ORF 2). Our hypergeometric tests were designed to ignore Met and Trp codons in the calculation. The tests were carried out using the phyper function in the R stats package. The R script is available o n G i t h u b a t h t t p s : / / g i t h u b . c o m / a c r e s p o -v i r e v o l / Cressdnaviricota-codon-usage.

Pairwise Nucleotide Identity Analysis
A pairwise nucleotide identity matrix was calculated for complete Rep sequences using SDT v1.2 (44) with default settings to assess evolutionary relationships between samples.

RESULTS
A total of 849 ambisense viruses, which include 6 recognized CRESS DNA virus genera across Cressdnaviricota and a subset of unclassified CRESS DNA virus sequences, were analysed in this study ( Table 1). The coding region organization is mostly conserved across established genera, with all viruses (except those in Circovirus) harboring the putative origin of replication on the CP-encoding strand ( Figure 1). Among unclassified CRESS DNA viruses, genomes have organizations with either CP or Rep in the virion-sense, as detailed below. We first present results for members of the Circoviridae genera, Cyclovirus and Circovirus, given that one of the main distinguishing features between them is the Rep and CP orientation of their ambisense genomes. We follow that with the analyses of several other CRESS DNA virus genera and, lastly, CUB analyses for unclassified CRESS DNA virus sequences.

Circoviridae
Circoviridae is comprised of viruses from two established genera: Cyclovirus and Circovirus. Cycloviruses have been identified only via sequencing and are believed to infect invertebrates based on the presence of endogenous virus elements within arthropod genomes (45). Some circoviruses, on the other hand, are well-characterized and include economically important pathogens of livestock such as porcine circovirus (17,46). These genera only encode the two major proteins, Rep and CP, and the main distinguishing feature between them is that they have opposite gene orientations ( Figure 1). We focused our analyses on these genera as a proof of concept; if the mutational bias affects both ORFs uniformly, they will have similar strandspecific biases regardless of what gene is encoded.

Circoviridae Codon Distribution
We first calculated the distribution of codons in the Rep and CP sequences of circo-and cycloviruses( Figure 2; Supplemental Table 1). For cyclovirus CP sequences (which are in the virionsense), the nnt codon distribution exhibited the highest median value (41.0%) while nnc had the lowest (13.2%). Conversely, nna was highest (32.6%) for cyclovirus Reps while nng was the lowest (15.8%). The distributions in both comparisons were significantly different (Mann-Whitney U; p-value <0.05). The differences in CP nna and nng and the complementary Rep nnt and nnc counts were not significantly different (Supplemental Table 2), indicating they are more evenly distributed. These results suggest an enrichment of thymine accompanied by a depletion of cytosine (which in anti-sense are equivalent to adenine and guanine, respectively), an observation that is consistent with strong c!t mutational bias on the virion sense of the genome.
Interestingly, for circoviruses, it was nng that had the highest median value (29.4%), followed by nnt (28.0%), nnc (25%) and nna (18.5%) for the virion-sense Rep ( Figure 2B). Only nna content was significantly lower than the rest (Supplemental Table 2). The anti-sense CP had complementary results with a median high for nnc (32.0%), followed by nna (26.2%), nng (21.3%) and, lastly, nnt (20.6%). No difference was observed between CP nnt and nng, yet they were both significantly lower than nnc content. These results suggest that circovirus coding sequences do not broadly conform to the expectation that nnt will be the most highly enriched codons, and generally have increased guanine content and are depleted of adenine at synonymous sites. Codons are grouped by nucleotide identity at the third position (i.e., nnt, nna, nng and nnc, where n, any nucleotide; t, thymine; a, adenine; g, guanine and c, cytosine). Asterisks denote significant differences (p < 0.05) between the Rep and CP distributions based on Bonferroni-corrected Mann Whitney U tests.

Crespo-Bellido and Duffy
Codon Usage Bias Across Cressdnaviricota Frontiers in Virology | www.frontiersin.org June 2022 | Volume 2 | Article 899608 respectively. One sequence, corresponding to Duck associated cyclovirus 1 (DuACyV1; accession NC_034977), displayed a noticeably different RSCU profile ( Figure 3A). Examination of the DuACyV1 CP RSCU values shows that only nna and nnc codons are overrepresented while a significant portion of nnt codons (10/16) are underrepresented (Supplemental Table 3). In the DuACyV1 Rep, most of both over-and underrepresented codons were a-ending. The RSCU profiles of circoviruses were less conserved across both ORFs, with no broad patterns in preferred relative codon usage based on codons grouped by the identity of their third nucleotide ( Figure 3B; Supplemental File 2). For circovirus Rep sequences, nna codons seemed generally more underrepresented than in any other group while there was a general tendency of underrepresentation of nnt and nng codons across the CP sequences.
We conducted a PCA to understand the variance in RSCU profiles of viruses in Circoviridae ( Figure 4; Supplemental File 3).
In both genera, we find that PC1 generally separates sequences into Rep and CP clusters ( Figures 4A, B). PC1 largely separates sequences according to nnt/nng and nna/nnc RSCU in both genera, as reflected by the contributions of individual codons to PC1 ( Figures 4C, D). This separation potentially reflects the c!t mutational bias, given that nnt codons have negative PC1 scores while nnc codons have positive PC1 scores, and the inverse relationship is observed for the complementary nna and nng codons. For circoviruses, PC2 seems to be separating sequences by hosts, as most of the bird-associated virus sequences have negative PC2 scores while a majority of the mammalian and fish circovirus sequences have positive PC2 scores (Supplemental File 3). In contrast, no apparent clustering by host source was observed for cycloviruses, which is consistent with phylogenetic analyses that show cyclovirus sequences do not cluster according to host of isolation and their associations might not be indicative of definitive host range (17).

Correlations Between Composition and Codon Usage of Circoviridae Coding Sequences
We performed correlation analyses between nucleotide composition, amino acid composition and codon usage indices of coding sequences for viruses in Circoviridae ( Figure 5). Even though the represented species within each genus may have disparate host ranges and host influence on codon usage might weaken the relationships between compared variables, we were able to identify base composition constraints as strong determinants of codon usage variation. Between cyclovirus CP codon contents, the strongest correlation was observed between nnc and nnt (r= -0.68), which is in accordance with a c!t mutational bias ( Figure 5A). However, similar correlations were observed between nnt-nna (r= -0.63), and nnc-nng (r= -0.60), suggesting other mutational constraints may be influencing codon usage for some species. The strongest relationship between codon proportions in cyclovirus Rep sequences corresponds to nng and nna (r= -0.81), which is a complementary pattern to the correlation in CP. Yet, nnc-nnt also had a strong negative correlation (r= -0.76). Nc was negatively correlated with nnt in CP (r= -0.66) and with nna in Rep (r= -0.67), reinforcing that constraints on codon usage correlate with the proportions of nnt codons in virion sense and nna codons in anti-sense. GRAVY scores had a significant negative correlation with Rep nna (r= -0.67), indicating a selective constraint associated with nna codon usage. The RSCU PC1 had strong and inverse correlations with CP nnc (r= 0.86) and nnt (r= -0.86) as well as Rep nnc (r= 0.67) and nnt (r= -0.85), which shows that PC1 is mostly influenced by nnc and nnt codon content. The strongest relationships with PC2 were observed with Rep nna (r= -0.87) and nng (r= 0.73), implying that separation along PC2 is mainly due to nna and nng content of Rep sequences.
The strand-specific relationships observed for circoviruses ( Figure 5B) resemble those of cycloviruses in terms of codon content. The strongest correlation for the virion-sense Rep sequences is the negative correlation between nnc and nnt (r= -0.74) while the strongest correlation for the anti-sense CP sequences is between nng and nna (r= -0.62). These trends are consistent with the virion-andanti-sensepatternsofcycloviruses,suggestingthatthec!t bias on the virion sense of the genome influences synonymous codon usageinbothORFsforvirusesinCircoviridae.However,thecircovirus Rep sequences also showed significant negative correlations between nnc-nna (r= -0.68), nng-nnt (r= -0.64) and nng-nna (r= -0.63), suggesting other biases may be present. Regarding codon composition, Nc is only significantly correlated with the Rep nna content (r= -0.65). No strong significant correlations are found between amino acid composition indices and any other variables. In contrast to cycloviruses, PC1 is mainly correlated with CP nna (r= 0.85) and nng (r= -0.7), implying PC1 mostly explains the variance in nna and nng content in the CP sequences. PC2 has strong positive correlations with nnt and negative correlations with nnc in both Rep and CP, indicating that variance in PC2 largely corresponds to the nnt and nnc content across both ORFs. In addition, strong correlations were observed between PC2 and Rep nna (r= 0.71) and nng (r= -0.63), which means that PC2 further explains variance in the circovirus Rep sequences based on nna and nng content.

Hypergeometric Tests for Relative Codon Overrepresentation in Circoviridae Coding Sequences
Hypergeometric tests were used to evaluate the relative overrepresentation of codons in the Rep and CP sequences of The expectation is that nnt codons will consistently be overrepresented in virion-sense sequences while nna codons will be overrepresented in the anti-sense sequences due to the assumed influence of c!t mutation bias in shaping codon usage. Most of the cycloviruses showed a relative overrepresentation of nnt codons in their virion-sense CP (83.3% of sequences) and of nna codons in their anti-sense Rep (76.2%) (Figures 6A, C). There were three coding sequences that showed contrary results: two Rep sequences with a relative overrepresentation of nnt codons, corresponding to Spider associated cyclovirus 1 (SpACyV1; accession NC_040324) and DuACyV1 (accession N C _ 0 3 4 9 7 7 ; t h e D u A C y V 1 C P a l s o s h o w e d a n overrepresentation of nna codons). Many viruses also showed relative nng overrepresentation in the virion-sense and nnc overrepresentation in the anti-sense. However, this is likely a by-product of c!t mutational bias acting on the virion-sense strand of the genome, as nng is not enriched in the virion-sense, but nng (which reflects cytosine) is depopulated in the anti-sense (illustrated in Figure 2A). A similar scenario is observed for nnc codons, where nnc codons are not enriched in the anti-sense but are depleted in the virion-sense. Therefore, relative overrepresentation of nng in the virion-sense and nnc in the anti-sense is due to a depletion of their complementary base in the opposite sense.
A majority (56%) of circovirus anti-sense sequences showed nna overrepresentation (Figures 6B, C). In contrast to cycloviruses, nnt codons were not overrepresented in most virion-sense sequences (44%) but nng codons were (64%). These results are consistent with nng being the most frequent codon, on average, in the virion-sense ( Figure 2B). Only the Starling circovirus (StCV; accession NC_008033) anti-sense CP sequence showed nnt codon overrepresentation, and no virionsense sequences have relative nna codon overrepresentation.

Codon Usage Across Diverse Members of Cressdnaviricota
There are varying relative codon overrepresentation trends observed in ambisense genes for begomoviruses, mastreviruses, porprsimacoviruses and gemycircularviruses (Figure 7; Supplemental File 4). All viruses for these families possess genomes with a cyclovirus-like orientation (with Rep in  anti-sense). These trends and relevant sequence composition analyses related to synonymous codon usage were examined.

Begomovirus
Begomoviruses (family Geminiviridae) are whitefly-transmitted CRESS DNA viruses that infect dicotyledonous plants (or dicots) and severely constrain crop production in tropical and subtropical regions around the world (47). Their genomes can be monopartite or bipartite, and in bipartite genomes their segments are denominated DNA-A and DNA-B. As monopartite and bipartite DNA-A begomovirus segments are homologous, we will refer to them jointly as DNA-A throughout this manuscript. On the virion-sense strand, DNA-A has a CP gene, and often a partially overlapping gene that encodes the precoat protein. The anti-sense strand of DNA-A encodes the Rep and additional proteins involved in transcription, replication, and RNAsilencing suppression. The DNA-B segment encodes two proteins involved in systemic movement: a nuclear shuttle protein (NSP) in the virion-sense and a movement protein (MP) in the anti-sense. We extended our analyses to exemplars of both segments to compare their codon usage patterns. Consistent with a c!t mutational bias, the hypergeometric tests show that DNA-A segments usually have a relative nnt overrepresentation in the virion-sense and almost always have nna overrepresentation in the anti-sense ( Figure 7A). Codon composition analyses reveal an enrichment of nnt, low usage of nnc and a depletion of nna codons in the virion-sense while there is relatively even usage of all bases except for a decrease in nng in the anti-sense for DNA-A segments ( Figure S1A). While the c!t bias may explain a significant fraction of the overrepresentation patterns, the higher number of anti-sense sequences displaying nna overrepresentation relative to the number of virion-sense sequences with nnt overrepresentation is largely due to the observed virion-sense adenine depletion. DNA-B segments also typically displayed overrepresentation of nnt in the virion-sense and nna codons in the anti-sense, but it was the virion-sense sequences that nearly always showed the expected (nnt) relative overrepresentation ( Figure 7B). Compared to DNA-A, DNA-B segments had distinct codon composition patterns: increased nnt and decreased nnc usage in the virion-sense NSP (in accordance with a c!t bias), and similar (although statistically different) usage of all codon types aside from an increase in nna for the anti-sense MP ( Figure S1B). Overall, codon usage in the virion-sense does not complement the usage in the anti-sense for either segment. The contrasting results between segments and between ambisense genes suggest that codon usage is evolving in a segment-and strand-specific manner in begomoviruses.
The RSCU analyses are consistent with the codon composition analyses: for DNA-A, a general overrepresentation of nnt codons and underrepresentation of nna codons in the virion-sense, and relatively even RSCU except for broad underrepresentation of nng codons across sequences in antisense ( Figure S1C). The ORFs on the DNA-B segment display high nnt and low nnc RSCU values in the virion-sense and increased relative usage of nna codons in the anti-sense ( Figure S1D). PCA based on the RSCU values show that PC1 separates Rep and CP sequences into distinct clusters ( Figure  S2A), while it took both axes to separate the DNA-B genes into discrete clusters ( Figure S2B). While PC1 and PC2 explain similar proportions of variance in DNA-A and DNA-B gene RSCU, the PCA loadings plots ( Figures S2C, D) show that codons contribute to the principal components in different ways, which supports the notion that codon usage is evolving differently in each segment. The correlation analyses show that there is a strong negative relationship between nnc-nnt content in both virion-sense sequences (CP r= -0.86, NSP r= -0.84, Figures S3A, B). Nc and PC1 did not have strong correlations with any codon type in DNA-A, implying that base constraints are not the main determinants of CUB for either Rep or CP. Conversely, both Nc and PC1 had strong negative correlations with nnt and strong positive correlations with nnc in NSP, suggesting c!t is a main factor influencing codon usage for this gene ( Figure S3C). On the other hand, MP nnt and nnc content showed the same relationships as NSP with PC1 but not with Nc ( Figure S3D).

Mastrevirus
Mastreviruses constitute the second largest genus within Geminiviridae and are monopartite, leafhopper-transmitted viruses that are predominantly associated with disease in monocotyledonous (or monocot) plants in the Old World (47). Mastreviruses have two genes in the virion-sense encoding CP and a movement protein. In the anti-sense, they encode Rep, which is expressed from two genes by transcript splicing, and an additional overlapping gene that supports viral replication (48). We provide results of analyses performed on CP and the spliced version of Rep sequences.
A great majority of mastrevirus CP sequences (92.3%) showed relative overrepresentation of nng codons while most Rep sequences (73.1%) revealed an nna codon overrepresentation ( Figure 7C). This is consistent with codon composition analyses showing nng with the highest median content in the virion-sense ( Figure S4A; Supplemental Table 1). Interestingly, while nnt is not relatively enriched in the CP sequences, there were seven sequences exhibiting very high nnt content (> 37%) ( Figure S4A; Supplemental File 3). They all correspond to dicot-infecting mastreviruses while all the other sequences closer to the median correspond to monocotinfecting viruses (Supplemental File 1). In the anti-sense, instead of an increase in nna content, nnc and nnt show the highest median values. The relative nna overrepresentation in Rep sequences ( Figure 7C) is therefore likely supplemented by the depletion of nna nucleotides in the virion-sense ( Figure S4A).  Figure  S4C). However, a more well-defined cluster would form for CP if we removed sequences corresponding to dicot-infecting mastreviruses, which account for seven of the eight CP sequences with negative PC1 values (Supplemental Files 1, 3).  PC1 seems to separate sequences by nna/nnt and nng/nnc usage while PC2 is separating according to nng/nnt and nna/nnc usage ( Figure S4D). The strongest correlation in the virion-sense sequences in terms of codon composition is between nnc-nnt (r= -0.88) ( Figure S4E). Anti-sense Rep sequences show similar negative correlations between nnc-nnt (r= -0.71) and nng-nna (r= -0.72). PC1 correlates positively with nng and nnc, and negatively with nna and nnt for both Rep and CP, which is consistent with the codon contributions observed in the PCA loadings plot ( Figure S4D). Additionally, PC1 correlates somewhat with AROMA, which indicates a potential selective constraint imposed by amino acid composition on the RSCU of mastrevirus CP sequences.

Porprismacovirus
Porprismacoviruses (family Smacoviridae) are a recently discovered group of CRESS DNA viruses that, although prevalent in vertebrate feces (including human), are yet to be cultured (49,50). Their ambisense genomes contain only two major ORFs, Rep and CP. Although gut-associated methanogenic archaea have been implicated as potential smacovirus hosts based on the presence of CRISPR spacers that match regions of the smacovirus genome (51), no definitive host-range has been established. We performed analyses on porprismacovirus CP and Rep sequences.
Almost all porprismacoviruses exhibit nnt overrepresentation in the virion-sense and nna overrepresentation in the anti-sense ( Figure 7D). The codon composition of virion-sense CP sequences displays very high median levels of nnt (39.8%) and strikingly low median levels of nna (9.7%) ( Figure S5A and Supplemental Table 1). In complement, anti-sense Rep sequences show very low median levels of nnt (10.8%), yet all other codons have comparable distributions. RSCU heatmaps broadly corroborate codon compositions for both ORFs ( Figure  S5B). Rep and CP are separated well by PC1, which explains over a third of the variance in RSCU values ( Figure S5C, D).
Codon composition correlation analysis shows that nnc-nnt is strongly correlated the virion-sense (r= -0.79) and mildly correlated in the anti-sense (r= -0.6) ( Figure S5E). PC1 is most strongly correlated with nnt in both senses, indicating that the separation of Rep and CP sequences is largely due to the high levels of thymine in the virion-sense and low levels of adenine in the anti-sense. These codon usage analyses indicate that there is a uniform and strong bias against adenines at the third position in porprismacovirus genomes.

Gemycircularvirus
Gemycircularviruses belong to the Genomoviridae family, the sister clade to Geminiviridae that includes viruses with fungal and insect host-ranges (52)(53)(54). Genomoviruses have the same genome orientation as geminiviruses but encode only CP (virion-sense) and Rep (anti-sense) in their ssDNA genomes. Here we present results from CUB analyses performed on the two coding sequences of Gemycircularvirus species exemplars.
Virion-sense CP sequences consistently have a relative nnc overrepresentation ( Figure 7C). In contrast to all other CRESS DNA virus genera examined, the anti-sense sequences often show an nnt overrepresentation, observed in 40.7% of viruses ( Figure 7E). Codon composition supports these trends since nnc and nnt have the highest median content in the virion-and anti-sense strands, respectively ( Figure S6A and Supplemental Table 1). Both Rep and CP show a depletion of nna content. RSCU analyses show uniform nnc codon overrepresentation and general patterns of underrepresentation across several nna codons in the virion-sense ( Figure S6B). In the anti-sense Reps, some nnt codons are overrepresented across many sequences, and select nna codons (e.g., gga, aga, cga) are strongly overrepresented ( Figure S6B and Supplemental File 2). The RSCU PCA mostly separates Rep and CP into individual clusters but PC1 and PC2 only account for 23.4% of the variance seen in the RSCU profiles, indicating complex and largely unexplained patterns across genes ( Figure S6C). PCA loadings show separation along PC1 is mostly due to nnc usage ( Figure S6D).

Unclassified CRESS DNA Viruses
The examination of relative codon overrepresentation across Cressdnaviricota showed that nna codons were always the most overrepresented across anti-sense sequences in all genera (excepting Gemycircularvirus, Figure 7E). To a lesser degree, nnt codons are overrepresented in the virion-sense sequences of most CRESS DNA virus genera ( Table 2). Additionally, except for gemycircularviruses, there is a consistent lack of relative nna codon overrepresentation in the virion-sense and a lack or relative nnt codon overrepresentation in the anti-sense strands of these viruses. Given these observations, we decided to conduct hypergeometric tests on Rep and CP pairs of sequences from unclassified CRESS DNA viruses with ambisense genomes to assess their virion-sense strandedness.
First, genome orientation for all unclassified viruses was verified by identifying the direction of putative origin of replication motifs found in stem loops predicted by the StemLoop Finder tool. We found 48 instances where the submitted unclassified CRESS DNA viruses in GenBank were potentially in the reverse complement (Supplemental File 1). After correcting the orientation of these 48 genomes, there were 35 unclassified viruses predicted to code Rep in the virion-sense and CP in the anti-sense, sharing a strand orientation that is unique to circoviruses. Hypergeometric tests of relative codon overrepresentation in Rep and CP pairs revealed that nearly half of virion-and anti-sense sequences display overrepresentation of nnt and nna codons, respectively (Figure 8). These results agree with the typical patterns observed for the established Cressdnaviricota genera. Around 12% of virion-sense and anti-sense sequences display the opposite pattern: nna codon overrepresentation in the virion-sense and nnt codon overrepresentation in the anti-sense.
Out of the 35 CRESS DNA viruses with predicted circovirus-like orientation, 8 virion-sense Rep and 10 anti-sense CP sequences across 13 different viruses display the expected pattern of nnt codon overrepresentation in the virion-sense and nna codon overrepresentation in the anti-sense, supporting their circoviruslike organization. These 13 unclassified viruses were sampled from many sources including fish tissue, bird cloacal swabs, plants, and mammalian feces and meat (Supplemental File 5). We assessed the evolutionary relationships between the homologous Rep sequences of the 13 viruses by performing a pairwise nucleotide identity analysis with SDT v1.2. We found that, on average, the Rep sequences share 57.6% nucleotide identity to each other and the highest pairwise identity between two samples is just 63.6% (Supplemental File 5). Given the low pairwise percent nucleotide identity between the most well-conserved gene shared by these samples, it is likely that these viruses represent members of multiple undetermined families that possess the strand organization shared with only circoviruses among established Cressdnaviricota families.

DISCUSSION
Cressdnaviricota is a widespread and rapidly expanding phylum of small ssDNA viruses with highly diverse host ranges. As the number of sequenced CRESS DNA virus representatives grows exponentially through genomic and metagenomic surveillance, it is necessary to refine sequenced-based approaches to characterize viruses in the absence of our ability to culture them. To this end, we decided to explore trends in codon usage bias of the two major ORFs, Rep and CP, of species representatives across the entire phylum. Based on the extensive literature from experimental and comparative analyses of ssDNA bacteriophages and eukaryotic viruses (9,10,29,(55)(56)(57)(58)(59)(60), we hypothesized that the strong bias towards c!t transitions exhibited by ssDNA viruses would result in an overrepresentation of nnt codons in virion-sense and, in complement, nna codons in anti-sense coding sequences. Results showed that most sequences, across genera, shared the predicted over-and underrepresented codons. While in members of Circoviridae these statistical over-and underrepresentations are complementary to each other (Figures 2,  3), we found most genera have idiosyncratic preferences. This suggests that forces shaping codon usage differ across Cressdnaviricota, but they often act in a strand-specific manner. Virion-sense genes had a high amount of nnt codons, often being the most overrepresented kind of codon in these ORFs by median value (excepting circoviruses and mastreviruses where nnt codons are second most prevalent to nng codons, and gemycircularviruses, where nnt comes in second to nnc). For anti-sense ORFs, while nna codons were usually overrepresented relative to the virion sense ORF, only two groups had median nna content that exceeded that of the other kinds of codons: cycloviruses and begomovirus DNA-B segments. Consequently, codon content indicates that nnt codons are not incredibly enriched in the virion sense and nna codons are not in antisense, contrary to the expectation from a persistent and quantifiable c!t substitution bias. Regardless, we observed that most virion-sense sequences across the classified genera of Cressdnaviricota do have a relative nnt overrepresentation (69.1%) while a majority of anti-sense sequences have a relative nna overrepresentation (80.1%, Table 2), which conforms to our expectations. Moreover, only a very small fraction of sequences displays the opposite pattern (i.e., virionsense enriched for nna-0.42% -and anti-sense enriched for nnt-3.9%), indicating that the hypergeometric test can serve as a tool to reliably corroborate the strandedness of a CRESS DNA virus genome.
When examining unclassified CRESS DNA viruses, establishing gene orientations based on the position of the origin of replication can be difficult since the nonanucleotide varies between families and we do not have a clear sense of the unsampled motif diversity out there. Additionally, multiple candidate stem loop structures are often found throughout CRESS DNA virus genomes, the nonanucleotides where replication starts can be palindromic, and we lack experimental data revealing origin function for many of these viruses. Based on our best origin predictions aided by the StemLoop Finder tool, we applied hypergeometric tests to ambisense coding sequence pairs of unclassified CRESS DNA viruses and observed that only a little less than half of all sequences follow the expected patterns for nnt and nna relative overrepresentation ( Figure 8). However, nnt overrepresentation in the virion-sense and nna overrepresentation in anti-sense are the most common patterns of overrepresentation. Around 12% of unclassified sequences in both the virion-sense and anti-sense data sets have inverse relative codon overrepresentation. It is possible that strand assignments for these annotations are incorrect despite our best efforts to correct them by predicting the direction of putative origin of replication motifs. Alternatively, they can be properly assigned, and their codon usage is a result of distinct factors from those acting at third codon positions across Cressdnaviricota. Nonetheless, results are generally consistent with those observed for established CRESS DNA virus genera. We found 13 unclassified CRESS DNA viruses with circovirus-like genome organization based on both origin predictions, and nnt/nna relative codon overrepresentation profiles. As there was substantial nucleotide divergence among the Rep gene sequences for the 13 viruses, they likely represent candidate members of more than one novel family in Cressdnaviricota. Therefore, it appears that viruses encoding Rep in the virion-sense and CP in the anti-sense are more common than what our current taxonomy suggests, where of more than 30 genera only Circovirus uses this genomic orientation.

Additional Factors Influencing Codon Usage for Cressdnaviricota
Our results suggest that codon preference across Cressdnaviricota cannot be solely attributed to c!t mutation bias. We explore some of the factors that could potentially influence CUB in viruses of Cressdnaviricota.

Additional Mutational Biases
Mutational biases implied by significant negative correlations between codon types are present in all data sets, suggesting base composition constraints play an integral role in shaping codon usage for all genera. A c!t bias is suggested to be present across all data sets based on negative correlations between nnc and nnt in virion-sense and nng and nna in anti-sense for all genera. Yet, there are often other codon types that are negatively correlated with each other, indicating that multiple mutational biases might be acting simultaneously. While c!t transitions represent by far the most common type of mutation in single-stranded DNA (31,61), evolution experiments with geminiviruses have revealed that, after c!t, g!t substitutions occur more frequently than others (56)(57)(58)(59). Interestingly, van der Walt et al. (56) and Monjane etal. (57) showed that these mutations occur in a strand-specific manner in both mastrevirus and begomovirus species, with CP exhibiting a distinct overrepresentation of g!t substitutions, which may also be caused by oxidative damage to nucleotides. While long-term g!t bias will act synergistically with c!t to enrich nnt codons in the virion-sense CP, our results show that mastrevirus CPs are enriched in nng and depleted in nna. Since no mutational bias can explain the observed pattern, we hypothesize that some form of selection accounts for the codon usage of mastrevirus CP sequences (discussed in the next section). In turn, the mastrevirus CUB signatures indicate that measured substitution biases from previous short-term evolution experiments (56,57) do not necessarily reflect long-term evolution of synonymous sites in geminivirus genomes.

Host-Induced Selection Pressures
Viruses rely on the host tRNA machinery for the synthesis of viral proteins and are subjected to different selective pressures imposed by the intracellular environment as well as a variety of antiviral immune systems (62)(63)(64). Selection pressures might act in synergy or antagonistically with mutational biases to shape codon usage and could partially explain the genus and strandspecific patterns observed across the diverse host range of members of Cressdnaviricota. Since host ranges are not welldefined for most CRESS DNA viruses, it is hard to make interpretations from in silico, comparative analyses such as this one. However, we provide some hypotheses based on the existing literature for CRESS DNA viruses.
PCA analysis of RSCU in circovirus genes revealed that some variation in codon usage is host-dependent, since PC2 produced clusters based on host source and separated avian circoviruses from fish and mammalian circoviruses. A previous CUB study also identified distinct patterns of codon usage between avian and mammalian circoviruses, citing a strong deviation from mutational patterns in mammalian viruses as the determinant of the differences between the two groups (65). A more recent study revealed that tRNA genes are drastically reduced in number and complexity in birds when compared to other vertebrates, but they exhibit overall similar tRNA usage patterns (66). This suggests that birds have evolved to use their limited tRNA inventory more efficiently, and this optimization may in turn exert selective pressures on codon usage for avian viruses. Overall, selection acting in different directions between avian and mammalian circoviruses might partially explain the diverged codon usage patterns. A more detailed, host-based examination of codon usage with available circoviruses might provide further insights into factors responsible for this demarcation.
Translational selection can promote optimal viral codon usage and translation through assimilation to the host tRNA pool (67,68), and may partially explain the strand-specific differences observed in the geminiviruses in this study. Begomo-and mastreviruses are both geminiviruses and infect plants, yet they have distinct host ranges: begomoviruses infect dicots, and mastreviruses are largely restricted to monocots. Both are similarly biased against nna codons in the virion-sense CP and against nng in the anti-sense Rep sequences (Figures S1, S4). However, they do show one notable distinction in CP: begomoviruses are biased in favor of nnt while mastreviruses are enriched for nng codons. Previous codon usage analyses revealed that mastreviruses' monocot hosts prefer nng and nnc codons while begomoviruses' dicot hosts prefer nna and nnt codons in highly expressed genes (30,69,70). Cardinale etal. (30) showed that the RSCU values of begomo-and mastrevirus CP sequences are well-correlated to their respective hosts' RSCU while Rep sequences are not. Since likeness to host codon usage is not consistent throughout a virus genome and is generally stronger in structural proteins (23), it is plausible to suggest that there is host-imposed translational selection acting mostly on CP, leading to an increase in nnt content in begomoviruses and nng content in mastreviruses. This is further supported by the fact that we observed a much higher nnt content in dicotinfecting mastreviruses than in monocot-infecting ones ( Figure  S4A; Supplemental File 3).
Gemycircularviruses showed a complicated pattern of codon usage that largely did not conform to results observed for the rest of the examined genera ( Figure 7). Principal components from our RSCU PCA jointly accounted for only 23.4% of the variance in the data, which points to very divergent RSCU profiles between samples. The most well-characterized gemycircularvirus replicates in a fungal host, a mycophagous insect, the cells of a distantly related moth and potentially a variety of other hosts (53). It is possible that gemycircularviruses are generalists which have not adapted to any one specific host codon usage, as viruses with a narrow host spectrum match their hosts' tRNA pools better and exhibit stronger CUB than those with wide host ranges (71). Due to their putative ability to infect such a broad diversity of hosts, gemycircularviruses present a potential system to explore hostspecific factors influencing codon usage in ssDNA viruses.

Secondary Structures
Nucleic acid secondary structures can be functionally important in virus genomes, which means that they can be targeted by selection to preserve structure integrity (72)(73)(74)(75). Mutations disrupting predicted secondary structures have been found to reliably revert and restabilize base pairings in mastreviruses (76). Additionally, median substitutions rates of paired 3 rd codon position nucleotides within predicted highly conserved secondary structures are significantly lower than for unpaired nucleotides in circoviruses, begomoviruses, mastreviruses and capulaviruses (77)(78)(79)(80). Muhire etal. (77) also observed reduced mutation frequencies for paired sites in short term evolution experiments for mastreviruses. These results indicate that selection acts to conserve secondary structures in CRESS DNA viruses, even in extremely short time scales, and that perhaps secondary structures can protect paired nucleotides from oxidative damage. Correlating secondary structure predictions in coding regions with nucleotide frequencies in silent codon sites may inform whether secondary structures play a significant role in shaping global codon usage for CRESS DNA viruses.

Cyclovirus Outliers
Two outliers were detected within Cyclovirus in our hypergeometric test analyses: a Rep sequence from Spider associated cyclovirus 1 (SpACyV-1) and the Rep and CP sequences for Duck associated cyclovirus 1 (DuACyV-1). Rep phylogenetic analyses reveal SpACyV-1 as an intermediate between circo-and cycloviruses and its inclusion into Cyclovirus is not phylogenetics-based but rather informed by its inferred genome orientation (81). The DuACyV-1 CP shows no significant evolutionary relationship to other cyclovirus CPs and clusters with unclassified CRESS DNA viruses (82). The Rep, however, shares a 98% nucleotide sequence identity with a partial Rep sequence (~470 nucleotides in length) isolated from human faeces (83), which groups closely with circoviruses rather than other cycloviruses (84). These results suggest that DuACyV-1 might be a recombinant involving a circovirus-like Rep and a CP diverged from other cycloviruses both phylogenetically and in terms of RSCU. Potentially, both DuACyV-1 and SpACyV-1 warrant investigation into new genera within Circoviridae.

CONCLUSIONS
Ambisense genomes of viruses classified within Cressdnavircota display genus-and strand-specific patterns of codon usage. While influenced by different factors in each genus, hypergeometric tests reveal consistent relative nna codon overrepresentation in anti-sense sequences and, to a lesser extent, nnt codon overrepresentation in the virion-sense, suggesting it could potentially serve as a complementary tool to verify genome organization of novel sequences. Additionally, there is an even more constant absence of relative nna codon overrepresentation in the virion-sense and a lack or relative nnt codon overrepresentation in the anti-sense strands of these viruses (except for gemycircularviruses), which indicates that there is a low probability of incorrect assignment. When applied to unclassified CRESS DNA viruses, hypergeometric test results provide support for the notion that, while CP is more commonly encoded in the virion-sense strand across Cressdnaviricota, more viral genera with circovirus-like genomes (i.e., harboring a virion-sense Rep ORF) must exist. We hope that this comprehensive analysis of codon composition across Cressdnaviricota can provide a framework for more fine-scale, experimental approaches to understand the evolution of codon preference for such a diverse group of viruses.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
AC-B and SD conceived the project, AC-B conducted the analyses, AC-B and SD wrote the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This work was supported by an HHMI Gilliam Fellowship to AC-B and SD (#GT13634) and NSF OIA 1545553 to SD. AC-B was also supported by a Rutgers SUPERGrad fellowship and the Rutgers Initiative for Maximizing Student Development (NIH R25GM055145).