Chlamydia trachomatis Strain Types Have Diversified Regionally and Globally with Evidence for Recombination across Geographic Divides

Chlamydia trachomatis (Ct) is the leading cause of bacterial sexually transmitted diseases worldwide. The Ct Multi Locus Sequence Typing (MLST) scheme is effective in differentiating strain types (ST), deciphering transmission patterns and treatment failure, and identifying recombinant strains. Here, we analyzed 323 reference and clinical samples, including 58 samples from Russia, an area that has not previously been represented in Ct typing schemes, to expand our knowledge of the global diversification of Ct STs. The 323 samples resolved into 84 unique STs, a 3.23 higher typing resolution compared to the gold standard single locus ompA genotyping. Our MLST scheme showed a high discriminatory index, D, of 0.98 (95% CI 0.97–0.99) confirming the validity of this method for typing. Phylogenetic analyses revealed distinct branches for the phenotypic diseases of lymphogranuloma venereum, urethritis and cervicitis, and a sub-branch for ocular trachoma. Consistent with these findings, single nucleotide polymorphisms were identified that significantly correlated with each phenotype. While the overall number of unique STs per region was comparable across geographies, the number of STs was greater for Russia with a significantly higher ST/sample ratio of 0.45 (95% CI: 0.35–0.53) compared to Europe or the Americas (p < 0.009), which may reflect a higher level of sexual mixing with the introduction of STs from other regions and/or reassortment of alleles. Four STs were found to be significantly associated with a particular geographic region. ST23 [p = 0.032 (95% CI: 1–23)], ST34 [p = 0.019 (95% CI: 1.1–25)]; and ST19 [p = 0.001 (95% CI: 1.7–34.7)] were significantly associated with Netherlands compared to Russia or the Americas, while ST 30 [p = 0.031 (95% CI: 1.1–17.8)] was significantly associated with the Americas. ST19 was significantly associated with Netherlands and Russia compared with the Americans [p = 0.001 (95% CI: 1.7–34.7) and p = 0.006 (95% CI: 1.5–34.6), respectively]. Additionally, recombinant strains were ubiquitous in the data set [106 (32.8%)], although Europe had a significantly higher number than Russia or the Americas (p < 0.04), the majority of which were from Amsterdam [43 (87.8%) of 49)]. The higher number of recombinants in Europe indicates selective pressure and/or adaptive diversification that will require additional studies to elucidate.

Chlamydia trachomatis (Ct) is the leading cause of bacterial sexually transmitted diseases worldwide. The Ct Multi Locus Sequence Typing (MLST) scheme is effective in differentiating strain types (ST), deciphering transmission patterns and treatment failure, and identifying recombinant strains. Here, we analyzed 323 reference and clinical samples, including 58 samples from Russia, an area that has not previously been represented in Ct typing schemes, to expand our knowledge of the global diversification of Ct STs. The 323 samples resolved into 84 unique STs, a 3.23 higher typing resolution compared to the gold standard single locus ompA genotyping. Our MLST scheme showed a high discriminatory index, D, of 0.98 (95% CI 0.97-0.99) confirming the validity of this method for typing. Phylogenetic analyses revealed distinct branches for the phenotypic diseases of lymphogranuloma venereum, urethritis and cervicitis, and a sub-branch for ocular trachoma. Consistent with these findings, single nucleotide polymorphisms were identified that significantly correlated with each phenotype. While the overall number of unique STs per region was comparable across geographies, the number of STs was greater for Russia with a significantly higher ST/sample ratio of 0.45 (95% CI: 0.35-0.53) compared to Europe or the Americas (p < 0.009), which may reflect a higher level of sexual mixing with the introduction of STs from other regions and/or reassortment of alleles. Four STs were found to be significantly associated with a particular geographic region. ST23 [p = 0.032 (95% CI: 1-23)], ST34 [p = 0.019 (95% CI: 1.1-25)]; and ST19 [p = 0.001 (95% CI: 1.7-34.7)] were significantly associated

INTRODUCTION
The modern pathogenic Chlamydiaceae family has a rich evolutionary history, diverging from environmental Chlamydiales approximately seven million years ago (Horn et al., 2004). The human Chlamydiaceae spp. Chlamydia trachomatis (Ct) has infected human populations causing sexually transmitted diseases (STD) and the chronic ocular disease known as trachoma since the 27th century BC (Perkins and Hill, 2013). Trachoma was initially described in China and in the Eber's Papyrus of Egypt, and subsequently spread to Europe during the Crusades (Perkins and Hill, 2013). While improvements in hygiene and sanitation have eliminated trachoma from many global populations, the disease is still endemic in many developing countries of Africa, Central and South America, the South Pacific and Asia in addition to aboriginal populations in Australia (Perkins and Hill, 2013). Presently, Ct is the leading cause of preventable blindness and bacterial sexually transmitted infections (STIs) worldwide (Rowley et al., 2012) with estimates of over 250 million trachoma cases and 110 million annual STI cases, according to the World Health Organization (WHO Sexually transmitted infections [STIs], 2015).
Ct has evolved to include 19 serological variants (serovars) based on antibody typing of the major outer membrane protein (MOMP) with over 60 ompA genotypes (Dean and Millman, 1997;Batteiger et al., 2014;Isaksson et al., 2016;Peuchant et al., 2016;Schillinger et al., 2016), the gold standard typing technique for all Chlamydiaceae spp. The serovars are designated A through K, Ba, Da, Ia, Ja, and L 1−3 , and L 2 a while the ompA genotypes or strains are denoted by the same or by a number or letter after the conventional serovar name (e.g., D1; Ga) for new genotypes (Batteiger et al., 2014). These strains are responsible for ocular, urogenital and rectal infections. Ocular infections include trachoma, a chronic ocular disease, and ophthalmia neonatorum (Darville, 2005), an infection acquired during passage through a Ct infected birth canal (WHO Sexually transmitted infections [STIs], 2015). Urogenital strains cause not only ocular infections, which usually present as unilateral conjunctivitis (Dean et al., 2008), but also can ascend from the endocervix to cause sequelae such as pelvic inflammatory disease, infertility and ectopic pregnancy (Mårdh, 2004;Blas et al., 2007;Baud et al., 2008). Rectal infections can progress to proctitis and inguinal syndrome (Sethi et al., 2009). While the later is caused primarily by the lymphogranuloma venereum strains (LGV) L 1−3 , L 2 a, L 2 b, and L 2 c, the former can be caused by most Ct strains, although strains B, Ba, and C are rarely detected in the urethra, endocervix or rectum (Danby et al., 2016;Labiran et al., 2016). Strain A is the only strain that is confined to the ocular mucosa (Dean, 2010).
Classification of Ct strains was conventionally performed by serotyping and more recently by ompA genotyping (Dean et al., 1992) since the organism is rarely cultured, a requirement for serotyping. Although ompA genotyping can be informative, the gene represents a mere 0.1% of the genome and is subject to immune selective pressure and recombination (Joseph et al., 2011(Joseph et al., , 2012. Finer, more holistic typing schemes are necessary to track recombination events, differentiate new and persistent infections (Götz et al., 2013), reinfection, transmission patterns and elucidate potential biomarkers. Three Multi Locus Sequence Typing (MLST) schemes have been developed for Ct (Klint et al., 2007;Pannekoek et al., 2008;Dean et al., 2009), of which only two meet the MLST criteria of using strictly housekeeping genes (Pannekoek et al., 2008;Dean et al., 2009). Our MLST scheme employs seven highly conserved housekeeping genes and successfully resolves reference and clinical Ct samples into LGV, trachoma, non-prevalent non-invasive urogenital, and prevalent non-invasive urogenital clonal complexes representing the respective diseases in addition to revealing evidence for recombination (Dean et al., 2009;Batteiger et al., 2014).
Partial and whole genome sequencing (WGS) have added considerably to our knowledge of the diversity of Ct and evidence for recombination. We initially bioinformatically identified recombination within ompA (Millman et al., 2001) and then among partial genome sequences of trachoma and sexually transmitted strains involving ompA and polymorphic membrane proteins (pmp) (Gomes et al., 2004(Gomes et al., , 2006(Gomes et al., , 2007. In the first publication of comparative WGS of Ct, we identified three major clades that were similar to the MLST disease-related clonal complexes with a subclade encompassing the trachoma strains within the non-prevalent non-invasive urogenital clade (Joseph et al., 2011). Subsequent WGS by our and other groups have substantiated these phylogenetic groupings as well as the recombinogenic nature of Ct (Harris et al., 2012;Joseph et al., 2012;Seth-Smith et al., 2013;Hadfield et al., 2017).
As whole genome sequencing remains cost-prohibitive for large sample sets and beyond the reach of most research laboratories, in this work, 323 Ct reference and clinical samples from 15 countries and 5 continents were analyzed by MLST to provide a more comprehensive analysis of the global diversification of Ct strain types. We included 60 new clinical samples from Amsterdam, Netherlands, eight from Boston, MA, United States, and 58 from St. Petersburg, Russia, a region that had not previously been evaluated by MLST.

Study Populations and Ethics
Information on populations and ethics for samples collected previously are included in the publications by Dean et al. (2009) and Batteiger et al. (2014). Russian women were enrolled in the original studies following verbal informed consent after approval by the Local Institutional Review Board at DO Ott Institute of Obstetrics and Gynecology and the Russian Academy of Medical Sciences (RAMS), St. Petersburg, Russia (Shipitsyna et al., 2007) For the Russian samples, endocervical swabs were obtained consecutively from January 2006 to January 2008 in two university clinics in St. Petersburg, Russia, as described elsewhere (Smelov et al., 2009). To minimize the potential for low response from the enrolled women, samples were collected without obtaining personal information. After removal of any mucopus with a cotton swab, a Dacron swab was inserted into the endocervix, rotated and placed in an empty 5 mL vials. Specimens were kept at 4 • C (39 • F) for up to 4 days before they were shipped to the laboratory where they were stored at 4 • C (39 • F). Within 1-3 days (Shipitsyna et al., 2007) all samples were tested for the presence of Ct by commercial NAAT assays (Shalepo et al., 2006;Smelov et al., 2009). Additionally, either a conventional PCR (Lytech, Moscow, Russia) or a realtime PCR (Central Research Institute of Epidemiology, Moscow, Russia) were used. The results were confirmed in Amsterdam, Netherlands by the commercial real-time PCR (TaqMan, Applied Biosystems, United States) (Morré et al., 1999;Smelov et al., 2009) and CE-IVD certificated Presto CT-NG Assays (Goffin Molecular Technologies, Beek, Netherlands) (de Waaij et al., 2015).
ompA Genotyping and MLST Analysis Genomic DNA was purified from clinical isolates and ompA genotyped as described previously (Dean et al., 2009;Batteiger et al., 2014). Only the samples from St. Petersburg were provided as swabs; DNA was extracted and purified for these samples using our protocol as described in Joseph et al. (2014). MLST analysis examined seven housekeeping genes: glyA, mdhC, pdhA, yhbG, pykF, lysS, and leuS, with primers as described (Dean et al., 2009) (Supplementary Table 1). All seven housekeeping genes were amplified and sequenced as described (Batteiger et al., 2014). A consensus sequence was created from the forward and reverse sequence. The genes for each of the St. Petersburg, Amsterdam and Boston samples were each concatenated and queried against the 265 samples in the MLST database in addition to including these samples in the database. Allelic numbers and STs were assigned based on this query as described previously (Batteiger et al., 2014).

Phylogenetic Analysis and Strain Clustering
Using the concatenated sequences, dataset strain clustering and Single Nucleotide Polymorphism (SNP) analyses were performed as described (Batteiger et al., 2014). Briefly, this included visualizing clusters of related STs and non-related STs using eBURST 2 (Feil et al., 2004). Founder STs were identified by the highest number of single locus variants (SLV) branching from that particular ST (i.e., the clonal ancestor that diversifies into other STs). Clonal complexes generated by eBURST were defined as a group of STs separated by one SLV.
Phylogenetic trees were created by Maximum likelihood using the Symmetric+GI model, which provided the best fit for the data, in the R package phangorn (Schliep, 2010) to analyze the nucleotide sequence variation between the seven MLST loci for each ST. Tree nodes were verified with 1,000 bootstrap replicates. Alternative evolutionary pathways, such as horizontal gene transfer, were analyzed with SplitsTree 3 using the splits decomposition method as described (Dean et al., 2009;Tamura et al., 2013). In addition, the sequence for each of the seven MLST loci for a sample were compared across the dataset and to the ompA genotype of the same sample to determine evidence for putative recombination.

Statistical Tests
Fisher's exact test was performed in R 4 to test for significant region-specific ompA and ST clustering; a p-value of <0.05 was considered significant. Confidence intervals were determined based on the method of Clopper-Pearson (Borkowf, 2006). Simpson's Diversity Index, D, was calculated for the MLST data as described (Simpson, 1949;Hunter and Gaston, 1988). A D-value of ≥0.95 was considered ideal for molecular typing techniques (van Belkum et al., 2007). The ompA genotypes were excluded from the analysis. The Benjamin-Hochberg FDR method (Benjamini and Hochberg, 1995) was used to correct p-values for multiple comparisons.
Samples were classified as putative recombinants when the sequences of the seven gene sequences that comprise the ST or any of the seven genes were non-concordant with each other or with the ompA genotype of the same sample.
PROC FREQ in SAS was used to identify SNPs associated with disease phenotype and Haplotype as described previously (Dean et al., 2009). Levene's test evaluated the variance across the dataset of the 323 samples. The Classification Index was used to determine significance of each SNP with a disease phenotype where a p-value of <0.05 was considered significant.
DnaSP v5.10 (Librado and Rozas, 2009) was used to calculate nucleotide (nt) and haplotype (hd) to determine the genetic diversity and differentiation for regional subgroups on the concatenated sequences of the seven MLST genes. DnaSP considers the frequency of variants (STs) present in a population and also genetic distances that separate these variants from each other. Genetic population differentiation between regional subgroups was assessed using the pairwise fixation index (Fst) in Arlequin v3.5 (Excoffier and Lischer, 2010) with significance testing by permutation.

Characteristics and Geographic Distribution of Alleles
The characteristics of the alleles for each gene locus are shown in Supplementary Table 2. The number of alleles varied by gene locus, ranging from seven to 18, as did the number of polymorphic sites. We determined the allele frequencies by geographic region for the 78 alleles (Table 1). Thirty-two (41%) alleles were observed once. The highest number of unique alleles for a geographic region was in Western Europe at 16 alleles but the highest frequency was 43.3% for Russia, which was not statistically significant.

ST and ompA Distributions
For the 323 samples, 84 unique STs were identified (Supplementary Table 3). STs novel to the dataset were numbered consecutively in order of identification. Of those 84, 57 (67.9%) were singletons, with a relatively even distribution by geographic region (excluding Asia and Africa where the sample sizes were very small) with a higher percentage of singletons in Europe that was not significantly different ( Table 2; p = 0.08). Table 2 also shows that the percentage of unique STs per region was highly similar. However, the ST/sample ratio was significantly greater for Russian than for European and American samples (p < 0.009). There were also significant differences in the distribution of STs (Table 3). Dutch females (n = 79) were significantly more likely to be infected with ST23 (p = 0.032; 95% CI: 1-23) and ST34 (p = 0.02; 95% CI: 1.1-25) compared to Russian females (n = 58) and with ST19 (p = 0.001; 95% CI: 1.7-34.7) compared to American females (n = 108). Supplementary Table 4 shows the distribution of STs by geographic region.
There were 26 ompA genotypes observed in the dataset resulting in a 3.23 lower resolution than for STs. Excluding the STDs samples from South Africa, all samples available from Asia and Africa were from trachoma patients; 1 (7.1%) of 14 was a urogenital Da ompA genotype (ST 37) in Asia while 1 (5.9%) of 17 was a urogenital E genotype (ST 39) in Africa.
The ompA distribution of urogenital strains varied across Europe, the Americas, and Russia. In Europe, ompA genotype D was significantly more prevalent than in other geographic regions (p = 0.046) and more frequent than the globally prevalent genotype E. In Russia, ompA genotype G was significantly more prevalent than in all other regions (p = 0.001). In the Americas, ompA genotype Ia was significantly more prevalent than the other regions (p = 0.001). In comparing female STD cohorts, Russian women were significantly more likely to be infected with E and G (p = 0.001 and 0.026, respectively) while Dutch women were significantly more likely to be infected with D and I (p = 0.026 and 0.002, respectively).
Based on DnaSP, nucleotide diversity was lower in Russia (Pi = 0.00216) compared to Netherlands (Pi = 0.00321) and North America (Pi = 0.00294) ( Table 4). However, Netherlands and North American datasets contained isolates from men and LGV samples. The phylogenetically disparate LGV samples appeared to contribute substantially to nucleotide diversity; nucleotide diversity dropped from 0.00250 to 0.00196 in Dutch women when the four LGV isolates were removed. When comparing non-LGV isolates from women in these regions, Russian women had the highest nucleotide diversity (Pi = 0.00216), followed by North American women (Pi = 0.00210), and Dutch women (Pi = 0.00196).
Assessing population differentiation between regional subgroups by Fst revealed significant differences between most regional subgroup pairwise comparisons ( Table 5). African and Asian subgroups exhibited high Fst values in all comparisons, indicating those regions were distinct from others in the dataset. This is not surprising given the small sample size and the fact that these samples are from two distinct trachoma populations.

Phylogenetic Relationships and Evidence for Recombination
The phylogenetic relationships were initially evaluated by eBURST, which revealed clonal clusters (CC) similar to what we reported previously (Figure 1) (Dean et al., 2009;Batteiger et al., 2014) but with the addition of an LGV cluster. These included CC-A encompassing trachoma STs, CC-B with non-invasive, non-prevalent urogenital STs, CC-C with non-invasive prevalent STs and CC-D that included LGV STs. The predicted founders were ST19, ST23, ST34, and ST39 (Figure 1 and Supplementary Table 5). The 57 singleton STs are denoted by a gray circle alone or within a colored circle, representing a specific geographic region, but were not associated with any specific region. The tree (Figure 2) revealed ST branches similar to the eBURST clusters; both had branches or clusters for the disease phenotypic groups of LGV, non-invasive prevalent urogenital, and non-invasive non-prevalent urogenital STs. However, the trachoma STs formed a subgroup of the non-invasive nonprevalent urogenital branch. In addition, within disease groups, STs branched from central nodes by geographic region. These nodes contained, in general, large numbers of STs from diverse locations. For example, the founders ST19, ST23, ST34, and ST39 located at nodes in the tree contain STs from Europe, Russia, and the Americas. The amino acid tree showed a similar phylogeny (Supplementary Figure 1).
The Splitstree decomposition tree revealed evidence for a network structure consistent with homologous recombination (Figure 3). This was demonstrated by the interconnecting networks specifically among the ST founders ST19, ST23, ST34, and ST39 and other STs on the network in addition to the canonical evolutionary pathway shown in the tree (Figure 2). Supplementary Figure 2 shows the amino acid tree.
The Splitstree data are consistent with findings in the MLST and ompA genotyping data. Samples were classified as recombinant when the sequences of the seven genes that determine the ST were non-concordant with each other or with the seven gene sequences of the Ct genotype associated with the ompA genotype (Table 6), denoted in bold in Supplementary  Table 3. There was no evidence of any recombination among the seven ST genes for any sample, although this would conceivably be possible.
A total of 106 (32.8%) samples were considered putative recombinants. Excluding Asia and Africa where the sample sizes were small, Europe had a significantly higher number of recombinants than Russia and the Americas (p < 0.04) ( Table 2), where, of the 109 samples, 49 were recombinant with 43 (87.8%) from the Netherlands. While ompA genotypes were generally consistent across samples of the same ST, many cases of recombinant strains were observed. For example, ST19 (n = 40) was primarily associated with ompA genotype G (37.5%) but eight different ompA genotypes (B, D, E, G, H, I, J, K) were also associated with this ST. The majority of these samples were from St. Petersburg and Amsterdam (Supplementary Table 3). ST23 (n = 32) was also associated with eight ompA genotypes (B, D, G, H, I, Ia, J, K) but the most frequent was Ia (47%). In contrast, the most geographically prevalent ST was ST39 (n = 45) where 95% were associated with ompA genotypes E. There were also 24 singletons that were recombinants ( Table 6). Table 7 shows the ompA genotype, ST and allelic SNPs, if present, associated with each of the eight Boston samples added to the dataset. Under the column denoted ST sequence homology are the Ct ompA genotype sequences to which the sequences of the seven MLST genes are identical. For example, sample J/259b has four SNPs that are different from the sequences of the seven genes for reference and clinical J strains in the database. These SNPs are identical to the sequences of strains G and K, which suggests that this sample is a recombinant between J and G or K strains. Another example is sample D/256b; the ompA genotype is D but the MLST sequences of the seven genes are identical to

DISCUSSION
The evolution of straining-typing techniques for bacteria have progressed from identifying variations in gel electrophoresis patterns and melt curve analyses to sequencing single pathogenspecific genes and MLST. While typing based on WGS would be ideal, this remains out of reach given the current expense and lack, in general, of sufficient DNA from clinical samples. However, it should be mentioned that we and others have been developing techniques to enrich DNA recovery directly from urogenital and ocular patient sample types with some success (Seth-Smith et al., 2013;Joseph et al., 2014;Hadfield et al., 2017). ompA genotyping remains widely used among Chlamydia investigators for molecular epidemiologic and comparative studies of strains between STD and trachoma populations. However, ompA encodes for the MOMP, which is under immune selection. MLST offers a more robust typing scheme by employing 6-8 housekeeping genes as relatively immutable signatures for strain typing (Maiden et al., 1998;Dean et al., 2009) and has become an important tool for studying both the epidemiology and evolution of human pathogens (Urwin and Maiden, 2003), including Ct.
In this study, we included 58 samples from Russia and eight from Boston, regions that have not previously been represented in Ct typing schemes. The 323 reference and clinical samples resolved into 84 STs, representing a 3.23 higher typing resolution over ompA genotyping consistent with previous studies (Dean et al., 2009;Gravningen et al., 2012;Batteiger et al., 2014;Herrmann et al., 2015). The high discriminatory index D of 0.98 (95% CI 0.97-0.99) and narrow CI for our MLST scheme confirms the validity of this typing method.
We noted an overall high rate of novel STs (67.9%), which may be expected because entire populations were not sampled and the numbers are small for some areas such as Asia and Africa. For example, there were 109 samples from Europe, representing six different countries, and 19 (68%) of the 28 STs were novel. Our findings are similar to other studies in Europe where the rates for novel STs were 62% among high school students in Norway (Gravningen et al., 2012), 65% among Tunisian sex workers (Gharsallah et al., 2016) and 62% among young adults in Amsterdam . Additional novel STs are likely to be identified as new regions undergo MLST. Indeed in Russia, of the 26 STs that contained Russian samples, 18 (69%) were novel.
A significantly higher ST to sample ratio of 0.45 was identified for Russian compared to European and American samples (p < 0.009). This was not explained by the number of STs unique to a region as the percentages were relatively uniform across the geographies (Table 2). However, Russian women had the highest nucleotide diversity (Pi = 0.00216), followed by North American women (Pi = 0.00210), and Dutch women (Pi = 0.00196), excluding LGV strains that are present only in the female Dutch population and would therefore skew the data. There are no males in the Russian dataset and therefore only women were evaluated here for the three locals. However, when assessing population differentiation for women from the Netherlands, Russia, and North America, excluding those with LGV, there were no significant differences (Supplementary Table 8). A larger sample size for each group would likely provide better resolution of the data. The increased sample ratio for Russia may reflect sexual mixing among the Russian STD cohort with the introduction of STs from other regions or reassortment of existing alleles. Of the 30 alleles present in Russia, 23 (76.7%) were found in other geographic regions, supporting the geographic influx of STs. Reassortment of alleles that generate new STs is also possible given the higher frequency of novel alleles in Russia at 43.3%. It has been shown that recombinational replacements are the major contributors to clonal diversity in contrast to point mutations among bacteria (Spratt et al., 2001).
In support of our hypothesis, Hadfield et al. (2017) has shown that Ct evolves within genomic 'ecotypes' but also outside of these niches via recombination, consistent with prior genomic studies (Harris et al., 2012;Joseph et al., 2012). However, without a larger sample size from St. Petersburg, the relative overall diversity of STs will remain unknown, and we can only speculate as to the degree of reassortment based on the alleles comprising the STs that represent only the currently sampled cohort of women in Russia.
Excluding singleton STs, STs 23 and 34 were significantly associated with female STD patients in Amsterdam. We had previously noted that ompA genotyping is valuable as a separate adjunctive typing method along with MLST as it allows comparison with strains typed only by ompA and also can provide evidence for transmission and treatment efficacy as well as putative recombination (Batteiger et al., 2014). But ompA should not be included as one of the genes in the MLST scheme because it is under immune selection (Maiden et al., 1998;Dean et al., 2009). We performed ompA genotyping on all samples in the database and, for ST23, the samples comprised Ia, B, D, G, H, I, J, and K genotypes while for ST34 they included genotypes F, D, D2, E, J, and Ja (Supplementary Table 3). The high number of ompA genotypes for these two STs suggests a high rate of recombination. For example, the sequences of the seven MLST genes matched those of reference strain Ia/UW202 for 47% of the ST23 samples where the ompA genotype was also Ia. However, the remaining samples that also matched the seven MLST Ia/UW202 sequences had B, D, G, H, I, J, and K ompA genotypes, indicating a mismatch between the MLST and ompA sequences, providing evidence for recombination (Supplementary Table 3). Similarly, STs such as 15, 19, and 34 had numerous samples, some of which were putative recombinants. These findings are supported by partial and WGSs where ompA has been shown to be involved in frequent exchange and is considered a hotspot for recombination (Gomes et al., 2007;Harris et al., 2012;Joseph et al., 2012). For example, phylogenetic analyses indicate clustering of ompA Ja genotypes, which are uncommon, with highly prevalent ompA, D, E, and F genotypes (similar to ST34) where hotspots of recombination were noted in ompA and pmpEFGH genomic regions (Gomes et al., 2007;Joseph et al., 2012). Clusters of ompA D genotypes with less prevalent strains G, Ia, and J, similar to our strains in ST23, have also been noted (Harris et al., 2012). In a recent study, ompA Ba and C trachoma strains isolated from Australian Aborigines were found to cluster with urogenital D, Da, E, and F strains with hotspots also involving ompA and pmpEFGH (Andersson et al., 2016). Of course, with WGS, many additional genes have been found to be involved in recombination (Harris et al., 2012;Joseph et al., 2012;Hadfield et al., 2017).
To confirm putative recombinants, we had described that the sequences of the seven MLST genes were individually aligned to the respective gene for all samples in the database and compared these results to the ompA genotype of the same sample.
For example, Table 7 shows the five putative recombinants among the eight Boston samples. Sample D/256b had an ompA genotype of D but the seven MLST sequences were an exact match to the seven sequences of F. In some cases, one or more of the seven genes matched two different strains as was the case for samples J/253b and J/259b. Other examples of recombinants are shown in Supplementary Tables 6 and 7 for the Dutch and Russian samples, respectively. In addition, Splitstree analysis was performed and revealed ancillary evidence for a network structure consistent with homologous recombination Bold denotes putative recombinant. * The seven ST genes of the sample have the highest homology to the seven genes of the strain that has the ompA genotype denoted in the column (e.g., for sample D/256b that has a D ompA genotype, the seven ST genes were identical to the seven genes of F strains in the database where the ompA genotype was also F for those strains). LGV, lymphogranuloma venereum. * p < 0.01. (Figure 2). While these data confirm the recombinant nature of the samples, it is likely that there are other genetic regions that have undergone recombination and, therefore, it is not possible to determine the extent of genetic exchange unless WGS is performed (Joseph et al., 2011(Joseph et al., , 2012Harris et al., 2012;Hadfield et al., 2017). Overall, there were 106 (32.8%) putative recombinants ( Table 2 and Supplementary Table 3), which is similar to our previous studies and those of other investigators (Dean et al., 2009;Gravningen et al., 2012;Batteiger et al., 2014). Each geographic region contained recombinants, although Europe had a significantly higher number than Russia and the Americas (p < 0.04) ( Table 2). This result was skewed by the higher rate of recombinants among the Amsterdam population, which would be expected given that the samples came from individuals at high risk for STDs where sexual mixing and import of strains from tourists could increase the chances for multiple Ct strain infections and opportunities for recombination. Indeed, rates of Ct mixed infections as high as 6 to 16% have been reported among men who have sex with men and heterosexual populations, respectively, in Europe, including the Netherlands (Quint et al., 2011;Rodriguez-Dominguez et al., 2015).
The most geographically prevalent ST was 39 with 45 samples, 95% of which had an ompA genotype of E; there were only two recombinants in this ST: one from Boston with a D ompA genotype and one from St. Petersburg with a G genotype. E genotypes are known to be the most globally prevalent (Lysén et al., 2004;Millman et al., 2004;Spaargaren et al., 2004;Lee et al., 2006;Gharsallah et al., 2016) and the least recombinogenic based on whole genome sequencing (Joseph et al., 2011(Joseph et al., , 2012. In our dataset, there were 72 E genotypes with an ST to sample ratio of 0.25; only the F genotype, the 4th most prevalent genotype with 30 samples, had a lower ratio at 0.17. The lower ratios indicate greater fitness as these strains are prevalent worldwide and have fewer allelic variants that resolve into fewer STs. This is borne out by the fact that only eight (11%) of the 72 E genotypes and 0 of the 30 F genotypes were recombinants (Supplementary Table 3). A recent genome study that included 149 E strains supports our conclusions (Hadfield et al., 2017). Genotype D was also highly prevalent but had a much higher ratio at 0.48, and 29 (88%) of 33 samples were recombinants.
We previously found that phylogeny based on MLST resolved the STs along disease phenotype demarcations. As samples have been added to the database, the phenotype resolution has increased to include the LGV phenotype, denoted as clonal cluster-D (CC-D; Figure 1). Similarly, the tree shows the three main clusters with the trachoma STs as a Subcluster of Cluster I, which resembles those of other reports (Herrmann et al., 2015).
To determine whether the phenotypic groups could be more finely discriminated, we analyzed the database for SNPs that independently or together would identify a phenotype. As in our previous studies (Dean et al., 2009;Batteiger et al., 2014), specific SNPs correlated with LGV, non-invasive urogenital disease and trachoma. Haplotype 2, which included the noninvasive urogenital disease group required exclusion of strains that were recombinants, specifically D genotypes.

FUNDING
This work was funded in part by National Institutes of Health grants R01 AI 098843 to DD and R03 TW 007754 to DD.