Epstein-Barr Virus in Burkitt Lymphoma in Africa Reveals a Limited Set of Whole Genome and LMP-1 Sequence Patterns: Analysis of Archival Datasets and Field Samples From Uganda, Tanzania, and Kenya

Epstein-Barr virus (EBV) is associated with endemic Burkitt lymphoma (eBL), but the contribution of EBV variants is ill-defined. Studies of EBV whole genome sequences (WGS) have identified phylogroups that appear to be distinct for Asian versus non-Asian EBV, but samples from BL or Africa, where EBV was first discovered, are under-represented. We conducted a phylogenetic analysis of EBV WGS and LMP-1 sequences obtained primarily from BL patients in Africa and representative non-African EBV from other conditions or regions using data from GenBank, Sequence Read Archive, or Genomic Data Commons for the Burkitt Lymphoma Genome Sequencing Project (BLGSP) to generate data to support the use of a simpler biomarker of geographic or phenotypic associations. We also investigated LMP-1 patterns in 414 eBL cases and 414 geographically matched controls in the Epidemiology of Burkitt Lymphoma in East African children and minors (EMBLEM) study using LMP-1 PCR and Sanger sequencing. Phylogenetic analysis revealed distinct genetic patterns of African versus Asian EBV sequences. We identified 281 single nucleotide variations (SNVs) in LMP-1 promoter and coding region, which formed 12 unique patterns (A to L). Nine patterns (A, AB, C, D, F, I, J, K and L) predominated in African EBV, of which four were found in 92% of BL samples (A, AB, D, and H). Predominant patterns were B and G in Asia and H in Europe. EBV positivity in peripheral blood was detected in 95.6% of EMBLEM eBL cases versus 79.2% of the healthy controls (odds ratio [OR] =3.83; 95% confidence interval 2.06-7.14). LMP-1 was successfully sequenced in 66.7% of the EBV DNA positive cases but in 29.6% of the controls (ORs ranging 5-11 for different patterns). Four LMP-1 patterns (A, AB, D, and K) were detected in 63.1% of the cases versus 27.1% controls (ORs ranges: 5.58-11.4). Dual strain EBV infections were identified in WGS and PCR-Sanger data. In conclusion, EBV from Africa is phylogenetically separate from EBV in Asia. Genetic diversity in LMP-1 formed 12 patterns, which showed promising geographic and phenotypic associations. Presence of multiple strain infection should be considered in efforts to refine or improve EBV markers of ancestry or phenotype. Lay Summary Epstein-Barr virus (EBV) infection, a ubiquitous infection, contributes to the etiology of both Burkitt Lymphoma (BL) and nasopharyngeal carcinoma, yet their global distributions vary geographically with no overlap. Genomic variation in EBV is suspected to play a role in the geographical patterns of these EBV-associated cancers, but relatively few EBV samples from BL have been comprehensively studied. We sought to compare phylogenetic patterns of EBV genomes obtained from BL samples in Africa and from tumor and non-tumor samples from elsewhere. We concluded that EBV obtained from BL in Africa is genetically separate from EBV in Asia. Through comprehensive analysis of nucleotide variations in EBV’s LMP-1 gene, we describe 12 LMP-1 patterns, two of which (B and G) were found mostly in Asia. Four LMP-1 patterns (A, AB, D, and F) accounted for 92% of EBVs sequenced from BL in Africa. Our results identified extensive diversity of EBV, but BL in Africa was associated with a limited number of variants identified, which were different from those identified in Asia. Further research is needed to optimize the use of PCR and sequencing to study LMP-1 diversity for classification of EBV variants and for use in epidemiologic studies to characterize geographic and/or phenotypic associations of EBV variants with EBV-associated malignancies, including eBL.

EBV infects >95% of adults globally (10,11), but BL and NPC exhibit distinct geographic distributions and age-specific patterns that are unexplained by simple EBV epidemiology. BL is the commonest childhood cancer in equatorial Africa and Papua New Guinea where 5-10 per 100,000 children below 15 years are affected (12) and is rare elsewhere. NPC occurs with high incidence in Eastern and South-Eastern-Asia and in some areas of the Middle East and North Africa (4). These distinct geographical patterns of BL and NPC could theoretically be attributed to genomic variations of EBV circulating in the different world regions. The discovery of variations in EBNA-2 and EBNA-3 genes enabled the classification of EBV into types 1 and 2 (13,14), with apparently different distributions in EBV isolates worldwide. However, a literature review of studies conducted up to 2009 showed that none of the genetic variations in EBV studied up to that point were either associated with EBV-associated malignancies or could explain the geographic patterns of the malignancies (4). Some of the possible limitations of the studies reviewed included focusing on variation in single EBV genes, such as LMP-1 (15,16), EBNA-1 (17), or BZLF-1 (18), because they were linked to suggestive biological properties of transformation (19), but no convincing epidemiological associations with disease patterns emerged (4).
The successful whole genome sequencing (WGS) of EBV samples (13,20,21) and increasing access to high-throughput sequencing (HTS) data of EBV from tumor and non-tumor samples present new opportunities to investigate genomic variations of EBV that may be associated with EBV-associated cancers. HTS studies have been utilized to discover genomic variations in EBV associated with NPC (22,23) and to investigate EBV genomic variations in samples from regions that were previously underrepresented, such as South America (24), and genomic variations in EBV from Africa or from BL (25).
We previously reported (25) 51 novel single nucleotide variants (SNVs) in the sequence spanning a 2.1 kb region of the LMP-1 promoter and coding region Exon 1-3 in 13 of 14 of primary BL biopsies from Ghana, Brazil, and Argentina that were investigated using HTS. The SNVs formed four unique LMP-1 patterns when aligned for the 112 EBV genomic samples available in GenBank, comprising 23, 29, and 3 shared SNVs in the promoter, LMP-2B Exon 1, and LMP-1 Exon 1 regions, respectively. The nucleotide variation patterns in LMP-1 were labeled A, B, and C, and the samples with the wild type (WT) reference sequence were labeled pattern D (25). EBV pattern A was observed in 48% of the 27 EBV samples from BL samples (primary biopsies or BL-derived cell lines) in GenBank but only in 8% of 85 non-BL samples analyzed (25). Pattern A variations were validated in the primary BL biopsies using Sanger sequencing of PCR products using 3 primer sets (Lei-1, 2, and 3) designed to capture the whole 2.1 kb hypervariable region in LMP-1 promoter and coding regions (25). Pattern A was the most frequently detected pattern among 50 additional BL tumors from Ghana, Argentina, and Brazil subsequently tested (25,26), highlighting this pattern as being frequent in BL or samples from Africa.
The discovery of novel LMP-1 patterns (25) builds on findings from previous studies of genetic diversity of LMP-1 (27)(28)(29)(30)(31)(32), which is widely accepted as an EBV oncogene (15,16). Some of these studies have suggested ways to classify and study EBV genetic diversity, such as the 30 bp deletion in the Cterminus (28), the loss of restriction site XhoI in the N-terminus of LMP-1 (29), and the classification proposed by Edwards et al. based on nucleotide variants resulting in signature amino acid changes in the C-terminus of LMP-1 relative to the WT (B95-8) (27). The Edward's classification includes seven variants named according to the geographic region from which the initial EBV isolate was originally derived, such as Alaskan, China 1, China 2, China 3, Mediterranean + (Med +), Med -, and North Carolina. These classification systems have been used to study the biology of EBV, but as reviewed in Chang et al. (4), their utility as biomarkers of geographic or of cancer phenotypic associations has been less clear. Because the geographic patterns of BL or NPC in endemic versus non-endemic areas vary 30-90-fold (33), a useful marker for study of the geographic or phenotypic associations with EBV should be rare in geographical areas where the associated cancers are rare and common in geographical areas where the associated cancers also are common.
To guide our further epidemiological research using the novel LMP-1 patterns reported in Lei et al. (25) versus other established classifications of LMP-1 diversity (27-32), we performed comparative analysis of EBV genetic variation using the Lei patterns versus seven other systems for 114 EBV sequences that were used in the discovery study by Lei et al. (25) and Liao et al. (26). Our comparative analysis confirmed that the SNVs used to define patterns in Lei et al. did not overlap with nucleotide positions in the regions used to classify EBV in the seven other systems. Most of those systems utilized amino acid changes coded by nucleotides in the 3 rd Exon of LMP-1 (26). The system proposed by Edwards et al. yielded a reasonable representation of phylogenetic clusters (23,27), but it did not allow a clear geographic separation of samples of African from those of Asian origin, whereas the patterns proposed by Lei et al. do (25). Similarly, the 30 bp pair loss in the C-terminus (28), although easy to classify, did not discriminate either phenotypic or geographic associations (26).
Here, we expand our results of studying LMP-1 patterns through phylogenetic analysis of the largest set of EBV whole genome sequences (WGS) focusing primarily on samples from BL or from Africa. We also report a detailed curation of patternforming SNVs in LMP-1 and analyze them in the context of EBV samples from all other conditions and from elsewhere. To get preliminary data about LMP-1 patterns in eBL cases and age-and -geographically matched healthy controls for comparisons, we also performed targeted PCR and Sanger sequencing of LMP-1 in peripheral blood samples of 414 childhood eBL cases and 414 geographically matched non-BL controls obtained in the Epidemiology of Burkitt Lymphoma in East African children and minors (EMBLEM) study (34). Figures 1, 2 show the samples selection and prioritization flow and the data sources, processing, and analysis flow charts. We accessed 730 EBV genomic sequences obtained from BL, HL, NK/ T lymphoma, PTLD, NPC, GC, infectious mononucleosis (IM), lymphoblastoid cell lines (LCL), and healthy donors and diverse geographic regions. The previously published EBV genomic sequences (n=545) were accessed as fasta files from GenBank or fastq files from the Sequence Read Archive (SRA) (35) (n=108; Supplementary Table S1) The sequence metadata, including accession numbers, sample type, geographical area, country, and EBV type were downloaded (35). The new EBV sequences from the Burkitt Lymphoma Genome Sequencing Project (BLGSP) were accessed as Bam files from Genomic Data Commons using the GDC Data Transfer (GDC, https://portal.gdc.cancer.gov/; Project ID: CGCI-BLGSP, dbGap study accession: phs000527.v13.p4; details in Supplementary Table S2). A total of 162 files were accessed, of which 77 had high-quality EBV content and were analyzed (7). The fasta files from GenBank were consensus EBV sequences and were filtered to exclude low-quality sequences, defined as having an in-house calculated N number > 2000, assuming a sequencing error rate of 1% and genome length >170 Kb. The genomic sequences from the BLGSP were flagged by default as mapped to the human reference genome (GRCh38) or unmapped. The reads mapped to the human genome were removed by Bowtie2. The unmapped reads, which are considered non-human, were extracted from the BAM files using command view -@ 20 -f 12 -F 256 in Samtools on NIH HPC Biowulf cluster and imported into CLC Genomics Workbench version 20.0.4 (Qiagen Bioinformatics, USA) as fastq files and mapped to the EBV wildtype (WT) B95-8 reference genome (NC_007605.1) following the same approach described in Lei et al. (25) using default parameters of Map Reads to Reference tool. To minimize the inclusion of low-quality results, we filtered out sequences with an average read depth <15 and <98% coverage of the EBV reference genome, resulting in 77 high-quality, full-length EBV genomes. These genomes were subjected to variant calling using Fix Ploidy Variant Detection tool of CLC Genomics Workbench. Overall, 431 of the 730 compiled EBV sequences were deemed high-quality for multiple sequence alignment (MSA) for phylogenetic analysis.

EBV Samples Compiled to Study Genomic Variations
We primarily focused on 219 (Supplementary Table S3), including all good quality WGS from 130 BL and Africa (74 new EBV genomic sequences from the BLGSP) plus 89 representative sequences sampled by EBV phylogenetic clade among the non-African origin samples. Because the non-African EBV from certain regions, particularly Asia, were many, these sequences were sampled by clade from the aligned sequence in the alignment files in fasta with a cap placed at 35% for sequences belonging to large clades. EBV type was based on metadata, except for those samples where EBV type was recorded as "unknown" for which EBV type was   assigned by aligning EBNA2 sequences. EBV type could not be determined for small number of EBV genomic datasets with poor EBNA2 sequences, which remained undetermined in our analyses. The 219 complete genomes were aligned by MAFFT v7 (36), installed on the NIH high-performance computing (HPC) Biowulf cluster, and the MSA file (https://github.com/smbulaiteye/ EBVBL_Africa_focus.git) was used to construct unrooted phylogenetic trees using the Neighbor-Joining (NJ) and the Jukes-Cantor method to measure the genetic distance of the aligned sequences. Although the NJ method may not be optimal for calculating the phylogenetic distance or accurately characterizing consequences of evolutionary diversification (37), this was not the focus of our paper because EBV evolution has been addressed in several recent excellent reports (24,(38)(39)(40)(41). We utilized the NJ method because it has reasonable performance and accuracy for studies of genotype clades and geographic and/or phenotype patterns (42). We conducted a phylogenetic analysis of LMP-1 sequences to determine to the concordance between WGS and LMP-1 patterns. Additionally, we conducted a limited parallel phylogenetic analysis of EBNA-1 and EBNA-2 as a sensitivity analysis about the specificity of LMP-1 patterns. MSAs for LMP-1, EBNA-1, and EBNA-2 (https://github.com/ smbulaiteye/EBVBL_Africa_focus.git) were generated using the ClustalW algorithm in BioEdit (v7.0). The internal repetitive region of the EBNA-1 gene was excluded from the alignment. Phylogenetic analysis used the Neighbor-Joining algorithm and the Jukes-Cantor method. These gene-specific analyses allowed us to include more samples than were possible for WGS samples. These analyses were conducted using the largest number of sequences that qualified (with a calculated N<10 N in the aligned genes) and repeated for a smaller set of samples to allow better visualization of the phylogenetic patterns and to assess whether the patterns observed in the full set of sequences remained apparent in the same in the smaller set, i.e., selecting fewer samples does not obviously bias the patterns. Thus, for the full set, we analyzed 668 LMP-1 sequences (listed in Supplementary Table S4), 705 EBNA-1 sequences (listed Supplementary Table S5) and 595 EBNA-2 sequences (listed in Supplementary Table S6). These analyses were repeated for a smaller set of samples (Supplementary Tables S7-S9 for LMP-1, EBNA-1, and EBNA-2, respectively). These subset samples were selected to include all samples with qualifying read depth and coverage (see above) from BL patients or from Africa and a set of non-BL, non-African samples selected by clade as described above.
The 2.1 kb LMP-1 promoter and coding region (Exon 1-3) was carefully curated for SNVs in 597 samples (185 from BL or from Africa, including 77 new sequences from the BLGSP, 40 controls from Africa, and 412 non-BL samples reported from elsewhere) to identify pattern forming LMP-1 variants. Synonymous variants in the LMP-1 coding regions and intronic variants were not used to classify variant patterns (Supplementary Table S10 shows the genotyping results in representative samples).

LMP-1 Patterns in Peripheral Blood of Cases and Controls
We performed targeted PCR and Sanger sequencing of LMP-1 in 414 eBL cases and 414 age-and geographically matched healthy controls enrolled in the EMBLEM study in Uganda, Tanzania, and Kenya during 2010-2016 (34). The cases and controls had comparable mean age (7.25 in cases versus 7.73 years in healthy controls). PCR was done using Lei-1 primer pair (Lei-F: GCCTCCGGCAGACCCCGCAAATC; Lei-R: GGGCAAAGGG TGTAATACTTAC), which targets a 435 base pair amplicon of the LMP-1 promoter and exon 1 hypervariable region (43). Approximately 100-300 ng of genomic(g) DNA was used as an input template (25). PCR mixtures were prepared using 10 µL 2× DreamTaq Master Mix (Thermo Scientific, USA), 0.5 µL primers (10 mM each), and gDNA template. Nuclease-free water was added to the mixture to attain a final test volume to 20 µl. Thermocycle was carried out in Eppendorf Mastercycler Pro S (Eppendorf North America, Hauppauge, NY, USA) using initial denature at 94°C for 5 min, thermocycle at 94°C for 30 s, 55°C or 60°C for 30 s, and 72°C for 30 s for a total of 45 cycles, followed by a final extension at 72°C for 7 min. The PCR products were separated by electrophoresis using pre-made 2% agarose gel prestained with ethidium bromide in 1× Tris-Acetate EDTA (TAE) buffer. The amplicons were visualized under blue light at wavelength 460-520 µm (Amersham Imager 600, GE Healthcare, Marlborough, MA, USA). The result of the sample was classified as EBV PCR positive or negative. EBV PCR negative samples were not tested further. The amplicons matching the desired length were retrieved from the EBV PCR positive samples by eluting from the agarose gel using QIAquick Gel Extraction kit (Qiagen, San Jose, CA, USA) and stored in nuclease-free water. The retrieved amplicons were subjected to bi-directional Sanger sequencing reactions implemented by Macrogen Inc. (Macrogen, Rockville, MD, USA). The chromatograms generated from sequencing were exported into CLC Genomics Workbench Version 20.0.4 and BioEdit v7.0 with the Clustal W algorithm to visualize, assemble, and align the sequence file against the EBV WT reference genome (26).

Ethics Approval and Consent to Participate
The data from BLGSP were accessed with permission from dbGap (Approval #50629-8 and #52320-7 for project #12922) to investigate genomic variation of EBV in the BL. The EMBLEM study was performed with approval from ethics committees at Uganda Virus Research Institute (GC/127), Uganda National Council for Science and Technology (H816), Tanzania National Institute for Medical Research (NIMR/HQ/R.8c/Vol. IX/1023), Moi University/Moi Teaching and Referral Hospital (000536), and National Cancer Institute (10-C-N133). Written informed consent was obtained from the guardians of the participants and assent from participants aged 7 years or older.

Statistical Methods
We used phylogenetic trees to explore and describe EBV genomic variation. The association of EBV positivity and successful sequencing of LMP-1 and identified patterns with eBL case status was calculated using frequency tables and logistic regression to calculate odds ratios and 95% confidence intervals (ORs, 95% CIs). EBV infection in Africa occurs during infancy (44,45) and is lifelong (46). Thus, the reference category for the pattern analysis comprised EBV PCR positive patients regardless of sequencing result. PCR-positive but sequence-negative patients were considered infected, but the infection was low titer, presumably because it was virologically controlled and below sequencing sensitivity and probably irrelevant for BL risk (44,45). The associations were adjusted for sex, age group, falciparum infection status, anemia [as an indicator of malaria burden (47)], and area of residence.  Supplementary  Table S3) for whole genome-wide phylogenetic analysis (see Methods, above), of which 130 (59.3%) were from BL or from Africa. All the EBV genomic sequences from Africa included in this set were from BL patients (either primary biopsies, BLderived cell lines, or normal samples from peripheral blood or buccal). We filtered samples from healthy people because they lacked sufficient EBV read depth. EBV genomic samples from non-African origin were sampled manually by clade to provide context for the comparative analysis. Of 730 EBV genomic samples identified, 431 were high-quality with genome size >170 kbp and gap <2000 ambiguous nucleotides. Because we were specifically interested in exploring genomic patterns of samples from Africa and facing the limitation of computational power for aligning whole genome sequences of a large dataset, we selected 219 EBV samples, including all qualifying samples from BL or Africa plus around 35% sequences selected from all other qualifying samples (see Figure 1). Figure 3A shows the phylogenetic tree for the 219 genomic sequences with the layers of the circle (inside to outside) showing the geographic origin, EBV type, LMP-1 pattern, and phenotype of the sample from which the EBV sequence was obtained. The scale bar of phylogenetic distance (0.006) indicates high similarity of the sequences of EBV genomes. The tree shows, as has been reported in several previous reports (24,38,40,41,48), that the sequences of EBV from Africa are genetically separate from those in Asia. The phylogenetic tree shows four major genetic branches in EBV from Africa and two genetic branches in EBV from Asia, of which one branch splits into two sub-branches. When LMP-1 patterns were considered (see details below), the African EBV samples carried eight LMP-1 patterns. These patterns were found, mostly, but not always, on different tree branches. EBV type 1 samples from Africa showed imperfect clusters that corresponded to AB, H, and I, while Patterns A and D were carried by samples belonging to different tree branches. EBV type 2 samples formed two sub-clusters, which carried Pattern A and Pattern J LMP-1 variants, respectively. Interestingly, three EBV type 1 samples carrying Pattern A variants clustered close to the EBV type 2 samples that also carried Pattern A LMP-1 variants. Of the EBV type 1 samples, Asian samples formed two separate tree branches, of which one branch split into two sub-branches. These Asian samples all carried Pattern B LMP-1 variants, regardless of the tree branch. Most EBV from South America intermixed with those from Africa and carried Pattern A LMP-1 variants, while those from Europe and North America also intermixed with samples from Africa, but appeared to carry a distinct LMP-1 H pattern. (Figure 3A). However, a smaller set of EBV from South America, Europe, or North America intermixed with samples from Asia, carrying either Pattern A, AB, or H LMP-1 variants ( Figure 3A). These cluster patterns suggest that there are distinct genetic subgroups of EBV in the samples from Africa and Asia consistent with the idea of the presence of EBV phylogroups by Zanella et al. (41). The two sequences from BL samples in Asia intermixed with Asian samples, but on different tree branches of the Asian sequences, and they all carried Pattern B LMP-1 variants suggesting that Pattern B is a prominent geographic marker of EBV from Asia ( Figure 3A and Supplementary  Table S3). Figure 3B shows the phylogenetic tree of EBV from nine paired tumor-normal samples from BLGSP patients who had sufficient EBV genome coverage in WGS for phylogenetic analysis in both samples to detect possible co-infection by multiple EBV strains. The EBV WGS sequence was identical in tumor and buccal cells in eight patients, but discordant in one patient (#251) who had type 1 EBV in the tumor and type 2 EBV in the buccal sample. This patient's EBV viral load in tumor and buccal samples were high with more than 1700x and 3800x genome coverage-depth of the EBV genome sequence reads in WGS of the BL tumor and buccal samples, respectively ( Figure  3B and Table 5).

Characteristics of the Compiled EBV Genome Datasets and Phylogenetic Patterns
The findings based on WGS genomic sequences noted above were similarly observed in phylogenetic trees using only the LMP-1 genomic sequence including a large set of 668 sequences and a subset of 360 genomic sequences ( Figure 4A, Supplementary Tables S4 and Figure 4B, Supplementary Tables S5). The LMP-1 results confirm WGS genomic sequence patterns that EBV from Asia clustered on two main branches, of which one branch forms at least two sub-branches. These Asian samples were mostly homogenous in their LMP-1 pattern, which was Pattern B except for a small set of samples classified as Pattern G. Consistent with WGS results, the Asian EBV sequences from tumors appear to cluster separately from those from non-tumor samples ( Figure 4A), suggesting that recent efforts to sample populations without malignancy in Asia are starting to pay dividends in terms of separating tumor versus non-tumor EBV in population data.
Our parallel phylogenetic analysis of EBNA1 ( Supplementary  Figures 1A, B) and EBNA2 ( Supplementary Figures 2A, B) genes confirmed the general impression that EBV from Africa tumors/ populations is separate from EBV in Asia and that the LMP-1 patterns identified are independent of sequence variations at those loci, and, by extension, EBV type.

LMP-1 Variants and Patterns in Representative EBV Samples From GenBank
We identified 281 SNVs (details in Supplementary Tables S10 and S11) in the LMP-1 hypervariable region of 597 sequences curated when compared to the WT B95-8 reference genome. These included 83 (30%) SNVs that formed 12 LMP-1 patterns (A to L, as classified in Supplementary Table S12 with representative examples in Supplementary Table S13) (25). In this study, we identified 28 new SNVs that form eight novel patterns that are consistently found in many samples. One of the new patterns is a hybrid of A and B SNVs. Although this pattern may have resulted from recombination, we did not identify a hard transition from A to AB because the patternforming SNVs are scattered over a long stretch of the LMP-1 sequence. The split between A and B SNVs was such that about 50% of pattern A SNVs were retained at the 5' end and 50% of pattern A SNV's at the 3' end were replaced by pattern B SNVs. Table S13 with a blue-gray shade (and also in Supplementary Table S10) for additional guidance. This hybrid pattern was observed in samples from Africa but not in those from Asia.

Representative pattern AB samples are shown in
We also observed five new patterns (E, F, G, H, I, and J) based on bein g ob se rve d con sistent ly i n m any s ample s (Supplementary Tables S10 and S12). Patterns K (A-70G and position C-9T) and L (A+28T in the promoter region) are provisional because they are based on EMBLEM samples that were tested using only Lei-1 PCR primers. These primers target only variants in LMP-1 exon and therefore generate a sequence that is insufficient to categorize SNVs in the LMP-1 core promoter region (2 SNVs, see Table S12) to the LMP-2B exon 1 (6 SNVs, see Table S12) to exclude alternative pattern I and pattern J. We also note that pattern E is defined by one nonsynonymous variation in amino acid position I63L (ATA>CTA), which was observed in many EMBLEM samples and considered valid. Pattern G was characterized by 4 variations at the promoter region and 3 non-synonymous variations found in exons 1, 2, and 3. Pattern H was characterized by variation at amino acid position G82A (GGC > GCC). It is possible that some patterns with SNVs in relatively adjacent positions (e.g., D, E, H, I or J and K) might belong to single clusters, which will become clearer as more samples are studied.
Two LMP-1 patterns (B, and G) were observed principally in samples from Asia, whereas the other ten patterns were observed principally in non-Asian origin samples. Each pattern exhibited sub-patterns that will require further research to identify those that represent lineages versus artifacts (Supplementary Tables S11 and 13). We also noted many variations, some of which were common and others rare, but not contributing to a pattern or sub-pattern.
Among BL, 92% of samples belong to one of four LMP-1 patterns; about half were either A or AB (33.3% and 15.7%, respectively) and the remainder were D and H (24.5% and 18.4%, respectively). We observed the four LMP-1 patterns to predominate in BL samples with WGS genomic sequences, i.e., convenient samples ( Table 2) or in the nine samples from BLGSP patients, which are well-characterized from two different regions in Uganda ( Table 3) (7). Among the BLGSP patients with paired tumor-normal (buccal or blood) samples, EBV loads were evidently higher in the buccal than in peripheral blood cells of these patients ( Table 3). EBV sequence reads could be found with more than 100-fold genome coverage in 6 out of 9 buccal specimens from the BL patients in the BLGSP with paired tumor-buccal samples, but in none of the 14 peripheral blood samples from the BLGSP patients ( Table 3). Only 2 blood samples had more than 5-fold EBV genome coverage in approximately 80 WGS of the blood-related samples (Supplementary Table S2, the average coverage of the 2 blood samples were highlighted with orange). We had paired tumor-peripheral blood results for 14 patients included in the BLGSP and the EMBLEM studies. Two subjects who had sufficient depth of EBV WGS genome coverage (>5-fold genome depth) had concordant EBV LMP-1 patterns between tumor samples in the BLGSP and blood samples in the  Figure 3B. #These samples could not be classified for EBV type due to insufficient sequence coverage in EBNA-2. n/a, Not applicable because sequence data was insufficient to genotype samples for EBV LMP-1 patterns. EMBLEM. Of those, 12 patients with insufficient EBV genome coverage in WGS of their blood samples in the EMBLEM, EBV LMP-1 patterns from Sanger sequencing were concordant with tumor in six patients, discordant in three, and not determined in blood in three patients ( Table 3). These results are difficult to interpret because of low Sanger sequencing quality in blood samples with apparent very low EBV titers. Table 4 shows the characteristics of the eBL cases and age-and geographically matched healthy controls in EMBLEM who were studied using PCR and Sanger sequencing. Table 5 shows the associations with EBV positivity, successful sequencing of LMP-1, and with the identified patterns. EBV positivity was associated with being an eBL case (95.6% of eBL cases versus 79.2% of controls, aOR =3.83; 95% CI 2.06-7.14). Among the EBV positives, successful Sanger sequencing was associated with being an eBL case (66.7% of eBL cases versus 29.6% of healthy controls, aOR= 8.27; 95% CI 5.27-13.0) ( Table 5). Among EBV positives, detection of four LMP-1 patterns (A, AB, D, and K) was associated with being an eBL case (

DISCUSSION
We present a detailed phylogenetic analysis of EBV genomic sequences of samples obtained from Africa (primarily from BL patients) analyzed together with non-African EBV clades from elsewhere (26). Our EBV genomic and LMP-1 findings confirm impressions from earlier studies (23,24,38,40,41,48,49) that EBV from Africa is genetically separate from EBV in Asia. Our results also confirm that there is extensive genetic diversity in LMP-1, as previously suspected (25)(26)(27). The results also suggest that only a fraction of the identified diversity in LMP-1 is necessary to group into the 12 patterns i.e., based on~30% of the 83 SNVs identified). The LMP-1 patterns also showed consistent separation of Africa-versus Asia-origin samples. Our  gene-specific analysis confirmed that the LMP-1 patterns were unique and not phylogenetically related to EBNA-1 or EBNA-2, or EBV type. LMP-1 analysis identified 9 patterns distributed across four WGS phylogenetic tree branches in the African-type samples. By comparison, we identified two LMP-1 patterns scattered across two WGS phylogenetic tree branches in the Asian EBV samples. The clear geographic patterns of LMP-1 patterns are interesting given the geographic eBL and NPC risk profiles in Africa versus China (33). LMP-1 patterns may have potential utility as biomarkers to study geographical variants of EBV at relatively low cost using LMP-1 PCR and Sanger sequencing in epidemiological studies such as EMBLEM (34,50), where large-scale use of WGS is not feasible.
We identified dual infection in WGS data (EBV type 1 and type 2) and LMP-1 patterns using PCR and Sanger (D and A or AB). These findings are likely valid because they were observed in high viral titers samples and they are consistent with earlier reports that have reported dual infection in some individuals (51). The observation of multiple type or variant infections adds complexity to the interpretation of EBV genomic variation in epidemiological and clinical studies when assays yield conflicting results in patients. The results also raise a concern about which body compartments (buccal or blood), should be targeted in epidemiological studies of disease patterns to identify valid or the strongest associations. Our finding that PCR-sequencing was more successful in buccal than peripheral blood samples ( Table 3) suggests that buccal samples may be preferable for the study of non-malignant samples, although further research is needed to clarify performance issues.
Our results suggest that 92% of BL patients carry one of four LMP-1 patterns (A, AB, D, and F), and 50% of them carry either A or AB. Because these patterns are rare or not observed at all in Asia, they fit the hypothesis that these EBV variants may be both geographic and tumor markers. However, these results while conclusive about the geographic association, they are inconclusive about phenotypic association because the EBV data from Africa are mostly from BL patients with little representation of healthy populations from Africa. For example, only 40 of 668 LMP-1 sequences analyzed in this study were from healthy people in Africa versus 130 from BL patients. BL develops in about 0.005% of EBV-infected people, suggesting that current EBV data are not representative of EBV in the general population without BL (12) and the geographic and phenotypic associations with LMP-1 are confounded. The finding that most BL carried one of four LMP-1 patterns, which are virtually absent in Asia (25,26), suggests that investigating the distribution of these markers in healthy people in Africa is a promising area of research.
Our finding that different EBV variants are found in different geographic regions is similar to the pattern reported for other viral carcinogens, such as the human papillomavirus (HPV) (52, 53) and hepatitis B virus (HBV) (54). Multiple carcinogenic variants are known for HPV, and different types are found in different geographic areas (HPV genotypes 35 and 45 predominate in Africa, whereas HPV genotypes 52 and 58 predominate in Asia) (55). Similarly, multiple genotypes exist for HBV with genotypes A and E predominating in Africa, whereas genotypes B and C predominate in Asia (54). These geographical patterns are important for public health, biology, and diagnosis. They reflect underlying immunological pressure that drives diversification through host-pathogen adaptations in populations living in geographically separate areas, with some of the best characterized examples being HIV (56), HCV (57), multiple bacterial pathogens (58), and plasmodia (59). Because LMP-1 is a target of the host immune response (60,61), it is possible that LMP-1 diversification reported here is driven by immunological pressure and has led to diversification of LMP-1 patterns. We noted that some LMP-1 patterns were found in different branches of the WGS phylogenetic tree, including Pattern A and D found both in type 1 and type 2 EBV, while others were not found on other branches or types. We speculate that this intermixture of LMP-1 patterns in different tree branches may be because LMP-1 patterns represent an early gene sequence that preceded the modern sequences observed in EBV type 1 or 2. These patterns could have been evolutionarily preserved due to essential biological function favoring their preservation.
EBV is a suitable target for discovery biomarkers for diagnosis (62) and study of the etiology of BL (25,26). The LMP-1 region has been an attractive locus to characterize EBV's biological, genetic, and epidemiological properties (63,64). LMP-1 has been linked to biological changes that influence transmission, transformation, and tumor microenvironment (65). Phylogenetic studies have revealed distinct phylogroups of EBV (23,24,38,40,41,48,49), and principle components analysis of WGS data has identified SNVs that are correlated with ancestry (48). However, access to HTS is still limited, especially for large epidemiological studies conducted in Africa, where eBL is a public health problem. This compelled us to investigate whether the LMP-1 patterns described by Lei et al. might be sufficient characterize geographic or phenotypic patterns of EBV. We developed Lei-1, Lei-2, and Lei-3 primers for a simple and cost-effective PCR and Sanger sequencing assay (25,26). Our results using only Lei-1 primers in EMBLEM confirms that assay can be used to type LMP-1 patterns, but the results with one primer are insufficient to resolve some patterns that may have SNVs in regions not covered by Lei-1 primers. We also identified differential completion rates in eBL cases versus healthy controls as a limitation in samples with low EBV viral titers. EBV establishes lifelong and low-grade infection (1-50 infected B cells per million) that is maintained in most healthy individuals (66), so all our subjects were infected but those with low viral titer cannot be typed, making it difficult to distinguish between clearance, persistence, and poorly controlled infection.
Our study is subject to several limitations, despite its use of large current EBV genome datasets. First, EBV HTS data are skewed to cancer patients (tumor or normal samples) with gross underrepresentation of healthy people. This bias in EBV sampling was observed for BL and healthy people from Africa, but we also noted it to be significant samples from Europe, North America, and astonishingly extreme for certain regions of Asia, such as India. Because HTS datasets are likely to play an important role in the discovery and fine mapping of carcinogenic EBV variants, this issue requires urgent attention through collaboration between scientists with access to populations and those with access to HTS technology and computational resources. Second, the N-J methods used to infer phylogeny may be less accurate than other methods such as the Maximum Likelihood (ML) methods (24). We used the NJ methods because they are reasonable for initial quick exploration of data and hypothesis generation, and they yield robust results across a range of small to large datasets and suffer only a small decline in accuracy across that range (42). We acknowledge that our results are not complemented by mechanistic explanations about the functional implications of the LMP-1 patterns on EBV biology, virus-host interaction, transmission, or cell transformation. The epidemiological scope of our studies precluded mechanistic studies, but we hope that the findings will inspire those studies.
The strengths of our study are that we used a larger set of samples from Africa to study LMP-1 patterns as potential biomarkers of EBV genetic diversity. The results support further optimizing the LMP-1 PCR-Sanger sequencing assay for use as a relatively low-cost assay to investigate the geographic and phenotypic associations with EBV-related disease. Further research is needed to improve the success rate of this assay in normal samples with low viral loads.
To conclude, the phylogenetic analysis of EBV focusing on samples from Africa or BL confirms that EBV from Africa is genetically separated from EBV in Asia. We show that LMP-1 patterns cluster separately for African versus Asian samples, with European, North American, and South American samples clustering mostly, but not exclusively, with EBV from Africa. Four EBV LMP-1 patterns accounted for most EBV genotypes in BL patients, but these results may still reflect geographic patterns of EBV because EBV samples from Africa were mostly from BL patients and with few samples from the general population. Our findings suggest LMP-1 variants are promising markers for identifying and classifying EBV genetic variants in quantitative and qualitative research to identify EBV variants associated with EBV-related cancer, including eBL.

DATA AVAILABILITY STATEMENT
The datasets for this study can be found in the links provided in Supplementary Table S1 and Table S2. The EMBLEM data and code used in the current analysis will be made available upon request from the corresponding author. The MSA for the 219 samples included in WGS phylogenetic analysis as well as the samples used for LMP-1, EBNA-1, and EBNA-2 phylogenetic analysis can be accessed at the following link: https://github.com/ smbulaiteye/EBVBL_Africa_focus.git.

ETHICS STATEMENT
The EMBLEM study was performed with approval from ethics committees at Uganda Virus Research Institute (GC/127), Uganda National Council for Science and Technology (H816), Tanzania National Institute for Medical Research (NIMR/HQ/ R.8c/Vol. IX/1023), Moi University/Moi Teaching and Referral Hospital (000536), and National Cancer Institute (10-C-N133). Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.