Unraveling the genetic evolution of SARS-CoV-2 Recombinants using mutational dynamics across the different lineages

Introduction Recombination serves as a common strategy employed by RNA viruses for their genetic evolution. Extensive genomic surveillance during the COVID-19 pandemic has reported SARS-CoV-2 Recombinant strains indicating recombination events during the viral evolution. This study introspects the phenomenon of genome recombination by tracing the footprint of prominent lineages of SARS-CoV-2 at different time points in the context of on-going evolution and emergence of Recombinants. Method Whole genome sequencing was carried out for 2,516 SARS-CoV-2 (discovery cohort) and 1,126 (validation cohort) using nasopharyngeal samples collected between the time period of March 2020 to August 2022, as part of the genomic surveillance program. The sequences were classified according to the different lineages of SARS-CoV-2 prevailing in India at respective time points. Results Mutational diversity and abundance evaluation across the 12 lineages identified 58 Recombinant sequences as harboring the least number of mutations (n = 111), with 14 low-frequency unique mutations with major chunk of mutations coming from the BA.2. The spontaneously/dynamically increasing and decreasing trends of mutations highlight the loss of mutations in the Recombinants that were associated with the SARS-CoV-2 replication efficiency, infectivity, and disease severity, rendering them functionally with low infectivity and pathogenicity. Linkage disequilibrium (LD) analysis revealed that mutations comprising the LD blocks of BA.1, BA.2, and Recombinants were found as minor alleles or as low-frequency alleles in the LD blocks from the previous SARS-CoV-2 variant samples, especially Pre-VOC. Moreover, a dissipation in the size of LD blocks as well as LD decay along with a high negative regression coefficient (R squared) value was demonstrated in the Omicron and BA.1 and BA.2 lineages, which corroborated with the breakpoint analysis. Conclusion Together, the findings help to understand the evolution and emergence of Recombinants after the Omicron lineages, for sustenance and adaptability, to maintain the epidemic spread of SARS-CoV-2 in the host population already high in immunity levels.


Introduction
SARS-CoV-2, a causative agent of COVID-19 disease, emerged in the late 2019 in Wuhan, China (1,2) and spread worldwide due to its highly contagious form leading to a range of symptoms from asymptomatic to mild-to-severe respiratory illness in the humans (3).As of 14 September 2023, there have been 770,437,327 confirmed SARS-CoV-2 infection cases and 6,956,900 deaths reported globally (4).Extensive genomic surveillance and research on SARS-CoV-2 has provided valuable genomic data, revealing evolutionary events and the emergence of unique variants (5).The interplay between evolution, ecology, and epidemiology was a defining feature of RNA viruses such as coronaviruses (5).Understanding the virus's evolution can be crucial in predicting future paths, public health preparedness, and developing prevention/treatment strategies (6).Viral evolution is a complex phenomenon, involving a balance between successful replication and transmission (5).Virus mutation rate and replication errors contribute to genetic diversity, providing the basis for selection (5).Recombination, observed in SARS-CoV-2 and other related viruses, combined mutations from different strains which facilitates adaptability (6,7).During the first 8 months of the SARS-CoV-2 emergence, visible evolution was limited due to factors such as a small global viral population, non-pharmaceutical interventions, limited dissemination, and under-sampling of the virus (5).Initially, it was expected that SARS-CoV-2 would evolve slowly due to the proofreading ability of its polymerase enzyme (8).Hence, earlier studies relying on linkage disequilibrium identified limited instances of viral recombination and evolution (9)(10)(11)(12)(13)(14).This has facilitated the emergence of divergent SARS-CoV-2 lineages worldwide, including Alpha, Beta, and Gamma variants of concern (VOCs).These lineages exhibited rapid evolutionary rates and acquired numerous additional genomic mutations.Speculative evidence suggested recombination between Alpha and Delta variants in a limited number of SARS-CoV-2 infections in Japan (15).Emergence of the Omicron variant, characterized by dozens of spike gene mutations, marked a new phase of the COVID-19 pandemic (16).Initially, Omicron sub-lineages BA.1, BA.2, and BA.3 emerged independently in October 2021 (16), but later, BA.2 diversified into BA.2.12.1, BA.2.75, and BA.5, which were phylogenetically distinct from the BA.2 sub-lineages and gained high global frequency (17).Subsequently, BA.4 and BA.5 were identified between December 2021 and January 2022, respectively (18).BA.1 and BA.5 emerged with common mutations in the spike region (N501Y, E484K, ΔH69/V70) despite their different origins from South Africa and Botswana respectively, indicating convergent evolution (13,19,20).Thus, the increased genetic diversity of SARS-CoV-2 allowed multiple lineages to co-circulate and aided in the detection of Recombinants (5,21).Several Recombinant variants have since been reported (22).The XA lineage, a Recombinant variant, was first identified in the United Kingdom (23).Preliminary genomic characterization of the emergent SARS-CoV-2 lineage in the United Kingdom was defined by a novel set of spike mutations (24).In North America, lineage XB (B.1.631/B.1.634)was widespread and additional Recombinant lineages (XD, XF, and XE) which combined different variants were discovered (25).Recombination events likely contributed to the emergence of BA.3, BA.4, and BA.5 lineages too (26).
The future evolutionary trajectory of SARS-CoV-2 remains uncertain, with questions about whether diverse lineages will continue to emerge or if the virus will undergo the transition to a slower adaptation process?Understanding the evolution of Recombinant lineages through the genetic variability patterns was observed since pre-VOC times is crucial for interpreting the outcome of such transitions to sustenance and adaptability or enhanced transmissibility or virulence for reconstructing the epidemic spread.
In this study, we analyzed 2,516 in-house SARS-CoV-2 sequences as a discovery cohort, since Pre-VOC times, to delineate the mutation profiles associated with different lineages and understand the evolutionary trajectory of SARS-CoV-2 in India, leading to the emergence of the Recombinants.Linkage disequilibrium analysis was performed among all the SARS-CoV-2 lineages to evaluate their impact on the fitness and adaptability associated with the Recombinants.Moreover, a causal relation between LD decay and breakpoint analysis revealed the propensity of Omicron lineages and Recombinants to undergo evolution for enhanced sustenance in the host population.We parallelly validated this framework of analysis in a separate data set of 1,126 SARS-CoV-2 genomes as the validation cohort from the different Indian states to capture the differential traits of the evolution of the SARS-CoV-2 virus.

SARS-CoV-2 whole genome sequencing
The SARS CoV-2 whole genome sequencing was performed using Oxford Nanopore and Illumina Sequencing platforms.Library preparations have been carried out using standard protocols and guidelines of the Illumina COVIDSeq (Cat.No.20043675 and reference guide: 1000000126053 v04) and Oxford Nanopore Rapid barcoding kit (SQK-RBK110.96).The sequencing methodology and analysis pipeline of both platforms have been previously published (27).

Sample segregation
Sequenced data for COVID-19-positive patients were taken from the SARS-CoV-2 genomic surveillance program between the time period of March 2020 and August 2022.A total of 3,642 genome sequences were considered for analysis which had at least 90% SARS-CoV-2 genome coverage and 100X sequencing depth.The FASTA sequences obtained from the genomic analysis were uploaded in Nextclade to obtain lineage-wise classification of the SARS-CoV-2 genome sequences (28).

Mutational analysis and categorization
VCFs obtained from genomic analysis of lineages were merged using bcftools (29).In order to capture low-frequency mutations, all mutations present in more than one sample in a lineage were considered for mutational analysis.Furthermore, the relative frequencies of mutations in different lineages were estimated by taking into account the sample numbers and presented as a percentage.Seven groups were categorized: Pre-VOC (B.1); Alpha (B1.

Linkage disequilibrium (LD) analysis
Variants identified in seven different lineages with minor allele frequencies (MAF) ≥0.05% were used to estimate the LD.To perform haplotype block analysis and detect the recombination event, we used the solid spine of LD option from Haploview (30), which selects the criteria of the first and last mutation to be in strong LD and allows intermediate mutations within to have optimal r 2 values (31).The selected blocks were intact, where the average r 2 of all mutations in the block was ≥0.5.We found exceptions in block 1 of pre-VOC, Delta and blocks 1 and 2 in Beta.The plink files for haploview were created using plink commands (32) "plink-out" for changing VCF to the .pedformat, and "plink-recode" command was used to get .infoand .pedfiles.

Detection of LD decay
To estimate the potential gradual breakdown of non-random association of the mutations, LD decay within each lineage from pre-VOC to Recombinant was analyzed.LD decay curves are graphical representations of LD, illustrating how it changes as a function of the physical distance between each pair of SNPs (12).LD decay was accomplished by computing the difference between genetic distances as LD2 (Position) minus LD1 (Position) and its corresponding LD (r 2 ).Preceding decay, the data points were smoothened by a locally estimated scatterplot smoothing (LOESS) algorithm.This non-parametric regression technique fitted a smooth curve to the sorted distances and their corresponding LD values, effectively capturing underlying patterns in the data.Furthermore, a horizontal threshold line was introduced to the plot at an LD value of 0.1.This threshold served as a visual marker, aiding in the identification of the point at which LD is deemed to have significantly decayed.The data points displayed in the graph offered a tangible representation of LD values at varying genetic distances.To quantify the connection between genetic distance and LD decay, the coefficient and R-squared values were calculated using the statsmodel (33) library in Python version 3.11.4.The coefficient states whether the distance and LD (r2) were positively or negatively correlated and R-square determined the coefficient regarding how much the independent variable (distance) could be explained by the dependent variable (LD), with a larger R-squared value denoting better regression Figure 1 illustrates the comprehensive methodology utilized for carrying out this study.

Identification of recombination breakpoints using PhiPack
Potential recombination breakpoints (regions where recombination events might have occurred) were detected using the Pairwise Homoplasy Index (PHI).FASTA files were generated for the variant data of all the lineages from the Pre-VOC to Recombinants, and multiple sequence alignment was performed separately for all the lineages.Furthermore, we employed the "Profile" feature of the PhiPack tool to detect the recombination breakpoints and the refined incompatibility matrix (34,35).We configured the profile analysis with default parameters for robust statistical outcomes.The window size was ascertained based on the fraction of the total samples and implemented with 1,000 permutation tests.Within the framework of PhiPack, the null hypothesis centered on the absence of recombination events, whereas the alternate hypothesis acknowledged the plausible occurrence of recombination events (36).The resultant p-values below the established threshold of 0.05 were considered as recombination incidences (37).

Mutational landscape across different lineages of SARS-COV-2 from pre-VOC to Recombinants
The inhouse SARS-CoV-2 genome surveillance program enabled us to gather 2,516 sequences from the time period of March 2020 to August 2022.These sequences were obtained using a coverage exceeding 90% of SARS-CoV-2 genome with a sequencing depth of minimum 100X.They were further categorized according to the lineage, leading to the identification of 12 lineages.).We identified 58 sequences as Recombinants wherein XAP lineage had the largest number of genomes (n = 44), followed by XT (n = 6).Two sequences each belonged to the Recombinants-XM, XAE, XQ, and XU, respectively.Sequencing metadata and sample information are tabulated in Supplementary Table S1.Collating the mutation data, we identified 3,056 unique mutations.Figure 2A represents the mutational landscape of all the 12 lineages across the SARS CoV-2 genomes from the pre-VOC to Recombinants.The mutational region in the SARS-CoV-2 genome has been observed to be in the 3' end region, downstream of ORF1ab (essentially involved in virus replication and reproduction) across all the lineages since the Pre-VOC times, demonstrating the importance of the ORF1ab region toward providing stability and functional edge to the virus toward its sustenance in the human host population.Moreover, an overall mutational surge in the 3′ region was observed from the Delta-plus (AY*) lineage onwards although a burst in mutations occurred in the spike region of B.1.1.529(Omicron), following a similar trend for all the other lineages of Omicron.
Indeed, this mutation crowd started decreasing in the Recombinants.Furthermore, mutation-type analysis across the SARS-CoV-2 genome/gene regions deliberated a heightened presence of missense and synonymous mutations, followed by deletions.The remaining part consisted of start/stop gain and loss (leading to truncated ORFs), insertions as well as mutations in the untranslated regions.Missense dominates the different SARS-CoV-2 gene regions except for the M region, which is taken over by synonymous mutations.The ORF8 gene region had a good proportion of all the types of mutations, which coincides with its proposed functionality toward within-individual fitness only [e.g., viral replication and immune evasion), but neutral or even disadvantageous for transmission in the population (38)].Deletions were higher in the ORF7a, wherein these mutations could alter viral-host interactions and immunomodulatory features of the SARS-CoV-2 virus.Segregation of mutation types across the lineages identified the highest percentage of missense mutations in B.1.1.529,whereas synonymous mutations were more in BA.2 and deletions were higher in B.1.617.1 (Kappa) (Supplementary Figure S1).Furthermore, we looked into gene-wise mutation abundance (overall prevalence of mutations in specific gene regions which is "mutation per sequence") across the different lineages as demonstrated in Figure 2C.For synonymous mutations, relative abundance was only observed for the ORF6 region and that too in the Omicron lineages, whereas the non-synonymous gene regions differentially occurred among the lineages.Mutational abundance dominated ORF8 and nucleocapsid in Alpha, 5' UTR in Delta.Omicron lineages highlighted the Spike, Envelope, and ORF6 gene regions.Interestingly, the Recombinants showed relative mutational abundance of the synonymous mutations similar to BA.1 and B.1.1.529,whereas the non-synonymous mutations were diluted across the genome with moderate abundance in the Spike.This led us to delve into mutations captured across the Recombinants.Interestingly, across all the 12 lineages, Recombinants harbored the least number of mutations (n = 111) across the 58 Recombinant sequences.Looking at the frequency of these mutations across all the 12 lineages, Figure 2D highlighted four SNVs, C241T (5'UTR), C3037T (ORF1ab: F924F), C14408T (ORF1ab: P4715L), and A23403G (S: D614G), with strong allelic associations.This was  S2).Surprisingly, these mutations from the Delta (B.1.617.2) were present in low frequency in the Recombinants.

SARS-CoV-2 Recombinant emergence ascertained through mutational dynamics across the distinct lineages during the course of evolution
As the mutational landscape underlying the viral genome architecture played a significant role in viral evolution, we looked into the mutational pattern across the lineages from the Pre-VOC to the Recombinants.Primarily, all the lineages were grouped according  3A).The increasing vs. decreasing trend was identified by taking their frequencies at the Recombinant time points into account when compared to the rest of the lineages.The mutational abundance in each of the study groups was illustrated through a spider plot (Figure 3B).We observed the highest number of mutations (n = 47) in the spontaneously increasing category, wherein frequencies were seen to be high in the Recombinant but low in the Pre-VOC as well as other variants (Figure 3C).These mutations acquired during the Pre-VOC time became majorly dominant in the Omicron and were further passed on to the  or decrease afterward, resulting in complete de-selection by the virus during the Recombinant emergence (Figure 3D).Notably, the mutations S: N440K (T22882G) and S: S982A (T24506G) in the Alpha variant were reported to be significantly associated with enhanced viral entry and reduced antibody neutralizations, which were absent in the Recombinants (42).Strikingly, we observed only six mutations as spontaneously decreasing (Figure 3E), which majorly arose in the Omicron and were retained in the Recombinants.
Looking at the dynamic mutations, the vanishing category harbored the highest mutations (n = 31), which showed complete absence in the Recombinants with low/medium frequency presence in the Pre-VOC and varying frequency trends in the other lineages (Figure 3F).Similarly, dynamically decreasing mutations demonstrated low frequency in the Recombinants with high/ medium/low presence in the previously arisen variants (Figure 3G).Notably, dynamically vanishing and decreasing mutations were principally present in the Beta and Delta and were eliminated by the virus till it evolved into Omicron with a similar trend in the Recombinant.Interestingly, such high-frequency mutations of Delta such as S: L452R (T22917G), S: P681R (C23604G), and N: R230M (G28881T) are reported to functionally elevate the SARS-CoV-2 infection and replication (43)(44)(45)(46), which were absent in the Omicron and the Recombinant.Other mutations belonging to this category were mutations of the N protein, N: D63G (A28461G) and N: D377Y (G29402T), which were also reported to be significantly associated with ICU admissions (47).Contrarily, the dynamically increasing (Figure 3H) category captured mutations that were essentially present in high frequency in the Alpha/Pre-VOC and subsequently retained in the Omicron/Recombinant with negligible presence in the Delta and Beta.Overall, it is important to note that the mutations associated with the Delta severity demonstrated a reduction in frequency in the Recombinant as well as an overlapping trend with the Omicron highlighting the emergence of the Recombinants through the Omicron lineages.

Omicron lineages paved the way for the evolution of Recombinants by dismantling the LD blocks
Multiple mutational acquisition/depletion trends observed among the different lineages of SARS-CoV-2 furthered our quest to understand the role of linkage disequilibrium (LD).LD analysis has been used to infer evolutionary features and reveal the trends of existing LD blocks in the SARS-CoV-2 genomes along with their dominant mutations since the Pre-VOC times.We analyzed LD blocks for all the lineages stated in the previous result, plus two lineages of Omicron (BA.1 and BA.2), since most of the Recombinant lineages arose from these two lineages of Omicron.Pre-VOC carried five LD blocks, wherein blocks 1, 3, and 4 spanned >1 kb region.Block 1 was constituted by the ORF1ab region, whereas block 3 spanned across the Spike-ORF3a and block 4 across the E-M-ORF6 gene region.The rest of the blocks (blocks 2 and 5) were smaller and spanned the Spike and N gene regions (Figure 4A).Similarly, Alpha demonstrated three small LD blocks in the ORF8 and N (Figure 4B) regions.Interestingly, the Beta variant demonstrated three blocks in LD, wherein block 2 with 9.4 kb length spanned the ORF1ab-S region (Figure 4C), and Delta demonstrated one single LD block, which spanned a larger region of approximately 11.7 kb in the ORF1ab region (Figure 4D).
As we moved toward the Omicron lineages, blocks of strong LD were observed to dissipate leading to the occurrence of smaller blocks, similar to the Pre-VOC in the S, ORF6, and N gene regions.Reducing patterns of LD blocks in the BA.1 and BA.2, leading to the generation of multiple lineages within Omicron suggest a higher propensity for evolution by recombination (Figures 5A,B).Henceforth, the occurrence of Recombinants after the Omicron lineages could be professed/perceived.It is evident that after Omicron lineages, the average r 2 across the mutations in the block was >0.5 (Supplementary Table S3), indicative of smaller blocks of strong LD being retained.Examining the variants in the LD blocks suggests a clear variation in its frequency between the lineages.Notably, prominent haplotype blocks in the Spike were seen only in the Pre-VOC, BA.1, and BA.2 and were found to be retained in the Recombinant.The mutations were present in low and medium frequency in the Pre-VOC (Figure 4A) which subsequently became high-frequency mutations in the BA.1 and BA.2 (Figure 5B) followed by Recombinants with medium and high frequency (Figure 5C), respectively.Importantly, mutations of the SpikeQ498R/N501Y/ Y505H presenting strong LD block in the Pre-VOC (low frequency) were selected in BA.1 (medium and high frequency), along with two other mutations in the Spike region (Q493R/G496S) that reportedly enhanced the immune evasion of BA.1.Subsequently, the same block of Pre-VOC was again selected in the BA.2 (high frequency) but was not retained as the LD block in the Recombinant.Notably, these three reported mutations were associated with increasing viral replication and fitness.However, in the Recombinant, LD block upstream to the positions mentioned above were observed with the mutations, Spike:S477N/T478K/E484A retained from BA.1.Interestingly, S: T478K mutation originated from the Delta (high frequency), whereas S: S477N/E484A was subsequently selected during BA.1 and BA.2 propagations.These mutations were also functionally reported to increase the viral binding with ACE2, increased infectivity, and transmission.It is significant to note that though there was a shift in the LD block to an upstream position, notably all these mutations had a medium and high-frequency representation in the Recombinant.This implies the viral selection toward these positions by either recombination or LD.Another LD block in the ORF6 region, in the Pre-VOC carried D61 low-frequency mutation, which re-emerged in BA.2 with high frequency.D61 protein is instrumental for enhanced viral replication, whereas ORF6: D61V mutation is functionally deleterious with the potential to compromise protein function, viral replication, and indirectly influence the viral immune escape mechanism (48).Subsequently, this LD block was retained in Recombinants but at a very low frequency due to its disruptive function.Thus, we see that many mutations of BA.1, BA.2, and Recombinants were found as minor alleles or as low-frequency alleles in the LD blocks from the previous variant samples, especially the Pre-VOC.
We further looked into the trends of the mutations in the LD blocks (Supplementary Figure S2).Pre-VOC mutations majorly showed an increasing trend as these mutations became a part of the LD blocks retained in the Recombinants.Mutations such as ORF6: D61 in the LD block showed a decreasing trend as they became low-frequency mutations in the Recombinants.Beta and Delta LD blocks consisted of vanishing mutations since none of the LD blocks of the Recombinant coincided with the Delta or Beta.With BA.1 and BA.2, all categories of mutations, increasing, decreasing, and vanishing trends, were noted.The decreasing trend was associated with those LD blocks that are deselected in the Recombinants, such as S: S371/373/375 and S: Q964/N969/L981, and were functionally found to impair or reduce viral infectivity (49).However, another LD block N: R203M/G204R within the Pre-VOC showed opposite trends of N: R203M decreasing while N: G204R increasing in the Recombinants although both mutations were known to increase viral replication (50,51).

LD decay and recombination breakpoints align the course of genetic shuffling from the Omicron sub-lineages
Linkage disequilibrium decay described how certain mutations that were initially found together in a viral genome can become separated in the course of evolution.This separation is influenced by multiple factors such as the frequency of recombination events during the viral replication, mutation, migration, or admixture, genetic drift as well as the selection pressures acting on the virus.The decline of LD pattern is an indirect method for determining the recombination events wherein higher recombination rate leads to the decay in LD and vice versa (52).After analyzing the LD blocks across the lineages, where majorly, lineages harbored small LD blocks except Delta and Beta, we investigated the decay signal of LD.We plotted LD changes as a function of genetic distance and observed that Alpha, BA.1, BA.2, and Recombinant showed a clear pattern of LD Decay as depicted in Figures 6B,E-G.To delve deeper, we conducted a detailed linear regression analysis between LD (r 2 ) and distance (between genetic markers).R-square gave a numerical measure of how well the linear regression model captured the change in the dependent variable (LD values) based on the predictor variable (genetic distance).The 0.001 to 0.096 range demonstrated the model's ability to accurately capture the interplay between the genetic distance and LD values across the different variants.It is interesting to note that the Omicron lineages of BA.1 and BA.2 along with Recombinants stood out with notable negative regression coefficients of −5.834000e-06, −4.410000e-06, −2.174000e-06, respectively, with R-squared value between 0.024 to 0.096 (Table 1), indicative of the factor that increased genetic distance between the mutations which led to decreased LD values (r 2 ).Moreover, except for Pre-VOC, other lineages including Alpha, Beta, and Delta also showed negative regression coefficients with comparatively lower R-squared value falling between 0.001 to 0.004 (Table 1).Pre-VOC demonstrated a positive correlation coefficient between distance and the LD values along with no evidence of decay after 7,500 genetic distance (Figure 6A).
Decay in BA.1, BA.2, and Recombinants with low regression coefficient directly correlated with the reduced LD block patterns (Figures 5A,B) which suggested its propensity to undergo recombination during evolution.To further explore the recombination  8,600-8,700) of ORF1ab (Table 2).For Pre-VOC and Alpha, breakpoints were identified until ORF1ab and the beginning of the Spike regions, whereas for Beta and Delta, breakpoints were toward the 3'UTR region.This suggests that Alpha and BA.2 preserves 3'UTR and Delta retains 5'UTR during the recombination events (54).
Additionally, the variants present in the breakpoint regions were invariably in weak LD with the mutations present in the haploblocks surrounding the breakpoint areas.It was also noted that highfrequency mutations in the breakpoint regions increased by 38% from BA.1.

Evaluation in the validation cohort highlights the diversity of SARS-CoV-2 transmission and evolution across different lineages
Accumulation of mutations within the viral genome that altered virulence, infectivity, and transmissibility was the primary driving force of viral evolution.Although mutations unique to different populations worldwide are reported, yet a heterogeneous distribution of co-existing mutations in distinct geographic regions within a particular population, such as India, is also reported.Comparative analysis of mutational trends of the SARS-CoV-2 genomes from geographic regions will help us in understanding differential selection pressure acting on the virus (55).Therefore, we carried out a comparative analysis of our "discovery" dataset with a "validation" dataset acquired through SARS-CoV- In this cohort, 13 samples were identified as Recombinants, 10 XN with one each of XAP, XAS, and XQ (Figure 7A; Supplementary Table S4).In the validation dataset, we identified a total of 1,663 mutations over the nine lineages, with an increase of mutational abundance from B.1.617.2 to BA.2.75, mainly from the ORF1ab region.BA.1 and BA.5 showed distorted patterns with higher abundance of mutations (Supplementary Figure S3A; Supplementary Table S5).Mutation trends of Delta and BA.2 carried a similar pattern, which was consistent with the discovery cohort findings.Analysis revealed higher presence of missense mutations followed by synonymous and deletion mutations (Supplementary Figure S3B).M gene exhibited a higher number of synonymous mutations, and ORF8 had the highest proportion of all mutation types (Supplementary Figure S3B), which aligned with the discovery dataset.Deletions were more prevalent in the 3'UTR compared to ORF7a of the discovery dataset.Furthermore, gene-wise mutational abundance showed a similar pattern; synonymous mutations were abundant in the ORF6 followed by ORF3a (Supplementary Figure S3C).For non-synonymous mutations, 5'UTR demonstrated more mutations in the Delta, whereas abundance in the Spike, E and ORF6 genes was observed for the Omicron lineages.
Interestingly, Recombinant had a low abundance of synonymous mutations, and in the non-synonymous, Spike, N and ORF6 demonstrated mutations with the trend followed in the Omicron lineages (Supplementary Figure S3C).Delving deeper into the 73 Recombinant mutations, 19 originated from Delta, three each from Pre-VOC and Beta, four from BA.2, five from BA.2.75, and the rest 39 mutations were from the BA.1 (Supplementary Figure S3D).Subsequently, the comparison of mutational dynamics across discovery and validation datasets strikingly demonstrated a distinct mutational percentage distribution among the five mutational trends (spontaneously increasing, spontaneously decreasing, spontaneously vanishing, dynamically increasing, and dynamically decreasing), except in the dynamically vanishing group (Figure 7B; Supplementary Figure S4).Spontaneously increasing mutations (31.5%) which showed the highest presence in discovery were replaced by the spontaneously vanishing mutations (40%) in the validation dataset (Supplementary Figure S4).However, most of the individual mutations were found overlapping between the discovery and validation datasets across the different mutational trends.Except, dynamically decreasing mutations of validation were identified as overlapping with the spontaneously/dynamically increasing in the discovery set, reflecting a completely opposite selection pressure on the mutations over the course of time in the distinct human host populations.Moreover, contrary to our discovery dataset, LD analysis in the validation cohort demonstrated an increase in the number and size of the LD blocks while evolving toward the Omicron (Figure 7C; Supplementary Figure S5).Contrary to the observation that the LD blocks dissipate down the lineages in the discovery data, the validation cohort LD analysis demonstrated an increase in the LD blocks after Beta till BA.2.Unlike the discovery, Pre-VOC in the validation cohort had a completely distinct pattern of haplotype blocks which did not get retained in the lineages after that (Supplementary Figure S5A).
The validation cohort portrayed a predominant presence of large LD blocks in the ORF1ab gene regions in the Pre-VOC and Beta; however, Delta onwards, blocks were reduced in size and were confined to the 5′ region of ORF1ab (~9-11 kb) (Supplementary Table S6).Interestingly, we observed a similar trend in the discovery cohort as well, where the LD blocks were seen from the Pre-VOC to Delta, completely vanished in BA.1 and BA.2, but arose in the Recombinant, confined to the 5′ region of ORF1ab.This highlighted the importance of ORF1ab in the SARS-CoV-2 recombination, particularly the 3'region of the gene.Additionally, strong LD blocks in the spike region were found in all the lineages except BA.1 and Recombinant in the validation cohort, dissimilar from the discovery data.It is important to note that LD blocks were significant and spanned comparatively larger gene regions in the BA.2 of the validation cohort, whereas it was completely contradictory in our discovery data with a small number and size of LD blocks in the BA.2.Furthermore, the N gene LD blocks with the mutations-N: R203M/203R/G204R were seen to be inherited till the Recombinants in discovery as well as the validation cohort, which showed that a higher degree of conservation was maintained in the SARS-CoV-2 N gene.Intriguingly, the validation cohort displayed a higher presence and retention of 3′ region LD blocks of SARS-CoV-2 genes, including ORF3a, M and ORF7a till BA.1, which was completely absent in the discovery cohort.Overall, the LD blocks increased in size and number in the validation cohort compared to the discovery, except in the 3′ region of the ORF1ab gene.The complete LD decay signal was very random from Pre-VOC to Recombinant (Supplementary Figure S6); however, regression coefficient showed a positive correlation for Pre-VOC similar to the discovery data, while a higher negative correlation in Beta (−0.0000197) followed by Recombinant (−0.0000125) (Supplementary Table S7).This was validated in the breakpoint data wherein the longest breakpoint was observed in the Pre-VOC followed by the Beta and Recombinant (Supplementary Table S7).Overall, the decay signal of BA.1 and BA.2 is in alignment with the discovery data with conservation of the 3'UTR region in BA.2.

Discussion
SARS-CoV-2, similar to all RNA viruses, undergoes mutation and recombination as part of its evolutionary process.Recurrent mutations and recombination are crucial forces that let the virus adapt/evolve within its host microenvironment.Similarly, diverse recombination frequencies were seen in other distinct families of viruses.For instance, HIV is noted for the most frequent recombination events (56) followed by the positive-single strand RNA virus families such as coronaviridae, bromoviridae, and potyviridae (57,58).Moreover, negative-single strand RNA viruses such as Influenza A are reported to have less frequent recombination events attributing to their segmented genomes (59,60).It is also important to note that the time of infection, either persistent or acute, contributes to the recombination frequency as a host can be co-infected with different strains of the same virus during the course of time.In case of SARS-CoV-2, the initial emergence of Recombinants such as XA (inter-VOCs) and XE (hybrid of BA.1 and BA.2), WHO reported XBB as variant of interest (VOI) in January 2023, and recently, EG.5 with Initial Risk Evaluation in August 2023.Recombination events reported in the SARS-CoV-2 are both intervariant and within Omicron sub-lineages (61).Although the recombination lineages obtained in our data arose from BA.1 and BA.2, the mutational dynamics of Recombinants were not the same as their parent clades.This could possibly be resultant of the general mechanism of RNA viruses, whereby they undergo selective pressure for elimination of detrimental mutations from parental clades and keeping only those that are advantageous (62) for their sustainability within the host.Additionally, an increase and decrease in the frequency of mutations could also occur due to chance events (63), which has been reported wherein limited number of viruses established a new population during transmission as might have happened during recombination.The dilution of the mutations in Recombinants can plausibly explain their inability to propagate or transmit and infect at faster rates as observed for the previous lineages of Delta and Omicron.Among the retained mutations, ten mutations that originated from the Delta were responsible for enhanced fusogenicity, infectivity, and were majorly involved in increasing pathogenicity (64, 65), yet their presence with differential frequencies in the Recombinants might have ameliorated their desired effects.The mutations such as S (G142D) and S (T95I) played significant driver role in the evolutionary process and affected the viral load during Delta propagation (66,67).Delving deeper into the adaptive and deleterious mutations during SARS-CoV-2 evolution, represented by the six categories of mutations as in Figure 3A, we found that spontaneously increasing mutations were most prominent with a higher percentage (31.5%)whereas dynamically increasing displayed low presence (3.4%).Notably, these progressively increasing mutations were long accumulating mutations since majority of them were present in low frequency in Pre-VOC and substantially resurfaced in the Omicron and Recombinants.Mutations, such as ORF1ab: T3255I, are reported to be responsible for symptomatic disease outcome and changed interactome toward avoidance of antiviral response, improving viral spread and infectivity as seen in the Omicron (68), while ORF3a: T223I, ORF1a: P3395H and ORF1ab: I5967V were retained till BA.2 and BA.3, but the functional significance of such mutations are still unknown (69, 70).
The dynamically increasing mutations (Figure 3H) although low in percentage, yet, important as these mutations have undergone selection and de-selection by the virus across the lineages at different time points and tend to increase during evolution.These diverse trends of mutations played a major role in the observed LD forming haploblocks, where we found longer blocks in the Pre-VOC and blocks that tended to shrink in the Omicron lineages.Supposedly, sudden bursts of mutations might have decreased the LD strength and created regions of weak LD which would have been beneficial for giving rise to dataset (Supplementary Tables S1, S5).Furthermore, the relevance of finding the emergence of Recombinant variants subsequent to the Omicron lineages as revealed through LD, LD decay, and recombination breakpoints provides a valuable insight into the evolutionary dynamics, driven by mutations or recombination events.This contributed directly comprehending the origin and evolution of novel SARS-CoV-2 variants, shedding light on the epidemiological aspects and infectivity.Additionally, the geographic diversity of the validation data [Rohtak (Haryana), Lucknow (Uttar Pradesh) and Jorhat (Assam)] strengthened the LD blocks and reduced the presence of recombination breakpoints, which was also demonstrated by another study where different Association of Southeast Asian Nations (ASEAN) populations displayed little to no recombination (72), confirming different selection pressure for survival and evolution in Omicron lineages.These findings have the potential to play a crucial role in understanding the virus's spread, with implications for epidemic or pandemic scenarios.Further studies are needed to evaluate the linkage disequilibrium and the impact on fitness factors for Omicron lineages.

Conclusion
This study highlights mutational diversity and abundance evaluation across seven broadly categorized lineages of SARS-CoV-2 since the Pre-VOC times, identifying the trends of mutations harbored by the Recombinants and their functional relevance.Mutational trend analysis through different lineages revealed that spontaneously increasing mutations retained in Recombinants were majorly associated with decreased SARS-CoV-2 infectivity, replication, and disease severity.The dynamically vanishing and decreasing mutations were from the Beta and Delta (reportedly elevated SARS-CoV-2 infection), which were eliminated by the virus during evolution into Omicron and Recombinants.Small LD blocks in BA.1 and BA.2 might have led to generation of multiple lineages within Omicron as well as emergence of the Recombinants.LD decay patterns were observed from BA.1 and were corroborated with breakpoint analysis.These findings provided insights into the mechanisms of evolution of SARS-CoV-2 through recombination across several lineages of different time points.

FIGURE 1
FIGURE 1Comprehensive study methodology in three broad categories: Unveiling SARS CoV-2 genome sequencing, exploration of mutational diversity and detecting recombination events.

FIGURE 2
FIGURE 2 Trends of SARS CoV2 mutational dynamics with respect to the Recombinant variants.(A) The plot represents the mutational landscape of the 12 lineages from the Pre-VOC (B.1) to the Recombinant across the SARS-CoV-2 genomes.(B) Types of mutations harboring different SARS-CoV-2 gene regions.(C) Relative abundance of the synonymous and non-synonymous mutations across different SARS-CoV-2 gene regions.(D) Heatmap shows the tendency of mutational evolution from the Pre-VOC to the Recombinant.

FIGURE 3
FIGURE 3 Evolutionary trends of mutations from the Pre-VOC to the Recombinant.(A) Graphical representation of the low, medium, and high-frequency mutations, depicting increasing, decreasing, and vanishing trends in the two categories of spontaneous and dynamic occurrence.(B) Spider plot illustrating the overall presentation of mutations falling in each of the six study category types.(C-E) Matrix plot of spontaneously increasing, vanishing, and decreasing mutations from the Pre-VOC to the Recombinant.(F-H) Matrix plot of dynamically vanishing, decreasing, and increasing mutations.

FIGURE 4 Block
FIGURE 4Block-wise information for the Pre-Omicron lineages analogous with the amino acid mutations in the lollipop.LD blocks of (A) Pre-VOC, (B) Alpha, (C) Beta, and (D) Delta.Color of the lollipop determines frequency of the mutations in their respective lineages.

FIGURE 5 Block
FIGURE 5Block-wise information for the Omicron lineage analogous with the amino acid mutations in the lollipop plot.LD block of (A) BA.1, (B) BA.2, and (C) Recombinant.Color of the lollipop determines the frequency of the mutations in their respective lineages.

FIGURE 6 LD
FIGURE 6 LD decay patterns from the Pre-VOC to the Recombinant.LD decay plot with the fitted original LD.(A) Pre-VOC, (B) Alpha (C) Beta, (D) Delta, (E) BA.1, (F) BA.2, and (G) Recombinant.The y-axis denotes LD (r 2 ) between pairwise combinations of the variant sites, with x-axis denoting distance between these sites.(H) Heatmap showing recombination breakpoints from the Pre-VOC to the Recombinant, red signifies breakpoint position.

TABLE 1
Linear regression for distance (L1 & L2) and LD (r 2 ).Bold values signify substantially higher R 2 values than the rest of the values.

TABLE 2
Breakpoint information from the SARS-CoV-2 Pre-VOC to Recombinant.