Genomic Surveillance of COVID-19 Variants With Language Models and Machine Learning

The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape, increased transmissibility or pathogenicity. Early prediction for emergence of new strains with these features is critical for pandemic preparedness. We present Strainflow, a supervised and causally predictive model using unsupervised latent space features of SARS-CoV-2 genome sequences. Strainflow was trained and validated on 0.9 million sequences for the period December, 2019 to June, 2021 and the frozen model was prospectively validated from July, 2021 to December, 2021. Strainflow captured the rise in cases 2 months ahead of the Delta and Omicron surges in most countries including the prediction of a surge in India as early as beginning of November, 2021. Entropy analysis of Strainflow unsupervised embeddings clearly reveals the explore-exploit cycles in genomic feature-space, thus adding interpretability to the deep learning based model. We also conducted codon-level analysis of our model for interpretability and biological validity of our unsupervised features. Strainflow application is openly available as an interactive web-application for prospective genomic surveillance of COVID-19 across the globe.

A similar analysis was done with daily values of entropy and new cases. Entropy was calculated in rolling windows, and cross-correlation analysis was performed between entropy and new cases at different lead periods. Although the cross-correlation values were found to be significant, the values were low and ranged between -0·1 to 0·1. Therefore, we decided to use the monthly entropy values for the modeling exercise.

Prediction of new cases with 'blips'
We also experimented with "blips" as a feature to predict the new cases. Blips are sudden changes in the values of the latent dimensions. These changes may be caused by a mutation, which changes the words (codons) in a given genome sequence. This hypothesis was validated in simulation experiments in synthetic datasets. Each dimension of the spike gene embeddings for a country was analyzed for the presence of temporal anomalies. Countries having a minimum of 20 samples in any given month were selected and the same number of records (minimum samples in any given month for that country) were sampled without replacement from each month. These records were used to define control limits of ±1 standard deviation from the mean value for each dimension, and all values in the full dataset outside those limits were categorized as 'Blip' points. Blip counts in each month were normalized by calculating the number of blips per sample collected in a month for each dimension (normalized blips). The embedding dimensions were then compared in terms of the total normalized blips for each country to observe the significant dimensions and dis(similarity) in trends among different countries. Cumulative counts of normalized blips were analyzed to understand the temporal accumulation of blips in each dimension. Similar to entropy, blips were found to have a leading relationship with the cases. However, regression modelling results with sample entropy were found to be better.

Strainflow Dashboard
Implementation: The strainflow dashboard web application is primarily built using ReactJS and other accompanying libraries for UI needs and GraphJS for graphical needs. Python libraries such as numpy, pandas, matplotlib and seaborn were used to pre-process and infer the dataset. The Random Forest regression model was implemented using the R library randomForest. The web application is available for use on http://strainflow.tavlab.iiitd.edu.in and works on all modern browsers.
Functionalities: The application has three tabs: Cases Plots, Entropy Plots, and Paper. The Cases Plot tab exhibits two graphs; one compares the actual number of cases with our predicted cases, with a two-month lead time, while the second shows entropy against the caseload for a given country. The Entropy Plots tab displays the sum of sample entropy across all the latent dimensions for each pair of countries. The toggler present above the graph can be used to change countries to compare their entropies. Lastly, the "Paper" tab presents a graphical abstraction of our paper.
Discussion: COVID-19 had a devastating impact on our health systems, thus with caseload predictions made two months in advance, we provide a data-driven handle on epidemiological surveillance to warn about potential upcoming case surges, so that people can be prepared in advance and appropriate preemptive steps can be taken by policymakers to prevent the spread of COVID-19.