Sub genomic analysis of SARS-CoV-2 using short read amplicon-based sequencing

The novel coronavirus disease 2019 (COVID-19) pandemic poses a serious public health risk. In this report, we present a modified sequencing workflow using short tiling (280bp) amplicons library preparation method paired with Illumina’s iSeq100 desktop sequencer. We demonstrated the utility of our workflow in identifying gapped reads that capture characteristics of subgenomic RNA junctions within our patient cohort. These analytical and library preparation approaches allow a versatile, small footprint and decentralized deployment that can facilitate comprehensive genetics characterizations during outbreaks. Based on the sequencing data, Taqman assays were designed to accurately capture the quantity of subgenomic ORF5 and ORF7a RNA from patient samples and demonstrated utility in tracking subgenomic titres in patient samples when combined with a standard COVID-19 qRT-PCR assay.

1 Introduction SARS-CoV-2 patients present with a range of severity and it is unclear how infectious asymptomatic cases are although they are believed to contribute to the spread of the virus. The proportion of asymptomatic patients have been estimated to range from 40%-45% and tend to be younger and do not have preexisting conditions (Oran and Topol, 2020). Conversely, there are many patients who remain PCR positive for weeks despite clinically recovering from the infection. This has posed an important question from a public health management perspective: When can we discharge or remove from isolation patients who are clinically well but remain PCR positive? Increasingly, the consensus is that PCR tests are picking up viral debris shed after an infection and that a time-based discharge criterion may be a more appropriate measure. WHO currently recommends that patients who are clinically well can be discharged 10 days after their symptoms began as long as they have been symptom-free for three consecutive days. Most countries now release patients from isolation between 7 and 14 days from onset of symptoms (He et al., 2020;MIN COW, 2020). In Singapore, COVID-19 patients who are clinically well are discharged at day 21 of their illness without the need for repeat PCR testing, though immunocompromised patients still require two negative PCR tests before discharge (Ministry of Health, 2022).
Coronaviruses express 3' proximal ORFs through subgenomic RNAs (sgRNAs) that are generated by a discontinuous transcription mechanism that is unique to RNA viruses (Sola et al., 2015). A study from Germany showed that infectious virus was isolated from throat or lung but not from stool samples and this correlated with evidence of active replication in throat samples by measuring the levels of transcribed subgenomic mRNA which are only produced by infected cells (Wölfel et al., 2020). The present study examines the expression of subgenomic RNAs in 53 patients. In addition, we demonstrated the ability of our workflow to capture temporal dynamics of amplicon coverage and subgenomic RNA junctional reads during an infection time course over 15 days.
2 Materials and methods 2.1 Patient samples 53 patients admitted to the National University Health System, from Feb to May 2020, with SARS-CoV-2 confirmed by RT-PCR (MiRXES, Singapore [Feb-Mar]; Cobas SARS-CoV-2 Roche, Switzerland [from Apr onwards]) from nasopharyngeal swabs were included in this study. Ethics approval was obtained from the National Healthcare Group Institutional Review Board.

Pooled PCR and library preparation
1 µl of the cDNA was then used for each pooled PCR in the following 10 µl Q5 polymerase reaction mix (NEB) (1 µl 10x Q5 buffer, 0.8 µl 10 mM dNTPs, 0.5 µl 10 µM pooled primers (Pool 1 or 2), 0.1 µl Q5 polymerase) using cycling protocols of 98°C 15s, 63°C 3 min s, 38 cycles). The two pools were combined after PCR and purified using 18 µl AmpureXP beads (Beckman-Coulter) as per manufacturer's protocol and eluted in 20 µl water. 10ng of the amplified product was prepared for library with ligation and barcoded using the NEXTFLEX Rapid DNA Seq 2.0 kit (Perkins-Elmer) as per manufacturer's instructions and eluted in 25 µl of water. Samples were pooled at 4-6 samples per run and diluted to 50pM before loading into the iSeq100 cartridge (150 nt read, paired end; Illumina) as per manufacturer's instructions.

Read processing, alignment and quality control
The critical step in the analysis is the alignment of partial transcript reads spanning sub genomic junctions to the reference genome. To achieve this, we developed a gapped alignment workflow that combines GSNAP (Wu and Watanabe, 2005) and V-ASAP (Maurier et al., 2019) to identify and quantify reads spanning these sub-genomic junctions. GSNAP was deployed over other available tools such as TOPHAT because it allows for an unbiased approach by not requiring an existing annotation of the subgenomic transcripts. Following alignment with GSNAP, gapped reads were filtered from the resulting SAM files based on CIGAR (Compact Idiosyncratic Gapped Alignment Report) strings using an AWK script. The remaining nongapped reads are processed using V-ASAP, which merges the overlapping paired end reads and quality controls the merged amplicon reads to specifically those containing our intended forward and reverse primer sequences. Merged amplicon reads that passed quality control is realigned to the genome using BWA (Li and Durbin, 2009), this is followed by variant calling using SAMTOOLS and BCFTOOLS .
The filtered gapped read SAM entries from GSNAP alignment, were further narrowed down to those spanning the known canonical subgenomic regions using RSAMTOOLS in R (Bioconductor, 2022). The quantity of these reads is then normalized by sequencing depth for comparison across the sample cohort and time points. Visualization of the downstream characterization are plotted using GGPLOT2 (Wickham, 2009). Reads Counts for subgenomic reads are represented in terms of reads per million of total sequenced reads to allow for easy reference to proportion of reads and also to account for the impact of sequencing depth.

Taqman assay
As there were subgenomic amplicons detected at low viral titres at Day 13 and 15 that reflected non-specific amplification with SyBr Green qPCR using ARTIC primers, we wanted to compare how onestep Taqman qRT-PCR assay fared compared to RT followed by SyBr Green qPCR. Taqman PCR were designed with one primer in the TRS-L and the other primer and the Taqman probe located within the subgenomic fragment about 20,000 nuclceotides away. Taqman probes resulted in better concordance between subgenomic RNA and genomic RNA and can be is empirically more sensitive compared to the two-step qRT-PCR as seen from the ability to detect the subgenomic RNA on P07 Day 12 (Supplementary Figure S3).
Real time probe qPCR measurements were performed Applied Biosystems QuantStudio system. Primers were designed to span 7bmrna (seq_143) and E-mRNA (seq_133) respectively (Table 2)

Real-time PCR analysis
Quantification of the relative abundance of sub-genomic transcripts to genomic transcripts were performed by obtaining delta Ct change defined by the difference between the geometric mean of the Taqman assay targeting sub-genomic transcripts described above and the fortitude assay. Wilcoxon ranked sum test with continuity correction is then applied to calculate the significance of the delta CT changes across time.

Amplicon-based sequencing identifies sub-genomic transcript-derived junction reads
In order to probe genomic changes in SARS-CoV-2, we designed 156 primers pairs using the ARTIC protocol (https://artic.network) to generate genome tiling amplicons approximately 300 nt in length to be compatible with 2 × 150bp sequencing chemistry. This protocol allows generation of consensus COVID-19 genome and dentification of single nucleotide variants ( Figure 1A). We sequenced samples from 53 SARS-CoV-2 positive patients who were hospitalized at the National University Health System between February to May 2020. We observed that a proportion of sequencing reads does not contain the intended paired primer sequences. These reads span large genomic distances, indicative of gapped reads originating from the splice junction of the subgenomic transcripts (Kim et al., 2020). Alignment of the gap-spanning junction reads to canonical subgenomic transcripts ( Figure 1B) identified all nine previously reported canonical subgenomic transcripts (Wölfel et al., 2020), indicating that our amplicon design generated reads composed of leader-body junctions. The genome distribution of junction reads shows a sharp peak at the 5' end corresponding to the leader sequence and high coverage in the 3' end for the nested transcript bodies ( Figure 1B). These canonical gapped reads made up 0.001%-1% of the total amplicon reads. Estimating the transcript abundance using junction reads across all 53 samples showed varying levels of total subgenomic transcripts across all samples and transcription distribution within each sample (Figures 2A,B). Notably, we see in some samples, a high expression of the envelope encoding gene (M)  Frontiers in Genetics frontiersin.org distinct from that seen in transcripts isolated from cultured virus (Kim et al., 2020).

Tracking abundance of subgenomic transcripts across infection course
Next, we investigated the effect of viral titre on genome and subgenomic read recovery. As the ORF1ab region targeted by Fortitude 2.1 is not included in any canonical subgenomic RNA, it is actually a good proxy for genomic RNA. We found that the degree of genome read coverage was lower at higher Ct ( Figure 3A). This observation is consistent with the hypothesis that high Ct samples are dominated by viral fragments rather than intact viruses (Hu et al., 2020). Comparing genome coverage against the day of illness demonstrated that as the viral load decreased over the course of the infection, the genome coverage generated by our protocol correspondingly decreases ( Figure 3B). This suggests an explanation why subgenomic reads were not recovered in samples that were of higher CT values. To explore this more quantitatively, we performed qPCR with Taqman assays (Fortitude 2.1 and custom-designed subgenomic assays targeting ORF5 and ORF7a subgenomic RNA) that specifically target junctional subgenomic transcripts and non-junctional genomic RNA across the 53 patient samples ( Figure 3C). We confirmed a general trend where the CT values of both junctional and non-junctional targets increased over the course of the illness. We further observed a small but significant increase in junctional targets CT compared to non-junctional targets in the latter stages of the infection ( Figure 3D), suggesting that subgenomic transcripts may possibly serve as an early measure of infection resolution although the variability observed in subgenomic across samples grouped by day of illness was high.

Subgenomic junctional reads captures temporal variation
This variability in subgenomic titres across samples may reflect differences in sample quality, processing or patient-to-patient variation in treatment protocol or intra-host variation or differences in disease trajectory. To better evaluate the utility of subgenomic quantitation in tracking disease progression, we Frontiers in Genetics frontiersin.org sampled a single patient (P07) daily from day 6 of illness to day 15 ( Figure 4A). At each time point, the collected samples were sequenced. Total and subgenomic junctional reads were quantified. The patient was administered lopinavir/ritonavir daily from day 10. We observed that coverage across the genome dropped significantly by more than 50% on day 11 ( Figure 4B), the day after antiviral treatment was administered. Notably, subgenomic junction reads across all canonical transcripts also reduced significantly on day 9 and was no longer detectable by day 11 ( Figure 4C). Individual subgenomic transcripts showed a similar trend. To verify these, we used amplicon primers that target genomic RNA (dotted line; targeted to N gene region) and junction spanning qPCR ( Figure 5A) to quantify genomic and subgenomic RNA targets respectively. For convenience of plotting, undetermined Ct values by qPCR were assigned the value of 42. The amount of subgenomic RNA detected by qPCR tracked the viral genome signal, but fell drastically at day 9 or 10 to undetectable levels on day 11 in agreement with the sequencing data. The amplicon at the completion of the qPCR was then analyzed by agarose gel electrophoresis to verify that amplicons were of the expected sizes ( Figure 5B). Importantly, a synthetic SARS-CoV-2 RNA control (Twist) was used as a negative PCR control to demonstrate that the PCR primers were specific for gapspanning templates as the Twist control template is composed only of synthetic SARS-CoV-2 genome provided as 5 kb fragments. We repeated the analysis with another different longitudinal patient sample (P08) and showed that subgenomic RNA levels tracked genomic RNA similarly, with an earlier decline in the progression of disease (Supplementary Figure S2).

Discussion
In this article, we present a method for workflow and Taqman probes for the detection of subgenomic RNA which may have diagnostic value in the COVID-19 pandemic, potentially for determining the infectivity of patients. However, further studies will need to be done to validate this hypothesis. Other than Wilcoxon rank sum test with continuity correction was performed on samples binned by early (day 1 and 2) and latter stages (day 3-11) of infection.

Frontiers in Genetics frontiersin.org
ORF1ab, which spans roughly two-thirds of the coronavirus genome and codes for replicase-transcriptase proteins, the other polypeptides including the spike and envelope proteins are dependent on the production of negative strand subgenomic RNA that has the transcription-regulating sequence (TRS) "spliced" next to the translational start site of these polypeptide (Sawicki et al., 2007). This allows transcription of 50-100-fold excess of viral mRNAs that ultimately produce proteins essential for virion formation. Disrupting the formation of N gene subgenomic RNA and consequently N gene mRNA through mutation of conserved RNA motifs effectively curtailed the ability of another coronavirus-the transmissible gastroenteritis virus to produce successful virions (Mateos-Gomez et al., 2013). Absence of subgenomic RNA in infected cells will likely mean the infected cells are not capable of producing active virions. If active infectious virions are produced and actively transmitting between cells within the host, we hypothesize there will be subgenomic RNA produced. As production of subgenomic RNA is dependent on the intact replicase/transcriptase complex skipping a large portion of the genome and not produced spontaneously from just reverse transcription of the virus, it is distinct from viral fragment residues present in the virus at late stages of infection (Hu et al., 2020) and may be more valuable in detection of infectious SARS-COV-2 patients. Thus, it may be possible to use subgenomic RNA levels to ascertain the presence of 'active' infectious virions in which the virions can infect the naïve host cells at the site of sampling. Indeed, when we correlate our subgenomic RNA levels or proportion with day of symptom, we see a correlation that is consistent with the predicted transmissibility of SARS-CoV-2 (He et al., 2020), suggesting that this hypothesis is worth Frontiers in Genetics frontiersin.org investigating further. There are a few caveats to this hypothesis. Firstly, subgenomic RNAs can self-amplify without the presence of a full-length genomic template, which means the positive strand can act as a template for the negative strand and vice versa (Wu and Brian, 2010). However, this requires the presence of ORF1ab proteins that are required for the generation of the subgenomic RNA in the first place, thus subgenomic RNAs are unlikely to persist long after degradation of the genomic template encoding ORF1ab. Second, a similar study on SARS-CoV-2 has suggested that subgenomic RNA content was poorly correlated to day of illness in two patients and thus not a universally useful indicator of viral replication (Alexandersen et al., 2020). They noted the observation that subgenomics samples were more abundantly detected in poor samples enriched in degraded RNA because of a bias for short reads during PCR amplification. This was based on a limited number of samples (n = 2) collected under different conditions and was not conclusive. We performed a controlled time-course experiment where samples were collected from the same patients over the course of 10 days and both sequencing and Taqman quantitation were performed. In this way, differences in Frontiers in Genetics frontiersin.org subgenomic RNA levels due to sample handling were minimized. We found that subgenomic RNA tracked the genomic RNA over the course of the infection and showed an earlier drop-off that coincided with antiviral treatment. Lopinavir has shown in vitro activity against SARS-CoV-2 replication (Choy et al., 2020), and there was a noticeable reduction in SARS-CoV-2 genome coverage in our patient (P07) following lopinavir/ritonavir initiation, although the subgenomic RNA levels were already in decline a day prior to that. The decline in the viral genome coverage may be due to the activity of the drug, or it may be due to the natural viral kinetics during the disease course. The combination lopinavir/ritonavir is currently not recommended as a treatment regimen for patients with COVID-19 in view of a lack of significant therapeutic effect (WHO, 2023;COVID-19 Treatment Guidelines, 2021). Moving forward, it would be worth using subgenomic RNA levels as a marker for antiviral efficacy in this and other widely used therapeutics in COVID-19 patients such as remdesivir or dexamethasone.
In summary, we demonstrated the ability to detect subgenomic RNA using the ARTIC protocol with low sequence depth and showed that we can track subgenomic distribution and contribution against genomic signal in multiple patients. We showed that there is correlation between subgenomic percentage, subgenomic Ct, genome coverage and genomic Ct with time after symptom first appears. Importantly, we developed a Taqman assay against ORF5 and ORF7a subgenomic RNA that we believe can be used to more closely monitor if subgenomic RNA is reflective of infectivity and can readily be incorporated into other available SARS-COV-2 diagnostics. We believe that this will make a significant contribution to public health efforts to control the virus as well as therapeutic studies evaluating the effects of antiviral therapies.

Data availability statement
The data represented in the study are deposited in the figshare repository, with the accession number: 22060181 and associated doi link: https://doi.org/10.6084/m9.figshare.22060181.v1.

Ethics statement
The studies involving human participants were reviewed and approved by National University Hospital Singapore DSRB 2020/ 00867. The patients/participants provided their written informed consent to participate in this study.