Systematic benchmarking of nanopore Q20+ kit in SARS-CoV-2 whole genome sequencing

Whole genome sequencing provides rapid insight into key information about the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), such as virus typing and key mutation site, and this information is important for precise prevention, control and tracing of coronavirus disease 2019 (COVID-19) outbreak in conjunction with the epidemiological information of the case. Nanopore sequencing is widely used around the world for its short sample-to-result time, simple experimental operation and long sequencing reads. However, because nanopore sequencing is a relatively new sequencing technology, many researchers still have doubts about its accuracy. The combination of the newly launched nanopore sequencing Q20+ kit (LSK112) and flow cell R10.4 is a qualitative improvement over the accuracy of the previous kits. In this study, we firstly used LSK112 kit with flow cell R10.4 to sequence the SARS-CoV-2 whole genome, and summarized the sequencing results of the combination of LSK112 kit and flow cell R10.4 for the 1200bp amplicons of SARS-CoV-2. We found that the proportion of sequences with an accuracy of more than 99% reached 30.1%, and the average sequence accuracy reached 98.34%, while the results of the original combination of LSK109 kit and flow cell R9.4.1 were 0.61% and 96.52%, respectively. The mutation site analysis showed that it was completely consistent with the final consensus sequence of next generation sequencing (NGS). The results showed that the combination of LSK112 kit and flow cell R10.4 allowed rapid whole-genome sequencing of SARS-CoV-2 without the need for verification of NGS.

Whole genome sequencing provides rapid insight into key information about the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), such as virus typing and key mutation site, and this information is important for precise prevention, control and tracing of coronavirus disease 2019  outbreak in conjunction with the epidemiological information of the case. Nanopore sequencing is widely used around the world for its short sample-to-result time, simple experimental operation and long sequencing reads. However, because nanopore sequencing is a relatively new sequencing technology, many researchers still have doubts about its accuracy. The combination of the newly launched nanopore sequencing Q20+ kit (LSK112) and flow cell R10.4 is a qualitative improvement over the accuracy of the previous kits. In this study, we firstly used LSK112 kit with flow cell R10.4 to sequence the SARS-CoV-2 whole genome, and summarized the sequencing results of the combination of LSK112 kit and flow cell R10.4 for the 1200bp amplicons of SARS-CoV-2. We found that the proportion of sequences with an accuracy of more than 99% reached 30.1%, and the average sequence accuracy reached 98.34%, while the results of the original combination of LSK109 kit and flow cell R9.4.1 were 0.61% and 96.52%, respectively. The mutation site analysis showed that it was completely consistent with the final consensus sequence of next generation sequencing (NGS). The results showed that the combination of LSK112 kit and flow cell R10.4 allowed rapid whole-genome sequencing of SARS-CoV-2 without the need for verification of NGS.

Introduction
Coronavirus disease 2019 , which occurs at the end of 2019, is a very serious infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and poses a huge public health challenge to the world (Wu et al., 2020). SARS-CoV-2 is an enveloped virus with a positivesense, single-stranded RNA genome of ∼30 kb. The COVID-19 epidemic is currently occurring in almost every country in the world, with over 520 million cases of infection and over 6.25 million deaths as of the end of May 2022. Because of the highly transmissible nature of SARS-CoV-2 and the easy mutation nature of single-stranded RNA viruses, SARS-CoV-2 is constantly mutating and undergoing immune escape (Garcia-Beltran et al., 2021;Harvey et al., 2021).
Currently, the world health organization (WHO) has defined five specific Variants of Concern (VOCs 1 ), in particular B.1.617.2 (Delta) and B.1.1.529 (Omicron). Delta was the key strain that caused the early COVID-19 epidemic, with the D614G mutation contributing to the rapid spread of SARS-CoV-2 (Korber et al., 2020;Jackson et al., 2021;Plante et al., 2021). Omicron has been responsible for the rapid re-transmission of COVID-19 epidemic since 2021, and the K417N mutation caused the immune escape of Omicron strain against SARS-CoV-2 vaccine (Cao et al., 2021(Cao et al., , 2022Li et al., 2021). In fact, more than 90% of the sites of SARS-CoV-2 genome have been mutated. According to the PANGOLIN SARS-CoV-2 typing system, 2 hundreds of SARS-CoV-2 genotypes have appeared, and only whole genome sequencing can detect all genotypes at once.
Nanopore sequencing is a technology with many advantages such as simplicity, real-time rapid sequencing, and long reads. It has been used to sequence pathogens in several previous outbreaks, such as Ebola, Zika, and Lassa virus (Hoenen, 2016;Quick et al., 2017;Kafetzopoulou et al., 2019). The earliest artic sequencing protocol for sequencing SARS-CoV-2 was also derived from the nanopore sequencing protocol of the Zika virus (Quick et al., 2017). At present, nanopore sequencing is widely used for the whole genome sequencing of SARS-CoV-2. A large number of SARS-CoV-2 sequences in databases such as Global Initiative of Sharing All Influenza Data (GISAID) and National Center for Biotechnology Information (NCBI) are sequenced by nanopore sequencing. In addition, the nanopore-based direct RNA sequencing is also used to study the subgenomic structure and RNA modification of SARS-CoV-2, providing scientists with the complete transcriptome structure of SARS-CoV-2 (Davidson et al., 2020;Kim et al., 2020;Chang et al., 2021;Wang et al., 2021;Ugolini et al., 2022).
1 https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/ 2 https://cov-lineages.org/lineage_list.html Although nanopore sequencing has excellent performance in SARS-CoV-2 sequencing, with a sensitivity and specificity of more than 99% based on a sequencing depth greater than 60x, compared with the next generation sequencing (NGS) technologies represented by Illumina (Bull et al., 2020). There are still scientists who are concerned about the accuracy of nanopore sequencing and still perform NGS to verify the nanopore sequencing results when studying the transmission relationship between cases. Recently, Oxford Nanopore Technologies (ONT) launched Q20+ kit (LSK112), which claimed to produce duplex data (∼Q30) and achieve simplex accuracies of over 99%, enhanced high-precision consensus sequence as well as mutation identification, when combined with the latest flow cell R10.4. In this study, we firstly utilized Q20+ kit in combination with flow cell R10.4 for whole-genome sequencing of SARS-CoV-2, and we compared the sequencing results with the results of NGS and the combination of the previous nanopore sequencing kit LSK109 and flow cell R9.4.1 to observe whether Q20+ kit showed significant improvement in the accuracy of SARS-CoV-2 whole-genome sequencing. We found that the SARS-CoV-2 consensus sequences of the combination of Q20+ kit and flow cell R10.4 were completely consistent with the sequences generated by the NGS, with a very significant improvement in single-molecule accuracy, particularly for the homopolymer region where nanopore sequencing was most likely to be incorrect in the past. Comparing with the old kit LSK109 with R9.4, the new Q20+ kit (LSK112) with flow cell R10.4 improved the average sequence accuracy in sequencing SARS-CoV-2 96.25% to 98.34% and the proportion of sequences with an accuracy of more than 99 to 30.1% from 0.61%, which greatly reduced the background noise that may interfere with variants calling.

Reverse-transcriptase polymerase chain reaction
Short Fragment (400bp) Target Capture Kit and Long Fragment (1200bp) Target Capture Kit for SARS-CoV-2 Whole Genome (Baiyi Technology Co., Ltd., China, BK-WCoV024TS and BK-WCoV024IITS) were selected to reverse transcribe the extracted RNA and amplify the SARS-CoV-2 whole genome. The top three samples in Table 1 were amplified using the short fragment target capture kit and the other samples were amplified using the long fragment target capture kit. RNA was reverse transcribed into cDNA with reverse transcriptase and random primers, and the cDNA was amplified by multiple polymerase chain reaction (Multiple PCR) using primer pool 1 and primer pool 2 provided in the kit, respectively. The conditions of Multiple PCR: 98 • C for 30s followed by 25 cycles of 98 • C for 15s, 65 • C for 5min, and 72 • C for 2min. The Multiplex PCR products were purified by AMPure XP beads (Beckman coulter, United States) and then quantified using a Qubit 2.0 Fluorometer and Qubit dsDNA BR Assay kit (Thermo Fisher Scientific, Q32850).

Next generation sequencing
Illumina sequencing was performed using Nextera XT DNA Library Preparation Kit (Illumina, FC-131-1096) for library building and sequencing on Miseq or NextSeq 2000 (300 cycles for 150bp paired end read type). BGI sequencing was performed using ATOPlex RNA Library Prep Set for library construction and sequencing on MGISEQ-2000, and using DNBelab-D4RS Digital Sample Preparation System and accompanying kits for library building and sequencing on DNBSEQ-E5. Both sequencers of Applied Biosystems were automatic operating systems, using matching kits and materials for library building and sequencing.

Data analysis
The fast5 electrical signal files were obtained from the nanopore sequencing down-machine data, and then the fast5 data were converted to standard fastq files using Guppy (v 6.0.1 3 ) to study the effect of different base-calling strategies on the accuracy of the nanopore sequencing data. We used three modes from the configuration file -config in guppy: dna_r9.4.1_450bps_fast.cfg, dna_r9.4.1_450bps_hac.cfg and dna_r9.4.1_450bps_sup.cfg, corresponding to the conversion modes: fast, hac and sup mode, respectively. The average Q value for each reads was counted using Seqkit tool (v.2.2.0 4 ) (Shen et al., 2016) and the accuracy density curves were plotted based on the obtained Q values using the ggplot2 package in R language (v 4.1.3 5 ). When analyzing the homopolymer accuracy of the SARS-CoV-2, we used Seqkit tool to obtain all homopolymer positions and corresponding sequences on the reference genome Wuhan-Hu-1, and then used Seqkit tool to count the number of different homopolymers matched to the sample data, using the ggplot2 package for line plotting.
The data analysis process was carried out using BAIYI MicroGeno Platform (v 4.0 6 , Hangzhou Baiyi Technology Co., Ltd.). The raw data were first quality controlled using NanoPlot (v.1.30.0 7 ) (Coster et al., 2018) and then the low quality and sequences less than 200bp were filtered using Filtlong (v.0.2.0 8 ) based on the quality control results. The filtered clean data were compared with the reference genome Wuhan-Hu-1. When processing the NGS data, we used BWA (v 0.7.17 9 ) (Li, 2018) for comparison and minimap2 (v 2.22 10 ) (Li, 2018) when processing the nanopore data. Mutation site detection was performed using freebayes (v 1.1.2 11 ), with reference assembly of the SARS-CoV-2 whole genome sequence using bcftools (v 1.12 12 ) (Danecek et al., 2021). We calculated the Shannon entropy of variant sites in nanopore sequencing and NGS to analyze the accuracy of sequenced sites (formula of Shannon entropy: , using the ggplot2 package for line plotting.

Results
The basic sequencing data Fifteen samples were selected for SARS-CoV-2 whole genome sequencing by NGS and nanopore sequencing, including 9 Delta samples and 6 Omicron samples. Short Fragment (400bp) Target Capture Kit and Long Fragment (1200bp) Target Capture Kit for SARS-CoV-2 Whole Genome were selected to reverse transcribe the extracted RNA and amplify the SARS-CoV-2 whole genome. Among nine Delta samples, three samples were amplified by the 400bp capture kit, six samples were amplified by the 1200bp capture kit. Six Omicron samples were amplified by the 1200bp capture kit. Then, fifteen samples were sequenced by NGS, method A 4 https://github.com/shenwei356/seqkit 5 https://www.r-project.org/ 6 http://www.baiyi-tech.cn/ 7 https://github.com/wdecoster/NanoPlot 8 https://github.com/rrwick/Filtlong 9 https://github.com/lh3/bwa 10 https://github.com/lh3/minimap2 11 https://github.com/freebayes/freebayes 12 https://github.com/samtools/bcftools (using LSK112 kit with flow cell R10.4) and method B (using LSK109 kit with flow cell R9.4.1) respectively (the details of amplification and sequencing protocol are given in the methods and materials). The time from a sample to sequencing result was 21-29 h for NGS and 7-8 h for nanopore sequencing. Statistical analysis of the sequencing results showed that the sequencing depth of each sequencing method was greater than 230, and the whole genome sequences of 15 samples were basically obtained, with most of sequences coverage above 99% ( Table 2). The amount of the sequencing data is showed in Table 3.

Analysis of accuracy
The effect of different base calling strategies on the accuracy of nanopore sequencing Considering that nanopore sequencing is a technology based on electrical signal sequencing, different base calling strategies can be chosen during the conversion of electrical signal fast5 data into fastq data. Guppy, providing three base calling strategies (fast, hac, and sup modes), was utilized to analyze the effect of different data conversion modes on sequence accuracy. The density distribution of sequence accuracy showed that the sup mode had higher accuracy for both method A and method B (Figures 1A,B). The fast and hac modes were not suitable for analyzing Q20 data, with the fast mode being more obvious ( Figure 1B). Therefore, we consistently chose the sequence accuracy in the sup mode to evaluate both nanopore sequencing methods. It could be found that the sequence accuracy of method A was significantly better than that of method B for both the 400bp capture kit and the 1200bp capture kit. This illustrated that regardless of the length of the sequenced fragments, Q20+ kit had a great improvement in sequence accuracy, reaching an accuracy of 99% (Figures 1C,D).
The effect of different amplicon lengths on the accuracy of nanopore sequencing In method B, the average sequencing fragment lengths obtained by using the 400bp capture kit and the 1200bp capture kit were around 376bp and 1058bp, respectively, with no significant difference in accuracy. In contrast, in method A, the average reads accuracy of the 1200bp amplicon improved significantly compared to that of the 400bp amplicon, from 96.5 to 97.5%, and the average proportion of data above Q20 rose from 23 to 28.8% (Table 4). This led us to further consider whether different amplicon lengths had an effect on nanopore sequencing accuracy? As could be seen in the singlebase accuracy analysis, the average single-base Quality value (Q value) of the 400bp amplicon was indeed lower than that of the 1200bp amplicon (Supplementary Figure 1). Interestingly, we also found that in method A, the single-base Q values for the first 20-30 bp was very low (Figure 2A), possibly due to an unstable electrical signal generated when the DNA fragment just passed through the nanopore. However, the first 20-30 bp was the adapter sequence, not the true amplified fragment sequence. With this in mind, we further performed statistics on the accuracy after cutting the adapter sequence and found that was 98.27% ( Figure 2B).
The effect of duplex data on the accuracy of nanopore sequencing Currently, for DNA sequencing, ONT only supports the 1D method, but LSK112 kit is supported by the 2D method. Compared to method B, some sequences in method A are double stranded through the nanopore. In the sequences with positive and negative strand through the nanopores, we used Guppy (guppy_basecaller_duplex) with duplex tools (v 0.2.9 13 ) for method A to analyze the extracted duplex data. The statistical analysis revealed an average Q value of 26.1 for the duplex data, corresponding to an accuracy of 99.75453%, and duplex data accounted 3.33% of the sequencing data of method A (Figure 3), which was relatively in line with the 1-10% range given by ONT. The results showed that duplex data was particularly effective in improving the accuracy of nanopore sequencing.

Analysis of single nucleotide polymorphism and insertion-deletion
Taking the SARS-CoV-2 genome Wuhan-Hu-1 (GenBank accession number: MN908947.3) as the reference genome, we analyzed the mutation site for each sample. It could be found 13 https://github.com/nanoporetech/duplex-tools/ that method B had a significant increasement of mutation sites in the fast mode, 30.97% of which were caused by homopolymer variation, and generated 2.92% false positive site heterozygosity in addition (Supplementary Table 1). This also confirmed that the fast mode was not suitable for accurate variants calling, meanwhile the fast mode can run faster than sup mode with lower hardware requirement, which may illustrate why some scientists still have doubt about the accuracy of nanopore sequencing technology even with the rapid development of   accuracy in nanopore sequencing. In the sup mode, method A and method B were completely consistent with the NGS in mutation detection with the consistent site coverage. Intriguingly, in method B, we analyzed eight consecutive T-base position (genomic position 11094) in the sup mode, and found that 7 out of 15 samples were identified to be heterozygous with a deletion of one T base which proportion was greater than 50%, and the other 8 samples had low heterozygosity

Analysis of homopolymer
We conducted a genome-wide scan of the SARS-CoV-2 whole genome, which had multiple regions of homopolymer, including a T-base homopolymerized region with a length up to 8, in addition to the 3 UTR. In method B, the percentage of homopolymer identification accuracy gradually decreased as the length of homopolymer increased ( Figure 4A). This limited the application of this sequencing method to wholegenome sequencing of SARS-CoV-2, as it could easily cause frame shift mutation. Method A showed high recognition accuracy for homopolymer, and still had excellent recognition accuracy for a T-base homopolymer region with a length of 8 ( Figure 4B). Moreover, the recognition accuracy of homopolymer was significantly negatively correlated with the length of homopolymer, and had no significant correlation with the four base types.

Analysis of data quantity
We analyzed the data quantity generated by flow cell R10.4 and flow cell R9.4.1 over time and could see that flow cell R10.4 generated approximately 230 Mb data at 120 min, which is a significant difference compared to the The distribution of sequence length and sequence Q value of duplex data in method A. The counterr density statistic for the detection of different homopolymers on the whole genome of SARS-CoV-2. (A) The counterr density plot of method B in the sup mode for the detection of homopolymer with lengths of 4, 5, 6, and 8; (B) The counterr density plot of method A in the sup mode for the detection of homopolymer with lengths of 4, 5, 6, and 8. 625.4 Mb data generated by flow cell R9.4.1 (Figure 5). It was a significant positive correlation with the speed through the nanopore of sequences on both flow cells. The sequencing speed of flow cell R9.4.1 is 400∼450bp per second, while the sequencing speed of flow cell R10.4 is reduced to 200bp per second. As could be seen from the above analysis, method A significantly improved sequencing accuracy at the sacrifice of its data output. However, during the whole genome sequencing of SARS-CoV-2, the data output was often excessive, so the combination of LSK112 kit and flow cell R10.4 could still meet the needs of the whole genome sequencing of SARS-CoV-2.

Discussion
Whole-genome sequencing is the best way to detect SARS-CoV-2 due to its rapidly mutating nature. On account of the advantages of rapid, simple and low-cost sequencing nature, nanopore sequencing technologies is widely used Comparison of method A with method B for data output over time.
to obtain the whole genome sequence of viruses, such as Ebola, Zika, and Lassa viruses (Hoenen, 2016;Quick et al., 2017;Kafetzopoulou et al., 2019). Actually, how to make the accuracy of nanopore sequencing technology comparable to NGS or even sanger sequencing is still the most important issue to the users. Excitingly, the emergency of nanopore Q20+ kit (LSK112 kit with flow cell R10.4) may help us to sequence the SARS-CoV-2 genome without verification from NGS or sanger sequencing, and its sequencing accuracy has been verified in bacterial, fungal, human and plants (Sereika et al., 2021;Keraite et al., 2022;Sanderson et al., 2022). This study is the first benchmark test of nanopore Q20+ sequencing in SARS-CoV-2 and viruses. Excitingly, LSK114 kit with flow cell R10.4.1 released in London Calling 2022 not only maintains the accuracy of 99%, but also improves the sequencing yield to the same level or even more as LSK109 kit with flow cell R9.4.1.
Regardless of method A or method B, there were significant differences in accuracy among three base calling modes, with the sequence accuracy decreasing significantly in the fast mode, especially in the homopolymer region. Highest accuracy was achieved by two sequencing methods in the sup mode, with some sample sequences in method A reaching an accuracy of over 99%. Method A was more accurate than method B regardless of the size of the targeted capture fragment. And the longer the fragment, the more accurate it was. Method A had duplex data with an average Q value of 26.1 and an accuracy of 99.75453%, although the percentage of duplex data was small. With the development of nanopore sequencing technology and the increasing proportion of duplex data, nanopore sequencing is expected to achieve even higher accuracy. It could be observed that LSK112 kit did improve sequencing accuracy compared to LSK109 kit and was more suitable for sequencing long amplicons. The sequencing quality of sequences that initially enter the nanopore is poor, due to the unstable speed of the initial sequence through the nanopore. The overall sequence accuracy is greatly affected when the length of amplicon was short. It could be seen from the results that the accuracy was significantly improved after removing the adapter sequence. Therefore, we need to filter the adapter sequence and short fragment in order to achieve better analysis results in the data processing part.
In the mutation detection, it was evident that method B had a recognition error in the homopolymer region, which led to the eventual problem of frame shift mutation. This problem is even more noticeable on ONT MK1C platform and this weak point was eliminated on ONT GridION platform supporting the sup base calling mode with a huge boost of read-time computing power. Q20+ kit maintained high recognition accuracy in the homopolymer regions of the lengths of 4, 5, 6 and 8, which clearly showed that the Q20+ kit solved the homopolymer accuracy problem well. We compared the consensus sequences sequenced by method A with the consensus sequences from NGS, and the sequences were identical. The homopolymer region has been a high-incidence region with accuracy problems in the previous sequencing kits of nanopore sequencing technology. However, the continuous upgrades of sequencing kits, flow cell and algorithm are solving the shortcoming, especially Q20+ kits, such as LSK112 kit and LSK114 kit, have improved the ability of detecting homopolymers up to length of 10∼12. A recent study reported sequencing bacteria genome with LSK112 kit and flow cell R10.4 has allowed high accuracy in homopolymers regions of length up to 9 (Sereika et al., 2021). It means that LSK112 kit and flow cell R10.4 allow the accurate detection of the largest 8-base homopolymer in SARS-CoV-2 genome.
In conclusion, Q20+ kit was found to be more accurate than previous nanopore sequencing kits, especially for sequencing long amplicons. The improvement in accuracy derived from the increased 5 to 10% of duplex data, and the relatively reduced sequencing speed that resulted in increased homopolymer identification accuracy. However, to ensure high accuracy, the base calling strategy required selecting the sup mode.
At present, Nanopore sequencing is increasingly used for the whole genome sequencing of SARS-CoV-2 due to its advantages of simple, fast and real-time sequencing. The improved accuracy brought by Q20+ kit can play a more accurate and positive role in the prevention and control of epidemics and traceability analysis of SARS-CoV-2.

Data availability statement
The data presented in this study have been submitted to the National Genomics Data Center (https://ngdc.cncb.ac. cn/) with submission number: CRA007743. The generated consensus sequences were submitted with accession numbers: GWHBJYX01000000-GWHBKAG01000000.

Author contributions
JL and XX: methodology establishment, data sorting and analysis, visualization, and writing -original draft. ZM: methodology establishment, data sorting, and writing -review and editing. LW: data sorting and analysis. KZ, XZ, QQ, and YG: resource and writing -review and editing. LM: conceptualization, project administration, supervision, and writing -review and editing. LC: conceptualization, funding acquisition, project administration, validation, supervision, and writing -review and editing. All authors contributed to the article and approved the submitted version.