A systematic review of the barcoding strategy that contributes to COVID-19 diagnostics at a population level

The outbreak of SARS-CoV-2 has made us more alert to the importance of viral diagnostics at a population level to rapidly control the spread of the disease. The critical question would be how to scale up testing capacity and perform a diagnostic test in a high-throughput manner with robust results and affordable costs. Here, the latest 26 articles using barcoding technology for COVID-19 diagnostics and biologically-relevant studies are reviewed. Barcodes are molecular tags, that allow proceeding an array of samples at once. To date, barcoding technology followed by high-throughput sequencing has been made for molecular diagnostics for SARS-CoV-2 infections because it can synchronously analyze up to tens of thousands of clinical samples within a short diagnostic time. Essentially, this technology can also be used together with different biotechnologies, allowing for investigation with resolution of single molecules. In this Mini-Review, I first explain the general principle of the barcoding strategy and then put forward recent studies using this technology to accomplish COVID-19 diagnostics and basic research. In the meantime, I provide the viewpoint to improve the current COVID-19 diagnostic strategy with potential solutions. Finally, and importantly, two practical ideas about how barcodes can be further applied in studying SARS-CoV-2 to accelerate our understanding of this virus are proposed.


FIGURE 1
Schematic representation of mechanistic strategies of barcoding. (A-C) Barcodes can be introduced to a template using adaptors through direct ligation (A), using RT-or PCR primers at the reverse transcription or PCR amplification step (B), and using hybridizing molecular inversion probes (C). (D) Schematic representation of the difference between "barcodes" and "sample indexes". Barcodes aim to correct sequencing errors. For example, a misreading nucleotide, guanosine (G) can be corrected in final consensus sequences for a pool of Sample 1 (top panel). Sample indexes are used to multiplex different sequencing amplicons generated from different pools of samples (Sample 1, 2, and 3) (bottom panel). Panel (A) is modified based on Figure 1 in (Schmitt et al., 2012) and panel (C) is modified based on Figure 1 in (Hiatt et al., 2013). Microarray data processing scripts (GitHub); GenePix Pro 7 Multiplex samples Characterization of (polyclonal) antibody-epitope binding Barber et al. (2021) Primerassociated approach   Sequence-independent single primer amplification (SISPA) (Reyes and Kim, 1991). ‡ The whole genome sequence was obtained based on a multiple sequence alignment of short reads (110-140 bp) sequenced by Illumina MiSeq.

Frontiers in Molecular Biosciences
frontiersin.org main steps including 1) tag samples of interest with unique barcodes, 2) multiplex samples, 3) proceed barcoded samples by sequencers or other high-throughput techniques, and 4) demultiplex readouts and assign each sample to the corresponding barcode. Barcodes can be introduced in at least three ways. In the first approach, barcodes are embedded into molecular adaptors while constructing sequencing libraries. A classic example was given by (Schmitt et al., 2012). They first generated a pair of double-stranded and Y-shaped adaptors embedded with unique barcodes and ligated them to both ends of amplicons. This sequencing library is made to correct sequencing errors shown in sequencing reads ( Figure 1A). Several commercial kits already provide the option of a PCR-free barcoding procedure with the same logistic strategy (so-called direct ligation approach shown in Table 1). In the second approach (so-called primerassociated approach in the following context and Table 1), barcodes are embedded in target-specific primers and introduced on a template by reverse transcription (RT) or PCR amplification ( Figure 1B). The third approach is to use molecular inversion probes (MIP) carrying barcodes. A classic example was shown in the study from Hiatt et al. (2013) (Hiatt et al., 2013), where molecular tags (the same as barcodes discussed here) were introduced to the reverse complement strand of the gene of interest using polymerase and ligase, allowing distinguishing reads derived from different genomic equivalents within individual DNA samples ( Figure 1C). Other methods that are not frequently used for barcoding will also be briefly discussed in the latter section (summarized in Table 1). The following contents will focus on barcoding applied to diagnostics of SARS-CoV-2 infections and biologically-relevant studies.

Background information about SARS-CoV-2 and COVID-19
The outbreak of novel coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) occurred in early December 2019 and has quickly spread worldwide and turned into a global pandemic. Although the origin of SARS-CoV-2 has been the topic of substantial debate (a natural origin through zoonosis or the introduction from a laboratory source), molecular evidence indicates that coronaviruses originated in bats (Drexler et al., 2014) and then transmitted to civets and several wildlife species as potential intermediate hosts, and then to humans. Coronaviruses, like other RNA viruses that can frequently undergo host switching under different selection pressures, are genetically heterogeneous, in part due to the highly error-prone and low-fidelity RNA-dependent RNA polymerases that replicate their genomes (Vignuzzi et al., 2006;Peck and Lauring, 2018;Jones et al., 2021), resulting in this virus possibly infecting a broad spectrum of hosts.
The genome of SARS-CoV-2 is composed of 29,881 nucleotides (Lu et al., 2020), making this virus one of the largest known singlestranded RNA-enveloped viruses. Its genome encodes four structural proteins, including spike (S), small protein (E), matrix (M), nucleocapsid (N) (Chan et al., 2020), and other accessory or non-structural proteins. In SARS-CoV-2, the S protein is the main structural protein to ensure the attachment of the virion to the target cell and mediate membrane fusion, thereby achieving successful viral entry (Ou et al., 2020) and being a key protein in determining the infectivity of this virus and the transmissibility in the host (Hulswit et al., 2016). Additionally, this protein is also the major antigen inducing protective immune responses (He et al., 2004;Du et al., 2009;Li, 2016;Walls et al., 2020). Pathologically speaking, it is suggested that severe COVID-19 results from virus-driven perturbations in the immune system and tissue injury, including neutrophil extracellular traps, and thrombosis even though the mechanisms that lead to manifestations of viral infection are not fully understood.

Literature search strategy
A rigorous literature search was done using PubMed with the keywords ((SARS-CoV-2) OR (COVID-19)) AND (barcode). Research articles were searched from 2019 till the time of writing (end of November 2022), with the limitation of solely selecting the research articles published in the English language and the exclusion of the review articles and preprints, and news features. In the first place, 345 articles were released from PubMed searching with the keywords. A careful examination was then performed throughout all articles and removed the ones that do not match the scope of this Mini-Review. Eventually, 45 articles fit the criteria. Based on the function and the type of barcodes described in these 45 articles, the barcodes were classified into three categories: molecular barcodes (26 articles), genetic barcodes (10 articles), and digital barcodes (9 articles). Molecular barcodes refer to sequence-based barcodes, which are often implemented with different biotechnologies, such as PCR, RT-PCR, flow cytometry, CRISPR/Cas9 and so on. In contrast, in this Mini-Review genetic barcodes refer to either the unique viral genomic regions, enabling to classify SARS-CoV-2 variants or host cellular genetic signatures (Fischer et al., 2021). It is worth noting that although the concept of viral genetic barcodes is indeed fascinating for tracking and discriminating variants of SARS-CoV-2 and perhaps can also be beneficial for COVID-19 diagnostics, computational methods/algorithms used to retrieve genetic barcodes are presently not optimized and how frequently that currently known genetic barcodes still remain in the latest variant of SARS-CoV-2 is required to be evaluated. Here I summarize sequences of known genetic barcodes present in major clades, and their corresponding variants, and SARS-CoV-2 genes in Table 2. Genetic barcodes were collected from (Guan et al., 2020) and Zhao et al. (2020)  . Digital barcodes refer to 2D QR barcodes used to store information. In this Mini-Review, the focus will be placed on molecular barcodes. A comparison of the articles using molecular barcodes is summarized in Table 1.

The barcoding strategy for studying SARS-CoV-2
Barcodes used in these 26 articles are sequence-based, except the study from (Vesper et al., 2021), in which the authors used different concentrations of the cell proliferation tracer, CytoTell blue, as color-based barcodes read by the flow cytometry. The barcoding Frontiers in Molecular Biosciences frontiersin.org step can be achieved either using commercial kits, like the kits from Illumina and Oxford Nanopore Technologies, or a customized design (Table 1). In the latter case, a sequence of a barcode is often embedded in primers as an overhang at the step of reverse transcription of viral RNA or PCR amplification of the RT product ( Figure 1B) (Bhoyar et al., 2021;Bloom et al., 2021;Duan et al., 2021;Gauthier et al., 2021;Ludwig et al., 2021;Stüder et al., 2021;Wu et al., 2021;Cohen-Aharonov et al., 2022;Credle et al., 2022;Gallego-García et al., 2022;Palmieri et al., 2022;Warneford-Thomson et al., 2022;Yermanos et al., 2022). Barcoded primers used in SwabSeq (Bloom et al., 2021) and by Cohen-Aharonoc et al. (2022) (Cohen-Aharonov et al., 2022 are compatible with one-step RT-PCR. The workflow of barcoding here resembles the primerassociated approach described in the first section that yields a final product of the amplicons carrying barcodes when the procedure of RT-PCR is complete. The length of barcodes can vary (generally between 4-20 base pairs): the longer length of a barcode is, the lower probability that multiple reads contain the same barcode. Of note, "barcodes" and "sample indexes" are conceptually two different molecular tags even though they both consist of a string of a DNA sequence. There are indeed some functional overlaps. However, precisely speaking, "barcodes" resolve to correct sequencing errors, thereby increasing sequencing accuracy, whereas "sample indexes" are used to multiplex sequencing libraries into the same lane of flow cells ( Figure 1D). It is noteworthy that while reviewing these articles, I notice that it presently appears to be ambiguous between the usage of the term "barcodes" and "sample indexes". Using a combination of multiple barcodes or different layers of barcoding appears to be popular to increase the sequencing capacity and make the readouts more informative. For example, amplicons from SwabSeq (Bloom et al., 2021) are subjected to barcodes (i5 and i7) used to maximize the specificity and avoid false-positive results. Amplicons from LAMP-Seq (Ludwig et al., 2021) and COV-ID (Warneford-Thomson et al., 2022) contain one LAMP barcode (10 bp used in LAMP-Seq and 5 bp used in COV-ID) and two standard PCR barcodes (Illumina i5 and i7) to scale up the deep sequencing capacity. Gauthier et al. (2021) (Gauthier et al., 2021) employed SISPA barcoded primers (Reyes and Kim, 1991) to detect and assemble genomes of SARS-CoV-2 and Oxford Nanopore barcodes to multiplex samples. Stüder et al. (2021) used two sets of barcoded primers to track variants of viruses and multiplex samples for sequencing. Wu et al. (2021) embedded a left and a right barcode (5 bp of each) in the forward-and the reserve primer, respectively, to specify patient samples. Gallego-García et al. (2022) spiked in a string of 20 random nucleotide barcode sequences inserted in the forward-and reverse primer to minimize crosssample contamination. Palmieri et al. (2022) and Yermanos et al. (2022) applied two-dimensional barcoding primers to specify samples pooled in wells and plates. Similarly, Duan et al. (2021) directly included a known sequence of a barcode (8 bp) and a UMI with three random nucleotides in RT primers at the same time to multiplex samples and correct sequencing reads. In addition to the PCR-or RT-PCR-based method, barcodes can also be introduced using different ways. For example, Danh et al. (2022) used chemical cross-linkers to install DNA barcodes. Studies from Fang et al. (2022), Saini et al. (2021), and Karp et al. (2020)) directly ligated a DNA sequence of a barcode to the protein of interest (peptide-MHC complex multimers or the spike protein), which is a PCR-free approach. Mylka et al. (2022) used barcoded-labeled antibodies or lipid anchors to stain a pool of cells individually. More importantly, the spectrum of its application can be broadened when barcoding is adapted to other biotechnologies. For example, designed unique sgRNA to serve as identifiers (unique barcodes), which are co-expressed with the Cas9 protein Barber et al. (2021); Barber et al., 2022). Zhu et al. (2022) included an additional sequence-based barcode adjacent to the 3' end of sgRNA in addition to unique guide sequences. As mentioned previously, Vesper et al. (2021) applied the cell proliferation tracer dye with different dilutions to label samples, allowing samples to be separated using flow cytometry.

2.4
The barcoding strategy for current COVID-19 diagnostics, fundamental research, and future perspectives One of the main contributions of barcoding is to scale up testing capacity for population diagnostics. Diagnostics at a population level has become one of the essential strategies to control the outbreak of COVID-19 because it allows the detection of people with SARS-CoV-2 infections in the first place and immediately places them in quarantine. Available and mature methods, which have been benchmarked for COVID-19 diagnostics at a population level include DRAGEN COVIDSeq (Bhoyar et al., 2021), SwabSeq (Bloom et al., 2021), and LAMP-Seq (Ludwig et al., 2021) (Table 1). Amplicons prepared based on these methods are sequenced using Illumina sequencing platforms (iSeq, MiniSeq, MiSeq, NextSeq, NovaSeq). One strong advantage of Illumina sequencing is that Illumina adapter sequences are made public, benefiting researchers to implement barcodes adapted to their experimental designs subtly. These methods are made to diagnose a small region of a gene, thereby shortening the duration of diagnostic time. Most importantly, these methods appear to be less labor-intensive and cost-effective. Other potential methods for COVID-19 population diagnostics are listed in Table 1. Although nowadays public health policy in many countries tends to coexist with viruses, COVID-19 diagnostics is still crucial to control the spread of the disease in countries where medical resources are insufficient. Since around 33% of people with SARS-CoV-2 infection are estimated to be asymptomatic Oran and Topol (2021) the accurate assessment of COVID-19 diagnostic capacity remains important in first place for strategic planning, public health control measures, and patient management.
In addition to multiplexing samples, several groups apply barcoding to identify new variants of concern (Bhoyar et al., 2021;Gauthier et al., 2021;Stüder et al., 2021;Cohen-Aharonov et al., 2022;Escalera et al., 2022;Gallego-García et al., 2022;Yermanos et al., 2022) (Table 1). Indeed, SARS-CoV-2 is a typical zoonotic RNA virus that enables itself to complete infection across different species. The appearance of viruses that evolve to adapt to a new living niche often reflects on viral sequence changes. Fixation of these changes may require a long time through repeated transmission, eventually resulting in a reduced size of an effective population harboring dominant alterations in their sequence spaces. Investigation of how the virus genetically Frontiers in Molecular Biosciences frontiersin.org evolves to achieve host jumps could therefore be essential to understand the molecular basis of this process, benefitting developing better antiviral strategies. One of the methodologies to study virus cross-species transmission is to use the reverse genetics approach, allowing elucidation of the consequence of genetic mutations by examining changes to phenotypes. Here, I propose that barcodes could be implemented in the in vitro or in vivo system and used as tracers for reconstructing individual evolutionary transmission routes over a large experimental timescale. Practically, unique barcodes could be used to tag the genome of SARS-CoV-2 or embedded in SARS-CoV-2 pseudotyped virus. Barcoded viruses then infect an appropriate model system with multiple rounds of infection. Since barcodes distinguish individual viral infections, it becomes feasible to monitor the genetic alteration of individual viruses from different lineages of evolutionary paths. Barcoding has also been applied to characterize specific antibody-epitope binding (Karp et al., 2020;Barber et al., 2021;Saini et al., 2021;Vesper et al., 2021;Barber et al., 2022;Credle et al., 2022;Danh et al., 2022;Fang et al., 2022), and identify novel host factors required for viral entry (Zhu et al., 2022) (Table 1). A typical feature of an RNA virus is high rates of mutations due to the high error-prone and low fidelity of the RNA-dependent RNA polymerase, thereby exhausting our immune system and weakening the efficacy of antiviral drugs. For this reason, an effective strategy to develop a broad spectrum of SARS-CoV-2 neutralizing antibodies and antiviral drugs that cover variants of SARS-CoV-2 is a requisite shortly. Another idea proposed here is to high-throughput select drug-resistant variants of SARS-CoV-2 in vitro. Barcoded SARS-CoV-2 will be used for in vitro infections in the presence of different antiviral drugs. After multiple rounds of infections, barcoded viruses that remain vivid are collected and sequenced. Based on unique barcodes, it thus becomes possible to unveil mutations, which are essential to resist the killing of corresponding antiviral drugs with resolution of individual viruses at a single-nucleotide level.

Discussion
In the past 10 years, technological progress in barcoding has been made to reach the resolution at a single-molecule level and detect low-frequency and subclonal variations. Such advantages are now applied to elevate diagnostic capacity and study or track variations of individual viruses in a pool of samples. Collectively, the advantages of barcoding strategy toward COVID-19 diagnosis include 1) increasing the throughput of diagnostic samples, 2) shortening the processing time, 3) diminishing the risk of technical batch effects, 4) lowing library preparation costs and per-sample cost, and 5) increasing accuracy of diagnostic results. Furthermore, the potential application of the barcoding strategy in SARS-CoV-2 research can be extended to track variants over a large timescale and perform SARS-CoV-2 progression surveillance beyond the usage in COVID-19 diagnosis. Nevertheless, critical issues (shortcomings of barcoding strategy), such as barcode collisions and barcode hopping are still required to pay attention. These problems could be solved at the experimental-and analytical level. The potential solution worked out at the bench could be by increasing the complexity of unique barcodes in a pooled library, thereby minimizing the probability that multiple molecules initially Frontiers in Molecular Biosciences frontiersin.org receive the same barcode (barcode collisions) or barcodes are incorrectly assigned to other molecules (barcode hopping) at the amplification step. The complexity of barcodes can be lifted either by increasing the abundance of a pool of barcodes (quantity) or adjusting a minimum Levenshtein distance (Yujian and Bo, 2007) among barcodes (quality). Alternatively, errors could also be corrected using better error-correcting algorithms and quantification methods. In this Mini-Review, recent 26 research articles using the barcoding strategy, which mainly contributes to COVID-19 diagnostics and biological research of SARS-CoV-2 were systematically reviewed (Table 1). In addition to increased diagnostic capacity, rapid duration of diagnostic time, and low costs, the accuracy of diagnostic results is another factor that should be well considered. Several of the studies (Bloom et al., 2021;Gauthier et al., 2021;Ludwig et al., 2021) reviewed here already discussed and proposed possible solutions to correct falsepositive results caused by barcode swapping. Importantly, it has been documented that up to 58% of COVID-19 patients may face initial false-negative diagnostics results (Pecoraro et al., 2022). One of the risks is due to frequent mutations in the genome of SARS-CoV-2, rendering primers used for detecting viruses ineffective. A potential solution could be to perform population-scale long-read sequencing. Although this idea has been put forward (Freed et al., 2020;Gauthier et al., 2021;González-Recio et al., 2021;Stüder et al., 2021;Escalera et al., 2022), the current methods are required to be further optimized. Essentially, two practical ideas to expand the power of barcoding are proposed in this Mini-Review. In the first idea, barcodes can be used as a tracer to depict the history of genome alterations in every lineage of variants of SARS-CoV-2 over time. It is beneficial to screen potential mutations that are required for cross-species transmission. The second idea would then benefit medical doctors to adjust antiviral regimens for treatments to satisfy the need of individual patients.
Collectively, barcoding is one of the molecular tools that assist to read a massive array of samples in parallel and the onset of investigating variations at a population level.

Author contributions
Conceptualization, H-CC; literature search, H-CC; writing-original draft preparation, H-CC; writing-review and editing, H-CC; funding acquisition, H-CC. All authors contributed to the article and approved the submitted version.

Funding
This work is supported by institutional funding; no extramural funding was received.

Conflict of interest
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.