Group Testing for SARS-CoV-2 Allows for Up to 10-Fold Efficiency Increase Across Realistic Scenarios and Testing Strategies

Background: Due to the ongoing COVID-19 pandemic, demand for diagnostic testing has increased drastically, resulting in shortages of necessary materials to conduct the tests and overwhelming the capacity of testing laboratories. The supply scarcity and capacity limits affect test administration: priority must be given to hospitalized patients and symptomatic individuals, which can prevent the identification of asymptomatic and presymptomatic individuals and hence effective tracking and tracing policies. We describe optimized group testing strategies applicable to SARS-CoV-2 tests in scenarios tailored to the current COVID-19 pandemic and assess significant gains compared to individual testing. Methods: We account for biochemically realistic scenarios in the context of dilution effects on SARS-CoV-2 samples and consider evidence on specificity and sensitivity of PCR-based tests for the novel coronavirus. Because of the current uncertainty and the temporal and spatial changes in the prevalence regime, we provide analysis for several realistic scenarios and propose fast and reliable strategies for massive testing procedures. Key Findings: We find significant efficiency gaps between different group testing strategies in realistic scenarios for SARS-CoV-2 testing, highlighting the need for an informed decision of the pooling protocol depending on estimated prevalence, target specificity, and high- vs. low-risk population. For example, using one of the presented methods, all 1.47 million inhabitants of Munich, Germany, could be tested using only around 141 thousand tests if the infection rate is below 0.4% is assumed. Using 1 million tests, the 6.69 million inhabitants from the city of Rio de Janeiro, Brazil, could be tested as long as the infection rate does not exceed 1%. Moreover, we provide an interactive web application, available at www.group-testing.com, for visualizing the different strategies and designing pooling schemes according to specific prevalence scenarios and test configurations. Interpretation: Altogether, this work may help provide a basis for an efficient upscaling of current testing procedures, which takes the population heterogeneity into account and is fine-grained towards the desired study populations, e.g., mild/asymptomatic individuals vs. symptomatic ones but also mixtures thereof. Funding: German Science Foundation (DFG), German Federal Ministry of Education and Research (BMBF), Chan Zuckerberg Initiative DAF, and Austrian Science Fund (FWF).


INTRODUCTION
The current spreading state of the COVID-19 pandemic urges authorities around the world to take measures in order to contain the disease or, at least, to reduce its propagation speed, as commonly referred to by the term "curve flattening 1 ." At the time of writing, the World Health Organization (WHO) reported 12,552,765 cases and 561,617 deaths with 230,370 new cases in the last 24 hours 2 . In particular, more than 50 countries experiencing larger outbreaks of local transmission and severe depletion of the workforce, for example, among healthcare workers (HCWs), had been reported to the WHO. Also, given the current number of tests described by several government agencies, this number likely underrepresents the total number of SARS-CoV-2 infections globally.
Even though a lot of research is currently being performed toward a cure of this infectious disease, to date, the most effective reasonable measure against its spread is the tracking and subsequent isolation of positive cases via an intensive test procedure on a large part of the population or at least important risk groups (1). A pilot study conducted by the University of Padua and the Italian Red Cross in Vò, Italy, showed encouraging results in this direction 3 .
At present, the standard tests for the detection of SARS-CoV-2, are nucleic acid amplification tests (NAAT), such as the quantitative reverse transcription-polymerase chain reaction (qRT-PCR). These biochemical tests are based on samples from the lower respiratory or upper respiratory tract of tested individuals 4 . The former is too delicate of an operation to be widely applicable and usually only feasible for hospitalized patients. In the routine laboratory diagnosis, however, sampling the upper respiratory tract with nasopharyngeal and oropharyngeal swabs is much less invasive and usually the method of choice. 1 Why outbreaks like coronavirus spread exponentially, and how to "flatten the curve, " The Washington Post. Available online at: https://wapo.st/2wLMbzI (accessed July 10, 2020). 2 Coronavirus disease 2019 (COVID- 19), Situation Report-174, World Health Organization Webpage. Available online at: https://bit.ly/2ZoN8JJ (accessed July 13, 2020). 3 In one Italian town, we showed mass testing could eradicate the coronavirus, The Guardian. Available online at: https://bit.ly/2VBsmDM (accessed July 10, 2020). 4 Laboratory testing for 2019 novel coronavirus (2019-nCoV) in suspected human cases, World Health Organization Webpage. Available online at: https://bit.ly/ 38SLDH1 (accessed July 10, 2020).
The demand for this type of SARS-CoV-2 testing, however, is drastically increasing in many healthcare systems, resulting in shortages of necessary materials to conduct the test or capacity limits of the testing laboratories 5 .
The concept of group testing (also called pooled testing or pooling) is a promising way to make better use of the available capacities by mixing the samples of different individuals before testing, and to first perform the test on these mixtures, the socalled pools, as if it were only one sample. This idea goes back to mathematical ideas developed in the 1940s and has since been used for tests based on various biospecimens such as swab, urine, and blood (2)(3)(4). In particular, group tests are employed when testing for sexually transmitted diseases such as HIV, chlamydia, and gonorrhea, and were recently used in viral epidemics such as influenza, e.g. (5,6) and references therein.
Very recently, there have also been successful proofs of concept for experimental pooling strategies in SARS-CoV-2 testing. An Israeli research team demonstrated the feasibility of pooling up to 32 samples; they encountered false negative rates of around 10% (7). Subsequently, a German initiative filed a patent for a new approach that allows for so-called minipools combining 5-10 samples with a significantly reduced false negative rate (8). Similarly, a US American research group performed a test with 12 pools of 5 specimens, each from individuals at risk, and were able to correctly identify the two infected individuals out of the 60 with only 22 tests (9).
The main goal of these works is to demonstrate the feasibility of the experimental design; they propose to use the original group testing design by Dorfman of including each specimen into exactly one pool then testing every specimen of the pool again individually in case of a positive outcome of the group test (2). Other works over the last weeks have suggested refined approaches, typically based on examples or, from a more theoretical viewpoint, with a simplified model (10)(11)(12)(13)(14)(15)(16).
In this manuscript, we will demonstrate and systematically explore that even within the limitations of the initial experimental designs for COVID-19 testing, more sophisticated pooling strategies can lead to a significantly reduced number of tests. Thus connecting the recent SARS-CoV-2 pool tests to the rich literature on group testing developed over the last decades may be a key ingredient for effectual national responses to the current pandemic. Such connections have been established by FIGURE 1 | Hierarchical Testing as proposed by Dorfman vs. array testing: In the left figure, 100 specimens are randomly sorted in groups/rows of size 10. As indicated on the right-hand side, the row-wise group test correctly identifies the groups which contain a positive sample (indicated by the red color). Every sample of a positive group will be flagged as possibly positive (indicated by the bold circle) and used for the next stage of tests. In the right figure, we illustrate array testing where, in addition to testing the row groups, column group tests are performed simultaneously. Only specimens which were tested positive in both group tests will be flagged as possibly positive. While this is an example with two simultaneous pool tests, also a higher amount of simultaneous tests can be performed.
Abdalhamid, Bilder and McCutchen by incorporating a decision step regarding how to optimize the number of samples within each pool based on the estimated infection rate-this led to the choice of 5 for the pool size (9). The problem of choosing the right pool size had previously been analyzed in many works (17)(18)(19). And we argue that a massive testing program based on pooled tests can have significant positive effects on the physical and mental health of the general population, given that it can allow for partial reopenings or the use of less restrictive social distancing measures, hence allaying social deprivation and isolation with its strong negative effects (20,21).
The theoretical and practical understanding of group testing developed since the first results of Dorfman (2), however, goes far beyond merely optimizing the pool sizes (22,23). For example, it is also possible to study group testing in the case of responses involving three categories or more (24), and to use pooling for the more involved problem of estimating the prevalence of a disease in a population (17).
The main message of this paper is that in realistic prevalence regimes for the current COVID-19 pandemic, concepts like array testing and informative testing, explained in detail in section 2, may help to improve the testing efficiency even significantly beyond the gain achieved by the simple pooling strategies implemented in the first approaches. By no means we claim statistical originality; our goal is rather to explore and numerically compare classical methods for a variety of realistic parameter choices, demonstrating their efficiency for large-scale SARS-CoV-2 testing. This paper is accompanied by a repository of source code that allows for parallel computation and comparative visualizations 6 . 6 Harar P, Berner J, Elbrächter D, et al. Group Testing Simulations (2020). Available online at: https://gitlab.com/hararticles/group-testing-simulations.

GROUP TESTING
As described in section 1, group testing (GT) is the procedure of performing joint tests on mixtures of specimens, so-called pools, as a whole, instead of administering individual tests, thereby requiring significantly fewer tests than the number of specimens to be tested. Ideally, this joint test will produce a positive outcome if any one of the specimens in the pool is infected and a negative outcome otherwise. Because of the limited information contained in a positive outcome, it is required to test certain specimens multiple times-either in parallel for all the specimens or sequentially with additional testing only for those specimens with positive test results.
Sequential test designs in which the grouping of samples into pools in each stage depends on the results of the former stages are called adaptive. For non-adaptive methods, in contrast, all the sample groupings are specified in advance, which translates into a one-stage procedure in which all pool tests can be performed in parallel.
A special class of adaptive test designs is hierarchical tests, where in the first stage, each specimen is included in exactly one pool, and, in every subsequent stage, groups with positive results are divided into smaller non-overlapping groups and retested, while all specimens contained in groups with negative results are discarded. The original Dorfman test, for example, is a two-stage hierarchical group test.
The left part of Figure 1 illustrates the hierarchical structure of the Dorfman test with a 10 × 10 illustrative microplate. Each circle in the plate represents specimens from separate individuals and the red circles are the infected ones that need to be identified. The specimens are then amalgamated row by row to perform a group test for each row. A positive test result indicates that some individual in the corresponding row is infected. Once the results from the group tests are available, they can be used for the next stage, so only the specimens sharing a pool with an infected specimen will need to be retested.
Entirely non-adaptive group testing procedures have been designed and analyzed using techniques at the interface of coding theory (25), information theory (23), and compressive sensing (26)(27)(28). The symbiosis among those fields leads to developments such as the establishment of optimal theoretical bounds for the best expected group testing strategies (29). However, some of the developments lead to algorithms that may not be practically efficient to implement and, consequently, are not suited for many medical applications including SARS-CoV-2 testing.
Nevertheless, the idea of including every specimen in multiple pools to be tested in parallel is an integral part of many medical testing procedures, as the implementation of hierarchical tests with many stages can be rather complex and hard to automatize. Often, the test proceeds by arranging the specimens in a twodimensional array and assembling all the specimens of each column in a pool. Then, the same procedure is done with all the specimens of each row (30). This testing strategy is a special instance of the so-called array testing, already mentioned in section 1. In this way, every specimen is included in exactly two pools. All the specimens in the intersection of two pools with positive test results have to be retested in a second stage, but the number of these individual tests can be considerably smaller than for the Dorfman design. Figure 1 illustrates the array testing procedure for a 10 × 10 microplate with two infected individuals; here only four of the 100 specimens need to be retested.
Sometimes, for array tests, an initial master pool consisting of all specimens in a certain array is formed and all the k 2 individuals are tested together. This allows for a rejection of a large group in case it exhibits a negative result. Otherwise one proceeds with the array strategy described above. It is important to note, however, that master pooling should be used when there are no clear restrictions on the pool size, e.g., given by dilution effects. In case that such effects are not present, as claimed recently at least for small pool sizes (8), master pooling strategies could be explored.
Another important methodological advancement in group testing is the design of informative tests, i.e., testing strategies that are not based on the assumption of a uniform infection rate, but rather incorporate different estimates for the infection rate of subgroups of the population. We expect that such strategies will be of particular relevance for SARS-CoV-2 testing; for example, the infection rate among healthcare professionals or elderly care workers is expected to be higher than for citizens working from home due to different levels of exposure and, similarly, a stratification based on the level of symptoms also seems reasonable. A first attempt to make use of such a stratification for SARS-CoV-2 testing has recently been made with two subpopulations (13). This paper, however, only assembles homogeneous pools within the two subpopulations and hence does not make use of the full power of informative testing. Namely, the testing efficiency can be significantly improved by smartly assembling combined pools with members of both subpopulations.
Indeed our simulations confirm that this approach, when available, can help improve testing efficiency for realistic choices of parameters. At the same time, we expect that for best performance, one will have to employ a combination of different approaches.
As for many other applications, the design of the GT strategy needs to be driven by the following challenges (31).
i. What practical considerations restrict the pooling strategies available to the laboratory? ii. How do the pool size and the choice of the assay for NAAT affect the ability of a pooling algorithm to detect infected individuals in a testing population? iii. Given the assay and maximum pool size, what efficiencies can be expected for different pooling strategies in testing populations with different prevalences of the disease or welldefined subgroups of varying prevalence? iv. How can pooling strategies be expected to impact the accuracy of the results?
Especially the fourth point has not received much attention in the literature on GT approaches for SARS-CoV-2 testing yet. Like most other testing procedures, qRT-PCR for COVID-19 misclassifies some negative specimens as positive and vice versa, as quantified by the sensitivity and the specificity of the test (the precise definitions are recalled in section 3).
Causes of these inaccuracies that have been documented include low viral load in some patients, difficulty to collect samples from COVID-19 patients, insufficient sample loading during qRT-PCR tests, and RNA degradation during the sample handling process (32). Some of these effects are to be amplified in group testing procedures, so it becomes even more important to take errors into account.
At the same time, the accuracy of a test is difficult to assess. Namely, as described above, NAAT is used to quantify the abundance of SARS-CoV-2 genetic material in a sample similarly to tests for other viral infections (33). In the specific case of qRT-PCR, the abundance measurement is on a continuous scale, the cycle (Ct) at which the readout, given by a fluorescence trace, surpasses a threshold. A decision boundary for a positive observation, i.e., infected, has to be established based on negative samples, i.e., biological control. Accordingly, the estimates on false negative and false positive rates of NAAT tests (and group tests in particular) for the SARS-CoV-2 infection depend on the strength of the classifier induced by this decision boundary. The accuracy of this classifier is influenced by several factors such as the following.
1. The ability of the test to selectively amplify virus genetic material depends on primer design. Multiple primers for qRT-PCR testing on COVID-19 samples were recently compared and found to be similarly strong, with a few exceptions of published weaker primers (34). 2. A large worry about group testing is that the pooling of few positive samples with many negative samples could push the virus concentration in the pooled sample below the detection limit, increasing the false negative rate. This effect has been investigated by studying the test accuracy for dilutions containing virus samples, and false negatives rates were found to be below 10% at a wide range of dilutions, suggesting that the qRT-PCR stage of the testing pipeline introduces small error rates only (7). Still, it is of fundamental importance to accurately estimate the errors introduced by dilution effects since a good understanding of the error is crucial to allow for any reliable inference in a disease study (35). 3. Thirdly, sample extraction methods may have varying yield in virus material: This yield depends on the tissue or fluid that is sampled and on the processing of the sample, such as the time between sampling and qRT-PCR or the temperature at which the sample is held. One would expect this sample extraction to mostly have a destructive effect and to inflate negative rates rather than inflate positive rates. 4. The establishment of gold standard disease labels on samples that were also tested with NAAT is of fundamental importance to assess the overall accuracy of the classifier. There is little such data for COVID-19 testing right now. To this end, a recent study analyzed the positive test result rate of qRT-PCR tests on COVID-19 patients identified based on symptoms, where the symptom-based diagnosis served as ground truth (36). They found false negative rates of individual tests of around 11-25% on sputum samples. At the same time, false positive rates are hard to estimate in the current situation in which non-symptomatic infections occur at an unknown frequency and because of the lack of reference gold-standard labels for positive observations that are non-symptomatic. However, as sample collection does likely contribute little to false positive rates, the overall false positive rate of a group test would largely depend on the qRT-PCR stage in which there is reason to believe that it should be small. Some previous studies on the use of PCR for similar infectious diseases such as SARS-like viruses as well as for SARS-CoV-2 reported high sensitivity for PCR (34,37). Indeed, in the absence of cell culture methods, qRT-PCR tests are considered to be the gold standard for the identification and diagnosis of most pathogens.
The importance of such estimates described above lead to a recent collaborative effort between FIND, a Swiss foundation for diagnostics, and the World Health Organization for the COVID-19 pandemic in order to evaluate the qRT-PCR tests and to assess their accuracy (38). FIND is currently evaluating a list of more than 300 SARS-CoV-2 commercially available tests and establishing accurate estimates for sensitivity and specificity with their respective confidence intervals 7 . Based on the preliminary findings, in this work we will assume that the specificity of a single PCR test is 99%. For the sensitivity, we will mostly assume the value of 99% as well but also explore the impact of lower values to account for potential dilution effects along the several tables presented in the Appendix.
A common thread in the various aspects discussed in this section seems to be the large variety of relevant parameters due to differences between testing scenarios and uncertainty as a consequence of infected individuals without symptoms. In this note, we aim to illustrate that the test design of choice should very much depend on these parameters to make the best use of the testing capacities. We will provide a numerical comparison between different designs for large classes of parameters, such as the sensitivity, specificity, and the expected number of tests per person, so the design can be constantly adapted to what is the best fit to the current best estimate of, e.g., the infection rate and the sensitivity.
Before discussing our numerical results, we will precisely introduce the relevant design parameters and testing strategies in the next section.

Terminology
We start by introducing some terminology.
• Prevalence p: This is the assumed infection rate of the population that is going to be tested, that is, the fraction of the population that is infected. Hence it also is the probability of infection for a randomly selected individual. For simplicity of notation, we will write q = 1 − p for the probability that a randomly selected individual is negative. When the test subjects can be divided into groups with different fractions of infected subjects, we also speak of the prevalences of these subgroups. Without further specification, however, the term refers to the full population to be tested. • Number of stages: This denotes how many steps the method performs sequentially and these steps are characterized by the fact that each stage requires the results from the previous one.
In this paper, we will study adaptive methods with up to three stages, even though more stages, usually up to four in the case of infectious diseases, can be used (30). • Divisibility: This refers to the maximal number of tests that can be performed on a given specimen. This number provides a limitation on how many group tests can be performed, in parallel or in different stages, that include the corresponding test subject. • Group size k: This is the size of the groups that are used in a pooling scheme. For a testing strategy to be feasible, one needs to ensure that the maximal group size k still allows for reliable detection of a single positive in a pool of size k. • Sensitivity S e : This is the probability that an individual test correctly returns a positive result when applied to a positive specimen or pool. A priori, this probability can be different depending on the number of positives included, for example, due to dilution effects (35,39,40), but we will neglect this important distinction for the mathematical description below and assume that a PCR test has a fixed sensitivity independent of pool size. Analogously, for a pooling strategy X, S e (X) is the probability of the whole method X returning a positive result for a positive specimen.
• Specificity S p : This is the probability that an individual test correctly returns a negative result when applied to a negative specimen or pool. Again we assume that a PCR test has a fixed specificity independent of pool size. In case dilution effects are taken into account and more specific information on how the sensitivity/specificity changes with the pool size k is added, one should write S e and S p with a dependency on k.
Analogously, for a pooling strategy X, S p (X) is the probability of the method X returning a negative result when a negative specimen is tested. • Expected number of tests per person E: We consider the expected number of tests per person as a measure of efficiency. Naturally, the expected number of tests per person of a method depends on the prevalence p as well as S e and S p , but also on the design parameters, such as the group size k and the number of stages. We will write E(X) to denote the expected number of tests per person for a method X, without explicitly indicating its dependence on these parameters for the sake of notational simplicity. There exist recent discussion in the literature about alternative objective functions which take directly into account the effects on specificity and sensitivity. The findings, however, show that such an alternative choice most often does not affect the optimal group testing configuration (41).
The optimal choice of design will depend on the aforementioned parameters. In section 4, we will explore these dependencies numerically.
There is also some theory on the optimal design choice and the necessary amount of tests. An argument given by Sobel and Groll (42), which is based on the seminal works by Shannon and Huffman (43,44), shows the theoretical lower bound for the expected number of tests per individual of any given group testing method. More precisely, they showed that E(X) ≥ −p log p − q log q must hold for any method X with S e (X) = S p (X) = 1. In addition to its theoretical interest, it pragmatically indicates how much further improvement might still be possible. Note that it is only a bound, which may very well not be achievable with practically feasible methods. Figures 5, 6 illustrate how the methods discussed here compare to this bound and how much gain one could expect for any large-scale group testing strategy. Regarding the influence of the infection rate, it has been established by Ungar that for infection rates p ≥ (3 − √ 5)/2 ≈ 38%, the optimal pool size is 1, so there does not exist a group testing scheme that is better than individual testing (45). Also, on an intuitive level, one may think that the higher the prevalence, the higher the expected number of tests should be. In fact, Yao and Hwang proved that the minimum of the expected number of tests with respect to all possible test strategies should be non-decreasing with respect to p, if p < (3 − √ 5)/2 (46). Therefore, in the COVID-19 pandemic where the prevalence in most countries, both among the tested individuals and the entire population is clearly believed to be smaller than the threshold provided by Ungar's theorem, one can expect a significant reduction in the average number of tests by employing suitable group testing methods. In the following subsection, we will discuss some of these methods and their mathematical formulation.

Standard Group Testing Methods
In this subsection, we will recall some standard methods for group testing that we will numerically explore in the following section. An overview of these methods and their mathematical formulation can be found in the book by Kim and colleagues while their mathematical derivation was published by Johnson et al. (47,48).

2-Stage Hierarchical Testing (D2)
Dorfman's method is an adaptive method, which tests, in a first stage, each individual as part of a group of size k (2). Then, in the groups that tested positive, all the individuals are tested again individually in a second stage. Consequently, the test requires divisibility of 2. The probability of a pool of size k, here denoted by P k , drawn at random from the population to test positive is the expected number of tests per person of the method is given by and its sensitivity and specificity are A slight improvement of Dorfman's method is possible by omitting one of the individual tests per pool in the second stage and only performing it in a third stage when at least one of the other second-stage tests of that pool has a positive result-exploiting that if all test results in the second stage are negative, the last specimen must be infected for the group test to be valid (42). A more significant modification was proposed by Sterrett (3). In his method, the second stage is modified by performing individual tests until the first positive is found. Then a pooling procedure similar to the first stage is performed for the remaining, still unlabeled, specimens, and this scheme is repeated until all specimens are labeled. While requiring a smaller number of tests per individual on average, especially for small infection rates (19), the number of stages that need to be performed sequentially is not known a priori and may be very high. As such Sterrett's method is more involved in practice, while D2 is a simple and straightforward procedure. Thus the latter is often preferred in applications, which is also why we will perform the simulations for the original form of D2 in this paper.

3-Stage Hierarchical Testing (D3)
In this method, each individual is tested as part of a pool of size k in the first stage. Every pool that tests positive is then split into subgroups, which are tested in a second stage. Every member of a subgroup with a positive result in the second stage is tested individually in a third stage. Consequently, this method requires divisibility 3. In this paper, we will focus on the case that all subgroups are of size s. Expected number of tests per person, sensitivity, and specificity of this method are given by A schematic comparison between the hierarchical tests with two and three stages, D2 and D3, is given in Figure 2.

Array Testing (A2)
This is a 2-stage method, originally proposed by Phatarfod and Sudbury and later explored by Kim et al. (47,49,50), that tests every individual twice in a first stage as a part of two different groups of size k. In a second stage all the individuals, for which both group test results are positive, are tested individually. Consequently, this method requires divisibility 3. A schematic overview of array testing for different scenarios is given in Figures 3, 4.
Precisely determining the optimal way to assemble the pools is rather non-trivial, see, e.g., the publication by Kim et al. (47), but the following configuration provides a good trade-off between simplicity and the expected number of tests. At first, k 2 specimens are arranged in a k × k array, then every row and every column is pooled and subjected to a group test. This ensures that each specimen is tested exactly twice as part of a group of size k and constitutes the unique intersection of these two pools. For S p = S e = 1 it is sufficient to only test a person individually if both its row and column tests return positive results. In this case one obtains the following formula for the expected number of tests If S e or S p differ from 100%, the first stage may yield positive rows without any positive columns or vice versa. In this case, it makes sense to test every member of such a row or column individually (47,51). This results in a slight increase in sensitivity at the expense of a slight increase in the expected number of tests per person. As this change makes the formulas much more involved, we omit them here and refer to the corresponding literature (47).  A2 can be generalized to procedures with three or more simultaneous pools. In this case, the pools could be assembled, for example, by creating pools along the diagonals and/or the antidiagonals 8 of an array, in addition to rows and columns (51). An advantage of such approaches is that the group tests for all these pools can be performed in parallel, which can lead to faster test results, but one has to take into account that the sensitivity is decreasing with the number of pool tests per individual.
The method above can be extended to higher-dimensional procedures, i.e., j > 3, and a connection to optimally efficient two-stage methods can be established. Note that these arrays have size k j rendering this approach practically infeasible very quickly as j and k increase. More concretely, Berger et al. (52) showed that if the prevalence is p = 0.01, then an (almost) optimally efficient two-stage method can be achieved by j = 6 and k = 74, i.e., a 6-dimensional array with side length 74. However, the population, in this case, would need to contain 74 6 ≈ 164 billion individuals to be screened, which is impractical to be applied in any real-world problem. Thus, the quest for methods that use the same principles but are effective for a realistic population size still remains.

Non-adaptive Array Testing (A1)
All the group testing methods discussed so far terminate with an individual test for all specimens with positive test results in all previous stages to avoid false positives only based on the choice of the pools. In a situation with a shortage of test components, there may be scenarios where one is willing to accept a significant number of additional false positives as a means to reduce the expected number of tests and simplify the test design-in particular, it is desirable to perform all different tests necessary for a testing procedure in parallel.
Toward this goal, one may consider replacing the last stage of individual tests in an adaptive procedure by an additional pooling dimension to be performed in parallel, hence transforming the adaptive into a non-adaptive method.
When this adaption is applied to the Dorfman method, one obtains a procedure A1 2 that is identical to the first stage of A2. When applied to A2, this yields a method A1 3 of three parallel pool tests per specimen, again without a decisive individual test at the end. By design, the resulting methods have a significantly lower specificity, but lead to a reduction in necessary tests. An additional advantage is that the resulting methods are fully nonadaptive and can be performed in a single testing stage, allowing for faster test results. At the same time, the adaptation from the methods D2 and A2 does not affect the divisibility required nor the sensitivity of the resulting procedure as adding another additional pooling dimension is accompanied by omitting the last stage-one is really just trading specificity for a lower number of tests and non-adaptivity.
Hence a suitable decision parameter is the minimal acceptable specificity. By the trade-off just mentioned, this also implicitly determines the group size and hence the expected number of tests per person via the relations where j = 2, 3. It is important to note that such tests can only be used when a certain false positive rate can be accepted. If a non-adaptive method with perfect detection of positive individuals, i.e., assuming perfectly accurate RT-PCR, is required, a theoretical result by Aldridge shows that no testing strategy is better than individual testing (53). Also, in contrast to the adaptive tests discussed above, the minimal number of expected tests per person alone is not a viable measure for the optimal choice of the group size k-it would yield a strong bias toward tests with many false positives. For the remainder of this work, the threshold for the minimal acceptable specificity is set to 95%. Nevertheless, we will give a short comparison with a preset of 90 and 97% in section 4.

Extension to the Informative Case
As described in section 2, it is possible to incorporate prior information such as demographic, clinical, spatial, or temporal knowledge into refined estimates for the prevalence and to stratify the population accordingly, reflecting the heterogeneous distribution of the infected individuals. This heterogeneity, first explored by Nebenzahl and Sobel (54), and Hwang (55), can be exploited for refined GT strategies.
From a mathematical point of view, informative tests are somewhat more challenging to analyze (56)(57)(58)(59). To illustrate the findings of the analysis of the informative tests and demonstrate its relevance for SARS-CoV-2 testing, we will work with a scenario where two distinct subpopulations, one with a high prevalence p high (e.g., HCWs) and another, larger, subpopulation of individuals with low prevalence p low (e.g., representative samples of the general population) are to be tested. As shown for example by Bilder and Tebbs (60), informative testing reduces the expected number of tests per individual even further when compared with their corresponding noninformative counterparts. As argued by them, it is crucial to exploit this heterogeneity and employ an efficient mixing strategy of individuals from both subpopulations to form the pools. Our goal here has a different perspective on how to exploit such strategies as will be discussed in the next section. It sheds light on testing methodologies where as much individuals as possible should be tested with the available tests while subject to the constraint of constantly testing high-risk individuals such as HCWs.

NUMERICAL RESULTS
In this section, we will numerically explore different design choices in group testing for SARS-CoV-2. A key tool is the Rpackage binGroup for identification and estimation using group testing, that features the computation of optimal parameter choices for standard group testing algorithms 9 (61). We have complemented this package with a repository of source code for parallel computation and comparative visualization that has been used to create all the graphics in this section and is available for the reader to produce visualizations adapted to different prevalence ranges of interest 10 .
As indicated in the previous section, the choices of the correct method and the optimal group size k heavily depend on several constraints, most importantly the underlying prevalence p (or the subpopulation prevalences for a refined model). In this work, instead of attempting to find the optimal method, we evaluate the properties of a group testing design for a single fixed group size. We will investigate different infection scenarios with the different group testing methods described above. We apply the tests D2, D3, A2, and A1 j with overall prevalence varying from 0.25 to 15%. The results for D2, D3, and A2 have been simulated using binGroup2 while A1 j has been implemented separately 11 .
An important aspect to take into account when putting the number of individuals tested per available test into perspective is that methods based on multiple pools or stages will typically have a smaller overall sensitivity than individual tests, cf. section 3.2. It is crucial to integrate the sensitivity considerations into any pooling strategy (40). In Tables A2-A4, we will illustrate (potential) efficiency increase assuming a sensitivity of 99 and FIGURE 5 | (A) The number of individuals that can be tested per test available for the different adaptive methods. Here, the sensitivity and specificity are assumed to be 99%. The theoretical bound given by (42) is also shown for comparison and the maximum group size is assumed to be 16. (B) zoomed version of (A) that illustrates the low prevalence regime of infection rates up to 2%. 90%, respectively, for the qRT-PCR test. As mentioned before, extensive tests are currently being performed to confirm the high accuracy of qRT-PCR for SARS-CoV-2 testing. Indeed, they indicate that many available PCR procedures for SARS-CoV-2 testing show a sensitivity of or close to 100% (62). Nevertheless, an appropriate quantitative understanding of pooling effects and viral load progression on the sensitivity is still an active discussion (63).
For a PCR sensitivity of 99%, we observe that the reduction caused by the use of a pooling method is very small (97% for D3, A2, and A1 3 ; 98% for D2 and A1 2 ). Only a single PCR procedure showed low sensitivity of 90% when choosing a specific gene target (compared to 100% when choosing another target) (62). In that case, we find a sensitivity of 73% for D3, A2, and A1 3 and 81% for D2 and A1 2 . While the specificity of PCR already appears to be close to 100%, the tables indicate that D2, D3, and A2 improve the specificity even further while A1 2 and A1 3 fulfill the preset threshold S p (.) ≥ 95%. Due to the specificity constraint, A1 2 can not be recommended for very high infection rates of at least 12% as there is no reduction of necessary tests over individual testing. A1 3 is more robust but shows the same behavior at p > 15%.
S e (.) and S p (.) depend mostly on the method and underlying sensitivity S e of the qRT-PCR test and barely change for increasing p. Therefore, Table A5 shows the change of S e (.) and S p (.) for p = 3% and varying S e . It should be noted that the sensitivity S e (.) virtually does not depend on the specificity S p of PCR. Only a slight change in initial group size can be detected. As explained in section 3, the sensitivity can be computed as S e (D2) = S e (A1 2 ) = S 2 e and S e (A2) ≈ S e (D3) = S e (A1 3 ) = S 3 e . To reflect practical considerations such as dilution effects (7), we constrain the group size to at most sixteen 12 . We observe that all the methods yield a significantly reduced expected number of tests per person as compared to individual 12 Since writing the article, further publications demonstrate the feasibility of pooling specimens for even larger pool sizes of up to 30 (64).
testing. This improvement decays with the growing infection rate, in line with our discussion above. For prevalence values below 4%, and hence including the estimated range of current infection rates for SARS-CoV-2 in different countries 13 , all adaptive methods (D2, D3, A2) allow to test at least 3 times as many individuals with the same amount of tests. Around a prevalence of 3% both non-adaptive methods allow testing around 5 individuals per test if a false positive rate up to 5% can be accepted.
Compared to individual testing where only a single individual can be tested per available test, Figures 5, 6 demonstrate the average amount of individuals who can be tested per available test when applying different group testing methods. For infection rates as high as 2%, up to 5 times as many individuals compared to the amount of available tests can be tested using adaptive methods. For a low prevalence below 0.5% this number varies between a 7-and 15-fold efficiency increase. Figure 6 shows the efficiency improvement of A1 j compared to the corresponding adaptive method. The specificity reduction, the biggest drawback of the proposed non-adaptive methods, is controlled by setting the threshold to 90, 95, and 97%. Naturally, the methods relying on the lowest threshold show the biggest improvement. The suggested threshold of 95% leads to a significant improvement of A1 2 compared to D2 for an infection rate between 0.4 and 5%. A1 3 significantly exceeds A2 and D3 for a prevalence between 2.5 and 5%. This is exemplified by some numerical examples in Table A1; for example, this entails that for an infection rate of 0.4%, the city of Munich with 1.47 million inhabitants could be tested with only 141 thousand tests using D3, the 6.69 million inhabitants of Rio de Janeiro could be tested using around 1 million tests if the infection rate does not exceed 1% and the adaptive methods D3 or A2 are performed. If a false positive rate up to 5% is considered acceptable, the non-adaptive method A1 2 would only require FIGURE 6 | The number of individuals that can be tested per test available for different non-adaptive and their corresponding adaptive methods. A1 j (9X) denotes the non-adaptive method A1 j with a specificity threshold 9X%. Here, the sensitivity and specificity of qRT-PCR are assumed to be 99%. The theoretical bound given by (42) is also shown for comparison and the maximum group size is assumed to be 16. 836,000 tests and at the same time allow for higher prevalence values of up to 1.5%.
To summarize, below 1% infection rate, any of the presented group testing procedures will constitute an extreme improvement over individual testing while D3 shows the best performance. For 1% ≤ p < 6%, A2 and D3 show a comparable performance which is superior to D2. For p ≥ 10% all adaptive methods show a similar performance. Considering the non-adaptive methods, A1 2 requires a significantly reduced expected amount of tests for an infection rate between 1 and 4%. For a prevalence between 3 and 8%, A1 3 shows the highest reduction in the number of tests of all methods. However, the trade-off between the lowest amount of tests and a false positive rate of up to 5% has to be considered when choosing the testing method.
Next, we numerically explore the average number of tests of different approaches for informative testing, with the goal of finding the best way to incorporate refined knowledge about different prevalences for distinct subpopulations. Each plot of Figure A1 compares the expected number of tests per person of two informative testing methods, namely the approach of choosing pools separately for the subpopulations, and the approach of assembling the pools with members of all subpopulations. We study a model with two subpopulations of different prevalence, and consider prevalence values between 5 and 25% for the high-risk and between 0.1 and 5% for the lowrisk group. As far as we are aware, this assumption regarding different prevalence values for two groups, in line with the two subpopulations we mention, was first mentioned in the context of SARS-CoV-2 by Deckert et al. (13), where they speak of homogeneous pools and use non-informative D2 for their analysis. However, the question of whether and how to adjust the testing procedure based on subpopulation knowledge did not arise in this work.
We find that for A2 and D3, the advantage of assembling combined pools from both subpopulations gets larger when the prevalence of the low-risk group decreases. How it depends on the prevalence of the high-risk group differs depending on the methods and also the constraints imposed on the group size. For D2, however, the same phenomenon was not observed. More experiments of the same type but with different group sizes as well as different sensitivities and specificities can be visualized at our web application 14 .

DISCUSSION
In this manuscript, we provide a comparison of general strategies for group testing in view of their application to medical diagnosis in the current COVID-19 pandemic.
Our numerical study confirms the recent observation that even under practical constraints for pooled SARS-CoV-2 tests, such as restrictions on the pool size, and for prevalence values in the estimated range of current infection rates in many regions 13 , group testing is typically more efficient than individual testing and it allows for an efficiency increase of up to a factor 10 across realistic scenarios and testing strategies. We also find significant efficiency gaps between different group testing strategies in realistic scenarios for SARS-CoV-2 testing, highlighting the need for an informed decision of the pooling protocol. The repository for parallel computation and comparative visualization accompanying this manuscript allows the reader to visualize the performance of the different approaches similarly to the tables and graphics contained in this paper for different sets of parameters 12 .
For every scenario and method, an optimal pool size can be determined. However, the pool size is constrained biochemically by dilution effects and by sensitivity considerations. For a low prevalence, this can prevent choosing the optimal pool size. We find that within pooling protocols, sophisticated methods that employ multiple stages or multiple pools per sample, or exploit prevalence estimates for subpopulations have the strongest advantages at low prevalences.
Such low prevalence values are realistic assumptions especially for large-scale tests of representative parts of the population, so these methods are particularly suited for full population screens or representative sub-population screens with the goal of reducing transmission and flattening the infection curve. This is of fundamental importance since transmission before the onset of symptoms has been commonly reported and asymptomatic cases seem to be very common (65). For example, 328 of the 634 positive cases on board of the formerly quarantined Diamond Princess cruise ship were asymptomatic at the time of testing, which corresponds to 52% of the cases. Another study conducted in a homeless shelter in Boston, MA, USA, confirmed that standard COVID-19 symptoms like cough, shortness of breath, and fever were uncommon among individuals who tested positive and strongly argues for universal PCR testing on that basis (66). Also, besides enhancing the tests of mild/asymptomatic cases, some disease control centers, such as the ECDC, recommend that group testing should potentially be applied to prevalence studies 15 .
The pooling schemes suggested here can also include routine tests of cohesive subpopulations with high prevalence, such as healthcare workers, and therefore propose a sensible way to include commonly available information about risk groups into the setup (67). For certain scenarios, our numerical experiments show a reduced expected number of tests when employing combined pools consisting of high-risk and low-risk individuals provided some estimates for the prevalence in these two parts of the test population are available.
One could also envision separate pooled tests with different requirements on specificity and population coverage in subpopulations with different prevalence, again highlighting the importance of proper stratification: High specificity is for example likely desirable among healthcare workers whereas specificity may be partially traded for coverage during contact tracing. At the heart of these trade-offs lie considerations about the societal cost of false positives in comparison to the cost of missed diagnosis because of a lack of available tests.
The improved test efficiency of group testing is, however, only one aspect of test design. Carefully tracing every single specimen throughout the whole process is of utmost importance. As this already is the case for individual tests, the additional requirements for tracing pooled probes are therefore rather minor and typically covered by the specimen registration into the laboratory information systems (LIS). Moreover, the FDA has published an amendment for pooling protocols which includes guidelines for the appropriate traceability/registration of the samples pooled 16 . From the IT point of view, sample tracing can be implemented for example via a hash file, which has proven successful for large-scale implementations of group testing, see (68).
Nevertheless, practitioners have to take several factors into account when deciding if group testing can provide a feasible solution for massive tests procedures (40). Some important practical considerations are time constraints, specimen conservation for multi-stage testing, and resource 15 Laboratory support for COVID-19 in the EU/EEA, European Center for Disease Prevention and Control. Available online at: https://www.ecdc.europa.eu/ en/novel-coronavirus/laboratory-support (accessed April 28, 2020). 16 In Vitro Diagnostics EUAs -Molecular Diagnostic Tests for SARS-CoV-2, US Food and Drug Administration. Available online at: https://bit.ly/2QCxCIQ (accessed April 30, 2021). availability, as well as the actual execution of the test in the labs, such as variations in pipetting and sample collection. In particular, the decision at which stage the pooling is taking place (pre-pre analytical, pre-analytical, or analytical) is crucial for the expected turnaround time (69). All of these aspects need to be carefully considered before the establishment of massive pooled test policies.
qRT-PCR-based tests are currently widely deployed for COVID-19 diagnosis and, more generally, to identify current infections (37,70). As for any nucleic acid amplification tests, one can only identify cases where virus particles can still be detected. Thus for long-term disease monitoring, NAATs will have to be complemented by serological tests, as these can be used to infer the immunity state of a patient and hence identify past asymptomatic infections through detection of diseasespecific antibodies. Such tests have already been deployed in a few cases (71,72). In contrast to the PCR testing procedures mainly discussed in this paper, the main intention of serological testing is to obtain accurate estimations of the number of unidentified previous infections as a measure for the progress toward herd immunity. Group testing can also be expected to yield accuracy gains for this problem. Namely, group testing for prevalence estimation is an active area of research with many recent advancements. Also, in settings like a hospital, nursing homes, or similar, the employment of rapid and massive testing may be superior for overall infection control compared to less frequent, highly sensitive tests with prolonged turnaround times. Therefore, pooling strategies for antigen tests or other point-ofcare tests should also be considered in this scenario and we are confident that some of these results can be employed once pooled tests become available (17,73). In any case, there are still many well-established methodological tools available in the literature that have not yet been explored for SARS-CoV-2 testing, so we advocate for a continued exchange between theory, simulation and visualization, and practice.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: Source code is available on Gitlab: https://gitlab.com/hararticles/grouptesting-simulations.

AUTHOR CONTRIBUTIONS
CV: conceptualization (manuscript), methodology, validation, formal analysis, investigation, writing-original draft, review and editing, and visualization. TF: conceptualization (manuscript), methodology, software, validation, formal analysis, investigation, writing-original draft, review and editing, and visualization. PH, DE, and JB: methodology, software, validation, formal analysis, investigation, data curation, writing-review and editing, and visualization. DF: validation, investigation, writing-original draft, and review and editing. PG: conceptualization (project), supervision, and project administration. FT: supervision and project administration. FK: conceptualization (project and manuscript), supervision, and project administration. All authors contributed to the article and approved the submitted version.