Gaps in DNA-Based Biomonitoring Across the Globe

DNA-based methodology has proven to be a vital tool for ecosystem assessment and monitoring. Increasingly, high-throughput approaches such as DNA metabarcoding are being used to address more complex questions, including ecological network analyses through machine learning. Despite the technological advances which allow for such questions to be posed, there remains inherent limitations in studies utilizing DNA metabarcoding, referring to environmental sample type targeted, geographical coverage and lack of standardised field and laboratory procedures. Additionally, DNA reference databases are lacking information from taxa, resulting in unidentified sequences and underrepresentation of some taxa. These issues need to be addressed to enable a more representative approach to ecosystem monitoring to allow for detection and monitoring of global ecosystem change.

To better determine the global effects that the changing climate and anthropogenic damage have on the planets' ecosystems requires a more complete understanding of the global biodiversity than currently exists. However, this has been extremely difficult to ascertain and standardize due to the large number of taxa and the diversity of different geographic localities. More confounding is the reality that these natural and man-made changes are increasingly reshaping the global biodiversity and the associated ecosystem processes and services they provide (Díaz et al., 2015;Bohan et al., 2017). Unfortunately, to date, scientists studying the connections between biodiversity and ecosystem change in specific ecosystems have been poorly equipped to measure these relationships, and have tended to rely on the taxonomic identity and biomonitoring indicators collected from other, and perhaps distant areas, which may or may not be appropriate or accurate choices (Bohan et al., 2017).
DNA metabarcoding utilizes bulk samples such as soil, water, and benthos to extract DNA (termed environmental DNA, eDNA) and generate sequence data for standard taxonomic marker genes (e.g., DNA barcodes) via high-throughput sequencing (Porter and Hajibabaei, 2018b). By streamlining and scaling-up biodiversity data generated, DNA metabarcoding provides the ability to increase the amounts of assessment of the status of biodiversity associated with ecosystem change that can occur across a wide range of global ecosystems (Ruppert et al., 2019). The approach is cost-effective, easy to implement, and provides a robust and comprehensive dataset of taxa from environmental samples, making DNA metabarcoding an important tool of choice for future fundamental research and large-scale biodiversity monitoring programs (Zinger et al., 2019). Moreover, DNA metabarcoding provides an important component to be used with the ecological network analyses and machine learning algorithms that are rapidly advancing to enhance the capacity to detect global ecosystem change through biodiversity assessment (Bohan et al., 2017;Cordier et al., 2019). The complex relationships between changes in nodes and links, and their impact on ecosystem functions should be understood at the network level if we are to develop more robust biomonitoring (Bohan et al., 2017). That said, there are still various barriers that need to be overcome in order to accurately and effectively detect such global ecosystem change, regardless of how quickly these technologies and analyses advance.
DNA metabarcoding has been used to assess eukaryotic and prokaryotic communities, to answer ecological questions such as identifying soil microbiome communities associated with nitrogen-fixing tree species in secondary tropical forests (McGee et al., 2019), assessing bioindicators of river health through macroinvertebrate biomonitoring (Hajibabaei et al., 2011;Dowle et al., 2016) and investigating the effects of oil spills on coastal biodiversity (Xie et al., 2018). Robust experimental design is vital to ensure reproducibility and the ability to draw sound ecological conclusions from the data (Fahner et al., 2018;Zinger et al., 2019). Type I and Type II errors are common with DNA-based biomonitoring, and to overcome this, firstly the sampling design needs to be effective at capturing the full taxonomic diversity or the ecological processes being investigated (Zinger et al., 2019). Secondly, the laboratory and bioinformatic workflow should be optimized to reduce sampling, extraction, amplification, or sequencing bias (Fahner et al., 2018;Ruppert et al., 2019;Zinger et al., 2019). For detecting biodiversity changes, both the taxonomic reference database (for taxonomic annotation of sequences), and environmental sample type (as a proxy for biodiversity) need to be efficient and suitable for detection of target taxa (Ruppert et al., 2019). Geographic variability of environmental sample types also needs to be taken into consideration, to provide the most inclusive representation of taxa, which is vital for detecting biodiversity change within different ecosystems.
Ecological network analyses are becoming an increasingly popular approach to study how ecosystems respond to change and the functional implications of these responses. Typically, network analyses are able to link together species indicators, gathered via DNA metabarcoding methods and others, and functions/interactions to represent a totality of nodes as an ecosystem model (Bohan et al., 2017;Laroche et al., 2018). Network structures can elucidate environmental shifts from stable ecosystem states (Beisner et al., 2003;Bohan et al., 2017;Derocles et al., 2018) through changes that occur in species composition and manifest in an ecological network. These ecological network analyses can potentially explain and possibly predict why stable states in ecology can persist over a period of time Scheffer et al., 2001;Beisner et al., 2003;Bohan et al., 2017), in order to aid advancements in global biomonitoring. Network analyses, combined with machine learning algorithms, provide a standardized and sensitive method at a high resolution to foster a general understanding of the current state of ecosystem function across the globe (Vacher et al., 2016;Bohan et al., 2017;Derocles et al., 2018).
However, even if we advance the technologies behind these network and machine learning methods, the reference databases for taxonomic identification, sample type, and geographical location remain as the most influential limitations to advancing an understanding of detecting global ecosystem change. Next-generation biomonitoring involves the isolation of DNA from samples including freshwater (Valentini et al., 2016;Muha et al., 2017;Harper et al., 2019), salt/brackish water (Lobo et al., 2017;Aylagas et al., 2018;Hansen et al., 2018), benthos (Hajibabaei et al., 2011;Turner et al., 2015;Aylagas et al., 2016;Robinson et al., 2019;Salonen et al., 2019), soil (Andersen et al., 2012;Yoccoz et al., 2012;Fahner et al., 2016;McGee et al., 2019), permafrost (Bellemain et al., 2013;Zielińska et al., 2017;Zimmermann et al., 2017), passive biomass collection efforts such as malaise traps (Morinière et al., 2016;Adamowicz et al., 2019), and more recently air (Kraaijeveld et al., 2015;Ferguson et al., 2019). Within these different types of environmental samples, there are taxa which are either unique to a particular sample type or can be detected across a breadth of environments, which ultimately influences the ecological questions that can be addressed with each type of environmental sample (Ruppert et al., 2019). In a brief, robust Web of Knowledge hit search from the last 5 years (2015-2019), using various search terms to show where various sample types are popularly collected, or sample type (i.e., water, soil, benthos), suggested that samples may be substantially lacking in various geographical regions. Overall, tropic * returned the greatest number of searches for environmental DNA/eDNA/metabarcoding studies (n = 319), followed by Arctic/Antarctic/polar (n = 262), and then temperate (n = 188; Table 1). What this brief hit search does not highlight is the lack of geographic coverage within some geographic regions. For example, despite temperate returning the fewest searches for environmental DNA/eDNA/metabarcoding studies, the range of sample localities is vaster than for both the tropics and arctic/Antarctic/polar regions. Studies returned for the temperate region include localities such as Asia, United Kingdom, Canada and France, whereas for Antarctic for example, the studies are concentrated around remote field stations on the Antarctic peninsula. In terms of sample type, soil environmental DNA/eDNA studies return more searches in temperate locations, whereas permafrost and benthos/sediment return a greater percentage of searches from arctic/Antarctic/polar regions (Table 2; Figure 1). Water, river/stream/pond/lake and seawater/marine return relatively even percentage of searches across the three geographic regions (Table 2; Figure 1).
Often, one type of environmental sample is collected in an attempt to answer broad ecological questions regarding an ecosystem, such as a watershed (Dickie et al., 2018). However, this is problematic and can lead to bias in terms of taxa recovered (Baird and Hajibabaei, 2012;Taberlet et al., 2018). In addition to the geographic location of sample collection, sample type is a large bottleneck in terms of taxa recovered (Figure 2). For example, recent studies have found that eDNA samples from freshwater are a poor substitute for bulk-benthos samples for assessing macroinvertebrate community assemblages (Macher et al., 2018;Hajibabaei et al., 2019). Furthermore, the terminology surrounding the types of environmental sample is inconsistent across the literature, with variations of "eDNA" and "bulk-tissue DNA" used interchangeably (Dickie et al., 2018). Often aquatic-based DNA monitoring samples are referred to as "eDNA" (e.g., Valentini et al., 2016;Deiner et al., 2017), Searches were conducted using Boolean Operators "OR" (find records containing any of the terms), "AND" (find records containing all terms) and "SAME" (terms that must occur within the same sentence), restricted to the last 5 years (2015-2019). Searches were conducted using Boolean Operator "AND" (find records containing all terms) and "OR" (find records containing any of the terms), restricted to the last 5 years (2015-2019).
whereas sediment/benthos or soil samples are termed "bulktissue DNA" (Hatzenbuhler et al., 2017;Hajibabaei et al., 2019;Harper et al., 2019), despite these types of sample all referring to DNA which is isolated from an environmental sample (Dickie et al., 2018). This lack of consistency is particularly challenging when attempting to amalgamate literature and compare studies from different research groups and for effectively communicating results of DNA-based studies to non-specialists. Going forward, it would be greatly beneficial to have a consistent and shared ontology across the environmental DNA and metabarcoding community in terms of environmental sample type. Although eDNA could provide an all-encompassing term for analysis of DNA from environmental samples, it is important to provide complementary information about sample type (e.g., soil, water, and benthos) and technology used for detection in all scientific/technical communication. To fully investigate the current uses of DNA-based terminology, an in-depth review would be necessary, which is beyond the scope of this paper. Ultimately, different types of environmental samples, with their varying associated terminologies, are likely to reflect specific  Table 2) for each sample type search term (water, river/stream/lake/pond, benthos/sediment, soil, seawater/marine, and permafrost) for each geographic region.
FIGURE 2 | Infographic displaying the "bottlenecks" associated with global DNA metabarcoding data generation.
communities of taxa based on factors such as life histories, season and geographic location (Thomsen and Willerslev, 2015;Dickie et al., 2018), and if global ecological questions are to be addressed using next-generation biomonitoring, sample design will need to incorporate the processing of multiple sample types for accurate assessments of biodiversity. In addition, there is a substantial degree of variation within metabarcoding as to the sequencing technology implemented for data generation (Bleidorn, 2016;Evans et al., 2016;Elbrecht and Steinke, 2019;Singer et al., 2019;Zinger et al., 2019). As of 2015, there were 13 different PCR-based NGS technologies (Pavan-Kumar et al., 2015), with Illumina R MiSeq currently the prominent NGS platform for processing biomonitoring data (Bleidorn, 2017). In terms of sequencing, different environmental sample types require varying degrees of sequencing breadth and depth (Porter and Hajibabaei, 2018b;Singer et al., 2019). Tropical forest soils are considered to be one of the most diverse ecosystems on the planet, in comparison to alpine mountain lakes, which have vastly different biological richness (Schluter and Pennell, 2017;Dumbrell, 2019). For example, two separate studies looking at microbial community structure in tropical soils and alpine lakes, produced a large difference in sequence reads for the two environments (tropical soil 16s: 1.3 million; alpine lake: 184,273; Filker et al., 2016;Dopheide et al., 2019). In addition, detection of whole communities as opposed to fewer taxa will require a greater sequencing depth (Porter and Hajibabaei, 2018b). Similar to environmental sample type, the sequencing process of DNA-based biomonitoring is often referred to as "NGS, " "High-throughput sequencing (HTS), " and "Second-generation sequencing (2GS)" (Dickie et al., 2018;Divoll et al., 2018;Zinger et al., 2019); this varying use of terminology again adds another level of inconsistency to DNAbased biomonitoring. Referring to a consistent term for this sequencing technique, similar to the ontology discussed for sample terminology, would be beneficial. As many companies, such as illumina R , which produce sequencing equipment, often refer to this sequencing technology as "next-generation sequencing, " therefore it would be logical to maintain consistency with this term (von Bubnoff, 2008;Quail et al., 2012). As with sample terminology, it is necessary to provide complementary information regarding the technological processes (i.e., highthroughput targeted sequencing). Since January 2016, there have been a few publications referring to the use of Illumina R 's newest high-capacity platform, NovaSeq (Singer et al., 2019) in metabarcoding studies, which have highlighted the higher performance of this new technology in comparison to both the HiSeq and MiSeq, with NovaSeq detecting 40% more metazoan families in metabarcoded sea water samples in comparison to the MiSeq (Singer et al., 2019). The implementation of new technology brings to light the need for evaluating available technologies to address biomonitoring needs for a given system with the main limitation being the taxonomic coverage achieved per sample (Divoll et al., 2018). For example, MiSeq may provide optimal solution to tackle biodiversity in freshwater systems or specific taxonomic assemblages whereas NovaSeq would be a better platform for more complex situations such as oceanic samples. Suboptimal use of data generation platforms could lead to misrepresentation of taxonomic information and can be problematic when considering the implications of this on the ecological conclusions already having been drawn from metabarcoding-based biomonitoring data (Zinger et al., 2019).
Environmental sample choice and implementation of different sequencing platforms are not the only sources of taxa detection bias (Figure 2). There are numerous bioinformatic pipelines for processing samples, which vary greatly across studies (Alberdi et al., 2018) and appropriate clustering/filtering thresholds can lead to mis-classification and thus bias in the taxa detected Alberdi et al., 2018;Zinger et al., 2019). In addition, the most prominent bottleneck in terms of recovering present taxa in an environmental sample is incomplete DNA reference databases (Figure 2; Zaiko et al., 2015;Elbrecht et al., 2017;Stat et al., 2017). Commonly used, both the BOLD (Barcode of Life Datasystem) and GenBank databases regularly lack reference sequences and/or have conflicting taxonomic assignments for the species (Ammon et al., 2018). Reference database incompletion causes inability to identify all DNA sequences in a sample and means some taxonomic groups are underrepresented (Creer et al., 2010;Ratnasingham and Hebert, 2013;Porter and Hajibabaei, 2018a), which highlights the current substantial gap in global biodiversity knowledge (Zaiko et al., 2015). If DNA-based biomonitoring is to be an effective, reliable tool for assessing biodiversity on a global scale, efforts need to be primarily concentrated toward better curation and updating of DNA reference records, as well as continued barcoding of taxonomically identified specimens to improve the quality and quantity of information in DNA databases Elbrecht et al., 2017;Stat et al., 2017;Zinger et al., 2019).
In essence, what will dominate the database in terms of sequence data for various biota will be based on what has been collected from the temperate areas more so than the tropics and polar regions. Thus, for example, how will soil scientists (and others) be able to effectively identify organisms in their soil samples based on databases from other regions? More importantly, some of these areas that need to be further sampled are those that are experiencing drastic intensities of climate pattern changes. This also describes the need for more seasonal studies over periods of time to assess the variability in climate patterns across the globe. If we are to detect ecosystem change globally, more comprehensive work involving biomonitoring and DNA metabarcoding/eDNA will be needed to generate consensus data, generate the metadata, and start analyzing trends across the globe.
With advancing technologies and methodologies such as implementing machine learning and neural networks pertaining to ecological status and modeling, as has been described elsewhere (Díaz et al., 2015;Bohan et al., 2017;Derocles et al., 2018), we still need to increase the information in a database to identify particular organisms of interest and from more geographical locations across the globe, for biomonitoring, and more robust experimental designs rather than straight survey-based approaches to draw sound ecological conclusions (Zinger et al., 2019). Yet, if sample types are inherently variable due to geographical location and/or sample type across the globe, how can we ever expect these taxonomic databases to accurately reflect a global perspective of ecosystem, in order to effectively and accurately detect global ecosystem change. By collecting samples from more geographical locations where the representation is lacking, collecting a wider array of sample types, and constructing the replicated ecological networks of ecological interactions, together, will provide useful standards of global ecosystem information, dramatically enhancing the ability to assess the taxa within global ecosystems, and understanding how these respond to climate change and other forms of ecosystem damage. We propose that combining the use of these technologies would greatly enhance the capacity to better predict how various ecosystems respond to environmental change at local, regional and global levels.

AUTHOR CONTRIBUTIONS
MH, KM, and CR conceived the idea and co-wrote the manuscript. CR conducted the literature search.