Aggregated and Validated Datasets for the European Seas: The Contribution of EMODnet Chemistry

1 Istituto Nazionale di Oceanografia e di Geofisica Sperimentale (OGS), Trieste, Italy, 2 International Council for the Exploration of the Sea (ICES), Copenhagen, Denmark, National Institute for Marine Research and Development “Grigore Antipa”, Constanţa, Romania, Hellenic Centre for Marine Research, Anavyssos, Greece, 5 French Research Institute for Exploitation of the Sea (IFREMER), Nantes, France, 6 Arhus University, Aarhus, Denmark, 7 Swedish Meteorological and Hydrological Institute, Norrköping, Sweden, 8 Institute of Marine Research, Bergen, Norway, 9 Alfred-Wegener-Institute, Bremerhaven,


INTRODUCTION
EMODnet (European Marine Observation and Data Network) is the long-term marine data initiative started by DG MARE in 2009 as part of Blue Growth strategy (European Commission, 2012) to ensure that European marine data across seven discipline-based themes become easily accessible, interoperable, and free on restrictions on use (Martín Míguez et al., 2019).
EMODnet Chemistry started in 2009 with a pilot phase aimed at testing the project feasibility in limited geographical areas on a restricted number of parameters (Vinci et al., 2013). The second phase, from 2013 to 2016, was aimed at expanding the spatial coverage and the range of chemical parameters (Vinci et al., 2016), while the third one, from 2017 to 2019, extended the data focus, including marine litter data and data products (Vinci et al., 2018).
The EMODnet Chemistry portal is now built on a network of 45 marine research and monitoring institutes and oceanographic data management experts from 27 countries. Many of these are National Oceanographic Data Centres (NODC), actively involved in managing, indexing and providing access to ocean and marine data sets, acquired from research cruises and monitoring activities in European marine waters and global oceans.
The objective of EMODnet Chemistry is to provide easy and open access to marine chemistry data sets and data products related to three main categories: eutrophication (e.g., nutrients, oxygen and chlorophyll), contaminants (e.g., hydrocarbons, pesticides, heavy metals, antifoulants) and marine litter (e.g., beach litter, seafloor litter and floating micro-litter). Data derive from inputs gathered and collated from national monitoring efforts and research activities from all European coastal states.
The large heterogeneity in data managed by EMODnet Chemistry derives from the different kinds of collated variables: eutrophication data are mainly available for the water column, conversely, contaminants and marine litter are collected in the water, sediments and in biota. Furthermore, samples have been collected by a wide range of instruments used for in situ data acquisition, analyzed by heterogeneous laboratory protocols, with different method accuracy and precision that need to be described and archived, together with the data, to allow data reusability and solid scientific analysis. A lot of information included in metadata is therefore needed to correctly archive data and allow their long-term use. Historical data are often deprived of any detailed information on analytical procedures, calibrations and confidence intervals; conversely, large efforts are currently carried out in the framework of European data management initiatives to properly collect and archive this kind of relevant metadata. Guidelines for data management in the field of physical and chemical oceanography have been developed long ago (Intergovernmental Oceanographic Commission, 1965) and are being constantly updated (Intergovernmental Oceanographic Commission, 2019). However, further efforts are needed as new instruments and new parameters are being used by monitoring agencies and the scientific community. Each data category (from eutrophication to contaminants and litter) requires the development of customized standards and tools for their management and visualization. In the field marine litter, a shared process with institutions involved in marine litter monitoring, management and assessment has allowed EMODnet Chemistry to develop a unified data model and to define common data formats that pick up the most relevant information to standardize them (Molina Jack et al., 2019). Beach and seafloor litter data are modeled on existing and consolidated formats, adapted to accept data from different sources (Galgani et al., 2020a). For micro-litter, the standard data format is extended to gather all necessary metadata (Galgani et al., 2020b).

DATA COLLECTION, AGGREGATION, AND VALIDATION
Data collections are the result of harmonization, standardization and validation process, applied when merging observations from different sources, sensors and purposes . All data entering EMODnet Chemistry are managed according to standard protocols developed in the framework of the European consolidated SeaDataNet marine data infrastructure 1 , which implements consolidated communication standards and tools, common data and metadata models and common file formats. In particular, the following SeaDataNet standards are adopted to ensure interoperability with other data platforms: • Metadata services for standard description of cruise, organizations, projects, observing systems, datasets. . . (CSR, EDMO, EDMED, . . . ), • Ocean Data View (ODV) format for data exchange, • NVS Controlled Vocabularies providing common terms to describe data and metadata, • Common Data Index (CDI) service to access and download data according to the related data policy, • Security services for user registrations, • Products viewing services for discovery, visualization, and downloading of products, • Dedicated tools (NEMO, MIKADO, . . . ) to prepare data and metadata, • Quality control procedures for data validation.
The use of controlled vocabularies (i.e., standardized terms that cover a broad spectrum of disciplines) is an important prerequisite to allow consistency and interoperability. The SeaDataNet NVS controlled vocabularies are technically managed and hosted by the British Oceanographic Data Centre by means of the NERC Vocabulary Server (NVS2.0). Heterogeneity in marine chemical data is extremely high, with regard to sampled matrix characteristics, different sampling and analytical protocols. To keep all relevant information linked to the data, a very specific vocabulary (BODC Parameter Usage Vocabulary terms, P01 vocabulary 2 ) was implemented by SeaDataNet, which allows to classify the different substances, but also record matrix characteristics and the analytical techniques.
The P01 vocabulary is based on a semantic model which uses a defined set of controlled vocabularies (the semantic building blocks), and which allows to keep the relevant information and to label parameters with a standard description.
EMODnet data and metadata format is compliant with the INSPIRE themes and SeaDataCloud data models .
The data files can be imported into the Ocean Data View (ODV, Schlitzer, 2002) visualization and analysis software package, which is freely available for non-commercial, nonmilitary research and for teaching purposes.
In order to obtain data aggregation per sea region from the heterogeneous datasets originating from multiple institutions, an automatic Robot Harvester, properly configured with predefined criteria of geographical and temporal coverage and parameters, was adopted to retrieve specific data sets from distributed data centers. The resulting collection is aggregated, and quality controlled using ODV software and following a dedicated methodology. Regional leaders were in charge of both steps, to be performed according to a common protocol, shared between all sea regions. Parameter aggregation included unit conversions, harmonization of parameter coding and meaning (taking into consideration the possible difference in the collection of new and historical data). Regional quality control follows procedures compiled in discussion with the wider international community (e.g., IOC/IODE, ICES and JCOMM, SeaDataNet Data Quality Control Procedures) and involves: metadata format correctness and completeness check, data format checks, identification of negative and zero values, identification of wrong measurement units and "broad range" check which consists in the comparison with minimum and maximum regional values derived from previous statistics (SeaDataNet, 2010; Barth et al., 2015;Buga et al., 2019). Regional experts are involved in the validation of the aggregated data collection; as a result, quality flags are assigned to all data according to a standard scale (SeaDataNet measurand qualifier flags, L20 vocabulary 3 ).
To keep all relevant information describing the dataset (e.g., data originator, station characteristics, sampling instruments. . . ) data files and metadata files are merged in ODV to create a metadata enriched SeaDataNet ODV data collection.
Due to the large variety in chemical data, dedicated and customized approaches and the development of new tools for FIGURE 1 | The loop implemented within EMODnet Chemistry for data quality control, assures continuous data quality upgrade. management and visualization are required to meet the specific needs of the different data categories (e.g., nutrients in the water column, contaminants in sediment . . . ).

Eutrophication and Acidity
In order to produce thematic data sets, aggregated at Regional Sea scale, a dedicated vocabulary was implemented (P35 vocabulary: EMODnet Chemistry aggregated parameter names) to combine various P01 terms associated to a same substance (e.g., nitrate in seawater) but measured with different protocols or expressed with different measurement units, into a unified aggregated term with a uniquely identified standard unit. The ODV software (Schlitzer, 2020) has a built-in aggregation procedure, making use of the P35 vocabulary and also applying a number of business rules, such as for averaging and possible unit conversions. The information related to the specific parameters (as described by the P01 vocabulary) is available as metadata and, if provided by data originators, details about the acquisition instrument are also made accessible. The final resulting ODV data collections, aggregated and validated, are used as input for data interpolation on a regular grid (Data-Interpolating Variational Analysis, Brankart and Brasseur, 1996) and for the preparation of data products available through EMODnet Chemistry portal. A synthetic description of the data formats and the validation process is given below. Two types of data are defined for each collection and treated separately: • vertical profiles (VP) for data that have been collected roughly at the same time and location for several consecutive vertical depths, • time series (TS) for data collected at the same location and depth but repeated in time.
Most data related to eutrophication are provided as VP collections whereas some TS are available for Mediterranean Sea, Greater North Sea and North East Atlantic Ocean. The regional data collections include alkalinity, chlorophylla, DIC, ammonia, nitrite, nitrate, NO 2 + NO 3 , DIN, oxygen, pH, phaeopigments, phosphate, silicate, total nitrogen, total phosphorus, as described in the user manual EMODnet Chemistry Eutrophication and Acidity aggregated datasets v2018 .

Contaminants
In the case of contaminants, heterogeneity of data is particularly high, as contaminants are measured in three matrices (water, sediment, biota), with different characteristics (dissolved/ particulate phase in water, different sediment size classes, different marine species and target tissues/organs), with different sampling, analytical and normalization protocols. This complexity results in a high number of different P01 terms for each single substance, depending on the matrix, sampling, analytical procedures, and makes data comparability among different areas really challenging. As an example, only for the Black Sea 347 unique P01 terms were listed.
In order to focus EMODnet Chemistry efforts at regional sea scale, an in-depth analysis of the relevant EU legislation concerning contaminants (EU Directive 2013/39/EU; EU Water Framework Directive 2000/60/EC; Marine Strategy Framework Directive MSFD -2008/56/EC) and of the procedures defined by the Regional Sea Conventions for the assessment of chemical pollution of the marine environment was carried out. The most relevant substances used for the assessment of ecosystem status were identified. Among the large list of substances target of EMODnet Chemistry data collection, priority was given to the following parameters: • Pesticides: p,p'-DDTs (including in this group p,p'-DDE, p,p'-DDT and p,p'-DDD) and HCB; • Antifoulants: TBT and TPT; • Heavy metals: mercury, cadmium, lead, plus copper and zinc; • Water: µg/l • Sediment: µg/kg of dry weight sediment • Biota: µg/kg of fresh weight for biota following RSC guidelines (except mussel as dry weight).
In the case of contaminants, to improve data processing and make use of the important information on species, chemical substance, matrix, basis of determination etc. connected to each single variable (P01+P06 code), the P01 code is decomposed into its separate parts of relevant information (i.e., split in its subcomponents namely the substance name, the matrix characteristics, the taxonomic level, the measured statistics,. . . ). In this way, the user can independently aggregate the data and reduce the large heterogeneity. This approach has been agreed within the EMODnet community in consultation with the MSFD group of experts and new functionalities were added to ODV tool.
To improve further usability, the ODV formatted data can be transformed into a long/vertical format, having one record line per P01 code and its subcomponents. For traceability, the local variable names are maintained, having P01 as the primary information.
EMODnet Chemistry aggregated datasets of contaminants in the marine environment are described in the user manual (Buga et al., 2019).

DATA QUALITY CONTROL LOOP
Data originators are responsible for Quality Assurance (QA) and Quality Control (QC), involving the assessment of the whole sampling/laboratory analysis process and resulting on data flagging according to a standardized approach. All data originators provided information on QA/QC procedures through questionnaires based on ISO/IEC 17025:2017 standards, which indicate that most laboratories follow standard procedures involving inter-calibrations and the use of certified reference material 4 .
EMODnet Chemistry involves a network of National Oceanic Data Centres (NODCs) that supervise the national availability of research and environmental monitoring data, which are provided, respectively, by research institutes and environmental agencies. NODCs are also responsible for archiving qualitycontrolled data, flagged with quality information (Vinci et al.,4 https://www.emodnet-chemistry.eu/data/questionnaires 2017). Within EMODnet Chemistry (2009-2012, 2013-2016, 2017-2019), strong effort was dedicated to evaluate the quality of data and to define a validation protocol. The comparisons of datasets from different sea regions showed many inconsistent data quality flags, which is an evidence of the need to coordinate and harmonize practices. Commonly agreed and standardized data aggregation and validation protocols have been defined and shared to guarantee consistency among comprehensive databases, which include data from different origin and spatial or temporal scopes.
An internal process that harvests information to check its consistency and quality, revealed a number of unexpected issues. These were reported back to data originators through the related data center, in charge of correcting them and updating the official copy of the data made available for the next harvest. This continuous loop implemented in tight contact with the data providers is leading to an effective upgrade of data quality (Figure 1).

ANALYSIS OF THE EMODnet CHEMISTRY AGGREGATED AND VALIDATED DATASETS
With six different sea regions, three main data categories (covering nutrients, oxygen, chlorophyll, hydrocarbons, pesticides, heavy metals, antifoulants, and marine litter) and more than one century of data for some of them, the qualitycontrolled data collections produced by EMODnet are really huge and thus it is difficult to describe them in detail with a limited number of figures. Some general information on the geographic area covered by the regional data collections, the temporal extent, the number of profiles and DOI reference for their open access is provided in Table 1.
Here we focused on the process implemented for managing eutrophication and contaminants. Data quality control is crucial in oceanographic data management, especially for the creation of multidisciplinary and comprehensive databases which include heterogeneous data from different and/or unknown origin covering long time periods (Buck et al., 2019). The outcomes of the validation process are indicated by quality flags assigned to data or to datasets. Merging data from different laboratories, sea regions and quality pose a great challenge to allow data inter-comparability at regional scale. Therefore, the adoption of appropriate and shared methodology, as well as the feedback from scientific community and from data originators are necessary ingredients that contribute to the validation of large data collections.
While eutrophication data are mainly available in the water column, on repeated stations or along scientific cruises in the open sea, contaminants are mainly collected in sediments and in different biota species, following national implementations of EU directives requirements. Thus, customized protocols and tools are adopted, while comparable efforts are dedicated to marine litter and described in specific papers.
The ODV system was suitably enhanced for automated parameter aggregation of data sets from the multiple sources. In the case of contaminants, heterogeneity of data is particularly consistent, as EMODnet Chemistry manages a high, and continuously growing, number of different substances, measured in different matrices (water, sediment, biota), in different sediment size classes, in different phases (dissolved/particulate), in different marine species and target tissues/organs, with different sampling, analytical and normalization protocols. The approach designed and implemented to manage such heterogeneity included the splitting of P01 vocabulary terms in subcomponents and giving the users tools to be able to aggregate information following their needs. This was made possible because EMODnet Chemistry metadata stores all needed information with controlled vocabularies.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.

AUTHOR CONTRIBUTIONS
AG coordination of the data quality control loop, coordination in the definition of the protocols for data integration and validation, shared between the six regional leaders, and writing. MLi contributed to chemical data management and to the definition of the protocols and generation of the figure. MM contributed to chemical data management and to the definition of the protocols and generation of the table. NH and HJ contributed as expert in chemical data validation and management. LB and GS responsible for the aggregated and validated dataset in the Black Sea region. AI responsible for the aggregated and validated dataset in the Mediterranean Sea region. JG responsible for the aggregated and validated dataset in the North Atlantic region. MLa responsible for the aggregated and validated dataset in the North Sea region. LF responsible for the aggregated and validated dataset in the Baltic Sea region. AØ responsible for the aggregated and validated dataset in the Arctic region. RS contributed to the datasets (management and tool development). All authors contributed to the article and approved the submitted version.