Edited by: Jay S. Pearlman, Institute of Electrical and Electronics Engineers, France
Reviewed by: Antoine De Ramon N’Yeurt, University of the South Pacific, Fiji; Ramasamy Venkatesan, National Institute of Ocean Technology, India
This article was submitted to Ocean Observation, a section of the journal Frontiers in Marine Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
A thorough and reliable assessment of changes in sea surface water temperatures (SSWTs) is essential for understanding the effects of global warming on long-term trends in marine ecosystems and their communities. The first long-term temperature measurements were established almost a century ago, especially in coastal areas, and some of them are still in operation. However, while in earlier times these measurements were done by hand every day, current environmental long-term observation stations (ELTOS) are often fully automated and integrated in cabled underwater observatories (UWOs). With this new technology, year-round measurements became feasible even in remote or difficult to access areas, such as coastal areas of the Arctic Ocean in winter, where measurements were almost impossible just a decade ago. In this context, there is a question over what extent the sampling frequency and accuracy influence results in long-term monitoring approaches. In this paper, we address this with a combination of lab experiments on sensor accuracy and precision and a simulated sampling program with different sampling frequencies based on a continuous water temperature dataset from Svalbard, Arctic, from 2012 to 2017. Our laboratory experiments showed that temperature measurements with 12 different temperature sensor types at different price ranges all provided measurements accurate enough to resolve temperature changes over years on a level discussed in the literature when addressing climate change effects in coastal waters. However, the experiments also revealed that some sensors are more suitable for measuring absolute temperature changes over time, while others are more suitable for determining relative temperature changes. Our simulated sampling program in Svalbard coastal waters over 5 years revealed that the selection of a proper sampling frequency is most relevant for discriminating significant long-term temperature changes from random daily, seasonal, or interannual fluctuations. While hourly and daily sampling could deliver reliable, stable, and comparable results concerning temperature increases over time, weekly sampling was less able to reliably detect overall significant trends. With even lower sampling frequencies (monthly sampling), no significant temperature trend over time could be detected. Although the results were obtained for a specific site, they are transferable to other aquatic research questions and non-polar regions.
Measuring changes in water temperature over time is important for assessing climate change impacts. In this context, temperature changes have a fundamental impact not only on the kinetic energy in the system but can also affect the overall cross-taxon structure of marine biodiversity and, therefore, the global distribution of life in the oceans (
It was only at the end of the last century when watertight temperature sensors became available off the shelf at an affordable price and successively replaced most manual measurement devices. Today, digital temperature sensors are available for most
Remote
In addition to defining specific
According to ISO 5725-1:1994,
Visualization of accuracy and precision according to ISO 5725-1:1994. The bell-shaped distribution describes a set of measurements of a single parameter (e.g., water temperature) over time with a single sensor or the distribution of synoptic measurements with multiple sensors. Higher measurement accuracy means that the maximum of the distribution moves closer to the reference value. The precision of a sensor refers to the width of the bell-shaped distribution. The more precise a measurement is, the narrower the bell shape of the distribution.
In contrast, the term precision relates to the reproducibility of measurements and their variability between repeated measurements due to, for example, electronic or resolution-based, variabilities of the measurements itself (
Even though both accuracy and precision are defined in their respective ISO standards, the precise meaning of these definitions are debated (
Compared to defining a sensor’s accuracy, the calculation of precision seems to be much easier. According to ISO 5725-1:1994, the precision of data depends only on the distribution of random errors around an assumed statistical value, which is, however, not necessarily the true value in the sense of the real value referred to in the above accuracy section. Precision is usually expressed in terms of imprecision and computed as the standard deviation of the measured mean value, which is reflected by a larger standard deviation or confidence limit (
Even though most sensor manufacturers provide lab-derived accuracy values for new sensors, the meaning and consequences of these parameters for
Therefore, the
In the first part of this paper, we address the issue of sensor accuracy, precision, and comparability of different commercially available temperature sensors within the available price categories between 200 and 15,000 EUR on the potential to measure one or more environmental variables over a certain range as accurately and precisely as possible. Therefore, we conducted laboratory intercomparison experiments to compare the
In addition to the above-described sensor-specific issues, scientists are often confronted with the decision on how often a sensor should sample per time unit to best assess possible changes and dynamics of a focus parameter. Similar to the above-described issue of accuracy and precision, there are also valid and scientifically proven theoretical concepts to determine an adequate sampling frequency for a certain monitoring task. One of these concepts is the Shannon–Nyquist theorem (
In the second part of the manuscript, we therefore address the question of how different sampling schemes with different sampling frequency (hourly, daily, weekly, or monthly) affect the observed long-term temperature trend over a period of 5 years. For this analysis, we used a dataset from an Arctic coastal observatory in Svalbard (
This study is not intended to evaluate or develop standard operating procedures (SOPs) to determine sensor accuracy and precision and not to provide SOPs for the determination of the sampling frequency for a specific monitoring approach as this must be completed specifically for each experimental setup. Rather, the goal of this study is to demonstrate how different sampling schemes with respect to the sampling frequency and use of sensors with different accuracy and precision values can affect the outcome of monitoring programs. The results are thus discussed considering: (1) the requirement to select suitable sensors for long-term oceanographic measurements, including cost-benefit considerations when using either expensive oceanographic sensors, such as CTD or thermo salinometers, compared to multiple relatively cheap temperature sensors that are available off the shelf and (2) the possible effects of different temporal sampling schemes on the results. The latter considerations are essential when deciding how much money and workforce should be invested for a long-term sampling program to detect relevant changes in the target parameter (here water temperature) with high reliability and accuracy without exaggerating the sampling and data handling efforts.
To test the influence of different sensors and sampling strategies to observe a specific environmental parameter over longer periods of time, we used the variable water temperature as it is the most important hydrographic variable across the aquatic disciplines in the context of climate change (
Sensor analysis in the intercomparison experiment.
Sensor ID | Type | Manufacturer | Parameters measured |
Price group (€) | Resolution (°C) |
Accuracy (a) Precision (b) (°C) |
Stability (°C) |
Max. sampling Frequency (s) |
1 | Multiparameter probes | AML Oceanographic Ltd. | T, S | >10.000 | 0.001 | 0.005–0.002 (a) 0.003 (b) | n.a. | 1 |
2 | Sea-Bird Electronics, Inc. | T, C, P, O2 | >10.000 | 0.0001 | 0.002 (5–35°C) 0.01 (35–45°C) | 0.0002 |
10 | |
3 | AML Oceanographic Ltd. | T, C, P | >10.000 | 0.1 | 0.05 | n.a. | 0.1 | |
4 | YSI | T, C, P, O2, fDOM, turbidity | 2.000–10.000 | 0.001 | ±0.01 ±0.05 | n.a. | 1 | |
5 | TriOs | T, S, nitrate | 2.000–10.000 | n.a. | n.a. | n.a. | 60 | |
6 | Satlantic | T, S, nitrate | 2.000–10.000 | n.a. | n.a. | n.a. | 60 | |
7 | FerryBox flow system | 4HJena | T, C, Chl-a, O2, turbidity | >10.000 | 0.0001 | <5%/±0.005 | ±0.0005 | 60 |
8, 9, 10 | Logger | Schlumberger | T, C, P | <2.000 | 0.01 | ±0.1 | n.a. | 60 |
11, 12 | HOBO | T | <2.000 | 0.02 at 25°C | ±0.21 | 0.1 |
60 |
Sensors, respectively, datasets available for the
Sensor ID | Manufacturer | Temporal resolution of source dataset | Manufacturer accuracy and precision |
13 | Teledyne WorkHorse | 2.8e–4 Hz (one value per hour) | Accuracy: 0.01°C |
14 | Aanderaa Optode | 1.67e–2 Hz (one value per minute) | Accuracy: 0.03°C |
15 | SeaBird SBE38 | 1 Hz | Accuracy: 0.001°C |
16 | SeaBird SBE45 | 1.67e–2 Hz (one value per minute) | Accuracy: 0.002°C |
qc | Quality controlled dataset (see text) | 2.8e–4 Hz (one value per hour) | Accuracy: n.a. |
For the
In the
Water temperature over time in a water basin for all of 12 sensors (inbox). Deviation to the calculated median (°C).
For the data analysis, all data were averaged over 1 min to reduce the bias toward sensors with a higher sampling frequency. As the experiments targeted the interoperability of the sensors, we used the median temperature of all 14 sensors as reference data for each time step.
The sensors included low-cost water temperature data loggers, water level loggers, instrument clusters, such as CTD probes, and flow-through systems, such as FerryBoxes (4H Jena Engineering GmbH). These sensors represent a price range between 200 and 15,000 EUR. The measurement principle of the different sensors varied with different resolutions and accuracies (
Pre-experiment sensor handling followed standardized routines defined by the sensor manufacturers and individual routines. All sensor operators prepared their sensors exactly, as they normally do for standard scientific missions. Specific across-institute concerted sensor preparation procedures were explicitly not provided, as we wanted to focus on possible variations of measurement between different standard sensors under SOPs as applied by different scientific operators and institutes. Therefore, we did not provide any guidelines with respect to sensor calibration and routine maintenance prior to the experiment, except that the routines must be in full agreement with the respective institute guidelines for good sensor handling practice prior to scientific measurement campaigns.
As an
Location of the Bremerhaven experimental site for the lab experiments and the COSYNA observatory in the Arctic Ocean, Svalbard archipelago (78.93045°N, 11.9190°E), base map:
The dataset comprised temperature data from four different sensor types (
The data flow and handling of the raw data from the Svalbard observatory is described in
Data flow and handling from single sensor raw data sets to quality controlled data. Further explanations see text.
Additionally to the single sensor datasets, quality controlled datasets from 2012 to 2017 were used which have been published as yearly datasets in the Pangaea data repository (
For all further calculations, both the four single sensor data sets and the quality-controlled dataset were averaged (arithmetic mean) per hour so that mean hourly temperature data were available.
Using these five time series, virtual sampling campaigns were conducted from 2012 to 2017, simulating a realistic monitoring program on SSWT in the Arctic. When setting up the sampling frequency and procedure, we used our experiences of long-term sampling programs with logistic support available on year-round operated polar field stations. Based on these considerations, the five source datasets with temporal resolutions of 1 h were sampled
Overview of the temporal (hourly, day, week, and month) sampling scenarios in the
Temporal sampling scenario | Sampling sub-scenario | |
Hour | Hour | The full dataset was used |
Day | d-1 | Random selection of the sampling time between |
d-2 | Sampling every day at exactly 12:00 h | |
Week | w-1 | Random selection of one sampling day (Sunday to |
w-2 | Random selection of one sampling day (Tuesday to |
|
w-3 | Sampling on Wednesday with random selection of |
|
Month | m-1 | Random selection of one sampling day within each |
m-2 | Random selection of one sampling day between |
|
m-3 | Sampling on 15th of each month with random |
In the d-1 scenario, sampling was performed every workday year-round with a random selection of the exact sampling time between 10:00 and 15:00 every day. In the daily d-2 sampling scenario, sampling was also performed every workday, but exactly at 12:00.
For the different weekly sampling scenarios (
For the three-monthly sampling scenarios (
Applying this sampling strategy, nine different virtual sampling scenarios (
All analyses were performed with R-Studio (
In addition to evaluating the comparability of the tested sensors with respect to sensor accuracy and precision, we also evaluated whether specific sensors may be more appropriate for scientific tasks based on measurement characteristics. We specifically checked if there were sensors that were more appropriate for (1) measuring the “true” temperature as accurately as possible or (2) determining minimal small-scale temperature changes over time. While task (1) has less strict requirements regarding the accuracy of the measurements, task (2) requires a high precision of the temperature measurements.
The median of all sensor data for each time step was determined to analyze the accuracy of the different sensors within the intercomparison experiment. Usually, sensor accuracy is calculated using a “true” value. However, measurements of the true value are extremely challenging, and reference techniques can only provide an approximation. In this set-up, the real “true” temperature was unknown. Therefore, the difference between each measured data and the calculated median at each time step was determined.
As described above, one aim was to find reasonable and feasible metrics to select the most appropriate sensors for a specific scientific task. It is essential to be certain that the sensor of interest is as accurate as a reference or as an assumed “true” value. Therefore, it is crucial to measure the agreement between the two sensors.
One approach to evaluate the agreement is using Bland–Altman analysis between two sensors, rather than validating the sensors to a “true” reference. Bland–Altman analysis is based on quantifying the agreement between two quantitative measurements by studying the mean difference and constructing the LOA (
Example of Bland–Altman analysis taking the median of the reference sensor and three different sensors (DEV_1, DEV_9, and DEV_12) and the calculated reference as the median of all sensors; the light blue stripe indicates the mean difference of the device to the reference and represents the bias. The two orange bars represent the confidence interval at 95%.
In
These percentage differences between the reference and the sensor show that mean accuracy is the best benchmark to find the sensor that measures the temperature as accurately or “true” as possible. In
Percentage difference between sensor and reference from the Bland–Altmann analysis: (upper panel) bias or mean difference as a measure of accuracy, if bias >0 then median >sensor and, as a result, the median provides a bigger retention than the sensor and vice versa. A1–A3 are the most suitable sensors for accuracy estimates; (lower panel) spread of limits of agreement (LOA) to 0 as a measure of precision, P1–P2 are most suitable sensors for precision estimates.
Water temperatures (temporal resolution 1 h) of the Svalbard AWIPEV underwater observatory from 2012 to 2017 for the four single sensors and the quality controlled data set. In each plot, the available data of each sensor are shown after applying the plausibility control procedure (
This basic dataset (subsequently referred to as an “hourly” dataset) was used for all subsequent virtual sampling campaigns.
ANOVA results on the statistical effects and interactions of the parameters “sampling time,” “sensor-id,” “repetition,” and “sampling scheme” on the increase of water temperature per year (slope).
Factor | Sampling scheme |
||||||||
Hourly | d-1 | d-2 | w-1 | w-2 | w-3 | m-1 | m-2 | m-3 | |
Sampling time | |||||||||
Repetition |
– | – | n.s. | n.s. | n.s. | n.s. | n.s. | n.s. | n.s. |
Sensor_id | |||||||||
Sampling time × repetition | – | – | n.s | n.s | n.s | n.s | n.s | n.s | n.s |
Sampling time × sensor_id | n.s | n.s | |||||||
Sampling time × sensor_id × repetition | – | – | n.s | n.s | n.s | n.s | n.s | n.s | n.s |
The results in
However, when sampling monthly, this difference between the sensors over time could only be resolved with the restricted sampling scheme when the monthly sampling was performed exactly on the 15th of each month (m1) over the entire time period and only with
Finally, the interaction term including the factor replicate (time × repetition as well as time × sensor × repetition) was not significant in any sampling scheme, indicating that the random selection of the sampling time or day in the 100 virtual sampling events within each time slot did not confound the analysis with respect to hidden temporal patterns introduced by the sampling scheme.
Predictive capacity of different sampling schemes compared to hourly sampling. Hundred percent predictive capacity means that all 100 virtual repetitive samplings within a sampling scheme detected the significance in temperature increase over the period found in the hourly sampling. Zero percent predictive capacity means that none of the 100 virtual samplings within a sampling scheme detected the significant temperature increase over the period.
Our analyses show that daily sampling (either full random or restricted, d1, d2) revealed identical results as hourly sampling for all sensors and sampling times, and for the interactions between sampling time and sensor ID.
Weekly sampling was similar except for the interaction term “sampling time × sensor_id.” For this interaction term, the predictive capacity dropped to 0% in the weekly restricted sampling scheme. This shows that only daily sampling allows statistical disentangling of the effects of sampling frequency on the effects of sensor type when analyzing temperature increase over time.
When switching to the monthly sampling scheme, the predictive capacity dropped sharply. This means that with this sampling scheme, it is no longer possible to reliably determine the temperature increase over time from undirected signal noise. In the 100 repetitive samplings, a statistically significant relationship between temperature increase and time was found in only 40%, and the effect of the different sensors on the temperature measurements was found to be less than 20%.
In the next step, we analyzed the temperature measurements of each sensor in detail to focus on the effects of using different sensor types to determine long-term temperature changes (referring to the factor “sensor-id” and “sensor-id × sampling time” in the above analysis).
Calculated mean temperature increase per year (slope) from January 2013 to December 2017 measured with the different sensors. Shown are the mean slope values calculated from 100 virtual replicate sampling within the individual sampling schemes. Additionally, the standard deviations of the slope measurements are shown as whiskers.
In the last step, which is identical to the calculation of the predictive capacity of the different sampling schemes, we calculated the predictive capacity of each individual sensor to discriminate a real increase in water temperature over time from random temperature fluctuations (
Calculated mean temperature increase per year (slope) from January 2013 to December 2017 measured with the different sensors. Shown are the mean slope values calculated from 100 virtual replicate sampling within the individual sampling schemes. Additionally, the standard deviations of the slope measurements are shown as whiskers.
Best practices and standards for aquatic monitoring have gained increasing attention in recent years (
According to our experiences, however, such frameworks are sometimes not written in an operational way to allow easy implementation in a concrete data workflow and, therefore, are not used to their full effect in the ecological community. The sometimes high level of abstraction prevents the implementation of the often well-designed but theoretical procedures in the data workflow, meaning that they do not make their way to operational science as scientists are not able to adapt the suggested procedures to their specific scientific application. These problems in the translation process from theoretical data quality considerations to operational science can be observed at many levels of operational monitoring and, unfortunately, sometimes prevent the comprehensive implementation of described workflows on data quality in operational scientific work.
Our results from the
The sensor systems used in our intercomparison experiment varied in measurement technique and price, from high-end FerryBox systems and multiparameter
Another remarkable finding of the intercomparison experiment was that the behavior of each sensor is sensor-specific, and even sensors of identical type and manufacturer sometimes do not show the same behavior or provide different data under defined experimental conditions (e.g., DEV 8, 9, and 10). It is often assumed that periodic calibration ensures accurate and precise data, as a known reference standard with high accuracy is used for the calibration. In our experiments, all participants confirmed that their sensors were in a calibrated state; thus, we assumed that all sensors were well calibrated and ready for accurate use. The results of the experiments indicate that even proper calibration has the potential to retain errors. Calibrated sensors are assumed to be initially true and have a bias less than their precision error. Our results clearly show that calibrated sensors need to be checked against each other frequently, especially for comparative measurements with multiple sensors, for example, on joint cruises with different ships or synoptic measurements at different places.
This also holds true for the accuracy and precision of new sensors from manufacturer datasheets. These values are often only valid for a brand-new sensor and sometimes do not reflect the specific sensor, but only the sensor type. In this case, the question arises if such reported manufacturer values are trustworthy for field experiments. On the other hand, it has been reported that in some sensors, the accuracy and precision actually improve over time as the sensor stabilizes while in other cases it was observed that the accuracy and precision deteriorates over time, e.g., when the battery power decreases below a certain threshold. This shows that intercomparison experiments as well as proper sensor preparation prior to field campaigns should be a standard routine to assess and document the sensor performance prior to each campaign and that those operational “metadata” of the sensor should be accessible for later data analysis.
Our experiments show that during the planning and implementation phase of measuring or monitoring programs with multiple (different) sensors and in programs where different institutions with different sensor handling procedures are involved, it is highly recommended to perform intercomparison experiments. Such experiments are easy to perform, foster information and knowledge exchange and transfer among sensor operators, and help to select suitable sensors with regard to resolution, accuracy, and price. Our analysis showed that even low-cost sensors can be suitable and the low price allows the implementation of a measurement array at the same cost as a single, more expensive sensor. These kinds of considerations including the respective accuracy and precision information have to be properly documented in the data’s metadata especially when submitted to global datasets. In this context, it must be considered, however, that data of lower accuracy and precision, even though sufficiently accurate and precise for local scientific questions, may compromise global datasets obtained by higher accurate and precise sensors. Therefore, data portal administrators should be aware especially of such considerations when accepting data from various institutions using different sensor types and deployment methods.
In addition, the observed variability between sensors of the same type from the same manufacturer in our experiments supports the need for intercomparison experiments to assess reliability across sensors of the same type. In particular, larger research communities with different departments and cooperation partners need to establish standardized facilities to compare sensors and to carry out standardized calibrations with defined reference values. This information on data intercompatibility is necessary for data blending or common analysis and interpretation, and therefore contributes to the FAIR principles (
In addition to proper sensor selection, the re-analysis of the Svalbard dataset from 2012 to 2017 revealed that the overall measurement strategy, in particular the sampling frequency, is crucial for a possible statistical-reliable discrimination of long-term interannual temperature changes. In particular, for long-term field measurements over several years, setting up the sampling scheme must include not only accuracy and precision consideration of the sensors themselves but also the long-term availability of the workforce on site for sensor maintenance, possible weather constraints preventing sampling for some time, and possible temporal or spatial restrictions with respect to access to the area. While scientists often want to achieve a strict sampling plan with fixed sampling days or even hours at as high as possible temporal frequency, logistics station personnel who have to conduct the sampling in the field prefer the sampling plan to be as flexible as possible to fit their daily, weekly, or monthly routines, as well as their preferred field times. Unfortunately, such discussions are often not based on an in-depth knowledge or evaluation of the consequences of the proposed sampling scheme for data reliability and data quality for a certain question, but rather follow the “experience” factor either of the scientist or the “feasibility” factor of the station personnel.
A proper long-term reliable sampling plan and the respective preparation including all the above-mentioned technical, human, and legal points will facilitate the long-term success of a monitoring program and will better focus on the scientific question, instead of technical or logistic issues.
Our experiments revealed that the sampling frequency is most critical for the chance to determine long-term changes in a parameter (here temperature) with a relevant statistical significance. We detected average increases in temperature over time in the shallow area of the Kongsfjorden ecosystem close to the settlement NyÅlesund between 0.1 and 0.4°C per year depending on the sensor used. Using our best estimates based on our quality control dataset, an average increase of 0.22°C per year was calculated. These values fit quite well with the overall estimate of the effect of global warming in the Arctic realm. Recent studies have shown a significantly faster increase in Arctic temperatures due to global warming than the global average, with Svalbard lying in the global hot-spot area in recent decades (
When looking at the ability to significantly prove the observed increase in water temperature by weekly sampling, the probability that the observed temperature increase over time reaches a statistical significance (
When shifting to a monthly sampling scheme, the chance of detecting a significant increase over time was almost 0, independent of the sensor used and if the sampling was done on the same day of the month or on a random day of the month.
Summarizing these results, in our monitoring program, a sampling frequency of less than “daily” is inappropriate for trying to discriminate random fluctuations in arctic water temperature from a directional change in temperature over time.
Another interesting issue emerged when examining the results of sensor 13. Independent of the sampling scheme, this sensor revealed the highest temperature increase, with an average of 0.33°C year–1. Evaluating this value in the context of all other sensors and the quality-controlled dataset strongly suggests that this value is a sensor-specific overestimation of the real temperature increase over time. This may be due to the larger number of measurement gaps due to technical failures. It is well known that data gaps can confound underlying “real” trend signals in long-term datasets especially when the covered overall time period is relatively short and data have a pronounced seasonality (
This result may be explained by the method of linear regression, as the calculation of statistical significance of a slope is done by analyzing the increase in the measured value after time using
In contrast, in integration scenarios m-1, m-2, and m-3, almost all calculated slopes were insignificant, except for the 0.33°C year–1 increase from sensor 13, which showed a
Summarizing the observed patterns for the hourly, daily, weekly, and monthly sampling schemes, a consistent picture emerges. While hourly and daily sampling provided stable results independent of the sensor and independent of the aggregation procedure (minimum, mean, or maximum values), weekly sampling may show significant results in long-term temperature changes over time; however, these results are highly sensor-dependent and are potentially associated with a high probability of error. In our analysis, monthly sampling schemes did not provide significant results for long-term temperature changes over time, independent of the sensor used and the sampling scenario.
Sampling aquatic environments, especially in remote areas, is time-consuming and expensive. Our results showed that for the Svalbard dataset, only hourly and daily sampling is a reliable sampling strategy for monitoring long-term changes in water temperature for climate change monitoring programs. Even daily sampling programs based on discrete water samplings, for example, with a small ship or even from a pier or any other access point to the water, are not practical, even when considering a year-round operated research base in the Arctic, such as the AWIPEV research base in NyÅlesund used for our study. Winter conditions with extreme outside temperatures and Arctic polar nights make such a human-based sampling program not feasible. Furthermore, measuring further environmental parameters such as pH or chlorophyll a by discrete water sampling is also not feasible on a daily basis, even in more friendly environments, as the workload is too high and hiring extra personnel for such programs is often not possible. Cable observatories are often assumed to be expensive and technically demanding. However, when considering the financial expense and workload effort required for daily sampling based on human operation, cabled observatories are often more cost-effective, not only in remote areas. Cable connected fully automated sampling facilities have become operational standards over the last decade, and data handling procedures for quality control and storage have been developed and established in most scientific institutions. Considering these technological developments and the findings from this study that at least daily sampling is most appropriate and reliable in terms of statistical power to discriminate random fluctuations in water temperature from directional changes over time, observatory technology with sensors measuring at least on a daily resolution is a cost efficient and reliable method for environmental monitoring. Our results are transferable to other aquatic research questions and non-polar regions. Increases in surface water temperatures constitute a global challenge and are monitored in many coastal and terrestrial regions. Hence, it is important to evaluate sensor behavior and provide elaborate and feasible sampling schemes.
However, these results do not address the problem that sensor-based measurements have a higher potential for bias than discrete water samples. Therefore, we propose a synergistic approach of sensor-based measurements of at least a daily frequency, with a regular discrete sampling scheme several times per year to validate the sensor data and ensure a high accuracy of the continuous sensor data. In our experience, such a validation by discrete water samples must be done in pre-defined intervals, depending on the variable and the environment, as well as on the requirements of the data quality.
Our experiments show that differences in temperature measurements with different sensors are within the order of magnitude of the expected temperature increase in the Arctic Ocean.
The paper shows that the selection of suitable sensors is essential to meet previously defined scientific tasks. There are two main scientific tasks that are very important for sensor selection: distinguishing differences (high accuracy) and distinguishing trends (high precision). Consequently, a comprehensive evaluation of the accuracy and precision of sensors is required, even after successful calibration. Usually, sensors are assumed to be initially true and have a bias less than their precision error after calibration. However, the sensor characteristics depend also on the prevailing environmental conditions, proper handling routines, and sensor age, and vary within a specific range. With rapid changes in environmental conditions, the functionality of sensors must be maintained to provide data of consistently high quality. Furthermore, a thorough theoretical knowledge of possible impacts of a sensor’s accuracy and precision on the usability of a dataset for a specific scientific question is required, as, e.g., a highly precise but not very accurate sensor, e.g., may yield in numerically false values for long-term trend studies while a highly accurate but not very precise dataset may fail in discriminating smallest scale temperature differences, e.g., in studies of a water columns stratification.
Exact knowledge of the variability and influence of the sensors used is important to ensure reliable data interpretation. It is evident that the conversion from theoretical concepts and corresponding data regarding sensor calibration from the laboratory to operational monitoring is complex. Therefore, intercomparison experiments provide an opportunity to assess the variability of various sensors with changing experimental conditions to provide valuable information for the decision process on which type of sensor is suitable for a specific task.
The intercomparison experiment data discussed in this paper indicate that low-cost sensors do not necessarily have lower measurement quality than expensive sensors in terms of accuracy and precision. Low-cost sensors may allow the exposure of multiple sensors in sensor clusters, which is ideal in some cases. In addition, the authors recommend that it may be more effective to apply multiple sensors, even from different manufacturers. There is discussion to be had over whether multiple low-cost sensors are better than one expensive sensor which, to a major part, depends on the primary scientific question addressed with the measurements but also if the data shall be later integrated into a global database with predefined accuracy and precision requirements.
Our long-term data evaluation of Svalbard data shows that when it comes to the reliability of statistical analysis, the sampling scheme is more important than the sensor characteristics, especially in terms of accuracy and precision. Hence, the sampling frequency is the most sensitive attribute for detecting long-term, statistically significant changes. In this context, it is important to consider that although the highest possible sampling frequency is desirable to enable maximum statistical significance in the analysis of the target data set, in operational practice, the sampling frequency can sometimes be limited by technical aspects such as the lifetime of the batteries in autonomous sensors. Especially in these cases, it is very important to know exactly the statistical consequences of different sampling frequencies for the later data analysis. Choosing a frequency that is too low due to technical limitations may mean that the scientific question cannot be answered at all with adequate statistical significance and thus the entire sampling program may have been in vain. A statistically justified determination of the minimum sampling frequency should therefore always take precedence over any technical framework conditions. Another issue that has to be considered in this context is also the continuity of data sets. Especially larger gaps in datasets may considerably confound the statistical output of long-term trend analysis. There is unfortunately limited research available on the consequences of data gaps in environmental datasets but
Regarding the definition of the sampling frequency, our statistical analysis of the Svalbard data showed that with an hourly and daily sampling rate, long-term temperature trends could be detected reliably and accurately. Only hourly and daily sampling delivered reliable, stable, and comparable results with respect to temperature increase over time. When sampling was weekly, a similar overall trend in temperature increase was not evident and the uncertainty to detect this trend was much higher. Random factors due to simple sampling procedures may confound the results. With even lower sampling frequencies, no significant temperature trend could be predicted.
Nevertheless, suitable sensor selection is crucial. A slightly lower temporal sampling resolution of 1 week, either using discrete sampling data from single sampling events or integrated sampling data with mean values, can have diverse results, spanning from non-significant to highly significant, depending on the sensor used. An increase in water temperature of up to 0.33°C year–1 was derived by selecting an unsuitable sensor which means a 57% higher prediction of the long-term temperature increase compared to an average increase of 0.21°C year–1 across all sensors. For climate projections, this difference is significant and mitigating it is essential for reliable interpretation.
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
PF, UK, and PD coordinated the production of the manuscript. All authors collaborated on the manuscript and provided critical feedback on the experiments, the analyses and the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
This work was supported by funding from the Helmholtz Association in the framework of Modular Observation Solutions for Earth Systems (MOSES). We acknowledge funding from the Initiative and Networking Fund of the Helmholtz Association through project “Digital Earth” (funding code ZT-0025). This project made use of the facilities that are part of the JERICO-S3 project, which is funded by the European Commission’s H2020 Framework Programme under grant agreement No. 871153. Project coordinator: Ifremer, France.
We thank the Zentrum für Aquakulturforschung (ZAF) for providing their facilities for the sensor intercomparison experiment. We highly appreciate the support of the AWI Center for Computation and AWI Computing and Data Centre for the year-round maintenance of the Svalbard observatory. Finally, we also want to thank the two reviewers AN’Y and RV for their extremely constructive and helpful comments on the first version of the manuscript.
The Supplementary Material for this article can be found online at: