Perspectives on Citizen Science Data Quality

Downs, Robert R.; Ramapriyan, Hampapuram K.; Peng, Ge; Wei, Yaxing

doi:10.3389/fclim.2021.615032

PERSPECTIVE article

Front. Clim., 09 April 2021

Sec. Climate Risk Management

Volume 3 - 2021 | https://doi.org/10.3389/fclim.2021.615032

This article is part of the Research TopicOpen Citizen Science Data and MethodsView all 28 articles

Perspectives on Citizen Science Data Quality

Robert R. Downs¹^*

Hampapuram K. Ramapriyan^2,3

Ge Peng⁴

Yaxing Wei⁵

¹NASA Socioeconomic Data and Applications Center, Center for International Earth Science Information Network, The Earth Institute, Columbia University, Palisades, NY, United States
²Science Systems and Applications, Inc., Lanham, MD, United States
³Earth Science Data and Information System Project, Goddard Space Flight Center, NASA, Greenbelt, MD, United States
⁴Earth System Science Center/NASA Marshall Space Flight Center (MSFC) Interagency Implementation and Advanced Concepts Team (IMPACT), The University of Alabama in Huntsville, Huntsville, AL, United States
⁵Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States

Information about data quality helps potential data users to determine whether and how data can be used and enables the analysis and interpretation of such data. Providing data quality information improves opportunities for data reuse by increasing the trustworthiness of the data. Recognizing the need for improving the quality of citizen science data, we describe quality assessment and quality control (QA/QC) issues for these data and offer perspectives on aspects of improving or ensuring citizen science data quality and for conducting research on related issues.

Introduction

Citizen science (CS) is recognized as having broad potential benefits to society. Citizen science projects are providing unique and sometimes fundamental scientific insights and offer a wide variety of scientific outcomes (Pettibone et al., 2017; Paul et al., 2018; Wiggins et al., 2018; Bautista-Puig et al., 2019; Miller et al., 2019; van Etten et al., 2019). Citizen science also offers opportunities for efficiently collecting data that otherwise might not be obtainable in a practical manner (Li et al., 2019; Van Eupen et al., 2021). Citizen science data (CSD) provides valuable environmental measurements and observations that can be used independently and in conjunction with other data products and services to improve research and decision making capabilities (Robinson et al., 2018; Poisson et al., 2020). Especially given the increased opportunity to supplement traditional scientific data with CSD, it is essential that the CSD be as trustworthy and of known quality as other scientific data (Swanson et al., 2016; Aceves-Bueno et al., 2017; Budde et al., 2017; Burgess et al., 2017; Kallimanis et al., 2017; Steger et al., 2017; Sandahl and Tøttrup, 2020). Information about the quality of CSD builds trust, provides opportunities for potential users to discover CSD that are appropriate for their purposes, and enables users to determine whether and how the data can be used to meet their objectives (Alabri and Hunter, 2010; Hunter et al., 2013; Freitag et al., 2016; Lukyanenko et al., 2016; Stevenson, 2018; Anhalt-Depies et al., 2019). The quality of CSD also can influence the analysis and interpretation of the data (Kelling et al., 2015; Clare et al., 2019). Quality information is important for scientific data, including CSD (Roman et al., 2017; Gharaibeh et al., 2019). Citizen science data contributes to many scientific endeavors that are important for environmental science and for the well-being of society, including sustainable development, humanitarian efforts, and disaster prevention and response (Hicks et al., 2019; Fraisl et al., 2020). Providing data quality information can improve opportunities for CS to contribute to important societal efforts and to the reuse of CSD (Kosmala et al., 2016; Hecker et al., 2019; Shanley et al., 2019).

While CS initiatives offer possibilities for obtaining observations and gathering data that supplement traditional data collection on important environmental issues, there is healthy skepticism about the quality of CSD (Brown and Williams, 2019; Cross, 2019). Fritz et al. (2019) indicate that uncertainty regarding quality of the data is a major barrier to the use of CSD, despite their value for the United Nations Sustainable Development Goals (SDGs). They also provide examples of several activities where steps have been taken to ensure that CSD are of high (and known) quality. Earp and Liconti (2020) describe the disparity between benefits of using marine CSD for research and perceptions of quality. Incompatible design of CS studies and inconsistencies in nomenclature also can affect data quality, resulting in challenges for integrating data from different CS programs (Campbell et al., 2020). User interfaces of digital tools provided to participants also can affect CSD quality (Sharma et al., 2019; Torre et al., 2019). Studying CSD management practices, Bowser et al. concluded: “While significant quality assurance/quality control (QA/QC) checks are taken across the data lifecycle, these are not always documented in a standardized way” (Bowser et al., 2020, p. 12). Recognizing a perceived bias among scientists regarding the use of CSD, Albus et al. (2019) reviewed comparison studies that were conducted on volunteer and professional data collection efforts for large-scale water quality projects, concluding that more comparison studies are needed and that such studies should include accuracy, while controlling for variations among the datasets that are compared.

Considering such concerns about the quality of CSD, as well as other data, and how data quality can affect data and their use, the Earth Science Information Partners (ESIP) Information Quality Cluster (IQC) is attempting to provide recommendations on practices to help ensure or improve CSD quality and build trust for CSD in the scientific community. This manuscript aims to lay out ESIP IQC's perspectives on the existing challenges and important aspects of CSD quality that should be tackled by the community in the near future.

In section ESIP Information Quality Cluster, activities of the ESIP Information Quality Cluster, relevant to CSD, are introduced along with four quality dimensions that occur throughout the data lifecycle. Section Challenges and Approaches for Improving CSD Quality introduces challenges, directions, and approaches for improving the quality of CSD. The first subsection offers a brief overview of opportunities for improving CSD quality during the recruitment, selection, self-selection, and training of CS volunteers. The second subsection describes selected issues that pertain to transparency of information about QA/QC practices during the production of CSD. The third subsection describes the importance of documenting CSD quality. The fourth subsection describes the importance of and need for establishing rubrics for evaluating CSD quality levels. Section Discussion concludes the paper with a discussion of these CSD quality issues and offers recommendations for progressively improving the quality of CSD.

ESIP Information Quality Cluster

The ESIP IQC studies and promotes the awareness of data and information quality (Ramapriyan et al., 2017). Like other ESIP Collaboration Areas (ESIP, 2020), the IQC reflects perspectives of various partner organizations that contribute to the collection, curation, dissemination, and interdisciplinary use of Earth science data. Information Quality Cluster activities include regular meetings, workshops, conference sessions, white papers, and journal publications. Information Quality Cluster activities also leverage the work of the NASA Earth Science Data System Working Group (ESDSWG) on Data Quality, which was active during 2014–2019 and completed its recommendations to the NASA Earth Science Data and Information System Project (NASA, 2020a). The IQC also organized sessions on CS during recent ESIP meetings. Directly related to data quality concerns for CS and other types of studies, the IQC recently began developing guidelines for documenting and enabling the sharing and reuse of data quality information (Peng et al., 2020). The strength of the IQC is in its membership, consisting of experts in data and information quality from various organizations and disciplines, and promoting collaboration among them and resulting in synergy for developing recommendations with broad applicability.

Challenges and Approaches for Improving CSD Quality

Applying CSD can be problematic if researchers and other users are not aware of data quality issues that could affect their analyses, contributions, or operational uses. However, there are several challenges for improving CSD quality. Assessing CSD quality can be extremely difficult due to heterogeneous observers and methods and lack of information about such methods. In particular, data bias, errors, uncertainty, and ethical issues pose challenges that should be assessed regularly as part of CS research projects. These and other challenges that occur throughout the data lifecycle are being investigated in an effort to improve the quality of CSD.

Taking a lifecycle approach can help CSD investigators to consider data quality issues and improve the information about data quality that is recorded and provided to users along with the data. The term, data lifecycle, has been defined variously with different levels of detail by different groups. For example, at a very high level, the NOAA Environmental Data Management Framework shows three types of activities—Planning and Production, Data Management, and Usage—in that order, but with feedback from each to the previous type of activity (NOAA, 2013). The US Geological Survey (USGS) defines a science data lifecycle model consisting of the following activities: “Plan, Acquire, Process, Analyze, Preserve and Publish/Share” (Henkel et al., 2015), with cross-cutting activities including “Describe (including metadata and documentation), Manage Quality, and Backup and Secure” (Henkel et al., 2015), thus emphasizing that management of quality cuts across all parts of the lifecycle (Faundeen et al., 2013). Strasser et al. (2012, p. 3) define a data lifecycle with eight components: “Plan, Collect, Assure, Describe, Preserve, Discover, Integrate, and Analyze.” Ramapriyan et al. (2017) consider information quality (i.e., quality of information about data quality) throughout the entire lifecycle to be four-dimensional. These dimensions, also referred to as aspects of information quality, are: 1. Scientific quality, 2. Product quality, 3. Stewardship quality, and 4. Service quality. Activities that focus on these four dimensions can be regarded as constituting four stages in the lifecycle. The specific activities of the four stages and their mappings to the four dimensions are: “1. Define, develop, and validate; 2. Produce, assess, and deliver (to an archive or data distributor); 3. Maintain, preserve, and disseminate; and 4. Enable data use, provide data services and user support” (Ramapriyan et al., 2017). Figure 1 depicts data lifecycle stages with each of these activities represented within the four quality dimensions.

FIGURE 1

Figure 1. Information quality dimensions and data lifecycle stages.

Regardless of the terminology used and the level of detail into which the data lifecycle is subdivided, it is important that characterizing and documenting data quality is considered within each stage of the lifecycle. For convenience of discussion, the terms, stages 1–4, as defined, above, in terms of the four quality dimensions, are used in sections Recruitment, Selection, Self-Selection, and Training of CSD Contributors, Transparency in Information about QA/QC Practices during the Data Production Process, Documenting Data Quality to Facilitate Discovery and Reuse, and Establishing Rubrics for Evaluating Quality Levels of CSD to indicate when the recommended actions need to be taken during CSD projects.

Information about the quality of data, including CSD, should be recorded throughout the data lifecycle to improve data for potential use and reuse. Effective planning is critical to the success of a CS project (Freitag et al., 2016) and improved data stewardship (Peng et al., 2018). Considering data quality during the earliest stages of the data project can improve planning and enable the research team to identify issues that could affect data quality later during the project. A framework for data quality issues to be considered while planning and designing CSD research is offered by Wiggins et al. (2011) for applying data quality and validation methods throughout the research process. In particular, when planning the CSD project, the questions and techniques identified by Kosmala et al. (2016) provide a good starting point for investigators and also provide considerations that can be assessed by evaluators and users of CSD. Such planning would be applicable to CS projects that involve a small number of volunteers as well as to large-scale projects, such as those that were the focus of the study conducted by Albus et al. (2019). A white paper has been developed by NASA's Citizen Science Data Working Group, for the benefit of researchers desiring to incorporate CS and crowdsourcing into their projects (NASA, 2020b). While this white paper is targeted for NASA-funded researchers in the Citizen Science for Earth Science Program, the discussion in the paper is relevant to a much broader audience. Many aspects of CSD management are addressed in this white paper, including a significant amount of detail describing how information about data quality should be handled.

The ESIP IQC recognizes some of the challenges in and potential approaches to addressing these data quality issues that are pertinent to CSD. These are discussed in more detail within the following subsections.

Recruitment, Selection, Self-Selection, and Training of CSD Contributors

Bias, errors, uncertainty, and ethical issues can be addressed through well-designed and documented procedures and proper training by providing volunteers with instructions and written procedures for fieldwork. For studies that involve large numbers of volunteers in additional aspects of the research process besides data collection, training of volunteers contributes to QA (Wilderman and Monismith, 2016). Investigators should consider sources of potential bias when recruiting CS participants and, including recognizing the potential for errors, the proper use of instruments, and techniques for reducing and flagging data uncertainty. Developing a data collection instrument and recruiting volunteers to use the instrument in the field provides opportunities to identify enhancements that can improve the quality of data collected by future volunteers (Compas and Wade, 2018). When engaging volunteers, protecting indigenous people and privacy also must be considered (Bowser et al., 2017; Carroll et al., 2019; Global Indigenous Data Alliance, 2019). Human research subject protections further reduce risks (Resnik, 2019). The NASA Earth Science Data Systems CSD Working Group also offers guidance on these and other relevant issues (NASA, 2020b).

Citizen science data quality efforts for recruitment, selection, self-selection, and training should be initiated during stage 1 (science quality focus) of the data lifecycle, when defining, developing, and validating CSD. These activities also should be pursued during subsequent stages.

Transparency in Information About QA/QC Practices During the Data Production Process

Uncorrected errors, missing data, and undocumented corrections and modifications could influence findings resulting from the analysis of CSD. Such lack of transparency could result in lost time when exploring whether to use the data. Identified usage limitations should be recorded and, when possible, addressed during research design. Similarly, appropriate uses of data should be identified to reduce the potential for misuse. Verification procedures should be planned and conducted to ensure correctness of data values. Completeness should be ensured by reducing the potential for missing values.

Deploying automated verification and parsing to address data quality issues also could reduce the potential for human errors. However, human oversight is recommended to avoid potential pitfalls of fully-automated systems, such as underestimating extremes. In addition, increasing transparency about pitfalls that have compromised the quality of CSD can avoid a cycle of repeating failures in CS research (Balázs et al., 2021). Enabling volunteers to contribute to transparent validation of observations also contributes to the improvement of CSD quality and to the motivation of contributors (Bonnet et al., 2020).

Considering that CSD is produced largely from voluntary contributions, it is also critical to be transparent about other aspects of CSD that can facilitate use, especially when designating CSD as open data. Providing simple language that enables users to understand their intellectual property rights for using CSD facilitates their use as open data. Ideally, such language should describe permissive intellectual property rights that eliminate restrictions on the use of the data and the documentation (Anhalt-Depies et al., 2019).

Facilitating transparency of information about QA/QC practices should be completed as part of stage 1 (focus on science quality) and stage 2 (focus on product quality) of the data lifecycle. Such transparency also should be facilitated during subsequent stages.

Documenting Data Quality to Facilitate Discovery and Reuse

Describing the quality of CSD in documentation and metadata improves its potential for use and improves capabilities for assessing whether data are appropriate for reuse by those who did not participate in the original study that collected the data. Furthermore, describing data quality can improve the interoperability and integration of CSD with other data. Documentation of CSD also should describe provenance for collection, validation, curation, dissemination, and use of the data. As data originators, the roles and responsibilities of investigators and volunteer observers for ensuring and documenting the scientific quality of data should be defined (e.g., Peng et al., 2016).

Relevant guidance on practices for managing data also delineate the importance of documenting data quality. These include the FAIR Principles (Wilkinson et al., 2016), the Group on Earth Observations System of Systems (GEOSS) Data Management Principles (Group on Earth Observations, 2016), the TRUST Principles for Digital Repositories (Lin et al., 2020), and data maturity models (Peng et al., 2019).

Data quality documentation should be conducted throughout all four stages of the data lifecycle. The development of data quality documentation should be initiated early during stage 1, delivered to a repository during stage 2, disseminated along with the data during stage 3, and used to support use of the data in stage 4.

Establishing Rubrics for Evaluating Quality Levels of CSD

To enable and maximize the reuse of CSD in environmental research and other areas, easy-to-understand quality levels that address the specific needs of target user communities, e.g. researchers, decision supporters, and the general public, on CSD will be important. Establishing rubrics to evaluate CSD quality information against such quality levels will be consequential. For example, Balázs et al. (2021) recommend communicating data quality goals to volunteers and providing accessible training materials, guidance, and understandable instructions for data collection to improve the quality of CSD. Tredick et al. (2017) developed a rubric for evaluating CS programs. This structured rubric acknowledges the importance of CSD management, quality assurance, and information integrity to the success of a CS program. The BiodivERsA Citizen Science Toolkit For Biodiversity Scientists (Goudeseune et al., 2020) also described the evaluation of output, including data quality, as one of the ten key principles for successful CS. Vocabularies for CSD quality levels, which link to the needs of diverse user communities and rubrics to assess CSD against such vocabularies, are important next steps to maximize the scientific and societal benefits of CS programs.

Rubrics for information quality levels of CSD apply to the dimensions across all stages of the data lifecycle. However, it should be noted that the development of rubrics should be initiated very early during stage 1, and that such rubrics will support users during stage 4.

Discussion

Enabling the use of CSD offers opportunities for new research projects to investigate issues while avoiding costly or redundant data collection. To allow for broad use of CSD, data QA/QC should be performed, and information about QA/QC procedures should be captured and conveyed to users. Since improving CSD quality offers opportunities for additional uses, data quality efforts should begin during project conceptualization and planning, continuing throughout the data lifecycle, to enable data reuse. Efforts to improve the quality of CSD should begin during stage 1, when science quality activities are performed and quality information is prepared when defining, developing, and validating the data. Citizen science data quality efforts should continue with stage 2, so that product quality information is prepared, assessed, and delivered along with the data to a repository for dissemination. Citizen science data quality information should be maintained, preserved, and disseminated with the data to ensure stewardship quality during stage 3. Providing quality information along with the data to provide service quality during stage 4 enables and supports the use of CSD.

Furthermore, documenting CSD quality can improve trust in CS within the scientific community and reflects ethical approaches to conducting CS. When preparing CSD for use, investigators should describe data quality in the metadata and data documentation, as well as in data papers and publications. Documentation should differentiate between various quality issues to avoid confusing potential users.

Consequently, we recommend employing a systematic approach for ensuring CSD quality. Future research should consider implications of data quality throughout the data lifecycle and data quality as it pertains to collecting CSD.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Author Contributions

RD, HR, GP, and YW contributed to conception and design of the manuscript and wrote the first draft and sections of the manuscript. All the authors reviewed and revised the draft with beneficial edits, and approved the submitted version.

Funding

RD was supported by the National Aeronautics and Space Administration (NASA) under Contract 80GSFC18C0111 for operation of the NASA Socioeconomic Data and Applications Center (SEDAC). HR was supported under NASA Contract 80GSFC20C044 with Science Systems and Applications, Inc. GP was supported in part by NOAA under Cooperative Agreement NA19NES4320002 and by NASA under Cooperative Agreement NNM11AA01A. YW was supported by NASA under Interagency Agreement 80GSFC19T0039.

Conflict of Interest

HR is employed by the company Science Systems and Applications, Inc.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This article reflects perspectives of the authors, who are members of the ESIP Information Quality Cluster (IQC) leadership team and appreciate the insight received from discussions among IQC members and from invited presentations on the CS programs at the U.S. agency level, including those at NASA and NOAA. The authors also appreciate the thoughtful comments and recommendations provided by the reviewers. The views expressed in the article do not represent the position of ESIP, its sponsors, the authors' employers, or their sponsors.

References

Aceves-Bueno, E., Adeleye, A. S., Feraud, M., Huang, Y., Tao, M., Yang, Y., et al. (2017). The accuracy of citizen science data: a quantitative review. Bull. Ecol. Soc. Amer. 98, 278–290. doi: 10.1002/bes2.1336