<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="review-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2022.871897</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Improving Data Quality in Clinical Research Informatics Tools</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>AbuHalimeh</surname> <given-names>Ahmed</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1670956/overview"/>
</contrib>
</contrib-group>
<aff><institution>Information Science Department, University of Arkansas at Little Rock</institution>, <addr-line>Little Rock, AR</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Lisa Ehrlinger, Software Competence Center Hagenberg GmbH, Austria</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Filipe Portela, University of Minho, Portugal; Omar Ali, American University of the Middle East, Kuwait</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Ahmed AbuHalimeh <email>aaabuhalime&#x00040;ualr.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>29</day>
<month>04</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>871897</elocation-id>
<history>
<date date-type="received">
<day>08</day>
<month>02</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>29</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 AbuHalimeh.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>AbuHalimeh</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Maintaining data quality is a fundamental requirement for any successful and long-term data management. Providing high-quality, reliable, and statistically sound data is a primary goal for clinical research informatics. In addition, effective data governance and management are essential to ensuring accurate data counts, reports, and validation. As a crucial step of the clinical research process, it is important to establish and maintain organization-wide standards for data quality management to ensure consistency across all systems designed primarily for cohort identification, allowing users to perform an enterprise-wide search on a clinical research data repository to determine the existence of a set of patients meeting certain inclusion or exclusion criteria. Some of the clinical research tools are referred to as de-identified data tools. Assessing and improving the quality of data used by clinical research informatics tools are both important and difficult tasks. For an increasing number of users who rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment to preserve the value of the data. In clinical research informatics, better data quality translates into better research results and better patient care. However, achieving high-quality data standards is a major task because of the variety of ways that errors might be introduced in a system and the difficulty of correcting them systematically. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among data resources such as format, syntax, and semantic inconsistencies. The second category is related to poor ETL and data mapping processes. In this paper, we describe a real-life case study on assessing and improving the data quality at one of healthcare organizations. This paper compares between the results obtained from two de-identified data systems i2b2, and Epic Slicedicer, and discuss the data quality dimensions&#x00027; specific to the clinical research informatics context, and the possible data quality issues between the de-identified systems. This work in paper aims to propose steps/rules for maintaining the data quality among different systems to help data managers, information systems teams, and informaticists at any health care organization to monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes.</p>
</abstract>
<kwd-group>
<kwd>clinical research data</kwd>
<kwd>data quality</kwd>
<kwd>research informatics</kwd>
<kwd>informatics</kwd>
<kwd>management of clinical data</kwd>
</kwd-group>
<counts>
<fig-count count="1"/>
<table-count count="3"/>
<equation-count count="1"/>
<ref-count count="13"/>
<page-count count="6"/>
<word-count count="4172"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Data is the building block in all research, as results are only as good as the data upon which the conclusions were formed. However, researchers may receive minimal training on how to use the de-identified data systems and methods for achieving, assessing, or controlling the quality of research data (Nahm, <xref ref-type="bibr" rid="B8">2012</xref>; Zozus et al., <xref ref-type="bibr" rid="B13">2019</xref>).</p>
<p>De-identified data systems are defined as systems/tools that allow users to drag and drop search terms from a hierarchical ontology into a Venn diagram-like interface. Investigators can perform an initial analysis on the de-identified cohort. Furthermore, de-identified data systems have no features to indicate the data quality or assist in identifying the data quality; these systems only provide counts.</p>
<p>Informatics is the science of how to use data, information, and knowledge to improve human health and the delivery of healthcare services (American Medical Informatics Association, <xref ref-type="bibr" rid="B2">2022</xref>).</p>
<p>Clinical Informatics is the application of informatics and information technology to deliver healthcare services. For example, patient portals, electronic medical records (EMRs), telehealth, healthcare apps, and a variety of data reporting tools (American Medical Informatics Association, <xref ref-type="bibr" rid="B2">2022</xref>).</p>
<p>The case presented in this paper focuses on the quality of data obtained from two de-identified systems (Epic Slicerdicer and i2b2).The purpose of this paper is to discuss the quality of the data (counts) generated from the two systems, understand the potential causes of the data quality issues, and propose steps to improve the quality and increase the trust of the generated counts by comparing the accuracy, consistency, validity, and understandability of the outcomes from the two systems.</p>
<p>The proposed steps for maintaining the data quality among different systems aim to help data managers, information systems teams, and informaticists at a healthcare organization monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes. The quality improvement steps proposed are generic and contributes in adding generic and essential steps to automate data curation and data governance to tackle various data quality problem.</p>
<p>The remainder of this paper is organized as follows. In the following section, we introduce the importance of data quality to clinical research informatics, the study case and study method and materials presented in the Importance of Data Quality to Clinical Research Informatics, Case Study Goals, and Methodology section. The findings and the discussion part, and the proposed steps to ensure data quality are discussed in Discussion section. Conclusions are drawn and work contribution is discussed in Conclusion section.</p>
</sec>
<sec id="s2">
<title>Importance of Data Quality to Clinical Research Informatics</title>
<p>Data quality refers to the degree data meets the expectations of data consumers and their intended use of the data (Pipino et al., <xref ref-type="bibr" rid="B9">2002</xref>; Halimeh, <xref ref-type="bibr" rid="B6">2011</xref>; AbuHalimeh and Tudoreanu, <xref ref-type="bibr" rid="B1">2014</xref>). In clinical informatics, this depends on the study conducted (Nahm, <xref ref-type="bibr" rid="B8">2012</xref>; Zozus et al., <xref ref-type="bibr" rid="B13">2019</xref>).</p>
<p>The meaning of data quality lies in how the data is perceived and used by its consumer. Identifying data quality involves two stages: first, highlighting which characteristics (Dimensions) are important (<xref ref-type="fig" rid="F1">Figure 1</xref>) and second, determining how these dimensions affect the population in question (Halimeh, <xref ref-type="bibr" rid="B6">2011</xref>; AbuHalimeh and Tudoreanu, <xref ref-type="bibr" rid="B1">2014</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>De-identified data quality dimensions (DDQD).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-871897-g0001.tif"/>
</fig>
<p>This paper focuses on a subset of data quality dimensions, which we term de-identified data quality dimensions (DDQD). We think these dimensions should be mainly considered to maintain the data quality in de-identified systems because the absence of any of these dimensions will affect the overall quality of the data in the de-identified data systems. These dimensions are described in <xref ref-type="table" rid="T1">Table 1</xref> below.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>De-identified data quality dimensions definitions (DDQD).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Quality dimension</bold></th>
<th valign="top" align="left"><bold>Definition</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Accuracy</td>
<td valign="top" align="left">Refers to the degree to which information accurately reflects an event or object described<break/>How well does a piece of information reflect reality?</td>
</tr>
<tr>
<td valign="top" align="left">Completeness</td>
<td valign="top" align="left">Refers to the extent to which data is not missing and of sufficient amount for the task at hand<break/>Does it fulfill data consumer&#x00027;s expectations? The needed amount is known?</td>
</tr>
<tr>
<td valign="top" align="left">Consistency</td>
<td valign="top" align="left">Refers to the extent the is applicable and helpful to the task at hand<break/>Does information stored in one place match relevant data stored elsewhere?</td>
</tr>
<tr>
<td valign="top" align="left">Timeliness</td>
<td valign="top" align="left">Refers to the extent to which the data is sufficiently up- to-date for the task at hand<break/>Is the data available up-to-date when you need it?</td>
</tr>
<tr>
<td valign="top" align="left">Validity</td>
<td valign="top" align="left">Refers to information that doesn&#x00027;t conform to a specific format or doesn&#x00027;t follow business rules<break/>Is information in a specific format, does it follow business rules, or is it in an unusable format?</td>
</tr>
<tr>
<td valign="top" align="left">Understandability</td>
<td valign="top" align="left">Refers to the degree the data can be comprehended<break/>Can the user understand the data easily?</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The impact of quality data and management is in performance and efficiency gains and the ability to extract new understandings. Poor clinical informatics data quality can cause glitches throughout an organization. This impact includes the quality of research outcomes, healthcare services, and decision-making.</p>
<p>Quality is not a simple scalar measure but can be defined on multiple dimensions, with each dimension yielding different meanings to different information consumers and processes (Halimeh, <xref ref-type="bibr" rid="B6">2011</xref>; AbuHalimeh and Tudoreanu, <xref ref-type="bibr" rid="B1">2014</xref>). Each dimension can be measured and assessed differently. Data quality assessment implies providing a value for each dimension about how much of the dimension or quality feature is achieved to enable adequate understanding and management. Data quality and the discipline of informatics are undistinguishable interconnected. Data quality depends on how data are collected, processed, and presented; this is what makes data quality very important and sometimes complicated because data collection and processing varies from one study to another. Clinical informatics data can include different data formats and types and could come from different resources.</p>
</sec>
<sec id="s3">
<title>Case Study Goals</title>
<p>The primary goal is to compare, identify and understand discrepancies in a patient count in i2b2 compared to Epic Slicerdicer (Galaxy, <xref ref-type="bibr" rid="B5">2021</xref>). The secondary goal was to create a data dictionary that clinical researchers would easily understand. For example, if they wanted a count of patients with asthma, they would know (1) what diagnoses were used to identify patients, (2) where these diagnoses were captured, and (3) that this count matched existing clinical knowledge.</p>
<p>The case described below is from one of the healthcare organizations wanted to have the ability to ingest other sources of research-specific data, such as genomic information, and the existing products did not have a way to do that. After deliberation i2b2 (The i2b2 tranSMART Foundation, <xref ref-type="bibr" rid="B12">2021</xref>) was chosen as the data model for their clinical data warehouse. Prior to going live with users, however, it is very important and essential to validate that the data in their Clinical Data Warehouse (CDW) was accurate.</p>
</sec>
<sec sec-type="methods" id="s4">
<title>Methodology</title>
<sec>
<title>Participants</title>
<p>The clinical validation process involved a clinical informatician, data analyst, and ETL developer.</p>
</sec>
<sec>
<title>Data</title>
<p>Many healthcare organizations use at least one of the three Epic databases (Chronicles, Clarity, and Caboodle). The data source used to feed i2b2 and Slicerdicer tools was Caboodle database.</p>
</sec>
<sec>
<title>Tools</title>
<p>The tools used to perform the study are i2b2 tool and Epic Slicerdicer.</p>
<p>I2b2: Informatics for Integrating Biology and the Bedside (i2b2) is an open-source clinical data warehousing and analytics research platform; i2b2 enables sharing, integration, standardization, and analysis of heterogeneous data from healthcare and research (The i2b2 tranSMART Foundation, <xref ref-type="bibr" rid="B12">2021</xref>).</p>
<p>Epic Slicerdicer: is a self-service reporting tool that allows physicians ready access to clinical data that is customizable by patient populations for data exploration. Slicerdicer allows the user to choose and search a specific patient population to answer questions about diagnoses, demographics, and procedures performed (Galaxy, <xref ref-type="bibr" rid="B5">2021</xref>).</p>
</sec>
<sec>
<title>Method Description</title>
<p>The study was designed in a way to compare, identify and understand discrepancies in a patient count in i2b2 compared to Epic Slicerdicer (Galaxy, <xref ref-type="bibr" rid="B5">2021</xref>). We achieved this goal by choosing a task based on the nature of the tools.</p>
<p>The first step was by running the same query to look at patient demographics (race, ethnicity, gender) and identified different aggregations with race and ethnicity in i2b2 compared with Slicerdicer, which was more granular as shown in <xref ref-type="table" rid="T2">Table 2</xref>. For example, Cuban and Puerto Rican values in Slicerdicer were included in the Other Hispanic or Latino category in i2b2. The discrepancies are shown in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Patients demographic counts.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Patient counts</bold></th>
<th valign="top" align="center"><bold>i2b2</bold></th>
<th valign="top" align="center"><bold>Epic Slicerdicer</bold></th>
<th valign="top" align="center"><bold>% different</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" colspan="4"><bold>Race</bold></td>
</tr>
<tr>
<td valign="top" align="left">American Indian or Alaska Native</td>
<td valign="top" align="center">1,434</td>
<td valign="top" align="center">1,579</td>
<td valign="top" align="center">9%</td>
</tr>
<tr>
<td valign="top" align="left">Asian</td>
<td valign="top" align="center">7,051</td>
<td valign="top" align="center">7,480</td>
<td valign="top" align="center">6%</td>
</tr>
<tr>
<td valign="top" align="left">Asian Indian</td>
<td/>
<td valign="top" align="center">917</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Chinese</td>
<td/>
<td valign="top" align="center">177</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Filipino</td>
<td/>
<td valign="top" align="center">148</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Japanese</td>
<td/>
<td valign="top" align="center">30</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Korean</td>
<td/>
<td valign="top" align="center">62</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Other Asian</td>
<td/>
<td valign="top" align="center">6,146</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Black or African American</td>
<td valign="top" align="center">2,38,638</td>
<td valign="top" align="center">2,42,871</td>
<td valign="top" align="center">2%</td>
</tr>
<tr>
<td valign="top" align="left">Native Hawaiian or Other Pacific Islander</td>
<td valign="top" align="center">2,990</td>
<td valign="top" align="center">3,430</td>
<td valign="top" align="center">13%</td>
</tr>
<tr>
<td valign="top" align="left">Native Hawaiian</td>
<td/>
<td valign="top" align="center">170</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Guamanian or Chamorro</td>
<td/>
<td valign="top" align="center">21</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Samoan</td>
<td/>
<td valign="top" align="center">14</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Other Pacific Islander</td>
<td/>
<td valign="top" align="center">2,582</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Multiple race</td>
<td/>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN1"><sup>&#x0002A;</sup></xref></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Other</td>
<td valign="top" align="center">99,081</td>
<td valign="top" align="center">1,07,759</td>
<td valign="top" align="center">8%</td>
</tr>
<tr>
<td valign="top" align="left">Unknown</td>
<td/>
<td valign="top" align="center">31,733</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Decline to answer</td>
<td/>
<td valign="top" align="center">176</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">White</td>
<td valign="top" align="center">6,59,140</td>
<td valign="top" align="center">6,70,182</td>
<td valign="top" align="center">2%</td>
</tr>
<tr>
<td valign="top" align="left" colspan="4"><bold>Ethnicity</bold></td>
</tr>
<tr>
<td valign="top" align="left">Hispanic or Latino</td>
<td valign="top" align="center">61,237</td>
<td valign="top" align="center">64,354</td>
<td valign="top" align="center">5%</td>
</tr>
<tr>
<td valign="top" align="left">Other Hispanic or Latino</td>
<td/>
<td valign="top" align="center">21,021</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Mexican, Mexican American, or Chicano/a</td>
<td/>
<td valign="top" align="center">2,263</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Puerto Rican</td>
<td/>
<td valign="top" align="center">118</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Cuban</td>
<td/>
<td valign="top" align="center">41</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Non-Hispanic or Latino</td>
<td valign="top" align="center">2,63,119</td>
<td valign="top" align="center">2,81,091</td>
<td valign="top" align="center">6%</td>
</tr>
<tr>
<td valign="top" align="left">Unknown</td>
<td valign="top" align="center">7,33,886</td>
<td valign="top" align="center">7,69,097</td>
<td valign="top" align="center">5%</td>
</tr>
<tr>
<td valign="top" align="left">None of the above</td>
<td/>
<td valign="top" align="center">7,30,480</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Decline to answer</td>
<td/>
<td valign="top" align="center">300</td>
<td/>
</tr>
<tr>
<td valign="top" align="left" colspan="4"><bold>Gender</bold></td>
</tr>
<tr>
<td valign="top" align="left">Female</td>
<td valign="top" align="center">5,36,450</td>
<td valign="top" align="center">5,45,895</td>
<td valign="top" align="center">2%</td>
</tr>
<tr>
<td valign="top" align="left">Male</td>
<td valign="top" align="center">5,55,851</td>
<td valign="top" align="center">5,68,026</td>
<td valign="top" align="center">2%</td>
</tr>
<tr>
<td valign="top" align="left">Unknown</td>
<td valign="top" align="center">548</td>
<td valign="top" align="center">615</td>
<td valign="top" align="center">11%</td>
</tr>
<tr>
<td valign="top" align="left">Other</td>
<td/>
<td valign="top" align="center">5</td>
<td/>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TN1"><label>&#x0002A;</label><p><italic>Note, if patients have more than one entry for race data, Slicerdicer counts them in all of the selected fields</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>The second steps was running same query to explore diagnoses using J45<sup>&#x0002A;</sup> as the ICD-10 code for asthma and Type 1 diabetes diagnosis code (E10<sup>&#x0002A;</sup>) as shown in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Patients count based on diagnosis codes.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Patient counts</bold></th>
<th valign="top" align="center"><bold>i2b2</bold></th>
<th valign="top" align="center"><bold>Epic Slicerdicer</bold></th>
<th valign="top" align="center"><bold>% different</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" colspan="4"><bold>Asthma (J45<xref ref-type="table-fn" rid="TN2"><sup>&#x0002A;</sup></xref>)</bold></td>
</tr>
<tr>
<td valign="top" align="left">Diagnosis</td>
<td valign="top" align="center">14,500</td>
<td valign="top" align="center">23,958</td>
<td valign="top" align="center">39.48%</td>
</tr>
<tr>
<td valign="top" align="left">Billing diagnosis</td>
<td valign="top" align="center">20,429</td>
<td valign="top" align="center">22,265</td>
<td valign="top" align="center">8.25%</td>
</tr>
<tr>
<td valign="top" align="left" colspan="4"><bold>Type 1 diabetes (E10<xref ref-type="table-fn" rid="TN2"><sup>&#x0002A;</sup></xref>)</bold></td>
</tr>
<tr>
<td valign="top" align="left">Diagnosis</td>
<td valign="top" align="center">1,900</td>
<td valign="top" align="center">2,202</td>
<td valign="top" align="center">13.71%</td>
</tr>
<tr>
<td valign="top" align="left">Billing diagnosis</td>
<td valign="top" align="center">1,869</td>
<td valign="top" align="center">2,025</td>
<td valign="top" align="center">7.70%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TN2"><label>&#x0002A;</label><p><italic>Indicates multiple race, the category exist in Epic slicer Dicer is not available in i2b2</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>The Percentage Difference Calculator (% difference calculator) was implemented to find the percent difference between i2b2 counts and Epic Slicerdicer counts &#x0003E;0. The percentage difference as described in the formula below is usually calculated when you want to know the difference in percentage between two numbers is used to estimate the quality of the counts coming from the two tools, the threshold for accepted quality in this study was below 2% difference.</p>
<p><italic>V</italic><sub>1</sub> = i2b2 counts and <italic>V</italic><sub>2</sub> = Slicerdicer counts and counts are plugged into the below formula</p>
<disp-formula id="E1"><mml:math id="M1"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>f</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>|</mml:mo><mml:mi>V</mml:mi><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>V</mml:mi><mml:mn>2</mml:mn><mml:mo>|</mml:mo><mml:mo>/</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>V</mml:mi><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x0002B;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>V</mml:mi><mml:mn>2</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x000D7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>100</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>A paired <italic>t</italic>-test is used to investigate the difference between two counts from i2b2 and Epic Slicerdicer for the same query.</p>
</sec>
<sec>
<title>Findings</title>
<p>All the results obtained from comparing the counts between Slicerdicer and i2b2 are listed in the <xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T3">3</xref> below.</p>
<p>However, when diagnoses were explored, larger discrepancies were noted. There are 2 diagnosis fields in i2b2, one for billing diagnosis, and one for diagnosis. Using J45<sup>&#x0002A;</sup> as the ICD-10 code for asthma resulted in 22,265 patients when using the billing diagnosis code in Slicerdicer but only 20,429 in i2b2. The discrepancy using diagnosis was even larger. Patient count results for Type 1 diabetes diagnosis code (E10<sup>&#x0002A;</sup>) using both diagnosis and billing are also shown in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<p>The best approach to understand the reasons of this discrepancy was by looking at the diagnosis options in Slicerdicer to build a hypothesis on where this discrepancy might come from. Next, was examining the SQL code for the Caboodle to i2b2 ETL process.</p>
</sec>
<sec>
<title>Hypotheses</title>
<p>The following hypotheses were considered:</p>
<p>H0: There is no discrepancy in the data elements used to pull the data.</p>
<p>H1: There is a discrepancy in the data elements used to pull the data.</p>
<p>Paired sample <italic>t</italic>-test was implemented on the counts obtained from the ib2b and Slicerdicer using different data points. The <italic>p</italic>-value was equals to 0, [<italic>P</italic>(x &#x02264; &#x02013;Infinity) = 0] in all cases that means that the chance of type I error (rejecting a correct H0) is small: 0 (0%). The smaller the <italic>p</italic>-value the more it supports H1. For example results of the paired <italic>t</italic>-test indicated that there is a significant medium difference between i2b2 (<italic>M</italic> = 14,500, <italic>SD</italic> = 0) and Epic Slicerdicer (<italic>M</italic> = 23,958, <italic>SD</italic> = 0), t(0) = Infinity, <italic>p</italic> &#x0003C; 0.001 and results of the paired <italic>t</italic>-test indicated that there is a significant medium difference between i2b2 (<italic>M</italic> = 1,55,434, <italic>SD</italic> = 0) and Epic Slicerdicer (<italic>M</italic> = 1,579, <italic>SD</italic> = 0), t(0) = Infinity, <italic>p</italic> &#x0003C; 0.001.</p>
<p>Since the <italic>p</italic>-value &#x0003C; &#x003B1;, H0 is rejected the i2b2 population&#x00027;s average is considered to be not equal to the Epic Slicerdicer population&#x00027;s average. In other words, the difference between the averages of i2b2 and Epic Slicerdicer is big enough to be statistically significant.</p>
<p>The paired <italic>t</italic>-test results supported the alternative hypothesis and revealed that there is a discrepancy in the data elements used to pull the data.</p>
<p>Also the Percentage Difference Calculator (% difference calculator) results which used to estimate the quality of the counts coming from the two tools, the majority of the results exceeded the threshold for accepted quality in this study (below 2%) difference as shown in <xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T3">3</xref>. The percentage difference results showed and provided a strong evidence for a crucial quality issue in the counts obtained.</p>
<p>In that process of examining the SQL code for the Caboodle to i2b2 ETL process, the SQL code results showed the code only looked at billing and encounter diagnosis and everything that was not a billing diagnosis was labeled diagnosis. Slicerdicer and even Caboodle include other diagnosis sources such as medical history, hospital problem, and problem list. This was included in the data dictionary so that researchers would understand what sources i2b2 was using and that if they wanted data beyond that, they would have to request data from Caboodle.</p>
</sec>
</sec>
<sec sec-type="discussion" id="s5">
<title>Discussion</title>
<p>The discrepancies led to major information quality issues such as data inconsistency and data accuracy both affects the believability and the validity of the data which also are major data quality measures. The discrepancies noted above are likely due to several factors. First, Slicerdicer counts patients for every race selected instead of i2b2, which only takes the first race field, this because two data models were used to pattern race and ethnicity variables in i2b2 to the 1997 OMB race categories and the 2003 OMB variables, which contains a more granular set of race and ethnicity categories. The mapping then was done to &#x02018;bundle&#x0201D; the other races to a more general set of categories. This could be the reason why there is a reduction of concepts because maybe the map is incomplete.</p>
<p>Secondly, the purpose of the Extract-Load-Transform (ETL) process is to load the warehouse with integrated and cleansed data. Data quality focuses on the contents of the individual records to ensure the data loaded into the target destination is accurate, reliable, and consistent, so the ETL code should be evaluated to ensure the data extracted generally match what researchers want. In our case, understanding what diagnosis most researchers are interested in&#x02014;they may want encounter diagnosis instead of including problem list and medical history. Thirdly, the causes for data quality issues are format differences or conversion errors (Azeroual et al., <xref ref-type="bibr" rid="B3">2019</xref>; Souibgui et al., <xref ref-type="bibr" rid="B10">2019</xref>).</p>
<p>Lastly, data loss could be present in the ETL process, which is one of the challenges in ETL processes because of the nature of the source systems. Data losses arise from the disparities among the source operational systems. Source systems are very diverse and disparate because of the increased amount of data, modification of data formats, and modification of and deriving new data elements.</p>
<p>In general, data integration with heterogeneous systems is not an easy task. This is mainly due to the fact that many data exchange channels must be developed in order to allow an exchange of data between the systems (Berkhoff et al., <xref ref-type="bibr" rid="B4">2012</xref>) and to solve problems related to the provision of interoperability between systems on the level of data (Macura, <xref ref-type="bibr" rid="B7">2014</xref>).</p>
<sec>
<title>Steps to Ensure Informatics Quality</title>
<p>To improve the data quality generated from the de-identified systems which is mainly counts, and to solve any data quality issues related to the provision of interoperability between the used tools on the level of data, we propose the following steps:</p>
<list list-type="simple">
<list-item><p>1. Make data &#x0201C;fit for use.&#x0201D;</p>
<p>To make data fit for use, data governance bodies must clearly define major data concepts/variables included in the de-identified systems and standardize their collection and monitoring processes; this can increase clinical data reliability and reduce the inconsistency of data quality among systems involved (Halimeh, <xref ref-type="bibr" rid="B6">2011</xref>; AbuHalimeh and Tudoreanu, <xref ref-type="bibr" rid="B1">2014</xref>).</p></list-item>
<list-item><p>2. Define data elements (data dictionary).</p>
<p>This is a fundamental part&#x02014;the lack of clear definitions of source data and controlled data collection procedures often raises concerns about the quality of data provided in such environments and, consequently, about the evidence level of related findings (Spengler et al., <xref ref-type="bibr" rid="B11">2020</xref>). Developing a data dictionary is essential to ensuring data quality, especially in de-identified systems where all data elements are aggregated in a specific way, and there are not enough details about each concept. A data dictionary will serve as a guidebook to define the major data concepts. To do this, organizations must determine what data about data (metadata) is helpful to the researchers when they use the de-identified data systems. In addition, identifying more targeted data concepts and process workflows can help reduce some of the time and effort for researchers when working with large amounts of data and ultimately improve overall data quality.</p></list-item>
<list-item><p>3. Applying good ETL practices such as data cleansing mechanisms to get the data to a place that acts well with data from other sources.</p></list-item>
<list-item><p>4. Choose smart ETL architecture that allows you to update components of your ETL process when data and systems need change or update to prevent any data loss and to ensure data integrity and consistency.</p></list-item>
<list-item><p>5. Apply Data Lineage techniques. This will help in understanding where data originated from, when it was loaded, how it was transformed and is essential for the integrity of the downstream data and the process that moves it to any of the de-identified system.</p></list-item>
<list-item><p>6. Establish a process for cleansing and tracing suspicious data and unusual rows of data when are revealed.</p></list-item>
<list-item><p>7. Users need to revise their queries and refine results as they combine data variables.</p></list-item>
<list-item><p>8. Having a clinical informaticist on board can also be beneficial to the process. They can ensure that your data reflects what is seen in clinical practice or help explain questionable data with their knowledge of clinical workflows and how that data is collected, especially if your analyst has no clinical background.</p></list-item>
</list>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>Conclusion</title>
<p>The success of any de-identified data tool depends largely on the quality of the data used, and the mapping process which is intertwined with the extraction and transformation components. The ETL process is a crucial component in determining the quality of the data generated by an information system.</p>
<p>This study proved that the discrepancies in the data used in the data pull process led to major information quality issues such as data inconsistency and data accuracy which both affects the believability and the validity of the data which also are major data quality measures.</p>
<p>Our contribution in this paper is to propose a set of steps that together form guidelines for a method or automated procedures and tools to manage data quality and data governance in a multifaceted, diverse information environment such as healthcare organizations and to enhance the data quality among the de-identified data tools.</p>
<p>Future plan is to study more clinical informatics tools such TriNetX and other sets of medical data to assess the quality of the counts obtained from these tools.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>The author confirms being the sole contributor of this work and has approved it for publication.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>AbuHalimeh</surname> <given-names>A.</given-names></name> <name><surname>Tudoreanu</surname> <given-names>M. E.</given-names></name></person-group> (<year>2014</year>). <article-title>Subjective information quality in data integration: evaluation and principles</article-title>, in <source>Information Quality and Governance for Business Intelligence</source> (<publisher-loc>Pennsylvania</publisher-loc>: <publisher-name>IGI Global</publisher-name>), <fpage>44</fpage>&#x02013;<lpage>65</lpage>.</citation>
</ref>
<ref id="B2">
<citation citation-type="web"><person-group person-group-type="author"><collab>American Medical Informatics Association</collab></person-group> (<year>2022</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://amia.org/about-amia/why-informatics/informatics-research-and-practice">https://amia.org/about-amia/why-informatics/informatics-research-and-practice</ext-link> (accessed April, 2021).</citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Azeroual</surname> <given-names>O.</given-names></name> <name><surname>Saake</surname> <given-names>G.</given-names></name> <name><surname>Abuosba</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>ETL best practices for data quality checks in RIS databases</article-title>, in <source>Informatics, Vol. 6</source> (<publisher-loc>Basel</publisher-loc>: <publisher-name>Multidisciplinary Digital Publishing Institute</publisher-name>), 10.</citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Berkhoff</surname> <given-names>K.</given-names></name> <name><surname>Ebeling</surname> <given-names>B.</given-names></name> <name><surname>L&#x000FC;bbe</surname> <given-names>S.</given-names></name></person-group> (<year>2012</year>). <article-title>Integrating research information into a software for higher education administration&#x02014;benefits for data quality and accessibility</article-title>, in <source>11th International Conference on Current Research Information Systems</source>, Prague.</citation>
</ref>
<ref id="B5">
<citation citation-type="web"><person-group person-group-type="author"><collab>Galaxy</collab></person-group> (<year>2021</year>). <source>Epic User Web</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://galaxy.epic.com/&#x00023;Search/searchWord=slicerdicer">https://galaxy.epic.com/&#x00023;Search/searchWord=slicerdicer</ext-link> (accessed April, 2021).</citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Halimeh</surname> <given-names>A. A.</given-names></name></person-group> (<year>2011</year>). <source>Integrating Information Quality in Visual Analytics</source>. <publisher-name>University of Arkansas at Little Rock</publisher-name>, <publisher-loc>Little rock</publisher-loc>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Macura</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Integration of data from heterogeneous sources using ETL technology</article-title>. <source>Comput. Sci.</source> <volume>15</volume>:<fpage>109</fpage>&#x02013;132. <pub-id pub-id-type="doi">10.7494/csci.2014.15.2.109</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nahm</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>Data quality in clinical research</article-title>, in <source>Clinical Research Informatics</source> (<publisher-loc>London</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>175</fpage>&#x02013;<lpage>201</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pipino</surname> <given-names>L. L.</given-names></name> <name><surname>Lee</surname> <given-names>Y. W.</given-names></name> <name><surname>Wang</surname> <given-names>R. Y.</given-names></name></person-group> (<year>2002</year>). <article-title>Data quality assessment</article-title>. <source>Commun. ACM</source> <volume>45</volume>, <fpage>211</fpage>&#x02013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1145/505248.506010</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Souibgui</surname> <given-names>M.</given-names></name> <name><surname>Atigui</surname> <given-names>F.</given-names></name> <name><surname>Zammali</surname> <given-names>S.</given-names></name> <name><surname>Cherfi</surname> <given-names>S.</given-names></name> <name><surname>Yahia</surname> <given-names>S. B.</given-names></name></person-group> (<year>2019</year>). <article-title>Data quality in ETL process: a preliminary study</article-title>. <source>Proc. Comput. Sci.</source> <volume>159</volume>, <fpage>676</fpage>&#x02013;<lpage>687</lpage>. <pub-id pub-id-type="doi">10.1016/j.procs.2019.09.223</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Spengler</surname> <given-names>H.</given-names></name> <name><surname>Gatz</surname> <given-names>I.</given-names></name> <name><surname>Kohlmayer</surname> <given-names>F.</given-names></name> <name><surname>Kuhn</surname> <given-names>K. A.</given-names></name> <name><surname>Prasser</surname> <given-names>F.</given-names></name></person-group> (<year>2020</year>). <article-title>Improving data quality in medical research: a monitoring architecture for clinical and translational data warehouses</article-title>, in <source>2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)</source> (<publisher-loc>Rochester, MN</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>415</fpage>&#x02013;<lpage>420</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="web"><person-group person-group-type="author"><collab>The i2b2 tranSMART Foundation</collab></person-group> (<year>2021</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.i2b2.org/about/">https://www.i2b2.org/about/</ext-link> (accessed April, 2021).</citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zozus</surname> <given-names>M. N.</given-names></name> <name><surname>Kahn</surname> <given-names>M. G.</given-names></name> <name><surname>Weiskopf</surname> <given-names>N. G.</given-names></name></person-group> (<year>2019</year>). <article-title>Data quality in clinical research</article-title>, in <source>Clinical Research Informatics</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>213</fpage>&#x02013;<lpage>248</lpage>.</citation>
</ref>
</ref-list>
</back>
</article>