Validation of automated data abstraction for SCCM discovery VIRUS COVID-19 registry: practical EHR export pathways (VIRUS-PEEP)

Background The gold standard for gathering data from electronic health records (EHR) has been manual data extraction; however, this requires vast resources and personnel. Automation of this process reduces resource burdens and expands research opportunities. Objective This study aimed to determine the feasibility and reliability of automated data extraction in a large registry of adult COVID-19 patients. Materials and methods This observational study included data from sites participating in the SCCM Discovery VIRUS COVID-19 registry. Important demographic, comorbidity, and outcome variables were chosen for manual and automated extraction for the feasibility dataset. We quantified the degree of agreement with Cohen’s kappa statistics for categorical variables. The sensitivity and specificity were also assessed. Correlations for continuous variables were assessed with Pearson’s correlation coefficient and Bland–Altman plots. The strength of agreement was defined as almost perfect (0.81–1.00), substantial (0.61–0.80), and moderate (0.41–0.60) based on kappa statistics. Pearson correlations were classified as trivial (0.00–0.30), low (0.30–0.50), moderate (0.50–0.70), high (0.70–0.90), and extremely high (0.90–1.00). Measurements and main results The cohort included 652 patients from 11 sites. The agreement between manual and automated extraction for categorical variables was almost perfect in 13 (72.2%) variables (Race, Ethnicity, Sex, Coronary Artery Disease, Hypertension, Congestive Heart Failure, Asthma, Diabetes Mellitus, ICU admission rate, IMV rate, HFNC rate, ICU and Hospital Discharge Status), and substantial in five (27.8%) (COPD, CKD, Dyslipidemia/Hyperlipidemia, NIMV, and ECMO rate). The correlations were extremely high in three (42.9%) variables (age, weight, and hospital LOS) and high in four (57.1%) of the continuous variables (Height, Days to ICU admission, ICU LOS, and IMV days). The average sensitivity and specificity for the categorical data were 90.7 and 96.9%. Conclusion and relevance Our study confirms the feasibility and validity of an automated process to gather data from the EHR.


Introduction
The pandemic of the coronavirus disease 2019 (COVID-19) has created a need to develop research resources rapidly (1).In response to the global demand for robust multicenter clinical data regarding patient care and outcomes, the Society of Critical Care Medicine (SCCM) Discovery Viral Infection and Respiratory Illness Universal Study (VIRUS) COVID-19 registry was created early in the pandemic (2)(3)(4).
Due to the surging nature of pandemic waves, and the subsequent workload and staffing burdens, clinical researchers have encountered difficulty in engaging in rapid, reliable manual data extraction from the electronic health record (EHR) (5).Manual chart review is the gold standard method for gathering data for retrospective research studies (6,7).This process, however, is time consuming and necessitates personnel resources not widely available at all institutions (8,9).Prior to the pandemic, automated data extraction from the EHR utilizing direct database queries was shown to be faster and less errorpone than manual data extraction (8,10).Nonetheless, data quality challenges related to high complexity or fragmentation of data across many EHR systems make automated extraction vulnerable (11-14).Both manual and automatic extraction rely on the EHR, which is an artifact with its own biases, mistakes, and subjectivity (15)(16)(17)(18)(19)(20).
Although previous research has looked at these notions, the best methods for obtaining data from EHR systems for research still need to be discovered.In response, we sought to assess the feasibility, reliability, and validity of an automated data extraction process using data for the VIRUS COVID-19 registry.

VIRUS COVID-19 registry
The SCCM Discovery VIRUS COVID-19 registry (Clinical Trials registration number: NCT04323787) is a multicenter, international database with over 80,000 patients from 306 health sites across 28 countries (21).VIRUS COVID-19 registry is an ongoing prospective observational study that aims at real-time data gathering and analytics with a feedback loop to disseminate treatment and outcome knowledge to improve COVID-19 patient care (3).The Mayo Clinic Institutional Review Board authorized the SCCM Discovery VIRUS COVID-19 registry as exempt on March 23, 2020 (IRB number: 20-002610).No informed consent was deemed necessary for the study subjects.The procedures were followed in accordance with the Helsinki Declaration of 2013 (22).Among the participating sites, 30 individual centers are collaborating to rapidly develop tools and resources to optimize EHR data collection.These efforts are led by the VIRUS Practical EHR Export Pathways group (VIRUS-PEEP).

Data collection
The VIRUS COVID-19 registry has over 500 variables which represents the pandemic registry common data standards for critically ill patients adapted from the World Health Organization-International Severe Acute Respiratory and Emerging Infection Consortium (WHO-ISARIC) COVID-19 CRF v1.3 24 February 2020 (23).The VIRUS-PEEP validation cohort was developed in an iterative, consensus process by a group of VIRUS: COVID-19 registry primary investigators to explore the feasibility of an automation process at each site.The Validation cohort variable was internally validated with seven core VIRUS COVID-19 investigators and subsequently validated from VIRUS-PEEP leads site's principal investigators.Because of the timeline, the cohort could not be externally validated.A purposeful representative sample of the 25 most clinically relevant variables from each category (Baseline demographic and clinical characteristics of patient and ICU and Hospital-related outcomes) were selected and prioritized for this study (4).We focused on demographic data (age, sex, race, ethnicity, height, weight), comorbidities (coronary artery disease (CAD), hypertension (HTN), congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD), asthma, chronic kidney disease (CKD), diabetes mellitus (DM), dyslipidemia/ hyperlipidemia), and clinical outcomes (intensive care unit (ICU) admission, days to ICU admission, ICU length of stay (LOS), type to oxygenation requirement, extracorporeal membrane oxygenation (ECMO), ICU discharge status, hospital LOS, and in-hospital mortality).
To avoid data extraction errors, we utilized precise variable definitions [VIRUS COVID-19 registry code book, cases report form (CRF), and Standard Operating Procedure (SOP)], which were already implemented in the registry and during the pilot phase of the automation implementation.Additionally, all manual and automation data extraction personnel were educated regarding the definitions and procedures needed to collect and report the data.

System description
De-identified data were collected through Research Electronic Data Capture software (REDCap, version 8.11.11,Vanderbilt University, Nashville, Tennessee) at Mayo Clinic, Rochester, MN, United States (24).The REDCap electronic data capture system is a secure, web-based application for research data capture that includes an intuitive interface for validated data entry; audit trails for tracking data manipulation and export procedures; automated export procedures for seamless data downloads to standard statistical packages; and provide a secure platform for importing data from external sources.

Manual abstraction
The VIRUS PEEP group has implemented a comprehensive process for data extraction, which involves training manual data extractors.These data extractors are trained to identify, abstract, and collect patient data according to the project's SOP.During a patient's hospitalization, extractors follow them until discharge, ensuring that all relevant information is collected.The CRF used in this process includes two main sections: demographics and outcomes, composed of categorical and continuous variables.Extractors answer a mix of binary ("yes" or "no") and checkbox ("check all that apply") questions in the nominal variable portions of the CRF.They are instructed to avoid free text and use the prespecified units for continuous variables.In any disagreement, a trainer is always available for guidance and correction.It's important to note that the manual extractors are unaware of the automated data extraction results.

Automated extraction
A package of sequential query language (SQL) scripts for the "Epic Clarity" database was developed at one institution and shared through the SCCM's Secure File Transfer Platform (SFTP) with participating sites.A second site offered peer coaching on the development and utility of end-user Epic™ reporting functions and how to adapt and modify the SQL scripts according to their EHR environment and security firewall.Other tools included R-Studio™ scripts, Microsoft Excel™ macros, STATA 16, and REDCap calculators for data quality checks at participating sites before data upload to VIRUS Registry REDCap.These tools were designed to aid in data extraction, data cleaning, and adherence to data quality rules as provided in VIRUS COVID-19 Registry SOPs.Institutions participated in weekly conference calls to discuss challenges and share successes in implementing automated data abstraction; additionally, lessons learned from adapting the SQL scripts and other data quality tools to their EHR environments were shared between individual sites and members of the VIRUS PEEP group.

Statistical analysis
We summarized continuous variables of manual and automation process data using mean ± SD and calculated mean difference and SE by matched pair analysis.Pearson correlation coefficient (PCCs) and 95% confidence intervals (CI) were generated for continuous data as a measure of inter-class dependability (25).Pearson correlations were classified as trivial (0.00-0.30), low (0.30-0.50), moderate (0.50-0.70), high (0.70-0.90), and extremely high (0.90-1.00) (26).Bland-Altman meandifference plots for continuous variables were also provided to aid in the understanding of agreement (27).
Percent agreements were determined for the data collected using each of the two extraction techniques in a categorical variable:

Number of patients categorized identically by both sources To t tal number of cases examined by both sources
The total number of agreeing outcomes divided by the total number of results is the summary agreement for each variable.For categorical variables we used Cohen's kappa coefficient (28).We used the scale created by Landis et al. to establish the degree of agreement (29).This scale is divided by almost perfect (ϰ =0.81-1.00),substantial (ϰ = 0.61-0.80),moderate (ϰ = 0.41-0.60),fair (ϰ = 0.21-0.40),slight (ϰ = 0.00-0.20),and poor (ϰ < 0.00).Additionally, the sensitivity and specificity were calculated by comparing the results of the automated data extractions method to the results of manual data extraction method (gold standard).The 95% confidence intervals were calculated using an exact test for proportions.We used JMP statistical software version 16.2 for all data analysis.

Results
Our cohort consisted of data from 652 patients from 11 sites (Figure 1).A total of 25 variables were collected for each patient for manual and automated methods.Of these 25 variables, 16 (64.0%)were nominal, 7 (28.0%)were continuous, and 2 (8.0%) were categorical variables.
Table 1 summarizes the continuous variables.The automated results for three variables (age, weight and hospital LOS) agreed "extremely high" (>90%) to the manual extraction results.The agreement was "high" (70-90%) for height, days to ICU admission, ICU LOS, and IMV days.Figure 2 presents the Bland-Altman plots for seven continuous variables.
Tables 2, 3 describe the ordinal and nominal variables.The agreement between manual and automated extraction was almost perfect in 13 (72.2%) of the studied variables, and substantial in five (27.8%).The comorbidity "dyslipidemia/hyperlipidemia" had the lowest degree of agreement (moderate 0.61); however, overall percent agreement was high (86.9%).The only variable that showed a Kappa Coefficient equal to 1 was "ICU-discharge status." The average Kappa Coefficient was 0.81 for the eight comorbidities collected and was 0.86 for outcomes variables, considered almost perfect.The automated electronic search strategy achieved an average sensitivity of 90.7% and a specificity of 96.9%.The sensitivity and specificity of each data-extraction method for all variables are presented in Table 3.

Discussion
The automated search strategy for EHR data extraction was highly feasible and reliable.Our investigation observed substantial and almost perfect agreement between automated and manual data extraction.There was almost perfect agreement in two-thirds of the categorical variables, and all continuous variables showed Extremely High or High agreement.
The results of our validation study are similar to other studies that validated and evaluated automated data (30)(31)(32)(33).Singh et al. (31) developed several algorithm queries to identify every component of the Charlson Comorbidity Index and found median sensitivity and specificity of 98-100% and 98-100%, respectively.In the validation cohort, the sensitivity of the automated digital algorithm ranged from 91 to 100%, and the specificity ranged from 98 to 100% compared to     COVID-19 pneumonia using ICD-10 base-data comparing to manual data collection.We also successfully compared seven continuous variables with three extremely high agreement and four high agreement in comparison to Brazeal et al. (35), who compared two variables (age and BMI) for manual versus automation in a study population comprised of patients with histologically confirmed advanced adenomatous colorectal polyp.Manual data extractors can overcome diverse interface issues, read and analyze free text, and provide clinical judgment when retrieving and interpreting data; however, manual data extraction is limited to human resources and is prone to human error (7,32,36).In addition to requiring considerable amount of time, manual data extraction also necessitates qualified personnel (30,33).During the COVID-19 pandemic, where real-time data is paramount, automated data has proven validity and efficacy, and may divert personnel to patient care and other vital tasks.Nonetheless, automated data is not flawless.A significant limitation is finding a unique algorithm that can be applied to every center.Variables collected as free text fields are another challenge for such validations.The automated VIRUS COVID-19 sites had reported over a large majority of variables collected using this method.Currently, more than 60,000 patients and their data variables in the registry had been collected through efforts of the VIRUS-PEEP group, which has allowed for updates and complete data in the shortest possible time.

Challenges in automation
The environment for data collection is often a shared environment within an institution, and there are limitations on how much data may be extracted and processed in one job and how much post-abstraction processing is necessary.Microsoft SQL and TSQL solutions process substantial amounts of data from many different tables and can take a long time to run on large populations.There are clinical documentation differences between the various sites requiring additional coding when applying the data requirements and rules.Establishing logic for data elements within a given EHR can be time consuming up front, requiring close collaboration between clinician and analytics teams.Data may be stored differently between multiple medical centers in one institution, requiring processing to comply with data requirements for standardization.While sites can share coding experience in data abstraction between similar data storage structure, variable coding schemes pose challenges for direct

Strengths and limitations
To our knowledge this is first multicenter study to evaluate the feasibility of automation process during COVID-19 pandemic.This automation process should be applicable to any EHR vendor (EHR type agnostic), and these purposeful sampled representative data points would be relevant to any other clinical study/trial, which is a major strength of this study.Nonparticipation of 19 sites out of 30 sites in the VIRUS-PEEP group, which leads to a possibility of selection bias, is a major limitation.The time constraints in the ongoing pandemic at participating sites were the reason behind this non-participation in the validation process.However, extracting data across 11 different centers is one of the strengths of this study; it could also highlight the variations in staff, procedures, and patients at these institutions.Although the SQL queries could be applicable in most sites, some sites required a new SQL tailored to their data architecture.One key limitation for our group was that all sites found a portion of data extraction that could not be automated, including variables which are described in narrative, such as, patient symptoms, estimated duration of onset of symptoms, and imaging interpretations.Another limitation is a notable discrepancy between manual and EMR extraction for important outcomes like ICU LOS and IMV days.The automation process relies on procedure order date (intubation/extubation) and ADT (hospital/ICU admission discharge transfer) order date and time and discontinuation date in EHR; however the manual extractor look for firsttime documented ICU or IMV in her, which probably could account for such notable discrepancy in outcomes like ICU LOS and IMV days.Transferring a patient to a location that was not a usual ICU due to COVID-19 surge may be another possible explanation for the observed lower sensitivity of ICU admission rate.Variation in creation of make-shift ICUs at different institution may have caused this discrepancy in automation of ICU admissions documentation.It partially explains the lower sensitivity and high specificity of ICU admission, IMV, NIMV, and ECMO rates by automation process.Another noticeable issue was that the manual data extraction was done in real time and automation was done when the patient was discharged and mainly relied on billing codes and manually verified data available in EHR.

Future direction
Future research on this topic could involve a thorough comparison of all patient records extracted using two methods: manual extraction and automated SQL queries.The data comparison could be done by aligning data points across a wide range of variables for each data extraction method and then statistically analyzing their consistency and discrepancies.This detailed comparison would verify the reliability of automated data extraction and provide insights into areas that could be improved for greater accuracy.

Conclusion
This study confirms the feasibility, reliability, and validity of an automated process to gather data from the EHR.The use of automated data is comparable to the gold standard.The utilization of automated data extraction provides additional solutions when a rapid and large volume of patient data needs to be extracted.

TABLE 1
Comparison of patients in automated versus manual reviews and measures of agreement for individual responses for continuous variables.

TABLE 2
Comparison of patients in automated versus manual reviews and measures of agreement for individual responses for categorical (ordinal) variables.

TABLE 3
Comparison of in automated versus manual reviews and measures of agreement for individual responses for categorical (nominal) variables.