Leveraging Open Electronic Health Record Data and Environmental Exposures Data to Derive Insights Into Rare Pulmonary Disease

Research on rare diseases has received increasing attention, in part due to the realized profitability of orphan drugs. Biomedical informatics holds promise in accelerating translational research on rare disease, yet challenges remain, including the lack of diagnostic codes for rare diseases and privacy concerns that prevent research access to electronic health records when few patients exist. The Integrated Clinical and Environmental Exposures Service (ICEES) provides regulatory-compliant open access to electronic health record data that have been integrated with environmental exposures data, as well as analytic tools to explore the integrated data. We describe a proof-of-concept application of ICEES to examine demographics, clinical characteristics, environmental exposures, and health outcomes among a cohort of patients enriched for phenotypes associated with cystic fibrosis (CF), idiopathic bronchiectasis (IB), and primary ciliary dyskinesia (PCD). We then focus on a subset of patients with CF, leveraging the availability of a diagnostic code for CF and serving as a benchmark for our development work. We use ICEES to examine select demographics, co-diagnoses, and environmental exposures that may contribute to poor health outcomes among patients with CF, defined as emergency department or inpatient visits for respiratory issues. We replicate current understanding of the pathogenesis and clinical manifestations of CF by identifying co-diagnoses of asthma, chronic nasal congestion, cough, middle ear disease, and pneumonia as factors that differentiate patients with poor health outcomes from those with better health outcomes. We conclude by discussing our preliminary findings in relation to other published work, the strengths and limitations of our approach, and our future directions.


INTRODUCTION
Rare diseases, while rare, collectively represent a large number of patients and place an enormous burden on healthcare systems, families, and caregivers. Drug discovery and drug repurposing for rare diseases have received increasing attention over the past decade, in part due to the realization that profits can be made from so-called orphan drugs, despite the relatively small market, and in part due to the advocacy of groups such as the International Rare Disease Consortium (Austin et al., 2018) and the National Organization for Rare Disorders (Dunkle, 2014). National Institutes of Health funding for rare disease and orphan drugs reflects this trend (National Institutes of Health., 2021). Advances in biomedical informatics promise to accelerate clinical and translational research, including research on rare diseases. Yet, many challenges remain, including regulatory and privacy concerns that hinder research access to the electronic health records (EHRs) of patients with rare disease when few patients exist within a healthcare system; the lack of definitive diagnostic codes for the majority of rare diseases; complexities related to data integration and semantic harmonization across disparate data sources; inconsistencies in the adoption of standardized ontologies, vocabularies, and terminologies; and domain-specific differences in terminologies and data representations (Wilkinson et al., 2016;Haixiang et al., 2017;Colbaugh et al., 2018;Shen et al., 2018;Cohen et al., 2020).
To begin to address these challenges, we have developed an open-source solution that can be used to explore EHR data on patients with rare diseases and identify factors that may contribute to health outcomes. Specifically, the Integrated Clinical and Environmental Exposures Service (ICEES) is an open service that exposes, in a regulatory-compliant manner, EHR data that have been integrated at the patient level with a variety of environmental exposures data Pfaff et al., 2019;Xu et al., 2020). ICEES also provisions analytic tools to explore the integrated data. As such, ICEES provides a powerful open-source, regulatorycompliant solution that allows users to readily explore realworld clinical observations, including observations related to rare disease, and conduct basic analyses designed to investigate environmental exposures and other factors that may influence health outcomes.
Herein, we describe a proof-of-concept application of ICEES to examine demographics, clinical characteristics, environmental exposures, and health outcomes among a cohort of patients enriched for phenotypes (i.e., diagnoses and procedures) associated with rare pulmonary diseases, namely, cystic fibrosis (CF), idiopathic bronchiectasis (IB), and primary ciliary dyskinesia (PCD). We focus on a subset of patients with CF, leveraging the availability of a diagnostic code for CF and serving as a benchmark for our development work. We use ICEES to examine select demographics, co-diagnoses, and environmental exposures that may contribute to poor health outcomes among patients with CF, defined as emergency department (ED) or inpatient visits for respiratory issues, with the goal of replicating prior work. We then discuss our findings in the context of published work, the strengths and limitations of our approach, and our future directions.

METHODS
All study procedures were approved by the Institutional Review Board at the University of North Carolina at Chapel Hill (protocol #21-0099).

"Rare Pulmonary Disease" Cohort
Our overall goal was to define an ICEES "rare pulmonary disease" cohort to explore demographic factors, clinical characteristics, and environmental exposures that impact health outcomes. We focused on a 1-year study period (calendar year 2020) and the rare diseases CF, IB, and PCD (Orphanet rare disease resources, June 13, 2022). Given that a definitive diagnostic code was available only for CF (ICD-E84), we identified patients as having possible CF, IB, or PCD using broad, expert-defined inclusion criteria based on EHR data from UNC Health. Specifically, we included patients who met the following criteria: (1) hospital or clinic location: one or more visits to the adult pulmonary clinic, any pediatric clinic, the male infertility clinic, the general hospital inpatient service for adult patients, any hospital inpatient service for pediatric patients with CPT codes for respiratory therapy and/or physical therapy, or admission to the Neonatal Intensive Care Unit for term births (≥35 weeks gestation); (2) diagnosis or procedure: one or more diagnoses or procedures for congenital heart disease with situs inversus/laterality/dextrocardia/laterality defect, sinus surgery before age 6 years, tympanoplasty tubes before age 1 year, echocardiograms before age 6 months, Kartagener's syndrome (situs inversus), bronchiectasis, heterotaxy, semen analysis, or male infertility.

ICEES
ICEES provides regulatory-compliant open access to binned, observational clinical data that have been integrated at the patient level with public environmental exposures data . Key to the design of ICEES are "ICEES integrated feature tables." These tables are created using a complex custom software pipeline within a secure environment and under a protocol approved by the Institutional Review Board at the University of North Carolina at Chapel Hill. In brief, a custom, open-source application, Clinical Asset Mapping Program for Health Level 7 Fast Healthcare Interoperability Resources (CAMP FHIR), is used to extract and convert patient EHR data from the PCORnet common data model to FHIR files . A second open-source application, FHIR Patient data Integration Tool (FHIR PIT), then ingests the FHIR files and integrates the patient data with public sources of environmental exposures data, using geocodes and dates . The final step in the FHIR PIT pipeline is the binning or recoding of each feature variable and the de-identification of the integrated data per the Safe Harbor method outlined in the Health Insurance Portability and Accountability Act (HIPAA). The final ICEES integrated feature tables are then deployed behind an open application programming interface (OpenAPI).
The public exposures data are derived from several sources: the United States (US) Environmental Protection Agency's Fused Air Quality Surface Using Downscaling repository on airborne pollutants; the US Department of Transportation's repository on roadway and highway exposures (a proxy for airborne pollutant exposures); the US Census Bureau's American Community Survey data on socio-economic exposures; and the North Carolina Department of Environmental Quality's repository on landfills and concentrated animal feeding operations (CAFOs). The availability of these public sources of exposures data depends on the year of interest. In this study, we focused on the year 2020 as the study period of interest and exposure estimates for residential density and distance between primary residence and nearest major roadway or highway, landfill, or CAFO. Distance estimates were based on patient primary residential address, as listed in the EHR. (Additional information on the sources of environmental exposures data can be found in Fecho et al., 2019;Valencia et al., 2020).
The binning or recoding step in the ICEES pipeline is necessary per institutional mandate to abstract the data and thereby create an added layer of privacy beyond HIPAA Safe Harbor deidentification. This mandate largely reflects our institution's position that any data derived using protected health information (PHI) such as exposure estimates are treated as "secondary PHI" and therefore subject to privacy regulations. We based our binning or recoding approach on a combination of published models, subject matter expert recommendation, and mathematical algorithms . In brief, ages were calculated from day one of the 1-year study period and binned as <5, 5-17, 18-44, 45-64, 65-89 years, per our prior work (Fecho et al., 2008a(Fecho et al., ,b, 2019, with 89 years treated as the ceiling per HIPAA. Sex was categorized as biological male or female, as listed in the EHR. Race was categorized as Caucasian, African American, Asian, and Other. Ethnicity was categorized as Hispanic or Latino vs. not Hispanic or Latino. Diagnoses were based on diagnostic codes and treated as binary (0, no; 1, yes), calculated as: 0, 1, >1 diagnoses over the 1-year study period. Residential density was based on the US Census Bureau's classification: rural area (<2,500 persons per Census block group); urban cluster (2,500 up to 50,000 persons per Census block group); and urbanized area (>50,000 persons per Census block group). Proximity from primary residence to a major roadway or highway was binned based on published work (Schurman et al., 2018): 0-49, 50-99, 100-149, 150-199, 200-249, ≥250 meters). Distance between primary residence and nearest landfill or CAFO was modeled using a combination of subject matter expert recommendation and published modeling approaches (Radon et al., 2007;Rasmussen et al., 2017;Njoku et al., 2019;Tomita et al., 2020): <500, 500-1,000, 1,000-2,000, 2,000-4,000, >4,000 meters.

Analytic Approach
In this study, we first queried ICEES to examine demographics, diagnoses, environmental exposures, and health outcomes among patients within the rare pulmonary disease cohort with possible CF, IB, or PCD. As noted above, ages were based on day one of the 1-year study period (calendar year 2020); diagnoses were based on one or more diagnostic codes for a condition over the 1-year study period; and environmental exposures were based on estimated residential density and nearest distance from primary residence to major roadway or highway, landfill, or CAFO.
We then used ICEES to create a subcohort of patients with one or more diagnostic codes for CF (ICD-E84) and an active EHR over the 1-year study period, meaning that the patient was seen by a UNC Health provider at least one time during the study period. Our primary health outcome was ED or inpatient visits for any respiratory issue, as defined in Fecho et al. (2019). We applied the ICEES multiple comparison functionality to identify demographic factors, co-diagnoses, and environmental exposures that differ between patients with CF and poor health outcomes, defined as one or more ED or inpatient visits for respiratory issues, and patients with CF and better health outcomes, defined as zero ED or inpatient visits for respiratory issues. We applied a Chi Square test with multiple comparisons and a Bonferroni-corrected α = 0.05.

Characterization of Rare Pulmonary Disease Cohort
We first queried ICEES to examine the characteristics of the broadly defined rare pulmonary disease cohort (N = 4,840) ( Table 1).
The demographic profile indicated that most patients were middle age or older [5.58% (270/ We also examined select diagnoses based on their relevance to CF, IB, and PCD. Over the 1-year study period, ten percent or more of patients had one or more diagnoses for anxiety [

Health Outcomes Among Patients Within the Rare Pulmonary Disease Cohort
We then used ICEES to examine health outcomes for patients within the rare pulmonary disease cohort. We focused on ED or inpatient visits for respiratory issues over the 1-year study period (Figure 1). The majority of patients [77.44% (3,748/4,840)] did not have any ED or inpatient visits for respiratory issues. Of those who had at least one ED or inpatient visit for respiratory issues,

Health Outcomes Among Patients With CF
We next used ICEES to examine health outcomes among the 163 patients (3.37% (163/4,840)) with a diagnosis of CF. Thirty-seven patients with CF (22.70%) had one or more visits to the ED or an inpatient clinic for respiratory issues over the 1-year study period (maximum = 8) ( Table 2).
The demographic composition of patients with CF who had poor health outcomes was similar to that among patients with CF and better health outcomes. Environmental exposures were likewise similar among patients with CF, regardless of health outcome. However, several co-diagnoses differentiated patients with CF and poor health outcomes from those with better health outcomes.

Summary of Findings and Relationship to Other Published Work
In this study, we defined a cohort of patients enriched for phenotypes (i.e., diagnoses and procedures) associated with rare pulmonary diseases, specifically, CF, IB, and PCD. We deployed the dataset behind an ICEES OpenAPI and used ICEES to examine demographics, clinical characteristics, environmental exposures, and health outcomes among patients within the cohort. We then used ICEES to focus on a subset of patients with a diagnostic code for CF and applied the ICEES multiple comparisons functionality to determine factors that significantly differed among patients with CF and poor health outcomes vs. those with better outcomes. We were able to replicate current understanding of the pathogenesis and clinical manifestations of CF (e.g., Turcios, 2020) by identifying several co-diagnoses that differentiated patients with CF and poor health outcomes from those with better outcomes: asthma; chronic nasal congestion; cough; middle ear disease; and pneumonia. Demographic factors and environmental exposures were not associated with health outcomes among patients with CF.
Several findings and other points are worth discussing in relation to their interpretation and previously published work. For instance, the inclusion criteria that we used to define the ICEES rare pulmonary disease cohort were intentionally broad, the goal being to create a dataset enriched in phenotypes associated with possible CF, IB, and PCD rather than definitive diagnostic codes, which do not exist for IB and PCD. The intent behind our broad inclusion criteria was to create a rare pulmonary disease dataset with multiple applications, including the one reported here and others such as enabling expert chart review to determine definitive diagnoses of CF, IB, and PCD  FIGURE 2 | Co-diagnoses that significantly differed between patients with CF and poor health outcomes (defined as one or more ED or inpatient visits for respiratory issues) vs. patients with CF and better health outcomes (defined as zero ED or inpatient visits for respiratory issues), N = 163, P < 0.05. CF, cystic fibrosis; ED, emergency department.
and providing a training dataset to support supervised machine learning estimates of predicted diagnoses of CF, IB, and PCD. We have moved forward with both additional applications and plan to incorporate the definitive diagnoses and predictions into the next deployment of the ICEES rare pulmonary disease OpenAPI. While EHR data are intended to support healthcare administration and billing, not research, such records provide a valuable source of research data that, when used appropriately, can accelerate clinical and translational research, including research on rare diseases. For instance, classification and machine learning models have been developed and successfully applied by our group  and others (Colbaugh et al., 2018;Shen et al., 2018;Cohen et al., 2020) to leverage available EHR data in a subject matter expert-informed manner and use that collective information to predict patients who have specific rare diseases. Moreover, to support the application of EHR data for research, numerous statistical techniques and software packages have been developed to account for the missing data and imbalances that are inherent in EHR data (e.g., Chawla et al., 2002;Wells et al., 2013;Haixiang et al., 2017). ICEES extends EHR data to include environmental exposures data and exposes the integrated data for open exploratory analysis. While the present proof-of-concept study did not identify significant environmental impacts on health outcomes among patients with CF, our prior work has (e.g., Fecho et al., 2019Fecho et al., , 2022, and we expect the production version of the ICEES rare pulmonary disease instance to likewise reveal clinical and environmental influences on health outcomes.
Another consideration is that the co-diagnoses that we identified as related to poor health outcomes among patients with CF are all pulmonary and perhaps not unexpected. This reflects several factors. First, we considered the initial ICEES rare pulmonary disease instance as a proof-of-concept application of ICEES to the study of rare disease, and so, we focused on a relatively small subset of EHR data. In addition, we did not capture medications, laboratory measures, or procedures, which we have done for other ICEES cohorts. The CAMP FHIR/FHIR PIT data conversion and integration software pipeline relies on an enumeration file that is manually generated. The enumeration file that supports the ICEES PCD instance does not currently support medications, laboratory measures, or procedures. Having demonstrated proof of concept, we are now updating the file and correcting technical bugs in our software pipeline that were detected as part of the work described herein. We expect to continuously improve and extend the ICEES rare pulmonary disease instance, thus providing a unique open-source resource for exploratory analysis of clinical and environmental determinants of health.
A related point to consider is that the co-diagnoses we identified as affecting health outcomes among patients with CF may or may not represent comorbidities. The study design and available data did not allow us to differentiate between patients who simply became ill over the study period vs. those with chronic comorbidities.
Our findings on environmental exposures that differentiate patients with CF and poor health outcomes from those with better outcomes are also worth discussion. Specifically, we did not identify a relationship between proximity to a landfill or CAFO and poor health outcomes among patients with CF. In fact, most patients with CF and poor health outcomes resided >4,000 meters from a landfill or CAFO. These findings contradict published findings on the pulmonary effects of landfill and CAFO exposures (e.g., Radon et al., 2007;Rasmussen et al., 2017;Njoku et al., 2019;Tomita et al., 2020). Likewise, we did not find a relationship between proximity to a major roadway or highway and health outcomes among patients with CF, which contradicts prior findings on asthma exacerbations from our group (Fecho et al., 2022) and other groups (Perez et al., 2012;Schurman et al., 2018;Hauptman et al., 2020). Finally, we did not identify a relationship between rural residence and poor health outcomes among patients with CF, which contradicts our prior findings on asthma exacerbations (Fecho et al., 2022) and the well-established rural health disparities in North Carolina and elsewhere (North Carolina Institute of Medicine, 2014).
We believe that there are several explanations for these apparent discrepancies. First, our patient catchment area is largely rural, as we have reported previously (e.g., Fecho et al., 2022). As such, the current findings may simply reflect an inherent bias in our data that may have obscured rural health disparities in health outcomes in the current study. Second, our geocoding was sparse. In fact, only 27.42% of patients were successfully geocoded, unlike our prior work, in which few patients lacked valid geocodes (e.g., Fecho et al., 2019;Xu et al., 2020). The sparse geocoding in the current effort reflects a change in our hospital's geocoding practices that introduced errors, since resolved. Third, our models for estimating landfill and CAFO exposures may need refinement. This is the first study in which we applied landfill and CAFO exposures to patient data as part of ICEES. We are considering several more sophisticated models for landfill and CAFO exposures (Bunton et al., 2007;Son et al., 2021). We are also working with groups such as the Environmental Health Language Collaborative (National Institute of Environmental Health Sciences., 2021) to standardize and harmonize environmental health languages, ontologies, and exposure models. Finally, it may be that the environmental exposures that influence health outcomes among patients with CF and rare pulmonary diseases, or the models that we applied to estimate those exposures, differ from those that influence health outcomes among patients with asthma and common pulmonary diseases. We plan to expand the current work to include data on machine learning predictions and expertconfirmed diagnoses of CF, IB, and PCD, which will allow us to refine our exposure models and estimates and further explore the impact of environmental exposures on rare pulmonary diseases.

Limitations
This study has several limitations, in addition to those discussed above, that should be considered when interpreting the results. First, our study focused on the year 2020, which limited the sources of environmental exposures data that we had access to. For example, public data on airborne pollutant exposures such as particulate matter were not available for year 2020. This is an ongoing challenge with ICEES, one that is largely out of our control. However, we continue to monitor public sources of environmental exposures data for new releases. Second, we did not capture certain clinical phenotypes of interest to our subject matter experts, including male infertility due to sperm tail dysfunction, low nasal nitric oxide levels, and defective ciliary ultrastructure. ICEES currently is limited to demographic data, diagnoses, medications (prescriptions or administrations), select laboratory measures and procedures, and select environmental exposures. We are working to expand the laboratory measures and procedures that are captured by ICEES, which will support richer analyses. Third, ICEES currently supports open multivariate analytic approaches such as logistic regression (Fecho et al., 2021). However, the approach introduces a certain amount of data loss and thus influences multivariate model robustness due to regulatory constraints surrounding small sample sizes. In the current study, the sample size for patients with CF was simply too small to invoke the ICEES multivariate approach, even without the data loss that the open multivariate approach introduces (Bujang et al., 2018). Small sample sizes are a challenge for research on rare disease. However, as demonstrated with this study, ICEES can be applied to gain insights into clinical and environmental determinants of health even with small sample sizes. Ours proof-of-concept demonstration study revealed the need for additional analytic features, including Fisher's Exact Test, to adjust for small cell sizes, as well as more sophisticated analytics to support clustering, enrichment analysis, and machine learning algorithms. We are developing approaches to implement new analytic features and expose them to users in an open regulatory-compliant manner. A final consideration, but not necessarily a limitation, is that ICEES is under continual development. As such, users should be aware that the service continues to evolve, with new feature variables and functionalities introduced over time. Users can access ICEES through the ICEES OpenAPI and associated Swagger user interface and are encouraged to post any issues that are identified in the ICEES GitHub repository (see References list for URLs).

CONCLUSIONS
Here, we demonstrate the application of an open clinical service, ICEES, to explore clinical and environmental determinants of rare pulmonary disease. We focus on a subset of patients with a diagnosis of CF, and we replicate current understanding of CF by identifying co-diagnoses that differentiate patients with poor health outcomes from those with better health outcomes. ICEES was able to replicate prior findings without the need for regulatory approvals, patient recruitment, or complex epidemiological study design. As an open, regulatorycompliant, disease-agnostic service, ICEES has applications in the study of rare disease and can be used to overcome regulatory challenges and accelerate research. Moreover, ICEES has broader applications as a tool to inform public health research and delivery of health care, for example, by allowing researchers and healthcare providers to quickly explore the impact of clinical factors and environmental exposures on health outcomes, including patients with suspected rare disease, as was demonstrated in the work described herein.

DATA AVAILABILITY STATEMENT
The raw datasets presented in this article are not readily available because the data include sensitive patient data; rather, the data are available in a form consumable by the public via the ICEES rare pulmonary disease OpenAPI (see References list for URL).

ETHICS STATEMENT
All study procedures were reviewed and approved by Institutional Review Board at the University of North Carolina at Chapel Hill (protocol #21-0099). Written informed consent for participation was not required for this study in accordance with federal legislation and institutional requirements.

AUTHOR CONTRIBUTIONS
KF, SA, MK, AK, and ML conceived the study. KF contributed to the design of the ICEES rare pulmonary disease OpenAPI, tested all software implementations, conducted all analyses, and prepared the first manuscript draft. MK and ML provided clinical subject matter expertise. KM, MW, and HY served as technical leads, led software development, and deployment of the ICEES OpenAPI. MK, ML, and EP developed the inclusion criteria for selecting patients within the rare pulmonary disease cohort. EP co-led the design and implementation of the CAMP FHIR software application. All authors contributed to the study design, assisted with interpretation of the results, reviewed the first draft of the manuscript, and approved the final submission.

FUNDING
This work was supported by funding from the National Center for Advancing Translational Sciences (award numbers OT2TR 003430, UL1TR002489, UL1TR002489-03S4, and OT3TR002020).