World-class research. Ultimate impact.
More on impact ›

Original Research ARTICLE Provisionally accepted The full-text will be published soon. Notify me

Front. Big Data | doi: 10.3389/fdata.2019.00030

Significant EHR features driven T2D inference: Predictive Machine Learning and Networks

  • 1Center Computational Science, University of Miami, United States
  • 2University of Padova, Italy

Background. Electronic Health Records (EHR) play an important role for the redefinition of phenotypes in view of the wealth and heterogeneity of information now available from disparate data sources. A recent cross-sectional retrospective study has described the potential of EHR toward type 2 diabetes mellitus (T2D) screening when ad hoc models are used. About 10,000 US patients have been analyzed through a variety of inference techniques applied to all records with a variable degree of completeness. The analyses conducted in the reference study have indicated that EHR phenotypes significantly improved T2D detection.
Methods. With these US patients and the T2D data evidenced in the above study, we propose an integrative inference approach that leverages the prediction power of EHR features selected by two well-known methods, Random Forests and Lasso. The goal is twofold: reducing the big data redundancies potentially harmful to the predictive learning task and exploiting the inter-connectivity of EHR features. A mutual information (MI) network is the inference tool used to identify communities useful to prioritize significant T2D features underlying the similarity between patients.
Results. Endowed with a different degree of granularity, the communities detected after the application of both methods were centered especially on T2D comorbidities and risk factors. As such, they appear very relevant for assessment of two main issues, T2D disease burden and prevention.
Conclusions. Our analytical approach offers a solution for managing the EHR scale factor in a complex disease context. EHR are rich sources of phenotypic diversity through which novel stratifications of patients are expected. To enable these results, both pre-screening of variables and calibration of risk prediction methods become necessary steps in EHR analyses. We have presented networks identifying major T2D communities. The specific significance assigned to comorbidities and risk factors in relation to T2D can be inferred with accuracy from just a suitably reduced number of EHR features.

Keywords: Electronic Health Records, Type 2 Diabetes, Feature Selection, Network Inference, Patient Stratification , EHR, T2D, feature selection, networks, patient stratification

Received: 25 Jan 2019; Accepted: 16 Aug 2019.

Edited by:

Nick Duffield, Texas A&M University, United States

Reviewed by:

José Machado, University of Minho, Portugal
Mohamed Mostafa, Cardiff University, United Kingdom  

Copyright: © 2019 Capobianco and Preo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Dr. Enrico Capobianco, University of Miami, Center Computational Science, Coral Gables, 33136, Florida, United States,