Unifying Diagnosis Identification and Prediction Method Embedding the Disease Ontology Structure From Electronic Medical Records

Objective The reasonable classification of a large number of distinct diagnosis codes can clarify patient diagnostic information and help clinicians to improve their ability to assign and target treatment for primary diseases. Our objective is to identify and predict a unifying diagnosis (UD) from electronic medical records (EMRs). Methods We screened 4,418 sepsis patients from a public MIMIC-III database and extracted their diagnostic information for UD identification, their demographic information, laboratory examination information, chief complaint, and history of present illness information for UD prediction. We proposed a data-driven UD identification and prediction method (UDIPM) embedding the disease ontology structure. First, we designed a set similarity measure method embedding the disease ontology structure to generate a patient similarity matrix. Second, we applied affinity propagation clustering to divide patients into different clusters, and extracted a typical diagnosis code co-occurrence pattern from each cluster. Furthermore, we identified a UD by fusing visual analysis and a conditional co-occurrence matrix. Finally, we trained five classifiers in combination with feature fusion and feature selection method to unify the diagnosis prediction. Results The experimental results on a public electronic medical record dataset showed that the UDIPM could extracted a typical diagnosis code co-occurrence pattern effectively, identified and predicted a UD based on patients' diagnostic and admission information, and outperformed other fusion methods overall. Conclusions The accurate identification and prediction of the UD from a large number of distinct diagnosis codes and multi-source heterogeneous patient admission information in EMRs can provide a data-driven approach to assist better coding integration of diagnosis.


INTRODUCTION
In medical practice, clinicians are encouraged to seek a unifying diagnosis (UD) that could explain all the patient's signs and symptoms in preference to providing several explanations for the distress being presented (1). A UD is a critical pathway to identify the correct illness and craft a treatment plan; thus, clinical experience and knowledge play an important role in the science of diagnostic reasoning. Generally, from a brief medical history from a patient, clinicians can use the intuitive system in their brain and rapidly reason the disease types, whereas for complex and multi-type abnormal results, clinicians must use the more deliberate and time-consuming method of analytic reasoning to deduce the UD, raising the risk of diagnostic errors (2).
To increase the accuracy of a UD, enhancing individual clinicians' diagnostic reasoning skills and improving health care systems are regarded as two important approaches to support clinicians through the diagnostic process. The former requires professional knowledge training and lifelong learning, whereas the latter mainly involves the development of information technology (3). For an individual clinician, an intelligent clinical decision support system is prone to acceptable and can help clinicians to improve their unifying diagnostic decisions (4). Recently, along with the widespread adoption of electronic medical records (EMRs), an extremely large volume of electronic clinical data has been generated and accumulated (5,6). Meanwhile, artificial intelligence and big data analytic technology have been successfully applied to clinical diagnostic procedures and treatment regimen recommendation, which has resulted in new opportunities for intelligent clinical decision support systems that use data-driven knowledge discovery methods (7)(8)(9)(10).
From the data mining perspective, a UD aims to classify a large number of distinct diagnosis codes reasonably according to the disease taxonomy and attempt to adopt a disease to summarize or explain various clinical manifestations of the disease. Therefore, the nature of a UD is diagnosis code assignment along with disease correlation exploitation. Diagnosis code assignment refers to the clinical decision process in which supervised methods are adopted to predict and annotate disease codes based on patients' medical history, signs and symptoms, and laboratory examination (11). According to the number of diagnosis codes that patients suffer from, diagnosis code assignment can be divided into single-label (12), multi-class (13), multi-label (14), and multi-task learning methods (15). However, although many novel supervised learning models have been proposed and can achieve high performance in terms of assigning diagnosis codes for new patients using frontier supervised methods, such as ensemble learning (16), reinforcement learning (17), and deep Abbreviations: EMR, Electronic medical record; UDIPM, Unifying diagnosis identification and prediction method; CDSS, Clinical decision support system; ICU, Intensive care unit; IC, Information content; LCA, Least common ancestor; AP, Affinity propagation; SS, Sum of similarities; TDC, Typical diagnosis code; LCoP, LCA co-occurrence pattern; AOrd, Average order; TDCCoP, Typical diagnosis code co-occurrence pattern; CCoM, Conditional co-occurrence matrix; UD, Unifying diagnosis; Hadm-id, Hospital admission identifier; FM, Fusion method. learning (18), they cannot further explore disease co-occurrence relations for UD identification and prediction.
The coexistence of multiple diseases is pervasive in the clinical environment, particularly for patients in the intensive care unit (ICU) (19). According to the statistical results of the MIMIC-III database, which is a freely accessible critical care database, the average number of diagnosis codes for patients in the ICU is 11. Additionally, diagnosis codes are highly fine-grained, closely related, and extremely diverse (20). For example, the patient with admission identifier (ID) 100223 is assigned to 28 ICD-9 codes, and many diagnosis codes are similar, such as 276.2 (Acidosis, order: 15), 276.0 (Hyperosmolality and/or hypernatremia, order: 18), and 276.6 (Hyperpotassemia, order: 26). Thus, it is trivial and difficult for clinicians to make a consistent, accurate, concise, and unambiguous diagnostic decision reasonably.
Furthermore, although the inter-relation of diagnosis codes was considered in previous studies, the researchers commonly used the first three digits of ICD-9 codes to assign diagnosis codes for patients (21)(22)(23); hence, the complexity may increase and prediction performance may reduce when considering all digits of the ICD-9 codes. Additionally, in those studies, reasonable complicated and confused diagnosis codes could not be classified into a UD using a data-driven method. A UD is the basic principle of clinical diagnostic thinking. Its basic idea is that when a patient has many symptoms, if these symptoms can be explained by one disease, it will never explain different symptoms using multiple diseases (1). A UD reflects the integrity of the patient and the professionalism of clinicians; however, in previous studies, the main focus was on the UD of a category of diseases from the clinical perspective, such as mood/mental disorders (24), intracranial mesenchymal tumor (25), and arrhythmogenic right ventricular cardiomyopathy (26). In this study, we fully consider the fine-grained diagnosis codes (i.e., all digits) of patients, identify the UD from a group of patient diagnostic information using an unsupervised clustering method and predict the UD for new unseen patients using multi-class learning methods.

Data Collection
We selected a dataset of sepsis patients from the MIMIC-III database, where sepsis is divided into general sepsis, severe sepsis, and septic shock (27,28). Figure 1 shows the detailed processes of data collection and preprocessing of sepsis patients, including the identification of sepsis patients, data extraction, data cleaning, and feature selection. Finally, we screened 4,418 sepsis patients and extracted their diagnostic information to unify the diagnosis identification, their demographic information, laboratory examination information, chief complaint, and history of present illness information, and obtain a UD prediction.
First, the diagnostic information of 4,418 sepsis patients mainly contained the patient hospital admission ID (Hadmid), ICD-9 diagnosis code, order of diagnosis code, and a brief definition of the diagnosis codes, where the sum, maximum, minimum, and average numbers of diagnosis codes were 80501, 39, 3, and 18.3, respectively. Additionally, for the visualization,  Then, for the health condition of patients admitted to hospital, we used the minimum, maximum, median, mean, and variance value as the 5-tuple features of each laboratory indicator, and designed a symptom identification method based on text analysis of patient discharge reports, including rule setting, text segmentation, text extraction, abbreviation dictionary construction, negative word recognition, case unification, word segmentation, stop word removal, and external symptom dictionary embedding (Supplementary Figure 1). Additionally, we added related indicators to measure patients' severity, such as AIDS, hematologic malignancy, metastatic cancer SOFA, SAPS, and SAPS-II. Finally, we obtained 120 features of the health condition of sepsis patients in the experimental dataset, as shown in Table 1. Figure 2 shows the proposed UD identification and prediction method (UDIPM), which uses four types of information from EMRs. We adopt diagnostic information to identify the UD, and use demographic information, symptom information, and laboratory examination information to predict the UD. First, we apply a set of similarity measure methods to a large number of patients by embedding the semantic relation of the ICD classification system (Task 1 in Figure 2). Second, we apply a clustering algorithm to the similarity matrix to divide patients into different groups, and further obtain the exemplar and core patients of each cluster (Task 2 in Figure 2). Third, we extract the typical diagnosis code co-occurrence patterns (TDCCoP) from each cluster by defining a threshold and a sorting function (Task 3 in Figure 2). Fourth, we combine the visual analysis and conditional co-occurrence matrix (CCoM) to identify the UD by selecting the optimal segmentation (Task 4 in Figure 2). Finally, after obtaining the health condition of the patient admitted to hospital, we obtain a UD prediction using multi-class classification methods (Task 5 in Figure 2).

Patient Similarity Measure Method
Many methods exist for measuring patient similarity (29,30). In this study, considering the semantic relations of diagnosis codes in the ICD ontology structure, we adopt a set similarity measure method. First, we define patient diagnostic information as a series of ordered diagnosis codes. Then we reconstruct the ontology structure based on a disease classification system to easily measure patient similarity. Finally, we describe the process of the set similarity method, including the information content (IC) measure of diagnosis codes, diagnosis code similarity measure, and diagnosis code set similarity measure.

Patient's Diagnostic Information Representation
Diagnostic information refers to a record of disease diagnosis made by clinicians based on the health condition of a patient admitted to hospital. It is stored in the patient's EMR data in the form of a diagnosis code (e.g., ICD-9 and ICD-10). Because of the prevalence of disease complications, a patient's EMR is typically annotated using multiple disease codes, and these codes have a certain priority (i.e., order). The higher the priority of the diagnosis code is, the more central and important the disease is for this patient, then the weaker conversely. Thus, patient diagnostic information can be represented as D = {(dc 1 , Ord(dc 1 )), (dc 2 , Ord(dc 2 )), · · · , (dc i , Ord(dc i )), · · · }, where dc i and Ord(dc i ) represent the i-th diagnosis code and its order, respectively.

Ontology Structure Construction
We automatically construct a five-level ICD-9 ontology structure, shown in Figure 3, in which level-0 is the virtual root node, level-1 has 19 chapters, level-2 has 129 sections, level-3 has ∼1,300 categories (Supplementary

Set Similarity Measure Information Content Measure of Diagnosis Codes
In the ICD-9 ontology structure, each code represents a concept, and there is semantic similarity between classification concepts. Additionally, concepts on the same branch are more similar than those on different branches. Thus, we use the level depth measure method of the hierarchical tree (29), that is, we assign a value to each level of the ICD-9 ontology structure; the deeper the concept level, the larger the value. For an ICD-9 code dc i , the IC is defined as where Root is the virtual root node and the function level(.) denotes the level depth from the ICD-9 code d i to the root node. Intuitively, the IC of the root node (level-0) is 0, the ICs of a chapter (level-1), section (level-2), category (level-3), subcategory (level-4), and extension (level-5) are 1, 2, 3, 4, and 5, respectively.

Code-Level Similarity Measure
For the IC of codes, there are several approaches to measure codelevel similarity. We use the least common ancestor (LCA) of two codes to measure the similarity of diagnosis codes, defined as where dc i and dc j are two diagnosis codes, and LCA(dc i , dc j ) is the LCA of dc i and dc j . If dc i = dc j , then LCA(dc i , dc j ) = dc i = dc j , and IC[LCA(dc i , dc j )] = IC (dc i ) = IC (dc j ). If dc i = dc j and LCA(dc i , dc j ) = Root, then IC[LCA(dc i , dc j )] = 0.
To make this concept easier to understand, we provide a simple example in Figure 4A

Code Set-Level Similarity Measure
In the EMR dataset, patient diagnostic information is typically a set of diagnosis codes. Thus, patient similarity can be transformed into the similarity of the diagnosis code set. Generally, for binary code-level similarity, we can use classical methods, such as Dice, Jaccard, cosine, and overlap, to calculate set-level similarity. However, these methods cannot fully embed semantic similarity. Thus, we use the most similar concept pair's average value to measure the set-level similarity (29), and the formula is defined as Frontiers in Public Health | www.frontiersin.org where D ′ i and D ′ j are the diagnostic information of patient i and patient j, respectively, which does not consider the order of diagnosis codes; that is, D ′ i ={dc i1 , dc i2 ,. . . , dc ig ,. . . } and D ′ j ={dc j1 , dc j2 ,. . . , dc jh ,. . . }. |D ′ i | and |D ′ j | are the number of diagnosis codes for patient i and patient j, and dc ig and dc jh are the gth diagnosis code of patient i and the h-th diagnosis code of patient j, respectively. Finally, we obtain the similarity S ij of the two patients ( Figure 4B), and similarity matrix S for all patients in the EMRs using a set similarity measure method. The pseudocode of the patient similarity measure method is presented in Algorithm 1.
.., dc jh ,...), and diagnosis code similarity s(dc i g , dc jh ) = 2IC(LCA(dc i g , dc jh ))/(IC(dc i g ) + IC(dc jh )) based on the ICD ontology structure, compute set similarity 3. Obtain the similarity matrix S N*N for N patients

Patient Clustering Algorithm
A clustering algorithm aims to divide patients into multiple groups based on the similarity matrix S, requiring that patients in the same group are as similar as possible, and patients in different groups are as dissimilarity as possible (31,32). In this study, considering the advantages, such as not predefining the number of clusters, the real existence of exemplars, and much lower error, we adopt affinity propagation (AP) clustering (33,34).
AP clustering determines the number of clusters by controlling the input exemplar preferences (p), where p is more robust than K because p monotonically controls the perception granularity. Generally, p depends on the similarity matrix S N * N , number of input patients (N), and p coefficient (p coe ), which is represented as After patients are clustered, we identify K clusters (C 1 , C 2 ,. . . , C K ), and define the popularity (i.e., support) of each cluster as where C(D ′ j ) represents the cluster to which patient j belongs and Additionally, we obtain the sum of similarities (SS), which is an important indicator used to evaluate clustering performance. The SS depends on the similarity matrix S N * N , number of input patients (N), number of clusters (K), and corresponding exemplars, which is represented as Generally, the larger the SS value, the better the clustering performance. The pseudocode of the patient clustering algorithm is presented in Algorithm 2.

TDCCoP Extraction Method
In our previous studies, we proved that defining the core zone of a cluster is an effective approach to extract stable clustering results (35). Additionally, considering the complex semantic relations Algorithm 2 | Patient clustering algorithm.
Input: S N*N , p coe , step size ε Output: Optimal clustering number K*, E(C k ), support (C k ), SS(K*) Run the AP clustering algorithm with S N*N and p (p = median(S)-p coe (µ)*N) 3. Return the clustering number K(µ) Run the AP clustering algorithm with S N*N and p Return the clustering number K(µ) and p coe (µ) Return the maximum dp coe (K) and the optimal clustering number K* Run the AP clustering algorithm with S N*N and p (p = median(S) -p coe * N) 9. Return E(C k ), support (C k ) using Eq. 6, and SS(K*) using Eq. 7 among different diagnosis codes, the feature of a cluster cannot be fully described when the diagnostic information (cluster center or exemplar) of only one patient is used. Thus, we also define the core zone of each cluster to select a group of patients (i.e., core patients) using the k-nearest neighbor method, and further extract typical diagnosis codes (TDCs). For cluster C k , the core zone is defined as where E(C k ) is the exemplar of cluster C k and τ is a similarity threshold defined in advance, which aims to determine the number of core patients. Then, for cluster C k , the occurrence probability of the diagnosis code dc h can be represented as where |Core k | denotes the number of core patients in cluster H is the number of all diagnosis codes after duplicates are deleted.
After we calculate the probability of all diagnosis codes in the cluster C k , we define the TDC as where δ 1 is a threshold defined in advance to differentiate highfrequency and low-frequency diagnosis codes. Based on all TDCs of the cluster C k , we further analyze the priority of TDCs by embedding the order of the patient diagnostic information, that is, for patient j, D j = {[dc j1 , Ord(dc j1 )], [dc j2 , Ord(dc j2 )], [dc jh , Ord(dc jh )], . . . } and D j ′ = {dc j1 , dcj 2 , dc jh , . . . }. Thus, the average order (AOrd) of TDC Tdc h is defined as where H ′ is the number of TDCs in cluster C k and Ord Dj (Tdc h ) denotes the order of TDC Tdc h in the diagnostic information D j of patient j. Generally, the smaller the AOrds of typical diagnostic codes, the more likely they are to be primary diseases. Finally, after obtaining TDCs and their AOrds, we define a sorting function to determine TDCCoP, which is represented as TDCCoP k = Sort((Tdc 1 , AOrd k (Tdc 1 )), · · · , (Tdc H ′ , AOrd k (Tdc H ′ ))) = {(Tdc 1 , Ord ′ (Tdc 1 )), · · · , (Tdc H ′ , Ord ′ (Tdc H ′ ))}, (12) where Ord ′ (Tdc h ) is the new order of Tdc h . For example, if cluster C k has only three TDCs (e.g., Tdc 1 , Tdc 2 , and Tdc 3 ) and its AOrds are 5.3, 7.8, and 3.8, respectively, then after sorting, the TDCCoP k is {(Tdc 3 , 1), (Tdc 1 , 2), (Tdc 2 , 3)}. The pseudocode of the TDCCoP extraction method is presented in Algorithm 3.

UD Identification Method
To identify a UD, categorizing the TDCCoP of each cluster reasonably according to the disease taxonomy is a critical step.
In this study, we propose a UD identification method, as shown in Figure 5. Specifically, for the TDCCoP k of cluster k, we first visualize all TDCs in the reconstructed ICD ontology structure, and mark their orders. Then we use the LCA method to categorize these codes, and define their LCA and the corresponding orders. Furthermore, we calculate the CCoM using patient diagnostic information to select the optimal segmentation between primary diseases and complications. Finally, we regard the identified primary diseases as the UD. First, we define the LCA co-occurrence pattern (LCoP) of the TDCCoP k using visual analysis of the ICD ontology structure as Then we calculate the order of each d i in LCoP k as Ord(d i )=min d i =LCA(Tdc 1 ,Tdc 2 ,··· ,Tdc m ) (Ord ′ (Tdc 1 ), Ord ′ (Tdc 2 ), · · · , Ord ′ (Tdc m )), (14) where m is the number of TDCs in LCoP k whose LCA is d i . Additionally, considering the causal relation between d i and d j in LCoP k , we define the conditional co-occurrence probabilities p k (d j /d i ) and p k (d i /d j ) as where  j) exist, then d j is more prone to occur after the occurrence of d i ; thus, d i is more likely to be a primary disease, whereas d j will become a complication, and vice versa.
After analyzing the precedence relation of all diagnosis codes in LCoP k using CCoM k , we obtain the optimal segmentation between primary diseases and complications, and define the UD of cluster k as where UD k is a set of primary diseases. The pseudocode of the UD identification method is presented in Algorithm 4.

UD Prediction Method
After identifying the UD, we further study the prediction task based on the health condition of a patient admitted to hospital, exploring the important features to assign the most possible UDs to new patients. Figure 6 shows the proposed UD prediction method. First, we extract three categories of features using time series feature representation and text analysis methods, and fuse them in structured data for further prediction. Then after data pre-processing and feature selection, we label all patients with a UD. Finally, we adopt classical prediction models to perform the UD prediction task.

Patient's Health Condition Representation
The health condition of a patient admitted to hospital includes demographic information, symptom information, and laboratory examination information, which play crucial roles for clinicians in diagnosing disease types, evaluating disease severity, and designing a treatment regimen.

Demographic Information
Demographic information mainly includes the date of birth, age, gender, admission type, marital status, occupation, and residence, defined as

Symptom Information
Symptom information is recorded in the chief complaint and history of present illness in the form of text, where the chief complaint is the most painful part of the disease process, including the main symptoms and onset time. The history of present illness describes the entire process for the patient after suffering from diseases, including occurrence, development, evolution, diagnosis, and treatment. Thus, the patient's symptom information can be represented as.

Laboratory Examination Information
Laboratory examination refers to an indirect judgment of the health condition as a result of measuring specific components of blood and body fluids using instruments. Laboratory indicators typically have the characteristics of a time series, particularly for patients in the ICU. Thus, we use the minimum value, maximum value, median value, mean value, and variance of laboratory indicators to represent the time series, defined as Finally, we obtain the health condition of a patient admitted to hospital using a feature fusion method, that is, X = {De; Sy; LE}.

Information Gain-Based Feature Selection
Before predicting the UD, to remove noisy data, reduce the complexity and dimensionality of the dataset, and achieve accurate results, it is essential to apply feature selection methods to identify useful features. Therefore, feature selection is an important step that improves the clarity of the data and decreases the training time of prediction models (4). In this study, we use the information gain (IG) method to measure the importance of features and eliminate some irrelevant features. Then we compute the IG of feature x i as where feature x i ǫX, Y = {UD 1 ,. . . , UD k ,. . . , UD K }, y k ǫY, H(Y), and H(Y/x i ) denote the information entropy and conditional information entropy given feature x i for a UD classification, and P(y k ) and P(y k /x) denote the probability of y k and condition probability of y k given feature x i , respectively. Thus, we obtain the important features as where δ 2 is a threshold defined in advance to differentiate the important and unimportant features using the IG method.

Prediction Model Establishment
After obtaining the feature representation and UD result of each patient, we generate a standard dataset (Y and X ′ ) and establish a prediction model [Y = f (X ′ )]. In this study, we apply five classifiers to achieve a UD prediction: logistic regression, decision tree, random forest, SVM, and extreme gradient boosting (XGBoost). In the prediction process, we adopt the Z-fold crossvalidation (CV) method, which randomly partitions the initial dataset into Z mutually exclusive subsets, and perform training and testing Z times. We set Z to 5 or 10. Then we compute the average CV error to determine the prediction model as where L z and m z are the average CV error and number of the z-th testing dataset, and y j andŷ j are the real and predicted UDs of the j-th patient, respectively. Additionally, we identify distinctive features of different unifying diagnoses by analyzing the feature importance ranking results.

Parameter Setting
In our experiment, we set 5 parameters in advance. First, we set p coe in Eq. 5 to select the number of clusters, and then τ in Eq. 8, which is a similarity threshold to determine the number of core patients (i.e., |Core|). We discuss both parameters based on the stability of the experimental results. We set δ 1 in Eq. 10 to 0.3 to obtain TDCs, and δ 2 in Eq. 21 to 0.005 to select the important features. We set the last parameter Z in Eq. 22 to 10 to perform the 10-fold CV method. In particular, before UD prediction, we used data pre-processing methods, that is, data normalization and smoothing for imbalanced classes.

Selection of the Cluster Number
After obtaining the set similarity measure based on the ontology structure for 4,418 sepsis patients, we obtained the similarity matrix S and used the AP clustering algorithm to divide all the patients into multiple groups. Figure 7 shows the distribution of the number of clusters under different values of p c . Generally, the number of clusters decreased as the preference coefficient increased. The most stable number of clusters was two when p c ranged from 0.018 to 0.032. Thus, we selected two clusters (p c = 0.025) to identify TDCs and extract TDCCoPs from each cluster.

Stability Analysis of TDCs
After applying the AP clustering algorithm, we first divided the 4,418 sepsis patients into two clusters, where cluster 1 and 2 contained 1,391 and 3,027 patients with a support of 31.48% and 68.52%, respectively. Then we analyzed the stability of the TDCs in Eq. 10 using a set of different numbers of core patients in Eq. 8 (|Core|=100, 200, 400, 500, 800, and all patients), as shown in Figure 8, Supplementary Figure 3.
From the distribution of TDCs in Figure 8,  Supplementary Figure 3, the results showed that the stable range of core patients was from 400 to 800 (five codes in cluster 1 and 12 codes in cluster 2) because the number of TDCs and their distributions were approximately coincident. Specifically, compared with the stable TDCs, more TDCs were identified when the number of core patients was set to 100 and 200 (14 codes in cluster 2), such as the digital number 71 (276, disorders of fluid electrolyte and acid-base balance) and digital number 490 [V58.610, long-term (current) use of anticoagulants] (Supplementary Figures 3A,B). Digital number 99 (995.91, sepsis) was identified in cluster 1, and another three codes (486, 276.2, and 250) were not identified in cluster 2 (Supplementary Figure 3E) when we used all patients in the two clusters to extract TDCs. Thus, in the next experiment, we set the number of core patients to 800 to extract the TDCCoPs.

TDCCoP Extraction From Each Cluster
Using the clustering results, we finally determined two clusters, selected 800 core patients from each cluster, and set δ to 0.3 in Eq. 10 to identify TDCs and extract TDCCoPs. Figure 9 shows the co-occurrence relation and AOrd of all TDCs in two TDCCoPs, and Table 2 provides a detailed description of all TDCs in the two TDCCoPs.
To summarize, the experimental results indicated that there were 12 types of TDCs in the two TDCCoPs, where TDCCoP 1 and TDCCoP 2 had 5 and 12 codes, respectively. Specifically, the two TDCCoPs had similarities and differences. There were three similarities: (1) Five types of TDCs were the same, that is, 518.81, 38.9, 785.52, 584.9, and 995.92. (2) The AOrds of all TDCs in the same TDCCoPs were similar, for example, the AOrds of four TDCs in TDCCoP 1 were all below 6, whereas those of the TDCs in TDCCoP 2 were over 7. (3) The TDCs 38.9 (septicemia), 785.52 (septic shock), and 995.92 (severe sepsis) had the highest occurrence probability in the two TDCCoPs. There were also three differences: (1) TDCCoP 2 identified more TDCs than TDCCoP 1 . (2) The occurrence probabilities of TDCs in TDCCoP 1 were larger than those in TDCCoP 2 . (3) The AOrds of the same TDC were different in the two TDCCoPs, for example, 518.81 (acute respiratory failure) in the two TDCCoPs was 4.145 and 7.665, respectively. Additionally, septicemia (38.9) was a high-frequency and primary disease in sepsis patients, which is a life-threatening complication that can occur when bacteria from another infection enters the blood and spreads throughout the body.
Furthermore, using Eq. 12 and Algorithm 3, we extracted the TDCCOPs of the two clusters described in Table 2

UD Identification Based on TDCCOPs
After obtaining TDCCoPs, we visualized all the TDCs in the ICD-9 ontology structure. First, we categorized them using the LCA method to identify LCoPs using Eq. 13. Consider TDCCoP 2 as an example. The visualization result is shown in Figure 10. Clearly, we identified LCoP 2 with seven types of diseases, which are light green color, and computed the order of the new diseases using Eqs 13, 14: diseases of the genitourinary system (580-629, order: 1), septicemia (38.9, order: 2), diseases of the respiratory  Then we calculated the CCoM 2 of the LCoP 2 based on the diagnostic information of 800 core patients in cluster 2, as described in Table 3. First, the conditional probabilities p({390-459, 995.92}/{580-629, 38.9, 460-519}) colored red were significantly larger than the values p({580-629, 38.9, 460-519}/{390-459, 995.92}) colored blue, which indicates that diseases of the genitourinary system (580-629, order: 1), septicemia (38.9, order: 2), and diseases of the respiratory system (460-519, order: 3) were more likely to be primary diseases, whereas diseases of the circulatory system (390-459, order: 5) and severe sepsis (995.92, order: 10) were probably complications.c Second, the orders of septic shock (785.52, order: 8) and endocrine, nutritional, and metabolic diseases, and immunity disorders (240-279, order: 9) were also larger than those of the first three diseases. Thus, diseases of the respiratory system (460-519, order: 3) and diseases of the circulatory system (390-459, order: 5) were likely to be the optimal segmentation between primary diseases and complications, and the first three diseases were considered to be the UD (UD 2 ) of cluster 2.

UD Prediction Based on Patient Admission Information
After we applied feature fusion and feature selection using the IG method, we further performed five classifications to predict a UD based on patient admission information and identify important features for the constructed prediction models. Figure 11 shows the classification performance of the proposed UDIPM, including the area under the ROC curve (AUC), accuracy (Acc), precision (Pre), recall (Rec), and F1-score (F1), and Figure 12 presents the 10 most important features identified using the random forest method (Supplementary Figure 4).  The experimental results indicated that the proposed UDIPM achieved better prediction performance, where the AUC values were all above 0.8, except for the decision tree method. Similarly, the best Acc, Pre, Rec, and, F1 among all classifications was XGBoost, at ∼80%, followed by random forest, SVM, and logistic regression, whereas the decision tree was last, at ∼66%. Consider the random forest as an example. We obtained the feature importance results to better understand the prediction model. First, we found that demographic information (i.e., age) and laboratory examination information were more important than symptom information. Then some disease severity indicators were very important, such as SAPS and SAPS-II. Finally, the variance distribution (i.e., Var) of the laboratory examination indicators was more important than the mean, median, minimum, and maximum values. To summarize, the proposed UDIPM not only identified a UD from patient FIGURE 12 | Ten most important features using the random forest method.    diagnostic information but also predicted a UD based on the health condition of a patient admitted to hospital.

DISCUSSION
In this study, we conducted various experiments to demonstrate the efficiency of the proposed UDIPM when compared with other methods. Specifically, the proposed UDIPM fused three methods: a set similarity measure method, clustering, and classification algorithms. For the set similarity measure method, we selected Dice, Jaccard, cosine, and overlap as comparative methods, and used SS in Eq. 7 as a performance metric based on the AP clustering results. For the classification algorithms, we selected logistic regression, decision tree, random forest, SVM, and XGBoost. Additionally, we used AUC, Acc, Pre, Rec, and F1 as performance metrics to measure the effectiveness of the classification algorithms. The evaluation methods and metrics are described in detail in Table 4. The detailed experimental results are shown in Figure 13, Table 5. Specifically, for the set similarity measure, we first selected the optimal number of clusters using AP clustering algorithms, and then computed the SS value based on the clustering results (Algorithm 2). The experimental results indicated that the optimal numbers of clusters for four FMs were 2, 2, 2, and 3 (Supplementary Figure 5), and the proposed UDIPM achieved the second-highest SS value of 1997.86; it was only below FM4 (Figure 13). The reason is that the SS value increased as the cluster number increased. Interestingly, although the similarities of FM1 and FM2 were different, they had the same clustering results.
For the classification results obtained using the 10-fold CV method in Table 5, the proposed method achieved the secondhighest performance using logistic regression, random forest, and SVM, and the third-highest performance using the decision tree and XGBoost. More importantly, all metrics of the proposed UDIPM were higher than those of FM4. Therefore, from the overall performance evaluation in combination with the set similarity measure, clustering, and classification, the UDIPM was an effective method for identifying and predicting a UD from EMRs.
Further, for all fusion methods, the results of performance comparison indicated that both XGBoost and random forest were superior to other classification algorithms in terms of the Acc, Pre, Rec, F1, and AUC. The main reason is that XGBoost and random forest are ensemble learning algorithms by combining multiple classifiers, which can often achieve more significant generalization performance than a single classifier. Specifically, XGBoost is an improved algorithm based on the gradient boosting decision tree, which can efficiently construct boosted trees and run in parallel. XGBoost works by combining a set of weaker machine learning algorithms to obtain an improved machine learning algorithm as a whole (36). XGBoost has been shown to perform exceptionally well in a variety of tasks in the areas of bioinformatics and medicine, such as the lysine glycation sites prediction for Homo sapiens (37), the chronic kidney disease diagnosis (38), and the risk prediction of incident diabetes (39). Also, random forest classifier is an ensemble algorithm, which combines multiple decorrelated decision tree prediction variables based on each subset of data samples (40). In general, random forest shows better performance in disease diagnosis than many single classifiers (41).

CONCLUSION
In this study, we proposed a UDIPM embedding the disease ontology structure to identify and predict a UD from EMRs to assist better coding integration of diagnosis in the ICU. We discussed many critical issues, including a formal representation of multi-type patient information, symptom feature extraction from an unstructured discharge report, ICD ontology structure reconstruction for semantic relation embedding, multi-level set similarity measure for generating a patient similarity matrix, number of cluster selections using AP clustering, stability of the extracted TDC and TDCCoP from each cluster, optimal split line determination for identifying a UD based on visual analysis and the CCoM of LCoP, feature fusion and selection using the IG-based method, and the performance evaluation of UD prediction using five classifiers. We verified the proposed UDIPM on 4,418 sepsis patients in the ICU extracted from the MIMIC-III database. The results showed that the highest stability cluster number and largest range of TDCs were 2 and 400-800, respectively, the UD of cluster 2 was diseases of the genitourinary system (580-629, order: 1), septicemia (38.9, order: 2), and diseases of the respiratory system (460-519, order: 3), and the best AUC and Acc, Pre, Rec, and F of the UD prediction were 0.866, 0.795, 0.803, 0.795, and 0.794, respectively, which were better than those of other fusion methods from the overall view of SS and prediction performance.

STUDY LIMITATIONS
The proposed UDIPM can identify and predict a UD from EMRs; however, there remain several topics for future work. First, the order of diagnosis codes should be considered in the patient similarity measure by way of different weights because of the importance of primary diseases. Then some state-ofthe-art feature selection and classification models should be implemented to improve the prediction accuracies of the UD. Additionally, we hope to make progress on many of the valuable suggestions made by clinicians regarding our implemented method and experimental results.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found at: https://mimic.mit.edu/docs/iii/tables/.