Heterogeneous Information Network-Based Patient Similarity Search

Patient similarity search is a fundamental and important task in artificial intelligence-assisted medicine service, which is beneficial to medical diagnosis, such as making accurate predictions for similar diseases and recommending personalized treatment plans. Existing patient similarity search methods retrieve medical events associated with patients from Electronic Health Record (EHR) data and map them to vectors. The similarity between patients is expressed by calculating the similarity or dissimilarity between the corresponding vectors of medical events, thereby completing the patient similarity measurement. However, the obtained vectors tend to be high dimensional and sparse, which makes it hard to calculate patient similarity accurately. In addition, most of existing methods cannot capture the time information in the EHR, which is not conducive to analyzing the influence of time factors on patient similarity search. To solve these problems, we propose a patient similarity search method based on a heterogeneous information network. On the one hand, the proposed method uses a heterogeneous information network to connect patients, diseases, and drugs, which solves the problem of vector representation of mixed information related to patients, diseases, and drugs. Meanwhile, our method measures the similarity between patients by calculating the similarity between nodes in the heterogeneous information network. In this way, the challenges caused by high-dimensional and sparse vectors can be addressed. On the other hand, the proposed method solves the problem of inaccurate patient similarity search caused by the lack of use of time information in the patient similarity measurement process by encoding time information into an annotated heterogeneous information network. Experiments show that our method is better than the compared baseline methods.

Patient similarity search is a fundamental and important task in artificial intelligence-assisted medicine service, which is beneficial to medical diagnosis, such as making accurate predictions for similar diseases and recommending personalized treatment plans. Existing patient similarity search methods retrieve medical events associated with patients from Electronic Health Record (EHR) data and map them to vectors. The similarity between patients is expressed by calculating the similarity or dissimilarity between the corresponding vectors of medical events, thereby completing the patient similarity measurement. However, the obtained vectors tend to be high dimensional and sparse, which makes it hard to calculate patient similarity accurately. In addition, most of existing methods cannot capture the time information in the EHR, which is not conducive to analyzing the influence of time factors on patient similarity search. To solve these problems, we propose a patient similarity search method based on a heterogeneous information network. On the one hand, the proposed method uses a heterogeneous information network to connect patients, diseases, and drugs, which solves the problem of vector representation of mixed information related to patients, diseases, and drugs. Meanwhile, our method measures the similarity between patients by calculating the similarity between nodes in the heterogeneous information network. In this way, the challenges caused by high-dimensional and sparse vectors can be addressed. On the other hand, the proposed method solves the problem of inaccurate patient similarity search caused by the lack of use of time information in the patient similarity measurement process by encoding time information into an annotated heterogeneous information network. Experiments show that our method is better than the compared baseline methods.

INTRODUCTION
Patient similarity search has been identified as one of the key techniques in artificial intelligence (AI) medicine service, which is beneficial to medical diagnosis, such as making accurate predictions for similar diseases and recommending personalized treatment plans (Sharafoddini et al., 2017). Generally speaking, patient similarity analysis involves selecting certain clinical records as features of patients in a specific medical environment, then quantitatively analyzing the distance between them. A proper similarity measure should support various downstream applications, such as personalized medicine recommendation Lee et al., 2015), target patient retrieval (Sun et al., 2012), medical diagnoses (Gottlieb et al., 2013), and cohort study (Che et al., 2017).
The wide availability of Electronic Health Records (EHRs) makes it possible to quickly and accurately calculate the similarity between patients. Many similarity learning methods have been proposed (Tsevas and Iakovidis, 2011;Wang et al., 2012b;Barkhordari and Niamanesh, 2015;Wang and Sun, 2015;Sha et al., 2016;Zhan et al., 2016;Sharafoddini et al., 2017;Huai et al., 2018;Suo et al., 2018) on healthcare datasets. Existing methods have successfully derived the similarity measure from EHRs data through mapping the medical events into vector spaces. However, EHRs contain a variety of data (diagnostics, drugs, etc.) and a large number of medical events, which usually results in high-dimensional embedding vectors.
Heterogeneous information network (HIN) contains rich structure and semantic information, and it can effectively solve the problem caused by the high-dimensional and sparse embedding vectors. For calculating the similarity of patients, the diseases and drugs used by patients provide essential information. The patient's disease is critical to the doctor's clinical decision. At the same time, the patient's disease is basically determined by the patient's clinical symptoms and clinical indicators. It can be said that the disease is a comprehensive reflection of clinical indicators. The medicine is the solution made by the doctor to the patient's disease and symptoms, and is the final manifestation of the doctor's clinical decision. Therefore, it is easy to think that patients, diseases, and drugs can be connected to form HIN.
However, there are many duplicate diseases and drugs in the EHRs, meaning that if we were to use classic HIN modeling techniques with the above schema, we would lose the correlation information between patients and drugs. Considering this problem, we propose a kind of HIN with annotation: that is, in links connecting diseases and drugs, we add an annotation of patient information to enrich the original network with the information between patients and drugs. We call it annotated HIN. On the annotated HIN, we propose a novel node similarity measure S-PathSim to calculate patient similarity. As a node similarity measure, S-PathSim enjoys some good properties, like symmetric and self-maximum.
On the other hand, temporal information is crucial to understand the dynamics of medical expressions. To leverage the essential temporal information for patient similarity evaluation, we propose to use N-disease to encode temporal information into annotated HINs. N-disease is inspired by the N-grams model in natural language processing. Its basic idea is to arrange the patients' diseases into time series according to the time they are developed, sequentially collect the N-grams from the disease sequences, and then replace the disease object with the disease N-grams in the annotated HIN. The collected N-grams from the disease time series are called N-diseases.
Finally, two patient similarity search methods, MBH (method based on annotated HIN) and MBHT (method based on annotated HIN and temporal information), were defined according to S-PathSim and N-disease.
The remainder of this paper is structured as follows. The second section reviews the related research work on the topic of patient similarity analysis and heterogeneous information network, while the third section provides some preliminaries on HIN and shows the limitation of HIN to the calculation of patient similarity. In the fourth section, we introduced our method in detail. The experimental results and comparative analysis are shown in section five. Finally, the last section summarizes this paper and discusses some possible avenues for future research.

RELATED WORK
In this section, we review some related works on evaluating patient similarity and heterogeneous information network.
Studying patient similarity has practical significance in many applications (Lee et al., 2015;Li et al., 2015). Ng et al. provided personalized predictive healthcare model by matching clinical similar patients with a locally supervised metric learning measure (Ng et al., 2015). An integrated method for personalized modeling (IMPM) was proposed to provide personalized treatment and personalized drug design (Kasabov and Hu, 2010). The data-driven clinical decision support system was combined with patient similarity (Xia et al., 2019).
At present, there are many studies to calculate the similarity of patients. Zhang et al. combined patient similarity and drug similarity analysis and proposed a heterogeneous label propagation method to identify which drug is likely to be effective for a given patient . Chan et al. proposed a patient similarity algorithm named SimSvm that uses support vector machine to weight the similarity measures (Chan et al., 2010). Wang et al. proposed a patient similarity based disease prognosis strategy named SimProX (Wang et al., 2012a). This model used a local spline regression based method to embed these patient events into an intrinsic space, and then measure the patient similarity by the Euclidean distance in an embedded space. However, these methods do not leverage temporal information to evaluate patient similarities, which prevents them from delivering. Cheng et al. (2016) took temporal information into consideration and proposed an adjustable temporal fusion scheme using CNN-extracted features. This method is a supervised model, but the label data are not easy to obtain, which limits its use, and the method lacks interpretability. Zhu et al. proposed the method to solve the problem of highdimensional vectors and time series (Zhu et al., 2016). They embed medical events from HER into fixed-length vectors, but fixed-length vectors are difficult to obtain complete medical event information.
As mentioned above, the current method of measuring patient similarity is limited, and a better method is needed to calculate patient similarity.
Since, Sun et al. proposed the concept of HIN (Sun and Han, 2010), and the meta path concept subsequently (Sun and Han, 2011), HIN analysis becomes a hot topic rapidly in the fields of data mining, database, and information retrieval. He et al. incorporated temporal information for similarity search in HINs by assigning different weights to the paths built at different time (He et al., 2014). But this method is not suitable for the annotated HIN proposed in this paper. In order to evaluate the relevance of different-typed objects, Shi et al. (2014) proposed HeteSim to measure the relevance of any object pairs under arbitrary meta paths. As an adaption of HeteSim, LSH-HeteSim  is proposed to mine the drug-target interaction in heterogeneous biological networks where drugs and targets are connected with complicated semantic paths. In order to overcome the shortcoming of HeteSim in high computation and memory demand, Meng et al. (2014) proposed the AvgSim measure that evaluates similarity score through two random walk processes along the given meta path and the reversed meta path, respectively. In order to overcome the problem that the meta path can only express simple information, Cheng et al. (2017) proposed meta structure to measure the similarity between the objects. Until today, HINs have been widely used in other fields Wang et al., 2020;Zhang et al., 2020).
HIN rarely results in high-dimensional vectors, and most similarity calculation methods based on HIN have good interpretability. But it cannot be perfectly applied to patient similarity calculation, so in this paper,we propose an improved method, annotated HIN, which can be well-applied to calculate the similarity of patients.

PRELIMINARIES
In this section, as preliminaries, we will detail the HIN and its limitation in measures patient similarity.

HIN
An information network is defined as a directed graph G = (V, E) with an object type mapping function ψ : V → A and a link type mapping function ϕ : E → R, in which each object v ∈ V belongs to a particular object type ψ(v) ∈ A while each link e ∈ E belongs to a particular relation ϕ(e) ∈ R. Different from the traditional network definition, we explicitly distinguish the object types and relationship types in these networks. When the types of objects |A| > 1 or the types of relations |R| > 1, the network is referred to as a heterogeneous information network; otherwise, it is a homogeneous information network.

Limitation of HIN
HIN can link patients, diseases, and drugs. As shown in Figure 1, we can get the network schema of the patient HIN. P, D, and M represent patient, disease, and medicine, respectively.
There may be many kinds of drugs to treat one disease, and one drug can also cure many diseases, which leads to some incorrect information in the traditional heterogeneous information network when connecting patients, diseases, and drugs. We use a specific example below to illustrate this problem. Table 1 presents three inpatient records for two patients, all of which were diagnosed with the same disease; patient 231 was hospitalized twice. From the data in Table 1, the HIN in Figure 2 is obtained. However, the HIN shown in Figure 2 has two problems. First of all, we need to know that patient 231 has been hospitalized twice, but this information cannot be obtained through Figure 2. Second, patient 231 does not use perindopril in treatment, but the information we get from the heterogeneous information network is that there is a relation between patient 231 and perindopril, which leads to the incorporation of misleading information. Therefore, traditional HIN-based measurement methods are not suitable for our problem.

Annotated HIN
As mentioned in section 3, HIN is not suitable for our problem.
In order to measure patient similarity, we propose a new graph model-annotated HIN. Definition 1. Annotated Heterogeneous Information Network. Annotated HIN is a special heterogeneous information network G = (V, E, C). In the annotated HIN, there is a set of one or more link types annotated by < key, value > pairs. For each < key, value > pair, key corresponds to a specific type of object ψ(key) ∈ V, while value is used to record the number of links.
As above mentioned, we regard the set of < key, value > pairs as the annotations of a heterogeneous information network, represented by C. The number of key-value pairs in the set is referred to the length of the annotation, which is represented by L. Annotations can be added to one or more link types of the classic heterogeneous information network. These annotations can be used to record the source and number of connections and can thus represent more information. Figure 3 is a real example diagram of an annotated HIN, and we named it patient-annotated HIN. It can be seen that the connection with "Clopidogrel" has annotation C Clopidogrel = {< 231, 2 >}, and that the annotation length is L = 1. Combined with the annotated heterogeneous information network, we can interpret it as follows: Patient 231 was diagnosed with atherosclerotic heart disease in both hospitalizations, and clopidogrel was used in both treatments. Moreover, there is no corresponding record of patient 200 in the note, so it can be concluded that clopidogrel was not used in the treatment of patient 200. In this way, the two problems described in the previous section are solved.
For a given annotated HIN, in order to help readers better understand the object type, link type, and annotation type in the network, we provide its meta-description.
Definition 2. AHIN Network Schema. The network schema of AHIN is recorded as SG = (A, R, I). This is a meta template of AHIN G = (V, E, C). It has object type mapping ψ(v) ∈ A, relation type mapping ϕ(e) ∈ R, and annotation type mapping θ : C → I. It is defined on object type set A, relation type set R, and annotation type set I.

Weighted Meta Path and S-PathSim
The weighted meta path, designed to capture complex relationship between two annotated HIN objects, is based on network expansion structure. And the network expansion structure is defined as follows.
Definition 3. Network Expansion Structure. Network expansion structure S is a set of directed weighted graphs, which is defined on an annotated HIN schema SG = (A, R, I). It expands the annotated heterogeneous information network into an easyto-process format. Formally, S = (D 1 , D 2 , . . . , D n ), where D n = (V n , E n ) is a directed weighted graph with D n , V n being the set of nodes and edges, respectively. For any edge e ∈ E n , a weight w(e) is associated, with the default value 1.
Below we use an example to introduce the expansion of the network structure. Figure 5A demonstrates the expansion from a given annotated heterogeneous information network into the network expansion structure. There are annotations {< P 1 , 2 >, < P 2 , 3 >}, and {< P 1 , 3 >} in graph G. The keyvalue pairs < P 1 , 2 > and < P 1 , 3 > correspond to the entity P 1 , so we can get the graph D 1 , and the corresponding edge weights are 2, 3, respectively. And the key-value pair < P 2 , 3 > corresponds to the entity P 2 , so we get the graph D 2 , and the corresponding edge weight is 3. For the other edges, our default weight is 1.
After introducing the network expansion structure, we propose the concept of weighted meta path.
Definition 4. Weighted Meta Path. Weighted meta path P is a path defined on the network schema SG = (A, R, I), and based on network expansion structure S = (D 1 , D 2 , . . . , D n ). Weighted meta path is denoted in the −→ A l+1 , which defines a composite relation between object A 1 and A l+1 , where R l represents the relationship between A 1 and A l+1 , and w(e l ) represents the weight of the relationship.
Just like the meta path, if the relationship of the weighted meta path P is symmetric, then we say that it is symmetric. For a specified weighted meta path, it has a specified template. If there is no multiple relationship between the same object types, we can use the type name to represent the template of the weighted meta path: P = (A 1 A 2 . . . A l+1 ). As shown in Figure 5B, P 1 and P 2 have the same template PDMDP. P 1 and P 2 are symmetric weighted meta paths.
When A l+1 = A ′ 1 , the weighted meta paths P = (A 1 A 2 . . . A l+1 ) and P ′ = (A ′ 1 A ′ 2 . . . A ′ l+1 ) are concatenable,  so that a new weighted meta path ( For each weighted meta path P, there is a score S(P), and S(P) is the product of the weights of the relationships in P. For example, the weighted meta path P 1 , S(P 1 ) = 1 * 2 * 3 * 1 = 6. In fact, S(P) represents the weight of the relationship between the first and last objects in the weighted meta path P, and can also be understood as the number of connection paths between the two objects. As shown in Figure 6, the weighted meta path P 3 , W({D 1 , M 1 } P3 ) = 2, represents that patient P 1 has used the drug M 1 twice because of disease D 1 . Therefore, S(P 3 ) = W({P 1 , D 1 } P 3 ) * W({D 1 , M 1 } P 3 ) = 2 can also be obtained, then the number of connection paths between P 1 and M 1 is 2. In the same way, S(P 4 ) = W({M 1 , D 1 } P 4 ) * W({D 1 , P 2 } P 4 ) = 3, then the number of connection paths between patient P 2 and drug M 1 is 3. P 1 can be obtained by concatenating P 3 and P 4 , then we can get that the number of connection paths between patient P 1 and patient P 2 is S(P 1 ) = S(P 3 ) * S(P 4 ) = 6.
Based on the annotated HIN and weighted meta path, we propose a new measure, named S-PathSim.
Definition 5. S-PathSim. Given a symmetric weighted meta path, S-PathSim between two objects of the same type x and y is: where S sum (P x→y ) is the sum of score of the weighted meta path between x and y, S sum (P x→x ) is that between x and x, and S sum (P y→y ) is that between y and y. If there are two weighted meta-paths P a and P b between x and y, and S(P a ) = 4, S(P b ) = 3, then S sum (P x→y ) = S(P a ) + S(P b ) = 7. Take the patients in Table 1 as an example, and patient 231 has two admissions. During his first hospitalization, he developed arteriosclerotic heart disease and had some medicine including atorvastatin, bisoprolol, and clopidogrel. Patient 200 also developed arteriosclerotic heart disease and he had the medicine aspirin, atorvastatin, and perindopril. According to these information, we can get an heterogeneous information network G as shown in Figure 5A. According to Definition 5, we can get S_sum(patient231 → patient200) = 6, S sum (patient231 → patient231) = 22, S sum (patient200 → patient200) = 9, therefore s(patient231, patient200) = 6/11. As mentioned before, S(P) can be understood as the number of connecting paths of the first and last two objects in the weighted meta path P. If there are more connection paths between two objects, then we can consider them to have a higher similarity. However, the result obtained by using the number of paths as the judgment condition will be biased toward high-visibility objects. Therefore, we use the number of connection paths from two objects to their own as a balance factor. This idea has been applied to PathSim, and we extend it to the annotated HIN here, and propose S-PathSim. Properties of S-PathSim: • (1) Symmetric: s(x, y) = s(y, x). Considering the semantics of S sum (P x→y ), it is easy to understand S sum (P x→y ) = S sum (P y→x ), so s(x, y) = s(y, x). • (2) Self-maximum: s(x, y) ∈ [0, 1], and s(x, x) = 1. The weighted meta path template mn and nm can be concatenated into a new weighted meta path mnm.mnm i is the ith path of the weighted meta-path template mnm, as mentioned before, S(mnm i ) = S(mn i ) * S(nm i ). Assuming that mn is the weighted meta path template, the kth weighted element path is expressed as a k , and nm is the weighted meta path template, and the kth weighted element path is expressed as b k , then S sum (P x→y ) = p k=1 S(a k ) * S(b k ); the same can be obtained as S(a k ) 2 + S(b k ) 2 , so S(x, y) ≤ 1. And it is easy to understand that s(x, y) ≥ 0, so s(x, y) ∈ [0, 1], s(x, x) = 1. In the above formula, p represents the number of weighted meta path between x and y, q represents the number of weighted meta path between x and x, and o represents the number of weighted meta path between y and y.

Temporal Information Encoding
Temporal information is critical to understanding the patients' dynamics. However, the AHIN described previously cannot capture the temporal information, so for the problem to be solved in this article, we propose an N-disease method to embed temporal information into the AHIN.
N-disease is inspired by the natural language processing model N-grams. Its basic idea is to arrange the patients' diseases set into time series according to the time when they were developed, sequentially collect the N-grams from the disease sequences, and then replace the disease object with the disease Ngrams in the annotated HIN. Assuming that P 1 has the diseases [D 1 , D 2 , D 3 ] and P 2 has the disease [D 2 , D 3 , D 4 ], then the results obtained after the 2-disease operation and the 3-disease operation are shown in Figure 7. In fact, the patient annotation HIN given in Figure 4 is essentially the patient annotation HIN after 1-disease operation.
It should be noted that as N becomes larger and larger, the accuracy of the patient's annotation of diseases and drug connections in the HIN will gradually decrease. As shown in Figure 7A, the node [D 1 , D 2 ] is connected to the drug; then you do not know whether this drug is used to treat disease D 1 or disease D 2 . Fortunately, we can trade off the accuracy and temporal information by changing N.

MBH and MBHT
Retrieving top-k similar patients of specified patients has practical significance. It allows doctors to analyze similar patients to provide better treatment options. Previously, we have introduced the annotated HIN-based measurement method S-PathSim and temporal information embedding method Ndisease. In this section, we define two patient similarity search methods, MBH and MBHT, according to the definition introduced earlier.
MBH is a method based on annotated HIN. In detail, first, annotated HIN is constructed using the patient's medical record information. After specifying a patient, S-PathSim is used to calculate the patient similarity and return the top-k similar patient.
MBHT is a method based on annotating HIN and temporal information. The difference between MBHT and MBH is that MBHT needs to construct the annotated HIN processed by the N-disease based on patient's medical record information, and embed the temporal information into the annotated HIN, then use S-PathSim to calculate the patient similarity and return the top-k similar patient.
It is easy to understand that MBHT is the combination of Ndisease and MBH. When N = 1, MBHT is MBH. MBHT uses the temporal information in the patient's medical records, but it also loses some accuracy, and we need to make a trade-off between timing and accuracy.

Data Description
We perform experiments on a real dataset, which primarily includes information about the medical treatments and drug details of each person. Each person has multiple records (n > 2). Moreover, each record contains a diagnosis (i.e., ICD10) and information about multiple drugs. To improve the experiment quality, we randomly divided the data into four sub-datasets. Table 2 shows the description of the divided datasets. In addition, we did not perform any other desensitization treatment (such as removing diseases with less than five patients), so our experiment is performed on a real-world dataset without any unjustifiable data manipulations.

Experimental Settings
In application, comparative analysis is often performed by retrieving top-k similar patients of designated patients to support clinical decision making. In the experiment, we also evaluate the model by retrieving the top-k similar patients of the specified patients. We set k = 10. We used two metrics for quantitative evaluation. nDCG (normalized Discounted Cumulative Gain, with the value between 0 and 1, the higher the better) Zhang et al. (2020) is an indicator used to measure the quality of the ranking. The main idea is that the products that the user likes are supposed to be ranked in front of the recommendation list rather than in the back so as to significantly increase the user experience. It is obtained by DCG (Discounted Cumulative Gain) normalization, where rel is a sorted list, i is the position number of the current result, and IDCG is the largest DCG in the ideal state.
The HL (half-life utility) (Sarwar et al., 2001) index is proposed under the assumption that the probability that the user browses the product and the specific ranking value of the product in the recommendation list decrease exponentially. It measures the practicality of the recommendation system for a user. It is the difference between the user's actual rating and the model rating.
So HL can also be used to evaluate top-k search results.
Among them, r ua represents the true similarity of patient u and patient a, d is the default score, in the experiment we set d to the average similarity, and l ua is the ranking of patient a in the recommended list of patient u. h is the half-life of the system, that is, there is a 50% probability that the user will browse the recommended list position, we set h = 3. In order to verify the effectiveness of the proposed MBH based on S-PathSim, we set up a comparison experiment between MBH and the similarity search method based on PathSim. In addition, in order to explore the effect of N-disease on the results, N was set to 1, 2, 3, 4, respectively, and count the results of MBHT for comparative analysis. Finally, we explored the effect of Ndisease on algorithm efficiency. The experimental environment is as follows: INTELCorei5 CPU, 2.80 GHz; 4G memory.

Comparison of Patient Similarity Search Method
This article proposes annotated HIN and S-PathSim, and defines MBH, a patient similarity search method based on the annotated HIN and S-PathSim. PathSim is an excellent object similarity measurement method based on HIN. PathSim can be used to retrieve the similarity of patients. Here, we compare MBH with PathSim-based methods to verify the effectiveness of MBH: (1) MBH: Map the patient information to the annotated HIN, the schema is shown in Figure 4, through the weighted meta path as shown in Figure 5B; the S-PathSim is used to measure the similarity of patients, and get the top-k similar patients of the specified patients.
(2) Baseline: Map patient information into HIN. The schema is shown in Figure 1. The meta path used is (PDMDP). PathSim is used to calculate the patient similarity, and the top-k search result of the specified patient is obtained.
It is worth mentioning that the above steps are run simultaneously in 4 sets of datasets, effectively avoiding accidental. Figure 8 shows the experimental results of the two models on 4 sets of datasets. Figure 8A uses nDCG as the evaluation criterion, and it can be observed that MBH is superior to baseline on four datasets. Figure 8B uses HL as the evaluation criterion, which proves that MBH has better practicability than baseline.

The Impact of N-Disease
We propose N-disease to embed temporal information into annotated HIN, and the difference between MBH and MBHT is whether N-disease is used or not. In this section, we explore the comparison results of MBH and MBHT, and the effect of N-disease on MBHT. We set N to 1, 2, 3, and 4, respectively. When N = 1, the annotated HIN does not contain temporal information, and MBHT is MBH. When N = 4, annotated HIN contains the largest amount of temporal information. However, after a threshold, with the increase of N, the annotated HIN captures increasingly more temporal information while its patient similarity search performance decreases steadily. We  should carefully choose the threshold for N to obtain the best results. The experimental results are shown in Table 3. In datasets A, C, and D, MBHT has the best results when N = 2; in dataset B, MBHT achieved the best results when N = 3. Among the average values of the 4 datasets, N = 2 makes MBHT achieve the best results. In general, N = 2 can achieve the best results of MBHT, and N = 2 can balance the time-consuming and accuracy of annotated HIN. At the same time, the experimental results also show that MBHT is better than MBH.
In the following, we explore the effect of N-disease on MBHT efficiency. We assume that when N-disease method is not used (i.e., N = 1), the running time of the program is unit 1. The experimental results are as follows. It can be seen from Table 4 that the efficiency of the algorithm is improved by using N-disease; especially when N = 2, the algorithm has the highest efficiency. The use of N-disease changes the number of annotated HIN nodes and the relationship between the nodes, which in turn changes the efficiency of the algorithm. Since N-disease will affect the efficiency of MBHT, this paper gives an explanation from a practical point of view.
When the program is implemented, we divide MBHT into two steps. The first step is data statistics, and the second step is S-PathSim calculation. The use of MBHT has more data statistics steps than the use of MBH alone, but we know from practice that the time consumed by the data statistics step is quite small and can even be ignored. When we calculate S-PathSim, we use a lot of multiplication, which takes most of the total running time. We found that when N = 2, the number of multiplication operations is significantly smaller than when MBH is used alone. This explains why the running time of the program when N = 2 is shorter than that when using MBH alone.
In short, we conclude that when N =2, annotated HIN achieves a balance between time consuming and accuracy, and can effectively improve the efficiency of the algorithm.

CONCLUSION
In this paper, a new method of patient similarity calculation is proposed that uses the disease and drug data of patients, and further uses the annotated HIN proposed in this paper to create a model. The heterogeneous network adds the annotation of patient information to the connecting links between diseases and drugs, which solves the problem of the classic HIN in losing the information regarding these associations. At the same time, based on the annotated HIN, we propose S-PathSim to measure patient similarity. Furthermore, N-disease is proposed to encode temporal information into the annotated HIN. Our measurement does not rely on high-dimensional and sparse vectors, and effectively captures the patient's medical events and the temporal information in EHRs. Finally, based on S-PathSim and N-disease, two patient similarity search methods, MBH and MBHT, are proposed. The experimental results show that the method proposed in this paper is superior to competitive baseline method.