Discovering and Summarizing Relationships Between Chemicals, Genes, Proteins, and Diseases in PubChem

The literature knowledge panels developed and implemented in PubChem are described. These help to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature abstracts. Named entities in PubMed records are matched with chemical names in PubChem, disease names in Medical Subject Headings (MeSH), and gene/protein names in popular gene/protein information resources, and the most closely related entities are identified using statistical analysis and relevance-based sampling. Knowledge panels for the co-occurrence of chemical, disease, and gene/protein entities are included in PubChem Compound, Protein, and Gene pages, summarizing these in a compact form. Statistical methods for removing redundancy and estimating relevance scores are discussed, along with benefits and pitfalls of relying on automated (i.e., not human-curated) methods operating on data from multiple heterogeneous sources.


INTRODUCTION
increasing heterogeneity and variability in quality, demand a novel exploratory approach to rapidly retrieve relevant, nonredundant, and reliable information and present it in an easy-tocomprehend form, organized around the most useful content for biomedical-focused communities.
PubChem users often want to find and explore important relationships between chemicals, genes, proteins, and diseases, evidenced by peer-reviewed journal articles. This task is not easy, considering the size and scope of the data contained in PubChem. To meet this demand, the literature knowledge panels were developed and implemented in PubChem. For a given entity (i.e., a chemical, gene, or protein), the literature knowledge panels show a few most relevant "neighbors," which are chemicals, genes, proteins, or diseases mentioned together with the entity. The panels also provide a sample of PubMed records comentioning the entity and its neighbors. The information on the relationships between these entities needs to be extracted from public databases, then asserted and summarized. Such a complex collection of interlinked entities is frequently called a knowledge graph (Singhal, 2012;Ehrlinger and Wöß, 2016;Sullivan, 2020;Google, 2021;SciBite, 2021). To make the collection useful, pieces of data relevant to a query need to be found, organized, and presented to the user. In this paper, we describe the methodology that allows us to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in the biomedical literature.
The first step in producing data for the literature knowledge panels is to identify relevant named entities in unstructured text. While available data includes trusted curated sets, experimental data provided by various depositors, as well as literature and biomedical publications that are annotated manually by indexers (MEDLINE, 2021); an abundance of data can be extracted from unstructured text using named-entity recognition software (Ratinov, 2009). Current named-entity recognition approaches include dictionary matching, use of rules to recognize specialized terminology, and context analysis using statistical and neural language models (Sayle et al., 2011;Vazquez et al., 2011;Jessop et al., 2012;Rocktäschel et al., 2012;Gurulingappa et al., 2013;Lowe and Sayle, 2015;Pletscher-Frankild et al., 2015;Song et al., 2018;Devlin et al., 2019;Lee et al., 2020;Tian et al., 2020). To produce data for the PubChem literature knowledge panels, entities are annotated in a PubMed record using a third-party named-entity recognition software, LeadMine (Lowe and Sayle, 2015), and matched to chemical synonyms in the PubChem Compound database and to gene, protein, and disease names, as described in Materials and Methods.
The most relevant information is identified through statistical analysis and relevance-based sampling and summarized in a compact form. For each query entity, a few most-relevant neighbors in the knowledge graph are shown, along with several most-relevant PubMed records for each query-neighbor pair, where the query is the entity for which a panel is built (i.e., a compound, gene, or protein) and the neighbor is the co-occurring entity (i.e., a compound, gene, protein, or disease). Additional information accompanying each sample of records as well as download links helps the user to examine the context of the identified relationship and its reliability. The links to examples of literature knowledge panels are listed in Table 1, with screenshots shown in Figures 1-3.
Redundancy and near-redundancy elimination are performed for PubChem compounds using exclusion rules. We discuss three different methods for scoring co-occurrences: the simplest being a score based on the number of records where two entities are comentioned, and the other two being more advanced informationbased scores that allow to correct for an abundance of the neighbor in the database. To allow some user flexibility, while assuring PubChem efficiency, we enable the user's selection from a limited number of options, with data precalculated for each option.
The details of our methodology, data sources, and implementation are described in Materials and Methods. The application of the methodology to the real data is discussed in Results. The limitations of the current approach plus opportunities to enhance and extend using information from trusted human-curated sources as well as a variety of heterogeneous data sources are discussed in Discussion.

MATERIALS AND METHODS
NCBI PubMed records (PubMed, 2021;Sayers et al., 2021) are downloaded, and annotated using LeadMine, an entity recognition software program that uses dictionary matching and rules for specialized chemical terminology ("grammar") to recognize relevant text entities (Lowe and Sayle, 2015), version 3.15. Annotation is performed in multiple categories (such as chemicals, genes, proteins, and diseases) using LeadMineprovided dictionaries as well as our in-house dictionaries. While nested annotations (such as a chemical name, like "salicylic acid," found in a protein name, like "salicylic acidbinding protein 2") are allowed, only the non-nested annotations are used to build the knowledge panels. Annotated entities nested inside other annotated entities are kept for internal use, such as quality control and disambiguation.
The entities identified in PubMed records are normalized and matched to PubChem Compound database synonyms. Similarly, gene names and protein names are matched to the respective PubChem Gene and PubChem Protein pages, respectively, through the process described below. Disease entities are matched to Medical Subject Headings (MeSH) headers and supplementary concepts (MeSH, 2021) using LeadMine. For each query entity (a compound, gene, or protein), several non-redundant neighbors (compounds, diseases, genes, or proteins) are selected based on the cooccurrence scores between the query and neighbors. The scores depend upon the counts of PubMed records comentioning the query-neighbor pair. PubMed records comentioning them are sampled based on the relevance score, which reflects the position and frequency of the co-mentioned entities in the PubMed records as well as the characteristics of the PubMed records (e.g., the article type and publication date recency). The matching algorithms and scoring schemes are discussed in more detail below.

Text Entity Matching
Disease text entities are matched to MeSH headers and supplementary concepts using dictionaries and resolvers provided by LeadMine (Lowe et al., 2016), with some corrections made to accommodate recent changes in MeSH. Other annotated entities are matched to the names of chemicals, genes, and proteins, using the matching algorithm described in the paragraphs below.
While the entities annotated using LeadMine's case-insensitive dictionaries are matched in a case-insensitive manner, capitalization is considered when matching entities annotated by LeadMine using case-sensitive dictionaries. There is a normalization step performed, where all brackets become round brackets. In addition, before matching, text entities and database entries are transformed to ASCII (from UTF-8 or Unicode character sets), when possible, using the functionality from the Open Parser for Systematic IUPAC Nomenclature (OPSIN) project (Lowe et al., 2011).
Entities are considered matched if they have the same alphanumeric string (also using case-sensitive matching for alphanumeric strings produced from the entities annotated with case-sensitive dictionaries, and case-insensitive otherwise) and, if there is a high alignment score, allowing some flexibility in nonalphanumeric symbols. A pair of the text entities with identical alphanumeric strings are aligned using the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) with weight: 1 for exact matches, and −1 for mismatches and gaps. For two aligned entities to be accepted, the number of matched characters, normalized by the maximum of the lengths of the entities, should be greater than or equal to an acceptance threshold of 0.9 for compounds and 0.7 for genes and proteins. These thresholds were established empirically after experimentation with various cases.
Currently, it is not yet possible to reliably connect gene or protein entities in an unstructured text (e.g., PubMed records) to organism information. When annotating gene and protein entities, LeadMine frequently resolves their names to ones in an obscure organism (e.g., old names of human genes and proteins are resolved to current names in other species). We decided to prioritize human genes and proteins. The following strategy has been implemented to resolve gene and protein text entities to the most reasonable gene, protein, or enzyme symbol (corresponding to human, when possible): -Try to find a match among Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) names (Braschi et al., 2019;HUGO, 2021); -Try to find a match among names in The IUPHAR/BPS Guide to Pharmacology (Armstrong et al., 2020;IUPHAR/BPS, 2021); -Try to find matches among names in UniProt (Bateman et al., 2017); -Otherwise, try to match to an enzyme name and resolve to an EC number (Bairoch, 2000;Expassy, 2021).
In general, it is very difficult and often impossible to distinguish the name of a gene from the name of the protein encoded by that gene. Therefore, gene and protein names are not strictly distinguished from each other but considered as one category. Therefore, the annotations considered in this study can be grouped into three categories: chemicals, genes/proteins, and diseases.

Relevance Score for a Pair of Entities in a PubMed Record
The relevance score for two entities co-mentioned in a PubMed record is used to sample PubMed records that may provide a context as to their relationship. The relevance scoring scheme has been carefully crafted to reflect the following factors: -The occurrence of the query-neighbor pair in the title significantly increases the relevance of the publication; -Annotated entities that appear close in the text have more chance to be related (Manning and Schütze, 1999); -With all other factors being the same, a recent publication is probably more important than an old one; -The PubMed record corresponding to a "review" article slightly increases the relevance of the publication.
The relevance score r p ij for matched entities i and j in PubMed record p is calculated by the following empirical formula where, δT p i is 1 if the matched entity i is in the title of the record p, and is 0 otherwise (δT p j is defined similarly); δN p i is 1 if the matched entity i is present in the record p more than once, and is 0 otherwise (δN p j is defined similarly); δS p ij is 2 if both entities i and j are present in two or more sentences in the abstract, 1 if only in one sentence, and 0 otherwise; δR p is 1 if the record p is marked as corresponding to a review article, and is 0 otherwise; δA p is equal to: w A 1 if the publication age is within one year; w A 2 if the publication age is between one and two years; w A 5 if the publication age is between two and five years; w A 10 if the publication age is between five and 10 years; w A 15 if the publication age is between ten and 15 years; w A 20 if the publication age is between fifteen and 20 years; 0 if the publication age is more than 20 years.
Weights used in Eq. 1 were selected after careful consideration. The values currently used in PubChem production are: w T 50 w S 50; w M 10; w R 10; w A 1 25; w A 2 20; w A 5 15; w A 10 10; w A 15 5; w A 20 2.
Eq. 1 and the values of the weights have been established by looking into a variety of representative PubMed records and by FIGURE 1 | Chemical-chemical co-occurrence panel for ibuprofen (CID 3672), accessible at: https://pubchem.ncbi.nlm.nih.gov/compound/ 3672#section Chemical-Co-Occurrences-in-Literature.
Frontiers in Research Metrics and Analytics | www.frontiersin.org July 2021 | Volume 6 | Article 689059 4 establishing the relative importance of the contributing factors. It is difficult to objectively check the accuracy and reliability of the formula and weights because of the subjectivity of entity relationship interpretation and relative scarcity of curated data. Still, although the formula is heuristic and the weights values are subjective, they could be further optimized to handle specific use cases.

Selecting the Time Period for the Publication Dates
While a reconciliation of different relevance factors is an intricate problem, balancing the publication date with other relevance factors can be especially difficult, and strongly FIGURE 2 | A chemical-gene co-occurrence panel for ibuprofen (CID 3672), accessible at: https://pubchem.ncbi.nlm.nih.gov/compound/ 3672#section Chemical-Gene-Co-Occurrences-in-Literature.
Frontiers in Research Metrics and Analytics | www.frontiersin.org July 2021 | Volume 6 | Article 689059 5 depends on the user's needs. While operating with the precalculated data in the default setting, the user can select a preferred publication time period from a limited number of options (currently three options: since last year, within the past 5 years, or within the past 10 years). Based on the user's selection, the page view is formed within the web browser from the pre-calculated data. This approach allows some flexibility while assuring system efficiency.

Scoring the Co-occurrences
The co-occurrence score between two entities is used to select the most co-mentioned entities for a given entity. Three approaches have been tested to develop an appropriate formula for the cooccurrence score.
Consider the query entity i and the neighbor entity j, which belong to the categories I and II, respectively, and let Λ (I) and Λ (II) be sets of PubChem records that have mentions from the categories I and II, respectively, and Λ (I,II) Λ (I) ∩ Λ (II) . Let Λ (I) i 4 Λ (I) and Λ (II) j 4 Λ (II) be sets of PubChem records mentioning entities i and j, j be a set of PubChem records where entities i and j are co-mentioned (it is easy to see that Ω ij Λ (I) i ∩ Λ (II) j as well). We considered the following choices for the co-occurrence score S ij . First, we used where N ij Ω ij . This simple scoring scheme is suitable when the neighbor j is relatively rarely present in the dataset Λ (I,II) . However, when the neighbor j frequently occurs in the PubMed articles (e.g., the chemical name "water" or the disease term "cancer"), this scheme tends to bring it to the top of the neighbor list of entity i, even if the relationship between i and j is not very specific. This can be avoided by switching to the information gain-based cooccurrence score calculated by the formula where N DS Λ (I,II) is the size of the dataset, and N j Ω (II) j is the number of records within Λ (I,II) where entity j is mentioned. The score (3) is derived from the Kullback-Leibler divergence, also known as relative entropy (Kullback and Leibler, 1951;Manning et al., 2008). It can be considered as a variant of term frequency-inverse document frequency (TF-IDF) score (Aizawa, 2003;Robertson, 2004;Manning et al., 2008;Rajaraman and Ullman, 2011). At the time of writing, Eq. 3 is what is used by PubChem co-occurrence displays.
To define an even more advanced scoring formula, let us denote the relevance score of the entities i and j in the record p ∈ Ω ij as r p ij , Then the co-occurrence score is defined by the formula While co-occurrence scores defined by Eqs 2, 3 depend only on article counts, the score defined by Eq. 4 also depends on the distribution over relevance score. Further explanation on these scoring schemes is provided in Results.

Redundancy Elimination
Some compounds in PubChem are very similar (e.g., different salt forms of the same parent compound), and if such compounds happen to be neighbors to the query compound in the knowledge graph, the panel will be clogged with redundant information, decreasing its utility. This redundancy was removed by selecting a representative neighbor from each "group" of neighbors with either the same parent-connectivity group (Fu et al., 2015) or (more selectively) the same chemical name. The representative neighbors are selected using these rules: -All compounds that belong to the same parent-connectivity group or have the same name as the query compound are taken out of consideration; -The same rule is iteratively applied when PubChem compounds are added to the knowledge panel as neighbor compounds. At each iteration, the compound with the highest value of co-occurrence score (based on Eq. 3) is selected from the list of candidates (in the case of the same value of co-occurrence score, the selection is arbitrary). The selected compound is added to the knowledge panel as a representative neighbor compound, while all compounds that belong to the same parent-connectivity group as that compound or share a name with it are removed from the list (only names from the PubChem list of synonyms that matched PubMed records are taken into consideration).
The application of these rules results in a "non-redundant" neighbor list for a query entity. Note that these rules ensure that neighbors are not too similar to each other as well as to the query.

Implementation
The precalculated co-occurrence data are loaded into a set of databases and served to the knowledge panels presented in the respective literature sections of PubChem Compound, Gene, and Protein pages. When the summary page for a given PubChem record is created, these databases are queried to see if there is co-occurrence information for the specific record in question. If so, an appropriate heading is added to the summary (in the Literature section of the table of contents). When the user scrolls to that part of the summary page, the databases are queried to gather the information displayed in the panels.
The data underlying the literature knowledge panels are routinely updated on a weekly basis. The data presented in the next section was generated in late February 2021.

RESULTS
General statistics for the annotation and matching of PubMed records are shown in Table 2. Among 32.2M PubMed records (as of late February 2021), there are 14.0M that have a chemical annotation, with 11.42M records having a chemical annotation matched to a PubChem compound and 294.6K PubChem compounds matched to PubMed records. Note that nearly all disease terms in the disease dictionaries, including all levels of the MeSH trees, are resolved to MeSH headers and supplementary concept records (Lowe et al., 2016).
The distributions of the occurrences of compounds, genes/ proteins, and diseases in PubMed record annotations are illustrated in Figure 5. The five most frequently mentioned entities in PubMed records for the three categories are listed in Tables 3-5. Note that most of the annotations are for a small number of frequently mentioned entities. For example, 79.9% of unique CID-PMID pairs contain only 1% of the 294.6K CIDs. The five most frequently occurring chemicals are water (CID 962), D-glucose (CID 5793), oxygen (CID 977), ethanol (CID 702), and calcium (CID 5460341). Especially, water is annotated in 823.7K PubMed records, which corresponds to 5.9% of all the PubMed records annotated with chemicals and 2.6% of all PubMed records. The most frequently mentioned entities in the gene/protein and disease categories were insulin and neoplasms, respectively, which appeared in 329.4K and 2.46M PubMed records, respectively. Figure 6 shows the distributions of the number of neighbors for chemicals, genes/proteins, and diseases (the chemical neighbor counts are for non-redundant chemical neighbors, generated using the method described in Redundancy elimination). Because frequently mentioned entities are likely to occur with other entities, they often have tens of thousands of neighbors. For instance, the most mentioned chemical, water, has 47.2K non-redundant chemical neighbors, 15.9K gene/protein neighbors, and 4.5K disease neighbors ( Table 3). For the gene/protein category, insulin is most frequently mentioned, and has 13.4K non-redundant chemical neighbors, 12.3K gene/protein neighbors, and 3.9K disease neighbors ( Table 4). The most mentioned disease term, "neoplasms," appears with 47.1K non-redundant chemical neighbors, 28.3K gene/protein neighbors, and 6.0K disease neighbors (Table 5).
Importantly, the frequently mentioned entities tend to be ranked higher in a neighbor list when the co-occurrence scores are computed using Eq. 2. This bias is addressed by using the information gain-based co-occurrence scoring schemes, Eqs 3, 4. To illustrate the importance of correction, consider the chemical co-occurrence panel for acetone (CID: 180): https://pubchem.ncbi.nlm.nih.gov/compound/180#section Chemical-Co-Occurrences-in-Literature.
The three closest non-redundant chemical neighbors for acetone are methanol (CID 887), ethanol (CID 702), and water (CID 962). Note that, water is listed as the third closest, although it was more frequently co-mentioned with acetone than the other two neighbors (4.59K records for water, 3.30K for methanol, and 4.3K for ethanol). This is because of the correction term in Eq. 3. Figure 7 illustrates the distribution of relevance score values for the CID1-CID2-PMID triplets. Note that the group of columns on the right accounts for the most significant cooccurrences: score values above 240 are produced typically when two compounds are annotated in the title and are mentioned together in two or more sentences in the abstract.  , 2020), because all factors listed in Relevance score for a pair of entities in a PubMed record contribute to the relevance score, as illustrated by Figure 8. The two chemicals appear together in the title as well as in multiple sentences in the abstract. Besides, the paper is a recent review article published a year ago. Important, but fewer significant co-occurrence patterns produce relevance scores in the range 140-240, which correspond to the middle group in Figure 7. Score values in the range 140-240 are produced typically when two compounds are annotated in the title and are mentioned together in one sentence in the abstract.
To understand and compare Eqs 3, 4, let us rewrite them in an alternative form. Eq. 3 can be written as where the correction factor ϑ j is defined as and The value of the correction factor ϑ j in Eq. 5a is significantly below 1 when the neighbor j is well-presented in the dataset (i.e., frequently mentioned in PubMed articles) and log N j is comparable to log N DS . For N DS 11.42K (the number of PubMed records with matched chemical annotations; Table 3), ϑ j 1 2 when N j 3.38K, and ϑ j 1 3 when N j 50.7K. There are about 1.78K compounds whose ϑ j values are 1/2 or below. Among them, 82 compounds have ϑ j values 1/3 or below. Water has the smallest ϑ j value, equal to 0.161. The values of the correction factor ϑ j in Eq. 5a for chemical neighbors are shown in Figure 9.
Similarly to Eq. 3, Eq. 4, can be written in the form where the correction factor ϑ ij ′ is defined as and The rate ν ij of values S ij defined by Eqs 5a, 6b is equal to the rate of the corresponding correction factors ϑ ij ′ and ϑ j : Since 0 ≤ N j (r p ij ) ≤ N j for all p ∈ Ω ij , 0 < θ p ij ≤ 1 for all p ∈ Ω ij . Therefore, ] ij ≥ 1.
To illustrate calculation of the co-occurrence score using Eq.6a, consider D-glucose (CID: 5793) as a neighbor of PubMed records where D-glucose is mentioned. ε j value for D-glucose is 0.80, and the value of the correction factor is 0.20. The value of rate ν ij depends on the counts of PubMed records for the values of the relevance score in two sets of PubMed records: the set of records where D-glucose was co-mentioned with cholesterol and the set of records where D-glucose was mentioned with any PubChem compounds (if D-glucose is comentioned with several compounds, the maximal value of the relevance score is taken). Corresponding bar plots are shown in Figure 10. The resulting ν ij value is 1.18, and the value of the correction factor ϑ ' ij is 0.24. As explained in Redundancy elimination, a representative compound from each group of compounds with the same parent connectivity is selected to avoid clogging the knowledge panels with redundant information. Among the 294.6K CIDs matched in PubMed records, 101.1K CIDs (34%) have the same parent connectivity as another matched compound, forming 31.5K groups. The remaining 193.5K CIDs (66%) are singletons. This results in a total of 225.0K CID groups (i.e., 31.5K multi-CID groups plus 193.5K single-CID groups), from each of which a representative CID is selected to generate non-redundant chemical neighbors. The largest group contains 191 CIDs, which correspond to citric acid (CID 311) and its various salt forms (sodium citrate, calcium citrate, potassium citrate, and so on).

DISCUSSION
The PubChem literature knowledge panels, implemented in the PubChem Compound, Gene, and Protein pages, serve as an exploratory tool showing several most-related, non-redundant entities co-mentioned in the biomedical literature for the respective record being viewed (the query entity), along with a few most relevant PubMed records. The panels help the user to rapidly discover important relationships between chemicals, genes, proteins, and diseases, and quicky get a sense of the relationships in a set of papers. It is especially beneficial when a dataset is too big to examine. A sample of PubMed records comentioning the entities helps the user to understand the nature and reliability of the relationship. In addition, the user can download the list of the papers with a relationship of interest and read them to gain a deeper understanding.   The limitations of the approach used to develop the knowledge panels include the limitation of the current co-occurrence model itself as well as limitations of the technology employed for named-entity recognition and database matching. While an approach based on named entity co-occurrence in PubChem records is a useful data exploration tool, it is based on a simple well-known linguistic model. It is reasonable to think that even more sophisticated models may produce improved results.
A dictionary-based approach and the term matching procedures we use suffers from ambiguities when the same word could have multiple meanings and multiple matches. For example, lead has multiple meanings in common English besides being a synonym for the compound with CID: 5352425. Retinal is a synonym for the compound with CID: 638015 and also an anatomical term related to various retinal diseases under MeSH ID: D012164. CAT is a gene symbol for catalase (NCBI Gene ID: 847), a common name of organism domestic cat with NCBI taxonomy ID: 9685, and computer-assisted tomography under MeSH ID: D014057. MP2 is a synonym for the compound with CID: 15942661, a gene symbol for maturation polypeptide (NCBI Gene ID: 547827), an abbreviation for the second-order Møller-Plesset perturbation theory, as well a video file format, also known as MPEG-2. Approaches to mitigate ambiguities within the dictionary-based approaches include placing the term in a case-sensitive dictionary, deciding to always assign "the most common" meaning, or placing the term in a negative dictionary. However, in many situations the meaning is contextdependent, and novel disambiguation methods able to resolve context-sensitive situations are required. We are examing algorithms and methods that would enable us to better understand and utilize the contextual meaning of ambiguous terms. As with many other knowledge resources, the PubChem approach could benefit from incorporating information from more trusted, human-curated data sources. Additional curated information will allow to further cross-validate the data and promote trustful and reliable information through improved scoring.
Currently, the co-occurrence scores used for the knowledge panels are evaluated using Eq. 3, but we are working with more advanced scoring schemes, such as that given in Eq. 4, as well as incorporating validation scoring. Handling of (near-)redundancy  in the neighbor lists of an entity is also an important issue to address in future development. In particular, the chemical name-structure association is an important area for improvement. For various reasons, a chemical name is often associated with multiple chemical structures that slightly differ from each other (e.g., in terms of: stereochemistry, isotopic composition, resonance forms, tautomeric structures, mixture/ salt forms, etc.) (Hahnke et al., 2018). While PubChem chemical structure and chemical name processing attempts to handle such issues, it is imperfect. As a result, a single pair of chemical names, each of which can be mapped to multiple CIDs, often lead to many structurally similar CID-CID pairs, increasing the redundancy in neighbor relationships between chemicals. An improved algorithm to select good representative structures for chemical names would enhance the handling of this type of redundancy. It may also be interesting to consider whether to automatically annotate a broader disease term (e.g., cancer) when its more specific disease name (e.g., breast cancer) is annotated in a PubMed article.