This article was submitted to Pharmacogenetics and Pharmacogenomics, a section of the journal Frontiers in Pharmacology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.
Over the span of 10 years, technological achievements and advances have shifted the direction of pharmacogenomics (PGx) research from candidate gene PGx to large-scale PGx studies (
As previously shown, text mining has become a widely used approach for the identification and extraction of information from unstructured text (
Up to submission of this study, there has been a limited number of published papers, in which biomedical text-mining methodologies have been implemented in order to not only retrieve important disease/drug-gene/polymorphism relationships but also to assess the Sensitivity and Specificity of these relationships. Moreover, most of the existing databases (i.e., OMIM, Human Gene Mutation Database (HGMD), CTD, and PharmGKB) which curate pharmacogenomics relationships employ manual curation of biomedical literature in order to exploit disease or drug-related genetic association relationships in humans. Consequently, much of this information remains inaccessible in the unstructured text of biomedical publications. This observation further demonstrates the need for an accurate and automated process which will highlight important and clinically relevant PGx relationships.
In this study, we propose a novel biomedical text-mining system, which retrieves clinically relevant biomedical information, while also comparing the accuracy of the retrieved information with data extracted from PharmGKB. Text-mining annotation is also performed not only for PubMed abstracts but also for full-text articles. The key feature of this study is the use of advanced text mining and natural language processing (NLP) to tabulate the most important and clinically relevant pharmacogenomics relationships by comparing these to gold standard datasets, thus further demonstrating their potential clinically utility.
The present biomedical text-mining approach includes the following common steps of natural language processing (NLP): Figure 1 corpus creation, concept annotation and normalization, identification, extraction and filtering of sentences of interest and finally, and text classification with the purpose of discovering pharmacogenomics associations. The derived associations were subsequently compared with gold standard dataset from PharmGKB. The entire project was created using custom code and available packages in R programming language (version 4.0.0) (
The first step was the collection of published literature that is likely to contain information about pharmacogenomics associations, relevant to human species but not review articles. These prerequisites were summed up in the following query: ‘(pharmacogen*[Text Word] AND (“humans” [MeSH Terms]) NOT (Review [ptyp]))’.
Querying the NCBI’s PubMed database (via API, easyPubMed R package) in May 2020 resulted in the extraction of 11,302 PMIDs (standard identifiers for articles present in PubMed). Those PMIDs were consequently converted into their corresponding PubMed Central Identifiers or PMCIDs (i.e., standard identifiers for articles freely available as full text in the PubMed Central database), by using NCBI’s ID converter. Out of the 11,302 originally retrieved articles, only the 3,165 were freely available, while for the rest, we could access only the title and the abstract.
Concept annotation of the collected corpus with regard to biological entities (i.e., genes/proteins, genetic variants, diseases, chemicals, species, and cell lines) was performed with a custom function, by programmatically accessing, through RESTful API and PubTator Central (PTC) (
Since the results from PTC could potentially contain a significant amount of task-irrelevant terms, each category of Concepts of interest (Genes, Mutations, and Chemicals) was further evaluated, while only terms identified in the abstract or in paragraphs of an article were maintained (i.e., filtering out terms found in tables, titles, and references). Regarding Genes, only those corresponding to a human gene were kept, while entries corresponding to the pattern “
Finally, for the mapping of the provided Chemical MeSH IDs to PharmGKB IDs, the Chemical Vocabulary from Comparative Toxicogenomics Database (CTD) was accessed on April 10, 2020. This vocabulary contains up-to-date MeSH IDs, is a subset of MeSH’s dataset, after the exclusion of entries that are not molecular reagents, environmental chemicals, or clinical drugs, and also provides a list of DrugBank IDs, which can be used to connect to PharmGKB IDs. In order to keep mostly those that are actually drugs and eliminate any remaining noise, the data from CTD and PharmGKB’s chemical. tsv (downloaded on April 10, 2020) were combined based on the provided DrugBank ID (a common key between the two datasets) and were subsequently filtered to keep only the following Chemical Types (as defined in PharmGKB): Drug, Drug/Biological Intermediate, Prodrug, Drug/Ion, and Drug/Metabolite, leading to a list of 1,449 chemicals. This list was further manually curated, leaving 1,395 chemicals, and was used to remove “nondrug” chemicals from our data frame. Finally, the remaining Chemical IDs were queried to MeSH (via NCBI’s e-utilities) to get the corresponding MeSH terms for these compounds.
One of the limitations of the tools used by PTC for the identification of biomedical entities is their inability to distinguish Star Alleles, which are of utmost importance in pharmacogenomics and their misclassification as Genes. To overcome this obstacle, a search, based on regular expressions, was performed on the texts that contained a term characterized by PTC as “Gene,” to identify those complying with the Star Allele nomenclature. This process was applied solely on cases of terms regarding Genes that are already known to have Star Alleles (based on entries of PharmVar and the genes mentioned by Lee et al. (
Subsequently, the sentences that contained a concept of interest (Mutation, StarAllele, or Chemical) were extracted from the corresponding paragraphs based on string matching and the provided coordinates of each term within this paragraph. Further filtering of the sentences led to a subset of sentences containing at least one Mutation or StarAllele (which we both included under the term “Variant”) and at least one Chemical compound. As expected, some of the derived sentences contain one pair of Variant-Chemical (1-pair sentences), while others might contain multiple mentions of Variant, Chemicals, or both, leading to multiple Variant-Chemical pairs (n-pair sentences). Although these two pairs of sentences received similarly preprocessing, they were treated differently during classification and training set creation.
In order to extract the existing relationships (if any) between the Variants and the Chemicals present in each sentence, a subset from each sentence category (1-pair and n-pair) was manually curated and two distinct training sets were created. Consequently, those training sets were used to train four different classification algorithms and the models created were applied to the remaining, unseen, sentences and results were finally compared with a PharmGKB-derived Gold Standard set of Variant-Chemical pairs.
Regarding sentences that discuss only one pair of Variant-Chemical, a sentence was classified as “Correlated,” if the context of that sentence implied a clear association between the pair of Chemical-Variant in question or as “Not Correlated” if the context of the sentence implied no association, unclear associations, conflicting results, or indirect associations. Therefore, the Variant-Chemical pair’s class (Correlated or Not Correlated) determines the class of the sentence that contains it. In the case of n-pair sentences, each possible pair of Variant-Chemical for a sentence was individually assessed, and the results were aggregated to create three classes. When all the pairs mentioned in a sentence were classified as “Correlated,” that sentence received the same classification (“Correlated”). The same logic applies for n-pair sentences where all the pairs were classified as “Not Correlated” (those sentences were also classified as “Not Correlated”). Finally, n-pair sentences that contained pairs classified as “Correlated” as well as pairs classified as “Not Correlated” were put under a new class (named “Both”). The training set for 1-pair sentences consists of 1,039 sentences (and a corresponding number of Variant-Chemical pairs), while the n-pair sentences training set consists of 600 distinct sentences, containing 1,880 Variant-Chemical pairs.
Since biomedical texts might contain a wide variety of words, numbers, URLs and links, names of genes, chemicals, diseases, species, and so on, as well as abbreviations or the names of the writers of different articles which might be discussed, and all of these might add noise, rather than aiding the classification task; the derived sentences were preprocessed before being used to train an algorithm, as well as before a trained model was applied to them. As a first step and since the specific variants and chemicals found in a sentence were not of particular importance, the corresponding Variants found in a sentence of interest were replaced using string matching by the term “ClassVariant,” while Chemicals were replaced by the term “ClassChemical.” As a next step, all the words were converted to lowercase while URLs, numbers, and punctuation were removed from a sentence. Additionally, words with three characters or less (with the exception of “no,” “not,” “nor,” “ae,” “aes,” and “adr”) were removed. Finally, a set of custom stopwords was removed in order to reduce the number of unique evaluated words. This set of stopwords consists of PubMed’s stopwords, after minor modifications (
Four algorithms were used to create text classification models for the purpose of identifying PGx-related relations (FastText, Linear kernel SVM, XGBoost, and Lasso and Elastic-Net Regularized Generalized Linear Models). FastText is an open-source and free library, written in C++, which performs both supervised (classification) and unsupervized (word representation) tasks regarding text, while at the same time supports multiprocessing during training. Although a linear classifier (multinomial logistic regression), it is proven to be efficient and comparable with deep learning classifiers in many tasks. In a nutshell, the initial word embeddings are averaged to create a sentence vector which is then used to train the linear model (
The hyperparameter tuning, training, and the computation of the corresponding performance metrics (for all four models) was performed; caret R library (
The task of classification was approached independently for those two cases, thus leading to the training of eight models: four binary classifiers for 1-pair sentences and four multiclass classifiers for n-pair sentences. In order to assess the performance of those models, 10-fold Cross Validation was performed for the algorithms trained with the 1-pair sentences, while 5-fold Cross Validation was chosen for models trained with the smaller set of n-pair sentences. The optimal parameters are presented in
Presentation of the default and selected hyperparameter values for FastText algorithm.
Hyperparameter | Default value | Values used for both 1-pair and n-pair sentences |
---|---|---|
Size of vector (dim) | 100 | 200 |
Minimal number occurrences of a word (minCount) | 1 | 5 |
Size of the context window (ws) | 5 | 2 |
Learning rate (lr) | 0.1 | 0.1 |
Number of epochs (epoch) | 5 | 50 |
Maximum length of a word ngram (worNgrams) | 1 | 2 |
Loss function (loss) | Softmax | ns (negative sampling) |
Presentation of the default and selected, after grid search, hyperparameter values SVM, XGBoost, and glmnet models.
Model | Hyperparameters | Default values | 1-Pair | n-Pairs |
---|---|---|---|---|
Linear SVM | Cost (C) | 1 | 1 | 1 |
XGBoost | Learning rate (eta) | 0.3 [0, 1] | 0.2 | 0.2 |
Maximum depth of a tree [maxdepth] | 6 [0, ∞) | 4 | 6 | |
Subsample ratio of the training instances (subsample) | 1 (0, 1] | 0.7 | 0.7 | |
Number of boosting iterations (nrounds) | — | 50 | 50 | |
glmnet | Mixing percentage (alpha) | 1 [0, 1] | 0.1 | 1 (lasso penalty) |
Regularization parameter (lambda) | — | 0.02888342 | 0.02229455 |
Detailed explanations regarding the computation of performance metrics for binary classification are found in
The derived text classifiers were consequently applied in the remaining un-curated literature (unseen sentences), and their performance was evaluated based on the gold standard dataset from PharmGKB. Before the application of the models, the “testing” sentences were preprocessed in a similar manner (see
The extracted pharmacogenomics relationships were further verified by using a PharmGKB gold standard dataset, consisting of manually curated pairs of variants (Mutations or Star Alleles) and chemicals, for which the annotation evidence was annotated as “associated” (and which constitute the Positive pairs) or either as “nonassociated” or as “ambiguous” (Negative pairs). The computation of performance metrics is based on the determination of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) outcomes and the computation of Accuracy, Sensitivity/Recall, Specificity, Precision/Positive Predictive Value, and Negative Predictive Value, as it is described in
11,302 published articles, available in PubMed, were extracted based on the custom query, as described in the
Total number of the 1) initially retrieved, 2) annotated by Pubtator Central, and 3) filtered papers, based on the “pharmacogenomics-related” Pubmed query, as described in the
Papers resulting from query | 11,302 unique PMIDs (3,165 with PMCID) |
PTC-annotated papers | 5,307 unique PMIDs (2,257 with PMCID) |
PTC annotations | Chemicals: 187,850 (5,580 unique) genes: 230,159 (8,853 unique) mutations: 63,855 (13,610 unique) species: 115,520 (433 unique) strains: 54 (9 unique) |
Normalized terms | Genes: 5,463 remained chemicals: 805 remained mutations: 5,467 remained |
Star alleles |
11,201 entries (mistaken as gene entries) |
Sentences of interest | With 1 pair: 3,574 with multiple pairs: 1987 (distinct sentences) |
PMIDs, PubMed identifiers; PMCIDs, PubMed Central identifiers.
The number of “not unique” Star Alleles, since some of these are present in multiple copies. This number reflects the amount of Gene mentions that were actually Star Alleles.
The terms, as extracted by PTC, contain a significant amount of noise that could negatively affect the performance of the trained classifiers. Therefore, each category of Concepts of interest (Genes, Mutations, and Chemicals) was further evaluated, to keep only the most relevant terms. After performing Star Allele identification and sentence extraction, a subset of those sentences was manually curated to create training instances that would be used in the relation extraction step. In order to reduce noise added by redundant or very rare words, whose classification value is expected to be limited, each sentence was converted into a vector that contains its most representative words.
The best hyperparameters for FastText were determined through testing different values for specific parameters and after following thoroughly the provided guidelines (
The performance of the resulting classifiers was evaluated by computing the averaged classification metrics, after performing 10-fold Cross Validation. With regard to classifiers trained using the 1-pair sentences, we can observe (
Flowchart of the proposed automated text-mining approach and the validation steps for the retrieved literature relationships. PMIDs, PubMed identifiers; PMCIDs, PubMed Central identifiers (attributed to full-text articles only).
Presentation of the performance metrics, as calculated after using 10-fold Cross Validation with the training data, for all four models trained with sentences discussing one pair of Variant-Chemical (1-pair sentences).
Performance metrics, as calculated after using 10-fold Cross Validation with the training data, for all four models trained with sentences discussing multiple Variant-Chemical pairs (n-pair sentences). The resulting metrics are presented by model and by class, since this is a multiclass classification task, while finally, the by-class metrics for each model separately are weighted with the corresponding class prevalence and summed up to calculate the overall performance metrics.
Owing to the poor performance of the classifiers regarding the n-pair sentences, only those trained with 1-pair sentences were compared with the PharmGKB gold standard. Since those sentences focus only on one pair of Variant-Chemical, the Class attributed to a sentence by a classifier is also the one attributed to the candidate pair. However, one pair might appear in different sentences in which the Class might differ (e.g., as the result of conflicting findings in different studies). Consequently, pairs that were classified both as “Correlated” (Positive) and as “Not Correlated” (Negative) in different sentences were considered to belong only to the “Not Correlated” category. Furthermore, since the pairs appearing in the PharmGKB were extracted after manual curation of the published literature, we filtered the gold standard set to keep only Variant-Chemical pairs derived from the same articles (based on PMID), as the ones comprising the set of sentences to which the classifiers were applied. More precisely, True Positive values have described the instances that are classified as “Correlated” and are found in the Positive pairs of the gold standard; True Negative pairs are those classified as “Not Correlated” and found in the Negative pairs of the gold standard; False Positive pairs are the pairs classified as “Correlated” and found in the Negative pairs of the gold standard, while False Negative pairs are the pairs classified as “Not Correlated” and which are found in the Positive pairs of the gold standard. Initially, the gold standard consists of 10,121 curated pairs of Variant-Chemical which after filtering with PMID were reduced to 1,578 (1,337 of which belong to the “Correlated” class and 241 to the “Not Correlated”) (
Results stemming from the comparison of the classification results of the four models trained with 1-pair sentences compared with a gold standard dataset, extracted from PharmGKB.
Metric | xgboost | svm | Glmnet | Fastrtext |
---|---|---|---|---|
Filtered unseen sentences | ||||
Accuracy | 0.577 | 0.526 | 0.538 | 0.526 |
Sensitivity/recall | 0.512 | 0.465 | 0.488 | 0.349 |
Specificity | 0.657 | 0.6 | 0.6 | 0.743 |
Precision/positive predictive value | 0.647 | 0.588 | 0.6 | 0.625 |
Negative predictive value | 0.523 | 0.477 | 0.488 | 0.481 |
Original unseen sentences | ||||
Accuracy | 0.538 | 0.529 | 0.577 | 0.577 |
Sensitivity/recall | 0.415 | 0.358 | 0.434 | 0.264 |
Specificity | 0.666 | 0.706 | 0.725 | 0.902 |
Precision/positive predictive value | 0.564 | 0.559 | 0.622 | 0.737 |
Negative predictive value | 0.523 | 0.514 | 0.552 | 0.541 |
TP, TN, FP, and FN were calculated by comparing the resulting classification of the unseen pairs with the pairs present in the Gold Standard and the corresponding metrics were calculated as described in
Personalized and translational medicine aims toward the discovery and integration of basic biological concepts into the clinical routine. The ever-increasing knowledge about the impact of genomics variation in relation to drug response has yielded emerging research fields such as pharmacogenomics. Recent computational advances, including the creation of algorithms, which retrieve literature information about the association of genes or genomics variants with drug response or adverse effects of drugs, are expected to progress alongside genome-guided medicine (
Another interesting work is this of Pharmspresso (
Therefore, the present approach can be exploited to generate PGx relationships published for administered medications among different disease phenotypes. Although further work is essential in order to be able to capture an increased number of PharmGKB biomarkers or biomarkers for which FDA guidelines exist, our text-mining approach can be applied to capture a variety of clinically relevant PGx relationships, for which PharmGKB and FDA guidelines already exist.
Our text-mining approach though came not without any limitations. To begin with, the collection of a complete and concise query, free of irrelevant articles, relies heavily on the formulation of the query performed on PubMed. In addition, the majority of the extracted papers are only available as abstracts, thus reducing the available text that can be evaluated and leading to loss of associations. As highlighted previously elsewhere, being able to analyze the entire text of an article can add valuable information (
Finally, comparing with a PharmGKB gold standard set of Variants and Chemicals, might not be a suitable option in this case. The status of such a pair in PharmGKB is determined after the manual curation and integration of results from a number of articles, some of which might be conflicting with others, regarding a given pair of Variant-Chemical. On the contrary, in this approach, a simplifying assumption was made, characterizing any pairs with conflicting classifications as “Not Correlated,” an assumption that could be potentially mistaken. Furthermore, the number of gold standard’s Variant-Chemical pairs for the same articles (based on PMID) as the ones constituting the unseen sentences is substantially smaller than the number of tested pairs [(Unseen sentences: 795 pairs without filtering of the unseen sentences, or 673 after filtering) vs. (104 pairs without filtering of the unseen sentences, or 78 after filtering)].
Regardless of the current limitations, the present study described an automated text-mining system which extracts database level annotations from PubMed abstracts and full texts. Such approaches will lead to the identification of clinically meaningful relationships in the era of big data analytics. Although manual curation of the relationships may still be needed to a certain extent, text-mining approaches can be particularly useful in the delineation and curation of clinically meaningful relationships, such as PGx associations.
The data analyzed in this study is subject to the following licenses/restrictions: Data are available per contacting the authors. Requests to access these datasets should be directed to
M-TP, MK, and GP conceived the study; M-TP and MK performed the experiments and wrote the scripts; M-TP, MK, and PS validated the analysis; and GP provided funding. All authors wrote and approved the manuscript.
GPP is Full Member and National Representative at the European Medicines Agency, Committee for Human Medicinal Products (CHMP)—Pharmacogenomics Working Party in Amsterdam, the Netherlands.
The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We wish to thank the team of Pubtator Central for kindly providing us with a list of updated MeSH ID terms.
The Supplementary Material for this article can be found online at: