Artificial Intelligence in Pharmacoepidemiology: A Systematic Review. Part 1—Overview of Knowledge Discovery Techniques in Artificial Intelligence

Sessa, Maurizio; Khan, Abdul Rauf; Liang, David; Andersen, Morten; Kulahci, Murat

doi:10.3389/fphar.2020.01028

SYSTEMATIC REVIEW article

Front. Pharmacol., 16 July 2020

Sec. Drugs Outcomes Research and Policies

Volume 11 - 2020 | https://doi.org/10.3389/fphar.2020.01028

Artificial Intelligence in Pharmacoepidemiology: A Systematic Review. Part 1—Overview of Knowledge Discovery Techniques in Artificial Intelligence

Maurizio Sessa^1*†‡

Abdul Rauf Khan^1,2‡

David Liang¹

Morten Andersen^1§

Murat Kulahci^2,3§

¹Department of Drug Design and Pharmacology, University of Copenhagen, Copenhagen, Denmark
²Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark
³Department of Business Administration, Technology and Social Sciences, Luleå University of Technology, Luleå, Sweden

Aim: To perform a systematic review on the application of artificial intelligence (AI) based knowledge discovery techniques in pharmacoepidemiology.

Study Eligibility Criteria: Clinical trials, meta-analyses, narrative/systematic review, and observational studies using (or mentioning articles using) artificial intelligence techniques were eligible. Articles without a full text available in the English language were excluded.

Data Sources: Articles recorded from 1950/01/01 to 2019/05/06 in Ovid MEDLINE were screened.

Participants: Studies including humans (real or simulated) exposed to a drug.

Results: In total, 72 original articles and 5 reviews were identified via Ovid MEDLINE. Twenty different knowledge discovery methods were identified, mainly from the area of machine learning (66/72; 91.7%). Classification/regression (44/72; 61.1%), classification/regression + model optimization (13/72; 18.0%), and classification/regression + features selection (12/72; 16.7%) were the three most frequent tasks in reviewed literature that machine learning methods has been applied to solve. The top three used techniques were artificial neural networks, random forest, and support vector machines models.

Conclusions: The use of knowledge discovery techniques of artificial intelligence techniques has increased exponentially over the years covering numerous sub-topics of pharmacoepidemiology.

Systematic Review Registration: Systematic review registration number in PROSPERO: CRD42019136552.

Introduction

By definition, artificial intelligence is “the theory and development of computer systems able to perform tasks normally requiring human intelligence” (Oxford, 2019). The British logician Alan Turing reports the earliest work in the field in the second quarter of the 20th century. In 1935, Alan Turing proposed the basic concept of an intelligent machine commonly known as universal Turing Machine. He further elaborated his vision in 1947 by describing computer intelligence as “a machine that can learn from experience” (Turing, 1937). As human intelligence is a combination of diverse abilities (i.e., learning, reasoning, problem solving, perception, and using language), artificial (or machine) intelligence is also a composite of methods and techniques from different disciplines of science and engineering to assimilate them in machines (Figure 1). It is worthy to note that artificial intelligence is commonly confused with machine learning. Learning (Machine/Deep Learning) is a subfield in artificial intelligence that deals with methods and techniques to assimilate learning abilities in machines. One reason of machine (or deep) learning emerging as a dominant sub-field of artificial intelligence is the considerable advancement in computer technologies and impressive achievements in learning algorithms. By definition, machine learning is a multidisciplinary field, which involves methods and techniques from mathematics, statistics, and computer science to learn from experiences (historical data) with respect to some tasks (i.e., the nature of the problem), and measure the performance (performance matrix) and improve it (re-enforcement) (Michie et al., 1994). Today, machine learning algorithms based on the principal of reinforcement learning not only enhances the learning abilities of the machine but also complement the other aspects of intelligence such as appropriate reasoning, efficient problem solving, and factual perception. Traditionally, experimental design, observational data analysis (statistical data analysis), and computer science have always been integral constituents of research in biomedical sciences. However, in the past decade the sprightly ascent of machine learning based knowledge discovery methods in artificial intelligence sparked this trend conspicuously. For numerous medical fields, the contribution of knowledge discovery techniques in artificial intelligence have been described extensively. However, their level of infusion to pharmacoepidemiology is unknown. Acording to the international society of pharmacoepidemiology, this discipline may be defined as “the study of the utilization and effects of drugs in large numbers of people.” Considering this gap in knowledge, the objective of this systematic review is to provide an overview of the use of knowledge discovery techniques of artificial intelligence in pharmacoepidemiology.

FIGURE 1

Figure 1 Artificial intelligence abilities.

Methods

An independent author (MS) registered the protocol of the systematic review in the PROSPERO International Prospective Register of Systematic Reviews database (identifier CRD42019136552).

Eligibility Criteria for Considering Studies in This Review

We evaluated observational studies, meta-analyses, and clinical trials using artificial intelligence techniques and for which the exposure or the outcome of the study was a drug. Drugs include any substance approved on the pharmaceutical market having an anatomical therapeutic chemical classification code as proposed by the World Health Organization (WHO). Only studies for which the full text was available in the English language were considered as eligible. Abstracts sent to international or national conferences, letters to the editor, and case reports/series were considered ineligible along with articles evaluating natural language processing techniques. Reviews describing the use of natural language processing techniques are available elsewhere (Dreisbach et al., 2019). The reference list of narrative and systematic reviews included with our MEDLINE query were further screened for undetected records.

Outcome

The main outcome was the frequency of studies published per year from January 1950 to May 2019, a narrative overview of their findings, and a lay description of knowledge discovery methods of artificial intelligence that were used. Secondary outcomes included the evaluation of 1) the medical field in which the aforementioned techniques were used and 2) the number and the type of artificial intelligence techniques that were used. Additionally, we assessed the frequency distribution of articles by 3) the study design; 4) type of data sources (e.g. primary/secondary or simulated); 5) the specific data source; 6) the purpose for using artificial intelligence based knowledge discovery techniques, and 7) the level of evidence provided by the study.

The purpose of using artificial intelligence based knowledge discovery techniques (outcome no. 6) was categorized as follows: 1) To predict clinical response following a pharmacological treatment; 2) To predict the needed dosage given the patient’s characteristics; 3) To predict the occurrence/severity of adverse drug reactions; 4) To predict diagnosis leading to a drug prescription; 5) To predict drug consumption, 6) To predict the propensity score; 7) To predict drug-induced lengths of stay in hospital; 8) To predict adherence to pharmacological treatments; 9) To optimize treatment regimen; 10) To identify subpopulation more at risk of drug inefficacy, and 11) To predict drug-drug interactions.

Search Methods for the Identification of Studies

Ovid MEDLINE (from January 1950 to May 2019) was searched along with the references listed in the reviews identified with our research query (Supplementary Table 1). Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist is provided in Supplementary Table 2.

Selection of Studies

In the first screening procedure, titles and abstracts of retrieved record were screened by two independent researchers (MS and DL) for obvious exclusions. All articles that were considered eligible at the first screening procedure underwent a full-text evaluation. If disagreements arose during the two steps evaluation process, it was resolved by consensus.

Data Extraction and Management

A data extraction form was developed for this systematic review and it is shown in Supplementary Table 3. The scale proposed by Merlin et al. (2009) was used to establish the level of evidence of each study.

Results

In total, 6,470 and 240 records were identified in Ovid MEDLINE and in the reference list of reviews retrieved with the search query, respectively. After title/abstract screening, 6,633 records were eliminated because of ineligibility and 77 articles (72 original articles and 5 reviews) underwent a full-text evaluation. The 77 articles were considered eligible to be included in this systematic review. The PRISMA flowchart of the selection process is shown in Figure 2 and the PRISMA checklist has been provided in Supplementary Table 2.

FIGURE 2

Figure 2 Study flow diagram.

We observed increased use of artificial intelligence based knowledge discovery techniques in pharmacoepidemiology over the years as seen in Figure 3. In all, 17 medical fields were identified. The top four most prevalent medical fields were pure pharmacoepidemiology (16/72; 22.2%), oncology (15/72; 20.8%), infective medicine (8/72; 11.1%), and neurology (6/72; 8.3%) (Supplementary Table 4).

FIGURE 3

Figure 3 The trend of pharmacoepidemiological studies using artificial intelligence by years. DL, deep learning; ML, machine learning.

Fifty-five out of 72 articles (76.4%) used artificial intelligence techniques in the setting of a cohort study (Supplementary Figure 1). Most of the studies provided a medium-low level of evidence of III-3 (4/72; 5.6%), III-2 (49/72; 68.1%), and III-1 (16/72; 22.2%) while, a few articles provided a level of evidence of II (3/72; 4.1%).

In the 72 selected articles, the data sources included electronic health records (36.1%), ad-hoc databases from clinical studies (31.9%), administrative databases (29.2%), survey (1.4%), and simulated data (1.4%). The data sources were mainly secondary (59.8%) and primary sources (31.8%). Only in two articles (2.8%), researchers used both secondary sources and simulated data. Analogously, only in two articles (2.8%), researchers used simulated data (2.8%). The specific data sources used in selected articles are provided in Supplementary Table 5.

Main Applications of Knowledge Discovery Techniques in Pharmacoepidemiology

A narrative overview of the articles is provided in Table 1. The lay description of the knowledge discovery techniques that were used in retrieved articles is provided in Lay Description of the Knowledge Discovery Techniques of Artificial Intelligence Used in Pharmacoepidemiology.

TABLE 1

Table 1 Main applications of knowledge discovery methods of artificial intelligence (AI) in pharmacoepidemiology.

The main applications of artificial intelligence based knowledge discovery techniques in pharmacoepidemiology were classification/regression (44/72; 61.1%), classification/regression + model optimization (13/72; 18.0%), classification/regression + features selection (12/72; 16.7%), classification/regression + features interaction (1/72; 1.4%), and classification/regression + features selection + model optimization (2/72; 2.8%).

Classification and regression are two different types of predictive modeling where in the former the prediction is a label (class) whilst in the latter it is a quantity. For example, in classification, a patient can be classified as belonging to one of two classes: “having the disease” and “not having the disease” given a set of information from his/her medical history. In regression, instead, the researcher may try to predict the cholesterol level of a patient based on patient’s weight. Feature (variable) selection is a type of modeling in which the researcher constructs and trains statistical models by selecting relevant features to reduce overfitting and training time, and to improve accuracy. The main reason for feature selection is to improve the model performance that may be negatively impacted with the inclusion of partially relevant or irrelevant features as this leads to overfitting. Conversely, incorrectly excluding variables may lead to a bias in the model prediction (Heinze et al., 2018). Feature interaction, instead, is said to be relevant when the impact of any feature changes based on the levels of the other features hence rendering an additive model unsatisfactory. For a model with the lowest order interaction, the prediction is calculated based on a constant, a value for the first feature, a value for the second feature, and finally, the value for the interaction of the two features (Molnar, 2018).

In the retrieved articles, twenty different knowledge discovery techniques were used. Multiple techniques were used in the same article leading for a total of 122 applications. Random forest (30/122; 24.6%), artificial neural networks (22/122; 18.0%), and support vector machine (19/122; 15.6%) models were the three most used techniques (Table 1, Supplementary Figure 2). The top six purposes of using artificial intelligence techniques were to predict: 1) the clinical response following a pharmacological treatment (42.7%); 2) the occurrence/severity of adverse drug reactions (19.4%); 3) the needed dosage given the patient’s characteristics (14.5%); 4) drug consumption (9.7%), and 5) propensity score (4.8%) (Table 1).

Lay Description of the Knowledge Discovery Techniques of Artificial Intelligence Used in Pharmacoepidemiology

Artificial Neural Network

An artificial neural network is a machine learning technique that tries to mimic neurons’ mechanisms of processing signals and is applicable to solve complex knowledge extraction tasks. In artificial neural networks, the input signals are characterized by the features variables (e.g., covariates) where each gets a different weight according to its importance in the knowledge extraction task (e.g., having or not having an adverse event). In its simplest form, as in the case of single-layer network, features represent the input nodes of the artificial neural networks, and all the input nodes are then arranged in one layer (e.g., skip-layer units) while the outcome represents the output node (Zhang, 2016a). Artificial neural networks can be split into two broad categories based on network topology, Feedforward and Feedback Artificial Neural Networks. The choice and applicability of the different network topology depend on the nature of problem. Convolutional Neural Network based on the principal of feedforward is well suited for the problems related to image analysis whereas problems such as speech recognition are better suited for the recurrent neural networks based on the feedback network topology. For this reason, the model has been used widely for computer vision task such as the automatic identification of patterns in medical images (Yamashita et al., 2018). Among studies selected in this systematic review, the artificial neural network was primarily used for Auto Contractive Maps (ACM). The ACM differs from the other artificial neural networks because it is able to learn from data without randomizing weight for each variable. In this technique, the weight of each variable is calculated based on their convergence criterion when all the output nodes become null. In particular, the model uses a data-driven mechanism to set-up weights based on the Euclidean space given the topological properties of each variable.

Bayesian Additive Regression Trees (BART)

BART is a technique that combines several Bayesian regression trees and starts by building an individual regression tree for each variable that are subsequently summed. By definition, the BART model is flexible and able to evaluate non-linear effects and multi-way interactions automatically. For each node of the regression tree, the levels of the variable are separated into two sub-groups based on their predictive power for the outcome. By definition, Bayesian additive regression trees are able to capture additive effects among variables (Hernandez et al., 2018).

Bayesian Network

A Bayesian network is a special machine learning technique used in causal inference. Causal inference determines the probability of an outcome using evidence from prior observations. The model use prior knowledge from a causal diagram (direct acyclic graph) which describes the underlying joint probability distribution among variables with conditional dependencies (Sesen et al., 2013). The model incorporates prior knowledge about the topic and then learns from the data how the variables interact with each other in the network.

Ridge, ElasticNET, and LASSO

In the case of high dimensional datasets where the number of variables is bigger than the number of observations, least squares method (linear model) cannot be used. In such a scenario, the commonly used approach is to reduce dimensionality through regularization. In such a case, penalized regression can be the preferred choice to perform feature selection. In this case the coefficients are obtained through the minimization of the penalized residual sum of squares where the penalty is imposed on the regression coefficients and used as a tuning parameter. If the penalty is imposed on the sum of the squared coefficents, penalized regression is called the Ridge regression. If the penalty is imposed on the sum of the absolute values of the coeeficients, we have the Least Absolute Shrinkage and Selection Operator (LASSO) regression. The Elastic Net imposes the penalty on the combination of the both sum of the squared and absolute values of the coefficients. LASSO forces (shrinks) the coefficients of all the variables with a poor contribution to the prediction to be zero and, therefore, these variables are excluded from the final model. ElasticNET, instead, shrinks some of the coefficient towards zero but also preserve some of the variables with medium-low predictive power providing a less aggressive feature selection strategy (Kyung et al., 2010).

Naïve Bayes Classifier

The naïve Bayes classifier is an artificial intelligence technique used for classification that relies on the Bayesian classification (Zhang, 2016c) based on the following principles: given the hypothesis h, a set of data D and a probability measure P, we can define P(h) as the probability that h is true. P(h) represents the prior knowledge on h; P(D) is the probability that the data in D will be observed; P(D|h) is the probability of observing the set D given that h is true; and P(h|D) is the probability that h true for a given data D, i.e., posterior probability of h. The theorem can be formalized as following: P(D|h) = P(D|h) P(h)/P(D). The theorem allows for calculating the posterior probability of h given D starting from the knowledge of the prior probabilities of D, and the conditional probability of D given h. Consequently, it is possible to calculate the maximum posterior hypothesis (MAP), or rather the most probable hypothesis of h given D. The naïve Bayes algorithm classifies the new data by assigning the most probable target value, or rather the MAP value, given the sequence of attributes (a₁, a₂,…, a_n) that describe the new data.

Discriminant Analysis

A discriminant analysis is used to group observations based on the similarities of their features. Suppose we have g groups D₁, D₂,…, D_g from which the observations are coming from. The objective of the discriminant analysis is to categorize an individual in one of these groups given a set of observations, x₁, x_{2, … … … … ,}x_p (where p is the number of variables). For example, we want to discriminate between patients with or without diabetes mellitus type 2 (g = 2) based on observations of glycaemia, body weight, and age (p = 3) (in this case x1 = blood glucose concentration, x2 = body weight, and x₃ = age). For the specific characteristics of the individuals of a group D_i, we can compute a probability that describes the likelihood of belonging to the group i, given the observed variables. Linear discriminant analysis is a classification technique that uses linear combinations of features to categorize observations in groups. The model requires that the data are normally distributed, homoscedastic or have an identical covariate matrix among classes. Quadratic discriminant analysis, instead, relaxes the last assumption or rather does not require that classes have the same covariate matrix.

Principal Component Analysis

The principal component analysis is a technique that reduces the dimensionality of quantitative variables in the dataset through linear combinations of these variables, also known as the principal components. The principal components are selected so that the first principal component (first linear combination) has the highest variance, the second principal component has the second highest variance but also uncorrelated with the first principal component and so on. When the original variables are highly correlated, only a few principal components are retained as they would still explain a large portion of the variation in the data.

Q-Learning

Q-learning is a reinforcement-learning algorithm used to optimize the solution of discrete time stochastic processes. The technique is “model-free” and “goal-oriented.” It provides at each stage of the process the optimal set of decisions to maximize a long-term reward. The algorithm is used in pharmacoepidemiology considering that many therapeutic processes are a set of actions that change over time and may be associated with a clinical outcome (i.e., a set of drugs administrated over time and the occurrence of an adverse drug reaction) (Song et al., 2015; Krakow et al., 2017).

Support Vector Machine and Sequential Minimal Optimization

Support vector machine (SVM) is a method used for classification. The SVM algorithm has three core components: i) A line; or a hyperplane as the “boundary” that separates data points, ii) A margin; i.e., the distance between the groups of data that are close to each other, and iii) Support vectors; i.e., the vectors to separate data points located within the margin of a hyperplane. In the presence of linearly separable data points, the algorithm finds among all straight lines or hyperplanes that separate the different groups those that maximize the margin value. In fact, a straight line or a hyperplane with maximum margin value allows minimizing the classification error. In non-linear classification, it is necessary to operate in two separate phases. In the first phase, data points are mapped on a large dimensional space to make them separable in a linear manner. Subsequently, the algorithm searches for a line or a hyperplane that maximizes the size of the margin, given that the instances are linearly separable. The support vector machine usually uses data transformations to transform a non-linear into a linear relationship of variables to simplify the delineation of boundaries. These data transformations usually use the kernel function (Noble, 2006). Sequential minimal optimization, instead, is an algorithm used to train the support vector machine (Platt, 1998).

Classification and Regression Tree

A classification and regression tree (CART) is a model constructed by recursively partitioning variables based on their predictive power for the study outcome. The model starts by identifying the variable with the strongest predictive power. This variable is included in the model as the root node or rather the parent node from which all other splitting procedures will be performed. In the regression tree, each node represents a variable. The decision tree split each node into two levels to make them have the best separation for maximizing their predictive power of the variable. With this model, the user does not need to make any assumptions about the statistical distribution of the data (e.g., normality assumption). The model can handle both categorical and numerical data (Kingsford and Salzberg, 2008). The boosted regression tree incorporates the important advantages of tree‐based method described above. However, it overcomes the inclusion of a single tree by including boosting (a combination of simple models to improve the overall predicting performance) (Elith et al., 2008).

Decision Table

A decision table is a hierarchical (rule) table used for classification in which attributes of variables are paired. A decision table is composed of columns with the inputs and outputs of a decision and rows denoting rules. This technique allows for the detection of the interrelationship among variables and their attributes (Becker, 1998). Decision tables use the wrapper method that finds the best subset of features or rather it removes features with a poor contribution to the model. In this way, the algorithm reduces the probability of overfitting.

K-Means Cluster

The k-means clustering algorithm uses unlabeled data to generate a fixed number (k) of clusters of data with similarities in attributes. The center of the clusters (k) is called centroids and are calculated by averaging data allocated to the cluster. The algorithm is composed of two steps: 1) Initialization, where the user sets the number of clusters, k, 2) the application of an algorithm (e.g. Lloyd’s algorithm) for which each data point is assigned to its closest cluster (Bock, 2007). The process iterates until the variation of data points in the cluster is minimized.

K-Nearest Neighbors

K-nearest neighbors is a machine learning technique used for both regression and classification. The k-Nearest Neighbor algorithm uses a training dataset with labeled data to classify new data points without labels. In the training dataset, the number of clusters (k) is identified based on their labels (e.g., having or not having a disease). The algorithm classifies a new data point by calculating its distance to each cluster of the training set until the closest cluster is identified. The technique does not make any assumption about the distribution of data (Zhang, 2016b).

Fuzzy C-Means

The fuzzy c-means is an artificial intelligence technique for clustering based on the similarities in the features. The term fuzzy stands for indistinct, confused, and blurred. It is based on the assumption that the world around us is not dichotomous (e.g., black and white) but contains in itself all the infinite nuances that exist between these two extremes. This concept is expressed mathematically by a real number between zero and one that represents the degree of membership (membership function) of the object in question to one or the other group (e.g., how much a gray is white, or how much a gray is black).

Random Forest and Random Survival Forest

Random forest is a machine leaning method based on the principle of ensemble learning. The key aspiration behind the random forest is to improve the performance of the indvidual tree learners with the help of bootstrap aggregating (or bagging). The technique builds each tree by bootstrapping a random sample from the data. To select the variables that need to be split in the decision tree, the random forest randomly selects features and uses scores (e.g., the decrease in Gini impurity score) as the splitting criterion. Gini impurity is a metric used in decision trees to determine which variable and at what threshold the data should be split into smaller groups. Gini Impurity measures misclassification of random records from the data set used to train the model. To understand the importance of each variable for classification/regression, the random forest classifies variables based on their importance for classification/regression in a parameter called “variable importance measure,” which has however been noted to be biased. Alternative measures are available to overcome this limitation, such as partial dependent plots. These plots provide an overview of how each variable influences the prediction of the study outcome when related to other variables selected by the random forest. Crucial parameters for the random forest are the number of trees generated in the random forest, the number of variables randomly selected for splitting in each decision tree, and the minimum size of each terminal node (Couronne et al., 2018).

Kernel Partial Least Squares

Kernel partial least squares is a nonlinear partial least squares (PLS) method. PLS is a dimensionality reduction technique that models independent variables using latent variables (also known as components as in PCA). The aim is to find a few linear combinations of the original variables that are most correlated with the output. This technique is able to minimize multicollinearity among variables and it is useful in the set of high-dimensional datasets (Rosipal and Trejo, 2001).

Hierarchical Clustering

Hierarchical clustering is a technique that performs a hierarchal decomposition of the data based on group similarities. The model builds up a distance matrix that computes the distance among data points. In particular, given a set of N observations to be grouped, and a distance (or similarity) matrix N × N, which defines the distance of the data points to each other, the basic process of hierarchical grouping is as follows:

1. The algorithm starts associating a cluster to each entity so it will have initially N clusters, each of which contains only one data point and then computes the distance (similarity) among the clusters.

2. Subsequently, it will look for the pair of clusters that are “close” to each other (more similar) and it will combine them in a single cluster. In this way, the number of clusters will be reduced by one unit.

3. It will calculate again the distance (similarity) between the new cluster and each of the old clusters.

4. It will repeat steps 2 and 3 until the entities are grouped in the desired cluster number (Johnson, 1967).

Discussion

In the last decade, there has been increased use knowledge discovery techniques of artificial intelligence in pharmacoepidemiology. This result is in line with those of Koohy (2017) who showed an increased popularity of machine learning methods for biomedical research from 1990 to 2017. We strongly believe that one of the major consequences for the increased interest in applying machine learning techniques over the years is the dramatic growth in size and complexity of clinical and biological data that have led to the necessity of combining mathematics, statistics, and computer science to extract actionable insight. By using advanced algorithms that are capable of self-learning from the data, machine-learning techniques provide support for decision making to the final user (e.g. a researcher) without a pre-specific hypothesis (i.e., “hypothesis-free algorithms”). In this systematic review, we found that random forest, artificial neural network, and support vector machine were the most used techniques in the selected articles. The extensive use of artificial neural networks may be related to its first appearance in the scientific literature. In fact, this technique has existed for over 60 years (Jones et al., 2018). Random Forest instead, since its introduction in 2001 (Breiman, 2001), has rapidly gained popularity becoming a common “standard tool” to predict clinical outcomes with the advantage of being easily usable by scientists without any strong knowledge in statistics or machine learning (Couronne et al., 2018). Similarly, the support vector machine is considered to be one of the most powerful techniques for the recognition of subtle patterns in complex datasets (Huang et al., 2018). Interestingly, we observed that in the majority of the articles, researchers used more than one knowledge discover technique, which is a common approach in large data analytics. In fact, it is usually not possible to know beforehand the best algorithm for a specific classification/regression progress, and data scientist should rely on “past experience from other scientists” or benchmark multiple algorithms in order to determine the one that maximizes the accuracy of the model, an approach also known as “use trial and error” (Brownlee, 2014).

It should be highlighted that we found that secondary data were mostly used among selected articles. This is not surprising considering that electronic healthcare databases and administrative databases have revolutionized pharmacoepidemiology research in the last three decades. These data sources can be used by pharmacoepidemiologists to address clinical questions on drug use, drug effectiveness, and treatment optimization (Hennessy, 2006) carrying the advantage of being easier and less costly to reuse than primary data that, on the contrary, required to be collected anew (Schneeweiss and Avorn, 2005).

As expected, the majority of selected articles provided a medium-low level of evidence according to the Merlin scale (Merlin et al., 2009), a phenomenon that is a natural consequence of the level of evidence that is attributed to observational studies (Murad et al., 2016). In fact, among selected articles, the majority used a cohort or a case-control design, therefore, independently of the technique that was used to predict the study outcome the level of evidence was classified as medium-low.

In the selected articles, we identified 17 medical fields, of which the most prevalent were pure pharmacoepidemiology (mostly methodological studies in pharmacoepidemiology), oncology, infective medicine, and neurology. Clearly, the high frequency of articles investigating pure pharmacoepidemiology is related to the research query used for selecting the articles. Regarding the other medical fields, our findings are in accordance with the current scientific literature (Jiang et al., 2017). In fact, a recent article showed increased use of artificial intelligence in areas with a high prevalence of the disease of which an early diagnosis may guarantee a better prognosis or a reduced disease progression like oncology, neurology, and cardiology.

Finally, it is not surprising that the main purpose of using artificial intelligence techniques in this systematic review was related to the prediction of a clinical response to a treatment (i.e., supervised learning problems). Artificial intelligence and machine learning techniques have entailed some important methodological advancements in the analysis of “big data.” The utility of these techniques lies behind their potential for analysing large and complex data for making predictions that can improve and personalize the management and treatment of a disease, and improve the total well-being of an individual (Collins and Moons, 2019). As secondary purpose of using artificial intelligence techniques there was the prediction of occurrence/severity of adverse drug reactions. In this case, it can be related to the great impact of adverse drug reactions as iatrogenic disease that requires often a treatment and represents a cost to the health-care system.

Conclusion

The use of knowledge discovery techniques from artificial intelligence has increased exponentially over the years covering numerous sub-topics of pharmacoepidemiology. Random forest, artificial neural networks, and support vector machine models were the three most used techniques applied mainly on secondary data. The aforementioned techniques have been used mostly to predict the clinical response following a pharmacological treatment, the occurrence/severity of adverse drug reactions and the needed dosage is given the patient’s characteristics.

In the second part of this systematic review, we will summarize the evidence on the performance of artificial intelligence versus traditional pharmacoepidemiological techniques.

Author Contributions

All authors drafted the paper, revised it for important intellectual content, and approved the final version of the manuscript to be published. MS and MA developed the concept and designed the study. MS, DL, MA, MK, and AK analyzed or interpreted the data. MS, DL, MA, MK, and AK wrote the paper.

Funding

Maurizio Sessa, David Liang, and Morten Andersen belong to the Pharmacovigilance Research Center, Department of Drug Design and Pharmacology, University of Copenhagen, supported by a grant from the Novo Nordisk Foundation (NNF15SA0018404).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar.2020.01028/full#supplementary-material

References

Albarakati, N., Abdel-Fatah, T. M. A., Doherty, R., Russell, R., Agarwal, D., Moseley, P., et al. (2015). Targeting BRCA1-BER deficient breast cancer by ATM or DNA-PKcs blockade either alone or in combination with cisplatin for personalized therapy. Mol. Oncol. 9, 204–217. doi: 10.1016/j.molonc.2014.08.001

PubMed Abstract | CrossRef Full Text | Google Scholar

Alzubiedi, S., Saleh, M., II (2016). Pharmacogenetic-guided Warfarin Dosing Algorithm in African-Americans. J. Cardiovasc. Pharmacol. 67, 86–92. doi: 10.1097/FJC.0000000000000317

PubMed Abstract | CrossRef Full Text | Google Scholar

An, S., Malhotra, K., Dilley, C., Han-Burgess, E., Valdez, J. N., Robertson, J., et al. (2018). Predicting drug-resistant epilepsy - A machine learning approach based on administrative claims data. Epilep. Behav. 89, 118–125. doi: 10.1016/j.yebeh.2018.10.013