Social media and internet search data to inform drug utilization: A systematic scoping review

Introduction Drug utilization is currently assessed through traditional data sources such as big electronic medical records (EMRs) databases, surveys, and medication sales. Social media and internet data have been reported to provide more accessible and more timely access to medications' utilization. Objective This review aims at providing evidence comparing web data on drug utilization to other sources before the COVID-19 pandemic. Methods We searched Medline, EMBASE, Web of Science, and Scopus until November 25th, 2019, using a predefined search strategy. Two independent reviewers conducted screening and data extraction. Results Of 6,563 (64%) deduplicated publications retrieved, 14 (0.2%) were included. All studies showed positive associations between drug utilization information from web and comparison data using very different methods. A total of nine (64%) studies found positive linear correlations in drug utilization between web and comparison data. Five studies reported association using other methods: One study reported similar drug popularity rankings using both data sources. Two studies developed prediction models for future drug consumption, including both web and comparison data, and two studies conducted ecological analyses but did not quantitatively compare data sources. According to the STROBE, RECORD, and RECORD-PE checklists, overall reporting quality was mediocre. Many items were left blank as they were out of scope for the type of study investigated. Conclusion Our results demonstrate the potential of web data for assessing drug utilization, although the field is still in a nascent period of investigation. Ultimately, social media and internet search data could be used to get a quick preliminary quantification of drug use in real time. Additional studies on the topic should use more standardized methodologies on different sets of drugs in order to confirm these findings. In addition, currently available checklists for study quality of reporting would need to be adapted to these new sources of scientific information.


Introduction
Drug utilization research has been defined as "an eclectic collection of descriptive and analytical methods for the quantification, the understanding and the evaluation of the processes of prescribing, dispensing and consumption of medicines, and for the testing of interventions to enhance the quality of these processes." (1). Accurate and timely estimates of pharmaceutical drug utilization patterns are considered critical for assessing drug safety, effectiveness, access to drugs, and patients' care (2,3). Higher than expected use of some medications in a specific country (e.g., opioids in the United States) should be flagged rapidly as it could point to potential drug abuse). Timely assessment of drug utilization could be used to investigate the effectiveness and safety of medications for this new disease (4). On the contrary, when detected early, suboptimal use of essential medicines or vaccines could trigger health policymaking to prevent the resurgence of preventable morbidity.
Traditional ways to retrieve data on the use of drugs based on surveys, prescription rates, and drug sales tend to be slow, expensive, difficult to obtain, limited in geographic scope, and may not accurately capture a representative sample of the population. Currently, accessing the appropriate databases and analyzing drug utilization can take up to a year (sometimes even more). These limitations in retrieving drug utilization data can affect the health of populations.
In the last decade, web data such as social media and internet search data have been shown to be useful for infectious disease surveillance. In 2009, a study based on Google Flu Trends showed that worldwide influenza virus activity could be monitored using the Google search engine (5). It was found that the frequency of influenza-associated search terms highly correlated with the number of physician visits for influenza-like symptoms (5). Similar approaches have also been used in pharmacovigilance-focused studies, which deal with detecting, comprehending, and preventing adverse drug events (6,7). Similarly, the potential of using social media data to detect adverse drug reactions (8) as well as its use for infectious disease surveillance (9)(10)(11) have been recognized in the literature, and an increasing number of studies utilize web data to assess drug utilization (12)(13)(14).
Therefore, studies on web data could provide evidence of a complementary way to access information on drug utilization compared to traditional methods. We conducted a systematic scoping review and aimed to assess the content and quality of existing research using social media and internet search data to study drug utilization volumes compared to other sources of drug utilization information. This review was performed before the start of the COVID-19 pandemic as we believe that the specific media attention on some medications during this period may not reflect the association that could be made between drug web data and drug utilization in more usual circumstances.

Reporting standards
We performed a systematic scoping review and followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist (15) (Supplementary File S5). The review protocol is available in the online Supplementary Material (File S1).

Selection criteria
We included studies if they: (1) were primary research studies that involved web data including social media or search engine data such as Google Trends, Google Correlate, Google Insights for Search, Google search engine, Facebook, Twitter, and Instagram; (2) involved any kind of comparison data such as drug sales or drug prescription volumes acquired from surveys, registry data, physician databases, and others. Not all of these data originated from validated sources; and (3) included any kind of drug utilization data such as utilization frequencies of vaccines, vitamins, supplements, nicotine alternatives, prescription drugs, and over-the-counter drugs for both data sources.
Articles were excluded if they: (1) focused on E-cigarettes; (2) involved incidence rates of diseases instead of drug utilization volumes; or (3) involved only web data sources but no other kind of comparison data source.
In addition, we excluded non-English study documents, literature reviews, posters, PowerPoint presentations, articles presented at doctoral colloquia, or if the article's full text was not accessible to the study authors (e.g., conference abstracts). Only peer-reviewed proceedings were included in this review.

Selection process
All identified references were downloaded into Endnote, where duplicates were removed. Two independent reviewers conducted the screening with the free online tool Cadima (15). First, titles and abstracts were screened, followed by screening of the articles' full texts. The reference lists of the included articles were checked for additional studies. Any remaining disagreements about study inclusion or exclusion were resolved by a third investigator. objective), (2) characteristics of the involved data sources (e.g., web data source), and (3) additional study items (e.g., conflict of interest). The full list can be accessed in the online Supplementary Material (File S4).
Additionally, the reporting quality of the included studies was assessed using the STROBE checklist (16) (Strengthening the Reporting of Observational Studies in Epidemiology) as well as the statement's extensions RECORD (Reporting of studies conducted using observational routinely collected data) (17) and RECORD-PE (Reporting of studies conducted using observational routinely collected data for pharmacoepidemiological research) (18). Items were excluded if they were considered out of scope for the investigated population of research studies. One reviewer subsequently reviewed the adherence of the articles to the checklists' items. The checklist items were marked "yes" if the item was described satisfactorily well, "partly" if described partially, and "no" if it was not described at all. If an item was not applicable due to a study's nature or design, the item was marked "n/a".
One reviewer additionally reviewed the study authors' perceptions of the challenges of using web data for drug utilization estimation reported in the discussion sections of the papers. The abstracted data items were verified by a second reviewer, and any disagreements were resolved in consensus. The full list can be accessed in the online Supplementary Material (File S3). The extracted data were synthesized narratively. Descriptive statistics were performed using Microsoft Excel (e.g., frequencies, and measures of central tendency).

Risk of bias assessment
Risk of bias assessment was not conducted, which is consistent with the scoping review methods manual by the Joanna Briggs Institute (19).

Study flow
A total of 6,563 deduplicated citations from electronic databases were screened ( Figure 1). Of these, 6,427 (98%) papers were excluded during the title-and abstract-screening process, leaving 137 (2%) articles eligible for full-text screening. A total of 123 (90%) full texts were found to be ineligible for study inclusion, the most common reason being wrong study design as they did not include relevant datasources or any comparison with drug utilization data [see exclusion criteria 2, n = 70 (57%)]. Ultimately, 14 (10%) papers were considered eligible for inclusion. A first search was conducted in September 2016, identifying eight eligible articles, and the updated search in November 2019 yielded six additional papers. The full list of included documents can be found in the online Supplementary Material (File S4).

Data source characteristics
Of all reviewed articles, the most employed web data source was Google Trends' search volumes assessed in eight (57%) studies (21-24, 26, 27, 32, 33). Two (14%) studies used Twitter posts (22, 34), and two (14%) other studies utilized search volumes from former Google services similar to Google Trends: specifically, the Google Health Trends API (30) and Google Insights for Search (25). One (7%) study utilized both Google Insights for Search' and Google Trends' search volume (20), and another (7%) study assessed the frequency of website hits where a certain keyword is found using the Google search engine (28).
Datasources used for comparison with Web data included: Elven (79%) studies used data from public/government organizations drug utilization estimates as comparator to the web data Twelve (86%) out of fourteen studies provided the time of data collection for both the web and the comparison data source. In these studies, the web data were gathered for a median duration of 5.3 years (interquartile range of 3.9 to 8.6 years), while the comparative data were collected for a median duration of 5.0 years (interquartile range of 3.7 to 9.6 years). One (7%) study only reported the time of data collection for the comparison data source (21), while in another (7%) study, the time of data collection could not conclusively be identified (28).

Approaches used for comparisons
Nine (64%) of the fourteen studies quantitatively compared web-mined and comparison data using different types of correlation analyses (Pearson -, Spearman -and Crosscorrelation) (20, 21, 23, 25, 26, 29, [31][32][33]. Two studies (14%) quantitatively compared the performance of different prediction models (27, 30) using web and comparison data in terms of root Keller et al. 10.3389/fdgth.2023.1074961 Frontiers in Digital Health mean squared and mean absolute error. One study qualitatively compared different popularity ranking lists (28). Furthermore, two (14%) studies did not directly compare drug utilization volumes but reported the results of both data sources as part of an ecological analysis without statistical comparison (22, 24).

Main findings
Overall, positive associations between drug utilization estimates reported in web data sources and comparison data sources were found in all studies, with significant results reported in eight of the nine studies that used correlation analyses (20, 21, 23, 25, 26, 29, 31, 33). Kamiński et al. found antibiotic consumption to be significantly associated with internet search data of probiotics but not antibiotics (32). Kalichman et al. found that the internet search term H1N1 independently predicted H1N1 vaccine coverage, while the search term vaccine independently predicted HPV vaccination coverage as results of ordinal regression analyses (25). Two studies built and evaluated models to predict future drug utilization and reported the best predictions when combining web and comparison data (27, 30). Jankowski et al. developed a drug popularity ranking list using internet search data and found the list to be similar to those reported by two  Three studies found similar seasonal patterns across the web and comparison data sources (21, 26, 31). Moreover, one study found correlations between internet search volumes and drug prescription volumes not only at the same time but also following a one-month time lag for the population aged 20 to 59 years, suggesting that people obtain health-related information from the internet, which may subsequently affect their behavior and medication requests (33).

Reported challenges of using web data for drug utilization estimates
Several limitations and biases of using web-mined data for drug utilization estimation were discussed by the study authors. A total of five studies stated that there might be a selection bias as the web data source might not sufficiently represent the whole population and that important vulnerable populations such as the elderly might be underrepresented (21, 23, 29, 31, 33). Furthermore, unmeasured factors, such as users' search intents and attitudes as well as the potential impact of media attention might influence web-mined drug utilization volumes (20, 25, 32). Additional challenges were identified resulting from low search volumes when web data is narrowed down to specific regions or populations (31,32). In two studies web data was considered to be inadequate to draw causal relationships (20, 25) and it was also stated that web-mined data might generally be unreliable as it is based on self-reported experiences (29).
Four studies specifically addressed limitations of using webmined data from Google Trends (21, 26, 32, 33). Of these, three studies highlighted that Google Trends only reported a normalized share of the number of searches in the form of "relative search volume" rather than an absolute number of total searches (21, 26, 32). Furthermore, Google Trends provided no details about how research words were recognized or aggregated (33).

Discussion
This systematic scoping review identified 14 studies which compared drug utilization estimates from web data to another data source. While most studies (13) concluded to some similarities between the two data sources, studies showed a lack of consensus on methodology and only nine (64%) studies used a quantitative measure of correlation between the web and comparison data source.
To our knowledge, this is the only scoping review specifically focusing on the utility of web data for estimating drug utilization in comparison to other data sources. Other recent reviews focused on the use of social media data for pharmacovigilance (8,(34)(35)(36), surveillance of prescription medication abuse (37), and illicit drug use (38). Reviews investigating search engine data mostly focused on infectious disease surveillance (39, 40), but, to the best of our knowledge, did not cover the utility for drug utilization so far.
Ultimately, using web data in order to inform on drug utilization could have a significant public health impact. Research is likely to develop in this field showing more examples of association between web data and drug utilization (e.g., types of medication assessed, countries, web data sources used and speed of data obtained) that could confirm our findings.
Our findings are similar to those of a review investigating the utility of social media for pharmacovigilance: Tricco et al. reported consistent results in a majority of included studies which compared the frequency of drug adverse events detected from social media data sources against a regulatory database (8). In addition, our review found that all four included studies that reported on seasonal differences found similar seasonal drug utilization patterns between the two data sources. This finding shows that web data not only generally correlate with comparison data but also underpins the utility of web data to produce timely estimates of drug utilization.
Our review showed a great variety of comparison data sources commonly used for drug utilization studies that were used to validate the results from web data. Those comparison sources included, many country-specific surveillance data sources such as from the US CDC, US Medical Expenditure Panel Surveys (MEPS), and private companies, such as the Japanese JMDC Inc were identified. In these comparison data sources, drug utilization estimates were the most commonly used data measure, before prescription volumes and drug sales.

Web data sources
Twelve (86%) out of 14 included studies employed search engine data retrieved from various Google services such as Keller et al. 10.3389/fdgth.2023.1074961

Frontiers in Digital Health
Google Trends, Google Insights for Search, Google Health Trends, and the Google search engine. Connected to this, the total duration of access was very similar with a median duration of 5.3 years for the web and 5.0 years for the comparison data source. This is notably more than has previously been reported by a review focusing on the utility of social media for pharmacovigilance, where social media posts were followed for a median duration of 1.1 years (8). In addition, the predominance of search engine web data sources might be explained by the greater ease of accessing search engine data through services such as Google Trends compared to retrieving unstructured social media data, which typically involves a labor-intense processing pipeline containing multiple steps (8) to extract datasets suitable for analysis and comparison to other sources. We recommend that research in this field would use a wide range of web data rather than only focussing on one type of research engine (e.g. Facebook, Twitter, specific health forums).

Drug classes and type of drug utilization investigated
Seven out of 14 (50%) studies focused with both antibiotics (n = 3) or vaccines (n = 4), respectively, on drug classes that belong to the field of infectious diseases. The remaining studies focused on drug classes of diverse other fields, such as diabetes, depression, and the misuse of psychoactive drugs. Studies included medications used either as short treatment (e.g., antibiotics or vaccines) or chronic use (e.g., statins for lipid lower, or antidepressants). However, as most studies used web search engines, they could only evaluate the prevalence of drug use as it is not possible to differentiate former and new users only from these data sources. Using specific analyses of posts content from Facebook, Twitter or specific health forums would allow more information to be retrieved on drug utilization. For instance, one could screen for information on the time patient are on medications or on the concomitant use of other medications. Analysing the content of social media posts has already been used in the past for pharmacovigilance (41). Considering that the investigated studies found consistent positive results of using web data for estimating drug utilization across the vast majority of the investigated drug classes, we advise future studies to extend research to include drug classes from other fields additionally and use a wider diversity of web data sources such as those including specific users posts.

Reported challenges of using web data for drug utilization estimates
The mentioned limitations of the included primary research studies highlighted potential challenges of using web data for estimating drug utilization, such as the potential lack of representativeness between web data-creating users and the general population, difficulties identifying the populations who created the web data, difficulties interpreting relationships between web data and comparison drug utilization data (e.g., due to the presence of potentially unmeasured confounding factors such as users' search intent or effects of media attention), and problems dealing with low search volume if data is narrowed down to specific regions or populations. These critical aspects should be systematically targeted in further studies using web data to assess drug utilization.

Reporting quality
The overall reporting of the studies' quality according to the STROBE, RECORD, and RECORD-PE checklists was mediocre and strongly varied between the different items. The most commonly reported items (>80%) were background/rationale, objectives, and outcome data. Items with low reporting (<20%) were other analyses, bias, and the accessibility of protocol, raw data, and programming code. Of particular relevance is the poor reporting of the two latter items, since both items were rated to be applicable for all reviewed studies and since these points are increasingly recommended as they target research transparency and reproducibility. The finding that articles tend to underreport biases has also been observed in two other studies that assessed the compliance of the articles with the STROBE checklist in different fields (42, 43). One of the issues may be that these guidelines are not specific to internet user content research.
Moreover, many items were rated to be out of scope for the type and design of the studies we included in our review. In many cases, this was due to the fact that the users who created the web data could not directly be regarded as study participants as, for example, eligibility criteria cannot be controlled and important information such as descriptive user characteristics can hardly be retrieved from web data.
In conclusion, the three checklists include all important items necessary to assess the reporting quality of the included studies. However, a variety of items were not applicable as they were out of scope for these types of studies. Therefore, we recommend utilizing a shortened and adapted version of the current STROBE, RECORD, and RECORD-PE checklists for future studies. For example, as web data was usually sourced through social media platforms and open-access websites for search analysis, no actual participant recruitment procedures took place in those studies. Therefore, all items relating to the recruitment and assessment of real-world participants could be omitted in a future version of this checklist (i.e., items: 6(a), 6(b), 6.1, 6.2, 6.3, 6.1.a, 13(a), 13(b), 13.1, 14(a), 14(b), 14(c)) and replaced by more suited item such as: the type of web data (e.g. search terms volumes, number of tweets/posts of interest…).

Strengths and limitations
This systematic scoping review was conducted and reported according to the standardized PRISMA guidelines (15). We conducted an extensive literature search, defined the study eligibility criteria, rigorously assessed studies that contained drug  utilization information from web data sources, and compared it to other sources with drug utilization information. One limitation of this review was the heterogeneity of methodologies in terms of study objectives and analysis methods in the included studies, which made it impossible to draw more general conclusions. This, together with the relatively small number of identified studies, underlines the complexity and novelty of the field and justifies the selection of a scoping review approach.
Finally, in our assessment of the studies' reporting quality employing the STROBE, RECORD, and RECORD-PE checklist, a substantial number of items had to be considered out of scope for these types of studies. This requests for an adapted (standard) checklist.

Conclusion
While this study demonstrates the potential of social media and search engine data in assessing drug utilization, it also emphasizes the low level of evidence available in the literature. Generalization of this approach requires additional studies focusing on the validation of drug utilization estimates from traditional data sources as well as on using quantitative (such as correlation assessment or modelling) methodologies when comparing traditional sources to web data. The use of web data to estimate drug utilization is an emerging field, and future research should focus on fulfilling standardized reporting standards as well as developing new reporting guidelines that specifically target the characteristics of this type of research.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.