Harnessing Big Data for Communicable Tropical and Sub-Tropical Disorders: Implications From a Systematic Review of the Literature

Aim According to the World Health Organization (WHO), communicable tropical and sub-tropical diseases occur solely, or mainly in the tropics, thriving in hot, and humid conditions. Some of these disorders termed as neglected tropical diseases are particularly overlooked. Communicable tropical/sub-tropical diseases represent a diverse group of communicable disorders occurring in 149 countries, favored by tropical and sub-tropical conditions, affecting more than one billion people and imposing a dramatic societal and economic burden. Methods A systematic review of the extant scholarly literature was carried out, searching in PubMed/MEDLINE and Scopus. The search string used included proper keywords, like big data, nontraditional data sources, social media, social networks, infodemiology, infoveillance, novel data streams (NDS), digital epidemiology, digital behavior, Google Trends, Twitter, Facebook, YouTube, Instagram, Pinterest, Ebola, Zika, dengue, Chikungunya, Chagas, and the other neglected tropical diseases. Results 47 original, observational studies were included in the current systematic review: 1 focused on Chikungunya, 6 on dengue, 19 on Ebola, 2 on Malaria, 1 on Mayaro virus, 2 on West Nile virus, and 16 on Zika. Fifteen were dedicated on developing and validating forecasting techniques for real-time monitoring of neglected tropical diseases, while the remaining studies investigated public reaction to infectious outbreaks. Most studies explored a single nontraditional data source, with Twitter being the most exploited tool (25 studies). Conclusion Even though some studies have shown the feasibility of utilizing NDS as an effective tool for predicting epidemic outbreaks and disseminating accurate, high-quality information concerning neglected tropical diseases, some gaps should be properly underlined. Out of the 47 articles included, only 7 were focusing on neglected tropical diseases, while all the other covered communicable tropical/sub-tropical diseases, and the main determinant of this unbalanced coverage seems to be the media impact and resonance. Furthermore, efforts in integrating diverse NDS should be made. As such, taking into account these limitations, further research in the field is needed.

inference. For instance, this happened with "Google Flu Trend" (GFT), which failed to provide accurate predictions concerning influenza-like-illness (ILI) cases. GFT predicted, indeed, more than double the proportion of doctor visits for ILI than the centers for disease control and prevention (CDC) (10). Due to these concerns, GFT decided to no longer publish influenza estimates. Similarly, Google Dengue Trends, a web-based tool for predicting dengue cases, is not currently available.
"Infodemiology" (a port-manteau of information and epidemiology) and "infoveillance" (a combination of information and surveillance) have been coined by Gunther Eysenbach to indicate the new emerging "science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform and improve public health and public policy" (11). Systematically tracking and monitoring, collecting and analyzing healthrelated demand data generated by NDS could have the potential to predict events relevant for public health purposes, such as epidemic outbreaks, as well as to investigate the effect of media coverage in terms of potential distortions, misinformation and biases-the so-called "epidemics of fear" (12). Details are shown in Figure 1.
The aim of the current investigation was to systematically assess the feasibility of exploiting NDS for surveillance purposes and/or their potential for capturing public reaction to epidemic outbreaks. The main characteristics of NDS analyzed in this paper are briefly overviewed in Box 1.

mateRiaLS aND metHODS
The following systematic review was conducted according to the "Preferred Reporting Items for Systematic Reviews and Meta-Analyses" (PRISMA) guidelines (13). The literature search was performed in July-August 2017 using pre-established ad hoc key words and updated in September-October 2017. The search strategy is detailed in Table 1.

inclusion and exclusion criteria
Articles were included in the present systematic review whether they met the following inclusion criteria: (i) full text available; (ii) original articles; (iii) focused on communicable tropical and sub-tropical disorders, including neglected tropical diseases; and (iv) assessing novel sources of data, such as Twitter, web searches monitoring tools like Google Trends, Facebook, Google Plus, Wikipedia access logs, and traffic tracking tools, such as WikiTrends, and so on.
Exclusion criteria were: (i) studies without original data (abs tract, letters to editor, editorials, comments, commentaries, expert opinions, reviews) and (ii) studies published in congress proce edings and gray literature.
No time and language filter was applied. Two researchers (NLB and VG), independently, screened title and abstract in order to verify the articles relevance. Possible disagreements were resolved through discussion or third reviewer consultation. The full text was downloaded only for the selected titles, and reference lists of included studies were also checked in order to identify any other potential relevant paper.

Data extraction
Main information, from the included studies, were extracted independently from two authors (VG and NLB) and collected in a pre-defined ad hoc spreadsheet. The collected data included: (i) surname of the first author, (ii) year of publication, (iii) data source, (iv) studied disease, (v) study period, (vi) location searched, (vii) used keywords, (viii) aim of the study, and (ix) main findings.

ReSULtS
A total of 17,945 articles were retrieved: two articles were found by means of extensive manual hand-searching and cross-referencing. After a preliminary screening, a total of 14,996 articles were excluded because they did not meet the inclusion criteria.
Two more articles were retrieved from additional sources, finally 57 remaining articles were analyzed in full. 10 of them were excluded with reasons and last 47 articles were included in the present systematic review. Results are syntetized in ( Table 2). The screening process is shown in Figure 2.
Out of 47 articles included in the review, 1 on Chikungunya, 6 on dengue, 19 were focused on Ebola, 2 on malaria, 1 on Mayaro     (Figure 3), in terms, for example, of sentiment analysis and spreading of fake news related to tropical disorder outbreaks.

Chikungunya Virus
Only one study was related to Chikungunya. Roche et al. (14) harnessed tweets related to Chikungunya posted during the outbreak in Martinique (14) and, performing a regression analysis with epidemiological and environmental variables, found that the integration of model and tweets contents well explained epidemiological dynamics over time.

Dengue
Five studies were related to dengue. Four of them relied on predictive models to predict dengue outbreaks. In Brazil, Gomide et al. (18) exploited Twitter and performed extensive content, correlation, and spatiotemporal analyses (18). Authors were able to find an excellent association between tweets production   (17). News-based models were found to well correlate with epidemiological cases. One article harnessed big data to explore the determinants of sharing tweets related to dengue. In particular, Nsoesie et al. (16), using machine learning techniques, found that sociodemographic variables played a major role in producing and sharing dengue-related tweets (16).

Ebola
Twenty articles were related to Ebola. All of them exploited big data sources to capture public reaction to Ebola outbreaks, both in terms of sentiments, fears, and concerns and of knowledge, beliefs, and attitudes. More in detail, four studies exploited Twitter. van Lent et al. (37) investigated the predictors of Ebola-related tweet production and found a significant positive relation between proximity and fear for Ebola virus (37). Jin et al. (29) harnessed Twitter to understand the public reaction to misinformation related to Ebola outbreak, performing an extensive geo-coded analysis, coding, and mathematical modeling (29). Authors found that some Ebola-related rumors were more popular than others Lazard et al. (30) found that the public was mainly concerned with symptoms and lifespan of the virus, disease transfer and contraction, safe travel, and protection of one's body (30). Interestingly, Wong et al. (35) aimed at understanding the determinants of tweeting from local health departments. Approximately 60% of local health departments sent tweets (35).
Three studies utilized YouTube. Nagpal et al. (21) analyzed the most popular Ebola-related videos and found that the most relevant ones were those presenting clinical symptoms (21). Pathak et al. (24) found that the majority of the internet videos about Ebola were useful, even though some videos were misleading (24). Basch et al. (32) analyzed the 100 most viewed videos on YouTube with more than 73 million of visualizations and concluded that YouTube has a Yin-Yang nature, in that it could, on the one hand, enhance education and, on the other hand, spread misinformation (32).
Three studies utilized Facebook. Sastry and Lovari (26) analyzed the material posted on the official CDC and WHO pages (26). The following major themes were identified: (a) consulting and containment, (b) international concern, and (c) the possibility of an epidemic in the United States. Strekalova (22), reviewing the official CDC page, found that the CDC submitted fewer posts about Ebola than about non-Ebola topics, even though audience engagement was significantly higher (22). Furthermore, men were more interested in Ebola posts and submitted more comments per user. Moreover, Strekalova (22) found that there were differences in audience information behaviors in response to the emerging Ebola pandemic and health promotion posts (22).
Seven studies utilized more than one big data source. Fung et al. (33) used both Twitter and GT to understand the public reaction to the Ebola outbreak and the first US case (33). Fung et al. (34) combined Sina Weibo and Twitter to capture the reaction to misinformation related to Ebola emergency (34). Liu et al. (27) harnessed Baidu and Sina Micro to investigate the public reaction to the Ebola outbreak in China, performing a mathematical model (27). Roberts et al. (25) mined both English language websites and Twitter to qualitatively analyze the Ebola-related narrative, carrying out content and sentiment analysis (25). Househ (28), using Twitter and Google News Trend, found a significant correlation between media coverage and tweets production (28). Towers et al. (36) integrated Twitter and web searches to understand the impact of the media coverage on the public reaction to Ebola outbreak in the United States in terms of digital activities, performing a mathematical model (36). Wong et al. (35) exploited both Twitter and GT to understand the determinants of tweeting from local health departments Ebola, by means of a geospatial analysis (35). Authors found a weak, negative, non-significant correlation between online search activity and per capita number of local health department Ebola tweets by state.
Besides capturing public reaction to Ebola epidemic, three studies attempted to perform also predictive models and analyses. Alicino et al. (31) explored the feasibility of exploiting GT for a real-time monitoring and tracking of Ebola virus outbreaks, carrying out correlation and regression analysis with epidemiological cases (31). Authors found that correlation was stronger at a global level, but weaker at nation/country level, probably due to unbalanced, biased media coverage, and to digital divide. Odlum and Yoon (23) utilized Ebola-related tweets as a real-time method of Ebola outbreak surveillance to monitor information spread, capture early epidemic detection, as well as to examine content of public knowledge and attitudes (23). Authors found that tweets began to start to rise in Nigeria 3-7 days prior to the official announcement of the first probable Ebola case. Topics discussed included risk factors, prevention education, disease trends, and human compassion.

Malaria
GT was used for forecasting malaria cases by Ocampo et al. in 2013 (40). This study was performed using data related to Thailand in the period 2005-2009. Authors developed four Google search query-based models: namely, the so-called "microscopy model" (which uses terms associated with official data), the "automatic model" (based on automated selection algorithm), the "physician model" (generated from terms selected by surveyed Thai physicians), and the "stepwise model. " GT-based models well correlated with epidemiological cases. Fung et al. (38,39) used Twitter and performed a content analysis of the Malaria-related tweets (38). The main topics were: prevention, control, and treatment, followed by advocacy, epidemiological information, and societal impact.

Mayaro Virus
Only one study was related to Mayaro virus and exploited GT. Adawi et al. (41) explored the feasibility of utilizing GT for a real-time monitoring and tracking of Mayaro virus outbreaks (41). Correlational and regression analysis were performed with epidemiological cases and with other NDS, including Google News, PubMed/MEDLINE. Authors found that web searches were driven by media coverage rather than reflecting real epidemiological cases.

West Nile Virus
Two studies focused on the West Nile virus (42), and both of them used GT. Bragazzi et al. (42) aimed at exploiting the predictive power of GT (42) in Italy, performing a correlation analysis with epidemiological cases. Authors found a positive significant correlation between web searches and cases. Watad et al. (43) explored the predictive power of GT in the United States, carrying out correlation and regression analyses as well as mathematical modeling (43). Results showed a good correlation between web searches and real-world epidemiological figures. The best seasonal autoregressive integrated moving average model with explicative variable (SARIMAX) computed was (0,1,1)X(0,1,1)4, that is to say a "seasonal exponential smoothing" model. Moreover, using data from 2004 to 2015 it was possible to predict data for 2016.

Zika Virus
Sixteen studies focused on Zika and nine of them used Twitter as non conventional data source. In the majority of the cases (4 papers), the type of performed analysis was content analysis (46)(47)(48)52), even though carried out with various research purposes. More in detail, Miller et al. (52) conducted a tweets analysis during the period of the hosting of the Olympics games and captured public reaction in terms of sentiments and concerns related to the potential association between Zika infection, microcephaly, and Guillain-Barrè syndrome, an association probable, but not yet confirmed at that time. Although the total polarity was negative, the percentage of positive tweets was higher than expected. An imbalance in the volume of tweets focusing on treatment was found. Similarly, a study by Fu et al. (47) lead to the emergence of five major themes: (1) government, private, and public sector, and general public response to the outbreak; (2) transmission routes; (3) societal impacts of the outbreak; (4) case reports; and (5) pregnancy and microcephaly. Glowacki et al. (48) investigated the use of new ICTs by healthcare authorities and organisms and, for the purpose, collected tweets during an hourlong live CDC Twitter chat, identifying 10 major topics. Some of them were related to the virology of Zika, spread, infants' , and pregnants' sequelae, sexual transmission, and symptomatology. Dredze et al. (46) focused on the spreading of conspiracy theories and pseudo-scientific claims and found that tweets disseminating misleading information were concentrated almost all during the first week of pandemic (46). Three studies used quantitative approaches, namely correlation and regression analysis (45,49,55), mathematical modeling (51), and spatiotemporal analysis (56). Southwell et al. (55) found strong positive correlations between news coverage, social media mentions, and online search behavior (55). Bragazzi et al. (45) found a constantly increasing public interest toward Zika, with the public opinion being particularly worried by the alert of teratogenicity of the Zika virus (45). In particular, the most frequent queries were about symptoms, transmission, and possible sequelae, such as microcephaly. Lehnert et al. (49) performed a regression analysis in order to understand the determinants of social media usage from obstetric community (49). The percentage of obstetric practice websites increased the number of information posted about Zika virus throughout the time, however, the proportion of practice sites posting Zika virus content on Facebook and Twitter declined. Practice websites related to university hospitals were more likely to post information on Zika virus compared to independent practice sites. McGough et al. (51) through a mathematical model, integrated different non conventional surveillance data (51), such as Google searches, Twitter microblogs, and the HealthMap digital surveillance system, and found that models relying on Google and Twitter showed the best 2-and 3-week ahead predictions. Last, Stefanidis et al. (56) performed a spatiotemporal analysis in order to characterize Zika-related tweets in terms of temporal variations of locations, actors, and concepts (56). The spatiotemporal analysis of the different Twitter contributions reflected the spread of interest in Zika from South America to North America and, then, across the globe. Healthcare institutional bodies, such as the CDC and the WHO, played a major role in tweet production.
Other type of big data sources explored in Zika studies were Facebook (54,58), Google trends (50,57), YouTube (44) Another big data source was GT. Actually, two studies examined GT-generated volume data in order to build predictive models. Teng et al. (57) aimed at predicting the number of infection cases (57). Authors constructed an autoregressive integrated moving average model (0, 1, 3) for the dynamic estimation of ZIKV outbreaks. Majumder et al. (50), using nontraditional digital data, such as HealthMap and Google Trends, tried to estimate the R0 and Robs parameters of Zika virus spreading in Colombia. Authors observed an initially low, but increasing awareness and interest toward Zika. Google search was used in order to distribute more realistical over time, cumulative reported case counts. The ranges for Robs estimated using digital data were well comparable with the figures calculated with the traditional method, even though a little lower. Transmission parameters can be estimated in real time using digital surveillance data, especially when traditional methods are not available.
Only one study assessed the content of YouTube videos on Zika (44). Basch et al. (44) analyzed the 100 most viewed English ZIKV-related videos. Among them, the majority were consumergenerated and Internet-based news videos. According to the contents, the majority of the videos concerned babies, cases in Latin American and in Africa.
Also Pinterest and Instagram were exploited, however, only two studies were conducted and both of them performed a content analysis (39,53). Fung et al. (38,39) analyzed more than 600 posts and photos on Facebook and Pinterest, respectively (39). The most popular topics were: prevention, pregnancy, and Zika-related deaths. Seltzer et al. (53) analyzed images posted on Instagram (53) and found that, even though the majority of posts focused on transmission and prevention, most of them conveyed negative feelings (such as fear and concerns) and contained misleading information.

DiScUSSiON
In the past years, there has been a growing interest from the scholarly community in big data sources and their impact on public health. This was parallel to the interest toward neglected and communicable tropical diseases. Currently, communicable tropical diseases-including also the subset of neglected onesrepresent re-emerging infections. However, re-emergence is not a completely new phenomenon occurring only in the past decades, actually it is happening since centuries. On the other hand, today re-emergence and dispersion of infectious agents are more rapid and geographically extensive, mainly due to globalization, and to arthropods or other vectors adaptation to its effects (59).
Novel data streams appear to be promising tools for predicting the spread of infectious agents, and, as such, can potentially aid and inform early decision support for when and how to employ public health interventions within a certain community. Emergency situations, being urgent scenarios, need accurate, reliable, and fast predictive models (60). Traditional surveillance systems are often plagued by a number of shortcomings and drawbacks, such as a significant delay in releasing official government-reported case counts (51). NDS seem to offer a real-time way to track and monitor outbreak dynamics, as well as to capture relevant information and parameters related to infection rates when these details are scarcely known or not available.
Novel data streams are also versatile tools in that they can be exploited to capture public reactions to epidemic outbreaks, in terms of emotion and fears, and of knowledge, attitudes, and practices. Some studies have harnessed big data sources to understand the spread of misinformation. Years of researches in the field of health communication and psychology have shown that opinion change represents a much more challenging issue than opinion formation, since, once people believe something wrong or misleading, it is difficult to dissuade them from such rooted beliefs (46). With respect to this topic, some studies have shown that NDS have a Yin-Yang nature, being, on the one hand, useful resources for promoting health education and being, on the other hand, vehicles of potentially dangerous information and content. In the era of the "post-truth, " the dissemination of fake news, alleged claims, and not evidence-based rumors could have serious implications in terms of public health. Techniques of social bookmarking and the direct involvement of healthcare workers and practitioners (in producing health-related websites, posting and sharing online material, tweeting, chatting, and so on) could be useful strategies (61).
Stakeholders and health authorities should be aware of the new ICTs, in that they could usefully exploit Internet-based tools for collecting the concerns of public opinion and replying to them, re-ensuring, and disseminating accurate, high-quality information (45). However, some studies included in the current systematic reviews have stressed gaps in usage of NDS by official healthcare organisms and bodies. Efforts should be made to convey a proper and effective health communication, utilizing ICTs and borrowing approaches from social marketing, making their posted material and delivered information more appealing, in terms of public outreach and engagement.
Another important point that should be stressed is that the value of each paper included in the current systematic review does not appear equal with respect to the field of public health. For example, the studies by Gomide and co-workers, McGough and collaborators, Odlum and Yoon, Roche and co-workers, and Teng and coauthors are highly relevant to public health outcomes (14,18,23,43,51,57), while the others relate primarily to social networks. As such, only few papers with respect to the overall number of articles included in the present systematic review are directly relevant for public health outcomes. This definitely deserves further investigation and research in the field.
Our systematic review has some major strength, including the breadth of the search performed. However, even though efforts have been made in order to ensure completeness of the findings, alternate spellings/misspellings of keywords could have affected the results [for example, there are nine articles returned for "chikugunya" (an incorrect spelling of the disease chikungunya) returned recently on PubMed/MEDLINE]. On the other hand, reference lists of included articles have been extensively handsearched, to increase the chance of getting all potentially relevant studies Relatedly, a variety of computational, "big data"-related terms (such as machine learning, collective intelligence or deep learning) were not included. Ad hoc search strings are, of course, finite in length, however, we expect to have included all relevant investigations meeting with inclusion/exclusion criteria on the basis that we have carried out extensive cross-referencing and additional hand-searching.
cONcLUSiON Even though some studies have shown the feasibility of utilizing NDS as an effective tool for predicting epidemic outbreaks and disseminating accurate, high-quality information concerning communicable tropical diseases, some gaps should be properly underlined. Actually, among 47 studies included in our systematic review, only 7 studies focused on neglected tropical diseases (Chikungunya and dengue), while all the others were focusing on communicable tropical diseases (19 on Ebola, 2 on Malaria, 1 on Mayaro virus, 2 on West Nile virus, and 16 on Zika). In particular, out of the 17 groups of neglected tropical diseases individuated by the WHO, only two types of infectious diseases (namely, dengue and Chikungunya) were covered, and the main determinant of this unbalanced coverage seems to be the media impact and resonance, as well as the fear of the spreading of epidemic agents to Western countries. Furthermore, efforts in integrating diverse NDS should be made. As such, taking into account these limitations, further research in the field is needed.