Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India during COVID-19 Infodemic

COVID-19 infodemic has been spreading faster than the pandemic itself. The misinformation riding upon the infodemic wave poses a major threat to people's health and governance systems. Since social media is the largest source of information, managing the infodemic not only requires mitigating of misinformation but also an early understanding of psychological patterns resulting from it. During the COVID-19 crisis, Twitter alone has seen a sharp 45% increase in the usage of its curated events page, and a 30% increase in its direct messaging usage, since March 6th 2020. In this study, we analyze the psychometric impact and coupling of the COVID-19 infodemic with the official bulletins related to COVID-19 at the national and state level in India. We look at these two sources with a psycho-linguistic lens of emotions and quantified the extent and coupling between the two. We modified path, a deep skip-gram based open-sourced lexicon builder for effective capture of health-related emotions. We were then able to capture the time-evolution of health-related emotions in social media and official bulletins. An analysis of lead-lag relationships between the time series of extracted emotions from official bulletins and social media using Granger's causality showed that state bulletins were leading the social media for some emotions such as Medical Emergency. Further insights that are potentially relevant for the policymaker and the communicators actively engaged in mitigating misinformation are also discussed. Our paper also introduces CoronaIndiaDataset2, the first social media based COVID-19 dataset at national and state levels from India with over 5.6 million national and 2.6 million state-level tweets. Finally, we present our findings as COVibes, an interactive web application capturing psychometric insights captured upon the CoronaIndiaDataset, both at a national and state level.


INTRODUCTION
"WeâĂŹre not just fighting an epidemic; WeâĂŹre fighting an infodemic". These words spoken by the WHO Director-General at the Munich Security Conference on 15 February 2020 4 , sums up the challenges faced by our society due to COVID-19. Infodemic comes from the root words information and epidemic. It refers to an excessive amount of information which is made publicly available, consisting of both accurate and not so accurate information, which makes it harder to find reliable, trustworthy and accurate guidance when needed. 5 . Social media is one of the most popular media for the diffusion of information during an infodemic. Twitter, a micro-blogging site, is one of the most widely used social media platforms and it has seen a sharp 45% increase in the usage of its curated events page, and a 30% increase in its Direct messaging usage, since March 6th 2020, during the COVID-19 emergency. 6 . Social media platforms are not only easily accessible and have a global reach, they provide a virtual space to the users distant from their real world. Due to this, they provide a medium to help people discuss many taboo topics which they might not in their real social circles, like mental healthcare, domestic violence, sexual assault etc. This provides an opportunity to use data mining & natural language processing techniques to analyse the web data from the standpoint of psychology.
The recent outbreak of COVID-19 (COrona VIrus Disease 2019) has the world in its grips. With exponentially rising cases, WHO declared it as a global pandemic on March 11, 2020. Following health advisories and the virus rapidly spreading, most countries have declared national emergencies, closed borders and restricted public movement. Most countries in the world have gone into national lockdown to contain the pandemic.
Naturally, such a strong event's impact has affected the daily lives of people all over the world, something which has also impacted people's social media usage. In recent times of social distancing, social media has become a popular platform for people to express their thoughts & opinions. Social Media is not only being used by people to express how their lives have been affected and appeal to everyone to understand the importance of the situation, but trending hashtags like #CoronaOutbreak #COVID19, are used by authorities disseminate important information and health advisories.
A lot of research has previously been done on social media analysis related to pandemics. Ritterman et al. [14] showed how prediction market models and social media analysis can be used to model public sentiment on the spread of a pandemic. Signorini et al. [16] examined Twitter based-information to track the swiftly-evolving public sentiment regarding Swine Flu, in 2011 as well as correlate the H1N1 related activity to accurately track reported disease levels in the US. Jain et al. [7] used Twitter as a surveillance system to track the spread of 2015 H1N1 pandemic in India as well as the general public awareness towards it. In this paper, we examine the use of social media during this ongoing time of n-CoV2019 pandemic. As the impact of this pandemic is growing at an exponential rate, we try to model this rapidly evolving public sentiment and dig deeper to understand how it changes daily, using time-series based analysis. To this end, we have curated dataset of more than 5.6 million tweets and retweets, specific to India. This dataset was collected using two approaches -Content-Based and Location-Based queries, as explained in Section 3. We model the public sentiment using sentiment analysis and the Empath psycho-linguistic features.

LITERATURE SURVEY
Prior work done on understanding the public sentiment during the Ebola medical crisis by Lazard et al. [10] detected the ongoing themes in the ongoing discourse on the live Twitter chat by Centers for Disease Control and Prevention. Wong et al. [17] further studied the tweets by the local health departments related to Ebola. The work done by Kim et al. [9] presents a topic-based sentiment analysis of the Ebola virus on twitter and in the news.
With the world's efforts focused on battling COVID-19, research efforts from many fields including online social media & textual analysis has vigorously started in this direction. Chen et al. [2] provide the first Twitter dataset by collecting tweets related to #COVID19. This is an ongoing collection, started from January 22, 2019. Haouari et al. [5] also present ArCov-19, a large Arabic Twitter dataset collected from January 27,2019 to March 31, 2019.
During this infodemic, Zhao et al. [18] analyse the attention of Chinese public to COVID-19 by analysing search trends on Sina Microblog and evaluating public opinion through word frequency and sentiment analysis. Cinelli et al. [3] provide insights into the evolution of the global COVID-19 discourse on Twitter, Instagram, Reddit, YouTube and Gab. Alhajji et al. [1] analysed public sentiment of Saudi Arabia by collecting upto 20,000 tweets within 48 hours of key events in Saudi Arabia's timeline, using transfer learning to do sentiment analysis. Kayes et al. [8] attempt to measure community acceptance of social distancing in Australia, reporting that majority of tweets were in favour of social distancing. Li et al. [12] try to extract psychological profiles of active Weibo users during the time of COVID-19 spread in China to analyse linguistic, emotional and cognitive indicators.
Analysing user behavior and response can provide a critical understanding of what policies and decision worked. Hou et al. [6] assess the public's response to the situation and the government guidelines in terms of attention, risk perception, emotional and behavioral response by analysing search trends, shopping trends and blog posts on popular Chinese services. Li et al. [11] analyse how information regarding COVID-19 was disseminating, suggesting useful insights into the need for information. Work by Schild et al. [15] shows how the current pandemic situation has caused an unfortunate rise in Sinophobic behavior on the web.

Twitter Data Collection
The data was collected using 2 separate approaches in parallel -Content Based Query and Location Based Query.

Content-based Query:
To collect the relevant Twitter data, we explored the trending and most popular hashtags for each of the Indian states and manually curated the list of hashtags related to COVID-19. We also went through several n-CoV2019 related tweets manually to find and subsequently mined the most popular hashtags(list given in appendix) related to the same, which may not be trending. Further, in order to automate the collection of relevant tweets, we tried to formulate generic queries like 'corona <state>', etc. and collected the state-wise n-CoV2019 related twitter data using the same. This approach focuses on getting all tweets which are talking about COVID-19 in the context of India or the Indian states. Multiple queries were built by joining terms related to COVID-19 with the name or common aliases of the region and data was collected from 1st March to 23rd April 2020. Some of the terms used were -'corona', 'covid19', 'coronavirus', 'lockdown', etc. In addition, popular hashtags like #coronavirusin<region>, #coronain<region>, #corona<region>, #<region>fightscorona were used, where '<re-gion>' is replaced by the name or popular alias of India or various states. Examples of a popular alias are 'Orissa' for Odisha, 'TN' for Tamil Nadu, 'UP' for Uttar Pradesh, or spelling mistakes like 'chatisgarh' for Chhattisgarh.

Location-based Query:
Tweets are collected for globally trending COVID-19 related hashtags, and then filtering tweets based on April 2020. This resulted in a collection of a total of 12 million tweets from all over the world. We also create a list of location filters for various states. These are the state names, aliases (as explained above) and name of popular cities in those states. Using these, we first filter out all tweets having 'india' in their user location, and then sort them based on keyword matches of tokens in the user location with the above list. The User Location was lowercased before matching. We analysed using a chatterplot, the most frequent occurrences in our dataset, as shown in Figure 1. The plot shows the top 200 words, arranged by their frequency and Bing Sentiments [13]. +1 is positive, -1 is negative and the group of words in the middle have no sentiment value associated. In this plot, we have removed the outlier 'India' for a better representation of other terms.

Indian State Government n-CoV2019 related bulletins
DataMeet Community has curated a database of COVID-19 related government bulletins from Indian States 7 8 . These bulletins have statistics about COVID-19 cases in the state, government's response to them, advisories and other useful information. We have analysed all the reports that are in English language and belong to the states of Delhi, West Bengal, Punjab, Tamil Nadu, Odisha and Kerala.

METHODOLOGY 4.1 Preprocessing Data
We followed the below steps to pre-process the data and reduce the noise: We used Wordnet Lemmatiser instead of Porter Stemmer, since the latter leads to the reduction of words to a form wherein they are no longer real words, however, the former ensures that each word is reduced to a real word in English dictionary. For example, 'studies' and 'studying' get converted to 'studi' and 'study' by Porter Stemmer, while a Lemmatiser matches both of them to a common lemma 'study'.

Quantitative Empath Analysis
Empath [4] is an open vocabulary based tool to generate and validate lexical categories. It is based on deep skip-gram model to draw correlation between many words and phrases starting from a small set of seed words. It has some inbuilt categories, including emotions, which can be used to identify the emotion associated with a text.
Empath provides 3 types of datasets to build the lexicon from -'reddit' (social media), 'nytimes' (news articles) and 'fiction', and models a category by finding the words closest to the "seed words" of that category. But, the text data used for either of them is outdated and does not have enough information about the language used in the current scenario. Preliminary analysis using the Empath library showed that the current lexicon was inadequate to properly analyse the current situation. A case in point would be that the word 'positive' had a connotation with positive emotion in the Empath categories, however, in the COVID-19 scenario, it was often used in the context of 'tested positive', which by itself was neither a positive nor a negative emotion, and rather hinted at the activity of testing positive for COVID-19 To rectify this, we manually examined the most frequent unigrams and bigrams in the collected data as well as some common bigrams in the given context which may be classified incorrectly, and manually annotated them into the most relevant categories or created new categories to help better analyse the emotional content of the tweets. Some important modifications are shown in Table 2.   We analyse Empath scores of emotions related to Positive Sentiment, Negative Sentiment, Country and Government, the pandemic caused by COVID-19 and the fight against COVID-19. Details of the specific categories used can be found in Table 3.

ANALYSIS 5.1 Twitter Content
5.1.1 National Level. Using Empath, we analysed the tweets collected in March 2020, along various psycho-linguistic attributes, as shown in Fig 3. The most common categories being discussed in the Tweets were government, health and medical emergency, which reflects that while discussing about the pandemic, the public is bringing the government in the discourse, be it referring to some government policy or some information released by the government. Another observation is that the negative emotion is fairly high amongst these tweets. A positive indicator is that the confusion level indicated by the Empath analysis is significantly low, while the frequency of linguistic features related to positive emotions, healing and optimism are much higher.

State Level.
We analysed the psycho-linguistic features of COVID-19 related discourses on Twitter at a state-level, as shown in Figure 4. It is interesting to note that although the magnitudes of different psycho-linguistic features vary across different states, their structure remains very similar. We observed that while in some states like Delhi, Punjab, Odisha tweets talking about government related words were the most on Twitter, few states like Tamil Nadu and West Bengal talked more about medical emergency. Kerala, on the other hand had an equal frequency of words related to medical_emergency and government.  West Bengal and Kerala also have a higher frequency of words related to negative emotion compared to other states.
West Bengal showed higher levels of healing and positive emotions in the tweets. Interestingly, it also showed an even higher frequency of war or fight related words.

Twitter Time Series
We analysed the collected Twitter data in the Indian context over a period of 2 months (March and April 2020), by looking from the lens of various psycho-linguistic attributes as shown in Figure 6. We observe that while the frequency of 'hygiene' & 'nervousness' related words has decreased over time, since the start of COVID-19 crisis in India, words related to 'business' & 'optimism' have become more frequent. The categories of 'health' and 'government' have been one of the most popular categories for the Indian twitter data, and while the presence of 'health' category in tweets observes a sharp dip on March 28, 'government' related words have a sharp rise on the same date.

Relation of public sentiment to the rapidly changing on-ground situation.
We observed that the presence of optimism related keywords in the tweets has increased over time, with the highest frequency of optimistic words from 18th-21st April. It is interesting to note that on 18th April, due to the imposition of a nationwide lockdown, the time taken for doubling COVID-19 cases came down from every 3 days to every 8 days. We also observe that the discussion regarding certain aspects of COVID-19 discourses on twitter have reduced over time, especially those related to 'Hygiene' & 'movement', which became very popular near the time when the lockdown first got imposed, however, over time the frequency has declined possibly hinting at normalisation of certain aspects of the COVID-19 narrative on Indian twitter. The frequency of nervousness related words has sharply declined over time, with the peak around the time when COVID-19 started becoming popular. We also observe that frequency of businessrelated words increase over time, with the peak being observed around 20th April, the day when the government allowed certain relaxation for shops, etc. to re-open up for the first time post the COVID-19 lockdown. The frequency of 'health' related tweets take a sharp dip near 28th March, wherein the frequency of tweets related to 'government' observes its peak. It is interesting to note than on 28th March, India crossed a total number of 1,000 confirmed COVID-19 cases. We thus, observe that the rapidly evolving public sentiment are reflective of public's response to the on-ground n-CoV2019 situation and the government response.

5.2.2
The changing discourse around n-CoV2019. As can be observed from Figure 6 the discussion regarding certain aspects of COVID-19 discourses on twitter reduced over time, especially those related to Hygiene, movement and nervousness, while the discussions regarding business and optimism increased. We observe that hygiene-related COVID-19 discourses on Twitter became very popular near the time when the lockdown first got imposed. However, over time discussions regarding hygiene have reduced on Twitter, which might be due to it being normalised over time.

Government Bulletins Content
We observe that the government bulletins(as described in Section 3.2) shared by the governments of all the six states, as shown in Figure 7 frequently use words related to medical emergency and health. The government bulletins released on n-CoV2019, by the Delhi and West Bengal government 9 , have a higher frequency of linguistic features related to the topic 'healing' than other states. An interesting observation is that the state government bulletins in Odisha have a significantly higher inclination towards using words related to government, while for most other states, the primary focus is towards medical emergency. Also, all government bulletins show no 'fear' or 'confusion' related psycho-linguistic markers.

Granger's Causality Analysis
As a pre-requisite for studying causal mechanism between the time series on Delhi Bulletin and Delhi Tweets, both the sets of data were subjected to Augmented Dickey-Fuller (ADF) test of unit root (so as to see whether the series are stationary or not). The formulation adopted for the ADF test was: where t stands for the time variable; △ for the difference operator; and u t for the disturbance terms. The null and alternative hypotheses for the test are: △H 0 : τ = 0(meaning that the series possesses a unit root and is, therefore, non-stationary); △H 1 : τ < 1 (meaning that the series does not possess a unit root and is, therefore, stationary).
For both the time series, the test was performed at levels as well as at first difference; Table 4. As per the table, value of the test statistic τ for the time series (at level) on Help in respect of Delhi Bulletin was computed to be -1.596, which failed to reach the critical values ( -1.951 at 5% and -2.623 at 1% level of significance). Accordingly, we could not reject the null hypothesis of the presence of a unit root in the series. In other words, the series on 'HelpâĂŹ was nonstationary. However, at the first difference, the series was detected to be free from a unit root (τ = -6.081) and was, therefore, stationary in nature. A large majority of the time series on rest of the variables (except for Fear, Sadness, Nervousness, Confusion, Fun, Positive Emotion and Economics, wherein stationarity was present at levels itself) of Delhi Bulletin showed this very type of behaviour. Notably, in respect of Delhi Tweet, the series on the entire set of variables were observed to be non-stationary at levels but stationary at the first difference. Consequently, for examining causality behaviour, we have uniformly considered the first-differenced series on all    the variables in respect of both data sets. Here, we may mention that ADF test could not be performed on the variable Sympathy in respect of the first set of data (because of the absence of variability) and was, therefore, left-out for the subsequent analysis on GrangerâĂŹs causality. For examining causality, each of the corresponding pairs of variables from the first-differenced data sets was subjected to the estimation of Equations 2 and 3: which were then compared for their predictive power through WaldâĂŹs test. If Equation 2 turns out to be statistically superior to the Equation 1 (thus implying that current value of Y can be better predicted through its own past values as well as the past values of X than through the past values of Y alone), then we say X Granger causes Y. The series Y and X were then interchanged and the process repeated so as to examine if Y Granger causes X. The optimum number p of lagged terms to be included was decided through min AIC criterion which, in the present analysis, turned out to be 1, in general.
As per the results through the analysis (Table 5), it was observed that in respect of the variable 'HelpâĂŹ, the data set on Delhi Tweet (TWT) failed to induce any causality on the data set on Delhi Bulletin (BLT), because the F-value (= 1.194 at 1 & 30 d.f.) associated with Wald's test turned out to be statistically non-significant (p = 0.2832). On interchanging the two data sets, the finding remained virtually similar, thus implying that no causal linkage could be detected between the two data sets with respect to 'HelpâĂŹ.
In respect of 'Medical EmergencyâĂŹ, the data set on Tweet induced significant causality (at 5% probability level) on such a data such on Bulletin. On interchanging the two data sets, strength of causality (from Bulletin to Tweet) became all the more robust (at 0.1% probability level). We may thus say that although there was an indication of bi-directional (or, equivalently, feedback) causality between the two sets of data in respect of âĂŸMedical Emergen-cyâĂŹ, yet the strength of causality was more pronounced from Bulletin to Tweet. Bi-directional causality (at 5% level) between the two sets of data was observed in respect of 'HealthâĂŹ. Very strong causality (at < 0.1% probability level) from Tweet to Bulletin was indicated in respect of the variables 'HygieneâĂŹ, 'LeisureâĂŹ, 'FunâĂŹ and 'GovernmentâĂŹ. Significant (at 5% probability level) unidirectional causality (from Bulletin to Tweet) was also detected in respect of each of âĂŸDeathâĂŹ and âĂŸWarâĂŹ. But for rest of the variables, causal linkages between the two sets of data could failed to be established. Thus, on the whole, direction and strength of causality between the two sets of data were peculiar to the variable under consideration.

CONCLUSION
We present novel dataset consisting of more than 5.6 million n-CoV2019 related Indian tweets, with special emphasis on the tweets related to each of the Indian states. We further analyse the tweets to find the important topics and psycho-linguistic features discussed and compare it both between various states, as well as using a timeseries based approach. We further try to link the rapidly changing psycho-linguistic attributes of the public sentiment to the real-life on-ground situations arising due to COVID-19. We further designed an interactive web portal COVibes 10 , which displays psychometric insights gained both at a national and state level, from the Coro-naIndiaDataset. This dataset and analysis technique can be used for further research into understanding the public perceptions and taking more effective policy decisions. We restricted our work to the analysis of tweets as well as government bulletins in English. Future work in this direction can be done to increase the scope of the analysis to various Indian languages.