Impact of COVID-19 on mental health in China: analysis based on sentiment knowledge enhanced pre-training and XGBoost algorithm

Coronavirus disease 2019 (COVID-19) is causing a serious impact on the people living in countries across the entire world. The spread of this pandemic globally has led people worry every day about losing their jobs or even being threatened by the virus. This pandemic caused people to experience more serious psychological problems than we realized. However, there has been little research on how COVID-19 affects the mental health of the people. In this article, we attempted to use the social text data about COVID-19 on Sina Weibo (the largest “tweet” platform in China, and we will also call Weibo as tweet in the following content), to explore the impact of COVID-19 on the mental health of Chinese people. First, we fifilter the tweet data by selecting examples that contain COVID-19 and COVID-19 correlated keywords. However, we segment the filtered tweets, extract meaningful words, and construct a word vector sparse matrix as the measurement of every tweet. Then, for the model's labels, we use sentiment knowledge enhanced pre-training model (SKEP), a deep learning framework published by Baidu that measures the user's mental state. Through SKEP, we can obtain the probabilities of the user's positive and negative mental states. Finally, we use the XGBoost algorithm to study the relationship between the word vector sparse matrix and the mental health state of users. Our research shows that social text data can, indeed, reflect the mental health state of users to a large extent, and social data can be used to explore the impact of COVID-19 on mental health, which can help frame the public health policy.


. Introduction
The COVID-19 caused by the Coronavirus strain SARS-CoV-2 is currently an epidemic (1). The World Health Organization (WHO) declared that the COVID-19 outbreak has become a public health emergency of international concern (2). The epidemic not only has a direct impact on the physical health of millions of people, but also has a huge influence on the mental health of people (3)(4)(5). The epidemic and lockdown will inevitably affect everyone's mental health, no matter how well the outbreak is contained.
A report released by the WHO in March 2022 shows that the COVID-19 epidemic has increased the mental pressure of people everywhere during the global pandemic. In 2020, the first year of pandemic, there was a significant 25% increase in the global prevalence of anxiety and depression. Moreover, young people were particularly affected. In addition, .
/fpubh. . women were more severely affected than men, and people with underlying diseases were more likely to have mental health problems. One of the main reasons that can explain the widespread increase in mental health crisis across the globe is the fact that the social isolation brought about by the pandemic caused an unprecedented pressure, by limiting people's ability to work, seeking support from relatives, and restriction in participating in community activities. According to the WHO, although 90% of countries have included mental health and psychosocial support in their COVID-19 response plans, huge gaps and concerns are still prevalent among different countries (6). Since the COVID-19 epidemic started 2 years ago, people in many countries have experienced more mental health problems. However, mental health services also had to face serious challenges and disruptions, leaving a huge unmet need to bestow care and support to the most vulnerable people. During most of the COVID-19 pandemic, mental health services were the most severely disrupted of all the basic health services. Many countries also reported that some lifesaving mental health services, including suicide prevention, were seriously disrupted.
In Japan, one of the countries with the most detailed statistics and records of suicides in the world, the number of suicides in 2020 rose for the first time after 10 consecutive years of decline (7). Dr. Jia Wenting, a senior brain neurosurgeon and head of the Fangyuan Zaizai Clinic in Japan, said that the patients she met had various sources of psychological stress. These sources of stress included vaccine hesitancy, COVID-19 infection and isolation, loss of work and income, and family concerns.
Japan is not a special case. According to a survey conducted by the Ministry of Health of Singapore, in the first year of COVID-19 pandemic, 8.7% of the Singaporeans who were interviewed met the criteria of clinical depression, 9.4% met the criteria of anxiety, and 9.3% met the criteria of mild to severe stress (8). The main sources of psychological stress are the risk of COVID-19 infection among family members or friends, economic loss, and unemployment.
The cited examples illustrate the impact of the epidemic on mental health, and to study this influence even further, Gao et al. (9) and Hao et al. (10) collected information through questionnaires and surveys. Moreover, traditional methods face a major challenge: they cannot track mental health changes quickly and respond accordingly (11). Therefore, at present, some scholars are analyzing a large number of social App data and using natural language processing (NLP) methods to explore the emotion of tweet publishers from text information (12). Through the analysis of emotional information, we can obtain the indicators that can respond to the residents' mental health in a timely manner. These indicators range from 0 to 1. The closer is the value to 1, the stronger is the optimism, and the closer is the value to 0, the stronger is the pessimism.
In this article, we chose to use XGBoost to explore the impact of COVID-19 on the mental health of Chinese residents. To obtain real-time text information, we chose to use the text information on social media Sina Weibo for analysis, which is the largest tweet platform in China. The data contain a total of 13 million blogs in China with geographical location tags. The data start date is 1 January 2020 and the end date is 1 March 2020. Furthermore, the detailed data description will be shown in the article. To make rational use of these data, XGBoost is used to analyze and process the data. We use XGBoost because of its strong interpretability and its suitability for exploring sparse data such as text data. These advantages come from its tree model structure. XGBoost can also provide us with the feature importance of the tree model, which we will analyze to check its rationality.

. Related work
Several studies have been conducted to investigate the impact of the COVID-19 pandemic on mental health and wellbeing. Li et al. (13), based on China's microblog platform, surveyed 17,865 active users and analyzed the data from 13 January 2020 to 26 January 2020, using the online ecological recognition model to obtain emotional indicators (e.g., anxiety, depression, identification, and Oxford happiness) and cognitive indicators (e.g., social risk judgment and life satisfaction). Zhang et al. (14) designed cross-sectional data using mobile phone App data and telephone interviews, and studied 263 individuals. Liu et al. (15)  In previous studies, some scholars used social media data to predict influenza activity (20) and outbreak (21). In the same way, we decided to use social media data as the data source for our experiments and evaluate the mental health status of Chinese people by these data.
Sina Weibo is the largest blogging platform in China, and millions of users are active on this platform every day (https://www. weibo.com/). Because of the huge and open nature of the platform, many Chinese people publicly publish their living conditions and inner thoughts on Sina Weibo. However, due to the design of the platform, it is very difficult for us to crawl data. Hu et al. previously collected data from Sina Weibo. The data period is from 00:00 (GMT + 8) on 1 December 2019 to 23:59 (GMT + 8) on 30 April 2020 (22). The data contain 33,519,644 pieces of data in total, of which 895,012 are geotagged data. Figure   .

. Models
Using these social media data, we first use the sentiment analysis tool Sentiment Knowledge Enhanced Pre-training (SKEP) model provided by Baidu for analysis, which was released in 2020.
It is an open-source python library and some studies had used it for mental research (22). For each tweet, SKEP can use the input text information as some prior knowledge of sentiment to return two probability values reflecting positive and negative emotions, and the sum of the two probability values is 1. In this article, we use the probability of positive emotions as the expression of the user's mental health status corresponding to this blog.
Because the data contain geographic location tags, and for a certain time end, people in the same city should receive roughly the same anti-epidemic policy, so we use the blog information with geographic location tags of the same city in the same time period for analysis. We used XGBoost to analyze the impact of COVID-19 on the mental health of Chinese people. XGBoost is one of the boosting algorithms. The idea of formulating the boosting algorithm is to integrate many weak classifiers to form a strong classifier. Because XGBoost is a lifting tree model, it integrates many tree models to form a strong classifier. The tree model used is the Classification and Regression Tree model (CART). XGBoost is improved on the basis of GBDT to make it more powerful and applicable to a wider range.
For analysis using XGBoost, we collected the number of COVID-19 infection cases in each city during the corresponding period of social data and constructed the dummy variable feature COVID_19_it , indicating whether the epidemic occurred in city i at time t, where 1 indicates occurrence and 0 indicates no occurrence. In addition, we constructed the epidemic number variable: In addition, we use Jieba word segmentation (https://pypi. org/project/jieba/) to gather statistics on Sina Weibo to form a word vector matrix expressing people's mental health and use sklearn's CountVectorize (https://scikit-learn.org/stable/modules/ generated/sklearn.feature_extraction.text.CountVectorizer.html) to gather statistics on the obtained word vectors and select the top 500 words that appear most frequently in the training set to form a  . Experiments and analysis . . Data collection and description As mentioned in Section 3.1, we use that data set for the experiment. The data start time is from 00:00 (GMT + 8) on 1 December 2019 to 23:59 (GMT + 8) on 30 April 2020, and a total of 895,012 geotagged tweets are included. The information description of each tweet is shown in Table 1. Furthermore, Table 2 shows a specific example of our experiment data.
In addition, we need to obtain the word frequency matrix of each tweet in the training set. The specific method has been given in Section 3. To have a general understanding of the data, we use Jieba to draw the word cloud. The drawing results are shown in Figure 2. It can be seen that the obtained word cloud reflects the mental health of the user who responded to this tweet to a certain extent.
After using the Jieba word segmentation (only the words corresponding to common nouns, proper nouns, verbs, adverbs, gerund, adjectives, and adverbs are retained as the cleaned data), we also use CountVectorize to carry out word vector statistics and obtain the words with the highest frequency as features. The results are shown in Table 3.

. . Hyperparameters and metrics
Metrics: MSE statistical parameters are the mean value of the square sum of the errors of the corresponding points of the predicted data and the original data. The calculation method is Frontiers in Public Health frontiersin.org . /fpubh. .  given as follows: In this article, model training is carried out in the form of rolling. The training and testing processes are divided into the following steps: 1. Dataset division: each round of rolling determines the data set within the sample and the data set outside the sample, and divides them in chronological order. 2. Feature and label generation: the text in the sample is vectorized, feature X generated and labeled y, and the words used recorded.   3. Training: k-fold cross-training is conducted within the sample, and GridSearchCV of sklearn (sklearn.model_selection.GridSearchCV-scikit-learn 1.1.2 documentation) is used to find the optimal parameters. 4. Out-of-sample preprocessing: word vector on the text outside the sample is used to perform out-of-sample preprocessing, based on the words used in the sample. 5. Prediction and factor construction: the optimal model obtained by cross-validation is used to predict the outside of the sample, and the factor value describing the user's mental health is obtained.
Extreme gradient boosting (XGBoost) is a Boosting integration algorithm that is a strong learner that combines multiple weak learners (such as decision trees) in a series manner, in a way that continuously reduces the loss function by iterating between weak learners. We perform a mesh search of all the hyperparameter combinations of the XGBoost classifier and use five-fold crossvalidation to select the lowest set of hyperparameters in the validation set, average loss function as the final hyperparameter of the model, and the hyperparameter settings are shown in Table 4.

. . Results
. . . The impact of COVID-epidemic on mental health Figure 3 shows the user's mental health state inferred from the text information by XGBoost and the real user's mental health state obtained by SKEP. From the figure, we can easily conclude that: 1. The COVID-19 pandemic has a significant negative impact on people's mental health, as shown by the clear downward trend in both the real and inferred mental health states over time. 2. The red-striped data in Figure 3 can be divided into two categories. First, Chinese traditional festivals show a significant decrease in users' mental health status, which may be due to the inability to visit relatives and friends during the pandemic. Second, when the number of people diagnosed with COVID-19 in the United States exceeded 150,000 in March, the users' mental health status also declined significantly. This observation may be due to the globalization of the pandemic and its impact on prevention and control measures.
Our explanation for this is given as follows: with the deepening of globalization, every resident on earth understands that the epidemic situation in other countries will have a significant impact on their own prevention and control measures. Therefore, the mental health status of residents also dropped sharply after the news emerged.

. . . Importance influence of words
In Figure 4, the importance of certain characteristics is shown, with "stay at home and can not go outside, " "confirmed cases, and "quarantine" being the top three important features. This fact reflects the significant impact of the policy of "home isolation and no going out" on the mental health of Chinese people, as well as on the concern over the number of confirmed cases. Other words, such as "food, " "Canada, " "help, " and "Spring Festival, " are also representative of the living conditions in China during that time and can accurately reflect the mental health status of the Chinese people.

. Discussion
This study aimed to quantify the impact of the COVID-19 pandemic on people's mental health using social media data on Sina Weibo. By applying the natural language processing technology and a state-of-the-art deep learning framework (SKEP), as well as using the powerful XGBoost machine learning algorithm, we were able to analyze the results and provide empirical evidence of the impact of the pandemic on people's mental health.
Our findings indicate that the COVID-19 pandemic has had a significant negative impact on the mental health of people. This is consistent with a previous research on the effects of pandemic on mental health. We call on the general public to care more for people around us and work together to overcome the challenges presented by the pandemic.
We also found that Chinese traditional festivals are important for maintaining relationships between relatives and friends, and that the inability to visit loved ones during the pandemic had a significant impact on people's mental health. We recommend that policymakers take this into account when planning public health measures and suggest finding alternative ways for people to connect with each other during these festivals.
Overall, our study provides important implications for both the field of natural language processing and the field of public health, demonstrating how social text data can be used to measure and analyze the mental health of users during the pandemic, and how our findings can help public health policymakers to understand and improve the psychological wellbeing of the population.

. Conclusion and suggestions . . Conclusion
In this article, our analysis of social media data from Sina Weibo shows that the COVID-19 pandemic has a significant negative impact on people's mental health. After using the related natural language processing technology, XGBoost is used to analyze the results. The empirical results show that COVID-19 has a significant impact on people's mental health. Therefore, we call on the general public to care more for people around us and to let us tide over the difficulties together. In addition, we also found that Chinese traditional festivals are important festivals for maintaining relations between relatives and friends. Due to the pandemic, people cannot visit relatives and friends, and it is difficult for them to accept this . /fpubh. .  truth. Therefore, their psychological state drops sharply. With the deepening of globalization, every resident on earth understands that the epidemic situation in other countries will have a significant impact on their own prevention and control measures. Therefore, the serious news of the epidemic situation in foreign countries will also have an impact on the mental health of residents.

. . Strengths and limitations
This article has several strengths that contribute to the literature on the impact of COVID-19 on mental health. First, it uses a large and rich data source of social text data from Sina Weibo, which can capture the real-time and diverse opinions and emotions of users during the COVID-19 pandemic. Second, it applies a state-of-the-art deep learning framework (SKEP) to measure the user's mental state based on the sentiment knowledge, which can provide more accurate and fine-grained results than traditional methods. Third, it employs a powerful machine learning algorithm (XGBoost) to study the relationship between the word vector sparse matrix and the mental health state of the users, which can handle high-dimensional and sparse data efficiently and effectively.
However, this article also has some limitations that need to be addressed in a future study. First, it only focuses on one platform (Sina Weibo) and one country (China), which may limit the generalizability and applicability of the findings to other contexts and populations. Future study should include more platforms and countries to compare and contrast the results. Second, it does not consider other factors that may affect the user's mental health, such as demographic variables, social support, and coping strategies, which may confound or moderate the impact of COVID-19. Future study should take steps to control these factors or explore their . /fpubh. .
interactions with COVID-19. Third, it uses a single indicator (the probability of positive or negative mental state) to measure the user's mental health, which may not capture the complexity and diversity of mental health issues. Future study should use more professional and comprehensive mental health indicators, such as depression scales, anxiety scales, or stress scales.

. . Future vision
For future study, we plan to improve our research in three aspects. First, we aim to use more complete data sets that include not only Weibo, but also other platforms, such as WeChat, TikTok, and Kwai, where news messages are presented in text or short video form. This development will help us to further explore the impact of COVID-19 on mental health from different perspectives. Second, we intend to design our own model that can capture time and space information, such as the user's experience of isolation control and the user's location. We can use graph neural network to study the problem by considering the user information of the entire city. Third, we hope to use more professional mental health indicators for measuring the user's mental state, which will provide more value to study the problem. By doing so, we expect to have important implications for both the field of natural language processing and the field of public health, as our article demonstrates how social text data can be used to measure and analyze the mental health of users during the pandemic, and how our findings can help public health policymakers to understand and improve the psychological wellbeing of the population.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.