Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Natural Language Processing

Volume 8 - 2025 | doi: 10.3389/frai.2025.1623090

CoViNAR: A Context-Aware Social Media Dataset for Pandemic Severity Level Prediction and Analysis

Provisionally accepted
Soofi  ShafiyaSoofi Shafiya1*Mudasir  Ahmad WaniMudasir Ahmad Wani2Suraiya  JabinSuraiya Jabin1Mohammed  ELAffendiMohammed ELAffendi2J  JahiruddinJ Jahiruddin1
  • 1Jamia Millia Islamia, Delhi, India
  • 2Prince sultan university, riyadh, Saudi Arabia

The final, formatted version of the article will be published soon.

The unprecedented COVID-19 pandemic exposed critical weaknesses in global health management, particularly in resource allocation and demand forecasting. This work introduces a transformative approach to enhancing pandemic preparedness through real-time social media analysis. Using SnScrape, over 27.5 million tweets for the duration of November 2019 to March 2023 were collected using COVID-19-related hashtags. Tweets from April 2021, a peak pandemic period, were selected to create the CoViNAR dataset. BERTopic enabled context-aware filtering, resulting in a novel dataset of 14,000 annotated tweets categorized as "Need", "Availability", and "Not-relevant." The CoViNAR dataset was used to train various machine learning classifiers, and the best classifier achieved an accuracy of 96.42%, 96.44% precision, 96.42% recall, and an F1-score of 96.43% on the Test dataset. While training the NAR classifier, we experimented with three context-aware word embedding techniques, with DistilBERT yielding the best performance. We demonstrated the success of the NAR classifier by performing a temporal analysis of tweets from the US, UK, and India from November 2019 to March 2023. The strong correlation between NAR tweet counts and COVID-19 case surges highlighted the potential of the proposed method, offering health authorities a powerful, proactive tool for resource management during a pandemic.

Keywords: BERTopic, COVID-19, Natural Language Processing, Social Media, DistilBERT, SVM

Received: 05 May 2025; Accepted: 31 Jul 2025.

Copyright: © 2025 Shafiya, Wani, Jabin, ELAffendi and Jahiruddin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Soofi Shafiya, Jamia Millia Islamia, Delhi, India

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.