Hate speech detection in the Arabic language: corpus design, construction, and evaluation

Hate Speech Detection in Arabic presents a multifaceted challenge due to the broad and diverse linguistic terrain. With its multiple dialects and rich cultural subtleties, Arabic requires particular measures to address hate speech online successfully. To address this issue, academics and developers have used natural language processing (NLP) methods and machine learning algorithms adapted to the complexities of Arabic text. However, many proposed methods were hampered by a lack of a comprehensive dataset/corpus of Arabic hate speech. In this research, we propose a novel multi-class public Arabic dataset comprised of 403,688 annotated tweets categorized as extremely positive, positive, neutral, or negative based on the presence of hate speech. Using our developed dataset, we additionally characterize the performance of multiple machine learning models for Hate speech identification in Arabic Jordanian dialect tweets. Specifically, the Word2Vec, TF-IDF, and AraBert text representation models have been applied to produce word vectors. With the help of these models, we can provide classification models with vectors representing text. After that, seven machine learning classifiers have been evaluated: Support Vector Machine (SVM), Logistic Regression (LR), Naive Bays (NB), Random Forest (RF), AdaBoost (Ada), XGBoost (XGB), and CatBoost (CatB). In light of this, the experimental evaluation revealed that, in this challenging and unstructured setting, our gathered and annotated datasets were rather efficient and generated encouraging assessment outcomes. This will enable academics to delve further into this crucial field of study.


Introduction
In recent years, the spread, diversity, and ease of use of social media platforms (e.g., Facebook, Twitter, etc.) have facilitated the rapid dissemination of information and the quick growth of virtual communities (Kapoor et al., 2018).Social media has changed the typical daily Routines of individual traditional business operations and interaction patterns within various communities (Ngai et al., 2015).Despite the bene ts of these advances, individuals and communities became vulnerable to new forms of harm and verbal aggression that were not common before.Hate speech has gained prominence as a form of discourse that targets individuals or groups based on race, religion, gender, sexual orientation, or other characteristics (Yalçınkaya, 2022).e number of content items on which Facebook took action due to hate speech worldwide between the 4th quarter of 2017 and the 1st quarter of 2023 is presented in Figure 1.Despite the decrease in numbers as governments worldwide relaxed COVID-19-related constraints, the number in the rst quarter of 2023 is higher than the corresponding interval of 2020 and more than double the number of the corresponding interval of 2019. 1  Different forms of hate speech include harassment, cyberbullying, offense, and abuse (Fortuna et al., 2020;Omar et al., 2020).Harassment refers to persistent unwanted behavior that causes distress or fear, oen involving repetitive and intrusive actions (Gilani et al., 2014).Cyberbullying speci cally occurs in the digital realm, encompassing the use of technology to harass, intimidate, or demean others (Coban et al., 2023).Offense refers to actions or expressions that cause displeasure or upset, while abuse involves using power or authority to harm or control others (Husain, 2020).
Many recent studies have shown the relationship of hate speech to the increase in hate crimes worldwide (Nguyen, 2023).It also showed its connection to the exposure of targeted individuals to discrimination, violations, and denial of their human rights (Chakraborty and Masud, 2022).
Social media can be very helpful for connecting people, increasing self-esteem (Zsila and Reyes, 2023), and being a platform for information exchange and self-expression (Siddiqui et al., 2016).Other use of social media in societies includes, but are not limited to, building communities and helping in emergencies (Akram and Kumar, 2017).On the other hand, social media may hurt mental health as it may lead to stress, increased sadness and isolation (Zsila and Reyes, 2023), and addiction, as well as the possibility of hurting privacy and security, facilitating fraud (Akram and Kumar, 2017), the spread of misinformation and hate speech.Social media has been used successfully in business for marketing, identifying, and engaging talents (Akram and Kumar, 2017).Other use of social media in business includes, but are not limited to, customer support, facilitating communications between employees, and employee selfdevelopment (Siddiqui et al., 2016).Furthermore, social media has been found to have a positive value in higher education, particularly as a teaching and learning tool (Sobaih et al., 2016); it can increase peers' interactivity and online knowledge-sharing behavior which has a positive impact on students' engagement, can lead better academic performance (Ansari and Khan, 2020).Also, the use of social media was found to increase e-learning platform acceptance by students (Alghizzawi et al., 2019).
e propagation of hate speech online continuously challenges policy-makers and the research community due to difficulties limiting the evolving cyberspace, the need to empower individuals to express their opinions, and the delay of manual checking (Jahan and Oussalah, 2023).
To reduce its risks and possible devastating effects on the lives of individuals, families, and communities, the NLP community has shown an increasing interest in developing tools that help in the automatic detection of hate speech on social media platforms (Husain and Uzuner, 2021) as the detection of hate speech can be, generally, modeled as a supervised learning problem 1 Annotation exam.Available online at: https://forms.gle/9e56l2j8vh9mnsiv9. (Schmidt and Wiegand, 2017).Several studies investigated the problem and contrasted various processing pipelines using various sets of features and classi cation algorithms [e.g., Naive Bayes, Support Vector Machine (SVM), deep learning architectures, and so on] (Jahan and Oussalah, 2023).
Fairly generic features, such as a bag of words or embeddings, resulted in reasonable classi cation performance, and characterlevel schemes outperformed token-level approaches (Schmidt and Wiegand, 2017).It is reported in the literature that even though information derived from text can be useful for detecting hate speech, it may be bene cial to use some meta-information or information from other media types (e.g., images attached to messages) (Jahan and Oussalah, 2023).
While several studies investigated anti-social behaviors such as abusive or offensive language cyberbullying and heat speech, a limited number of researches have contributed to hate speech detection in Arabic in general (Al-Hassan and Al-Dossari, 2019).At the time of writing, we are unaware of any study attempting to detect hate speech for the Arabic dialect used in Jordan.Compared to English, Arabic could be more resourceful regarding NLP.e existence of different dialects combined with the richness and complexity of Arabic morphology add up more challenges to Arabic NLP research (Albadi et al., 2018).
e main contributions of this work are as follows: 1. Construct a public Arabic-Jordanian dataset of 403,688 annotated tweets labeled according to the appearance of hate speech as very positive, positive, neutral, and negative.

Comparing the performances of machine learning models for
Hate speech detection of Arabic Jordanian dialect tweets.
e remainder of this paper is organized as follows.e related work to Jordanian dialect datasets and Arabic Hate speech detection are reviewed in Section 2. Section 3 details our methodology for constructing the new dataset, the preprocessing steps, and statistics.Section 4 describes in detail the architecture of classi cation models, the conducted experiments, and the results.Section 4 discusses and analyzes the results.Finally, Section 5 concludes our work and discusses future directions.

Literature review
With the rapid spread of social media platforms, the freedom level has been elevated so that many people can give their opinions with advice or criticism without borders.People with shy and conservative personalities have been allowed to speak up and give their opinions without fear of interruption or hesitation.e problem is that many people have abused this freedom by not considering the courtesy of speech and descent manners.Hate speech, including cyberbullying, offensive talk, sarcasm, and harassment, are just a few examples of freedom abuse on social media (Omar et al., 2020).
is problem has motivated researchers to create methods to detect and stop such violations that have a large negative in uence on our societies, youth, and children.In this section, selective literature is introduced and discussed to illustrate the methods conducted in this area.Surely, the problem of hate speech has been considered in several scopes: science, sociology, psychology, and even criminology.is research will concentrate on the technical efforts conducted in this area, i.e., Natural Language Processing (NLP) and Arti cial Intelligence (AI), to detect such behavior.

Hate speech and related concepts
NLP is one of the common disciplines that is needed in the area of hate speech detection.Posts, tweets, comments, reviews, and most social contributions on social media are inserted as text.People from all over the world can express their feelings with their language and even dialect.No language standards are enforced on such platforms, and thus, NLP tools have become essential in representing, understanding, and analyzing these inputs.
AI algorithms, either Machine Learning (ML) or Deep Learning (DL) algorithms, have been extensively conducted as classi cation algorithms to detect hate speech in text extracted from social media (Husain and Uzuner, 2021;Yi and Zubiaga, 2023).
e most vital issue when tackling this problem is to work on a high-quality hate speech corpora.In literature, two streams are taken into consideration.Many researchers use public corporations directed to hate speech in general or in a certain type of hate.Such corpora can be hard to nd, especially in low-resources languages.us, most of the literature that adopts this stream works on English corpora as in Mozafari et al. (2020), Aldjanabi et al. (2021), andAwal et al. (2021).Public Arabic hate speech corpora can also be found but rarely concentrate on certain Arabic dialects.Abuzayed and Elsayed (2020), Haddad et al. (2020), andHassan et al. (2021), for example, have used in their research the OffensEval 2020 dataset, which shared task competition organizers have provided.In another case, Alsafari et al. (2020b) have proposed an Arabic hate speech corpus that they reused in further experiments in Alsafari et al. (2020a).Sections 2.2, 2.3 highlight literature that created hate speech corpora in different languages.
e next issue in this problem is representing the text (posts, tweets, comments, etc.) in proper text presentation (word embedding technique), enabling AI classi ers to handle them as proper inputs and thus produce the desired outputs.Any NLP task needs such text presentation methods.In literature, several word embedding techniques are used and, in some cases, compared in the same paper.Examples of such techniques are TF-IDF (Abuzayed and Elsayed, 2020), word2vec and some of its variations such as AraVec (Aref et al., 2020;Faris et al., 2020;Romim et al., 2021), and Fasttext (Alsafari et al., 2020a,b;Aref et al., 2020;Romim et al., 2021).
Using these corpora to detect hate speech on social media platforms is a classi cation problem that needs labeled data.Labeling of each text sample should be applied using either manual or automatic annotation processes.Number of classes varies from one research to another.Many papers use the binary classes by only labeling the samples with two labels.Hate or Not hate is the most common binary label used in literature, such as Alshaalan and Al-Khalifa (2020), Aref et al. (2020), Omar et al. (2020), Romim et al. (2021), Saeed et al. (2022), and Khezzar et al. (2023).Others used different labels for binary classi cation, such as clean or offensive (Alsafari et al., 2020b;Alsafari and Sadaoui, 2021), hateful or normal (Salomon et al., 2022).Some researchers were more precise in identifying the labels according to the type of hate speech detected.For example, in Mursi et al. (2022), Islamic 10.3389/frai.2024.1345445Radicalism is the type of hate to detect, and thus the binary labels are extremist or non-extremist.In Coban et al. (2023), the target was to detect whether cyberbullying terms exist in Facebook posts; thus, the binary labels used were cyberbullying or non-cyberbullying.
ree-labeled corpora have been introduced, with a third label that either indicates a neutral label (Faris et al., 2020) or undecided (Ameur and Aliane, 2021).Also, the three labels have been used to distinguish between hate and abusive classes in addition to clean or normal, as in Alsafari et al. (2020b), Alshalan andAl-Khalifa (2020), andDuwairi et al. (2021).
Other research used multi-labeled corpora, including negrained labels that identify the type of hate detected more speci cally.In Anezi (2022), the arHate Dateset has been created, with labels: racism, against religion, gender inequality, violence, offensive, and bullying.ese labels have been selected to distinguish the precise type of hate speech.Additional labels were added to indicate the existence of hate speech other than the ones mentioned previously, using labels normal positive and normal negative.Other examples are Alsafari et al. (2020b), Ahmed et al. (2022), Beyhan et al. (2022), andMollas et al. (2022), which used multi-labels to annotate their corpora, that in some cases reached eight different labels, according to how many details desired to be expressed in the labels.
Several classi ers have been used in the literature to apply the classi cation task.Most research compared different models to nd the most proper one(s) for the created or the public corpora tested.
e classi ers used were categorized into ML and DL algorithms.In Abuzayed and Elsayed (2020), both ML and DL classi ers have been applied and compared.Fieen traditional ML classi ers, such as SVM, RF, XGBoost, DT, LR, etc., have been used.On the other hand, DL classi ers have been used, such as CNN and RNN.When compared, it has been found that the best classi er was the hybrid CNN and RNN classi er.
In Althobaiti (2022), SVM and LR have been used and compared with a BERT-based model where the BERT model yielded the best results.In their research, a novel approach has been conducted by including emojis found in the tweets in the hate speech detection.Mozafari et al. (2020)

Arabic hate speech corpora and detection systems
Arabic, as a low-resource language, needs more specialized hate speech corpora.As aforementioned, research has been found and discussed in the previous sub-section, highlighting some research in this area.Nevertheless, Arabic dialects' hate speech datasets are not easily found in the literature.
is section discusses a sample of research that created Arabic hate speech corpora.Table 1 summarizes the main aspects of this sample.
Social media platforms have been considered the sanctuary of different types of people in society to express their feelings.Many people post their social news and events, either happy or sad, to the public.Nevertheless, this publicity can encourage some indecent people to re ect their negative feelings of hate, sarcasm, bullying, and others.us, social media platforms are considered the main resources of datasets, corpora, that consist of samples that can be trained and tested for the hate speech detection task.
In the literature, it has been found that Facebook (Omar et al., 2020;Ahmed et al., 2022;Anezi, 2022), Twitter, Instagram, andYouTube (Omar et al., 2020) are some of the main sources of such data.As illustrated in Table 1, most of the research used Twitter as the social media source; this indicates that this platform provides the data more easily to researchers than other platforms, such as Facebook.Another reason researchers prefer to collect data from Twitter is that tweets mostly consist of short text.While other platforms, such as Instagram or YouTube, consist of data in the form of images and videos, which takes more work to process.Also, some platforms, such as Facebook, Telegram, and Reddit, may have text content, but in most cases, the text is long and can take longer to process.
e Arabic language has many challenges when processed and tackled.Yet, standard Arabic has its rules and grammar that can make the text understanding and analysis easier.Arabic dialects, on the other hand, propose a hard problem for AI to distinguish and understand.us, collecting Arabic dialect data has been a hot research topic that Arabian authors have considered when conducting NLP tasks, speci cally hate speech detection.
As illustrated in Table 1, many researchers collected data that use Arabic letters without concentration on dialects, such as Aref et al. (2020), Faris et al. (2020), Omar et al. (2020), Ameur andAliane (2021), andKhezzar et al. (2023).In other cases, researchers concentrated on certain dialects that refer to a certain region or country within the Arabian countries.is helps researchers when scraping social media, to search for keywords that are more related to this dialect.Levantine (Duwairi et al., 2021) and Gulf (Alsafari et al., 2020b;Alsafari and Sadaoui, 2021) are examples of dialects used by people in a wide region of the Arab world.So, when a researcher needs to collect data in the Levantine dialect, for example, they should add to their query the desired locations, including Jordan, Palestine, Syria, and Lebanon.If a researcher concentrates on a certain country, the location query only includes this country.Saudi (Alshaalan and Al-Khalifa, 2020;Alshalan and Al-Khalifa, 2020), Tunisian (Salomon et al., 2022), andEgyptian (Ahmed et al., 2022) are examples of such dialects.As for the Jordanian dialect, our work is considered the rst to tackle data in this dialect, as far as we know.
Other query questions are heavily used by researchers when scraping social media platforms during the period.is can allow the researchers to study public opinions in a period when certain political or social events have happened.
Since scraping the social media platforms and annotating them with proper labels is not easy, it can be noticed that the size of such corpora is not considered large.Most corpora listed in sizes less than 10,000 annotated text, while only three exceeded this number.us, collecting and annotating over 400,000 tweets in our work is a vital contribution compared to other corpora proposed in the literature.
To evaluate the collected corpora, researchers have conducted hate speech detection algorithms on them.It can be noticed how most literature has concentrated on using DL algorithms, especially RNN and its variations; LSTM and GRU, in addition to DL transformer-based models, such as BERT, AraBERT, mBERT, and others.is refers to the special features of text data over other types, such as tabular ones.Extracting text features depends on the relationships and associations between words in the same text, not necessarily adjacent words.us, such classi ers can capture such features more efficiently than others.Nevertheless, ML classi ers have proven to efficiently use proper word embedding techniques to represent and extract the important features from a text.Table 1 summarizes the best classi ers used in the literature and their results.

Hate speech datasets for other languages
As aforementioned, many English hate speech corpora have been created to conduct the hate speech detection task.Recent surveys review such literature, which proposed high-quality hate speech corpora that researchers can use for further investigation (Alkomah and Ma, 2022;Yi and Zubiaga, 2023).Nevertheless, lowresource language corpora other than Arabic can be hard to nd.
Sizes, labeling process, classes, and best classi ers are displayed in the table, summarizing the important aspects of this literature.
It is worth mentioning that in high-resource languages, such as English, the researchers tend to concentrate on ne-grained classes that distinguish the types of hate speech since the number of keywords indicating these classes can be classi ed more easily.Consequently, this enables the researchers to create complex classi ers with multiple layers that may yield high performance (Mollas et al., 2022).
3 Methodology is section details the comprehensive methodology used to construct, annotate, and evaluate the Jordanian Hate Speech Corpus (JHSC) for detecting hate speech focused on the Jordanian dialect.Our approach includes rigorous data collection, careful pre-processing, manual annotation, exploratory data analysis, and performance assessment using hate speech detection models.e methodology used re ects the robustness, reliability, and applicability of the JSSC in developing research analyzing hate speech in the context of the Arabic language and dialects.Figure 2 illustrates the general methodology used to create JSSC and model hate speech based on it.

Data collection
e initial phase of constructing the Jordanian Hate Speech Corpus (JHSC) involved the collection of Arabic Jordanian dialect tweets from the Twitter platform.e tweets were collected from the beginning of 2014 to the end of 2022.To ensure the authenticity of the collected data, the following steps were applied: 1. Language Filter: e search parameters were further re ned by specifying the Arabic language, ensuring that only tweets written in Arabic were retrieved.2. Location Filter: Aer scrapping a random sample of tweets, it was found that most tweets do not have the location eld that was supposed to be populated in users' pro les.To overcome this issue, Twitter's advanced search techniques included locationbased lters.e "search" techniques focused on Jordan's main cities and regions, covering 12 governorates of Jordan and including 20 cities and regions as listed in Table 3. 3. Systematic temporal approach: e data collection process was organized over a period extending from the beginning of 2014 to the end of 2022.A monthly segmentation strategy was adopted, where tweets for each year were extracted individually and systematically monthly.is approach ensured the stability of the scrapping process and the systematic accumulation of tweets spread over a longer period.Subsequently, the distinct groups from each month and year were combined into one data set.e initial data set contained 2,034,005 tweets in the Jordanian Arabic dialect.
It worth mentioning that the collection process targeted the public tweets only, so no privacy invasion have been conducted.As for the usage of these tweets, it should be clari ed that the collected dataset is used for exploration study only, and no legal actions are entailed regarding detection of hate speech.Since, such actions should be licensed and applied by formal parties when the model is used in real life.

Data pre-processing and cleaning
To reduce noise in the data, several steps have been performed to clean and process the dataset; data pre-processing and cleaning steps are illustrated in Figure 3. First, all duplicate and "retweeted" tweets were deleted, as recommended by Barbosa and Feng (2010) and Alayba et al. (2017).Next, the non-Arabic tweets were removed from the dataset since we focused on Arabic-Jordanian tweets.en, unnecessary tokens such as user tags, numbers, emails, URLs, HTML tags, and hashtags were removed because they might reduce the performance of the classi er (Refaee and Rieser, 2014;Al-Twairesh, 2016).
Although emoji show feelings, they were removed from the dataset because keeping the emoticons in the dialect Arabic tweets reduces the performance of the classi er (Refaee and Rieser, 2014;Al-Twairesh, 2016), and this is due to the way Arabic sentences are written from right to le, which leads to the reversal of emoticons, as well as due to misunderstanding between brackets in the quote and emoticons.Aer that, all whitespaces, such as duplicate spaces, tabs, and newlines, were removed from the dataset.
Finally, the very short tweets with two or fewer words were removed from the data set.It is worth mentioning that the stemming algorithms were not applied to the dataset because they need to work better with Arabic dialect words (Al-Twairesh, 2016).Aer applying the pre-processing and cleaning steps, the dataset has 1,824,220 tweets.

Data annotation
e annotation process is pivotal in creating the Jordanian Hate Speech Corpus (JHSC).It includes careful manual tagging of each tweet with sentiment categories speci cally geared toward identifying instances of hate speech.is process contributes to the development and evaluation of hate speech detection models.

Annotation process stages
e process of Annotation tweets was done in two stages: lexicon-based annotation stage and manual annotation stage.In this stage, an Arabic hate lexicon from related research was used.is lexicon contains 357 terms that are considered hate or offensive terms (Mubarak et al., 2017), a sample from the lexicon term listed in Table 4.In this stage, all tweets that contain any term from this lexicon were extracted to a separate sub-dataset.e new sub-dataset contains 557,551 tweets, around 30 of the original dataset.e new sub-dataset was then processed through stage 2 of annotation.

• Stage two-Manual annotation stage
In this stage, the sub-dataset was labeled with four labels for sentiment: negative, neutral, positive, and very positive.e meaning and examples of each label are mentioned in Table 5.
e manual annotation process is designed to ensure accuracy and agreement between annotators.is stage was performed through the following tasks:

• Task one-Annotation guidelines
To enhance the reliability of annotations, a comprehensive annotation guideline was established, with the cooperation of domain experts, speci cally in Arabic language and linguistics.is guideline outlined speci c criteria and linguistic indicators for each hate speech class, guiding annotators toward consistent and accurate labeling decisions.
• Task two-Hiring annotators team e sub-dataset was manually annotated for hate speech by a team of annotators.Figure 4 illustrates the steps to perform this process.
e process started with an advertisement that has been published on LinkedIn.e purpose was to nd quali ed personnel, mainly students, who could participate in the scraping and annotation part of the project.Figure 5 displays a screenshot of this advertisement.e lled applications have been reviewed, and thirty applicants have been interviewed.Twenty of them were selected aer passing these interviews.A meeting has been conducted by the main researcher with the interviewed annotators to clarify the project requirements and the expectations from their side.Some candidates have been directed to work on social media scraping, while others took a quick training to understand the annotation task required.Before starting the annotation task, the candidates took a test that assessed their understanding of the annotation guidelines presented during the training and the annotation process.A link to the test is provided in Community standards enforcement report. 2 Candidates who passed the test, scoring 70 have proceeded with the annotation process.As a start, to validate the annotation guidelines, the annotators, who were native Jordanian Arabic speakers, participated in the following phases: 1. e annotators were given a training set of 100 tweets annotated by human experts.2. e annotators then independently applied the guidelines to another test set of 100 tweets.3. e annotators' annotations were compared with the experts' annotations.Differences were addressed through discussion.e guidelines have also been modi ed as necessary.
In addition to the above, periodic meetings were held between human experts and annotators during the annotation process.rough these meetings, the work was closely followed, and issues that emerged during the annotation process were discussed and addressed, such as ambiguous expressions and discordant attitudes in the corpus, to ensure the quality of the sub-data set.e inter-annotator agreement was computed to con rm the quality.Figure 6 brie y illustrates the main steps of the data annotation process.

Inter-annotator agreement
Inter-annotation agreement (IAA) measures how well many annotators can make the same annotation decision on the same Annotation team selection.
data when doing the annotation task independently.It is an important metric in many natural language processing (NLP) tasks, such as text classi cation, sentiment analysis, and named entity recognition.Fleiss' kappa is an IAA statistical measure that considers the number of annotators and the number of classes.Fleiss' kappa was computed from a sample of 500 tweets annotated by three annotators to choose one of four classes: positive, neutral, negative, and very negative.e kappa rate was 0.60, indicating a moderate level of agreement between the annotators (Landis and Koch, 1977).In summary, the use of Annotation Quality Control (AQC) through rigorous training of annotators and inter-annotator reliability checks ensures that our annotations are of high quality.is is achieved through rigorous training of annotators, as explained in the annotation team selection process, as well as the use of an Inter-annotator Agreement (IAA) to measure the extent to which multiple annotators agree on the same annotation decision when doing the annotation task independently.Additionally, the expert team provides the annotators with wellde ned annotation guidelines on how to annotate the data, as explained in the Annotation process stages section, which ensures that the annotators understand the annotation guidelines and can apply them consistently.Furthermore, the majority voting and expert evaluation that deployed in the annotation process by considering the majority decision as the nal annotation, and the expert evaluation provides feedback to the annotators.is helps to identify any discrepancies in the annotations and allows for corrections to be made.Announcement for building annotators team.

Exploratory data analysis
e cleaned corpus has 1,824,220 tweets.Figure 7 shows the tweets' distribution from 2014 to 2022.It is worth mentioning that Figure 1 shows the turnout of Jordanians on Twitter in 2014, then how it decreased by half in the following year.It is also possible to note the relative stability in the past ve years despite the Corona epidemic that swept the world between 2019 and 2022.In general, it is known that the epidemic increased the percentage of participation on social media platforms.Still, the reason for the decline in the participation rate of Jordanians on Twitter can be attributed to the presence of other platforms for social communication, in addition to the tightening of penalties in the cybercrime law in Jordan.
e corpus is partitioned into two parts.Part one has 1,266,669 tweets, which will be used to build the Jordanian dialect language model.Part two has 557,551 tweets used to construct the Hate speech Jordanian tweets dataset.Currently, this dataset consists of 403,688 annotated tweets, while the remainder is still undergoing the annotation process.Indeed, the dataset has 149,706 positive tweets, 126,297 offensive tweets, 7,034 very offensive tweets, and 120,651 neutral tweets.

Text representation
Building a Hate detection system based on machine learning and deep learning requires numerical input features.Converting words into numbers allows machines to perceive and decode linguistic patterns, which is fundamental in most NLP jobs.is process is referred to as text representation.Even if it is an iterative process, this one is crucial for selecting the features of any machine learning model or algorithm.erefore, the input text must be rst transformed into numerical features that can easily t into machine learning algorithms.
Text representation can be divided into three sections: discrete text representation, distributed text representation, and advanced language model, as shown in Figure 8.Under each category of text representation, there are various techniques.In this paper, we focus on three popular techniques: Term Frequency-Inverse Document Frequency (TF-IDF) (Ramos et al., 2003), Word2Vec (Goldberg and Levy, 2014), and BERT text representation.
e idea behind TF-IDF is that each word's weight is determined by a word's frequency and how speci c word is frequent in the whole corpus.It takes the count vectorizer (TF) and multiplies it by the IDF score.e resultant output weights for the words are low for highly frequent words like stop-words.One of the advantages of TF-IDF is it is simple and easy to understand and implement, but unfortunately, TF-IDF cannot capture the positional information of the word, and it is highly dependent on the corpus.
Word2Vec is a word embedding model that generates a vector representation of a word (Alayba et al., 2017).Each word is represented by a de ned vector size that captures its semantic and syntactic relationships with other words.e architecture of word2vec consists of the input layer, one single hidden layer network, and the output layer.e network aims to learn the word embedding vector for each word by learning the embedding and context weight matrices.ere are two versions of Word2Vec: Continuous Bag of Words (CBOW), which is an efficient way to use for a small dataset; the main idea behind it is to predict the middle word in the context of surrounding words.Skip-Gram, in contrast to CBOW, predicts the surrounding context words from a single word, and it is suitable for large corpus but takes more training time (Alayba et al., 2017).e most important feature of word2vec is its ability to capture the relationships between words in terms of their syntactic and semantic relationships.Still, it needs to improve and improve with out-of-vocabulary words.
Most recently, advanced text representation techniques have been proposed based on deep contextualized text representation, which allows the generated word vectors to capture the semantic meaning of the word in the text.e emergence of the Transformer and Attention model has sped up the presence of advanced text representations such as BERT and GPT models.In this paper, we used a version of the BERT model that was trained over a large corpus of Arabic Language.

Research methodology
Figure 9 depicts the research methodology conducted in this paper.In the rst part of the methodology, the collected texts have been revised and ltered by rst removing retweets to avoid redundancy.Text cleaning is essential in preparing text data for NLP and machine learning models.It involves preprocessing the text to remove noise, x structural issues, and standardize the text format; this can help improve the classi cation model's performance and make the text easier to work with.Text data is oen messy and unstructured and can contain a variety of issues that can affect the performance of a classi cation model.ese issues may include typos, misspellings, punctuation errors, and other irregularities that can confuse the model and make it difficult to understand the content of the text.en, the URL addresses, emojis, and other unwanted symbols have been removed.We used a regular expression in Python to complete this job.Finally, the texts have been tokenized to prepare data for text representation.
In the second phase, the text of each message is then transformed into a numeric vector using the text representation models discussed in the previous section.We have applied TF-IDF and Word2Vec text representation techniques.In addition, we used the AraBert transformer to produce text representation.However, the latter will be only used with neural network models.
For the AraBert model, we have used the pre-trained model as shown in Figure 10; we make ne-tuning on our corpus as shown in Figure 11.e transformer is trained over a large Arabic corpus during the pre-trained process.e output of this process is the  pre-trained transformer mode, which will be used later for netuning based on our collected dataset.
Numerous settings may be changed during the ne-tuning process (see Figure 11), including the optimizer, learning rate, number of epochs, and dropout value.As part of the ne-tuning procedure, we tested various optimizers, including SGD optimizer, ADAM, and AdamW.We experimented with several learning rate values, including 1 × 10 −3 , 1 × 10 −4 , and 1 × 10 −5 .We also experimented with them with several epochs ranging from 1 to 5. To prevent wasting time and storage, the terminating conditions were carefully chosen.We have tried Dropout values of 10 × 10 −2 , 25 × 10 −2 , and 50 × 10 −2 , with each number yielding a somewhat different outcome.
Finally, the text representations aligned with extracted features are entered into the NN model and placed on top of the pre-trained AraBert model.Two dense layers have been added to the NN model with ReLU activation functions.Also, a dropout layer was added to avoid over tting during the training process, and the linear layer was used to nd a correlation between input vectors and output labels.e ReLU layer will reduce the computation time required for model Research methodology and the experiment framework.
training.Finally, we divided the entire dataset into 70% training and 30% as testing data for validation purposes.

Evaluation measures
Choosing proper evaluation metrics for classi cation problems is tricky, as every metric explains a speci c part of the model performance.Wrong choices are likely to produce a poor explanation with deceived performance.erefore, ve evaluation metrics that capture different aspects of classi cation model predictions have been used.ese metrics ensure a trade-off between the overall performance of the classi cation models.Since we have four class labels, we used weighted averages to aggregate all evaluation results.e most popular evaluation metrics are Recall and Precision.Recall metrics, as shown in Equation 1, can capture the proportion of hate speech correctly classi ed within that speech.Precision, as shown in Equation 2, is de ned as the proportion of the hate speech tested as hate (see Equation 2).F1 metric, as shown in Equation 3, is used to combine the precision and recall metrics into a single metric that can work best with imbalanced data distribution.Finally, the accuracy metric shown in Equation 4re ects the proportion of all correctly classi ed examples.In addition to the above metrics, we used Area Under Curve (AUC), which estimates the area under the ROC curve formed by a set of Precision and Recall values and represented as a single value in the range [0, 1].e ROC curve presents the trade-off between recall and precision.e better model with a high AUC is regarded as the superior model. (1) Where tp (true positive) is the number of hate speech predicted as such.tn (true negative) is the number of hate speech that is predicted as such.fp (false positive) is the number of not hate speech predicted as hate speech.fn (false negative) is the number of hate speech predicted as not hate speech.

Experiments and results
is section shows the empirical results of building hate speech detection.ree text representations models have been used to generate word vectors, namely: AraVec, TF-IDF, and AraBert model.AraVec is a Word2Vec model trained over large Arabic corpus.AraBert is Bert alike transformer which has been trained over large Arabic corpus.ese models can give us the text representation as vectors to feed into classi cation models.We have used seven machine learning classi ers: Support Vector Machine (SVM), Logistic regression (LR), Naive Bays (NB), Random Forest (RF), AdaBoost (Ada), XGBoost (XGB), and CatBoost (CatB).

Experimental setup and hyperparameter tuning
Table 6 shows each classi er's searching parameters and best parameters.We identi ed a list of values for each con guration parameter, then we used the Grid search algorithm with 5-fold cross validation to select best con gurations for each classi er.

FIGURE 10
Pre-trained process of transformer.

FIGURE 11
Fine tuning process.

Results
To investigate the quality of the collected data in addition to the quality of the annotation process, we conducted comprehensive experimentation on building a hate detection system using multiple machine learning algorithms and two main text representation techniques: Word2Vec (W2V) and TF-IDF in addition, we used BERT based Arabic language called AraBert.As explained in the research methodology, we split the dataset into 70% training and 30% testing; then, we used seven machine learning algorithms and training datasets to build different classi ers on features extracted from W2V and TF-IDF techniques.Finally, all constructed models have been evaluated on testing using multiple evaluation metrics.
To facilitate presenting the results, we organized all results into different tables based on the class labels in the dataset.Since we have four classes, we showed the evaluation results for each class label.Each table shows each machine learning performance with each text representation technique.en, we added a table to summarize the overall results using the weight average aggregation method.Table 6 shows the evaluation results for the Negative class label.We omitted accuracy and AUC metrics because they are aggregated and not calculated individually for each class label.e bold text represents the best results between TF-IDF and W2V for each evaluation metric.e bold and red text represents each evaluation metric's best machine-learning model.From the table, we can generally observe that W2V is more suitable for our text than TF-IDF.It is widely acknowledged that W2V can produce good text representation when the corpus contains over 25,000 vocabularies, as in our case.erefore, the machine learning algorithms that use W2V produce better results than TF-IDF ones.On the other hand, if we look at the machine learning algorithm, we notice instability in terms of performance, such that we cannot identify the best model.However, for the Recall metric we can see that SVM+TF-IDF is the best, whereas for precision, we can see that NB+W2V is the best.is contradiction forces us to choose multiple options as good candidates.Finally, the best recall accuracy for (SVM+TF-IDF) suggests that the model can predict 65%.
Table 7 presents results for the Neutral class label.Generally, the results are poor because the best recall or precision score is relatively low.Interestingly, we can observe a stable result here, more than the Negative class label.Also, we found that W2V always produces good text representation for all machine learning models.If we look at the evaluation results between W2V and TF-IDF, we can see a big difference, suggesting that TF-IDF is inappropriate for such kind of hate speech corpus.Concerning the machine learning model, we cannot identify one best mode but multiple ones according to the evaluation metrics.For example, NB+W2V can work well under the Recall metric, whereas CatBoost+W2V can work well under the Precision metric.If we take the F1 metric as a compromised solution, we can see that NB and CatBoost with W2V are the best models.
Table 8 presents results for the Positive class label.We see the same trend as the Neutral class label but with different best machine learning models.First, we can con rm that the W2V is a good text representation among all models, and CatBoost is the most accurate and stable model under three evaluation metrics.e positive Label's overall results are good compared to the Neutral label and show good performance.
Finally, the evaluation results for the "Very Positive" class label are very poor, as shown in Table 9.One reason for that is the relatively imbalanced dataset's nature, which means that there is a big difference in the number of samples in each class label.Figure 12 shows the class distribution of our dataset.We can notice there is an imbalanced distribution between class labels.e "very Positive" class label is the minor one.erefore, the performance of machine learning over this label was very poor, as shown in Table 9.Also, there are no stable results across all evaluation metrics.erefore, judging which machine learning model is superior isn't easy.We aggregate all evaluation results using a weighted average that considers the class distribution with the nal calculation shown in Table 10 to get insights from the above results.We can see that W2V is generally the best text representation for our corpus.All machine learning models behave relatively accurately with good performance.Amongst them, CatBoost is the most stable and accurate model.
Concerning the Transformer Models, we netuned three Arabic language transformers (AraBERT, MARBERT, and CamelBERT) on our Arabic hate speech corpus; then, we built a neural network model based on the CLS embedding.It is important to note that the transformer usually uses its tokenizer, producing its text representation as output through CLS embedding.en, this embedding vector is connected to the Neural network fully.e model has been evaluated over testing corpus using the same evaluation measures shown in Table 11.We can see the transformers, especially CamelBERT, can learn text representation better than W2V and TF-IDF techniques and produce good results compared to the weighted average results of the machine learning models.
To conclude, the collected data and annotation process was very appropriate, and the obtained evaluation results show good performance for this complex and unstructured domain.We also should pay attention to the complexity of processing Arabic text, especially in Processing the natural Arabic language.For example, the word spelling can differ from one sentence to another, which changes the meaning, and there are many different Arabic dialects, even in the same country, which makes it harder to understand the meaning of the sentence; the word diary can also change the meaning.

Conclusion and future work
In this study, we address the intricate challenge of Hate Speech Detection in Arabic, a language with a wide variety and nuanced cultural characteristics.is study intends to aid in the ght against hate speech in Arabic that is spread online.A notable resource in this eld is the creation of a fresh multiclass Arabic dataset with over 400,000 annotated tweets that have been sentimentally classi ed.Additionally, using text representation techniques, including WordVec, TF-IDF, and AraBert, and seven machine learning classi ers, we evaluated the effectiveness of several machine learning models in detecting hate speech in tweets written in the Arabic Jordanian dialect.Our empirical ndings indicated our dataset's usefulness and precisely how hate speech could be identi ed in this difficult, unstructured environment.Although this work makes signi cant advancements in the Arabic Hate Speech Detection eld, several areas still might be used for more investigation.
In the future, we want to increase the size and diversity of our dataset, by including Arabic dialects in addition to Jordan and neighboring countries dialect, improve contextual analysis, create real-time detection systems, look into user-speci c detection, and address bias and fairness concerns.By promoting a safer online environment, these initiatives will help develop more effective and culturally relevant solutions for addressing hate speech in Arabic.
Another suggested work line in the future is to examine multilingual and cross-lingual models.Nevertheless, several challenges should be considered when tackling such problems.Some of these challenges are: the performance variability between high and low-resourced languages, the possibility of loosing language-speci c nuances, in Arabic for example, when using pre-trained models on other different languages.In addition to some generalization challenges between languages that differ in structures, writing styles and other characteristics.

FIGURE 1
FIGURE 1Number of content items actioned for hate speech on Facebook worldwide between 4th quarter 2017 and 1st quarter 2023.

FIGURE 3
FIGURE 3Data pre-processing and cleaning steps.

FIGURE 7
FIGURE 7Distribution of collected tweets throughout the years 2014 to 2022.

FIGURE 12
FIGURE 12Class distribution of class label.

TABLE 2 Summary of selected related literature.
Methodology for creating JHSC and model hate speech based on it.