Investigating machine learning and natural language processing techniques applied for detecting eating disorders: a systematic literature review

Recent developments in the fields of natural language processing (NLP) and machine learning (ML) have shown significant improvements in automatic text processing. At the same time, the expression of human language plays a central role in the detection of mental health problems. Whereas spoken language is implicitly assessed during interviews with patients, written language can also provide interesting insights to clinical professionals. Existing work in the field often investigates mental health problems such as depression or anxiety. However, there is also work investigating how the diagnostics of eating disorders can benefit from these novel technologies. In this paper, we present a systematic overview of the latest research in this field. Our investigation encompasses four key areas: (a) an analysis of the metadata from published papers, (b) an examination of the sizes and specific topics of the datasets employed, (c) a review of the application of machine learning techniques in detecting eating disorders from text, and finally (d) an evaluation of the models used, focusing on their performance, limitations, and the potential risks associated with current methodologies.


Introduction
Recent reports in broad media about the latest conversational chatbots, which can generate human-like texts in response to user questions have made natural language processing (NLP) famous to the broad public.Yet the possibilities of this field go far beyond text generation and chatbots.Classifying texts into two (or more) groups and automatically extracting indicators that suggest that a text snippet belongs to either of the groups is also a common task.In particular, when using machine learning, this allows the identification of patterns that might differ from what a human might detect that are nonetheless effective in separating the two groups.
Meanwhile, in clinical practice in mental health, inventories with scaling questions are often used for diagnosis.Such inventories have limitations, including for example defensiveness (the denial of symptoms) or social bias that can influence the results of the questionnaires (1).In these cases, an automated text analysis applied to specific open questions or interview transcripts can provide further source of information indicating the patient's condition that is more resistant to manipulations such as those arising from defensiveness.
Defensiveness is common amongst those afflicted with eating disorders (EDs).Respondents to a survey investigating the denial and concealment of EDs (2) reported a variety of attempts to hide the respective ED.Furthermore, the authors of the study state that such methods were described as deliberate strategies.This makes it challenging to use clinical instruments where an inventory item contains obvious indications for which options to choose in order to obtain a specific result.
EDs generally occur in the form of unhealthy eating habits, disturbances in behaviors, thoughts, and attitudes towards food, causing in some cases extreme weight loss or gain.These disorders not only impact mental health but also have physical effects (3).EDs are classified in the category F50 of the ICD-10 and can refer to different disorders including anorexia, bulimia or overeating 1 .A study conducted by Mohler-Kuo et al. (4) in Switzerland discovered that the lifetime prevalence for any ED is 3.5%.Another survey investigating the lifetime prevalence of EDs in English and French studies from 2000 to 2018 found that the weighted means were 8.4% for women, and 2.2% for men (5).
The power of natural language processing (NLP) has already been applied to the field of mental health, especially in research.Feelings and written expression are closely correlated: An analysis of student essays has shown that students suffering from depression use more negatively valenced 2 words and more frequently use the word "I" (6).Different approaches have been applied to explore how to use automated text analysis on tasks such as the detection of burnout (7), depression (8,9), the particular case of post-partum depression (10,11), anxiety (12), and suicide risk assessment (13), (14).Often, such methods are based on anonymized publicly available online data.Only little work makes use of clinical data.Furthermore, the English language has been the primary focus, even though these methods can be highly languagedependent, meaning that data and methods should be carefully reviewed when adapting to local languages.This is relevant, as it has been shown that adapting to the patient's language is beneficial in mental health diagnostics and treatment (15).In our view, one aim of such technologies should be to explore ways to support clinical practitioners in their daily work, and provide them with additional sources of information to consider.Therefore, we often refer to such solutions as Augmented Intelligence 3 , rather than Artificial Intelligence, as they aim to empower humans rather than replacing them.
Despite existing work in the field of ML and NLP for depression, anxiety or suicide risk assessment, there has been a lack of a detailed systematic literature comparison on the automatic detection of EDs using NLP technologies for both clinical and non-clinical data.A recent survey (16) investigated the use of natural language processing applied to mental illness detection.The majority of the identified results (45%) had worked on depression, whereas only 2% were about eating disorders in general and 3% about anorexia.Whereas the broad scope of the survey provides a generous overview of the research landscape, it does not compare the case of eating disorders in detail.
In this paper, we have undertaken a systematic literature review to address this research gap, following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines (17) to ensure a well-structured and transparent methodology.
We contribute to the field by (a) analyzing the metadata of published papers to understand the current trends and methodologies, (b) examining the sizes and targeted topics of the datasets used in these studies, (c) reviewing how machine learning techniques are applied to detect eating disorders from textual data, and (d) evaluating the performance, limitations, and potential risks of the models deployed in this domain.
Our research is guided by specific questions, structured around four distinct perspectives, which collectively form the core of our investigative approach.
• Demographical Questions (DemRQ): Focus on metadata aspects of the paper: • DemRQ1: When was the paper published?
• DemRQ2: From which countries were the contributors of the papers included in this study?The article is structured as follows: First, we describe our methodology such as the study design and the paper selection process.We then describe the results of the literature search and describe the findings of our review.Finally, we summarize our results and describe perspectives for future research in the field.

Study design
To answer our research questions, we conducted a structured literature review (SLR) following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines (17).This includes standards for literature search strategies and setting criteria for the inclusion or exclusion of gathered works in the final review.

Literature search strategy
In accordance with PRISMA standards, we have set an 8-year time span for searching for documents (2014-2022) related to our research scope.We consider the year 2014 mainly because Bellows et al. ( 18) conducted a study on automatically detecting binge Eating disorder using clinical data, which we deem to be the initial research in the field.We then compiled a list of all databases to be searched.The list included the following databases: In addition, in order to efficiently conduct our database search we have compiled a list of keywords and conditions.These keywords are relevant to the research topic of EDs and their detection using NLP and machine learning techniques.Furthermore, the list included specific terms related to social media and online social networks in order to enable the identification of studies that explore the use of social media for the early detection of EDs, which is an ongoing research interest.The final query is presented below: (eating disorder OR anorexia OR binge eating OR bulimia OR overeating) AND (natural language processing OR NLP OR text mining OR inventories OR machine learning OR artificial intelligence OR automatic detection OR early detection OR social media OR online social network OR clinical).
Using the aforementioned search keywords and conditions, we retrieved research articles where NLP techniques have been used for the detection of EDs from clinical and non-clinical data.The detailed workflow is depicted in Figure 1, and the corresponding PRISMA flow diagram for this SLR is shown in Figure 2.
With the initially proposed search query, a large number of papers was identified.With manual analysis we explored options to define a more restrictive query, still making sure to capture the relevant papers, which turned out challenging.We therefore adapted our method to consider the first 100 elements returned by the search query on each database, sorted by relevance.This furthermore allowed to apply the same methodology for all three data sources, including especially Google Scholar, where the search functionalities are limited compared to databases like PubMed, and thus we had to make a selection on the number of items to be reviewed.Given the interidisciplinarity of our approach, we wanted to include Google Scholar to target a vast number of sources and ensure the most relevant work can be included.
A Python script was used to screen the articles for duplicates.As a result, 1 article was excluded from further consideration, leaving a total of 299 articles for further analysis (see Figure 2).To refine the results further, a manual title scan was performed to exclude articles that were not pertinent to the research topic.This resulted in the exclusion of 237 articles, leaving a total of 62 for further analysis.Additionally, a manual scan of the abstracts from the remaining 62 articles was performed to exclude any that were not relevant to the study.This process resulted in the exclusion of an additional 30 articles, leaving a total of 32 for inclusion in the final analysis.After thoroughly reading and evaluating 32 articles, 27 were selected as relevant for the researched topic (according to the criteria from Table 1).These chosen articles were deemed to possess high relevance and reliability for this SLR.Finally, we scanned the references section of the articles included in our survey and identified any relevant literature that may have been missed in the initial database search.This added n=18 articles to the studies that were finally included in the review (n=45).The process is illustrated in Figure 2. Methodology for document collection.

Inclusion and exclusion criteria
Table 1 outlines the predefined exclusion and inclusion criteria that were used to guide the selection of related studies for the review.These criteria were established in advance to help simplify the process of identifying and selecting relevant papers.In particular, papers that focused solely on the psychological aspects of EDs and did not consider the use of automated text analysis technologies were excluded from the review.By adhering to these criteria, we were able to more effectively and efficiently select the relevant papers.

Results
In this section, we provide a thorough review and analysis of the research studies included in this systematic literature review.

Criteria Decision
When the predefined keywords exist in title, keywords or abstract section of the paper.

Inclusion
The paper should be written in the English language Inclusion When the paper targets other languages Inclusion Papers that are duplicated within the search documents Exclusion Papers that don't make use of automated text analysis Exclusion

Papers that deal with other types of data (than textual) Exclusion
Papers that got published before 2014 Exclusion

Demographical research questions
Figure 3 shows the yearly distribution of the selected research work (DemRQ1).The data suggests a growing interest in this topic in recent years.This is in line with the findings of Zhang et al. ( 16) that found that there has been an upward trend over the last years in using NLP and machine learning methods to detect mental health problems.Notably, we highlight a prominent peak in 2018 and 2019, which coincides with the emergence of tasks related to EDs in eRisk competitions.
We also observed the geographical distribution of the authors' affiliations of the selected studies (DemRQ2).As visualized in the heat-map in Figure 4, 7 of the selected studies were from the USA and Spain, 5 from Mexico and France.
From the 45 selected studies, 24 were results from the eRisk lab 4 , hosted by the CLEF Conference since 2017.This academic research competition focuses on the development and evaluation of textbased risk prediction models for social media.Each year, the lab provides a shared task framework where teams of participants are tasked with developing NLP techniques to automatically identify and predict the risk of different mental illness behaviors from social media data, including Eating Disorders.Participants are provided with a training dataset and a test dataset, and the performance of their models is evaluated based on two categories: performance and latency.The eRisk lab provides a unique opportunity for researchers to collaborate and innovate in the field of NLP and mental health, aiming to improve the detection and prevention of mental health issues in online communities.The datasets used in the eRisk lab are primarily sourced from the social media platform Reddit.
Since 2017, the challenge has included two tasks pertaining to the early detection of Eating Disorders.In both 2018 and 2019, the task involved the early detection of signs of anorexia [see e.g., Losada et al. (26)].In contrast, the 2022 iteration introduced a novel task centered on measuring the severity of eating disorders (27).This task diverged from the previous ones in that no labeled training data was supplied to participants, meaning that participants could not evaluate the quality of their models' predictions until test time.The task objective was to assess a user's level of eating disorder severity through analysis of their Reddit posting history.In order to achieve this, participants were required to predict users' responses to a standard eating disorder questionnaire (EDE-Q) 5 (28).

Input research questions
Our first input research question (InputRQ1) investigates the different languages that are considered in the studies included in this SLR.Research has shown that only a small number of the over 7000 languages used worldwide are represented in recent technologies from the field of natural language processing (29).We wanted to investigate whether this is also the case for the detection of eating disorders.Text analysis, naturally, depends on the specific language and can typically not be transferred from one language to another without specific adaptions.
Table 2 gives indication about the language of data used, its size, its source, and the type of eating disorder that was investigated in the   Dataset sizes distribution based on Table 2 excluding articles from eRisk.
selected studies (excluding studies from eRisk).18 of the 21 studies used English data, 2 used Polish and 1 Spanish data.The 24 papers from the eRisk lab challenges all relied on English data from the platform Reddit.Overall, only 3 out of 45 studies used a language other than English (7%).This confirms the need for further work in applying the latest technological developments to non-English texts.The dataset size is another crucial factor we took into account in our analysis ((InputRQ2).As depicted in Figure 5, the distribution of dataset sizes used in the studies reveals that datasets ranging from 1k to 10k instances are the most frequently used.
The distribution of dataset sizes across different research topics, as illustrated in Figure 6, offers insightful perspectives.Notably, Anorexia research displays the most significant variance in dataset sizes, spanning from less than 1K to over 1 million data points.In contrast, binge eating research predominantly employs datasets within a narrower range of 1K to 10K data points.For broader Eating Disorders, 6 studies leverage datasets between 10K and 100K, while 3 others operate with datasets in the 100K to 1 million range.Finally, research on Mental Disorders encompasses datasets varying from 1K to more than 1 million data points.The distribution of the primary focus of these studies is illustrated in Figure 7 (InputRQ4) The majority of the studies (n=29) we collected focused on anorexia, while 12 studies conducted a broader investigation of EDs in general rather than focusing on a specific type.Additionally, three studies had a more extensive scope, delving into various mental disorders, including but not limited to EDs, while one study focused on binge eating.

Architectural and evaluations research questions 3.4.1 eRisk challenge
Table 3 summarizes all the papers that we identified following our strategy, including the ones from eRisk.In 2018 and 2019, the eRisk papers focused on a text classification task aimed at developing an early detection system for eating disorders on social media using the history of users' writings data.The aim was to train a text classifier that could effectively identify and flag potential cases of anorexia based on users' social media content.For the eRisk challenge resulting in papers from 2022, the task was different.Participants were provided with the social media history of specific users and had to predict their answers to questions 1-12 and 19-28 from the Eating Disorder Examination Questionnaire (EDE-Q) 7 (28).
(EvalRQ1) For the 2018-2019 eRisk papers, we report F1 values corresponding to the binary classification task, whereas for the 2022 paper we report mean average error (MAE), corresponding to the average deviation between user's predicted questionnaire responses and the ground truth responses.

Non-eRisk studies
Table 3 shows the feature representation, tasks studied, machine learning techniques, and performance metrics of all studies included in this SLR.In this section we focus on Non-eRisk studies.We grouped these studies into the following categories with regard to the feature extraction techniques they apply (ArchRQ1): • and other feature representations Furthermore, it is worth noting that the machine learning methods used in these studies span various categories (ArchRQ2), including: • Classical machine learning (ML) methods such as Support Vector Machine (SVM), Naive Bayes, Logistic Regression, etc. • D e e p l e a r n i n g ( D L ) m e t h o d s , e .g ., r e c u r r e n t neural networks.• Combination of different methods from classical ML and DL.• Large language models (LLMs), e.g., BERT.
• Other approaches.Dataset sizes distribution by targeted ED based on Table 2 excluding articles from eRisk.
Research distribution of all research articles.Additionally, the tasks addressed in these studies can be broadly grouped into categories such as: In terms of feature extraction techniques employed across the 21 studies, a variety of methods were utilized.Among these, three studies (33,46,78)  Moreover, Bag of Words (BoW) and various types of Word Embeddings, including GloVe (35,48), FastText (35), and Word2Vec (35,36), were widely employed as feature extraction techniques in these studies.
It is pertinent to note that some studies, like Chancellor et al. (79) and Benıtez-Andrades et al. (38), did not provide comprehensive details on this aspect in their papers.Conversely, other articles adopted a more personalized approach to construct their features.For instance, some represented each data point as a vector within certain categories (39,40), while others used rulebased methods (18) or leveraged algorithms like decision trees (41) and topic modeling (42) to determine feature selection.Our results show that from the 21 studies, 8 make use of classical machine learning methods, 1 uses deep learning, 5 use a combination of classical ML and DL, 4 use large-language models and 3 use other approaches.
When using classical machine learning, some studies compare different methods.For example, Loṕez U ́beda et al. (33) apply 5 different supervised machine learning models: SVM, multilayer peceptron classifier, naive bayes, decision tree and logistic regression, and Villegas et al. (48) compare naive bayes, random forest, logistic regression and SVM.Along with the classical machine learning methods, the studies apply different feature representations ranging from Bag of Words (BoW) to TF-IDF (33,78), up to contextualized embeddings such as BERT (48).
Other studies compared both classical machine learning as well as deep learning methods.For example, in the case of Tebar and Gopalan (42), a so-called feature fusion model that includes both deep learning (a convolutional neural network (CNN) and a BiGRU model), as well as a classical machine learning model (logistic regression classifier with handcrafted features) is used.
For the studies using transformer-based large language models, different models including the BERT (19)  (EvalRQ1) The performance of each study is also reported in Table 3.
(EvalRQ2) Finally, we investigated the limitations of the proposed studies (RQ4) in order to provide a structured outlook for future work in the field.
In many cases, there were limitations in terms of the datasets.For example, Yan et al. (78) cites the limited availability of labeled data.They used a dataset of 50 posts, which they expect to be labeled correctly.Also Zhou et al. (34) mention that their study is limited by the number of collected tweets, which may result in some irrelevant topics arising from noise for their topic modeling task.
In many studies, social media data is used.The nature of such data is seen as a potential limitation for the resulting methods (37).Other studies indicated as a limitation that only one social media platform was used to gather their data (38,42).For example, a study from (35) points out that their work did not take into account the potential biases in the data that may exist, such as underrepresented population or lack of diverse perspectives.In addition, one of the notable constraints arises from the fundamental disparity between social media data and traditional clinical text data, often used in healthcare and medical research.Clinical records encompass detailed information on patients' medical histories, diagnoses, treatments, and outcomes, rendering them fundamentally distinct from the informal, user-generated content prevalent on social media platforms.Several studies point out that the involvement of clinical professionals would be beneficial.For example, Choudhury (30) states that their method could be more successful with the involvement of clinicians.
Different studies rely on anonymous data, which makes it difficult to ensure a good distribution within the training data over different populations and underrepresented groups.For example, Ragheb et al. (62) sees potential to optimize the model for different use cases and populations.Manual labeling by humans is also considered a source of bias since limited information about the users writing them is available to the annotators.This limited information may not encompass the full context of the users' lives, beliefs, or backgrounds.Annotators may make subjective judgments based solely on the content of the post, which can be influenced by their own biases and interpretations.Thus, limited context can lead to misinterpretations or mislabeling, potentially distorting the research results (38).
In the limitations, it is also discussed how texts written by laypeople and ED promotional 8 and educational materials can be hard to classify (34).This can be partly explained by the short length of texts, for example in the case of tweets, and the semantic similarity of the two types of texts.
Whereas many studies achieved good performance in terms of accuracy or f1-scores, they see a potential limitation in this matter.For example, Wang et al. (40) discusses that the validation was done only with a small sample of the data, and thus further validation is required with larger samples.In another study, the authors were concerned about the problem of overfitting (52).

Discussion
In this systematic literature survey we have discussed the use of machine learning and natural language processing methods for the detection of eating disorders.Our survey was conducted using the PRISMA framework (17).Our results have shown that many studies focus on the detection of anorexia, or eating disorders in general (see Figure 7).We have also seen that there was more work over the last couple of years, indicating a growing interest in the topic (as shown in Figure 3).Whereas most publications were from institutions in the USA and Spain, work from other countries including Mexico, France and Canada was also identified, as shown in Figure 4. Nevertheless, our work has shown that most research efforts have only been applied to the English language.Given the relevance of local languages for mental health diagnostics and treatment (15), it is thus necessary for future research to address other languages.With regard to the machine learning and feature extraction methods being applied, a comparison turned out to be challenging due to the diverse nature of the datasets and approaches used.The proposed approaches were classified into different categories, including classical machine learning, deep learning, a combination of classical and deep learning, the use of large language models, as well as other approaches.Several studies used f1-score as a common measure, reaching different performances ranging from 0.67 to 0.93.Overall, having a sufficient data quality and quantity was often seen as a major limitation of the approaches.Since 2017, the eRisk challenge has included two tasks pertaining to the early detection of Eating Disorders.In both 2018 and 2019, the task involved the early detection of signs of anorexia [see e.g., Losada et al. (26)].In contrast, the 2022 iteration introduced a novel task centered on measuring the severity of eating disorders (27).This task diverged from the previous ones in that no labeled training data was supplied to participants, meaning that participants could not evaluate the quality of their models' predictions until test time.The objective task was to assess a user's level of eating disorder severity through analysis of their Reddit posting history.
Given the composition of both the eRisk lab and the SMHD dataset (50) predominantly with social media data, it is notable that an overwhelming majority (93%) of the studies in our analysis employ this data type.This underscores the widespread reliance on social media sources in modern research methodologies.This finding confirms the results of Zhang et al. (16) who found that among 399 papers applying NLP methods for the identification of mental health problems, 81% consisted of social media data.
It is worth mentioning that we came across two types of use cases in the studies.Many studies focus on the individual's expression of their behavior and feelings with regard to eating disorders.Some studies, namely Choudhury (30) and Chancellor et al. (49), investigate the wording of pro-anorexia or pro-eating disorders communities on social media and online forums.Such communities promote disordered eating habits as acceptable alternative lifestyles (49).Whereas in many of the studies the technologies target support for clinical professionals, in these cases other applications such as content moderation are in the foreground.
In the realm of data collection for eating disorder research, manual labeling of datasets has been a common approach, with various strategies employed.For instance, Zhang et al. (45) relied on the voluntary efforts of 31 individuals to meticulously annotate 8554 data points encompassing 38 symptoms related to MD (Mental Disorders).Other studies took different routes, combining expert knowledge with input from non-expert annotators 9 (38), or solely relying on domain experts (46).In some cases, researchers have employed machine learning algorithms to automatically annotate their datasets and subsequently validated the results with input from human labelers (44).The majority of datasets underwent annotation by non-expert human annotators, as seen in studies conducted by (79,40,34,41).
Our review revealed few instances of Large Language Models (LLMs) application (10,11,19,30,38,43,44,45,49,50,61,67,73,74,79,80).Despite this, the rising adoption of technologies like MentalBERT (77) and MentaLLama (81), alongside traditional machine and deep learning approaches, is notable.This trend, driven by the impressive efficacy of LLMs in natural language processing, is expected to continue on.As these technologies evolve and become more accessible, we anticipate their increased utilization in this field of research, enhancing computational model accuracy and efficiency.
Based on the identified limitations in the selected studies, we infer the following focus topics that we suggest for future work in the field of using natural language processing and machine learning in ED research: • Data Quantity and Quality: how can more high-quality data be created and shared, while respecting the ethical and privacy limitations of such sensitive data?• Involvement of Clinical Professionals: how can machine learning engineers and clinical professionals work together more closely?• More Diversity in Data: How can the diversity of the population in the used datasets be increased to avoid bias in the classification?• Local Languages: How can the proposed methods be extended to local languages other than English?
In conclusion, based on the studies investigated in this literature survey, there is potential for further development and in the longterm a novel tool support for clinical professionals based on text data.

FIGURE 3
FIGURE 3Yearly distribution of all research articles.

FIGURE 4
FIGURE 4Geographic distribution of all institutions involved in the selected research articles.
model and its variations have been used.For example, Benıtez-Andrades et al. (32) applied five variations of the BERT model.The paper from Dinu and Moldovan (43) uses BERT, RoBERTa and XLNET, whereas Jiang et al. (44) use BERT and REALM.The work from Zhang et al. (45) focusing on different mental illnesses used the BERT model, as well as the MBERT variation.

TABLE 1 SLR
study selection of literature using inclusion and exclusion criteria.

Table 2
also gives an overview of the data sources (InputRQ3).From the 45 studies, the used datasets can be classified as follows in four groups:

TABLE 3
Overview of machine learning methods and performance metrics of the studies included in this systematic literature review.