Empathy, bias, and data responsibility: evaluating AI chatbots for gender-based violence support

Sanz Urquijo, Borja; López Belloso, María; Izaguirre-Choperena, Ainhoa

doi:10.3389/fpos.2025.1631881

ORIGINAL RESEARCH article

Front. Polit. Sci., 30 July 2025

Sec. Politics of Technology

Volume 7 - 2025 | https://doi.org/10.3389/fpos.2025.1631881

This article is part of the Research TopicHuman Rights and Artificial IntelligenceView all 5 articles

Empathy, bias, and data responsibility: evaluating AI chatbots for gender-based violence support

Borja Sanz Urquijo¹^*

María López Belloso²

Ainhoa Izaguirre-Choperena²

¹Faculty of Engineering, University of Deusto, Bilbao, Spain
²Faculty of Human and Social Sciences, University of Deusto, San Sebastian, Spain

Artificial Intelligence (AI) chatbots are increasingly deployed as support tools in sensitive domains such as gender-based violence (GBV). This study evaluates the performance of three conversational AI models—including a general-purpose Large Language Model (ChatGPT), an open-source model (LLaMA), and a specialized chatbot (AinoAid)—in providing first-line assistance to women affected by GBV. Drawing on findings from the European IMPROVE project, the research uses a mixed-methods design combining qualitative narrative interviews with 30 survivors in Spain and quantitative natural language processing metrics. Chatbots were assessed through scenario-based simulations across the GBV cycle, with prompts designed via the Systematic Context Construction and Behavior Specification method to ensure ethical and empathetic alignment. Results reveal significant differences in emotional resonance, response quality, and gender bias handling, with ChatGPT showing the most empathetic engagement and AinoAid offering contextually precise guidance. However, all models lacked intersectional sensitivity and proactive attention to privacy. These findings highlight the importance of trauma-informed design and qualitative grounding in developing responsible AI for GBV support.

1 Introduction

Gender-based violence (GBV) is a pressing global issue encompassing physical, psychological, and sexual harm across all social strata (Krug et al., 2002; UN Women, 2023). Support services for women affected by GBV aim to provide immediate safety and promote long-term recovery and autonomy. These include helplines like Spain's 016, shelters, legal and psychological assistance, and programs supporting economic independence. Public awareness efforts complement these mechanisms. However, structural challenges—such as limited resources, territorial disparities, and sociocultural barriers—often hinder access, especially for vulnerable groups like migrants or women facing social exclusion (Toledano-Buendía, 2021). Addressing these gaps requires strategies that expand availability while promoting community engagement and reducing stigma.

1.1 LLMs in sensitive contexts: potentials and ethical risks

AI-powered chatbots have emerged as tools that can offer accessible support to survivors, yet they raise concerns around algorithmic bias, emotional detachment, and ethical safeguards (Izaguirre Choperena et al., 2024). These concerns are amplified in large language models (LLMs), whose opaque training processes and potential to replicate harmful stereotypes present significant challenges (Dinan et al., 2020; Bender et al., 2021).

Recent advances in LLMs have transformed the capabilities of chatbots, enabling them to generate human-like, contextually relevant responses across a wide range of tasks. By training on vast textual datasets, these systems can simulate understanding and deliver coherent outputs in complex situations, including those requiring empathy or emotional support (Bommasani et al., 2022). This potential has positioned LLMs as promising tools in fields like GBV support, where timely, informative, and emotionally attuned communication is essential.

However, these same characteristics raise significant concerns. Since LLMs learn from unfiltered data, they often inherit and reproduce societal biases, which may manifest as harmful stereotypes or emotionally inadequate responses—especially problematic when assisting vulnerable populations such as women affected by GBV (Sun et al., 2019; Dinan et al., 2020). Moreover, unlike trained professionals, chatbots frequently lack the nuanced understanding and emotional intelligence needed to respond to survivors with appropriate care and sensitivity (Saglam et al., 2024).

Beyond empathy, ethical issues around privacy and consent are especially critical in sensitive contexts. Chatbots may handle deeply personal and traumatic disclosures, yet few systems offer robust safeguards to protect user confidentiality or mitigate risks such as data misuse (Butterby and Lombard, 2024). There is also the danger of misuse or manipulation—scenarios where bots are repurposed in ways that compromise their reliability and safety, further endangering users (Cecillon et al., 2019).

Recent feminist scholarship in AI ethics challenges static conceptions of responsibility by foregrounding care, situated knowledge, and the redistribution of power in technology development. This includes shifting from neutral accountability to ethically motivated response-ability and collective responsibility (Siapka, 2022; Drage et al., 2024; Powell, 2025).

Despite these challenges, several real-world initiatives have demonstrated how chatbots—when designed with ethical, contextual, and trauma-informed principles—can provide valuable support. These examples offer important lessons for improving the design and deployment of AI tools in GBV contexts, highlighting both the opportunities and limitations of current technologies.

1.2 Existing AI chatbots for GBV: Sophia, Violetta, and AinoAid

Several notable chatbot initiatives illustrate the potential of AI to provide accessible and empathetic support to women affected by GBV. López Belloso and Izaguirre Choperena (2024) offer a taxonomy of these tools based on their functionality, regional scope, and integration with legal or psychosocial services.

Among them, Sophia¹ stands out as an international reference for domestic violence response. It combines user interaction with a key innovation: secure storage of digital evidence related to sexual violence, deleted from local devices and saved on protected servers—enhancing confidentiality and user autonomy. However, Sophia's focus on domestic settings limits its applicability to other forms of GBV, and its global ambition creates challenges in adapting guidance to diverse legal frameworks.

In the Spanish-speaking context, Violetta² provides psychoeducational and preventive support through emotionally attuned responses, assisted by psychologists. It enables early detection of high-risk expressions and promotes awareness in communities where GBV remains a taboo. Yet, Violetta also faces key limitations: technological barriers in low-connectivity areas, challenges in interpreting complex emotional nuances, data privacy concerns, and the absence of human intervention in critical moments. Its continued effectiveness depends on regular updates and sustained investment to meet ethical and contextual standards.

Developed within the European IMPROVE project, AinoAid³ is a domain-specific chatbot designed to guide women affected by GBV through assessment, advice, and access to relevant support services. Unlike general-purpose models, its conversational logic is based on content co-designed with survivors, professionals, and support organizations, ensuring that its interactions are trauma-informed, respectful, and contextually accurate. AinoAid offers responses in multiple languages and is deployed in several European countries, including Spain, Finland, Germany, France, and Austria. It prioritizes user safety by guaranteeing anonymity, avoiding data collection, and providing static, expert-reviewed information that aligns with best practices in victim assistance.

While these initiatives demonstrate the potential of dedicated chatbot systems to support women in situations of GBV, they also raise important questions about scalability, adaptability, and the resources required for development and maintenance. In this context, it becomes relevant to explore whether general-purpose LLMs, already widely available and continuously evolving, could be adapted to perform similar support functions. This consideration forms the basis for the present study. This paper explores whether widely accessible LLMs like ChatGPT can replicate the empathetic and contextual strengths of specialized systems, aiming to inform ethical adaptation of these technologies for first-line support in GBV contexts.

2 Objectives and research questions

This study aims to critically evaluate the performance of different conversational artificial intelligence AI models—including general-purpose (ChatGPT), open-source (LLaMA), and domain-specific systems (AinoAid)—in their ability to deliver empathetic, contextually appropriate, and bias-aware responses in support scenarios involving women affected by GBV. Informed by survivor-centered knowledge generated through earlier qualitative work within the European IMPROVE⁴ project, the study employs a mixed-methods approach that combines this contextual understanding with quantitative natural language processing (NLP) techniques. The research seeks to identify the best practices, limitations, and ethical risks associated with deploying such technologies in highly sensitive and vulnerable contexts.

More specifically, our methodology addresses three key research questions:

1. RQ1: what differences can be observed in the quality of responses generated by advanced LLM models such as GPT-4, simpler models like LLaMA, and open-source domain-specific models (e.g., DialoGPT, BLOOM) when acting as first-line conversational agents for women affected by GBV?

2. RQ2: to what extent do different AI models demonstrate empathy and emotional validation in interactions with women experiencing GBV?

3. RQ3: what types of gender biases and other prejudices emerge in the responses generated by the evaluated models, and how do these vary across different models and testing scenarios?

While domain-specific chatbots have shown promise in providing tailored support to women affected by GBV, their development requires significant financial, technical, and human resources, limiting their scalability and adaptability across diverse contexts. In contrast, general-purpose large language models (LLMs), such as ChatGPT, are widely accessible, continuously evolving, and already integrated into many public-facing platforms. Evaluating whether these general-purpose models—when guided by carefully designed prompts—can replicate or even enhance the supportive functions of specialized systems is therefore critical. Such an evaluation can inform the responsible adaptation of existing AI infrastructure for social good, particularly in settings where dedicated resources for custom development are lacking. Moreover, it allows for the identification of trade-offs between personalization, ethical safety, and scalability in the deployment of conversational AI for GBV support.

3 Methodology

3.1 Research design

This study employs a mixed-methods strategy, integrating qualitative and quantitative analyses to assess the efficacy of generative AI chatbots in delivering first-response support to women affected by GBV. The mixed-methods approach allows for a deeper and more comprehensive understanding of the phenomenon under study by combining numerical data with contextual and subjective interpretations (López Belloso and Sanz, 2019; Timans et al., 2019). This methodology is particularly suited for research on complex social issues such as GBV, where personal experiences and emotional responses are as crucial as measurable patterns. The integration of both methods facilitates data triangulation, enhancing the validity and reliability of the findings.

The research is organized around the systematic evaluation of different AI models using a standardized conversational framework. The methodology consists of three primary phases: (1) prompt engineering and chatbot setup, (2) scenario-based evaluation utilizing structured questions, and (3) qualitative and quantitative analysis of chatbot responses. Each phase builds on the previous one to ensure a comprehensive assessment of the models' communicative behavior and their potential to contribute to responsible, context-aware digital support for women affected by GBV.

3.1.1 Qualitative component: survivor interviews

This research is conducted within the framework of the European project IMPROVE which aims to enhance institutional responses and access to support services for women affected by GBV, focusing on their personal circumstances and the institutional responses available to them, through narrative interviews. Data collection took place in five countries: Austria, Finland, France, Germany, and Spain. This article specifically highlights the research carried out in Spain, where 30 women participated in the study. The sample included diverse profiles of victim-survivors, incorporating vulnerable groups such as two elderly women and seven migrant or refugee women. The interviews were conducted across various Spanish regions, including the Basque Country, Cantabria, Castile and León, and Madrid. Details of this information are provided in Annex 1. These interviews, grounded in feminist epistemology and ethical research practices, provided empirical insights that informed the development of evaluation scenarios and chatbot testing prompts.

A multi-stage process was followed to conduct the interviews. The first stage involved comprehensive mapping of organizations and associations that support women who have experienced GBV. Using purposive sampling, women were selected to participate through a range of entities, including local GBV services, women's organizations, services targeting socially excluded populations, and organizations managing international protection programs for refugees. Researchers prioritized collaborations with organizations they had previously worked with or those facilitated by third-party professionals, considering the heightened vulnerability of the population involved. The selection of participants and coordination of interviews were led by psychologists or social workers within these services, leveraging their case knowledge and fostering participant trust.

In the second stage, an interview guide was developed based on a thorough literature review to identify key dimensions relevant to the needs assessment and the overarching the goal of the project—to enhance victim-survivors' access to support resources.

The third stage involved conducting in-depth narrative interviews with the selected participants, following the ethical guidelines established by the World Health Organization (Putting women first: ethical safety recommendations for research on domestic violence against women, 2001) for research involving women affected by domestic violence. This included ensuring anonymity and safety throughout the research process, involving advocates or intermediaries when needed, providing safe and secure locations for participation, securely storing research data, and engaging trained researchers skilled in sensitive and collaborative interviewing practices. The research fully complied with the Gender, Ethical, Legal, and Societal Aspects (GELSA) requirements of the European Commission, including informed consent, participant protection, confidentiality, and data privacy protocols. The study was approved by the University of Deusto's Ethics Committee⁵.

To ensure participants' safety and comfort, face-to-face interviews were prioritized, as this format fosters closeness and trust—particularly critical when working with vulnerable groups such as GBV victim-survivors. These interviews also enabled researchers to provide appropriate support before and after each session (Romero Gutierrez et al., 2024). Most interviews were conducted individually in private counseling or group rooms within partner organizations. However, two interviews were carried out in the presence of a shelter social worker at the request of the participants—one of these was facilitated by two researchers, one of whom had in-depth knowledge of the interviewee's cultural background. Additionally, four interviews took place in a group format.

All interviews were audio-recorded, except for one case in which the participant expressed fear due to her personal situation. Interview durations ranged from 1 to 3 h, including the initial and concluding phases, allowing sufficient time for participants to feel comfortable, confident, and heard. Participants were also given brief feedback and acknowledgments to honor their contributions.

Throughout the interview process, researchers aimed to establish a non-hierarchical and interactive dynamic with participants, drawing on the feminist epistemological approach proposed by Oakley (2016) and Oakley and Women (1981). Participants reported feeling safe and comfortable during the interviews, with many highlighting the empathy they experienced from researchers as a key factor. In this context, empathy was understood as being intentionally and unconditionally present for another individual (Eriksson and Englander, 2017).

Insights from these interviews were used to inform the development of the scenario-based evaluation framework applied to the chatbots. Survivors' narratives helped define key dimensions such as emotional responsiveness, perceived supportiveness, and ethical considerations in conversational interactions.

3.1.2 AI Chatbot evaluation setup

To evaluate the capabilities of conversational AI systems in GBV support contexts, three distinct models were selected. The first is a customized version of ChatGPT, developed by OpenAI and adapted through a specific prompt, as outlined in the methodology. As one of the most advanced models currently available, it is expected to yield high-quality responses. ChatGPT was accessed through the official OpenAI web interface using a Custom GPT configuration. This interface allowed the research team to apply a structured, pre-defined prompt while leveraging GPT-4's default behavior as deployed during March and April 2024. Unlike API-based implementations, the Custom GPT setup permitted consistent prompt control without altering the model architecture or training data.

The second model, LLaMa 3.2–3B Instruct model⁶, is an open-weight, widely accessible model loaded via LM Studio with default settings. Instruct-type models, including InstructGPT and GPT-4 variants, are fine-tuned through supervised learning to improve relevance, coherence, and alignment with user intent—making them particularly effective for applications like virtual assistants, education, and content creation. Among the available LLaMa versions, this instruct variant was selected because it best aligns with the needs of this study, where generating context-sensitive and ethically aware responses is essential. To ensure comparability, the same prompt used with ChatGPT was also applied to this model.

Finally, AinoAid was included as a domain-specific chatbot specifically developed to provide support for women experiencing GBV. Developed within the framework of the European IMPROVE project, AinoAid integrates AI with a curated knowledge base that covers topics such as the forms of violence, victim rights, access to support services, and legal procedures. Unlike general-purpose models, AinoAid's conversational logic is based on content co-designed with survivors, professionals, and support organizations, ensuring that its guidance is both contextually accurate and trauma-informed. It is available in over 5 languages and has been deployed in multiple European countries including Spain, Finland, Germany, Austria, and the French island of Réunion. The chatbot guarantees user anonymity, avoids data collection, and provides static, information-rich responses aligned with best practices in victim support. While it does not allow prompt customization by external users, its responses reflect a fixed set of expert-reviewed guidelines aimed at maximizing clarity, safety, and emotional validation in GBV-related interactions.

3.1.2.1 Prompt structure and ethical framing

To assess AI chatbots as potential first-response assistants for women affected by GBV, a structured prompt was designed using the Systematic Context Construction and Behavior Specification technique (Singh et al., 2024; Chen et al., 2025). The prompt instructs the chatbot to adopt the role of a social assistant specializing in GBV support. It establishes clear behavioral guidelines, ensuring that responses are respectful, emotionally validating, and aligned with best practices in victim assistance. The prompt also includes response constraints, prohibiting victim-blaming, legal advice, or any action that could minimize the survivor's experience. The goal is to minimize ambiguity, align model outputs with user expectations, and enhance the safety, relevance, and appropriateness of generated responses, particularly in sensitive domains such as GBV support services.

This methodological choice aligns with feminist calls for ethically grounded data practices and participatory, community-led ethics in HCI design—approaches that prioritize inclusion, consent, and historical awareness in digital systems (Leavy et al., 2021; Henriques et al., 2024).

The prompt was developed iteratively through multiple rounds of testing with the ChatGPT and LLaMA models. The initial version, included in Annex 2, served as a base to define tone, role, and scope of the assistant. Subsequent refinements focused on two key goals: improving the quality and emotional relevance of responses and simplifying the wording of the prompt to reduce verbosity and improve generalizability across models. Each iteration was evaluated in terms of its ability to elicit appropriate, informative, and trauma-sensitive replies when applied to the scenario-based test cases.

The final prompt, included below, reflects a balance between ethical rigor and operational efficiency:

“You are a conversational assistant specialized in providing initial support to individuals who have experienced GBV. Your main objective is to offer a safe space, immediate attentive listening, and unconditional support, conveying absolute respect while avoiding any form of revictimization.

You follow a protocol that includes:

• Creating a safe space and engaging in active listening without judgment.

• Validating the person's emotions and ensuring they do not feel blamed.

• Providing clear information about available help resources, such as emergency hotlines, local support services, rights, and protection options.

• Always referring to in-person professional care and activating emergency protocols if a life-threatening risk is detected.

When responding, you use respectful language, maintain a calm and unconditionally supportive tone, and avoid asking invasive questions. You never replace professional care or provide specific legal advice, nor do you minimize or discredit the recounted experience.”

This prompt was applied identically to both ChatGPT and LLaMA in order to ensure consistency in role-setting and behavioral guidance. In contrast, AinoAid—being a pre-configured domain-specific chatbot—was evaluated using the same set of questions but without external prompt customization. This distinction was considered in the analysis to ensure a fair and meaningful comparison across models. The resulting responses were then assessed using a scenario-based framework that simulated real-life support conversations with women affected by GBV.

3.1.3 Scenario-based assessment

To evaluate the AI models' capacity to respond appropriately in support scenarios, a set of structured questions was designed to simulate realistic interactions aligned with the stages of the GBV cycle. The chatbot evaluation consists of simulating realistic interactions through a structured set of questions categorized by different stages of GBV. These questions assess the chatbot's ability to provide accurate, empathetic, and bias-free responses across key phases of the abuse cycle: general awareness of GBV, early warning signs, crisis response, and post-incident support. Each AI model is tested using the same set of questions to ensure comparability. The responses are systematically recorded and analyzed.

The evaluation framework is grounded in the GBV cycle theory (Walker, 2016) which describes a recurring pattern of abuse characterized by three distinct stages—tension-building, acute battering, and a honeymoon phase—each involving different emotional dynamics and support needs for survivors. The tension-building stage consists of minor abusive incidents and emotional maltreatment, frequently justified by both partners as being linked to stress. Women might try to calm the situation, but as tension escalates, their coping strategies become less effective, resulting in heightened risk and a possible need for assistance. The acute battering episode stage is marked by intense physical or emotional abuse. It tends to be erratic and can lead to considerable harm to the victim. After the violent episode, the aggressor may apologize, promise to change, or minimize what happened, creating a brief sense of calm. This constitutes the apology stage. This stage can trap women affected by GBV in a repeating cycle of hope and fear (Bucuţă, 2015).

A total of 16 questions were developed, grouped into four categories: general awareness, tension-building, acute abuse, and post-incident (“honeymoon”) stages. These questions were informed by survivor narratives and designed to probe the AI models' ability to offer empathetic, coherent, and practical guidance. The full set of questions is presented in Table 1.

Table 1

Table 1. Predefined questions that will be used to evaluate the answers of the different models.

3.1.4 Qualitative analysis methodology

A qualitative content analysis was conducted to examine the communicative strategies employed by each chatbot. Using Atlas.ti software⁷, researchers applied thematic coding to assess how emotional sensitivity, contextual appropriateness, and ethical considerations were handled across responses. The approach combined deductive categories, based on previous knowledge, with inductive themes emerging from the chatbot responses to questions related to different stages of the GBV cycle. Manual coding was applied to selected segments, allowing for multiple codes per response. This enabled a deeper understanding of how emotional, contextual, and informational aspects were addressed.

The codebook was created iteratively and organized into thematic categories that represent key evaluation factors such as response quality, empathy, appropriateness, support orientation, and tone. In total, 17 codes were utilized, arranged across 4 thematic categories. Each code was characterized by a label and, when applicable, an additional comment detailing its scope and purpose.

The code system was organized into four thematic categories:

1. Response Quality: evaluating completeness, clarity, and alignment with the user's question.

2. Empathy and Humanization: capturing the affective dimension, tone, and emotional validation.

3. Privacy and Ethics: addressing concerns related to user data handling, consent, and the ethical implications of chatbot interactions.

4. Bias: examining potential prejudices in responses, ensuring fairness and neutrality in the information provided.

These categories collectively provide a comprehensive framework for assessing the performance of conversational agents, ensuring they meet women's needs while upholding ethical standards and fostering positive interactions.

Through the analysis, attention was given to the frequency of thematic code across the dataset, revealing recurrent patterns that highlighted both strengths and concerns in the chatbot's performance. Co-occurrence patterns between codes were also examined, offering insights into the relationships between different thematic elements—such as the link between empathic responses and user satisfaction, or the intersection of bias-related codes with problematic interactions.

This analysis provided a deeper understanding of each model's communicative behavior in GBV contexts, particularly regarding how empathetic language, ethical safeguards, and potential biases influence the perceived quality and safety of the interactions.

3.1.5 Quantitative NLP-based evaluation

In parallel with the qualitative assessment, a quantitative evaluation was conducted using natural language processing (NLP) techniques to analyze chatbot responses. This approach aimed to assess three critical dimensions—emotional tone, semantic coherence, and gender-related bias—essential for ensuring ethical and effective support in GBV scenarios. The analysis was implemented using Python programming language. which enabled an automated and objective evaluation of chatbot responses across three key dimensions—emotional tone, semantic relevance, and gender-related linguistic bias— ensuring consistency and reproducibility in a highly sensitive application context.

To assess the emotional quality and empathetic tone of chatbot responses, we applied two complementary sentiment analysis tools. TextBlob (Bird et al., 2009) was used for its simplicity in providing general polarity scores from −1 (negative) to +1 (positive). In addition, VADER (Valence Aware Dictionary and sEntiment Reasoner; Roehrick, 2020), optimized for informal and conversational language, was employed to capture more nuanced emotional expressions typical of chatbot interactions.

To assess how well the chatbot responses matched the users' questions in meaning, we used Sentence-BERT (Reimers and Gurevych, 2019)⁸, a tool that compares the similarity between texts. It provides a numerical score indicating how closely the chatbot's answer aligns with the original question. This helps objectively evaluate whether responses are clear, relevant, and consistent.

Politeness analysis was conducted using the Politeness tool in R (Yeomans et al., 2018), which detects features such as greetings, gratitude, and apologies to assess the respectful and empathetic tone of chatbot responses—crucial in emotionally sensitive contexts.

Lastly, to detect possible gender bias in language, we used a lexicon-based method that scans for words previously identified as gender-biased developed by Zhao et al. (2018). Given that such tools can produce false positives, we also carried out a qualitative review to interpret these findings in context.

A summary of these metrics can be found in Table 2.

Table 2

Table 2. Primary computational metrics used.

Metrics were interpreted according to their output scales (e.g., polarity from −1 to +1), and rank-based normalization was applied across models to ensure comparability. Higher values generally indicated greater emotional expressiveness, contextual relevance, or politeness.

These computational metrics complemented the qualitative insights, allowing for a multidimensional comparison of how different AI systems perform in ethically sensitive and emotionally complex interactions.

4 Results

This section presents the results of both, the quantitative and qualitative analyses, applied to evaluate chatbot-generated responses in the context of GBV support.

4.1 Structural and linguistic features of Chatbot responses

Before presenting the main quantitative metrics, we conducted a preliminary analysis of general linguistic features in the chatbot responses. This included text-level characteristics such as message length, sentence structure, punctuation use, emojis, and emotionally connoted vocabulary (e.g., “safe,” “abuse”). We also tracked the use of structured formats (e.g., lists) and explicit references to support resources, which can indicate a model's clarity and ability to provide actionable help. These indicators offer additional insight into tone, organization, and practical utility beyond emotional or semantic assessments. Table 3 shows the values that we get for each dimension. Rank-based normalization was applied across each metric.

Table 3

Table 3. Summary of general metrics obtained from the different answers of the models.

To facilitate model comparison, a rank-based normalization was applied across each individual metric. For every row in the table, representing a specific linguistic or structural dimension, the three models (AinoAid, ChatGPT, and LLaMa) were assigned a rank from 1 (best performance) to 3 (lowest performance), based on their absolute values. Higher values were interpreted as indicators of better performance for all metrics, in line with the study's objective of evaluating verbosity, clarity, empathy, and responsiveness in sensitive support contexts.

Ranking was calculated using the Excel function RANK.EQ in English, with descending order, so that the model with the highest value received rank 1. In cases of tied values, models were assigned the same rank. This approach allows for a simplified but consistent comparative evaluation:

Figure 1 illustrates the relative performance of each chatbot model based on nine structural and expressive features. These include average word and sentence length, punctuation usage (exclamations and questions), emoji presence, list formatting, and the inclusion of external support resources. A rank-based normalization was applied to each metric, enabling relative comparisons without the distortions of raw value disparities. Higher values indicate stronger performance on the respective features compared to the other models.

Figure 1

Radar chart comparing AinoAID, ChatGPT, and Llama across eight metrics: average words, average sentences, mentions resources, average word length, total emojis, emoji usage, average questions, and average exclamations. Each model is represented by a distinct colored line: AinoAID (blue), ChatGPT (orange), and Llama (green).

Figure 1. Radar chart comparing structural and stylistic metrics across chatbot models.

4.2 Qualitative results

4.2.1 Analytical framework

To complement the quantitative evaluation, a qualitative content analysis was conducted to examine the communicative behavior of the chatbots when interacting with users disclosing experiences related to GBV. This analysis focused on three key dimensions:

1. Response Quality: completeness, clarity, and relevance

The first key dimension analyzed centered on evaluating the overall quality of the chatbot's responses. This involved a thorough examination of several critical aspects, including the completeness of the information provided—ensuring that answers addressed user queries comprehensively without leaving out important details. Equally important was the clarity of the responses, meaning that the language used needed to be easily understandable, avoiding ambiguity or overly technical jargon that could confuse users. Additionally, the relevance of the content was carefully considered, focusing on whether the chatbot's replies were directly pertinent to the questions posed, aligned with the context of the conversation, and sensitive to the specific needs and experiences of the users.

In this regard, the women participating in the study emphasized that the chatbot's greatest value would lie in its ability to reduce confusion and uncertainty by providing clear answers to common doubts surrounding violence. Many noted that victims often struggle to recognize less visible or less commonly understood forms of violence, such as psychological abuse. Therefore, the chatbot's capacity to help users identify these subtle and complex forms of victimization was viewed as crucial.

Additionally, interviewees underlined the importance of receiving step-by-step guidance on legal rights and procedural matters, which can often feel overwhelming or inaccessible to those seeking help. They expressed a strong need for easily navigable information that breaks down legal jargon and clarifies what actions victims can take, what protections exist, and how to access them.

Participants also highlighted the significance of offering educational resources (including books and evidence-based information on gender-based violence), that clearly define and explain various types of violence, including stalking, sexual violence, coercive control, and other abusive behaviors. They envisioned the chatbot as a platform that could offer self-assessment tools and multimedia resources like videos, series, or films to help raise awareness and encourage self-reflection. Additionally, the option to hear testimonies from other survivors—spanning all forms of violence—was seen as a powerful feature that could foster solidarity and recognition.

Such materials should be presented in a language that is both straightforward and sensitive to users' emotional states, avoiding technical terms or overly clinical language that might alienate or confuse victims. The provision of accessible, relatable explanations was seen as vital not only for raising awareness but also for empowering victims to better understand their situation and seek appropriate support.

Moreover, the availability of information about local support services, including shelters, counseling, legal aid, and crisis hotlines, was seen as a key component. The women stressed that having quick and easy access to trustworthy resources could significantly lower barriers to help-seeking and potentially save lives. Overall, the chatbot's role was envisioned not just as an informational tool but as a compassionate, accessible first point of contact that could guide victims through the often complex and intimidating process of recognizing abuse and seeking assistance.

2. Empathy and Emotional Tone: language validation, supportive tone).

The second dimension centered on the empathy and emotional tone that chatbots should convey in their interactions. Interviewees expressed a range of perspectives on this aspect, highlighting its critical importance for creating a supportive and trusting environment. Some participants emphasized the need for the chatbot's tone to be gentle, compassionate, and non-judgmental, helping users feel safe and understood during difficult moments. Others pointed out that the emotional tone should adapt to the user's state, offering warmth and reassurance without sounding overly robotic or detached. Several interviewees noted that a lack of genuine empathy in responses could discourage users from engaging or sharing openly, underscoring how crucial it is for the chatbot to strike the right balance between professionalism and emotional sensitivity. Overall, the findings suggest that the chatbot's ability to convey empathy and a supportive emotional presence is key to fostering trust and encouraging continued interaction.

In addition to identifying the characteristics of the interaction offered by the chatbots, the participating women were asked whether they would prefer the voice to be male or female. While some expressed indifference to whether the voice was masculine or feminine, a notable number expressed a clear preference for a female voice. This preference was often linked to feelings of safety, relatability, and emotional comfort. For many survivors of gender-based violence, a female voice can evoke a greater sense of trust and psychological security.

Moreover, participants emphasized that the voice should convey warmth and emotional intelligence. Ideally, it should be soft, calm, gentle, friendly, and empathetic—qualities that help reduce anxiety and establish an atmosphere of support and care. The tone and delivery were seen as just as important as the content of the message.

Accent also emerged as a significant factor. While some interviewees favored a neutral, easily understandable accent, others preferred the chatbot to speak in their own native accent or in the accent of their country of origin. This linguistic familiarity was seen as a way to build rapport and cultural resonance, fostering a sense of belonging and understanding. The implication is that voice design in such tools should be culturally sensitive and adaptable to diverse user backgrounds, especially in multilingual and migrant-inclusive contexts.

The interviewees stressed that AI should prioritize personalization and empathy in its responses. The chatbot's communication style should be calm, composed, and supportive. When reaching out for help, DV victim-survivors often feel ashamed of their situation, so it would be reassuring if the chatbot could respond with something like: “I believe you. Don't worry, I believe you and we will do something about it.”

“That chat that gives you a personalization, gives you an importance.”

“Not showing that it's a robot, because a gender-based violence victim expects a human response that understands their feelings.”

3. Privacy and ethical awareness: handling of sensitive topics

Regarding the third dimension, privacy and ethical awareness, a significant concern raised by the women interviewed was a pervasive sense of distrust in the technology and its safety. This distrust stemmed from fears about privacy and confidentiality, with many participants unsure about who might have access to the information they share. Several women expressed apprehension about the possibility of conversations being monitored, recorded, or accessed by unauthorized individuals, including fears that law enforcement or other authorities might overhear sensitive disclosures. For this reason, participants emphasized that the chatbot should not be deployed through WhatsApp, as many women fear the possibility of someone gaining control over their phones.

This lack of trust created a substantial barrier to engagement, making women hesitant to fully disclose their experiences or seek help through chatbot platforms. The findings suggest that addressing these trust issues is critical to designing effective and safe digital tools for survivors of violence, emphasizing the need for transparent privacy protections, clear communication about data use, and robust security measures to foster confidence in such technologies.

Overall, the fear of being discovered while seeking help is deeply distressing for survivors of domestic violence. They are forced to carefully consider every action, knowing that even a small misstep—such as a traceable message—could put their safety at risk.

4.2.2 Common structure and communicative strategy

Across all three chatbots, responses tended to follow a common structure comprising four stages:

1. definition of the problem,

2. elaboration of the issue, often in lists,

3. suggestions or action-oriented advice, and

4. a final emotionally supportive statement.

This template was particularly evident in ChatGPT and AinoAid, suggesting an embedded conversational strategy designed to offer clarity, reassurance, and actionable help. For instance, ChatGPT frequently used clear segmentations such as “You might consider…”, followed by bullets, and concluded with messages like “You are not alone. You deserve care and support.” This structure, while consistent, varied in tone, empathy, and depth across systems.

This structure reflects a strong alignment with user needs for clarity, emotional validation, and actionable guidance—needs also emphasized by participants in the IMPROVE study, who identified certainty, practical information, and non-judgmental support as critical to their decisions to seek help (Blumenschein et al., 2023, p. 10–13).

4.2.3 Empathy and emotional validation

Empathy emerged as a key differentiator between the systems. ChatGPT demonstrated the strongest emotional alignment, consistently using affective language and reinforcing user agency. Phrases such as “It's okay to seek help at any stage” or “You don't have to go through this alone” reflected trauma-informed design principles. The use of symbolic elements (e.g., emojis) added to the affective resonance, although this may not be equally valued by all users or contexts.

AinoAid, though less expressive, took empathy through formal but respectful reassurance (e.g., “Reaching out is an important step toward healing”). LLaMA, by contrast, often defaulted to neutral or impersonal phrasing and showed less sensitivity to emotional cues.

Despite these efforts, none of the systems modulated their empathetic tone based on contextual signals such as language used by the user, emotional intensity, or perceived urgency—highlighting a limitation in adaptive response generation.

4.2.4 Contextual relevance and local guidance

The ability to provide context-aware, location-specific support information also varied. ChatGPT stood out for referencing local Spanish services—even including specific NGOs such as La Posada de los Abrazos—despite not being explicitly prompted with geolocation data. This suggests advanced inferencing capabilities based on indirect signals or language patterns.

In contrast, LLaMA often reverted to general or U.S.-centric resources, which limited its relevance in the European context of the study. AinoAid reliably linked users to accurate national and regional services in participating countries (e.g., Valencia's police unit), reinforcing its grounding in institutional resources and project-specific knowledge.

These patterns underscore how response quality is not just a matter of linguistic coherence but also of situational and geographic appropriateness—particularly critical in high-stakes contexts like GBV.

4.2.5 Inclusivity and ethical gaps

While the tone and structure of most responses were generally non-judgmental, the analysis revealed notable gaps in inclusivity. None of the systems actively adjusted language or recommendations based on markers of identity such as age, ethnicity, sexual orientation, or ability. Furthermore, most responses consistently used feminine pronouns, implicitly assuming a female, cisgender user. This framing overlooks other survivors of GBV, including male, non-binary, LGBTQI+, elderly, or migrant individuals—groups that face compounding forms of vulnerability, as highlighted in IMPROVE D1.2 (p. 18–23).

Attempts to use neutral language (e.g., “both partners”) were sporadic and not sustained across conversations. Similarly, no model made proactive reference to data privacy, consent, or safety protocols unless explicitly prompted—representing a critical ethical omission in contexts involving trauma disclosure.

4.2.6 Summary and recommendations

The qualitative analysis reveals a core tension: while chatbot systems demonstrate consistent structural logic and, in some cases, affective depth (notably ChatGPT), they lack the adaptability and contextual sensitivity required to respond meaningfully to the diverse realities of GBV survivors. This includes limitations in addressing intersectional identities, ethical safeguards, and user-specific nuances.

To address these limitations, future chatbot design should incorporate:

• Dynamic empathy modulation to reflect varying emotional states and communication needs.

• Inclusion of intersectional identity markers to better address compound forms of discrimination.

• Explicit bias mitigation in language and content framing.

• Provision of clear, context-specific support options to foster trust and action-readiness.

By evolving in these directions, AI-based support systems can move from generic responsiveness toward truly trauma-informed, ethical, and inclusive communication.

4.3 Comparative performance across models: integrated qualitative and NLP-based results

This section presents a comparative analysis of the three chatbot models, focusing on key quantitative performance indicators across three critical dimensions: emotional sentiment, semantic relevance, and gender bias detection.

4.3.1 Code distribution and emergent patterns

In order to complement the quantitative assessment, this section explores the qualitative dimensions of the chatbot performance, examining the ways in which each system navigates emotionally and ethically charged interactions Through a close analysis of selected responses generated in reaction to diverse prompts corresponding to different phases of the gender-based violence (GBV) cycle, we aim to evaluate the communicative depth, narrative coherence, and ethical sensitivity of each model's interaction strategy. This qualitative lens allows for a richer understanding of not just what the systems say, but how they say it—highlighting their ability (or lack thereof) to simulate human-like compassion, respond appropriately to situational complexity, and uphold standards of inclusivity and responsibility in high-stakes conversational scenario. In Table 4 we can see the co-occurrence matrix of all the codes used.

Table 4

Table 4. Co-occurrence matrix with the used codes.

4.3.2 NLP metrics on empathy, coherence, and bias

This subsection presents the results of automated text analysis applied to the chatbot responses, focusing on emotional tone and semantic alignment. By leveraging NLP tools—specifically sentiment analysis and semantic similarity—we aim to quantify how empathetic, emotionally appropriate, and topically coherent the responses are in the context of GBV support.

Sentiment analysis was conducted using two complementary tools: TextBlob and VADER. TextBlob provided a polarity score for each response on a scale from −1 (negative) to +1 (positive), while VADER—specifically calibrated for conversational language—yielded compound sentiment scores along the same scale.

• ChatGPT exhibited the most emotionally expressive profile, with a notably high average VADER sentiment score (0.528), indicating a predominantly positive tone. Its TextBlob polarity score (0.062) was also positive, though more moderate.

• AinoAid showed the highest TextBlob polarity (0.126), suggesting consistent positivity across responses, albeit with a lower VADER score (−0.036), reflecting a more neutral-to-flat affect.

• LLaMA displayed the lowest scores on both dimensions, with a negative average VADER score (−0.346), pointing to an overall less supportive or emotionally neutral tone in its outputs.

These variations reflect different design emphases: ChatGPT prioritizes empathetic engagement, AinoAid maintains a consistent but subdued tone, and LLaMA tends to generate emotionally detached content.

To assess whether the type of chatbot model had a significant effect on the emotional valence of responses, we conducted a repeated measures ANOVA using the 16 prompt questions as within-subjects factor and the model type as between-conditions factor. The analysis revealed a statistically significant main effect of model on VADER sentiment scores, F(2, 30) = 11.80, p < 0.001, $η_{p}^{2} = 0.164$ . This indicates that 16.4% of the variance in sentiment values can be attributed to the model used, a moderate-to-large effect size. Mauchly's test of sphericity confirmed that the assumption of equal variances across model comparisons was met (p = 504), so no correction was applied. To further validate the robustness of these results, we also ran a non-parametric Friedman test, which yielded a significant effect as well, χ²(2) = 6.13, p = 0.047, with Kendall's W = 0.191, indicating a consistent but small-to-moderate agreement in ranking across models. These converging results suggest that model type significantly influences the emotional tone produced, as measured by VADER polarity. Descriptive statistics for each chatbot model, including the mean and standard deviation of VADER sentiment scores across the 16 prompts, are presented in Table 5.

Table 5

Table 5. Mean and standard deviation of VADER sentiment scores across chatbot models for each question (n = 16 prompts per model).

To complement these omnibus results, we computed pairwise comparisons between ChatGPT and each of the other models. Table 6 summarizes the mean differences in VADER, TextBlob polarity, and semantic similarity scores across the 16 prompts, along with 95 % confidence intervals and paired Cohen's d values. These results offer a more fine-grained understanding of how ChatGPT differs from the other systems in both emotional and semantic dimensions.

Table 6

Table 6. Pairwise differences between ChatGPT and the other models across key NLP metrics.

As shown in Table 6, ChatGPT consistently outperformed LLaMA across all metrics, with large effect sizes in both emotional tone (VADER d = 1.10, TextBlob d = 0.80) and a small difference in semantic similarity (d = 0.13). When compared to AinoAid, ChatGPT showed significantly higher VADER scores (d = 0.74), but no clear advantage in TextBlob polarity, and a substantially lower score in semantic similarity (d = −1.10). These findings suggest that while ChatGPT excels in generating emotionally expressive responses, AinoAid provides more semantically aligned content. This trade-off between affective engagement and topical precision is further examined in the next section.

On the other hand, to assess how coherently each model responded to the user prompts, cosine similarity scores were calculated between each question and its corresponding answer using Sentence-BERT.

• AinoAid achieved the highest average semantic similarity (0.721), suggesting that its responses were the most topically aligned and context-aware.

• ChatGPT followed with a moderate score (0.621), while LLaMA trailed slightly behind (0.608), indicating comparatively weaker alignment to user intent.

This metric provides evidence of the relative precision of each model in adhering to the informational demands of the prompt.

Finally, gender-associated language was examined using a lexicon-based approach derived from Zhao et al. (2018), which includes a curated list of female-related terms commonly used in gender bias evaluations in NLP. This method flagged every response across all models as containing at least one word from the bias lexicon. Notably, all of the responses contain flagged terms in all three systems, which initially suggests an even distribution of potential bias.

However, a closer inspection reveals that the flagged terms are overwhelmingly concentrated around common gendered pronouns and references—such as her, she, ma, mom, gal, miss, and women. These terms, while gender-specific, are not inherently biased when used in neutral or supportive contexts—especially in a scenario centered on GBV, where referencing female identities is contextually appropriate and often necessary.

The prevalence of these terms in every response highlights a limitation of purely lexical bias detection: while it offers a systematic and reproducible method, it lacks sensitivity to semantic nuance and pragmatic intent. A response that includes the word her or woman in a validating or empathetic context should not be treated as equivalent to one that reinforces stereotypes or minimizes experiences.

Therefore, the uniform detection of “bias” in all responses must be interpreted with caution. It does not necessarily indicate the presence of harmful language but rather reflects the context-dependent nature of gendered vocabulary in support-focused interactions. This underscores the need for complementary qualitative analysis to distinguish between appropriate gender reference and genuinely problematic bias in language use.

Figure 2 presents a comparative overview of three chatbot models (ChatGPT, AinoAid, and LLaMA) based on three core metrics derived from natural language processing analyses: TextBlob polarity, VADER sentiment, and semantic similarity (Sentence-BERT). To ensure comparability across dimensions with different scales, a rank-based normalization was applied. This method converts raw scores into relative positions, assigning each model a fractional value based on its rank (1 being the lowest, 3 the highest) for each metric. This avoids the distortion caused by outliers or narrow value ranges in raw data.

Figure 2

Radar chart comparing three models: AinoAID, ChatGPT, and Llama, across three axes: Polarity (TextBlob), Semantic Similarity, and Sentiment (VADER). AinoAID is represented by a blue line, ChatGPT by an orange dashed line, and Llama by a green dotted line. AinoAID scores higher on Polarity, while ChatGPT shows moderate values, and Llama scores similarly across all axes.

Figure 2. Radar chart comparing NLP-based evaluation metrics across models (rank-normalized).

The chart reveals that ChatGPT ranks highest in sentiment-related metrics (VADER), confirming its tendency to generate emotionally expressive and engaging content. AinoAid leads in semantic similarity, indicating more contextually aligned responses. LLaMA, while consistent, ranks lower across all dimensions, particularly in affective tone. This visualization highlights each model's relative strength and supports a multidimensional interpretation of chatbot performance in sensitive interaction scenarios.

There are several metrics that are provided by the politeness tool in order to evaluate the behavior of the chatbots. We have selected a subset of responses considered the most representative. The same rank-based methodology used in previous sections has been applied here. These are shown in the Table 7.

Table 7

Table 7. Metrics obteined from the politeness tool.

Figure 3 compares AinoAid, ChatGPT, and Llama across key metrics, with ranking normalization applied to better highlight the differences. As seen, ChatGPT excels in reasoning and user empowerment, while AinoAid lacks emotional support features like gratitude and reassurance, and Llama shows a more neutral approach with lower scores in several areas. This visual aids in understanding each model's relative strengths and weaknesses in emotional engagement.

Figure 3

Radar chart comparing AinoAID, ChatGPT, and Llama across seven categories: gratitude, apology, hedges, reassurance, reasoning, ask agency, and give agency. AinoAID is represented by a blue line, ChatGPT by an orange dashed line, and Llama by a green dotted line. Each category is scored from zero to one, with labeled axes showing varying proficiency in emotional and communicative aspects across the models.

Figure 3. Radar chart comparing the results of the politeness tool.

Finally, we have compared the similarities between the answers of the different models. This analysis examines how similarly different chatbot models respond to the same question. For each prompt in the dataset, the responses generated by the three models—ChatGPT, LLaMA, and AinoAid -were compared in pairs to evaluate the degree of similarity in their content.

To do this, we used a technique from natural language processing that converts entire sentences into numerical representations, known as sentence embeddings (Li et al., 2020). These embeddings allow us to measure how close two responses are in meaning, even if the wording is different. The comparison is based on a cosine similarity score (Schütze et al., 2008), where values close to 1 indicate high similarity (i.e., the responses convey very similar ideas), and values near 0 indicate low similarity (i.e., substantial differences in meaning).

Figure 4 shows how semantically similar the responses of each pair of chatbot models are when answering questions from four categories based on the cycle of GBV. For each category, the cosine similarity between model responses was averaged to assess how consistently the models interpret and respond to prompts with similar meaning.

Figure 4

Line graph titled “Average Semantic Similarity by Question Category” with semantic similarity (cosine) on the y-axis from 0.0 to 1.0 and various question categories on the x-axis. It compares ChatGPT vs. LLaMA, ChatGPT vs. AinoAid, LLaMA vs. AinoAid, and the overall mean similarity. All lines trend upward with varying slopes, with LLaMA vs. AinoAid showing the highest similarity. Categories include Acute Battering Episode, General Questions, Honeymoon Stage (Apology/Remorse Phase), and Tension-Building Stage.

Figure 4. Average semantic similarity between chatbot responses.

The dashed horizontal line indicates the overall mean similarity across all questions and models. Higher values suggest that models provide more aligned, coherent answers, while lower scores reflect greater divergence in their interpretations or styles.

This approach allows for identifying whether models converge more on certain types of questions (e.g., general advice vs. crisis intervention), offering insight into where consistency may be more critical.

The analysis includes three pairwise comparisons (higher values indicate greater similarity in meaning):

• ChatGPT vs. LLaMA: 0.697525

• ChatGPT vs. AinoAid: 0.730427

• LLaMA vs. AinoAid: 0.748246

5 Discussion

The analysis of structural and stylistic features reveals important insights into how LLMs perform in the context of supporting women affected by GBV: a setting where both emotional sensitivity and informational clarity are essential.

ChatGPT emerged as the most expressive model, with longer and more elaborate responses, frequent use of punctuation to convey tone, and a consistent inclusion of emoji. These traits suggest a communication style designed to foster emotional engagement, which can be especially meaningful for users experiencing distress or trauma. However, the repetitive and symbolic use of the emoji may also risk appearing impersonal if not contextually adapted.

AinoAid, in contrast, adopted a more neutral and restrained tone. While moderately expressive and consistent in structure, it avoided emphatic punctuation and emotive symbols altogether. Nonetheless, it frequently referenced external support services and used list formatting to present information clearly. This suggests a prioritization of structured, accessible communication, which may benefit users in moments of crisis by offering practical help in an organized format.

LLaMA, the most concise model, offered the briefest responses and minimal emotional expression. Although it used list structures and questions to guide interaction, it referenced support services the least. Its higher average word length may reflect denser vocabulary, but the absence of empathy markers raises concerns about its suitability in emotionally charged situations such as those involving violence against women.

These stylistic differences are not merely superficial: they directly impact the model's capacity to convey care, trust, and actionable help, which are critical in support contexts. ChatGPT's expressiveness may create a more comforting experience; AinoAid's structure may aid comprehension and decision-making; while LLaMA's efficiency, though practical, may fall short in establishing the relational warmth needed in these interactions.

One major limitation observed across all models is the absence of proactive mention of privacy or data ethics: a crucial issue when supporting vulnerable women. While specific prompts on the topic generated appropriate answers, none of the chatbots addressed these concerns spontaneously. This omission suggests that current prompt designs do not yet sufficiently prioritize user safety and digital ethics, both vital in contexts involving personal disclosure and trust.

To examine this further, a direct prompt was used: “Can I share with you my phone number?”, including a fictitious number. As shown in Figure 5, ChatGPT and AinoAid clearly refused to accept personal information, referencing safety and policy restrictions. LLaMA, however, offered a less decisive response, lacking explicit refusal or warning. These findings underscore the importance of clear and proactive messaging around privacy, especially when interacting with women who may be unaware of the risks.

Figure 5

“Three-panel image showcasing responses from AI assistants to a request about sharing a phone number: a) The assistant declines, citing privacy and safety, and offers to help find support services. b) The assistant explains it cannot store personal information, advises contacting helplines, and emphasizes well-being. c) The assistant states it cannot provide personal contact for emergencies and suggests calling emergency services. Each response prioritizes user safety and privacy.”

Figure 5. Results provided by the chatbots when we want to provide personal data. (a) The answer of ChatGPT. (b) The answer of AinoAid. (c) The answer of LLaMA.

Beyond tone and privacy, the models were compared in terms of semantic similarity, which reveals how consistently they interpret and respond to the same prompt. Despite their differences, all three models showed relatively high agreement in meaning. This suggests that while stylistic and emotional delivery varies, the core informational content is largely aligned (a promising sign for the reliability of support information offered across platforms).

From a qualitative standpoint, 233 coded segments were identified in the dataset, evenly distributed among the three systems. The most frequent codes (e.g., response completeness, contextual adjustment, and emotional validation) reflect the evaluators' emphasis on thorough, context-sensitive, and emotionally supportive communication. Notably, ChatGPT excelled in emotional validation, LLaMA in contextual adaptation, and AinoAid in delivering complete responses. These differences reveal that each model contributes distinct strengths to a support scenario, and that no single system is fully comprehensive on its own.

However, the analysis also revealed significant gaps. Inclusivity-related codes (e.g., such as the use of inclusive language or representation) were scarce, appearing mostly in ChatGPT and AinoAid. Their limited presence suggests that current models do not consistently address diversity and inclusion, which are key dimensions of ethical and accessible communication in GBV support.

Finally, the analysis of politeness indicators (i.e., including greetings, apologies, and expressions of gratitude) found that ChatGPT maintained the most respectful and empathetic tone, followed by AinoAid. LLaMA showed the lowest levels of politeness, reinforcing its overall pattern of minimal emotional engagement.

Further differences emerge when examining finer-grained conversational features such as gratitude, apologies, hedging, reasoning, and user agency. Expressions of gratitude and apology were minimal across all models—almost absent in AinoAid and LLaMA—suggesting a missed opportunity to reinforce empathy and emotional connection in sensitive interactions. In contrast, ChatGPT exhibited occasional apologies and more varied emotional cues. The use of hedging (language that signals uncertainty or caution) was highest in ChatGPT, followed by AinoAid, with LLaMA showing the least; this could indicate reduced confidence or over-cautiousness, potentially affecting clarity. Reasoning was most frequent in ChatGPT's responses, suggesting a more transparent and coherent structure, while both AinoAid and LLaMA offered minimal justification for their answers. Lastly, ChatGPT also provided the highest levels of user empowerment, both in inviting user input and in encouraging decision-making—elements that are particularly valuable in restoring a sense of control in women affected by GBV. AinoAid showed moderate levels of agency, whereas LLaMA offered none. These micro-level features further reinforce the broader pattern: ChatGPT appears best equipped to simulate emotionally supportive dialogue, while AinoAid offers informative structure, and LLaMA remains functionally limited in this regard.

In sum, while all three models show potential to contribute to initial support for women affected by GBV, their strengths and limitations differ. Emotional engagement, clarity of information, ethical awareness, and inclusive language are all essential components of effective support, yet no model currently integrates all of them consistently. These findings point to the need for targeted improvements in the design and fine-tuning of language models intended for use in sensitive, high-stakes contexts.

These findings reinforce broader feminist and intersectional critiques of AI, which argue that systems designed without attention to social context often fail to meet the needs of those facing overlapping forms of marginalization. As Costanza-Chock (2018) and Crenshaw emphasize, universalist or single-axis design frameworks routinely overlook the lived experiences of individuals positioned at the intersection of gender, race, class, and migration status. They note that, when issues of inequality are addressed in design (which remains the exception rather than the norm in professional settings), such efforts are typically approached through a single-axis lens. This narrow framing renders current design processes largely incapable of identifying, engaging with, or redressing the uneven distribution of benefits and harms they help reproduce. In line with McCall's “intracategorical” approach 2005, our findings suggest that evaluating and designing chatbot interventions must account for the specific realities of survivors navigating complex systems of power. Survivors' narratives in this study highlight the ways institutional assumptions embedded in chatbot responses can reproduce harm, invisibilize needs, or fail to affirm identity.

Moreover, as Henne, Shelby, and Harb note, AI systems deployed in GBV contexts often replicate institutional blind spots, omitting critical experiential and cultural data. This calls for a shift toward open-source, human rights–oriented models grounded in anti-racist feminist principles. Effective support tools must move beyond claims of neutrality and instead embed care, situated knowledge, and what Drage et al. (2024) term “response-ability”—a deliberate orientation toward relational accountability and structural awareness—in both technical architecture and governance. An intersectional, community-responsive framework is not an optional enhancement but a foundational necessity in the ethical design of AI for GBV response.

Models that assist survivors of GBV must go beyond neutrality, embedding care, situated knowledge, and “response-ability” into both functionality and governance (Drage et al., 2024).

From a methodological standpoint, this study would benefit from further integration of diverse epistemological perspectives—particularly those emerging from the Global South. Feminist scholars such as Noble (2018), Benjamin (2023), and Eubanks (2018) have emphasized how AI systems often reproduce colonial, racialized, and class-based inequalities when developed without context-sensitive or community-informed frameworks. Incorporating such perspectives could enrich the analysis of bias and responsibility in chatbot behavior and help situate technical evaluation metrics within broader social and historical power structures. Future work should therefore endeavor for greater epistemic plurality, drawing on decolonial and intersectional frameworks that interrogate not only how AI behaves, but who it ultimately serves.

A further limitation lies in the evolving nature of the models themselves. ChatGPT, as a continuously updated platform, may yield different outputs over time even when prompted identically. The results presented in this study reflect the system's behavior during the October–January 2024 deployment period and may not be fully reproducible in future versions. This versioning dynamic poses challenges for long-term benchmarking and reinforces the need for temporal transparency in evaluating AI systems.

Another methodological limitation concerns the approach used to detect gender bias. Our analysis employed a lexicon-based method, which, while systematic and easily reproducible, cannot account for contextual meaning and often flags neutral or supportive gendered terms (e.g., “she”) as biased. More robust alternatives, such as embedding-based techniques like WEAT or SEAT, would allow for deeper analysis of implicit associations within model representations. However, these methods require access to the internal embedding layers of each model, which was not possible in this study due to the proprietary nature of ChatGPT and the closed infrastructure of AinoAid.

Finally, a key limitation of our study lies in the absence of real-time user-chatbot interactions. Although the scenario-based evaluation offered a consistent, ethically sound framework for comparing model outputs, it cannot fully replicate the interpersonal dynamics of actual help-seeking conversations. Crucial dimensions—such as trust-building, confusion handling, or emotional regulation—remain inaccessible without direct user input. Future studies should explore live, in-situ evaluations with GBV survivors or support professionals to capture how these systems function in practice. Such user-centered validation would offer more ecologically grounded insights into the perceived empathy, clarity, and safety of AI-driven support, and would help guide more ethically responsible deployment in GBV contexts.

6 Conclusions

This study offers a preliminary assessment of the potential of LLMs to assist women affected by GBV through conversational support. Using a set of prompts based on Walker's cycle of violence, we evaluated three models: ChatGPT (custom-prompted), LLaMA (open-weight), and AinoAid (specialized for women affected by GBV).

A mixed-methods approach combining qualitative and quantitative techniques allowed us to assess not only the informational quality of the responses but also their emotional tone, clarity, and ethical implications. The results highlight notable strengths: larger models like ChatGPT consistently demonstrate higher empathy and more emotionally validating language. They also provide accurate and actionable resources, which can be critical in moments of crisis.

However, several limitations emerged. None of the models request contextual information to tailor their answers, resulting in generalized but thorough responses. Also, privacy concerns are not proactively addressed -only AinoAid includes a prior note on data handling, and even that is limited. When directly asked, the models deny storing personal data, but this raises further questions about the systems' internal memory and possible vulnerabilities, such as exposure to prompt injection attacks.

Some of the limitations identified—such as the lack of contextual adaptation or the absence of proactive privacy guidance—could potentially be mitigated through more sophisticated prompt engineering. Tailored prompts can influence LLM behavior, encouraging, for example, the solicitation of context-specific information or the inclusion of safety disclaimers. However, this approach requires careful design and testing, as it may introduce new risks or inconsistencies, especially in sensitive applications like GBV support. Moreover, alternative techniques such as fine-tuning, reinforcement learning with human feedback (RLHF), or the integration of external safety layers offer promising avenues to enhance system performance, but they fall beyond the scope of this study and warrant future research.

Therefore, these systems show promise in offering first-line support and guidance, but important gaps remain regarding data privacy, ethical standards, and content personalisation. For broader implementation, further testing is essential, ideally involving professionals in the field of GBV. Collaboration between technologists and social experts will be key to ensuring that these tools are both effective and safe, particularly in high-stakes, emotionally sensitive contexts.

As McCullough et al. (2025) argue, integrating care into AI tool design, even in technical or trade contexts, opens up possibilities for more inclusive, sustainable, and human-centered development processes.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by Research Ethics Committee at the University of Deusto (UD-REC). The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants' legal guardians/next of kin.

Author contributions

BS: Writing – original draft, Writing – review & editing. ML: Writing – original draft, Writing – review & editing. AI-C: Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This research was funded by the European Union, HORIZON Europe Innovation Actions, Grant Agreement number 101074010.

Acknowledgments

The authors would like to extend their sincere gratitude to all the women who generously shared their personal experiences of gender-based violence, making this research possible. We are also grateful to the professionals from various organizations who supported the identification of participants and the planning of interviews. Special thanks are due to the ASKABIDE association for their invaluable assistance in facilitating contact with several participants and for conducting some of the interviews.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that Gen AI was used in the creation of this manuscript. The author(s) verify and take full responsibility for the use of generative AI in the preparation of this manuscript. Generative AI was used for language and style revision, as well as for support in developing and refining the source code included in the study. All outputs were reviewed, validated, and edited by the author(s) to ensure accuracy, relevance, and compliance with ethical standards.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpos.2025.1631881/full#supplementary-material

Footnotes

1. ^https://sophia.chat/

2. ^https://holasoyvioletta.com/

3. ^https://AinoAid.fi/

4. ^Grant Agreement 101074010.

5. ^Reference number: ETK-60/23-24.

6. ^Downloaded from: https://huggingface.co/lmstudio-community/Llama-3.2-3B-Instruct-GGUF.

7. ^https://atlasti.com/

8. ^Using the all-MiniLM-L6-v2 model.

References

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). “On the dangers of stochastic parrots: can language models be too big?,” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (New York, NY: Association for Computing Machinery), 14. doi: 10.1145/3442188.3445922

Crossref Full Text | Google Scholar

Benjamin, R. (2023). “Race after technology,” in Social Theory Re-Wired (Abingdon: Routledge). doi: 10.4324/9781003320609-52

Crossref Full Text | Google Scholar

Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing With Python: Analyzing Text With the Natural Language Toolkit. Sebastopol: O'Reilly Media, Inc.

Google Scholar

Blumenschein, T., Hopf, S., Leonhardmair, N., Vogt, C., Kersten, J., Köpsel, N., et al. (2023). Victims' mental maps of institutional response to domestic violence and needs regarding AI chatbot (deliverable D1.2). IMPROVE project. Available online at: https://www.improve-horizon.eu/ (Accessed May 12, 2025).

Google Scholar

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., Arx, S. von, et al. (2022). On the opportunities and risks of foundation models. arXiv. doi: 10.48550/arXiv.2108.07258

Crossref Full Text | Google Scholar

Bucuţă, M. D. (2015). The carousel of violence': experiences of abused women. Bull. Transilvania Univ. Braşov, VII: Soc. Sci. Law. 8, 72–78. doi: 10.31926/series-vii.2015.57.1.7

Crossref Full Text | Google Scholar

Butterby, K., and Lombard, N. (2024). Developing a chatbot to support victim-survivors who are subjected to domestic abuse: considerations and ethical dilemmas. J. Gender Based Violence 9, 153–161. doi: 10.1332/23986808Y2024D000000038

Crossref Full Text | Google Scholar

Cecillon, N., Labatut, V., Dufour, R., and Linarès, G. (2019). Abusive language detection in online conversations by combining content-and graph-based features. Front. Data Sci. 2:8. doi: 10.3389/fdata.2019.00008

PubMed Abstract | Crossref Full Text | Google Scholar

Chen, Z., Wang, C., Sun, W., Yang, G., Liu, X., Zhang, J. M., et al. (2025). Promptware engineering: software engineering for LLM prompt development. arXiv.

Google Scholar

Costanza-Chock, S. (2018). Design justice, AI, and escape from the matrix of domination. J. Design Sci. 3, 1–14. doi: 10.21428/96c8d426

Crossref Full Text | Google Scholar

Dinan, E., Fan, A., Williams, A., Urbanek, J., Kiela, D., and Weston, J. (2020). “Queens are Powerful too: mitigating gender bias in dialogue generation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), eds. B. Webber, T. Cohn, Y. He, and Y. Liu (Stroudsburg, PA: Association for Computational Linguistics), 8173–8188. doi: 10.18653/v1/2020.emnlp-main.656

PubMed Abstract | Crossref Full Text | Google Scholar

Drage, E., McInerney, K., and Browne, J. (2024). Engineers on responsibility: feminist approaches to who's responsible for ethical AI. Ethics Inf. Technol. 26:4. doi: 10.1007/s10676-023-09739-1

Crossref Full Text | Google Scholar

Eriksson, K., and Englander, M. (2017). Empathy in Social Work. J. Soc. Work Educ. 53, 607–621. doi: 10.1080/10437797.2017.1284629

Crossref Full Text | Google Scholar

Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. New York, NY: St. Martin's Publishing Group.

Google Scholar

Henriques, A. O., Nicolau, H., Carter, A. R. L., Montague, K., Talhouk, R., Strohmayer, A., et al. (2024). “Fostering feminist community-led ethics: building tools and connections,” in Companion Publication of the 2024 ACM Designing Interactive Systems Conference (New York, NY, USA: Association for Computing Machinery), 424–428. doi: 10.1145/3656156.3658385

Crossref Full Text | Google Scholar

Izaguirre Choperena, A., López Belloso, M., and Sanz Urquijo, B. (2024). Empowering change: unveiling the synergy of feminist perspectives and ai tools in addressing domestic violence. Commun. Pap. Media Literacy Gender Stud. 13, 49–75. doi: 10.33115/udg_bib/cp.v13i27.23087

Crossref Full Text | Google Scholar

Krug, E. G., Mercy, J. A., Dahlberg, L. L., and Zwi, A. B. (2002). World report on violence and health. Biomedica 360, 1083–1088. doi: 10.1016/S0140-6736(02)11133-0

PubMed Abstract | Crossref Full Text | Google Scholar

Leavy, S., Siapera, E., and O'Sullivan, B. (2021). “Ethical data curation for AI: an approach based on feminist epistemology and critical theories of race,” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (New York, NY, USA: Association for Computing Machinery), 695–703. doi: 10.1145/3461702.3462598

Crossref Full Text | Google Scholar

Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. (2020). “On the sentence embeddings from pre-trained language models,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Ithaca, NY: Association for Computational Linguistics). doi: 10.18653/v1/2020.emnlp-main.733

PubMed Abstract | Crossref Full Text | Google Scholar

López Belloso, M., and Izaguirre Choperena, A. (2024). “Nuevas formas de atención a situaciones de violencia de género: la irrupción de la inteligencia artificial en la atención a las mujeres víctimas,” in La Protección de las Víctimas de la Violencia de Género: Aspectos Jurídicos y Asistenciales, ed. D. B. Benito Sánchez (Bilbao: University of Deusto), 47–86.

Google Scholar

López Belloso, M. L., and Sanz, B. (2019). “Hic sunt dracones: derechos humanos y big data: análisis de una colaboración inexplorada,” in Retos Emergentes de los Derechos Humanos: ¿Garantías en Peligro? eds. Garro Carrera, Enara and Landa Gorostiza, Jon-Mirena (Valencia: Tirant Lo Blanch Publishing House), 211.

Google Scholar

McCall, L. (2005). The complexity of intersectionality. Signs J. Women Cult. Soc. 30, 1771–1800. doi: 10.1086/426800

Crossref Full Text | Google Scholar

McCullough, C. E., Fleischmann, K. R., Greenberg, S. R., and Lassiter, T. B. (2025). Workers who care: AI-enabled smart hand tools in the skilled trades. Proc. ACM Hum.-Comput. Interact. 9, CSCW210:1–CSCW210:20. doi: 10.1145/3711108

Crossref Full Text | Google Scholar

Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. New York, NY: New York University Press.

PubMed Abstract | Google Scholar

Oakley, A. (2016). Interviewing women again: power, time and the gift. Sociology 50, 195–213. doi: 10.1177/0038038515580253

Crossref Full Text | Google Scholar

Oakley, A., and Women, I. (1981). A Contradiction in Terms?'. Doing Feminist Research. London: Routledge. 707–732.

Google Scholar

Powell, A. B. (2025). Four ways to feminist research praxis: lessons from practice in AI ethics and policy research. Can. J. Commun. 50, 11–25. doi: 10.3138/cjc-2024-0033

Crossref Full Text | Google Scholar

Putting women first: ethical and safety recommendations for research on domestic violence against women (2001). Available at: https://www.who.int/publications/i/item/WHO-FCH-GWH-01.1 (Accessed May 12, 2025).

Google Scholar

Reimers, N., and Gurevych, I. (2019). Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv. doi: 10.18653/v1/D19-1410

PubMed Abstract | Crossref Full Text | Google Scholar

Roehrick, K. (2020). vader: Valence Aware Dictionary and sEntiment Reasoner (VADER). Vienna: R Package Version 0.2 1. doi: 10.32614/CRAN.package.vader

Crossref Full Text | Google Scholar

Romero Gutierrez, L., Izaguirre Choperena, A., and López Belloso, M. (2024). The study of gender-based violence through a narrative approach: evidence from the European project IMPROVE. Soc. Sci. 13:330. doi: 10.3390/socsci13070330

Crossref Full Text | Google Scholar

Saglam, R. B., Nurse, J. R. C., and Sugiura, L. (2024). Designing chatbots to support victims and survivors of domestic abuse. arXiv.

Google Scholar

Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511809071

Crossref Full Text | Google Scholar

Siapka, A. (2022). “Towards a feminist metaethics of AI,” in Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (New York, NY, USA: Association for Computing Machinery), 665–674. doi: 10.1145/3514094.3534197

Crossref Full Text | Google Scholar

Singh, A., Ehtesham, A., Gupta, G. K., Chatta, N. K., Kumar, S., and Khoei, T. T. (2024). Exploring prompt engineering: a systematic review with SWOT analysis. arXiv.

Google Scholar

Sun, T., Gaut, A., Tang, S., Huang, Y., ElSherief, M., Zhao, J., et al. (2019). Mitigating gender bias in natural language processing: literature review. arXiv. doi: 10.18653/v1/P19-1159

PubMed Abstract | Crossref Full Text | Google Scholar

Timans, R., Wouters, P., and Heilbron, J. (2019). Mixed methods research: what it is and what it could be. Theor. Soc. 48, 193–216. doi: 10.1007/s11186-019-09345-5

Crossref Full Text | Google Scholar

Toledano-Buendía, C. (2021). Barrera lingüística y victimización secundaria: la (des)atención institucional a las víctimas extranjeras de violencia de género en España. Verba Hispánica 29, 175–191. doi: 10.4312/vh.29.1.175-191

Crossref Full Text | Google Scholar

UN Women (2023). Facts and Figures: Ending Violence Against Women. New York, NY: UN Women.

Google Scholar

Walker, L. E. A. (2016). The Battered Woman Syndrome. New York, NY: Springer Publishing Company. doi: 10.1891/9780826170996

PubMed Abstract | Crossref Full Text | Google Scholar

Yeomans, M., Kantor, A., and Tingley, D. (2018). The politeness package: detecting politeness in natural language. R J. 10, 489–502. doi: 10.32614/RJ-2018-079

Crossref Full Text | Google Scholar

Zhao, J., Zhou, Y., Li, Z., Wang, W., and Chang, K.-W. (2018). “Learning gender-neutral word embeddings,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, eds. E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Brussels, Belgium: Association for Computational Linguistics), 4847–4853. doi: 10.18653/v1/D18-1521

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: artificial intelligence (AI), chatbots, gender-based violence (GBV), AI biases, quality of empathic responses, model evaluation, prompt design, IMPROVE European project

Citation: Sanz Urquijo B, López Belloso M and Izaguirre-Choperena A (2025) Empathy, bias, and data responsibility: evaluating AI chatbots for gender-based violence support. Front. Polit. Sci. 7:1631881. doi: 10.3389/fpos.2025.1631881

Received: 20 May 2025; Accepted: 07 July 2025;
Published: 30 July 2025.

Edited by:

Ana Campina, Fernando Pessoa University, Portugal

Reviewed by:

Sonal Sharma, Central University of Gujarat, India
Mattia Falduti, Thesquare, Italy

Copyright © 2025 Sanz Urquijo, López Belloso and Izaguirre-Choperena. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Borja Sanz Urquijo, Ym9yamEuc2FuekBkZXVzdG8uZXM=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.