Your new experience awaits. Try the new design now and help us make it even better

DATA REPORT article

Front. Comput. Sci.

Sec. Human-Media Interaction

Establishing Reference Points for Artificial Social Agent Evaluation: The ASAQ Representative Set 2025

Provisionally accepted
  • 1Delft University of Technology, Delft, Netherlands
  • 2Universiteit Utrecht, Utrecht, Netherlands

The final, formatted version of the article will be published soon.

)), and intelligence (e.g., the Stanford-Binet Intelligence Scales (Roid and Pomplun, 2012) and a normative dataset (Stevens and Bernier, 2021)). But, also closer at home, when it comes to evaluation of software, System Usability Scale (SUS) (Brooke, 1996) also comes with a representative data set (Lewis and Sauro, 2018).Creating benchmark set to go along with ASA Questionnaire (ASAQ) (Fitrianie et al., 2025b,c) allows us to benchmark peoples experience with an ASA. This is measured on 24 constructs and dimensions covering an extensive part of our community shared interests, such as believability, likeability, and sociability of ASA. ASAQ has been published alongside the norm set "ASAQ representative set 2024", which includes the experience of 1066 individuals with 29 agents. That set is based on a third person perspective, i.e., filling out a questionnaire after seeing a video of someone else interacting with an agent. Although pragmatic for validating the questionnaire, the ASAQ authors also acknowledge possible limitations of this set on generalization towards experiences based on actual interaction (Fitrianie et al., 2025b).A key question when developing a benchmark is what should constitute as a benchmark? Which people should be included in the sample, and which agents? For ASAQ representative set 2024, the research platform Prolific was used, which allows data collection across the world. When using this platform to develop our benchmarking set, we need to know which agents are publicly available that have a global reach and have a sizeable user group. Therefore, our first step in building the benchmark set was to survey contemporary ASA usage. We recruited participants for this study through the crowd-sourcing platform, Prolific, between November 30 and December 19, 2023. For this, we applied the following inclusion criteria, where eligible participants were those who: (1) had not taken part in prior ASAQ validation studies, (2) had a Prolific approval rate above 95%, and (3) were proficient in English. Recruitment spanned multiple time zones, with a staggered approach in six-hour intervals to elicit global participant distribution. The study consisted of two sequential phases: (1) screening the population for familiarity with contemporary ASAs, and (2) establishing the ASAQ Representative Set 2025. For this study, we received approval from the university human research ethics Committee (no. 2685, dated 13 January 2023), preregistered the study (Fitrianie et al., 2023), and made the analysis script and data publicly available (Fitrianie et al., 2025a). We compensated participants according to Prolific's payment guidelines.To develop a benchmark set based on individuals' interaction experiences with widely known agents, we started by creating an initial agent list using input from the OSF working group on Artificial Social Agent Evaluation Instrumentfoot_0 . Twelve workgroup members from all over the world brainstormed on popular and widely used ASAs, selecting agents that various people, e.g., age groups and locations, might have interacted with at home. This resulted in a pre-selection of 11 agents, namely: Amazon's Alexa, Google's Bard chatbot, Microsoft's Bing chatbot, OpenAI's ChatGPT, Microsoft's CoPilot, Android's Google Assistant, IKEA's customer service chatbot, Replika chatbot, Apple's Siri, iRobot's Roomba vacuum cleaner, and Microsoft's Xiaoice. To further diversify the agent group, we included a dog, asking some participants to complete the questionnaire based on their interactions with a dog. Furthermore, with an eye on the future, we also incorporated an online version of the classic Eliza chatbot (Weizenbaum, 1966), making it possible to expose people in the future to the same agent. Finally, we included a non-existent agent, "Xonderfloip," as a distractor check, resulting in a list of 14 agents. Participants were asked to indicate the timing of their last interaction with the agents, with options ranging from "today" to "never."Of the 1,296 individuals initially recruited, 1,253 participants responded "never" to interactions with the distractor agent, meeting the criteria for inclusion in the subsequent phase of the study.Allowing people to compare their agent with agents in the benchmark set, we aimed for a statistical power of 0.80 to detect at least a medium-sized effect in future independent t-tests with an alpha level of 0.05 (Cohen, 1992). Consequently, the benchmark set required a minimum of 64 samples per agent. To ensure participants had interacted with the agents recently, we only used agents used within the last six months, narrowing the agent group from 14 to 10. Including the Eliza chatbot and the dog, we selected the agents: Alexa, Bard, Bing, ChatGPT, CoPilot, Google Assistant, Roomba, and Siri. Participants were assigned to evaluate a single agent they were familiar with, or to interact with the Eliza chatbot for five minutes before assessment to establish their own interaction experience with this agent. Exclusion criteria in this phase were: (1) failing more than 20% of attention checks; (2) providing incoherent responses to open-ended questions (e.g., unintelligible or nonsensical answers, or indicating no interaction with the assigned ASA); and (3) completing fewer than 10 dialogue turns for those assigned to the Eliza chatbot. Each participant was allowed to participate only once, with only their first completion included in the analysis.Out of 1,253 available participants, we invited 777 individuals until we ended up with 666 participants who met the inclusion criteria (per agent: M = 66, SD = 1, range = [64 .. 68]). Among the exclusions, 47 participants did not complete the study with their assigned ASA, five failed attention checks (providing [3 .. 7] incorrect answers out of 10), and one was removed due to an open-ended response indicating no interaction with the assigned agent. Additionally, 58 participants assigned to the Eliza chatbot were excluded for completing fewer than 10 dialogue turns. Additionally, we requested participants to describe their experiences with the ASA to which they were assigned, in their own words, aiming for future research.The resulting dataset included participants from the two phases: Phase 1 (n = 1,253) and a subset of these participants in Phase 2 (n = 666). The majority of participants identified as male (Phase 1: 54.5%; Phase 2: 57.8%), followed by female (Phase 1: 44.9%; Phase 2: 41.9%), with a small proportion identifying as other (Phase 1: 0.6%; Phase 2: 0.3%). The mean age was similar across both Phases (Phase 1: M = 30, SD = 9.2;Phase 2: M = 29.8, SD = 9.2), with the largest age groups being 18-25 (Phase 1: 38.9%; Phase 2: 39.8%) and 26-35 (Phase 1 and Phase 2: 39.6%). Education levels were comparable between groups, with the highest proportions holding an undergraduate degree (Phase 1 and Phase 2: 41.4%) or a graduate degree (Phase 1: 25%; Phase 2: 23.9%). Socioeconomic status, assessed via the MacArthur Scale (Adler et al., 2000) (1 = lowest, 10 = highest), was distributed across the scale, with the largest proportions in the middle ranges (e.g., at level 6, Phase 1: 25.9% at level 6; Phase 2: 28.4%). Geographically (based on the United Nations Regional Groups (United Nations, 2024)), most participants resided in Western Europe (Phase 1: 46.8%; Phase 2: 42.9%), followed by Africa (Phase 1: 21.1%; Phase 2: 22.8%) and Eastern Europe (Phase 1: 18.2%; Phase 2: 20.1%). Smaller proportions were from Latin America and the Caribbean (Phase 1: 11.7%; Phase 2: 12.2%), with limited presentation from the United States (Phase 1: 1.2%; Phase 2: 0.6%), and other regions. Users of this dataset might select sub-datasets based on these characteristics to study specific groups. Table 1 provides an overview of participant interactions with 12 ASAs and a dog. ChatGPT emerged as the most widely used agent, with 89.47% of 1,253 participants reporting interactions. Google Assistant (85.08%) and Siri (71.51%) also demonstrated high usage rates. In contrast, less commonly used agents included Replika (10.45%), Xiaoice (7.98%), and Eliza (2.23%).Among the ASAs, ChatGPT and Google Assistant exhibited the highest proportions of recent interactions (today and this week), reflecting their integration into daily life. For instance, 295 participants interacted with ChatGPT today, and 362 this week. As anticipated, agents such as Eliza showed minimal recent interactions, with the majority of participants reporting never having engaged with them (1,225).The study generated a representative set of nine ASAs and a dog, collecting 666 unique participant ratings on the 90 first-person perspective items of the ASAQ. Sample sizes per agent ranged from 64 to 68.Analysis of the ASAQ long version revealed variability in the ASAQ scores across agents, ranging from -30 (Eliza) to +30 (the dog). The data set, showing a detailed presentation of the scores of the ASAs on each of the 24 constructs and dimensions of the ASAQ, can be accessed publicly online (Fitrianie et al., 2025a).The ASAQ constructs and overall item content remained consistent with the ASAQ representative set 2024; the only difference is the participants' point of view, with the 2024 set collected from a third-person perspective (watching a video of a human-ASA interaction) and the 2025-set from a first-person perspective (interacting directly with an ASA). Items reflect the relevant perspective (e.g., "The user can rely on [the agent]" vs. "I can rely on [the agent]"). The ASAQ construct and dimension scores, derived from both the long and short versions of the ASAQ, for all agents in the Representative Set 2025 are provided in the Supplemental Data accompanying this article (see Supplementary Material, Table S1-S4).foot_2 The ASAQ Representative set 2025 extends the previously established ASAQ representative set 2024, offering an enhanced resource for researchers. The dataset highlights the varying interaction experiences people have in direct interaction with well-known agents. The reported use of contemporary ASAs (e.g., ChatGPT, Google Assistant, and Siri) demonstrates how rapidly conversational agents have become embedded in daily life. The inclusion of a non-artificial social agent (a dog) adds depth to the dataset, allowing for comparisons to other social experiences. Additionally, the variability in ASAQ scores, ranging from -30 for Eliza to +30 for dogs, provides anchor points for researchers to compare their own ASA against when using the ASAQ. Furthermore, the dataset allows for the ranking of results across each ASAQ construct or dimension relative to the agents included in the ASAQ Representative Set. To facilitate analysis, researchers can utilise ASAQ charts, which offer a clear, at-a-glance visualisation of their ASA's scores across all 24 constructs/dimensions, enabling direct comparisons with the representative ASAs. This resource promotes robust and standardised reporting in studies focused on human-agent interactions, which advances methodological consistency in the field. With the here presented dataset, it is possible to create similar guidelines for the first person perspective use of the ASAQ.Two limitations about this dataset should be noted. First, apart from Eliza, participants evaluated ASAs based on their most recent interaction, which relies on recall and may introduce bias due to differences in time since use, ASA version, and interaction context. Second, participants were recruited through Prolific Table 1. Summary of participants' usage of the 13 ASAs participated between November 30 and December 13, 2023 (n = 1253). The reported % of total any-use reported for each ASA, and when this use last occurred. We present the ASAQ score only for the ASAs we measured (n=666).Phase

Keywords: Artificial social agent, Evaluation instrument, Normative dataset, questionnaire, user study

Received: 15 Oct 2025; Accepted: 08 Dec 2025.

Copyright: © 2025 Fitrianie, Abdulrahman, Bruijnes and Brinkman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Siska Fitrianie

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.