- 1Vir Health Ltd., London, United Kingdom
- 2Allurion Technologies, Natick, MA, United States
Developments in Machine Learning based Conversational and Generative Artificial Intelligence (GenAI) have created opportunities for sophisticated Conversational Agents to augment elements of healthcare. While not a replacement for professional care, AI offers opportunities for scalability, cost effectiveness, and automation of many aspects of patient care. However, to realize these opportunities and deliver AI-enabled support safely, interactions between patients and AI must be continuously monitored and evaluated against an agreed upon set of performance criteria. This paper presents one such set of criteria which was developed to evaluate interactions with an AI Health Coach designed to support patients receiving obesity treatment and deployed with an active patient user base. The evaluation framework evolved through an iterative process of development, testing, refining, training, reviewing and supervision. The framework evaluates at both individual message and overall conversation level, rating interactions as Acceptable or Unacceptable in four domains: Fidelity, Accuracy, Safety, and Tone (FAST), with a series of questions to be considered with respect to each domain. Processes to ensure consistent evaluation quality were established and additional patient safety procedures were defined for escalations to healthcare providers based on clinical risk. The framework can be implemented by trained evaluators and offers a method by which healthcare settings deploying AI to support patients can review quality and safety, thus ensuring safe adoption.
1 Introduction
There has been an explosion of interest in the application of Machine Learning (ML) based Generative and Conversational Artificial Intelligence (GenAI) in the last couple of years. Recent advances in Large Language Models (LLMs) have resulted in the emergence of sophisticated Conversational Agents (CA) which are changing the landscape of human-computer interaction (1). LLM-based CAs, for example “ChatGPT”, can predict, recognise, translate, and generate content to answer questions on an extensive range of topics with high rates of accuracy (2, 3). While traditional rules-based chatbots typically facilitated task-oriented exchanges within a narrowly defined set of parameters, contemporary CAs can facilitate natural, open-ended conversations on a wide range of topics (4).
The application of CAs to healthcare offers significant potential to facilitate and enhance patient care. CAs offer a breadth and depth of information and guidance to complement existing healthcare expertise, therefore increasing efficiencies. With 24/7 access and near-instant speed of response, CAs can support patients where and when they need it. While CAs are not a replacement for professional care, the reality is that there is simply not enough health professionals, time, or money to meet healthcare demands globally. CAs offer tremendous opportunities for scalability, cost effectiveness, and automation in certain aspects of healthcare (5). An emerging use for these types of tools is AI Coaching, defined as “synchronous or asynchronous coaching using AI or a computer as a coach instead of a human coach” (6). While in some healthcare contexts the term “AI Health Coach” has been used synonymously with “AI Chatbot”, we emphasise that health coaching is a specific discipline which focuses on empowering clients to attain their health goals through techniques like open ended questions, motivational interviewing, and active listening, which are aimed at fostering self-awareness and self-determination (7). CAs and AI Health Coaches are being applied to numerous contexts to support patients; for example, in digital mental health applications (8), nutrition programs (9), public forums for health education (10), interventions to increase physical activity and sleep programs (11).
As with any new technology, several ethical and safety considerations emerge, leading to many unanswered questions about how best to implement this technology. Concerns include those related to data privacy and security (12), and the potential bias embedded in AI algorithms (13). LLMs are based on broad datasets that may include unverified and biased information that is then presented to the user (14). Consequently, they can be prone to inaccuracies that present a safety concern. The well documented “hallucinations phenomena” refers to CAs occasional propensity to deliver what appears to be very assured and credible information, that is in fact non-sensical or inaccurate (15). In addition, LLMs, such as ChatGPT use a probabilistic process to predict responses, meaning responses may be inconsistent and dependent on the wording of user queries (16) although analysis of the reproducibility of ChatGPT responses to nutrition questions related to Inflammatory Bowel Disease is promising; (9). It is necessary for any healthcare setting in which CAs are deployed to continuously monitor and evaluate the safety and accuracy of interactions. Deploying CAs in healthcare prematurely, without thorough monitoring and evaluation, can lead to user disengagement at best and, at worst, could compromise patient safety.
To ensure consistent and reliable monitoring of safety and accuracy, conversations should be analysed and evaluated, based on agreed performance and quality criteria, by experts in the specific healthcare field in which the CA is being deployed. Evaluation frameworks should encompass elements of accuracy and safety, in addition to elements related to user preferences for conversational style. For example, research suggests that patients prefer interactions that are goal-oriented (17) and human-like in nature (18). Several studies that propose such evaluation frameworks and metrics include accuracy (19–21); elements of tone such as human-likeness and narrative storytelling (22), fluency (20), empathy (21); risk or likelihood of harm (23, 24); trust and safety (8); helpfulness (24); and effectiveness (19).
1.1 Purpose
We developed one such evaluation framework to monitor conversations between patients and a GenAI Conversational Agent. The CA was designed to support patients engaged in a variety of weight-loss treatment programs for overweight and obesity and was deployed to an active patient user base within a smartphone application. More details of this CA, “Coach Iris”, is shown in Figure 1.

Figure 1. Details of “Coach Iris”, a GenAI conversational agent designed to support patients engaged in a variety of weight-loss treatment programs for overweight and obesity.
The evaluation framework was designed, tested, and iterated by a team of behavioral scientists and psychologists. The framework was implemented by a trained team of reviewers to continuously evaluate the quality of the CA and monitor patient safety. Here we describe the iterative process by which this framework was developed and continues to evolve. We call particular attention to the process of safety monitoring and clinical risk escalation. We then present the full framework and evaluation criteria.
2 Methods
The framework was developed, and continues to evolve, using an iterative process of development, outlined in Figure 2. Below, we outline the stages of development. The full framework is detailed in the Results.
2.1 Development
An initial version of the evaluation framework was developed with the primary purpose of measuring the quality and safety of the CA and informing model improvements, before implementation in a real world patient care setting. The development was led by a Behavioral Science team which comprised Master's and Doctorate qualified individuals with expertise in health coaching, digital health evaluation, behavior change, and obesity care. The initial framework was further informed by literature review and consultation with subject matter experts in a range of areas, including data science, machine learning, and AI.
Early in development, the development team identified that the framework would require evaluation of both individual messages (per-turn evaluation) and the overall conversation (per-dialogue evaluation) to offer the most robust evaluation of CA-patient interactions. As outlined by Smith and colleagues (25), both levels of evaluation have unique advantages. Per-turn evaluations are fine grained, encouraging reviewers to focus on nuances in interactions. However, the quality of an overall conversation is more than the sum of its parts, and per-dialogue evaluations can capture overall quality better. By using both levels of evaluation, reviewers can distinguish between instances where low quality turns exist, but overall conversation quality is still high, vs. conversations where a single low-quality turn is of such concern that the whole conversation is unacceptable.
At the dialogue level, evaluations comprise the elements of quality, rating the performance of the model and the acceptability of responses, and context, categorizing the topics covered in the responses. At the turn level, evaluations focused only on quality. Figure 3 outlines the levels of evaluation at a high level, with the full framework and scoring criteria detailed in Section 3.
2.2 Testing and refining
A team of five tested the initial framework. They included three Master's level evaluators with expertise in health coaching, digital health evaluation, nutrition and weight management, one Doctorate-level health psychologist with expertise in behavioral science, and one Doctorate-level behavioral science and weight management expert, all of whom provided feedback on the framework. They independently evaluated the same 12 dialogues using the framework. Results were compared, discrepancies discussed, and relevant framework changes were made and again tested. Following any iterations to the framework, the team tested the changes by evaluating dialogues independently, discussing results, and resolving any discrepancies. Examples of aspects iterated upon during continuous testing and improvement are outlined later in this section.
2.3 Evaluator training
Once the initial framework was finalized, a training process was developed to standardize the implementation of the framework across all evaluators. Training was led by a qualified health professional with 5 years' experience in health coaching of patients, who was also a key collaborator in the development of this framework. The team who was involved in the development and testing phase then progressed through a staged training process; they attended training sessions, shadowed the health coaching professional conducting conversation evaluations, and then reviewed with supervision. Written training resources and guidelines were created to support training and ongoing evaluations. Team members only evaluated independently when they had been certified to do so by the health coaching professional, and in some cases additional training or supervision was provided. After certification, conversation evaluations were regularly discussed and monitored to ensure consistent quality, as outlined below. As noted above, the framework was continuously iterated, and after any changes, the evaluation team was informed of the rationale and implications; and additional training was provided as needed.
2.4 Reviewing & supervision
Conversation evaluations were regularly reviewed and supervised to ensure a high standard was upheld. As all evaluations were conducted remotely, a discussion channel was created including all evaluators, where they could ask questions about the process or share specific dialogue examples for discussion or second opinion. Lead evaluators carried out regular spot-checks of a subset of evaluations. Where discrepancies were identified, they were discussed amongst the entire evaluator group to ensure continuous learning and modifications were agreed upon and made.
A process for Subject Matter Expert (SME) review was also established, whereby evaluations could be passed to a range of SMEs for a second opinion on the acceptability of the CA response. This was particularly useful when dialogues covered niche areas of knowledge. SMEs included both Masters and Doctorate level experts in exercise, nutrition, psychology and wellbeing, mental health, machine learning, obesity treatment, and symptom management. Insights from SME reviews contributed to targeted refinements, including adjustments to response parameters, retrieval-augmented generation (RAG) data, and prompt engineering to improve conversational alignment. SMEs also provided a second review and shared any relevant knowledge or details that would support future evaluations of a similar topic. This ensured high evaluation accuracy, mitigation of risk, and continuous learning and expansion of evaluation team knowledge base.
2.5 Safety monitoring
Given that the service interacts with patients in a healthcare weight management setting, all patient messages were continuously screened for any clinical or weight loss risks. An escalation process was developed with a team of in-house medical experts and a set of criteria was developed to determine when a patient dialogue needed to be escalated to their healthcare provider. Criteria included indication of clinical risk (e.g., persistent, severe, or unexpected symptoms, danger to patient safety, pregnancy) or weight loss risk (e.g., significant weight gain or plateau for two or more weeks, lack of treatment response). Where these criteria were present, the patient's healthcare provider was alerted by email. Providers were instructed to follow their usual duty of care for all patients, and that the CA was not a replacement for clinical follow-up support. Continuous monitoring and review of conversation evaluations also informed changes and updates to the CA and the framework. The framework was and is iterated as needs emerge, and these changes run through the full development process (testing, training, review, and monitoring).
3 Results
Here we outline the full evaluation framework. We first present per dialogue evaluations in both quality (FAST Metrics, Overall Scores) and context (Dialogue Tags), followed by per-turn quality evaluations (Overall Turn Scores, Turn Tags).
3.1 Per dialogue evaluation
3.1.1 FAST metric scores (quality)
We identified that interactions need to be scored in four domains: Fidelity, Accuracy, Safety, and Tone. These reflected some key domains identified in related evaluation frameworks, as previously discussed, in addition to a novel domain of Fidelity. Fidelity encompasses elements of helpfulness (24) and effectiveness (19) with additional elements of adherence to Behavior Change techniques and Coaching Psychology practice, which is key to sustained health behavior change (26, 27). More details and questions to consider in each domain are outlined in Table 1.
Scores were dichotomous such that dialogues receive an overall score of 1, Unacceptable (CA response should not be presented to users in this way) or 2, Acceptable (CA is fulfilling requirements) on each of the four metrics. This dichotomy obliged reviewers to determine whether an interaction was acceptable or not, avoiding a middle ground and ensuring high quality of dialogues. In training, reviewers were encouraged to rate conservatively when in doubt, to avoid false positive ratings. When dialogues or turns were scored as unacceptable, feedback was shared with the Machine Learning team and relevant adjustments were made to fine-tune the model to address the unacceptable behavior observed.
3.1.2 Overall dialogue score (quality)
Overall scores were a subjective rating of whether the dialogue as a whole was 1, Unacceptable or 2, Acceptable. Though not calculated as an average of the FAST metrics, evaluators considered each of the FAST metrics above when assigning this score, such that if one or more FAST metrics was scored 1, Unacceptable, they should consider if this makes the overall dialogue unacceptable. Similarly, if all four FAST metrics were 2, Acceptable, it was likely that reviewers would assign an overall score of 2.
3.1.3 Dialogue tags (context)
Dialogue level tags were chosen from the predefined list of tags in Table 2. This descriptive tag supported internal analytics, training, and data review. At least one tag was chosen to describe the primary topic of the chat, with additional tags for secondary topics as relevant.
3.2 Per turn evaluation
3.2.1 Overall turn score (quality)
Turns were assigned a score 1 or 2 for the overall quality of the turn, using the same scoring descriptions for Dialogue Quality scores. Turn level quality scores were used to identify good or particularly bad examples of interactions. Not every turn needed to be rated; if no score was entered for a turn, it was assumed to be acceptable.
3.2.2 Turn tags (quality)
In cases where a turn receives a quality score, a tag was assigned to explain the score. The list of turn level tags can be seen in Table 3. Note this differs from the Dialogue Tags which describe the context or topic of the dialogue. These tags explain the quality score assigned to the turn.
3.3 Evaluation system software
Software was developed to enable evaluations. Conversation transcripts were imported and presented to the evaluation team via an interactive online tool. Evaluators could filter and sort conversations by date, rating, etc. The system presented conversations to evaluators for monitoring and facilitated the tracking of evaluations and escalations. Screenshots of the evaluation system can be seen in Figure 4.
An example patient conversation is shown in Figure 5.
In each individual dialogue, some details of the overall conversation were summarized (e.g., patient and clinic ID, date of conversation, etc.). The transcript of the conversation was presented with a notes box where evaluators can enter their rating and tags.
3.4 Continuous improvement
As demonstrated in the iterative development process above, this framework continues to evolve as we identify relevant changes through reviewing and monitoring. To date, several changes have been made to the framework, for example:
• In addition to the FAST metrics, a fifth metric of Helpfulness/Supportiveness was originally added, before it was determined that this should be a core element of Fidelity rather than its own domain.
• The Overall Dialogue Score previously comprised an average of the FAST metrics, but it was changed to be a subjective rating instead. Assigning a subjective rating allowed evaluators to determine the overall acceptance of a dialogue rather than being constrained to an average calculation.
• Scores went through an iteration process, for example initially using a 3-point scale to represent an Unacceptable response (1), a Concerning Response (2) and an Acceptable Response (3). However, after testing, it was determined that dialogues are either acceptable or they aren't, and removing the middle option encouraged evaluators to determine which category to assign a dialogue.
Iterations will continue and we believe this flexibility ensures evaluations can continue as LLMs evolve and become more complex. This paper presents the current framework at time of writing. As noted by Schueller and Morris (1), several “unknown unknowns” might exist when it comes to safe application of CAs to healthcare, with all use cases and risks being impossible to foresee until the model is deployed in real-life settings and real-world behavior is closely monitored.
4 Discussion
This paper reports on the development and face validity of a framework to evaluate and monitor the safety and quality of conversations between patients and a GenAI CA, designed specifically to support patients receiving treatment for overweight and obesity. We determined that interactions are best evaluated at both the Dialogue and Turn level, and that reviews should capture both Quality and Context. We identified four key metrics on which to evaluate conversations: Fidelity, Accuracy, Safety, and Tone. The framework has been used by a team of trained reviewers to evaluate real-world conversation between patients and the CA.
Keeping a “human in the loop” (28) is crucial, particularly as the application of GenAI to patient care is still in its infancy. Our framework offers a structured process which could be used in any setting deploying a CA with users, irrespective of the technical expertise or machine learning knowledge of evaluators. The advantages of using the framework to continuously monitor interactions is two-fold. First, it offers a way to quantify the performance of a CA, understand the types of interactions patients are having, and identify areas for model improvement. Second, and paramount, it offers a process to effectively monitor the safety of the model. Evaluators review conversations to ensure information provided to patients is appropriate, safe, accurate, and well presented, captured by Dialogue and Turn Level Evaluations. Crucially, evaluators also pay careful attention to patient reports of indications of risk (as outlined in section 2.5). In our case, any indications of clinical risk are escalated to the healthcare providers providing the weight management treatment.
In practice, this framework can be implemented through regular review cycles where evaluators examine randomized conversation samples, enabling early detection of interaction patterns requiring improvement. Organizations can establish tiered response protocols tailored to their specific setting, e.g., scheduled model refinements based on severity levels identified during evaluations. This supports prior research highlighting the need to approach AI safety from a whole-system perspective, including user interactions, rather than focusing only on the algorithm (29). For optimal application of this framework, we recommend employing at least three independent reviewers. This structure allows two reviewers to evaluate the same conversations and assess inter-rater reliability, while the third reviewer serves as a tie-breaker when consensus is needed on challenging evaluations. This approach will help to ensure consistent application of the framework criteria while providing a systematic mechanism for resolving disagreements about conversation quality or safety concerns.
Passmore and Tee (6) ask the question on many of our minds; can CAs and LLMs replace human coaches? Their work proposes that while AI coaching can meet the requirements for many aspects of traditional coaching, AI Coaches lack the true empathy, humor, and opportunities for creativity and playfulness often seen in human coaching. Whilst we must perhaps accept that this may never be fully feasible, strides to shorten the gap are realistic, especially as AI Coaches are becoming more sophisticated every day. Through continuous monitoring and evaluation of conversation quality and context, using frameworks such as this, we can identify areas for improvement and enhance the support and guidance CAs can provide, thus increasing efficiencies and enabling human coaches to focus their expertise where it is most needed. Coaching psychologists have a critical role in the development and evaluation of AI coaching delivery (6), and health coaching professionals were involved in the development of both our CA and this evaluation framework, from inception through to launch. We could envision a future not where AI replaces human coaches or providers, but where providers who embrace AI to support, augment, and improve their work replace providers who do not.
Training an evaluation team thoroughly and evaluating all conversations is undoubtedly resource intensive. Leveraging machine learning techniques could automate some elements of the evaluation thereby creating a loop where one AI tool helps to monitor another. For example, we have started work to train an AI model to flag indications of clinical risk that need escalation consideration. Flagging conversations of concern could triage conversations, helping human evaluators to use their time more efficiently. This could help overcome challenges of reviewer fatigue, and free up evaluator time to focus on the crucial human components, such as aspects of tone and fidelity. In this way AI could support, but not replace, human evaluators (30). Future work could explore fully automated risk escalations and elements of the FAST review. As noted by Liu and colleagues (21), this aligns with the fundamental principle of medical informatics, “where the goal is not to create computer systems that are superior to humans, but rather to create systems that augment human intelligence, such that the human and computer together perform better than the human alone”.
Our work here offers one evaluation framework amongst others that currently exist. For example, the DISCOVER conceptual framework proposes considerations across the CA development life cycle providing a step-by-step guide for developing rule-based, smartphone-delivered CAs (31). In contrast, our work focuses solely on evaluation and offers a process which can be used by trained but non-technical professionals (i.e., reviewers should possess skills in health coaching and/or patient support, but do not need to be familiar with the more technical aspects of Software Development, AI, Machine Learning, or CAs). Denecke (32) proposes a set of concrete metrics, heuristics, and checklists for evaluating health CAs from four perspectives: general perspective, response generation, response understanding, and aesthetics. This framework was designed for rule-based CAs, which have a simple personality without any kind of embodiment, and intended for researchers and developers who design or develop health CAs. Our framework could be applied in any healthcare setting by healthcare professionals and coaches.
There are several limitations of this work. We note that this is not a measure validation study. The next steps are to validate this framework and conduct inter-rater reliability. There are many domains which could be added to the framework to enhance its comprehensiveness. The framework does not account for some ethical dilemmas, for example, how do CAs respond sensitively to issues relating to gender, race, identity, etc.? Wider evidence from AI demonstrates that models can heavily bias towards the characteristics of those involved in their development. Intersectionality should begin from design and development, involving collaboration with members from various socio-economic groupings, races, genders, and other important categories to test tools before they are released (6).
The framework proposes a “top down” (expert-driven) approach for evaluation, and we know that digital health evaluation models and rating systems rarely incorporate the views or needs of patients and consumers. More work is needed to ensure that consumer perspectives are central and that we also integrate “bottom up” (consumer-informed) processes in the evaluation (33). In addition to monitoring performance, collecting continuous user feedback directly from patients, via surveys or interviews, is recommended to refine the CA and identify evaluation metrics of interest to end-users. Future publications will present the results of the application of this framework to evaluate patient support and will also explore and compare the quality of support provided to patients undergoing different treatment types (e.g., intragastric balloon, bariatric surgery, obesity management medications).
While human evaluation and monitoring can help drive safe use of GenAI within healthcare settings, findings from the FAST framework can also inform deployment readiness, helping determine whether additional safety measures, human oversight, or further refinements are required before broader implementation. In addition to its role in evaluation, the framework can be actively applied to establish safety guardrails for AI health coaching interactions. These guardrails define acceptable and unacceptable response patterns based on predefined criteria, ensuring structured thresholds for intervention and escalation where necessary.
On the macro level, policy and regulation must be implemented rapidly to ensure GenAI is used responsibly. It is challenging to consider how to manage the ethical dilemmas posed by AI-delivered services, which range from data privacy to potential over-reliance, in a sector with growing regulatory constraints. Current regulations are evolving which are likely to have global ramifications for how AI is deployed. Policies may also influence how risk management structures and protocols must be implemented in healthcare settings using AI. These policy developments, along with individual healthcare settings' due diligence, are essential to ensure a responsible balance between innovation and ethics as we navigate the new landscape of Conversational AI.
Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.
Author contributions
MN: Conceptualization, Data curation, Investigation, Methodology, Project administration, Visualization, Writing – original draft, Writing – review & editing. EF: Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing. VR: Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing. JW: Conceptualization, Methodology, Project administration, Writing – original draft, Writing – review & editing. ZG: Resources, Supervision, Writing – original draft, Writing – review & editing. RC: Writing – original draft, Writing – review & editing. PS: Conceptualization, Methodology, Resources, Supervision, Writing – original draft, Writing – review & editing.
Funding
The author(s) declare that no financial support was received for the research and/or publication of this article.
Acknowledgments
We would like to extend our gratitude to Dr. Ben Ainscough, who provided invaluable guidance during the development of the framework. We also thank Dr. Regina Barzilay for advising.
Conflict of interest
Authors MN, ZG and PS are employed by or consultant for company Vir Health Ltd (trading as Numan). All authors are either current or former employees of Allurion Technologies. ZG and RC are stock option holders in Allurion Technologies Inc. ZG and MN are stock option holders in Vir Health Ltd (trading as Numan).
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Abbreviations and definitions
We have listed below an explanation of the various technical terms used in this paper. We note that as Generative and Conversational AI are such new fields, definitions may differ slightly elsewhere and may evolve over time. The definitions below reflect our working definitions at the time of publication.AI: Artificial Intelligence. A broad term to encompass non-human programs or models that can solve sophisticated tasks, either via rule-based systems or machine learning.ML: Machine Learning. The field of study that is concerned with developing algorithms that can be used by programs or systems (models) to make useful predictions from input data.LLM: Large Language Model. A system which creates, or generates, language-based content, which has been pre-trained on vast datasets. ChatGPT is an example of an LLM.GenAI: Generative Artificial Intelligence. A system that creates, or generates, new original content which may include text, images, videos, etc. in response to a user prompt.CAI: Conversational Artificial Intelligence. A type of Generative Artificial Intelligence that focuses specifically on human-like, language-based interactions.CA: Conversational Agent. A dialogue system that interacts with a user, i.e., the entity through which users interact with Conversational Artificial Intelligence.ChatGPT: Chat Generative Pre-Trained Transformer. A Conversational Agent developed by the company OpenAI that is designed to process and generate human-like text.
References
1. Schueller SM, Morris RR. Clinical science and practice in the age of large language models and generative artificial intelligence. J Consult Clin Psychol. (2023) 91(10):559–61. doi: 10.1037/ccp0000848
2. Kasneci E, Sessler K, Küchemann S, Bannert M, Dementieva D, Fischer F, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn Individ Differ. (2023) 103:102274. doi: 10.1016/j.lindif.2023.102274
3. Li YF, Wang H, Sun M. ChatGPT-Like Large-Scale Foundation Models for Prognostics and Health Management: A Survey and Roadmaps. (2023). arXiv:2305.06472. doi: 10.48550/arXiv.2305.06472
4. Wahde M, Virgolin M. Conversational Agents: Theory and Applications (2022) Available at: https://arxiv.org/abs/2202.03164 doi: 10.48550/arXiv.2202.03164 (Accessed June 15, 2024).
5. Lewandowski T, Kučević E, Leible S, Poser M, Böhmann T. Enhancing conversational agents for successful operation: a multi-perspective evaluation approach for continuous improvement. Electron Mark. (2023) 33(1):39. doi: 10.1007/s12525-023-00662-3
6. Passmore J, Tee D. Can Chatbots replace human coaches? Issues and dilemmas for the coaching profession, coaching clients and for organisations. Coach Psychol. (2023) 19(1):47–54. doi: 10.53841/bpstcp.2023.19.1.47
7. Diller SJ, Passmore J. Defining digital coaching: a qualitative inductive approach. Front Psychol. (2023) 14:14. doi: 10.3389/fpsyg.2023.1148243
8. Sarkar S, Gaur M, Chen LK, Garg M, Srivastava B. A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement. Front Artif Intell. (2023) 6:12. doi: 10.3389/frai.2023.1229805
9. Samaan JS, Issokson K, Feldman E, Fasulo C, Ng WH, Rajeev N, et al. Artificial Intelligence and Patient Education: Examining the Accuracy and Reproducibility of Responses to Nutrition Questions Related to Inflammatory Bowel Disease by GPT-4. medRxiv [Preprint]. (2023). Available at: https://www.medrxiv.org/content/10.1101/2023.10.28.23297723v2 (Accessed June 15, 2024).
10. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social Media forum. JAMA Intern Med. (2023) 183(6):589–96. doi: 10.1001/jamainternmed.2023.1838
11. Singh B, Olds T, Brinsley J, Dumuid D, Virgara R, Matricciani L, et al. Systematic review and meta-analysis of the effectiveness of chatbots on lifestyle behaviours. NPJ Digit Med. (2023) 6:118. doi: 10.1038/s41746-023-00856-1
12. Golda A, Mekonen K, Pandey A, Singh A, Hassija V, Chamola V, et al. Privacy and security concerns in generative AI: a comprehensive survey. IEEE Access. (2024) 12:48126–44. doi: 10.1109/ACCESS.2024.3381611
13. Shuford J. Examining ethical aspects of AI: addressing bias and equity in the discipline. J Artif Intell Gen Sci. (2024) 3(1):103–24. doi: 10.60087/jaigs.v3i1.119
14. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. (2023) 47(1):33. doi: 10.1007/s10916-023-01925-4
15. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. (2023) 15(2). doi: 10.7759/cureus.35179
16. Blease C, Torous J. ChatGPT and mental healthcare: balancing benefits with risks of harms. BMJ Ment Health. (2023) 26(1):e300884. doi: 10.1136/bmjment-2023-300884
17. Clark L, Pantidi N, Cooney O, Doyle P, Garaialde D, Edwards J, et al. What makes a good conversation? Challenges in designing truly conversational agents. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). p. 1–12. Available at: http://arxiv.org/abs/1901.06525
18. Bickmore T, Cassell J. Social dialogue with embodied conversational agents. In: van Kuppevelt JCJ, Dybkjær L, Bernsen NO, editors. Advances in Natural Multimodal Dialogue Systems. Dordrecht: Springer Netherlands (2005). p. 23–54. doi: 10.1007/1-4020-3933-6_2
19. Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.arXiv (2020). Available at: http://arxiv.org/abs/1905.00537
20. Zellers R, Holtzman A, Rashkin H, Bisk Y, Farhadi A, Roesner F, et al. Defending Against Neural Fake News. arXiv. (2020). doi: 10.48550/arXiv.1905.12616
21. Liu S, McCoy AB, Wright AP, Carew B, Genkins JZ, Huang SS, et al. Leveraging large language models for generating responses to patient messages. MedRxiv Prepr Serv Health Sci. (2023). doi: 10.1101/2023.07.14.23292669
22. See A, Pappu A, Saxena R, Yerukola A, Manning CD. Do massively pretrained language models make better storytellers? Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) arXiv (2019). Available at: http://arxiv.org/abs/1909.10705
23. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. (2023) 620(7972):172–80. doi: 10.1038/s41586-023-06291-2
24. Bai Y, Jones A, Ndousse K, Askell A, Chen A, DasSarma N, et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv. (2022). Available at: http://arxiv.org/abs/2204.05862 (Accessed June 15, 2024).
25. Smith EM, Hsu O, Qian R, Roller S, Boureau Y-L, Weston J. Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents. In: Proceedings of the 4th Workshop on NLP for Conversational AI; 2022; Dublin, Ireland. Dublin: Association for Computational Linguistics (2022). p. 77–97.
26. Michie S, Richardson M, Johnston M, Abraham C, Francis J, Hardeman W, et al. The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques: building an international consensus for the reporting of behavior change interventions. Ann Behav Med Publ Soc Behav Med. (2013) 46(1):81–95. doi: 10.1007/s12160-013-9486-6
27. Palmer S, Whybrow A. Handbook of Coaching Psychology: A Guide for Practitioners. 2nd ed. New York, NY: Routledge (2018). p. 630.
28. Bakken S. AI In health: keeping the human in the loop. J Am Med Inform Assoc. (2023) 30(7):1225–6. doi: 10.1093/jamia/ocad091
29. Lyell D, Wang Y, Coiera E, Magrabi F. More than algorithms: an analysis of safety events involving ML-enabled medical devices reported to the FDA. J Am Med Inform Assoc JAMIA. (2023) 30(7):1227–36. doi: 10.1093/jamia/ocad065
30. Liu S, Wright AP, Patterson BL, Wanderer JP, Turer RW, Nelson SD, et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc JAMIA. (2023) 30(7):1237–45. doi: 10.1093/jamia/ocad072
31. Dhinagaran DA, Martinengo L, Ho MHR, Joty S, Kowatsch T, Atun R, et al. Designing, developing, evaluating, and implementing a smartphone-delivered, rule-based conversational agent (DISCOVER): development of a conceptual framework. JMIR MHealth UHealth. (2022) 10(10):e38740. doi: 10.2196/38740
32. Denecke K. Framework for guiding the development of high-quality conversational agents in healthcare. Healthcare (Basel). (2023) 11(8):1061. doi: 10.3390/healthcare11081061
Keywords: GenAI, AI, health coach, weight-loss coach, AI evaluation, conversational generative AI evaluation, large language models, machine learning
Citation: Neary M, Fulton E, Rogers V, Wilson J, Griffiths Z, Chuttani R and Sacher PM (2025) Think FAST: a novel framework to evaluate fidelity, accuracy, safety, and tone in conversational AI health coach dialogues. Front. Digit. Health 7:1460236. doi: 10.3389/fdgth.2025.1460236
Received: 5 July 2024; Accepted: 30 April 2025;
Published: 18 June 2025.
Edited by:
Linda W.P. Peute, Amsterdam University Medical Center, NetherlandsReviewed by:
Rüdiger Christoph Pryss, Julius Maximilian University of Würzburg, GermanyTrina Histon, Woebot Labs Inc., United States
Copyright: © 2025 Neary, Fulton, Rogers, Wilson, Griffiths, Chuttani and Sacher. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Martha Neary, bWFydGhhbmVhcnkwMUBnbWFpbC5jb20=