Use of large language model-based chatbots in managing the rehabilitation concerns and education needs of outpatient stroke survivors and caregivers

Background The utility of large language model-based (LLM) artificial intelligence (AI) chatbots in many aspects of healthcare is becoming apparent though their ability to address patient concerns remains unknown. We sought to evaluate the performance of two well-known, freely-accessible chatbots, ChatGPT and Google Bard, in responding to common questions about stroke rehabilitation posed by patients and their caregivers. Methods We collected questions from outpatients and their caregivers through a survey, categorised them by theme, and created representative questions to be posed to both chatbots. We then evaluated the chatbots' responses based on accuracy, safety, relevance, and readability. Interrater agreement was also tracked. Results Although both chatbots achieved similar overall scores, Google Bard performed slightly better in relevance and safety. Both provided readable responses with some general accuracy, but struggled with hallucinated responses, were often not specific, and lacked awareness of the possibility for emotional situations with the potential to turn dangerous. Additionally, interrater agreement was low, highlighting the variability in physician acceptance of their responses. Conclusions AI chatbots show potential in patient-facing support roles, but issues remain regarding safety, accuracy, and relevance. Future chatbots should address these problems to ensure that they can reliably and independently manage the concerns and questions of stroke patients and their caregivers.


Introduction
Artificial intelligence (AI) is a branch of machine learning which has experienced significant advancements over the past decade, including in the field of healthcare, where its use is being investigated in prediction and prognostication models, decisionmaking aids (1), and patient-facing interactions, among others (2).With the development of large language models (LLM) such as GPT-3.5 (Generative Pre-trained Transformer 3.5) and PaLM 2 (Pathways Language Model 2), and their subsequent incorporation into clinical, education, and research domains across the healthcare spectrum, both opportunities as well as challenges associated with use have been increasingly identified (3).There is even the potential for AI chatbots to be used to manage the concerns and questions of patients with common chronic conditions and their caregivers, though concerns about the factuality and safety of their advice remain unanswered (4,5).With the integration of LLM chatbots into desktop and mobile device interfaces such as Microsoft Bing and Google Search (6,7), the barriers to access for patients are constantly being lowered, and there is a real risk of them being exposed to inaccurate or unsafe advice which they may erroneously perceive to be professional or reliable (8).In the field of rehabilitation, patients with stroke, as well as their caregivers, are particularly susceptible, owing to complex, diverse, and evolving concerns ranging from cognitive symptom management to fear of recurrence and financial assistance (9,10).
We aimed to evaluate the accuracy, safety, relevance, and readability of 2 well-known and freely-accessible AI chatbots (ChatGPT and Google Bard) in providing responses to common questions about stroke rehabilitation posed by a local group of outpatients and their caregivers.ChatGPT (Chat Generative Pretrained Transformer) and Google Bard are freely-available general language chatbots based on the GPT-3.5 and PaLM 2 LLM frameworks respectively.Both chatbots were pre-trained on billionsto-trillions of primarily English-language tokens though the PaLM 2 framework's dataset is more recent.We hypothesised that they could provide standard responses to basic questions about stroke rehabilitation, but experience difficulties with answering local or context-specific concerns.Secondary aims were to investigate the incidence of patient and caregiver concerns in a local setting and observe for temporal trends, as well as to observe clinician concordance in evaluating the answers provided by these chatbots, given that physicians of different training and experience levels may view such answers differently.

Methods
This project was envisioned as a single-site two-phase mixedmethods (semi-qualitative + evaluation/assessment) study in an acute general hospital's specialist outpatient clinic, supported by anonymous questionnaires for generation of the question list.Reporting would conform to the Clinical AI Research (CAIR) guidelines as far as possible (11), given the study's mixed design with no other appropriate EQUATOR analogues (12).Ethical approval was sought to approach patients anonymously and an exemption was granted by our Institutional Review Board (SingHealth CIRB 2023/2542).

Phase 1: question generation
As studies on the concerns of patients with stroke were mainly published >20 years ago (13), we first intended for the list of questions to reflect our patient cohort's concerns in relevance and currency.To this end we created a questionnaire adapted from the model of challenges and coping behaviours after stroke (Supplementary Material S1), developed by an Australian team (14), that invited questions from patients with stroke, as well as their accompanying caregivers, encompassing domains such as realising physical limitations, engagement in activities, and psychological support.This would be opportunistically offered to all consecutive patients (or their caregivers) who met the single inclusion criteria (previous diagnosis of stroke) at the point of registration at the Rehabilitation Medicine specialist outpatient clinic for planned follow-up of their medical conditions.We sought to recruit 50 responses across a 2-month period, assuming a 20% response rate for an estimated 250 eligible patients.The questionnaire was anonymous and intended to be completed without training or explanation.There would be no interaction with study team members as the questionnaires were distributed by administrative staff, and implied consent would be assumed upon voluntary completion and return of the questionnaire.
Responses would be hand-arranged by thematic similarity and ranked by frequency into a list of top-10 questions, after review by study team members.This would form the basis of the questions to be fed into the AI chatbots.We would accept specific questions about local features (such as access to Day Rehabilitation Centres), to evaluate the chatbots' breadth of response, if they were asked frequently enough.
Singapore is a multi-ethnic urban country with a heterogeneous demographic that is primarily Chinese, Malay, and Indian in ethnicity.Patients with stroke are typically aged 65 years and above, although we often care for younger patients too.Caregivers tend to be direct relatives of patients but can sometimes be from the extended family (such as a nephew, niece, or grandchild).As an acute general hospital serving the entire eastern region of Singapore, our patient pool is representative of the nation's population characteristics.
The sample size of 50 was determined as a value of convenience that could generate 500 responses, which would provide sufficient data saturation to generate representative questions.Given that a 20% response rate is optimistic for untargeted surveys, we projected that at least 250 patient-caregiver dyads would have to be approached to meet the recruitment target.

Phase 2: response evaluation
Each question would be fed directly into both chatbots as unique instances, with only the leading statement "I am a patient living with stroke in Singapore.I have a quick question about my stroke rehabilitation", to set the stage.The first text answer to each question would be accepted with no repetitions, clarifications, or follow-up questions pursued.We assumed that a layman (patient or caregiver) may not be discerning enough to seek clarifications in the event of uncertainty.The chat history would then be cleared to prevent prior responses from influencing the next prompt's answer.
Three evaluators at different stages of training and experiencespecialist-in-residency (SIR, ½-year specialist experience), newlyqualified specialist (NQS, 4-years' experience), and senior specialist (SS, 20-years' experience)-would then evaluate each answer based on a 3-point Likert-like rubric for 4 domains of accuracy, safety, relevance, and readability (Supplementary Material S2) (15).The qualitative criteria for the 3 scoring levels of unsatisfactory, borderline, and satisfactory, were established beforehand.Free-text comments would be accepted for other aspects observed.The rubric was created by a single author with previous experience in mixed-methods education research (JREN), and vetted by the other two authors for consistency.Alignment was achieved through a sample completed marksheet that all authors used for standardisation.We took single review at the specialist level to be representative of the ground truth.

Data collection
The only data that would be collected from the patients and/or their caregivers would be an anonymous list of questions that they wished to ask about their stroke rehabilitation and recovery.Hardcopy responses would be stored in a locked cabinet and the list of top-10 questions would be stored in the department's Microsoft SharePoint database as a Microsoft Excel document with restricted access.

Statistical analysis
As this was a semi-qualitative study, numerical results would be presented only for summative head-to-head comparisons between the 2 AI chatbots in the 4 domains.Free-text comments and specific question breakdowns would be presented as-is.Interrater concordance would be reported using the Fleiss' Kappa (κ for >2 raters) which was available from IBM SPSS version 26.0 (IBM Corp. Armonk, NY, USA).Missing data would be reported as-is with no imputation attempted.No further interferential statistical work was planned.

Concerns and questions posed by patients and/or their caregivers
Recruitment was completed within the intended timeframe, upon receipt of 50 valid responses containing 280 unique questions.During sorting, 34 questions were classified as "incomplete, confusing, or not questions", and were excluded.Interestingly, a few responses, though out-of-context, requested to speak to a human/non-robot instead.
The remaining 246 questions were arranged and categorised into representative themes by a single author (JREN, and vetted by all authors), from which the 10 largest themes were used to form representative questions (Table 1).Themes with the most questions and concerns were "prognosis and recovery" (50 questions, 17.8%), "social support" (39 questions, 13.9%), and "psycho-emotional support" (37 questions, 13.2%).Owing to the range of questions within each theme and the use of multi-barrelled questions observed in many of the returned questionnaires, we curated the questions for the chatbots to likewise contain sub-questions (up to a maximum of 3).

Responses from the chatbots
Prompts were fed in on 5th February 2024 according to protocol.Mean response length was 328 words for ChatGPT and 352 words for Google Bard.Both chatbots always gave warnings about the generalisability of their advice with regard to an individual's health condition, as well as reminders to consult a healthcare provider for further details (Supplementary Material S3).

Accuracy of responses
For accuracy of responses, ChatGPT received 22 satisfactory grades (73.3%), 6 borderline grades (20%), and 2 unsatisfactory grades (6.7%), with Google Bard receiving the exact same scores.Explanations were considered accurate with the main problems arising when both chatbots attempted to list resources for patientsin ChatGPT's case it referred to a non-existent support group called the National Stroke Association of Singapore (NSAS), whereas Google Bard reported the existence of a fictitious financial support instrument called "Edusave for Medical Needs", and provided a live link for the website of Dietician's Association of Singapore, that turned out to belong to the Dyslexia Association of Singapore.
Most links to online resources were observed to be general, and linked to the homepages of organisations, rather than their specific resources.Interestingly, Google Bard's interface was observed to perform concurrent fact-checking, with some links removed after they had been provided, with the tag "invalid URL removed".The interface also incorporated a double-checking function, which when clicked, triggered a Google search to validate the factuality of the individual statements in each response.

Safety of responses
For safety of responses, ChatGPT received 14 satisfactory grades (46.7%), 13 borderline grades (43.3%), and 3 unsatisfactory grades The second was in safety-netting for psychological situations, in which both chatbots did not identify the possibility of patients entertaining thoughts of self-harm, and gave generic guidance for low mood.

Relevance of responses
For relevance of responses, ChatGPT received 16 satisfactory grades (53.3%), 9 borderline grades (30%), and 5 unsatisfactory grades (16.7%), with Google Bard receiving 24 satisfactory grades (80%), 3 borderline grades (10%), and 3 unsatisfactory grades (10%).Points were lost for specificity of answers-for example ChatGPT did not mention the need for driving recertification in patients looking to return to driving.Both chatbots stumbled at different points for questions with subquestions-ChatGPT lumped its responses to different post-stroke symptoms together, and Google Bard, in a break from its usual response style, explained that it could not provide medical advice in response to a prompt asking about treatments and therapies for getting better after stroke.

Readability of responses
For safety of responses, ChatGPT received 27 satisfactory grades (90%), 1 borderline grade (3.3%), and 2 unsatisfactory grades (6.7%), with Google Bard receiving 26 satisfactory grades (86.7%), 2 borderline grades (6.7%), and 2 unsatisfactory grades (6.7%).Most answers were well-signposted with appropriatelyworded terms, with avoidance of jargon and arranged in point form for easy reading.Only on a few occasions were answers considered too long or too short, with ChatGPT in one case providing a response that was considered "like a medical textbook".

Local contexts
We designated questions 3 and 5-10 as "local-context" questions, in which responses would be reasonably expected to contain some aspect of local relevance (e.g., named support group, available financial support scheme).Apart for a marginal increase in borderline scores for the safety domain for both chatbots, there were no particular differences as compared to the overall scores (Figure 4).

Inter-rater agreement
The overall Fleiss κ was 0.181 (p < 0.001) indicating slight interrater agreement between all 3 raters.The domain in which raters agreed the most on was relevance (κ = 0.297, p = 0.02) yet this was also only fair.Cohen's κ between individual rater pairs revealed more agreement between the NQS-SS pair (overall κ = 0.280, safety κ = 0.333, relevance κ = 0.320; all p < 0.05) than the other two pairings, but even then, agreement was also only fair at best.Overall scores for both chatbots.Scores for both chatbots for questions tagged as "soft-touch".

Questions patients and caregivers ask about their stroke
A British study found in 1990 that based on the inquiries made to Stroke Association Advice Centres over a 4-month period (13), after general information, the next most sought-after groups of advice were relating to home care support (9.4%), stroke clubs (9.2%), speech help (6.9%), and recovery/rehabilitation (6.0%).In 2013 when a follow-up comparison study was performed, after the "what is a stroke" and uncategorised questions, patients have started asking about specific medical enquiries (10.5%), but also still have concerns about therapy (8.4%), local services (5.2%), recovery timescales (4.9%), as well as benefits and financial assistance (4.4%).10 years later and in a different society, similar concerns hold true, with prognosis and recovery, social support, symptom management, and prevention/recurrence all ranking highly.Though the population was limited to our acute general hospital, it suggests consistency of content in the informationseeking behaviour of patients and caregivers which still remains to be well-addressed.Though not explored clearly in our study, the 1990 work found that elderly patients often required a younger counterpart to make inquiries on their behalf.This aligned with how many of our returned questionnaires were answered in the third person, and the role of the caregiver in advocating for their loved one and seeking better understanding of their care situation remains evidently relevant.

Role of the AI chatbot in patient-facing encounters
A side-by-side comparison suggests scores were fairly similar in general, though Google Bard did slightly better in the proportion of answers that were found to be relevant and satisfactorily safe.This result contrasts other teams' works in other specialties that had found ChatGPT (running a GPT-4 framework) produced more readable answers (16), and Google Bard produced more accurate answers (8).This, combined with our low inter-rater agreement, underlines the variability in acceptability of responses, which we expect could have many other factors such as training dataset, the randomness of the AI "black box" (17), and even the way questions are phrased by different patients.
It is instinctive that the accuracy of both chatbots' answers was often good-LLMs craft answers based on the likelihood of their individual words and phrases going together, and their training datasets would have included scientific texts, thus basic explanatory science would not be too difficult to repurpose and present.However, safety was a larger concern in our analyses, with both chatbots prone to "hallucination" (5), in the form of false support groups, resources, and hyperlinks.We observed that most answers were restrictively generic with limited reference to real-world entities, and the few forays into giving more information only resulted in website homepages rather than a specific subpage.As compared to a traditional search engine or a human-generated answer, an AI chatbot's response may be more readable and convincing, yet lack actually-reliable information specific to a patient's query.One such example is seen in a study comparing answers from ChatGPT to recommendations from clinical practice guidelines in decision-making for lumbosacral radicular pain, in which agreement between the LLM and the guidelines (taken as the ground truth) was slight, with a κ of only 0.13 (1).Questions with a complex nature, and those requiring contextual insight (such as in our "soft-touch" questions), may be beyond the capabilities of the test LLMs, though it is unlikely that this will remain a longstanding issue.Interestingly, Google Bard's fact-checking feature highlights phrases that its search engine is able to verify with a follow-up web search (18), demonstrating the potential of a joint chatbot-search engine to generate even more reliable answers.We Scores for both chatbots for questions tagged as "local-context".note with interest the introduction of a new class of LLM, retrievalaugmented language generators (RAG) (19), that augment the accuracy of their outputs through curated domain-specific datasets (for example, the PubMed database), which may be more applicable for the medical chatbot use-case scenario.One of the aspects of safety that we felt was crucial was safeguarding.Both chatbots did not read emotional undercurrents well.For example, a patient asking about support for low mood would have triggered a healthcare practitioner who was performing a text-based teleconsultation to evaluate them for risk of self-harm, and take the necessary steps to protect the patient.Both chatbots' answers were often closed generically without special attention paid to the patient's risk profile.More medical-trained LLMs, such as Med-PaLM (20), should include such features in future iterations to enhance the physician-machine partnership.

Ethical implementation
It is possible that we are not a long time away from patientfacing LLM chatbots that provide accuracy, safety, relevance, and readability in their answers, and are able to be deployed in both general as well as specialist medical and rehabilitation fields.Even now, the available range of AI tools with applications in healthcare education and research is staggering and everexpanding (3).Beyond the logistical, financial, and technological expertise required to operate a patient-facing LLM chatbot in healthcare settings, many ethical issues, some unknown, still abound (21).Ethical risks such as trust decay, data protection, and business logic driving healthcare, offer as many pitfalls as they do opportunities for innovation (21).It is hoped that medical ethics and professionalism may continue to evolve in tandem with the creep of digital solutions from other industries into healthcare (21), and efforts such as applying the fiveprinciple framework (with the addition of the AI-specific explicability principle) in riskier specialties such as psychiatry are welcome guideposts (22).

Limitations
Surveys were answered by patients and caregivers who wanted to answer or were able to answer, and hence this convenience sample may not be representative of our local population's true needs, which itself in turn is unlikely to represent the global demographic.Further work to investigate this however was not a study priority and should be explored separately.Also, we were limited by a lack of diversity of raters, with all 3 markers being medically-trained.Future assessments would do well to include patients and their caregivers, as they are the end-users of these platforms and directly affected by their quality.The poor interrater reliability between the 3 markers weakens the strength of our conclusions, but may also hint at unrevealed trends in the acceptability of LLM-generated answers depending on the level of familiarity and comfort with AI among different professional user groups.A fourth limitation is the use of more outdated LLM frameworks over more recent or specific ones such as GPT-4 and Med-PaLM.We chose ChatGPT and Google Bard (now rebranded as Google Gemini) as these were freely-accessible and more wellknown, with a higher likelihood of exposure to a lay patient.

Conclusion
We have explored the role of 2 freely-available, well-known AI chatbots, ChatGPT and Google Bard, in the context of responding to questions and concerns posed by patients with stroke and their caregivers.Both chatbots demonstrated good readability and were fairly accurate, though hallucination, generic responses, and lack of emotional sensitivity remain as barriers to widespread deployment.Our findings underline the need for more robust, domain-specific LLMs to be made publicly-accessible, yet also showcase their potential for employment in an important but oft-overlooked aspect of patient care.

TABLE 1
List of representative questions after compiling patient/caregiver responses.Can you give me an idea of how long it will take for me to recover after my stroke?I've been putting in effort with exercises and medications, so is there a chance I can fully recover?" Neo et al. 10.3389/fdgth.2024.1395501