ProKnow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance

Virtual Mental Health Assistants (VMHAs) are utilized in health care to provide patient services such as counseling and suggestive care. They are not used for patient diagnostic assistance because they cannot adhere to safety constraints and specialized clinical process knowledge (ProKnow) used to obtain clinical diagnoses. In this work, we define ProKnow as an ordered set of information that maps to evidence-based guidelines or categories of conceptual understanding to experts in a domain. We also introduce a new dataset of diagnostic conversations guided by safety constraints and ProKnow that healthcare professionals use (ProKnow-data). We develop a method for natural language question generation (NLG) that collects diagnostic information from the patient interactively (ProKnow-algo). We demonstrate the limitations of using state-of-the-art large-scale language models (LMs) on this dataset. ProKnow-algo incorporates the process knowledge through explicitly modeling safety, knowledge capture, and explainability. As computational metrics for evaluation do not directly translate to clinical settings, we involve expert clinicians in designing evaluation metrics that test four properties: safety, logical coherence, and knowledge capture for explainability while minimizing the standard cross entropy loss to preserve distribution semantics-based similarity to the ground truth. LMs with ProKnow-algo generated 89% safer questions in the depression and anxiety domain (tested property: safety). Further, without ProKnow-algo generations question did not adhere to clinical process knowledge in ProKnow-data (tested property: knowledge capture). In comparison, ProKnow-algo-based generations yield a 96% reduction in our metrics to measure knowledge capture. The explainability of the generated question is assessed by computing similarity with concepts in depression and anxiety knowledge bases. Overall, irrespective of the type of LMs, ProKnow-algo achieved an averaged 82% improvement over simple pre-trained LMs on safety, explainability, and process-guided question generation. For reproducibility, we will make ProKnow-data and the code repository of ProKnow-algo publicly available upon acceptance.


Introduction
Mental health disorders such as Major Depressive Disorder (MDD) 1 and Anxiety Disorder (AD) 2 are widespread; 20.6% and 4.3% in the USA before the pandemic 3 .The current pandemic has further aggravated this issue.To address the key challenge of the overburdened healthcare system, there has been an increasing interest in AI-powered VMHA solutions as one alternative.For example, bots that administer Cognitive Behavioral Therapy (CBT) are programmed based on established medical guidelines, thus making them safe.
As CBT is a template-based therapy, clinicians scrutinize patients by checking their behavior against rules.
If a conversational AI (convAI) 4 agent is put in place, there isn't a necessity to ask follow-up questions.However, to provide diagnostic support for MDD and AD, an AI system would require a validation between the patient's response and medical knowledge and the clinician's expertise.This is required to ensure safe and explainable conversations between the patient and a 1 https://tinyurl.com/yckkp386 2 https://tinyurl.com/5c646cf8 3 https://adaa.org/understanding-anxiety/facts-statistics 4https://www.ibm.com/cloud/learn/conversational-aiHealth Questionnaire (PHQ-9), and for AD, the Generalized Anxiety Disorder Questionnaire (GAD-7) is often used to measure the severity of mental health conditions.These questionnaires are what we consider process knowledge (ProKnow) [1,2,3,4].Incorporating ProKnow as an additional component in convAI can steer the natural language generation (NLG) to capture information relevant to diagnosis and constrains the topic of conversation.This is defined as (medical knowledge capture).Further, it would enforce safe and explainable mental health diagnostic assistance with minimal clinical involvement.In this research, we would be focusing on follow-up question generation, a task within conversational AI targeted toward improving engagement between agent and user [3].
Current research in question generation by large language models is at the mercy of datasets that must represent safe and valid responses for adequate quality control.Nabla, a Paris-based Healthcare Technol-ogy firm, leveraged GPT-3 for preventive care.To their surprise, GPT-3's response, "I think you should " to the user's query "Should I kill myself ?" raised concerns for the immediate adoption of GPT-3-like language models in mental healthcare 5 .Additionally, the black-box nature of GPT-3 and GPT-3-like neural NLG models causes significant difficulty in evaluating and explaining factually incorrect or erroneous generations.More generally, it isn't easy to evaluate the computational method's adherence to acceptable safety standards even if the data points in the dataset have been proven safe [5].We define safety as the conceptby-concept match between a lexicon and the generated sentence.We term Safety Lexicon as a dictionary of concepts that a clinician would be able to relate to a mental health condition.For instance, concepts like 'anxiety', 'anxiousness', 'anxious', 'agita', 'agitation', 'prozac', 'sweating', and 'panic attacks' in question are safe as they would infer AD.Concepts like 'depression', 'depressed', 'antidepressant', 'depressant', and others would describe MDD.ProKnow-driven NLG enhances medical knowledge capture, and leads to considerable reduction in harmful conversation (safety).
Since ProKnow-driven NLG leverage questionnaires or clinical guidelines, every generation can be matched for explainability.
Figure 1 illustrates a scenario where a convAI tasked to assess the severity of a user's anxiety generates questions that are risky and potentially won't be asked by a clinician.Whereas, if the same convAI is augmented with safety checks, like, generated questions matched with questionnaires or clinician-approved safety lexicons, it would endorse safe and explainable generation ( [6]).Incorporating these checks into existing language models would facilitate better follow-up question generation.
In this research, we would demonstrate a process of creating ProKnow-data and a feasible ProKnow-algo for safety-constrained and explainable mental health We define a generated follow-up question to be explainable if it is understandable to the clinician and gathers informative responses from the patient.Do the tags ProKnow-data help the explanation of ProKnow-algo's question generation?Further, does semantic annotation of ProKnow-algo's question generation using KB enhance explanation quality as judged qualitatively by domain experts?
In the process of addressing these RQs, we introduce three application-specific metrics to assess whether the algorithm follows a process (Average Square  Hence, a method that effectively utilizes ProKnow will contribute to algorithmic explainability in the NLG process ( [27,28]).We demonstrate that the use of explicit clinical knowledge in both datasets and methods would yield a convAI agent that can yield safe and explainable generation.In our proposed ProKnow-algo, we incorporate Human Biases that are well documented in clinical literature.These biases help language models focus on those clinically-relevant sentences in the posts that can contribute toward safe and diagnostically relevant questions ([32]).

Human Biases through
2 ProKnow-data Construction We followed a well-defined and expert-regulated To address (b) we expand this dataset using a T5 paraphrasing model to obtain 800,000 data points that contain conversations similar to the annotated dataset 8 .Such paraphrasing is required to train the branching models to generate natural language text that captures the essence but isn't repetitive during communication with the patient.Table 2 shows an example row in ProKnow-data.

Proposed Approach (ProKnow-algo)
The parametric knowledge within pre-trained language models (LMs) have often been exploited in downstream task through distillation ( [33,34]) or finetuning ( [35]).However, enforcing conceptual flow in  We reviewed prior studies that utilize principles of natural language inference to achieve conceptual flow.
For instance, RoBERTa trained on SNLI and MNLI datasets is used in downstream applications requiring flow in question generation or response generation ( [36]).However, the performance of RoBERTa on entailment is underwhelming and unstable.After experimenting on ProKnow-data, which yielded sub-optimal results, we asked annotators to annotate the questions by providing us with rank.Hence, in our manuscript, we report Cohen's Kappa and Krippendorff alpha agreement scores.Point 1 in ProKnow-algo is the standard scoring function to generate questions in vanilla transformers or sequence-to-sequence models.
To validate the two novel architectures of ProKnowalgo: the QG-LSTM's or QG-T's question generation, we compute the cosine similarity between the context vector (QG-LSTM) or attention matrix (QG-T) with numerical representation of concepts in KB.

Novel Evaluation Metrics
There are three evaluation metrics that we introduce in this research to assess the model's performance in capturing knowledge context, being safe, and explainable in question generation.

Average Number of Unsafe Matches (AUM):
This is defined as the number of named entities, ngrams, and longest common subsequence in the generated questions that do not have an exact match or partial match with the concepts in the safety lexicon.This is computed as an average over all the model-generated questions against the concepts in the safety lexicon.Such a measure provides a means to measure harmfulness in the generated question or the potency of severe consequences.This subjective inference would require expert validation.The range of AUM lies between 0.0 and the maximum number of tokens present in the question.Lower the AUM, the better the model.

Average Number of Knowledge Context
Matches (AKCM): Further to AUM, AKCM focuses specifically on triples comprising of subject, predicate, and object extracted from the generated question.Thereafter, computing word mover distance between the embedding of triples (BERT(s;p;o)) and concepts in the lexicon (BERT(concepts)).The range of AKCM is between 1.0 and 3.0, and the higher AKCM, the better the model.However, we found that not always a higher AKCM signifies a better model as a small addition of a meaningful concept can increase AKCM.Thus, we perform a statistical student t-test over multiple rounds of training and cross-validation results.We do the same for AUM.

Average Square Rank Error (ASRE):
This metric measures the model's tendency to generate questions following causal tag and rank.For example, if Q1, Q2, Q3, Q4 are generated in the correct order for a patient, then the total rank is 4.For another patient, if Q2, Q1, Q3, and Q4 are generated then only Q3 and Q4 are in the correct order, giving a rank of 2. The range of ASRE is 0.0 to 1.0, where lower is better.Further, we used Wilcoxon signedrank test to measure the statistical significance of the model's generated sequence of questions over multiple cross-validation turns.

Results and Discussion
Table 4 and  Through AKCM we found that T * † and T5-FT † showed statistically significant generations compared to QG-LSTM † and QG-T †.This metric contributes to explainability as the recorded patient response to these generated questions would help clinicians in informed decision-making.Hence, questions with clinicallyrelevant concepts would seek informative responses.
For instance, a response to "Do you feel afraid of

Methods
"Have you been diagnosed with any sleep disorder?".
The process followed by these questions are: Cause → Symptoms → Cause and Symptoms → Diagnosis, which is a process-guided question generation.Further, among the generated text, "Do you feel nervous often?" and "Do you feel anxious about something?", the former scored a higher probability of being the next sentence.However, as the former is associated with a tag of Degree/frequency and the latter is associated with a tag of Yes/No, the ProKnow-algo leads the algorithm to choose the latter sentence.Overall, 82% of the time the ProKnow-algo-based question generations were safe, explainable, and follows the clinical guidelines.
Negative outcomes: Among the generated text, "Do you feel nervous?" and "Do you feel nervous often?" both sentences scored a rank 2. This is erroneous as the former is of rank 1.Thus, we see that due to the lack of variety in the phrasing of certain sentences generated, the rank in the heuristic is wrongly computed.Further, among the generated Qk , "Do you feel fearful?" and "Do you feel nervous a lot?", the former scored a rank 2 and the latter scored a rank 1.This is erroneous as the former is of rank 1.Once again, we see that the rank in the heuristic is wrongly computed.In our experiments, we see a negative outcome 18% of the time, which implied we need to conduct more studies with more diverse datasets.We find that these errors occur when sentence generation requires relatively high semantic variations.Transformer Library [38] for T5-Fine Tuned and QG-T.
For LSTM and QG-LSTM, we implemented our own method.The hyperparameter tuning was performed using python library "ray", setting the learning rate Limitations: Although our proposed approach offers several advantages over the existing models for question generation in the mental health domain, there are several limitations as well.Since the main idea behind our approach is the usage of the "process knowledge", it can be computationally expensive and time-consuming to generate the follow-up questions.
Further, we demonstrated the efficacy of our approach in a closed domain task, its utility in an open domain hasn't been explored.The ProKnow-data construction took a considerable amount of effort and covered depression and anxiety.Creating a similar dataset for other mental health conditions like schizophrenia, and suicide can be more challenging.This also implies that there is a huge scope for improvement and extension in ProKnow-driven mental health assistance.
Ethical Considerations: This paper provides a novel mental health dataset constructed using our proposed ProKnow-algorithm.The medical guidelines for the construction of this dataset were given by the Senior Psychiatrist adhering to the PHQ-9 and GAD-7 questionnaires.Further, two Resident Psychiatrists from different hospitals created detailed questions.
The dataset is annotated using expert annotators.
Possible biases in our model predictions could be due to the annotation techniques and are not deliberate.
The content concerning AD and MDD result in unfavorable real-life interaction scenarios.However, the current research aims to establish a claim that clinical process knowledge can be infused into deep language models to make them explainable and safe.In our algorithm, we mitigate the unfavorable cases as un-favorable sentences are not diagnostically acceptable to clinicians using AI-based assistance.The ProKnowdata will be made publicly available by following bestpractices of ethical research ( [39,40]).Finally, we do not make any kind of medical recommendation or diagnosis and this dataset should be purely used for research purposes.

Acknowledgement
We want to thank Dr. Meera Narasimhan for helpful insights on constructing ProKnow guidelines for ProKnow-data.Also, we would like to thank her team for helping us with multiple annotation efforts.The prototype to be released will be deployed in Prisma Health, the largest healthcare provider in the state of South Carolina.We acknowledge partial support from National Science Foundation (NSF) awards #1761931 and #2133842 [27,41].

Figure 1 :
Figure 1: An illustration of safe and medically appropriate natural language question generated by an agent trained with ProKnow-algo.
diagnostic assistant.Incorporating process knowledge and corresponding algorithmic development addresses the following research questions: RQ1: Adherence to Process Knowledge: Does ProKnow-data impose constraints on conceptual flow on questions generated by ProKnow-algo-based LMs and pre-trained LMs? RQ2: Patient safety in conversation: Does ProKnow-algo constrain the safety of the generated questions?Additionally, does augmentation of a Safety Lexicon enhance the safety of ProKnow-algo's question generation?RQ3: User and clinician-focused explanations: ProKnow: Pre-trained attention-based language models are biased toward the lexical and syntactic co-occurrences between words in the training corpora.The loss function of language models learns human biases, which are not well-documented.In such a scenario, when such models are fine-tuned on Mental Health-like sensitive domains, they tend to generate sentences following the nature of the fine-tuning corpus.Hence, clinically verifiable learnable heuristics are desired to improve fine-tuning.Let me direct you to ProKnow-algo (Section 4).Heuristic 1 (point 2 in algorithm) enforcesthe question generation should be of a particular tag (e.g., symptoms, cause, medication, etc.) and rank, which regulates the order in which the generated ques-tion should appear.Without these heuristics, generated questions can lose semantics and order.Heuristics 2 (refer to point 3) ensure the generated question has entities in the mental health knowledge base (Mayo Clinic, in our proposed method).This enforces the preservation of context in the generated question, given the user's content.Heuristic 3 (refer to point 4) include semantic lexicons built from PHQ-9 and the GAD-7, with support from involved clinicians.The purpose of lexicons is to ensure that terms that refer to question 1 in the questionnaire are present in the generated question.Without this heuristic, it would not be easy to rank the generated question.Prior studies like Retrofitting ([29]), CounterFitting ([30]),) uses semantic lexicons.

Algorithm 1 -algo 1 . 2 . 3 . 4 .
question generation, adherence to prior knowledge, and safety have not been explored.This is because these properties required a specialized dataset and training process.So, to make LMs functional over the ProKnow-data, we propose a search algorithm mounted over pre-trained LMs that explicitly compares the generated question against the ProKnowdata ground-truth questions, Safety Lexicon, and a knowledge base (KB).This introduce an additional loss function along with cross-entropy loss that promotes medical knowledge capture and safety.Further ProKnow-algo enforces conceptual flow in question generation, thus capturing precise, relevant information through the use of the rank in ProKnow-data.At the center of ProKnow-algo are a branch and bound method which is a conditional probability-based scoring function that takes as input the previous question (Q k ), the tag and rank of Q k , KB, and safety lexicon (L) to compute a score that reflects on safety, medical knowledge capture, and explainability of the 8 https://huggingface.co/prithivida/parrot paraphraser on T5 ProKnowProbability from a deep language model, Qk+1 = arg max Qk+1 P ( Qk+1 |Q k ) Score from Tag and Rank heuristic (TR) Qk+1 = arg max Qk+1 (T R( Qk+1 ) − T R(Q k )) Score from Knowledge Base concept capture heuristic (KB) Qk+1 = arg max Qk+1 Sim( Qk+1 , KB) Score from Safety Lexicon heuristic (L) Qk+1 = arg min Qk+1 Qk+1 ∩ L The Qk+1 with the highest additive score is selected ((1) + (2) + (3) + (4)).generated question.The KB comprises comprehensive mental health lexicons that have been built using PHQ-9, GAD-7, and other questionnaires ([6]) 9 .If the score is above a threshold, the question is generated else the model is penalized for such generations.We break down the ProKnow-algo into four components and formalize them in Algorithm 1.Using ProKnow-algo, we propose two novel architectures: QG-LSTM: Q k is passed as input to the LSTM Cell Type 1, which generates the first token for Qk+1 .LSTM Cell Type 2 then generates the remaining tokens of Qk+1 until ⟨EOS⟩ token is seen.LSTM Cell Type 1 stops generating questions when the end of list sentence is seen (the end of list sentence is appended to the set Y in ⟨x, Y, P⟩ for all triples) to signify the end of the questions set for a query x similar to a ⟨EOS⟩ token.Figure 2 illustrates the working architecture of QG-LSTM.QG-Transformer (QG-T): This model has the identical architecture to QG-LSTM, except that the LSTMs are replaced with Transformers.Our experiments find that the QG-T and T5-FT perform best.Q k is passed as input to the Transformer Type 1, which generates the first token for Qk+1 .Transformer Type 2 then generates9 Some of the lexicons are built as a part of this study and would be made public.

Figure 2 :
Figure 2: An illustration of a LSTM-cell in QG-LSTM.Similar is the architecture of QG-T.
5 record the experiments with a vanilla transformer models [37], transformer T5 fine-tuned for question generation, and our proposed models: QG-LSTM and QG-T.We conducted the experiments by augmenting ProKnow-algo to every variant of seq2seq and transformer model to show generalizability.(RQ1) Evaluating Explainability: If the generated questions have concepts that have clinical relevance and significance, they are recorded in AKCM.
something?" would be less explainable compared to "Do you feel anxious or nervous?".The latter is more specific and matched with a query in GAD-7.Likewise, "Do you feel nervous often?" would yield a less informative response than "Do you feel anxious about something?".(RQ2) Evaluating Safety: The questions gen-erated using ProKnow-algo-based LMs are 89% safer than LMs that compute standard cross-entropy loss.The addition of an extra loss component, as described in Algorithm 1 allows the model to generate a safer question.For example, when a patient says "I feel bothered by little interest and have the least pleasure in doing anything", then a QG-T without ProKnow-algo select from the following top-3 generated questions: (a) "Did you check your dopamine?",(b) "Do you feel your brain is affected?",and (c) "Did you intend to indulge in risky behaviors?".Whereas, QG-T † selects from the following top-3 generated questions: (a) "What does lack of pleasure mean to you?", (b) "Do you feel little pleasure doing things you used to enjoy?", and (c) "How long have you struggled with lack of interest in things you used to enjoy?".AUM measured generations from QG-T † to be safer than QG-T because terms like dopamine, brain, risky behaviors do not show up in the safety lexicon.Likewise, among the generated, "Do you feel irritable?"and "Do you feel easily annoyed or destructive?", the former scored a higher probability of being safe.This is because destructive is associated with more unsafe phrases and is not present in the Safety Lexicon.Thus, the ProKnow-algo steered the generation to the former sentence.(RQ3) Evaluation of Process in Generation: ASRE recorded that questions generated using models with † had almost 96% reduction in ordinal error.This implies that ProKnow-algo enforced checks on conceptual flow in pre-trained LMs in the last hidden state before question generation.In the following example, a user mentions that "He is bothered by trouble concentrating while reading the newspaper or watching television", then T5-FT generated question in the following order: (1) "Do you have a hard time falling asleep and staying asleep?", (2) "Do you feel like you sleep a lot but are still tired?",(3) "Would you like to know about some major sleep disorders?, and (4) "Would you like to know about the 5 major sleep disorder types?".If you observe carefully, these questions have following tagged order: Symptoms → Symptoms → Yes/No (Also an irrelevant generated question).Whereas the questions generated by T5-FT † are in the following order: (1) "How many hours of sleep do you get on average each night?",(2) "Do you feel like you sleep a lot but are still tired?",(3) "How long have you struggled with sleep difficulties", and

to 1 .
21e-5.QG-LSTM took 4 hours of training with cross-validation intervals in each epoch, whereas QG-T took 6 hours of training.All the models have been trained-tested on NVIDIA Tesla V100 GPUs, each with 16 GB RAM.

Table 1 :
✓ indicates a dataset has the feature, and ✗ that it does not.

Table 2 :
to create ProKnow-data for MDD and AD.It is a 2-step process with four rounds of annotations involving two senior psychiatrists (SPs) and two resident psychiatrists (RPs).SPs are responsible for defining the guideline for creating the questions a clinician Examples of ProKnow-data for GAD-7.OSI: Other Symptoms or Information would ask when examining patients with depression or anxiety.They referred SCID-defined guidelines (an example of ProKnow) to create questions that elaborate on the queries in PHQ-96and GAD-77.An elongated list of questions follows a causal pattern of questions.Together with MDD and AD-defined questions, information from SCID would create a considerable size dataset.However, it would not be sufficient in training a convAI agent.Hence, we are challenged with two hurdles: (a) How to create a richer dataset that would enable a convAI to generate information-gathering questions whose responses from patients would be assistive to the psychiatrist?, and (b) How to scale it to a larger number of samples?Formal description of ProKnow-data: We define each data point in our dataset D to be a triplet ⟨x, Y, P⟩, where x is a question from a medical questionnaire (PHQ-9 or GAD-7), Y is a set of questions that elaborate on x (by RPs), and P, the process 6 https://tinyurl.com/5y7rp5w47https://tinyurl.com/ycxwmw2uknowledge, is a set of (Tag,Rank ) tuples corresponding to the elaboration questions in Y (by an SP).An example triplet ⟨x, Y, P⟩ is seen in Table2.As writing down questions from scratch would be tedious, to address (a) we supported RPs with questions from Google's SERP-API and Microsoft People Also Ask API.Our extraction process involves a set of seed questions from RPs and then iteratively gathering a set of 40 questions that RPs approve or disapprove.quent rounds of annotations, the SPs were asked to approve or disapprove RPs annotation, and in case of major conflict, seek re-annotations.The final dataset

Table 4 :
[37]arison between models with the heuristic ( †) and without the heuristic.✓/✗indicates statistically significant/insignificant improvement over the baselines at p < 0.05.↑denotes that a higher score is better and ↓ denotes that a lower score is better.MKC: Medical Knowledge Capture.T * :[37]

Table 5 :
The models without heuristics are evaluated by generation metrics.

Table 6 :
Ablation Study on the QG-T, QG-LSTM, and T5 Models.For Points 2, 3, and 4 refer to ProKnow-algo in the submitted manuscript.If the table cannot be included due to space limitations, it will be provided in the accompanying Github resource.FT: Fine Tuned for Question Generation.
Implementation Details: We implemented our method using PyTorch on top of the HuggingFace Additional examples of ProKnow-data are provided in the supplementary material. 10https://blog.google/technology/ai/lamda/