Differentiating depression using facial expressions in a virtual avatar communication system

Depression has a major effect on the quality of life. Thus, identifying an effective way to detect depression is important in the field of human-machine interaction. To examine whether a combination of a virtual avatar communication system and facial expression monitoring potentially classifies people as being with or without depression, this study consists of three research aims; 1) to understand the effect of different types of interviewers such as human and virtual avatars, on people with depression symptoms, 2) to clarify the effect of neutral conversation topics on facial expressions and emotions in people with depression symptoms, and 3) to compare verbal and non-verbal information between people with or without depression. In this study, twenty-seven participants—fifteen in the control group and twelve in the depression symptoms group—were recruited. They were asked to talk to a virtual avatar and human interviewers on both neutral and negative conversation topics and to score PANAS; meanwhile, facial expressions were recorded by a web camera. Facial expressions were analyzed by both manual and automatic analyses. In the manual analysis, three annotators counted gaze directions and reacting behaviors. On the other hand, automatic facial expression detection was conducted using OpenFace. The results of PANAS suggested that there was no significance between different interviewers’ types. Furthermore, in the control group, the frequency of look-downward was larger in negative conversation topics than in neutral conversation topics. The intensity of Dimpler was larger in the control group than in the depression symptoms group. Moreover, the intensity of Chin Raiser was larger in neutral conversation topics than in negative conversation topics in the depression symptoms group. However, in the control groups, there was no significance in the types of conversation topics. In conclusion, 1) there was no significance between human and virtual avatar interviewers in emotions, facial expressions, and eye gaze patterns, 2) neutral conversation topics induced less negative emotion in both the control and depression symptoms group, and 3) different facial expressions’ patterns between people with, or without depression, were observed in the virtual avatar communication system.


Introduction
Mental or mood disorders (1) have been two of the most common health problems negatively affecting the quality of life as well as the longevity of the global population; these problems include depression, schizophrenia, anxiety disorder, and bipolar disorder (2). Depression has a significant impact on the overall quality of life not only for younger adults but also for older adults (3).
Surveys using questionnaires (4)(5)(6), or clinical interviews (7,8) have been two of the main methods used to diagnose depression for several decades; however, current studies proposed many alternative approaches for detecting mental illnesses using social network data (9, 10), brain activity (11,12), or behavioral patterns (13,14). Furthermore, support systems for patients with depression using robots (15,16), or virtual avatars (17, 18) instead of human support provision have been highlighted in recent scientific studies.
Human-machine interaction technology, such as a virtual avatar or robot communication systems, has been reported as one of the main methods of intervention for people with depression for many decades (16, 18). Pinto et al. (18) proposed a virtual avatar-based self-management intervention system-eSMART-MH, which provides a virtual health coach to practice communication for patients with depression. These patients reported a reduction in depressive symptoms after using eSMART-MH for three months (18). Hung et al. (16) reviewed the effectiveness of social robots in mental care, reporting that they reduce negative emotion, improve social engagement, and promote positive moods in patients with depression.
When classifying people as being with, or without depression, not only verbal but also non-verbal information collected by cameras, such as facial expressions, eye movements, and behavioral patterns have been increasingly paid more attention in recent decades. It was reported that speech styles such as words or pause duration are one of the criteria required to identify people with depression symptoms (19,20). Cummins et al. (20) summarized that reduced pitch/pitch-range/speaking-intensity/ intonation, slower speaking ratio, and lack of linguistic stress represented the depressed-speech style. On the other hand, Islam et al. (9) reported that social network data analysis by machine learning technology is one of the most effective approaches to detecting people with depression highlighting three main factors that had an impact on detecting depression through social network data: linguistic style (e.g. adverbs, conjunctions, pronouns, verbs), emotional process (e.g. positive, negative, sad, anger, anxiety), and temporal process (present focus, past focus, and future focus). The accuracy of a machine learning model using each factor is greater than 60%, thus, comments posted on social networks have been established as one of the criteria used to detect depression (9). Furthermore, non-verbal information, especially facial expressions, can be effectively used to detect depression (3,(21)(22)(23)(24). People with depression display fewer positive facial expressions, such as smiling, and more negative facial expressions than people without depression whilst watching positive films, however, during negative films, people with depression make fewer negative facial expressions than people without depression (24). Girard et al. (21) developed a system to detect the severity of depression symptoms using a Facial Action Coding System (FACS) during a series of clinical interviews which confirmed that patients with high-severity symptoms of depression exhibit more negative facial expressions, such as contempt, and smiled less.
Facial expressions or gaze patterns have been employed to understand brain activity or to detect mental conditions; on the other hand, the effects of different interviewers, or different conversation topics on gaze patterns, facial expressions, emotions, or other non-verbal information remain unclear.
Several studies have reported the effect of the robot or virtual avatar interviewer on human emotion, empathy, or action (25)(26)(27)(28)(29). People perceived the robot which exhibited a more empathetic attitude and verbal utterance, as friendlier (27). In addition, Appel et al. (25) reported that congruence between robots' facial expressions and verbal information would lead to a more positive impression on users. A virtual avatar interviewer invokes stress as much as a human interviewer in the various tasks (28). However, the effect of computer interaction on human emotion or action is still not clear in people with depression symptoms. Thus, the first research aim of this study is to understand the different effects that human and virtual avatar interviewers have on facial expressions and emotions in people with depression.
In past studies, clinical interviews/tasks and negative conversation topics such as depression or military experiences were primarily used for studies focusing on detecting mood, or mental disorders, such as depression, post-traumatic stress disorder (PTSD), or attention deficit hyperactivity disorder (ADHD) (30)(31)(32)(33)(34). On the other hand, clinical interviews, including disclosure of feelings and problem-solving, induced more anxiety, depression, and behavioral fear than unrelated conversation topics (35), and it is also still unclear whether neutral conversation topics are effective in differentiating people with or without depression symptoms. Thus, the second aim was to clarify the effect of neutral conversation topics (e.g. nonclinical, military, and depression interviews) on facial expressions and emotions in people with depression.
Past studies have reported the effect of virtual avatars on people's emotions, rapport, or decision-making (36-38). Furthermore, eye gaze patterns and facial expressions are two of the main criteria used to identify people's emotions. Chen et al. (39) highlighted that a downward gaze was perceived as a negative social signal and enhanced the startled response magnitude. In facial expressions, it was reported that negative emotions, such as disgust, fear, boredom, or sadness reflected the intensity of the Lip Corner Depressor and Brow Lowerer (40)(41)(42)(43), and positive emotions, such as joy, reflected the intensity of Cheek Raiser and Lip Corner Puller (40). However, communication systems for people with depression have been still developing, and it is still not clear whether non-verbal information, such as eye gaze patterns or facial expressions while interacting with the virtual avatar, is effective in differentiating people with depression symptoms. The final aim of this study was to compare verbal and non-verbal information between people with, and without depression symptoms while talking to the virtual avatar about non-clinical interview topics.
In this experiment, twenty-seven participants (fifteen in the control group and twelve in the depression symptoms group) were asked to talk to each human, or virtual avatar interviewer on each, negative or neutral, conversation topic through a monitor; meanwhile, eye movements, heart rate, facial expression, verbal, and non-verbal information were recorded.

Materials and methods
The pre-registration of this research has been registered in Open Science Framework (OSF) (Registration DOI:10.17605/ OSF.IO/B9DNE). These experiments were conducted with participants interacting with two types of interviewers through a monitor on two types of conversation topics. Participants performed conversation tasks with each interviewer. This study was approved by the Ethics Committee of the University of Latvia in accordance with the Declaration of Helsinki (approval number: 30-47/18). This research has focused on the effect of the types of interviewers (human or virtual avatar interviewer) and conversation topics (neutral, or negative topics) on facial expressions, eye gaze patterns, and verbal information in both types of participants (people with or without depression), thus this paper has analyzed the results of the collected facial expressions and verbal data by a web-camera.

Positive and Negative Affect Schedule
The Positive and Negative Affect Schedule (PANAS) consists of twenty-item scales to measure both positive and negative effects (44), and each item can be rated from 1 (not at all) to 5 (very much). The reliability of this survey to measure the emotional effect was reported in many different types of medical situations such as rehabilitation or clinical interviews (45, 46). Furthermore, the psychometric properties of the scale were clarified in clinical samples with anxiety, depressive, and adjustment disorders in recent years (46). The effects of each experimental condition on participants' emotions were assessed by asking the Latvian version of PANAS (47) before and after each session.

Patient Health Questionnaire-9
A patient Health Questionnaire-9 (PHQ-9) consists of nine items which can be scored from 0 (not at all) to 3 (nearly every day), to screen for depression. Kroenke et al. (4) reported that PHQ-9 score !10 has a specificity of 88% for major depression. Furthermore, cut-off scores between 8 and 11 have no significance in sensitivity and specificity (48). In this study, a score of 10 as the most common cut-off score, was used as a cutoff score. Participants filled in a PHQ-9 in Latvian (49) before starting the experiment.

Participants
All participants were native Latvian speakers and were recruited and screened using PHQ-9 through Social Networking Service (SNS). A priori power analysis (G*Power ver 3.1 59) was conducted to determine the small sample size, and this indicated that the required sample size was a mere twelve people for each control group and depression symptoms group. Participants answered PHQ-9 on their experimental dates again and were classified as being in the control group (the score of PHQ-9 is lower than 10) or the depression symptoms group (the score of PHQ-9 is 10 or higher). Overall, participants for both the control group (N ¼ 17) and the depression symptoms group (N ¼ 13) took part in the experiment. Each participant provided written informed consent before the experiment and received a gift worth approximately 12 USD. A male participant in the depression symptoms group, whose PHQ-9 answered through SNS was higher than the cut-off score and was found to have it lower on the day that he participated in the experiment, and female and male participants in the control group, who experienced a technical issue in the middle of experiments, were excluded from all of the analysis.

Apparatus
The interviewers were presented on a monitor (Lenovo, 2880 Â 1620 pixels, 34:31 Â 19:30 cm) and controlled by a native Latvian member of the experiment team through a Unity game engine in the same room. The viewing distance was 60 cm. Eye movements were recorded by Tobii Pro Nano with a 60 Hz refresh rate and calibrated before each session by the Tobii Pro Lab. Facial expressions were monitored by the web-camera (Logitechthe C270 HD Webcam, 720p/30fps), and body language was recorded by the RGB camera (Canon EOS 1100 D, 25 fps). In addition, heart-beat ratios were monitored using a smartwatch (Fitbit Versa 2) every five seconds, but the research has focused on facial expressions data in this paper, and thus, the data of heart-beat ratios, eye tracking, and body language were not used.

Experimental setup
Interaction of the conversation task involved roughly structured dialogues between the participant and the interviewer ( Figure 1A). Each session consisted of thirty trials and; each trial had two modes based on participants' behavior-a listening mode where the interviewer led the conversation with a closedend question (participants can answer "yes" or "no") based on the topic and a reacting mode where the participant was asked to answer the question in five seconds ( Figure 1A). Two members of the research team were in the same room as the participants and controlled the experimental system based on the participants' reactions. Participants were offered a break between sessions. In the case of the human interviewer, if participants took over ten seconds to answer the questions, the video which was playing was automatically stopped until the system was moved to the next trial by a member of the research team. Each participant performed a total of four sessions-two types of conversation topics (neutral and negative) with two types of interviewers (human and virtual avatar). The order of the combination of the types of conversation topics and the interviewers was assigned randomly to participants. Before starting the main session, participants had practiced talking to the virtual avatar interviewer about animals in five trials. In order to clarify the effect of each experimental condition, participants were asked to fill in a PANAS before and after each session.

Interviewers
Two types of interviewers were prepared; an animated type of virtual avatar and a human to clarify the differences between them. Several recent studies already reported that there is no significance in the type of virtual avatar (e.g. a human-like and animated avatar), age, gender, and ethnicity of virtual avatars in frustration levels, preference, and the level of rapport (60)(61)(62). For the determination of a virtual avatar, several different types of animated virtual avatars' pictures that resembled the general Latvian appearance (e.g. light skin tone, blue eyes, and blonde hair), current casual clothing, and hairstyle were developed using Toon people ver 3.1 which is a Unity asset produced by JBGarraza (63). Students in the Department of Psychology scored their impression using the 5-point Likert scale (61), and the virtual avatar which had the most positive impression on participants, was used in the interacting experiment ( Figure 1B). The voice data of the virtual avatar interviewer were produced by software (64) and Hugo.lv which was used as the online text-tospeech application (65) to convert the written text into spoken words for the Latvian language. The virtual avatar interviewer was computed to blink four or three times per ten seconds based on the average natural human blinking ratio (66,67) and to move the mouth based on sentences.
In the listening mode, interviewers talked to participants and asked questions, and participants listened to this. In the reacting mode, interviewers nodded, and participants answered the questions. In the case of the human interviewer, the videos were prepared in that a native Latvian had spoken the same sentence as the virtual avatar and afterward the human interviewer nodded for approximately ten seconds which was twice as long as the length the participants were asked to talk. They, then, were played in order, and participants interacted with the video through a monitor.

Conversation strategies
Two types of conversation topics were prepared in Latviannegative topics (war and loneliness) as it has been reported that these topics have a high impact on vocal, visual, and verbal features used to detect depression (32) and neutral topics (gardening and traveling).

Verbal and non-verbal behavior annotation
Several non-verbal behaviors of participants were annotated, such as gaze direction (up, down, side, and rotation) and reacting behaviors (smile, nodding, and shaking head) while participants were answering the interviewers' questions. Table 1 indicates the criteria for annotating. Moreover, the number of words that participants used when talking to the interviewers, was counted. Four students of the University of Latvia were hired as annotators, and all data were annotated by three annotators using ELAN (68,69). The average data were computed using three annotators' data. In the human interviewer's case, the number of words before stopping videos was used for analysis

Facial expression analysis using OpenFace
Facial landmarks, head poses, facial action units (Figure 2), and gaze directions were detected by OpenFace toolkit for emotion recognition using videos of the events that cause reaction (70-72).

Statistical analysis
Three-way analysis of variance (ANOVA) within/between interactions was conducted with types of participants, interviewers, and conversation topics as the main factors. In the ANOVAs of this study, a Huynh-Feldt correction was applied when the assumption of sphericity was not met by the Mendoza test. 95% confidential interval (CI) was computed based on Loftus and Masson's procedure.

Results
A posthoc analysis was conducted by G*Power (59) to confirm sufficient statistical power (Power ¼ .945). Tables 2 and 3 indicate the socio-demographics data in each participant's group. This section reports the results of the effect of experimental conditions on participants' mood changes before and after and facial expression differences analyzed by both manual and automatic methods by examining types of participants (control vs depression symptoms group), interviewers (human vs virtual avatar), and conversation topics (neutral vs negative).

Comparison Positive and Negative Affect Schedule (PANAS) between participants and experimental conditions in each type of participant's group
The results of changes in PANAS scores which measured the effect of types of conversation topics and interviewers, between before (pre) and after (post) each session indicated that there was no significant interaction between types of participants, The criteria of facial action units (73,74). interviewers, and conversation topics in both the score of positive ( Figure 3A) and negative effects ( Figure 3B). However, the main effect of types of conversation topics were significant in the change of both positive and negative affect's score (F(1, 25) ¼ 8:1792, p ¼ :0084, h 2 ¼ :1183, F(1, 25) ¼ 11:2158, p ¼ :0026, h 2 ¼ :1356, respectively). The change of positive affect's score was much lower in negative rather than neutral conversation topics, and that of the negative affect's score was higher in negative rather than neutral conversation topics. The difference of types of conversation topics had a large impact on both the control and depression symptoms group.

Comparison annotation results in each participants' group type 3.2.1. The frequency of look-downward and lookaverted
With regard to the frequency of look-downward, there was no interaction among the three factors; however, there was significance in the main effects of types of conversation topics (F(1, 25) Figure 4A).
In the frequency of look-averted, there was no interaction between types of participants, interviewers, and conversation topics and the main effect of each factor ( Figure 4B).

The frequency of words
The frequency of words to which participants responded in reacting duration was reported ( Figure 5). There was no interrelation between types of participants, types of interviewers, and types of conversation topics, however, there is significant interrelation between types of participants and conversation topics (F(1, 25) ¼ 4:2526, p ¼ :0497, h 2 ¼ :0049).

Comparison of Facial Action Coding System (FACS) data in each participants' group type
The intensity (from 0 to 5) of seventeen Action Units (AUs), the gaze angle of the averted and vertical axis, and the differences in head rotations by Open-Face were all computed. In the results, there was significance in the intensity of Dimpler (AU 14), Lip Corner Depressor (AU 15), and Chin Raiser (AU 17) in types of participants.

Control group
Depression symptoms group     Figure 6A indicates the intensity of Dimpler. There were interactions between types of participants and interviewers (F(1, 25) ¼ 4:5909, p ¼ :0421, h 2 ¼ :0045). In the interaction between types of participants and interviewers, there was no significance between types of interviewers in each participants' group, however, the intensity of Dimpler was higher in the control group than in the depression symptoms groups in both types of interviewers Next, with regard to the intensity of the Lip Corner Depressor, there was no significance between conversation topics, types of participants and interviewers and the main effect in each factor ( Figure 6B).

Discussion
In this section, the effects of types of interviewers and conversation topics on verbal, facial expression, and annotated gaze patterns in types of participants, such as control and depression symptoms groups, are interpreted based on the three aims presented in the Introduction section: (1) to understand the different effects of types of interviewers, (2) to clarify the effect of neutral conversation topics on facial expressions and emotions in people with depression, and (3) to compare verbal and non-verbal information between people with, or without depression.

Understanding of the effect of the different types of interviewers
With regard to the effect of the different types of interviewers, the virtual avatar had no impact on emotions, facial expressions, and eye gaze directions on people with or without depression. With regard to FIGURE 4 Frequency of (A) look-downward and (B) look-averted in each interviewer. Striped pattern boxes indicate the data of the depressed symptoms group, and solid boxes are the control group. Error bars indicate 95% CI. Word frequency in each interviewer. Striped pattern boxes indicate the data of the depressed symptoms group, and solid boxes are the control group. Error bars indicate 95% CI. (28) suggested. They reported that female participants followed the virtual avatar at the same level as the human interviewer and completed the various tasks. Both the virtual avatar and the human interviewers invoke feelings of stress to the same level. The PANAS of these results indicated that the virtual avatar has the same level of emotional impact as the human interviewer in both types of conversation topics. It is, therefore, suggested the virtual agent has the same authority as the human interviewer; thus, the virtual avatar would be effective in the interview process with depressed patients. However, an animated virtual avatar as a virtual avatar interviewer and a recorded video as a human interviewer were used in this study, thus, the emotional effect of a human-like virtual avatar and realhuman interviewers are still unclear on people with depression.

Clarifying the effect of the different types of conversation topics
The effect of the different types of conversation topics was shown in the PANAS score, the frequency of look-downward and words, and the intensity of Dimpler and Chin Raiser.
First, the PANAS score indicated the emotional effect of the different types of conversation topics on participants in both the control and the depression symptoms groups. In negative conversation topics, the post-scores of positive effects were much lower after the experiment than the pre-scores in both the control and depression symptoms group and vice versa. The results are consistent with the suggestions of Costanza et al. (35). Positive conversation topics led to fewer negative emotions than negative conversation topics in both types of interviewers and types of participants.
Secondly, the frequency of look-downward was higher in negative conversation topics than in neutral conversation topics in both types of participants. Gaze patterns are primarily social cues used to represent emotions. Look-downward shows more negative social signals than direct gaze in psychophysiology (39). The result of gaze patterns in types of conversation topics in this study is consistent with these past studies. It was concluded that negative conversation topics induced participants' negative emotions, thus participants tended to look downward more frequently than in neutral conversation topics.
The intensity of Dimpler was lower, and that of Chin Raiser was higher, in negative conversation topics than neutral conversation topics in both control and depression symptoms   (40,42). The results of PANAS and the intensity of Dimpler and Chin Raiser in this study were consistent with these past studies. Types of conversation topics had a large effect on both the control and depression symptoms groups, namely negative conversation topics induced negative emotion in both types of participants. Thus, it was concluded that negative emotions induced the lower intensity of Dimpler and the higher intensity of Chin Raiser. The limitation of this study is that a closed-end question (participants can only answer "yes" or "no") was used to restrict the answers of participants in controlling the experimental duration, thus it is unclear whether an open-ended question that cannot be answered with "yes" or "no" has any effect on non-verbal information or emotions in people with depression.

Comparison of verbal and non-verbal information between people with and without depression
In types of participants, the frequency of words and the intensity of Dimpler and Chin Raiser have differences between control and depression symptoms groups.
First, the frequency of words was larger in neutral conversation topics than in negative conversation topics in the depression symptoms group. Several past studies reported that depression was characterized by speech style, especially people with depression who tend to have fewer social interactions and have difficulty choosing words (19,20). However, to the best of our knowledge, no past studies reported the behavioral pattern of people with depression in neutral conversation topics that are not clinically related conversation topics. It is, therefore, concluded that negative conversation topics were difficult for people with depression to choose words in answering the questions, thus, people with depression symptoms talked less than people in the control group in negative conversation topics.
Secondly, the intensity of Dimpler was higher in the control group than in the depression symptoms group. The intensity of Dimpler, Chin Raiser, and Lip Corner Depressor were the main criteria for negative facial expressions (40,42). Several past studies reported that the intensity of Dimpler is higher in the low severity of depression symptoms group than in high severity (22); however, other studies found opposing results. Hsu et al. (23), Rottenberg et al. (75) interpreted that depression is marked by reductions in facial expressions. Furthermore, past studies reported that the intensity of Dimpler increases when healthy people feel boredom (41, 43). It is interpreted, therefore, that the participants in the control group felt boredom more than those in the depression symptoms group, thus, it is higher in the control group. Furthermore, in the Chin Raiser, the intensity is higher in the depression symptoms group when they were talking about negative conversation topics. It was supposed that participants in the depression symptoms group had negative emotional impacts from negative conversation topics more than in the control group, thus the intensity of Chin Raiser was higher. Another limitation is that classification experiments using computer science methodologies, such as machine learning or AI were not conducted in this study.
In conclusion, based on the results of the manual (annotation) and automatic (OpenFace) non-verbal analysis, there was no significance in the different types of interviewers, and people with depression symptoms would make more negative facial expressions such as Chin Raiser when they were interacting with virtual avatar interviewers with neutral conversation topics. In further studies, classification experiments using computer science methodologies would be required in order to clarify whether people with depression could be differentiated using facial expressions in virtual avatar communication with neutral conversation topics.

Data availability statement
The datasets from the current study are available from the corresponding author on reasonable request.

Ethics statement
The studies involving human participants were reviewed and approved by the Ethics Committee of the University of Latvia. The patients/participants provided their written informed consent to participate in this study.

Author contributions
AT conceived the experiment, AT, IA, and LN conducted the experiment, AT analyzed the result. AT and LD wrote the manuscript.