The future of automated capture of social kinesic signals for psychiatric purposes

Burgoon, Judee K.; Elkins, Aaron C.; Derrick, Douglas; Walls, Bradley; Metaxas, Dimitris

doi:10.3389/fcomp.2023.1168712

MINI REVIEW article

Front. Comput. Sci., 02 August 2023

Sec. Computer Vision

Volume 5 - 2023 | https://doi.org/10.3389/fcomp.2023.1168712

This article is part of the Research TopicBody Talks: Advances in Passive Visual Automated Body Analysis for Biomedical PurposesView all 5 articles

The future of automated capture of social kinesic signals for psychiatric purposes

Judee K. Burgoon¹^*

Aaron C. Elkins²

Douglas Derrick³

Bradley Walls⁴

Dimitris Metaxas⁵

¹Center for the Management of Information, University of Arizona, Tucson, AZ, United States
²Department of Management Information, San Diego State University, San Diego, CA, United States
³College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE, United States
⁴Discern Science International, Tucson, AZ, United States
⁵Center for Computational Biomedicine Imaging and Modeling (CBIM), Rutgers University, New Brunswick, NJ, United States

This article considers how computer vision can be enlisted for biomedical applications, specifically the measurement, data analytics and treatment of psychiatric disorders. Often, youngsters are too afraid or embarrassed to disclose their emotional and mental problems to human therapists. An AI system can be utilized not only to collect data in a non-threatening ongoing manner and record patient's temporal psychophysiological state but also to analyze and output the periodic results, it may be an efficient and effective means for therapists to plan treatments. We report on various tools for analyzing social kinesic signals for emotional and physiological states. Only one, AVATAR (and its predecessor SPECIES), both records a patient's state and also outputs an analysis that flags problem areas for therapists. In this way, automated tools can augment human observation and judgment.

Introduction

Tools for detecting human emotional and cognitive states have undergone an exponential advancement in recent years. Tools developed for one purpose have shown utility in additional arenas, thus serving a multiplicity of purposes. That is the case with tools that have originated in the field of fraud and deception detection. Noncontact tools meant to passively and surreptitiously detect states of cognitive and emotional arousal may also register disruptions in one's mental, emotional and physiological state. Here, we demonstrate the application in the case of psychiatric disorders, such as bipolar disorder, anxiety, depression and suicidal tendencies. All of these disorders have linkages to arousal, anxiety and/or hidden emotional states. Drawing upon our research on deception and fraud detection, we demonstrate how cumulative signals from various sensors can be aggregated to correlate with, and predict, psychiatric states, that may aid in delivering useful treatment recommendations.

Background and foundations

The scope of human-computer interaction has been an ever-widening one, encompassing such domains as information technology design, entertainment technologies, cooperative work, medical care delivery, personality assessment and more (Salah et al., 2011) that relate to all manner of human intelligences, such as emotional intelligence, linguistic intelligence, logic and interpersonal intelligence (Salovey and Mayer, 1990; Gardner, 2011). This vast panorama exceeds our purview. Our goal in the current article is a more modest one, to take one slice out of the pie to propose augmenting human judgment with computer technology to detect and treat psychiatric disorders.

Many tools have been developed to assess humans' mental states. For example, VlogSense automatically measures and analyzes nonverbal conversational behavior shown while viewers watch YouTube videos (Biel et al., 2011). Computer vision measures such nonverbal behaviors as voice, gaze, facial expressions and head pose to assess team collaboration and personality traits (Jayagopi and Gatica-Perez, 2010; Jacques Junior et al., 2022). With sensors located in an interviewing kiosk, SPECIES [Special-Purpose, Embodied Conversational Intelligence with Environmental Sensors], Derrick (2011) combines sensors to conduct interviews that detect respondents' veracity. Wearables have been designed to give public speakers feedback about the effectiveness of the non-verbal facets of their presentations (Mihoub and Lefebvre, 2019). Using a tripartite system of computer vision for gaze estimation, a taxonomy to tag the implicit semantics of gaze patterns and machine learning to correlate the semantics with the gaze behavior, Okada et al. (2019) found that social gaze distinctly recognized group leaders. Computer scientists are also applying computer vision approaches to extract personality impressions from faces, postures and other kinesic behaviors (Jacques Junior et al., 2022). CogStack uses Electronic Health Records to alert when patients are at risk for a psychotic episode (Wang et al., 2020). Mental illness can be diagnosed from social media posts (Zhang et al., 2023).

In the foregoing examples, most systems deal with one-way transmission of signals by the patient or cooperative discourse between two parties. When it comes to dealing with therapeutic and non-cooperative discourse, however, it is more difficult to model communication because a patient or interviewee may be managing their behavior so as to mask undesirable past behavior or current troubling mental states. The interaction can better be likened to a legal context in which an interrogating attorney questioning a suspect (Keatley, 2020), must discern which behaviors can be believed and which constitute deceiving. Such discourse is regarded as adversarial or non-cooperative.

For over a century, the preferred technology used to assess noncooperative discourse has been the polygraph. It has been regarded as the gold standard for gauging when deception is or is not indicated (Vrij, 2008). Even though the polygraph is not a lie detector per se. Rather, deceit is inferred from respiratory, cardiac and skin conductance responses that measure arousal and thus predict an individual's truthfulness (Grubin and Madsen, 2005). Were psychiatric issues only related to arousal, use of a device like the polygraph would still be an infeasible psychiatric aid for several reasons. First and most problematic, the polygraph is a contact tool, in other words, the patient must be connected to the device. For most patients, being hooked up to the polygraph is intimidating; the patients have various fears of it, such as delivering an electric shock or learning something about the patient's physiological state that they do not want to divulge. Second, the set-up are time consuming as each of the behavioral sensors must be properly secured to the interviewee and calibrated to measure the optimal amplitude. A pneumograph around the chest measures respiration, a cardiosphygmograph around the arm measures blood pressure and pulse, and various leads to the fingers measure palmar sweat (skin conductance). All of this calibration takes a significant amount of time per patient. Third, the standard interview protocol itself for a properly done polygraph is time-consuming. It begins with a pre-test that includes detailed definitions of the meanings of the question terminology and explanation of the process to be followed. This is followed by the main interview set of questions then a post-test during which the questions may be repeated. Fourth, the instruction-giving is done by the examiner, who may unconsciously introduce bias by vocal tone, tempo and word choice (Mitchell et al., 2005). Fifth, the presence of a human conducting the interview introduces the interviewee's fear of evaluation by the examiner. Patients become embarrassed when having to address sensitive topics. Finally, an expert must be trained to review and interpret the results. Subjective interpretation always introduces the potential for variability in judgment across patients and across time.

In sum, completion of a polygraph for each individual patient absorbs extensive time and labor. And the end result is only an assessment of the physiological aspects of arousal, excluding cognitive arousal, emotional distress, depression or veracity. Its accuracy is quite variable, being the highest when judging single-incident, past-tense crimes and lowest when judging future intentions and repetitive proclivities (National Research Council, 2003), such as recurrent bouts of depression or habitual lying.

The shortcomings of the polygraph highlight some of the criteria of a system for gauging psychiatric disorders. Ideally, it should be noncontact; the patient should be free of any cuffs, wires or other connectors, which in addition to removing “scary” wires and connectors also gives the individual freedom of movement and freedom to gesture. An ideal system should entail brief, straightforward instructions to the patient, brevity being one of its hallmarks. It should be valid on its face (measuring what it is meant to measure) and reliable (producing the same results on subsequent administrations), while minimizing fatigue and boredom. Finally, computerized analysis of results would obviate the need for human, and possibly biased and unreliable, interpretation.

One category of computer-based tool used by mental health clinicians is the neuropsychological test conducted with a computer or tablet. One popular tool is the Cambridge Neuropsychological Test Automated Batteries (CANTAB) that measures the correctness and reaction to a series of computerized tests meant to measure visual memory, attention, and working memory and planning (Fray et al., 1996; Smith et al., 2013). The patient sits in front of a computer with a touchscreen and is instructed to respond to the tasks presented on the screen. For example, one task called the Affective Go/No-go presents the patient with words differentiated by their valence (i.e., positive or negative) and they must identify the valence of the word. The patient's omission and commission errors as well as response delays are recorded and used to evaluate, diagnose, and support research in neuropsychological phenomena such as correlating performance on CANTAB with FRMI data. Similar to polygraph, this tool requires physical contact, and human administration and interpretation of the results. It has an advantage over traditional interviews because it has higher face validity during the tests and questions directly measure performance rather than asking for a subjective evaluation.

The AVATAR, or Automated Virtual Agent for Truth-Assessment in Real-time, was developed with such criteria in mind (Patton, 2008; Derrick, 2011; Nunamaker et al., 2011; Burgoon and Nunamaker, 2013; Elkins et al., 2013, 2014; Twitchell et al., 2013). It originated in the field of credibility assessment, marrying sensors that measure signals of credibility with interviews conducted by a virtual agent. Studies employing automated interview systems such as the AVATAR have found that individuals being interviewed by a fully automated virtual agent feel less concerned about being evaluated and freely disclose more sadness, such as is associated with depression and suicidal tendencies, compared to interviews where they believed a virtual avatar was being operated by a human (Lucas et al., 2014; Rizzo et al., 2016). These results are part of a growing body of research suggesting that virtual human interactions reduce stigma by providing a safe context in which users may reveal sensitive information compared to situations where users anticipate negative judgments from a human interviewer. Additionally, automatic behavior detection seems to provide a more accurate window into the emotional state of the user than does self-report.

The AVATAR is designed to mimic human communication. A virtual interviewer that has the head and torso of a human conducts the interview while its various sensors register the interviewee's head pose, eye and facial movement, posture and gestures (It also registers such features of the voice as pitch, loudness, tempo and fluency, but our interest here is in the kinesic, or nonverbal visual, movement features.) This allows it to sense visual signals from the interviewee, interpret those signals, and in turn, translate those signals to produce messages. Among non-verbal signals, kinesic visual cues account for the most variance, followed by vocalics (Burgoon et al., 2022b) in creating first impressions, conveying emotions, managing social interactions and persuading others. Ideally, most or all non-verbal signals can be captured unobtrusively so that measuring instruments are not distracting.

Materials and methods

Starting from the top of the interviewee, the AVATAR analyzes the head and face, the former for purposes of detecting orientation toward the interlocutor and the latter for purposes of detecting emotional states and relational messages. Tools that measure the face such as OpenFace (Baltrušaitis et al., 2016) also often measure the pitch, roll and yaw of the head. Pitch is the forward and backward movement of head pose, such as when nodding “yes” or when hanging head forward and downward to convey emotional sadness. Roll is the left and right turning, such as when shaking the head “no.” Yaw is tilting the head sideways, as when listening. The head tilt is a common gesture to signal subordination; in the animal kingdom, it mimics exposure of the jugular vein of a vanquished foe as a substitute for an actual kill. The canting of the head sideways and downward can signal depression and emotional distress, despite the patient's words saying otherwise. Likewise, orienting the head and body indirectly toward an interlocutor can convey weakness and anxiety or lack of openness and rapport with an interlocutor. It can communicate “shutting down.” Contrariwise, sitting upright and facing an interlocutor straight on communicates directness and composure.

Additionally, many combinations of facial features convey specific emotions (Walls, 2020). Several software tools measure facial feature actions, the most frequently used being OpenFace (Baltrušaitis et al., 2016). Several landmarks are located on the face and computer vision links them, like a dot-to-dot puzzle, to measure different expressions (e.g., eyebrow raise, mouth tightener) and combinations that together express emotions (e.g., anger, fear). These expressions are represented by AUs, for automatic facial action units. AUs related to emotional distress would include sadness depicted around the eyes, laxity in the cheek region, and downward turn of the lips. Anxiety would be shown through tightened forehead muscles, with eyebrows tightened above the bridge of the nose, crows-feet in the outward corners of the eyes, pursed lips, and downturned lips (Porter et al., 2012; Ten Brinke and Porter, 2012).

Also, part of the face and head region are the eyes. The analysis is what is known as oculometrics. Eye trackers such as Tobii and EyeDetect (Cantoni et al., 2018) are used to track blinking, gaze direction, eye saccades and pupil dilation (Proudfoot et al., 2016). Depression and emotional distress are often signaled by suppression of blinking, gaze averted away from the interlocutor, and constricted (rather than dilated) pupils (Burgoon et al., 2017; Ceh et al., 2021). Masked (concealed) emotions are associated with more inconsistent expressions and a faster blink rate; neutralized (weakened) emotions instead show a decreased blink rate (Porter and Ten Brinke, 2008). Blinking and eye movements can predict vigilance during an interaction or task (Langhals et al., 2013).

Moving to the torso, there are motion capture systems such as OpenPose and Kinect for measuring posture and gestures. A slumped posture, often with an averted gaze, is commonly associated with depression or anxiety. Kinect and similar commercial tools can be used to capture the limb and gestural patterns. Alternatively, in contrast to the traditional method of manual gestural analysis, gesture analysis now can be captured with computer measurement. An approach called Blob Analysis, for example, forms bounding boxes around hands, arms and shoulders. Ellipses are formed within the boxes and the x and y coordinates of the ellipses are then calculated. From these, concurrent and sequential nonverbal communication patterns can be calculated. For example, Meservy et al. (2005a,b) created measures of gestural location, expansiveness and velocity from the pixels on the screen. Gestural animation, shown by more expansiveness and faster velocity, is associated with emotional stability and positivity, whereas more gestural restrictedness and rigidity would likely be associated with emotional distress (Twyman et al., 2014; Pentland et al., 2017). Analysis of torso and gestures can be extended to dyads by examining the synchrony of behavior between sender and receiver over time (Dunbar, 2022).

The analysis tools: putting it all together

The emergence of automated AI tools has naturally led not only to collections of multiple signals from multiple modalities, but also development of methods to analyze such signals in simultaneous and serial combinations. One such system, HireVue, is AI-driven software that combines facial affect, eye contact, vocal patterns and word choice to screen video and audio for potential employees. Other companies are Yobs Technologies, Talview Behavioral Insights and VCV.AI (Hinkle, 2020). For all of these, nonverbal behaviors and personality inventories play a big role in combining all these metrics to predict which applicants will make good employees. For criminal investigations, multiple kinesic and vocal signals together produce a robust system for discriminating the “bad guys” from the “good guys.”

An advance in analysis is time series methods like Recurrence Quantification Analysis and Multiscale Entropy (Duran et al., 2013) to measure dynamical movements. Multivariate analyses, machine learning and deep learning have all been used (Ding et al., 2019; Stathopoulos et al., 2021; Burgoon et al., 2022b). In Stathopoulos et al. (2021), the authors created a machine learning-based system that detects deceptive behavior in videos using facial Action Unit (AU) intensities as input. With the help of this system, the authors discovered specific micro-expression patterns that are known to be correlated with deceptive behavior. These include AU45 (eye blinks), AU20 (lip stretcher), AU13 (cheek puffer), AU9 (nose wrinkler), AU10 (upper lip raiser), and AU12 (lip corner puller). They occurred in deceptive videos across genders and ethnicity. As signs of discomfort and negative affect, such behaviors might prove to be good indicators of psychiatric distress.

An alternative approach for identifying hidden recurrent patterns among combined signals is software called THEME, developed by Magnusson (1996) (see also Magnusson, 2016; Burgoon et al., 2022a). An example would be discovering the dynamic head movement, eye gaze and gesture patterns correlated with anxiety. Burgoon et al. (2015) illustrated using this software to discover patterns of deception in group interaction. Several other methods for analyzing non-verbal dynamics can be found in Novotny and Bente (2022).

Finally, technologies such as AVATAR can make use of the cloud for data storage and security. Data no longer need to risk theft or damage when not stored locally.

Discussion

The integration of psychiatry, computer vision, and non-verbal communication is a significant achievement in interdisciplinary research. By combining these fields, researchers are able to create a system that accurately captures and analyzes non-verbal behaviors to aid in psychiatric treatment. The use of computer vision provides an objective and automated method of detecting non-verbal behaviors, while psychiatry offers a framework for interpreting the models' predictions.

One of the most significant benefits of this approach is that it minimizes the risk of human bias. Human observers may have their own personal biases, and their subjective judgments may be influenced by factors such as gender, race, and culture. By using automated systems to capture and analyze non-verbal behaviors, researchers can obtain more reliable and objective data. This information can be used to guide the selection of appropriate treatment options for patients. This has significant implications for the field of psychiatry, as it allows for the development of more accurate and effective diagnostic tools and treatment methods.

One of the key benefits of using AVATARs in this context is their ability to serve as both sender and receiver. AVATARs can deliver verbal and non-verbal messages while simultaneously providing sympathetic listening. This is particularly useful in psychiatric treatment, where empathy and understanding are critical components of successful therapy. AVATARs can simulate human-human interaction without the distractions that often come with face-to-face interactions, creating a more controlled and focused environment for patients to receive treatment.

While there are certainly benefits to using automated systems in psychiatric treatment, there are also limitations to be considered. One potential limitation is the need for sensors to be calibrated and synchronized. Another limitation is that some patients may be distrustful or fearful of technology and may be unwilling to use such devices. However, for those who are comfortable using technology, AVATARs may offer a promising alternative to traditional face-to-face therapy.

Author contributions

JB wrote the first drafts of the paper. AE added significant new content. All authors contributed to the article and approved the submitted version.

Funding

Development of the AVATAR was partially supported by the National Science Foundation Human and Social Dynamics Program (Grant #0725895) on Interactive Deception and its Detection through Multimodal Analysis of Interviewer-Interviewee Dynamics and several grants to the NSF Center for Identification Technology (Grant #1068026) testing aspects of deception and non-contact detection tools.

Conflict of interest

JB, AE, and DD are founders of Discern Science International. BW is a consultant to DSI.

The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Baltrušaitis, T., Robinson, P., and Morency, L. P. (2016). “Openface: An Opensource facial behavior analysis toolkit,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Piscataway, NJ: IEEE.

Google Scholar

Biel, J.-I., Aran, O., and Gatica-Perez, D. (2011). You are known by how you vlog: personality impressions and nonverbal behavior in YouTube. Proceedings of the International AAAI Conference on Web and Social Media. 5, 446–449. doi: 10.1609/icwsm.v5i1.14160