Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Psychiatry

Sec. Digital Mental Health

Volume 16 - 2025 | doi: 10.3389/fpsyt.2025.1602650

This article is part of the Research TopicAI Approach to the Psychiatric Diagnosis and Prediction Volume IIView all 4 articles

Predicting Depression by Using a Novel Deep Learning Model and Video-audio-text Multimodal Data

Provisionally accepted
Yifu  LiYifu Li1Xueping  YangXueping Yang2Meng  ZhaoMeng Zhao1Jiangtao  WangJiangtao Wang1Yu-Dong  YaoYu-Dong Yao3Wei  QianWei Qian1Shouliang  QiShouliang Qi1*
  • 1Northeastern University, Shenyang, China
  • 2The People's Hospital of Liaoning Province, Shenyang, Liaoning Province, China
  • 3Stevens Institute of Technology, Hoboken, New Jersey, United States

The final, formatted version of the article will be published soon.

ObjectiveDepression is a prevalent mental health disorder affecting millions of people. Traditional diagnostic methods primarily rely on self-reported questionnaires and clinical interviews, which can be subjective and vary significantly between individuals. This paper introduces the Integrative Multimodal Depression Detection Network (IMDD-Net), a novel deep-learning framework designed to enhance the accuracy of depression evaluation by leveraging both local and global features from video, audio, and text cues.MethodsThe IMDD-Net integrates these multimodal data streams using the Kronecker product for multimodal fusion, facilitating deep interactions between modalities. Within the audio modality, Mel Frequency Cepstrum Coefficient (MFCC) and extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) features capture local and global acoustic properties, respectively. For video data, the TimeSformer network extracts both fine-grained and broad temporal features, while the text modality utilizes a pre-trained BERT model to obtain comprehensive contextual information. The IMDD-Net's architecture effectively combines these diverse data types to provide a holistic analysis of depressive symptoms.ResultsExperimental results on the AVEC 2014 dataset demonstrate that the IMDD-Net achieves state-of-the-art performance in predicting Beck Depression Inventory-II (BDI-II) scores, with a Root Mean Square Error (RMSE) of 7.55 and a Mean Absolute Error (MAE) of 5.75. A classification to identify potential depression subjects can achieve an accuracy of 0.79. ConclusionThese results underscore the robustness and precision of the IMDD-Net, highlighting the importance of integrating local and global features across multiple modalities for accurate depression prediction.

Keywords: deep learning, Depression, Multimedia, information fusion, Local and global features

Received: 30 Mar 2025; Accepted: 26 Aug 2025.

Copyright: © 2025 Li, Yang, Zhao, Wang, Yao, Qian and Qi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Shouliang Qi, Northeastern University, Shenyang, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.