AUTHOR=Li Yifu , Yang Xueping , Zhao Meng , Wang Jiangtao , Yao Yudong , Qian Wei , Qi Shouliang 

TITLE=Predicting depression by using a novel deep learning model and video-audio-text multimodal data

JOURNAL=Frontiers in Psychiatry

VOLUME=Volume 16 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2025.1602650

DOI=10.3389/fpsyt.2025.1602650

ISSN=1664-0640

ABSTRACT=ObjectiveDepression is a prevalent mental health disorder affecting millions of people. Traditional diagnostic methods primarily rely on self-reported questionnaires and clinical interviews, which can be subjective and vary significantly between individuals. This paper introduces the Integrative Multimodal Depression Detection Network (IMDD-Net), a novel deep-learning framework designed to enhance the accuracy of depression evaluation by leveraging both local and global features from video, audio, and text cues.MethodsThe IMDD-Net integrates these multimodal data streams using the Kronecker product for multimodal fusion, facilitating deep interactions between modalities. Within the audio modality, Mel Frequency Cepstrum Coefficient (MFCC) and extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) features capture local and global acoustic properties, respectively. For video data, the TimeSformer network extracts both fine-grained and broad temporal features, while the text modality utilizes a pre-trained BERT model to obtain comprehensive contextual information. The IMDD-Net’s architecture effectively combines these diverse data types to provide a holistic analysis of depressive symptoms.ResultsExperimental results on the AVEC 2014 dataset demonstrate that the IMDD-Net achieves state-of-the-art performance in predicting Beck Depression Inventory-II (BDI-II) scores, with a Root Mean Square Error (RMSE) of 7.55 and a Mean Absolute Error (MAE) of 5.75. A classification to identify potential depression subjects can achieve an accuracy of 0.79.ConclusionThese results underscore the robustness and precision of the IMDD-Net, highlighting the importance of integrating local and global features across multiple modalities for accurate depression prediction.