Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Psychiatry, 27 November 2025

Sec. Digital Mental Health

Volume 16 - 2025 | https://doi.org/10.3389/fpsyt.2025.1694762

Depression diagnosis from patient interviews using multimodal machine learning

  • AI4Health Division, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany

Background: Depression is a major public health concern, affecting an estimated five percent of the global population. Early and accurate diagnosis is essential to initiate effective treatment, yet recognition remains challenging in many clinical contexts. Speech, language, and behavioral cues collected during patient interviews may provide objective markers that support clinical assessment.

Methods: We developed a diagnostic approach that integrates features derived from patient interviews, including speech patterns, linguistic characteristics, and structured clinical information. Separate models were trained for each modality and subsequently combined through multimodal fusion to reflect the complexity of real-world psychiatric assessment. Model validity was assessed with established performance metrics, and further evaluated using calibration and decision-analytic approaches to estimate potential clinical utility.

Results: The multimodal model achieved superior diagnostic accuracy compared to single-modality models, with an AUROC of 0.88 and a macro F1-score of 0.75. Importantly, the fused model demonstrated good calibration and offered higher net clinical benefit compared to baseline strategies, highlighting its potential to assist clinicians in identifying patients with depression more reliably.

Conclusion: Multimodal analysis of patient interviews using machine learning may serve as a valuable adjunct to psychiatric evaluation. By combining speech, language, and clinical features, this approach provides a robust framework that could enhance early detection of depressive disorders and support evidence-based decision-making in mental healthcare.

1 Introduction

1.1 Depression as a problem worldwide

Depression represents a significant public health issue, impacting approximately 322 million individuals worldwide and accounting for 7.5% of total years lived with disability Organization et al. (1). Untreated depression is associated with impaired quality of life, increased risk of comorbidities, and elevated mortality Voros et al. (2). Early and accurate diagnosis is essential to initiate effective treatment, yet recognition remains challenging in many clinical contexts due to subtle symptom presentation, variability across populations, clinical judgment, and commonly used self-report tools, such as which provide practical reference standards but are known to vary, particularly around diagnostic thresholds Montano (3). Similarly, many diagnostic tools are based on hard binary thresholds, without detailing the level of the condition. Objective tools that can support clinicians in identifying depression have the potential to reduce diagnostic delays and improve treatment outcomes Mao et al. (4). Speech, language, and behavioral cues collected during patient interviews might represent promising sources of objective markers that may enhance clinical assessment Smith et al. (5).

1.2 Machine learning in neuropsychiatry

Machine learning (ML) has emerged as a transformative tool in neuropsychiatry, enabling analysis of complex, high-dimensional datasets to detect subtle patterns associated with psychiatric conditions for diagnosis Strodthoff et al. (6) as well as for adverse event prediction in well-defined populations Oloyede et al. (7). ML models have been applied to various data modalities individually, but they particularly benefit from multimodal inte-gration, supporting impactful predictive modeling in clinical settings such as emergency care Alcaraz et al. (8). Advances in deep learning now allow more effective representation of longitudinal and multimodal data, capturing dependencies that are difficult to model with conventional statistical methods Durstewitz et al. (9). In this context, early studies indicate that applying ML to behavioral and clinical data, such as speech and structured interviews, may provide actionable insights to assist clinicians in diagnosing depression more reliably Li et al. (10).

1.3 Speech and text for depression detection

Speech and language are rich sources of behavioral and cognitive information that can reflect an individual’s mental state. Audio features have been shown to capture important physiological and cognitive signals relevant for medical assessment Henna et al. (11). In addition, lexical content and other linguistic characteristics correlate with depressive symptomatology Losada and Gamallo (12). Importantly, these data can be collected routinely during standard patient interviews, providing a non-invasive and cost-effective source of information Gumus et al. (13). When combined with structured clinical and demographic data through multimodal approaches, such features can yield complementary insights and enhance the reliability of psychiatric application in clinical practice with depression as a significant task Sui et al. (14).

1.4 Contributions

In this work, we present a multimodal machine learning framework designed to support depression diagnosis during routine patient interviews. Our key contributions are: 1) We integrate speech, text, and structured clinical features to create a comprehensive representation of a patient’s mental state, leveraging data that can be collected non-invasively and without additional clinical burden. 2) We systematically evaluate both single-modality and multimodal models, assessing not only predictive performance but also calibration and potential clinical utility, ensuring the framework is informative for real-world decision-making. 3) We demonstrate that multimodal fusion enhances diagnostic accuracy over individual data sources, illustrating how combining complementary routine information can augment clinical assessment and support evidence-based psychiatric care.

2 Methods

2.1 Dataset

Figure 1 contains a visual from input data, to preprocessing, modeling and evaluation. Table 1 contains descriptive statistics of the investigated dataset. We used the Distress Analysis Interview Corpus Wizard of Oz (DAIC-WOZ) dataset Gratch et al. (15); DeVault et al. (16), developed to study verbal and nonverbal indicators of mental illness in structured interviews conducted by a virtual agent. The dataset includes 189 participants (102 males, 87 females), each with raw audio recordings (median length 15.9 min, IQR 6.9-72.7; total 50 hours) and transcribed interviews (median 14,108 characters, IQR 5,595- 31,505). Acoustic features were extracted using the Continuous VALence and REgression Platform (COVAREP) Degottex et al. (17) and five formants, which capture vocal tract resonance frequencies Boersma (18). Depression severity was assessed via the PHQ-8 (Patient Health Questionnaire-8) Kroenke et al. (19), with 30% as depressed and 70% as controls. We use a train, validation, and test set splits of 107, 35, and 47 patient interviews respectively. Overall, the dataset provides rich multimodal information that can be leveraged for machine learning-based depression detection using routine interview data 91 without additional invasive assessments.

Figure 1
Flowchart illustrating a model for detecting depression. Inputs include speech, text, and tabular data. Speech data undergoes processing into 30-second crops for the Wave2Vec2 model. Text data aligns with speech for the BERT model. Tabular features are engineered for the XGB Classifier. Output probabilities are 0.75, 0.91, and 0.84, respectively, combined to a final probability of 0.83, predicting “Depressed or Non-depressed.

Figure 1. Conceptual overview of the implemented pipeline. The model integrates three modalities: audio, text, and tabular features. Preprocessing involves aligning 30-second segments and engineering tabular features from speech and text. Wave2Vec2, BERT, and XGBoost models each output class probabilities, which are then combined through late fusion to produce the final binary prediction of depression.

Table 1
www.frontiersin.org

Table 1. Descriptive statistics of the study population (n = 189).

2.2 Features

We grouped features into three categories: raw audio (16 kHz unprocessed speech), raw text (transcribed speech), and tabular features derived from audio (COVAREP and formants) and text (lexical metrics), totaling 550 features (543 audio, 7 text). For model training, we randomly cropped 30-second segments per interview, aligning corresponding text and tabular features, which allowed multiple samples per visit and improved model robustness. Audio features were extracted at 10 ms intervals and aggregated over the crop. To evaluate modality contributions, we tested seven configurations: (1) audio only, (2) text only, (3) tabular only, (4) audio + text, (5) audio + tabular, (6) text + tabular, and (7) full multimodal fusion. This setup enables systematic comparison of unimodal versus multimodal approaches. Full feature descriptions are provided in the Supplementary Material.

2.3 Target

The dataset is labeled using the PHQ-8 score, which is based on participants’ responses to eight questions assessing symptoms such as mood and appetite Kroenke et al. (19). For analysis, a binary label was created using a threshold of 10: participants with a score above 10 were classified as exhibiting depressive symptoms, while those with a score of 10 or lower were considered non-depressed. This threshold is widely used in clinical and research settings to identify individuals at risk for depression, enabling straightforward binary classification in subsequent modeling.

2.4 Models

For the audio model, speech waveforms were processed at 16 kHz in 30-second chunks with 50% overlap. We used a pretrained Wav2Vec2 model, and each chunk produced a binary prediction (“depressed” or “ not depressed”). Final patient-level predictions were computed by averaging the chunk-level outputs. For the text model, transcripts were concatenated into a single sequence per patient, tokenized with a maximum length of 256 tokens, and processed using a BERT model. The output was binary, similar to the audio model. Tabular features included both audio-derived (COVAREP and formants) and text-derived features (extracted via SpaCy). These were combined into a single table per patient and modeled using XGBoost with cross-validation for evaluation. For multimodal modeling, we applied late fusion by computing a weighted average of the outputs from each modality, followed by calibration using logistic regression. This approach allowed systematic integration of complementary information from audio, text, and tabular features. Further model hyperparameter configurations are provided in the Supplementary Material.

2.5 Performance evaluation

We assessed model performance primarily using the area under the receiver operating characteristic curve (AUROC), a widely adopted ranking-based metric that is robust to class imbalance. Specifically, we report the macro AUROC, which captures overall discriminative performance without requiring predefined decision thresholds. Recent studies (20) have highlighted AUROC’s advantages over alternatives such as the area under the precision-recall curve (AUPRC), particularly in imbalanced settings. Confidence intervals (95%) were estimated via bootstrapping on the test set. Complementary metrics, including precision, recall, and F1-score, were also reported to provide a nuanced assessment of predictive performance. Precision measures the proportion of true positives among all positive predictions, recall quantifies the ability to detect all actual positive cases, and the F1-score, as the harmonic mean of precision and recall, is particularly informative in class-imbalanced data. These metrics are critical in our context, where false negatives, missed cases of depression, carry important clinical consequences.

We further evaluated model calibration using calibration curves to assess the agreement between predicted probabilities and observed outcomes. Clinical utility was quantified via decision curve analysis (net benefit), comparing model performance to two baseline strategies: referring all patients versus referring none. As age information was unavailable, we focused on gender as the primary demographic feature. Following previous work on fairness in machine learning Pessach and Shmueli (21), we included demographic parity by gender through a stratified analysis using the distinctive acoustic feature, fundamental frequency (F0) where we report Equal Opportunity and Equalized Odds metrics, based on the true positive rate (TPr, the proportion of actual positive cases correctly identified) and the false positive rate (FPr, the proportion of actual negative cases incorrectly classified as positive) for each gender. These measures provide complementary insights into how fairly the model performs across genders.

3 Results

3.1 Discriminative performance

Table 2 summarizes the discriminative results. Among unimodal models, text features performed best (F1 = 0.59, AUROC = 0.60), followed closely by tabular features (F1 = 0.47, AUROC = 0.57). The audio-only model showed the weakest performance (F1 = 0.44, AUROC = 0.42), mainly due to very low precision for the depressed class. Multimodal approaches consistently outperformed unimodal ones. The best results were obtained by combining all three modalities, reaching an AUROC of 0.88 and amacro F1 score of 0.75. Pairwise combinations also improved performance, with audio+tabular slightly stronger (AUROC = 0.84) than audio+text or text+tabular. Overall, multimodal fusion clearly enhanced both discrimination and balance across classes, mitigating the weaknesses of single modalities.

Table 2
www.frontiersin.org

Table 2. Discriminative performance of models across different modalities.

Multimodal integration consistently improved performance compared to unimodal models, where the best results were achieved by combining all three modalities, with an AUROC of 0.88 (Figure 2). This model showed strong detection of non-depressed individuals (F1 = 0.87) and moderate but improved detection of depressed cases (F1 = 0.64). Overall, multimodal fusion mitigates the weaknesses of unimodal inputs, though performance remains better for the non-depressed class. While this class imbalance is a limitation, reliable identification of non-depressed individuals can still be clinically useful, as it helps reduce unnecessary referrals and ensures resources are focused on higher-risk cases.

Figure 2
ROC curve for depression classification showing true positive rate versus false positive rate. The red line represents the classifier performance with an AUC of 0.875. A black dashed line indicates a random classifier.

Figure 2. Receiver operating characteristic (ROC) analysis. The red solid line shows the mean AUROC with 95% confidence intervals estimated via empirical bootstrap resampling. The black dashed line represents the performance of a random classifier (AUROC = 0.5) as a reference.

Table 3 shows the model’s performance in classification and regression tasks. For classification, AUROC values were 0.79, 0.88, and 0.80 at PHQ-8 thresholds of ≥5, ≥10, and ≥15, all exceeding the random baseline (0.50) where the standard threshold used in current practice ≥10 achieve the best performance. Regression analysis yielded an MAE of 5.381, lower than the training set’s standard deviation (6.403) and mean (5.468), indicating reliable estimation of continuous PHQ-8 scores.

Table 3
www.frontiersin.org

Table 3. Model performance in classification (AUROC, higher is better) across PHQ-8 thresholds and regression (MAE, lower is better) compared to baseline values.

3.2 Stratified analysis

Figure 3 shows the distribution of F0 by gender, with male voices concentrated between 100–180 Hz and female voices between 180–240 Hz, highlighting F0’s discriminative power as a gender-related acoustic feature. Evaluating model performance separately by gender reveals notable differences: females show perfect precision (1.0) but low recall (0.14) and AUROC 0.57, whereas males achieve high recall (1.0), precision 0.70, and AUROC 0.91. Although the model uses all features, this performance gap aligns with F0 differences—males’ lower baseline and wider spread provide stronger cues, while the narrower, higher-range female distribution limits contribution. Following Pessach and Shmueli (21), we computed true positive ratio (TPr) and true negative ratio (TNr) to assess fairness: females have TPr 0.14 and TNr 1.0, while males have TPr 1.0 and TNr 0.81. This indicates that the model is more likely to correctly identify positive male cases, while negative female cases are more accurately classified, reflecting a subgroup performance imbalance rather than equal treatment.

Figure 3
Bar chart titled “Distribution of F0 mean by Gender.” It shows the frequency distribution of F0 mean in Hz for male and female categories, with males in red and females in blue. Male data peaks around 130 Hz, while female data peaks around 200 Hz. The count ranges from 0 to 6.

Figure 3. Distribution of the fundamental frequency (F0, i.e., pitch) across genders. Male voices (red) show a lower range (100–180 Hz), while female voices (blue) are generally higher (180–240 Hz).

3.3 Calibration

Figure 4 shows the calibration curve of our model, crucial for medical applications. Overall, the model shows good calibration. Initially, some points are above the diagonal, which means that the model is underconfident and underestimates risks and probabilities at these points. In general, with a calibration error of 0.04, it can be said that the model is fundamentally good, as the average difference between the predicted probabilities and the actual frequency is 4%. The model can be trusted, but there is still room for improvement before it can be deployed in practice.

Figure 4
Calibration plot for depression showing true probability versus mean predicted probability. A black line represents perfect calibration, while a red dotted line with dots shows actual calibration. The calibration error is 0.049.

Figure 4. Calibration curve. The red solid line depicts the observed calibration performance of the model, while the black solid line represents a perfectly calibrated classifier (ideal reference). Overall, the model shows good calibration, with predictions in the higher probability range aligning almost perfectly with the true outcomes. This suggests particularly reliable performance for the positive (depressed) class in the upper probability range.

3.4 Decision analysis

Figure 5 presents the decision curve (net benefit analysis), which complements traditional performance measures such as the AUROC by incorporating the clinical consequences of different decision strategies. The black dashed line shows the net benefit of our model, while the blue and red dashed lines correspond to the strategies of treating all or treating none, respectively. Across a wide range of threshold probabilities, the model provides greater net benefit than either reference strategy, indicating its potential clinical usefulness. In particular, the “treat all” strategy (blue line) declines sharply, underscoring the potential harm of unnecessary treatment. The model achieves its highest net benefit at low threshold probabilities, suggesting that it may be especially valuable for identifying patients at risk of depression early, when a lower decision threshold for intervention is clinically appropriate.

Figure 5
Chart titled “Depression” showing net benefit against threshold probability. Three lines are depicted: a black dotted line for “Model,” a blue dash-dotted line for “Refer All,” and a red dashed line for “Refer None.” The model performs better at lower threshold probabilities.

Figure 5. Decision curve analysis. The black dashed line shows the net benefit of the proposed model across a range of threshold probabilities. The blue dashed line represents the strategy of referring all individuals, while the red dashed line represents the strategy of referring none. The model achieves higher net benefit than both reference strategies across the entire range, indicating superior clinical usefulness.

4 Discussion

4.1 Methodological findings

Our results highlight the importance of multimodality in healthcare applications, particularly for psychological diagnosis. As shown in Table 2, combining modalities substantially outperforms unimodal models. The full multimodal setup improved the macro F1 score by 0.31 and the AUROC by 0.44 compared to the weakest unimodal baseline. Comparisons with prior work on the same dataset further emphasize the role of methodological rigor. Burdisso et al. (22) reported an F1-score of 0.90 but relied on validation set results rather than a held-out test set, inflating performance estimates. Similarly, Dai et al. (23) achieved an F1-score of 0.96 on the development set using a multimodal pipeline with audio, video, and semantic features. However, their score dropped to 0.67 on the test set, suggesting overfitting and limited generalization. In contrast, our models show more stable performance despite relying on even fewer modalities. Among unimodal models, raw audio alone was less predictive than text or tabular features, likely because acoustic markers of depression (e.g., pitch, prosody) are subtle and variable across speakers. For instance, research has shown that depressed individuals often exhibit lower pitch variability and slower speech rates, which can be subtle and variable across speakers, making them challenging to capture consistently in raw audio alone Di et al. (24). However, when combined with other modalities, audio consistently enhanced performance and contributed significant complementary information. Text and tabular features, being more structured and explicit, provided stronger standalone discriminative power, but it was the integration of all three modalities that yielded the most robust and generalizable outcomes. This aligns with recent work in multimodal representation learning Yang et al. (25, 26), which highlights the importance of structured multimodal fusion for complex psychological states. Similarly, advances in cross-modal feature learning Cui and Yang (27); Yang et al. (28) emphasize that complementary information across modalities is most beneficial when representations are aligned and robustly integrated.

4.2 Clinical findings

Our results demonstrate that significant diagnostic signals of depression can be extracted from routine data such as audio recordings, clinical text, and structured patient information. This finding has several important implications for clinical practice. First, multimodal AI models can support faster and lower-cost screening, reducing the reliance on extensive manual assessments Khanna et al. (29). By leveraging routinely collected data, such systems could provide early warnings during regular consultations or remote interactions, enabling earlier diagnosis and intervention. Second, such tools can help mitigate diagnostic bias. Unlike a single clinician’s evaluation, the models learn from patterns across a wide population of patients and evaluators, offering a more standardized and less subjective perspective. This could be especially valuable in settings where access to specialized mental health professionals is limited Saxena et al. (30). Finally, integration into routine practice can be envisioned in specific use cases: for example, as a decision support tool in primary care to flag at-risk patients for referral, as an adjunct in telemedicine platforms to enhance remote consultations, or as part of longitudinal monitoring systems that track patients’ risk levels over time Rony et al. (31). In all cases, the goal is not to replace clinical judgment but to provide an additional, reliable source of evidence that enhances the timeliness and equity of mental health care.

4.3 Limitations

This study has two main limitations. First, the sample size remains relatively modest, which restricts the robustness of the findings and may limit the ability to fully capture the heterogeneity of depressive symptoms across large populations despite our augmentation sampling approach. Larger datasets are needed to confirm the stability of the reported performance gains Collins et al. (32). Second, the models have not yet undergone external validation on independent cohorts. Without such testing, the generalizability of the results to different clinical settings, languages, or patient demographics remains uncertain. Future work should therefore emphasize replication across larger and more varied populations to ensure clinical applicability Riley et al. (33). Third, our analysis indicates that the model performs less effectively for female patients, particularly in identifying positive cases, highlighting a potential gender bias. Addressing this limitation will require incorporating additional features or strategies to improve fairness and ensure equitable model performance across genders.

4.4 Future work

Several avenues can be pursued to extend this work. First, incorporating additional modalities, such as physiological signals, could further improve predictive performance and provide complementary information. Second, enhancing explainability is a key goal. This includes analyzing patients classified as non-depressed or those with co-occurring conditions Ott et al. (34), as well as aiming for causal attributions rather than purely associational insights. Recent work on causal explanations in time series Alcaraz and Strodthoff (35) could be adapted to multimodal speech-based frameworks, although a new model design would be required. Third, moving beyond binary classification, future models could predict graded levels of depression, enabling risk stratification and more nuanced clinical decision-making for situations like treatment administration Duval et al. (36). Fourth, integrating additional demographic and clinical variables would allow for patient-specific predictions, supporting personalized mental health care. Finally, exploring alternative PHQ-8 thresholds for defining binary outcomes could inform how varying diagnostic criteria influence model performance and practical applicability. In our current approach, each modality was modeled independently, and alignment between text and speech was approximated at the level of interview crops. This inevitably leaves portions of audio and text that may not correspond directly, potentially limiting the effectiveness of multimodal fusion. Future work could leverage semantic alignment methods Yang et al. (37, 38) to explicitly map audio, text, and structured features into a shared latent space. Such techniques would ensure that information from different modalities is synchronized at a finer-grained semantic level, improving both robustness and interpretability of the predictions.

5 Conclusion

We demonstrate that multimodal analysis of routine patient interviews which combines audio, text, and structured clinical data can effectively support depression detection. Multimodal fusion outperforms single modalities, enabling faster, low-cost, and objective screening while reducing reliance on a single clinician’s judgment. These results highlight the potential of ML-driven tools to enhance early diagnosis, support clinical decision-making, and improve the way for personalized mental health care.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Ethics statement

The DAIC-WOZ data were collected by the Institute for Creative Technologies (ICT) at the University of Southern California (USC). The dataset received approval from the Institutional Review Board (IRB) overseeing ICT at USC. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

JW: Conceptualization, Data curation, Investigation, Methodology, Validation, Visualization, Writing – original draft. MW: Conceptualization, Data curation, Investigation, Methodology, Validation, Visualization, Writing – review & editing. JA: Conceptualization, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1694762/full#supplementary-material.

References

1. Organization WH. Depression and other common mental disorders: global health estimates. In. Depression other common Ment disorders: Global Health estimates. (2017).

Google Scholar

2. Voros V, Fekete S, Tenyi T, Rihmer Z, Szili I, and Osvath P. Untreated depressive symptoms significantly worsen quality of life in old age and may lead to the misdiagnosis of dementia: a cross-sectional study. Ann Gen Psychiatry. (2020) 19:52. doi: 10.1186/s12991-020-00302-6

PubMed Abstract | Crossref Full Text | Google Scholar

3. Montano CB. Recognition and treatment of depression in a primary care setting. J Clin Psychiatry. (1994) 55:18–34.

PubMed Abstract | Google Scholar

4. Mao K, Wu Y, and Chen J. A systematic review on automated clinical depression diagnosis. NPJ Ment Health Res. (2023) 2:20. doi: 10.1038/s44184-023-00040-z

PubMed Abstract | Crossref Full Text | Google Scholar

5. Smith KM, Renshaw PF, and Bilello J. The diagnosis of depression: current and emerging methods. Compr Psychiatry. (2013) 54:1–6. doi: 10.1016/j.comppsych.2012.06.006

PubMed Abstract | Crossref Full Text | Google Scholar

6. Strodthoff N, Lopez Alcaraz JM, and Haverkamp W. Prospects for artificial intelligence-enhanced electrocardiogram as a unified screening tool for cardiac and non-cardiac conditions: an explorative study in emergency care. Eur Heart Journal-Digital Health. (2024) 5:454–60. doi: 10.1093/ehjdh/ztae039

PubMed Abstract | Crossref Full Text | Google Scholar

7. Oloyede E, Bachmann CJ, Dzahini O, Alcaraz JML, Singh SD, Vallianatu K, et al. Identifying clinically relevant agranulocytosis in people registered on the uk clozapine central non-rechallenge database: retrospective cohort study. Br J Psychiatry. (2024) 225:484–91. doi: 10.1192/bjp.2024.104

PubMed Abstract | Crossref Full Text | Google Scholar

8. Alcaraz JML, Bouma H, and Strodthoff N. Enhancing clinical decision support with physiological waveforms—a multimodal benchmark in emergency care. Comput Biol Med. (2025) 192:110196. doi: 10.1016/j.compbiomed.2025.110196

PubMed Abstract | Crossref Full Text | Google Scholar

9. Durstewitz D, Koppe G, and Meyer-Lindenberg A. Deep neural networks in psychiatry. Mol Psychiatry. (2019) 24:1583–98. doi: 10.1038/s41380-019-0365-9

PubMed Abstract | Crossref Full Text | Google Scholar

10. Li Y, Kumbale S, Chen Y, Surana T, Chng ES, and Guan C. Automated depression detection from text and audio: A systematic review. IEEE J Biomed Health Inf. (2025) 29. doi: 10.1109/JBHI.2025.3570900

PubMed Abstract | Crossref Full Text | Google Scholar

11. Henna S, Alcaraz JML, Rathnayake U, and Amjath M. An interpretable deep learning framework for medical diagnosis using spectrogram analysis. Healthcare Analytics. (2025) 8:100408. doi: 10.1016/j.health.2025.100408

Crossref Full Text | Google Scholar

12. Losada DE and Gamallo P. Evaluating and improving lexical resources for detecting signs of depression in text. Lang Resour Eval. (2020) 54:1–24. doi: 10.1007/s10579-018-9423-1

Crossref Full Text | Google Scholar

13. Gumus M, DeSouza DD, Xu M, Fidalgo C, Simpson W, and Robin J. Evaluating the utility of daily speech assessments for monitoring depression symptoms. Digital Health. (2023) 9:20552076231180523. doi: 10.1177/20552076231180523

PubMed Abstract | Crossref Full Text | Google Scholar

14. Sui J, Zhi D, and Calhoun VD. Data-driven multimodal fusion: approaches and applications in psychiatric research. Psychoradiology. (2023) 3:kkad026. doi: 10.1093/psyrad/kkad026

PubMed Abstract | Crossref Full Text | Google Scholar

15. Gratch J, Arstein R, Lucas G, Stratou G, Scherer S, Nazarian A, et al. The distress analysis interview corpus of human and computer interviews. In Lrec (2014) 14:3123–8.

Google Scholar

16. DeVault D, Artstein R, Benn G, Dey T, Fast E, Gainer A, et al. SimSensei kiosk: a virtual human interviewer for healthcare decision support. In: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC (2014). p. 1061–8.

Google Scholar

17. Degottex G, Kane J, Drugman T, Raitio T, and Scherer S. Covarep—a collaborative voice analysis repository for speech technologies, In: 2014 IEEE Inter Conf on Aco, Spe and Sig Pro (ICASSP). (2014). p. 960–4. IEEE.

Google Scholar

18. Boersma P. Praat, a system for doing phonetics by computer. Glot. Int. (2001) 5:341–5.

Google Scholar

19. Kroenke K, Strine TW, Spitzer RL, Williams JB, Berry JT, and Mokdad AH. The phq-8 as a measure of current depression in the general population. J Affect Disord. (2009) 114:163–73. doi: 10.1016/j.jad.2008.06.026

PubMed Abstract | Crossref Full Text | Google Scholar

20. McDermott MB, Zhang H, Hansen LH, Angelotti G, and Gallifant J. (2024). A closer look at AUROC and AUPRC under class imbalance, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems, . Available online at: https://openreview.net/forum?id=S3HvA808gk.

Google Scholar

21. Pessach D and Shmueli E. A review on fairness in machine learning. ACM Computing Surveys (CSUR). (2022) 55:1–44. doi: 10.1145/3494672

Crossref Full Text | Google Scholar

22. Burdisso S, Reyes-Ramírez E, Villatoro-tello E, Sánchez-Vega F, Lopez Monroy A, and Motlicek P. DAIC-WOZ: On the validity of using the therapist’s prompts in automatic depression detection from clinical interviews. In: Naumann T, Ben Abacha A, Bethard S, Roberts K, and Bitterman D, editors. Proceedings of the 6th clinical natural language processing workshop. Association for Computational Linguistics, Mexico City, Mexico (2024). p. 82–90. doi: 10.18653/v1/2024.clinicalnlp-1.8

Crossref Full Text | Google Scholar

23. Dai Z, Zhou H, Ba Q, Zhou Y, Wang L, and Li G. Improving depression prediction using a novel feature selection algorithm coupled with context-aware analysis. J Affect Disord. (2021) 295:1040–8. doi: 10.1016/j.jad.2021.09.001

PubMed Abstract | Crossref Full Text | Google Scholar

24. Di Y, Rahmani E, Mefford J, Wang J, Ravi V, Gorla A, et al. Unraveling the associations between voice pitch and major depressive disorder: a multisite genetic study. Mol Psychiatry. (2024) 30:1–10. doi: 10.1101/2024.10.12.24315366

PubMed Abstract | Crossref Full Text | Google Scholar

25. Yang S, Liu S, Nie G, Wang L, Wang T, You J, et al. Fine-grained multimodal fusion for depression assisted recognition based on hierarchical knowledge enhanced prompt learning. Expert Syst Appl. (2025) 291:128532. doi: 10.1016/j.eswa.2025.128532

Crossref Full Text | Google Scholar

26. Yang S, Cui L, Wang L, Wang T, and You J. Enhancing multimodal depression diagnosis through representation learning and knowledge transfer. Heliyon. (2024) 10. doi: 10.1016/j.heliyon.2024.e25959

PubMed Abstract | Crossref Full Text | Google Scholar

27. Cui L and Yang S. (2024). Enhancing multimodal sentiment recognition based on cross-modal contrastive learning, In: 2024 IEEE International Conference on Multimedia and Expo (ICME). p. 1–6. IEEE.

Google Scholar

28. Yang S, Cui L, Wang L, and Wang T. Cross-modal contrastive learning for multimodal sentiment recognition. Appl Intell. (2024) 54:4260–76. doi: 10.1007/s10489-024-05355-8

Crossref Full Text | Google Scholar

29. Khanna NN, Maindarkar MA, Viswanathan V, Fernandes JFE, Paul S, Bhagawati M, et al. Economics of artificial intelligence in healthcare: Diagnosis vs. treatment. Healthcare. (2022) 10:2493. doi: 10.3390/healthcare10122493

PubMed Abstract | Crossref Full Text | Google Scholar

30. Saxena S, Thornicroft G, Knapp M, and Whiteford H. Resources for mental health: scarcity, inequity, and inefficiency. Lancet. (2007) 370:878–89. doi: 10.1016/S0140-6736(07)61239-2

PubMed Abstract | Crossref Full Text | Google Scholar

31. Rony MKK, Das DC, Khatun MT, Ferdousi S, Akter MR, Khatun MA, et al. Artificial intelligence in psychiatry: A systematic review and meta-analysis of diagnostic and therapeutic efficacy. DIGITAL Health. (2025) 11:20552076251330528. doi: 10.1177/20552076251330528

PubMed Abstract | Crossref Full Text | Google Scholar

32. Collins GS, Ogundimu EO, and Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med. (2016) 35:214–26. doi: 10.1002/sim.6787

PubMed Abstract | Crossref Full Text | Google Scholar

33. Riley RD, Snell KI, Archer L, Ensor J, Debray TP, Van Calster B, et al. Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study. bmj. (2024) 384. doi: 10.1136/bmj-2023-074821

PubMed Abstract | Crossref Full Text | Google Scholar

34. Ott G, Schaubelt Y, Lopez Alcaraz JM, Haverkamp W, and Strodthoff N. Using explainable ai to investigate electrocardiogram changes during healthy aging—from expert features to raw signals. PloS One. (2024) 19:e0302024. doi: 10.1371/journal.pone.0302024

PubMed Abstract | Crossref Full Text | Google Scholar

35. Alcaraz JML and Strodthoff N. Causalconceptts: Causal attributions for time series classification using high fidelity diffusion models. arXiv. (2024).

Google Scholar

36. Duval F, Lebowitz BD, and Macher J-P. Treatments in depression. Dialogues Clin Neurosci. (2006) 8:191–206. doi: 10.31887/DCNS.2006.8.2/fduval

PubMed Abstract | Crossref Full Text | Google Scholar

37. Yang S, Cui L, and Wang T. (2023). Semantic interaction fusion framework for multimodal sentiment recognition, In: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). p. 2132–7. IEEE.

Google Scholar

38. Yang S, Xing L, Li Y, and Chang Z. Implicit sentiment analysis based on graph attention neural network. Eng Rep. (2022) 4:e12452. doi: 10.1002/eng2.12452

Crossref Full Text | Google Scholar

Keywords: depression diagnosis, digital biomarkers, multimodal analysis, machine learning, deep learning, clinical decision support

Citation: Weber J, Weber M and Lopez Alcaraz JM (2025) Depression diagnosis from patient interviews using multimodal machine learning. Front. Psychiatry 16:1694762. doi: 10.3389/fpsyt.2025.1694762

Received: 28 August 2025; Accepted: 07 November 2025; Revised: 08 October 2025;
Published: 27 November 2025.

Edited by:

Vanessa Panaite, James A Haley Veteran’s Hospital, United States

Reviewed by:

Dezon Finch, Hines VA Medical Center, United States
Shanliang Yang, Shandong University of Technology, China

Copyright © 2025 Weber, Weber and Lopez Alcaraz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Juan Miguel Lopez Alcaraz, anVhbi5sb3Blei5hbGNhcmF6QHVvbC5kZQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.