Your new experience awaits. Try the new design now and help us make it even better

SYSTEMATIC REVIEW article

Front. Psychiatry, 13 August 2025

Sec. Digital Mental Health

Volume 16 - 2025 | https://doi.org/10.3389/fpsyt.2025.1588963

Machine learning approaches in the therapeutic outcome prediction in major depressive disorder: a systematic review

Veronica Atemnkeng Ntam&#x;Veronica Atemnkeng NtamTatjana Huebner&#x;Tatjana HuebnerMichael SteffensMichael SteffensCatharina Scholl*Catharina Scholl*
  • Research Division, Federal Institute for Drugs and Medical Devices, Bonn, North Rhine-Westphalia, Germany

Background: Various factors impact treatment outcomes in major depressive disorder (MDD), complicating prediction of treatment success. Therefore, applying machine learning (ML) algorithms for therapeutic outcome prediction on the basis of individual patient data has become a promising approach to tailor the treatment strategy in MDD. However, the applicability of such decision support systems in clinical settings has not been sufficiently demonstrated yet. The objective of the evaluation was to assess applicability of currently published ML-approaches for clinical settings in the EU on the basis of quality, ethical, social, and legal criteria.

Methods: We performed a bibliographic search on PubMed and Google Scholar for studies from January 2016 to December 2024 on ML-applications predicting treatment outcomes in MDD. The ML-model applicability was evaluated via information on validation and performance criteria and the compliance with relevant ethical, social, and legal criteria in the EU.

Results: In the 29 publications reviewed, Random Forest (RF) and Support Vector Machine (SVM) were identified as most frequently used ML-methods. Models integrating multiple categories of patient data, demonstrated higher predictive accuracy than single-category models. However, external validation of the applied ML-approaches was limited and due to the early stage of development, compliance with social, ethical and legal standards remains challenging.

Conclusion: A lack of demonstrated generalizability of the evaluated ML-approaches for treatment outcome prediction in MDD and challenges with regulatory compliance in terms of relevant social, ethical and legal aspects do not yet show sufficient applicability and utility for a use in clinical settings in the EU.

Introduction

According to the World Health Organization (WHO), Major Depressive Disorder (MDD) is a highly widespread condition on a global scale, affecting an estimated number of 280 million individuals (1, 2). The rates of treatment success vary between evidence-based interventions and are influenced by factors such as the patient population, the stage of the disease and e.g. in treatment with several selective serotonin reuptake inhibitors (SSRI) also by pharmacogenetic variants such as of CYP2D6 and CYP2C19 (36). However, in general a majority of patients do not achieve remission from their depression (79). Currently, there is no acknowledged way to anticipate whether a medication will result in a positive response. Thus, present methods for managing MDD primarily depend on trial-and-error sequential treatment tactics (1012).

Identifying treatment needs in individuals with MDD on the basis of patient indicators early could improve the efficiency of interventions by reducing time spent with unsuitable treatments (13). Personalized approaches including artificial intelligence based models thereby can support tailoring treatments based on each individual’s unique data sets (14). Thus, over the past years, there has been a notable increase in the utilization of machine learning in healthcare (15), including its use for forecasting the results of depression treatments (16, 17). However, a comprehensive framework for quality assessment across an ML-based prediction model to prevent errors in decision making is not yet in place (18).

A clear and transparent development and validation process helps prevent issues like bias and over-fitting, while ensuring the models remain understandable and trustworthy for healthcare professionals (19, 20). However, although the potential of ML in personalized medicine is vast, it also introduces significant challenges that necessitate robust regulation and guidelines (21, 22). In the EU, for the clinical applicability of ML-based decision support systems as diagnostic devices, the quality and safety in terms of an adequate technical and clinical performance for the intended use has to be demonstrated according to the Medical Device Regulation (23). Furthermore, for an application in practice, social, ethical and other legal aspects have to be considered and the clinical utility with respect to MDD outcome prediction needs to be demonstrated (2426).

To evaluate the clinical applicability and utility of AI approaches in MDD treatment, we conducted a thorough literature analysis of how machine learning is applied to guide therapeutic interventions in MDD. In this systematic review, we highlight the most commonly employed ML-methods and preferred parameters in this context. Additionally, we delineate the performed method validation, provided performance metrics and assess the potential of the ML-models to guide treatment selection on the basis of different categories of patient data. Furthermore, we assess whether compliance with international guidelines with regard to social and ethical considerations and current applicable EU legislation in this context is possible in the use of currently published ML-approaches.

Methods

Article search strategy

This systematic review search was conducted using the guidelines and guidance for Preferred Reporting Items for Systematic Reviews and Meta-Analysis: The PRISMA Statements (27). A bibliographical search was carried out using PubMed and Google Scholar. The search terms (antidepressant) AND (prediction response) AND (machine learning models OR machine learning methods) were applied for the publication period of 01st January 2016 to 31st December 2023. The aim was to identify original publications on detailed machine learning approaches for therapeutic outcome prediction in MDD. Thereby interventions such as pharmacological treatment and nonpharmacological treatment were considered. Outcomes considered were remission, response, reduction of symptoms and/or treatment resistant depression. The study designs included clinical trials and randomized controlled trials.

Therefore, further filters applied to PubMed were, “Clinical Trial” and “Randomized Controlled Trial”, excluding all reviews, systematic reviews and meta-analyses. Given the different nature of filters in Google Scholar, only articles without the keywords “Review, “Systematic Review” and “Meta Analysis” were taken into account. Most results from Google Scholar were eliminated as these were conference articles, website articles and posters. Further suitable articles were added through citation screening. In order to update the literature search, a further search with the according search terms and filters was performed in June 2025 to include the year 2024.

Study selection

The title and abstract from each identified publication were screened, making sure that they addressed MDD as diagnosis, and a machine learning method was used for generating a model on the basis of various patient data or data categories for predicting treatment outcome. All publications, which did not meet the inclusion criteria, were excluded from further evaluations. Studies reporting internal and/or external validation were included. No age or age of onset limit was set in the incidence of the depressive symptoms. Study design was either prospective or retrospective, open-label or controlled without restrictions to allocation blinding or randomization.

Data extraction

Table 1 details the profile of information extracted from publications taken into consideration for this review. The identified studies were screened for the ML-method applied, the applied outcome measures and predicted outcome, the medical intervention in MDD, the category of data, validation and technical considerations (class imbalance, missing data approaches, data preparation, feature documentation, source availability/open science applied). Furthermore, performance in treatment outcome prediction was extracted. An overview and definitions of ML-methods and performance metrics are provided in the Supplementary File 1 Tables 1, 2.

Table 1
www.frontiersin.org

Table 1. Data extraction summary.

Quality and predictive performance of ML-models in treatment outcome analysis

After screening the selected publications for the type of data used, clinical and sociodemographic data were grouped into a single category for analyses similar to several of the reviewed studies. Furthermore, patient data in the identified publications were categorized in electroencephalography (EEG, including resting state EEG, EEG coherence), magnetic resonance imaging (MRI: structural, resting state, and task-based functional MRI) or molecular biomarker data (genetic, epigenetic, gene expression, metabolites) for this systematic review. Studies applying ML-models on the basis of either only single categories or additionally combined data categories were evaluated separately. Due to the importance of the technical and clinical performance for the applicability of ML-approaches in clinical settings, we focused on the described type of validation and prediction performance in treatment outcome reported in the screened literature.

Compliance of ML-models with ethical, social and legal aspects to consider for AI algorithms in health care

Relevant ethical and social aspects related to the requirements and necessary restrictions for the safe and beneficial use of AI in health care were identified on the basis of international guidelines such as the general ISO 26000 guidance on social responsibility (28) and the more specific guidance of the World Health Organization (WHO) on “Ethics and governance of artificial intelligence for health” (29). The aim was to evaluate to which extent ethical and social aspects relevant for the use of AI in health care already have been incorporated into EU legislation on the basis of current EU guidelines and which AI applications of our literature search are in accordance with the current regularized criteria.

The ethical key principles in terms of AI use in health specified by the WHO were in accordance with the social issues listed in the ISO 26000 guidance on social responsibility including also the issue of sustainable resource use. For the evaluation whether according principles were met by EU guidelines, criteria were extracted from the specified WHO definitions for each principle (Supplementary File 2 Table 1). EU guidelines relevant for the utilization of AI in the health sector such as the General Data Protection Regulation (GDPR) (30), Medical Device Regulation (MDR) (23), and the current version of the European Union Artificial Intelligence (EU AI) act (31), were screened with regard to these overlapping key principles and criteria identified in the ISO and WHO guidance. A comprehensive list of principles and included criteria considered in these regulations is provided in the Supplementary File 2 Table 2.

The methods sections of publications identified in the present literature search were screened for applied methodology and assessed in terms of a possible compliance with the above-mentioned criteria relevant for social, ethical and legal considerations (Supplementary File 2 Table 2) for AI use in a clinical setting. Published evaluations that did not include external validation were thereby excluded due to insufficient quality.

Results

To get a broader overview of the ML-approaches that are currently used to predict therapy outcome of patients with MDD, a literature analysis was performed including different types of therapeutic interventions and outcome measures. This literature search yielded a total of 21,484 articles (Figure 1, Supplementary File 2 Table 3). After consideration of Google Scholar and PubMed filters and removal of duplicates, 34 articles from PubMed and 141 from Google Scholar were considered relevant and selected for further analysis. Screening of titles and abstracts excluded 151 articles due to not meeting the determined criteria or being posters, conference articles, presentations, or web articles, leaving 25 viable articles from PubMed and Google Scholar. The study by Yuelu Liu et al. (2019) (32) appeared to meet the inclusion criteria. However, it was excluded from the review after full text screening because, while it investigated the neural effects of acute dopaminergic enhancement on reward-related brain abnormalities in MDD using machine learning, it did not directly analyze or predict individual clinical responses to antidepressants. In order to update the evaluation, an additional literature search including literature published in 2024 yielded further 7114 publications in June 2025. In the literature search of the year 2024 only 3 articles from PubMed and 51 from Google scholar were screened further. However only one PubMed article provided additional ML-based outcome predictions in MDD for further evaluations (Supplementary File 2 Table 3). An additional 5 articles were added through citation screening, resulting in 29 studies used for the systematic review. Tables 2, 3 synthesize a diverse array of studies that leverage machine learning techniques to predict outcomes in depression treatment. Treatment outcome was either predicted with respect to TRD (10.3%, N=3) and/or response/change in depressive symptoms (86.2%, N=25) and/or remission/non-remission (20.7%, N=6). However, as measure for the outcome often different benchmarks were used including different clinical scales and methods. The most frequently used psychometric scale for outcome prediction in the reviewed literature was the Hamilton Depression Rating Scale (HAMD) (48.3%, N=14). Overall, 19 different ML-methods were used for approaches in outcome prediction in the screened publications. The most frequently reported methods include Support Vector Machines (SVM, including SVM with radial basis function kernel) in 34.5% (N=10) of studies, and Random Forests (RF) in 55.2% (N=16) of studies. However, in many evaluations several ML-methods were applied and tested.

Figure 1
Flowchart of a study selection process. Identification phase lists records from PubMed and Google Scholar, with forward/backward citation searches. Screening phase filters records, excluding duplicates and non-peer-reviewed articles. Eligibility phase includes full-text assessment, removing one article for no outcome prediction in MDD. Final analysis includes twenty-nine records.

Figure 1. Literature search and selection workflow of eligable records following the guidelines and guidance for Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA).

Table 2
www.frontiersin.org

Table 2. Data extracted from identified literature on ML-approaches reported for outcome prediction in pharmacological treatment of MD.

Table 3
www.frontiersin.org

Table 3. Data extracted from identified literature on ML-approaches reported for outcome prediction in non-pharmacological treatment of MDD.

Medical interventions primarily involve pharmacotherapy with antidepressants in 72.4% (N=21) of the screened literature. Thereby, selective serotonin reuptake inhibitors such as citalopram, escitalopram, fluoxetine, and/or sertraline are represented in all of these publications in mono- or combination pharmacotherapy (65.5%, N=19) or in combination with psychotherapy (6.9% N=2). Further pharmacotherapy-based interventions identified were norepinephrine–dopamine reuptake inhibitors such as bupropion in mono- and combination pharmacotherapy 10.3% (N=3), serotonin and norepinephrine reuptake inhibitors represented by venlafaxine) in mono- and combination pharmacotherapy (10.3% N=3), tricyclic (nortriptyline, N=1) and tetracyclic antidepressants (mirtazapine in mono- and combination pharmacotherapy (13.8%, N=4)). Furthermore, in one publication the psychedelic compound psilocybin which is not an approved drug in the EU, was used for pharmacological therapy in MDD (Table 2). Non-pharmacological interventions only (Table 3) were mainly represented by different psychotherapy approaches identified in 17.2% (N=5) of the screened publications. Transcranial magnetic stimulation was reported as non-pharmacological intervention in only one publication.

Internal validation techniques enhance the reliability of the model performance evaluation (62). The validation techniques applied in the screened studies were predominantly described as rigorous, with cross-validation methods such as predominantly k-fold cross-validation (62.1%, N=18), nested cross-validation (20.7%, N=6), and leave-one-out cross-validation (17.2%, N=5). However, external validation, to ensure the generalizability of the models, is reported in 6 studies only.

Most studies explicitly addressed missing data, commonly through exclusion or imputation. While implementation details are sometimes limited, the problem is clearly recognized across studies. Fewer studies discussed class imbalance. Very few by oversampling, most of them stratified cross-validation (Tables 2, 3), but only rarely was the issue addressed in-depth or through appropriate metrics like Area Under the Precision-Recall Curve (AUC-PR). This highlights a methodological gap. Across all studies, the description of data preprocessing steps and the features included in the models was consistently clear and well-documented. However, very few studies provided public access to code or data. Still, most described models and variables in sufficient detail to enable partial reproducibility. Although several studies described their methodology to prevent bias to some extent, they did not discuss potential biases in-depth. Ensuring transparency, explainability and intelligibility however is a key principle to be considered for implementation in clinical settings (29, 63).

Outcome prediction on the basis of a single data category

A majority of identified literature on ML-approaches for treatment outcome predictions was based on a single data category as defined for this review such as EEG or MRI or Clinical-sociodemographic data. Thereby, primarily the data category clinical-sociodemographic was represented. Overall, 34.5% (N=10) of identified publications reported approaches based on clinical-sociodemographic data in outcome prediction of pharmacotherapy or pharmacotherapy and psychotherapy (Table 2) and 17.24% (N=5) in psychotherapy only (Table 3). Figure 2 juxtaposes the performance metric AUC reported in publications on prediction models internally validated in terms of pharmacological treatment of MDD applying one data category only. Predominantly, AUC was reported in evaluations of the single data category clinical-sociodemographic in antidepressant treatment. In general, reported average AUC for identified single data category prediction models applied in antidepressant treatment was about ≥0.7. However, the approach by Poirot et al. (2024) shows a clearly lower performance for response and remission (Figure 2, Supplementary File 3 Table 1). In non-pharmacological treatment the AUC values were only reported by Zandvakili et al. (2019) for outcome prediction in Transcranial magnetic stimulation (AUC (Alpha coherence): 0.83; AUC (Theta coherence): 0.69) and by Kannampallil et al. (2022) for outcome predictions in psychotherapy (AUC (baseline clinical and patient-reported data): 0.719; AUC (baseline and data after 2 months of problem solving therapy): 0.744. Furthermore, performance metrics such as accuracy, sensitivity and specificity were often reported (Supplementary File 3 Table 1) (44, 46, 51, 59). F1 or R2 were predominantly applied to report performance in models using the data category EEG or MRI, and therefore were identified less frequently (43, 45, 47, 48). Model performance based on different methods was often higher with a larger set of selected patient variables (35, 36). However, Nie et al. (2018) showed that dependant on the method applied, the increase of performance with additional variables reached a plateau with a differing number of top variables. Also Iniesta et al. (2016) showed that additional variables can improve performance of outcome prediction only to a certain point. However, this impact was dependent on predicted outcome, applied pharmacological therapy and variable combination. Furthermore, in several publications on the basis of one data category, the combination of baseline data and data in the course of treatment increased performance such as AUC and/or accuracy (44, 51, 60).

Figure 2
Scatter plot showing the average AUC of models predicting outcomes of MDD pharmacological treatments. Data points are plotted for various drugs like Citalopram, Nortriptyline, and Sertraline, with labels indicating studies and sample sizes. Blue circles denote clinical-sociodemographic data, while a red square represents EEG data. Mean AUC values range from 0.20 to 1.00 on the vertical axis, with study references along the horizontal axis.

Figure 2. Mean Area Under the Curve (AUC) values across ML-methods internally validated on pharmacological treatment groups of various studies from 2016 to 2024.The diagram highlights the models applying a single patient data category. Chekroud et al. (2016): prediction of remission; feature selection applied. Iniesta et al. (2016): prediction of remission; largest predictor set with random patient allocation. Nie et al. (2018): prediction of treatment resistant depression, full set of features considered. Sheu et al., 2023: prediction of response; likelihood score and inclusion of deep-learning imputed labels not considered. Oakley et al. (2023): prediction of response, feature selection applied. Poirot et al. (2024) prediction of pre-treatment response, pre-treatment remission; largest predictor set without selection.

Effect of combining patient data categories in ML-approaches

Utilizing diverse patient data enables ML-models to capture complex patterns, leading to improved predictive accuracy and a more personalized approach to treatment for individuals with MDD (34, 64). Thereby, adding extra data such as combining different clinical and demographic patient variables (35) or combining measurements at baseline and in the course of treatment in general improved performance metrics such as AUC of predictive models for e.g. depression remission in previous studies (44, 51, 60). In several of the evaluated publications (Table 2), also ML-models that analyzed more than just one category of patient data were described (10, 4954).

Joyce et al. (2021) assessed whether ML-approaches trained on data of patients with antidepressant monotherapy can improve outcome predictions in combination with antidepressant pharmacotherapy when additional multi-omics measures are incorporated in comparison to clinical-sociodemographic and metabolomic data (molecular biomarker data) only (52). The accuracy achieved in external validation of the metabolomics model was 75.3% (p = 0.026) for both penalized regression and XGBoost. When utilizing the multi-omics model, the accuracy decreased to 73.2% (p = 0.085) applying XGBoost and increased to 77.5% (p < 0.01) with penalized regression. For AUC similar tendencies were observed (Supplementary File 3 Table 2).

Thus, the extended multi-omics model slightly decreased or only slightly increased prediction performance in terms of accuracy and AUC when applied on patients with combination antidepressant pharmacotherapies. The additional molecular biomarker data (6 functionally validated SNPs), therefore did not show a high impact in the cross-trial replication experiment by Joyce et al. (2021) and also in internal validation the impact on performance was low.

However, the additional incorporation of different types of data categories and feature selection showed a higher impact on the improvement of predictive performance in the analyses by Jaworska et al. (2019) (10) and by Poirot et al. (2024) (51). Jaworska et al. (2019) showed that Model 3 which combined EEG and clinical-demographic data categories with selected most predictive features for week 12 treatment response/non-response performed better than models on the basis of single data categories with predictive feature selection (Supplementary File 3 Table 3). The model applying the most predictive variables from feature selection of each category achieved high sensitivity and specificity and overall accuracy of 88%. Thereby, the approach using Random Forest (RF) was superior in terms of quality metrics across all the machine learning methods.

Poirot et al. (2024) demonstrated that multimodal approaches in remission and response prediction with pre-treatment MRI and clinical data predominantly outperformed unimodal models. However, when early treatment data was incorporated the effect was not observed for unimodal models applying clinical-sociodemographic data only to predict remission on the basis of low scientific evidence predictors or response with a large predictor set without selection in response prediction. In general, the use of a selected set of high scientific evidence predictors in multimodal approaches showed a higher performance. Also, Chen et al. (2023) (53) showed that combining different data categories tends to improve prediction performance. RF performed overall best across the different models using clinical-sociodemographic and/or molecular biomarker data (methylation levels of 38 methylated sites of tryptophan hydroxylase 2 genes via SNPs). Specifically, it achieved the highest AUC and accuracy when combining the molecular biomarker and clinical-sociodemographic data (Supplementary File 3 Table 4, Model 3). Furthermore, feature selection applying Recursive Feature Elimination (RFE) with random forest in general improved performance in terms of accuracy in the single and the multiple category approach. However, significance was obtained only in the molecular biomarker model and the combination model.

Sajjandian et al. (2023) (54) performed single and multiple data category analyses on clinical- sociodemographic data, MRI (multimodal and functional) and molecular biomarker data with and without feature selection. Thereby, a set of 134 baseline variables preselected on the basis of relevant published evidence of predictive value and a large set of 1152 baseline variables without the requirement of according evidence were used. For single and multiple data category analyses, the approaches including the smaller evidence-based set of variables in general showed better mean balanced accuracy than the according approaches with the large variable set. However, in the large variable set without prior evidence-based preselection, correlation-adjusted t-score feature selection led to a modest improvement of mean balanced accuracy. The inclusion of 80 variables representing measurements at week 2 of treatment in addition to the evidence based 134 variables mainly improved mean accuracy in single and multiple data category analyses (Supplementary File 4 Table 1). Furthermore, an increase of mean accuracy was observed with the increase of combined data categories over all variable set approaches.

In the application of baseline variables only (134 preselected baseline variable set, feature selection), the best-performing ML-method in terms of accuracy was SVM in Model 4 combining all data categories and in Model 2 applying molecular biomarker data (Supplementary File 4 Table 1). For the combined data of baseline (134 preselected, evidence-based baseline variable set, feature selection) and week 2 measurements (80 variables) in the course of treatment, the most accurate predictive models were a Naive Bayes model (CAT score) that utilized clinical data, an Elastic Net model (Embedded + CAT score) applying all data categories and a random forest model (Embedded + CAT score) that incorporated clinical data.

In sum, the combination of additional data categories improved accuracy with a higher impact than the addition of variables of the same data category in the identified literature for this review (35, 51, 52, 54).

Compliance of published AI applications with current social, ethical and legal aspects

A majority of extracted and screened criteria representing WHO ethical key principles (Supplementary File 2 Table 1) agreeing with ISO issues on social responsibility are addressed by current EU regulations that are translated into national legislation in the EU (Supplementary File 2 Table 2). Results on external validation applying new independent data sets of different studies were provided by only six of the identified publications. However, only five showed successful validation (Table 4). Though, the evaluated five publications provide performance data on external validation, which is a necessary step toward an application in clinical settings, they still describe ML-models in the development stage. Therefore, several of the relevant principles that cover ethical and social aspects that need to be considered for a potential medical product according to EU legislation could not be assessed yet.

Table 4
www.frontiersin.org

Table 4. Compliance of externally validated ML-models with current social, ethical and legal aspects applicable in the EU.

Although all ML-models developed and described for the prediction of either treatment response and/or TRD (37, 46), non-remission or remission (34, 60) prior to treatment initiation or in the early phase of treatment (60) in MDD aim to contribute to human well-being, their capability to provide a real health benefit and to be of public interest needs to be demonstrated. In four publications it was mentioned that an informed consent for the use of the data was available (37, 46) or they referred to information on trial registration or previous publications of study protocols that confirmed an available written informed consent (60) while one publication (34) mentioned data use via a limited access data use certificate. Still, whether a consent policy will be applied with the use of the ML-applications could not be deduced. Only Joyce et al. (2021) and Kannampallil et al. (2022) indicate the inclusion of several ethnic groups in the training data set, however the data sample was small. Chekroud et al. (2016) indicate “White” and “Black or African American” as applied variables. However, the aspect diversity was not addressed comprehensively. All publications showed that the applied ML-models were methods of supervised learning and were concordant with the principles “Intelligibility/explainability” and “Transparency”. Furthermore, data sources, the process of obtaining the data, the method of data processing, data inclusion and exclusion and a discussion of the data bias and limitations of the ML-models were addressed. The aspect “Human safety” was addressed by an external validation and performance assessment of the ML-model in MDD. Performance was reported in all evaluated publications (34, 37, 46, 52, 60) at least in terms of accuracy, sensitivity and specificity in the prediction of the according models. In most published ML-models the evaluated metrics showed above chance performance for prediction accuracy. However, the external validation by Chekroud et al. (2016) could not show an above chance performance in a Venlafaxine plus Mirtazapine treatment cohort in terms of accuracy. Furthermore, all models externally validated by Nie et al. (2018) could not reach specificity higher than 44% in the prediction of citalopram treatment resistant depression though accuracy and sensitivity were high. Only Chekroud et al. (2016) compared model prediction with treatment outcome prediction by clinicians. Here, prediction performance by 23 clinicians for 26 STAR*D patients was below chance and the ML-model showed a better performance.

Discussion

The heterogeneity and complexity of MDD aggravate the estimation of the course of treatment and thus result in the requirement to include a variety of different patient data for its assessment (66). For such multifaceted analyses, however clinicians will need tools to integrate the varying patient information for guiding treatment decisions. In the EU, algorithms that are used as decision support systems in clinical settings have to be validated like diagnostic devices according to the Medical Device Regulation (MDR) (23). Furthermore, the General Data Protection Regulation (30) and, from August 2026, the new EU AI Act (31) also apply when a use on patient data in patient care is executed. Therefore, several aspects in terms of performance quality, data safety, transparency and further social and ethical considerations must be taken into account when using machine learning to guide treatment for MDD (28, 29). In this systematic review we evaluated how clinical applicability in terms of technical and clinical performance, accordance with social, ethical and legal issues relevant in the EU and utility of AI models for the therapeutic outcome prediction in MDD have been addressed and demonstrated so far in published literature.

We could show that all evaluated publications addressed the issue of technical performance by applying some sort of internal validation and, in a majority, also reporting performance metrics such as accuracy and/or AUC, sensitivity, specificity and in some cases also other quality metrics such as R² or the F1 score. The findings from the performance metric evaluation of various ML-methods across different data categories indicate that the incorporation of multiple categories of patient data mainly leads to enhanced model performance. Thus, across the reviewed studies, the inclusion of a broader set of data categories—such as clinical-sociodemographic, EEG, MRI and molecular biomarker data—demonstrated the ability of ML-models to improve the accuracy of predictions related to treatment outcomes for Major Depressive Disorder (MDD) (10, 35, 5154, 60). However, some approaches were more successful particularly when escitalopram or citalopram treatment was involved, indicating that the benefit of adding data may vary dependent on the treatment being predicted (10, 34, 35). Another systematic literature evaluation by Lee et al. in 2018 supports our findings showing that predictive models that integrate multiple categories of data performed better than models applying single data categories of adults with depression. Furthermore, also Lee et al. (2018) reported issues on heterogeneity consistent to our observations especially in terms of validation approaches, applied data and feature selection (67). In terms of other mental disorders Pigoni et al. (2025) showed that overall accuracy of predictive models is increased with the use of more than one data category (68). A previous scoping review by Kline et al. (2022) on ML in precision health showed that multimodal approaches and unimodal approaches were compared only in few publications. However, in cases where a comparison was provided predictive accuracy was increased with the multimodal approach (69). The incorporation of multiple categories of patient data is also supported by studies focusing on other health conditions such as neurodegenerative diseases or cancer. (7072). Furthermore, similar to our observations on the basis of the reviewed ML-approaches, the use of feature selection techniques often further optimized the model performance also in predictions focusing on e.g. cancer therapy (7375).

Although evidence of the importance of genetic or pharmacogenetic information in MDD treatment outcome has increased (7678), only 10.3% (N=3) of the reviewed publications have incorporated such molecular biomarker data in their prediction models in MDD. However, several of the published ML-model evaluations showed a better performance or, dependant on the applied method, a trend of improved performance metrics when biological biomarkers such as genetic characteristics were considered (52, 53). However, none of the reviewed studies included data on relevant pharmacogenetic variants of e.g. CYP2D6 or CYP2C19 in ML-models for outcome prediction in SSRI treatment, although in previous studies sufficient evidence of a potential impact on treatment with several SSRI has been demonstrated to provide genotype based treatment and dosing recommendations (79). While adding and combining various data categories may help to improve prediction performance, Sajjadian et al., 2023 (54) showed that adding a large number of data without preselection on the basis of scientific evidence in terms of relevance for prediction and without feature selection does not necessarily lead to better accuracy. Also, in some approaches with preselection and/or feature selection applying one data category only such as clinical-sociodemographic data showed similar or even higher performance than approaches of combined models (51, 54).

According to the MDR, apart from analytical or technical performance also clinical performance should be assessed and should be suitable for the intended use of ML-models in MDD to ensure a safe and efficacious application by clinicians for therapeutic outcome prediction. However, only five publications provided information on a successful external validation to demonstrate robustness relevant for clinical performance (34, 37, 46, 52, 60). Thereby, only two publications reported external validations with high accuracies of >75%, moderate to high sensitivity of >70%, and only one in parallel a high specificity of 88% in outcome prediction for an independent data set (Table 4). However, sample size of the training data set in these cases was low. Appropriate performance values for MDD were not discussed in any of these publications and an above chance performance was still presented as successful (34). This review highlights that external validation as a crucial phase in the qualification process for a clinical application is underrepresented in current literature.

For clinical applicability, it should be furthermore demonstrated that the AI application can be utilized according to current legislation also taking into account social and ethical aspects to ensure public acceptance (63, 80). We identified and presented social and ethical aspects that need to be considered according to current EU legislation when AI models are used as diagnostic devices to assist MDD treatment outcome prediction as clinical decision support systems. However, as the ML-applications presented in current literature are still in the stage of development, many of these aspects cannot apply. Still, important aspects such as safety in terms of reporting performance metrics and to some extent also intelligibility, explainability, and transparency were met by all publications that presented results on successful external validation. This shows a potential for a future application in healthcare.

The demonstrated heterogeneity across the published studies in terms of study design, sample size, validation approaches, applied ML methods, number and type of applied data categories (Tables 2, 3) and the approaches of feature selection aggravate generalizability (81). This also provides a barrier to clinical implementation of such ML-methods as regulators in the EU require evidence of consistent performance across settings and populations in the course of post-market surveillance. Furthermore, for implementation of such decision support systems in clinical settings and to maintain real-world clinical performance, updating of ML-models in order to enable them to operate according to the intended use, a longitudinal validation and thereby consistent demonstration of compliance with current legislation is necessary (31).

Currently, there are no standards for predictive performance of ML-models in the therapeutic outcome prediction in major depressive disorder. Such standards could provide guidance to manufacturers in the development process of decision support systems and to regulators in terms of decision-making according to current legislation. However, due to the high burden for MDD affected individuals, their social environment and on health care (82), a thorough and careful evaluation by medical experts taking into account the intended use of the decision support system is necessary prior to determining appropriate metrics and levels of predictive performance (18, 83),. Thereby, the consequences of false positive and false negative results in the context of use and of specific predicted outcomes should be considered (84). However, also the consequences of not using the application should be taken into account, if evidence is provided in real-world practice that the decision support system demonstrates a clinical benefit. Thereby, comparative evaluations showing whether externally validated ML-based approaches surpass current approaches by healthcare providers could deliver such evidence. This could support implementing thresholds that would constitute acceptable levels of predictive performance. However, a demonstration that the application of an ML-model in real-world settings is useful in MDD outcome prediction was not sufficiently provided by any of the publications reporting at least an external validation of the according ML-models. Only Chekroud et al. (2016) compared AI-aided outcome prediction with AI-unaided prediction by 23 clinicians on a small sample of patients. Therefore, clinical utility has not been sufficiently addressed by an appropriate performance comparison with currently applied methods in real-world settings to show a value and benefit for patient care. It needs to be assessed which prediction performance of an AI-based decision support system compared to a prediction by clinicians can be regarded as useful. Thereby, also, the expert level of clinicians should be taken into account and that the aspect of human control can be met in the application of the decision support system despite of the complexity of underlying patient data. However, most of the published ML-models providing an above chance prediction performance in this systematic review are currently not suitable for a use in clinical settings in the EU as they do not meet all requirements in terms of clinical applicability and clinical utility. Furthermore, successful implementation of ML-models as decision support systems can also depend on stakeholder acceptability (85). An integration of patients’ and clinicians’ perspectives enables an important insight into the end-user environment and helps to identify needs, challenges and barriers for application according to the intended use and for adoption in real-world settings. Therefore, it is crucial to consider stakeholder feedback in the design and development process (86). Usability and whether the application can be applied in accordance with ethical, social and legal aspects such as explainability and transparency can be tested involving patients and clinicians in prototyping cycles (8688). Furthermore, interviews or surveys could be integrated in the development, validation process or in pilot testing to evaluate user satisfaction and perceived utility (89, 90). However, none of the evaluated studies addressed the patients’ and clinicians’ perspectives in terms of the application of such tools for MDD therapy management in clinical care.

For this literature evaluation, the search string was chosen to screen a large number of publications on ML-based outcome predictions in MDD. Still, the search terms used may not represent all current literature in terms of the objective of this systematic review. Limitations of the literature search were that two data bases including Pubmed and Google Schoolar were applied for the search while data bases such as e.g. EMBASE, Medline, Scopus or PsycINFO were not applied. To increase the quality of the assessment, only original publications of clinical trial evaluations were included. However, few publications met the required criteria and provided detailed descriptions of ML-approaches applied for therapeutic outcome predictions in MDD. Therefore, most publications identified had to be excluded. Results of models with low performance may have been published less likely. Also, several of the evaluated publications did not provide data on performance in terms of at least AUC or accuracy. Therefore, publication and reporting bias cannot be ruled out in the present systematic evaluation.

In conclusion, the additional integration of different data categories or additional data of such categories in the course of treatment mainly improved the performance of machine learning models in predicting treatment outcomes in MDD. While this approach enhances accuracy and/or AUC, careful consideration of trade-offs between sensitivity and specificity remains crucial. Additionally, longitudinal data collection, as well as feature selection and the selection of appropriate ML-techniques tailored to the specific data types and intended use, could be involved in maximizing predictive power and improving personalized treatment outcomes for individuals with MDD. While technical awareness is evident, consistent reporting and open science practices remain limited. We recommend that future studies systematically report on missingness patterns and employ resampling or cost-sensitive learning when imbalance is present. In the case of unbalanced data distributions, AUC-PR should be reported in addition to or instead of AUC-ROC to provide a more meaningful assessment of model performance. Making trained models and pipelines publicly available will foster reproducibility. Main issues for a translational process of ML-based decision support into clinical practice remains a lack of generalizability of prediction models for clinical application in MDD and an insufficient prove of clinical applicability and utility for real-world settings.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions

VN: Data curation, Formal Analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing. TH: Conceptualization, Methodology, Supervision, Visualization, Writing – original draft, Writing – review & editing. MS: Conceptualization, Methodology, Project administration, Supervision, Writing – review & editing. CS: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This evaluation was funded by Federal Ministry of Health (Bundesministerium für Gesundheit) in Germany in the framework of ERAPerMed JTC2021. Grant number: ZMI5-522FSB900. The funders had no role in design, data collection and analysis of the evaluation, decision to publish, or preparation of the manuscript.

Acknowledgments

The authors thank the consortium of the project “Artificial intelligence for personalized medicine in depression - analysis and harmonization of clinical research data for robust multimodal patient profiling for the prediction of therapy outcome” (ArtiPro) for professional and promoting exchange and collaboration in the project.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1588963/full#supplementary-material

References

1. WHO. Depressive Disorder. Geneva, Switzerland: WHO Home page (2023).

Google Scholar

2. G.H.D. Exchange, Global Burden of Disease (GBD). Seattle, Washington, USA: Institute for Health Metrics and Evaluation, University of Washington (2019).

Google Scholar

3. Otte C, Gold SM, Penninx BW, Pariante CM, Etkin A, Fava M, et al. Major depressive disorder. Nat Rev Dis Primers. (2016) 2:16065. doi: 10.1038/nrdp.2016.65

PubMed Abstract | Crossref Full Text | Google Scholar

4. Mahajna M, Abu Fanne R, Berkovitch M, Tannous E, Vinker S, Green I, et al. Effect of CYP2C19 pharmacogenetic testing on predicting citalopram and escitalopram tolerability and efficacy: A retrospective, longitudinal cohort study. Biomedicines. (2023) 11:3245. doi: 10.3390/biomedicines11123245

PubMed Abstract | Crossref Full Text | Google Scholar

5. Islam F, Marshe VS, Magarbeh L, Frey BN, Milev RV, Soares CN, et al. Effects of CYP2C19 and CYP2D6 gene variants on escitalopram and aripiprazole treatment outcome and serum levels: results from the CAN-BIND 1 study. Trans Psychiatry. (2022) 12:366. doi: 10.1038/s41398-022-02124-4

PubMed Abstract | Crossref Full Text | Google Scholar

6. Bharthi K, Zuberi R, Maruf AA, Shaheen SM, McCloud R, Heintz M, et al. Impact of cytochrome P450 genetic variation on patient-reported symptom improvement and side effects among children and adolescents treated with fluoxetine. J Child Adolesc Psychopharmacol. (2024) 34:21–7. doi: 10.1089/cap.2023.0039

PubMed Abstract | Crossref Full Text | Google Scholar

7. Mekonen T, Ford S, Chan GCK, Hides L, Connor JP, and Leung J. What is the short-term remission rate for people with untreated depression? A systematic review and meta-analysis. J Affect Disord. (2022) 296:17–25. doi: 10.1016/j.jad.2021.09.046

PubMed Abstract | Crossref Full Text | Google Scholar

8. Cuijpers P, Karyotaki E, Ciharova M, Miguel C, Noma H, and Furukawa TA. The effects of psychotherapies for depression on response, remission, reliable change, and deterioration: A meta-analysis. Acta psychiatrica Scandinavica. (2021) 144:288–99. doi: 10.1111/acps.13335

PubMed Abstract | Crossref Full Text | Google Scholar

9. Verhoeven JE, Han LKM, Lever-van Milligen BA, Hu MX, Révész D, Hoogendoorn AW, et al. Antidepressants or running therapy: Comparing effects on mental and physical health in patients with depression and anxiety disorders. J Affect Disord. (2023) 329:19–29. doi: 10.1016/j.jad.2023.02.064

PubMed Abstract | Crossref Full Text | Google Scholar

10. Jaworska N, de la Salle S, Ibrahim MH, Blier P, and Knott V. Leveraging machine learning approaches for predicting antidepressant treatment response using electroencephalography (EEG) and clinical data. Front Psychiatry. (2018) 9:768. doi: 10.3389/fpsyt.2018.00768

PubMed Abstract | Crossref Full Text | Google Scholar

11. McIntyre RS, Alsuwaidan M, Baune BT, Berk M, Demyttenaere K, Goldberg JF, et al. Treatment-resistant depression: definition, prevalence, detection, management, and investigational interventions. World Psychiatry. (2023) 22:394–412. doi: 10.1002/wps.21120

PubMed Abstract | Crossref Full Text | Google Scholar

12. Engelmann J, Wagner S, Solheid A, Herzog DP, Dreimüller N, Müller MB, et al. Tolerability of high-dose venlafaxine after switch from escitalopram in nonresponding patients with major depressive disorder. J Clin Psychopharmacol. (2021) 41:62–6. doi: 10.1097/JCP.0000000000001312

PubMed Abstract | Crossref Full Text | Google Scholar

13. van Krugten F, Goorden M, van Balkom A, Spijker J, Brouwer W, Hakkaart-van Roijen L, et al. Indicators to facilitate the early identification of patients with major depressive disorder in need of highly specialized care: A concept mapping study. Depression Anxiety. (2018) 35:346–52. doi: 10.1002/da.22741

PubMed Abstract | Crossref Full Text | Google Scholar

14. Wium-Andersen IK, Vinberg M, Kessing LV, and McIntyre RS. Personalized medicine in psychiatry. Nord J Psychiatry. (2017) 71:12–9. doi: 10.1080/08039488.2016.1216163

PubMed Abstract | Crossref Full Text | Google Scholar

15. Habehh H and Gohel S. Machine learning in healthcare. Curr Genomics. (2021) 22:291–300. doi: 10.2174/1389202922666210705124359

PubMed Abstract | Crossref Full Text | Google Scholar

16. Sajjadian M, Lam RW, Milev R, Rotzinger S, Frey BN, Soares CN, et al. Machine learning in the prediction of depression treatment outcomes: a systematic review and meta-analysis. psychol Med. (2021) 51:2742–51. doi: 10.1017/S0033291721003871

PubMed Abstract | Crossref Full Text | Google Scholar

17. Arnold PIM, Janzing JGE, and Hommersom A. Machine learning for antidepressant treatment selection in depression. Drug Discov Today. (2024) 29:104068. doi: 10.1016/j.drudis.2024.104068

PubMed Abstract | Crossref Full Text | Google Scholar

18. de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ digital Med. (2022) 5:2. doi: 10.1038/s41746-021-00549-7

PubMed Abstract | Crossref Full Text | Google Scholar

19. Shick AA, Webber CM, Kiarashi N, Weinberg JP, Deoras A, Petrick N, et al. Transparency of artificial intelligence/machine learning-enabled medical devices. NPJ digital Med. (2024) 7:21. doi: 10.1038/s41746-023-00992-8

PubMed Abstract | Crossref Full Text | Google Scholar

20. Hanna M, Pantanowitz L, Jackson B, Palmer O, Visweswaran S, Pantanowitz J, et al. Ethical and bias considerations in artificial intelligence (AI)/machine learning. Modern Pathol. (2024) 38:100686. doi: 10.1016/j.modpat.2024.100686

PubMed Abstract | Crossref Full Text | Google Scholar

21. E. Commission, Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts. Brussels, Belgium: European Commision Brussels (2021). COM(2021) 206 final, 2021/0106 (COD).

Google Scholar

22. Jobin A, Ienca M, and Vayena E. The global landscape of AI ethics guidelines. Nat Mach Intell. (2019) 1:389–99. doi: 10.1038/s42256-019-0088-2

Crossref Full Text | Google Scholar

23. T.E.P.A.T.C.O.T.E. UNION. Consolidated text: Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, amending Directive 2001/83/EC, Regulation (EC) No 178/2002 and Regulation (EC) No 1223/2009 and repealing Council Directives 90/385/EEC and 93/42/EEC (Text with EEA relevance). Brussels, Belgium: EUROPEAN UNION, EUR-Lex (2025).

Google Scholar

24. Rost N, Binder EB, and Brückl TM. Predicting treatment outcome in depression: an introduction into current concepts and challenges. Eur Arch Psychiatry Clin Neurosci. (2023) 273:113–27. doi: 10.1007/s00406-022-01418-4

PubMed Abstract | Crossref Full Text | Google Scholar

25. Lobig F, Graham J, Damania A, Sattin B, Reis J, and Bharadwaj P. Enhancing patient outcomes: the role of clinical utility in guiding healthcare providers in curating radiology AI applications. Front digital Health. (2024) 6:1359383. doi: 10.3389/fdgth.2024.1359383

PubMed Abstract | Crossref Full Text | Google Scholar

26. Goldstein SP, Nebeker C, Ellis RB, and Oser M. Ethical, legal, and social implications of digital health: A needs assessment from the Society of Behavioral Medicine to inform capacity building for behavioral scientists. Trans Behav Med. (2024) 14:189–96. doi: 10.1093/tbm/ibad076

PubMed Abstract | Crossref Full Text | Google Scholar

27. Moher D, Liberati A, Tetzlaff J, and Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. (2009) 6:e1000097. doi: 10.1371/journal.pmed.1000097

PubMed Abstract | Crossref Full Text | Google Scholar

28. I.S.O. ISO 26000: 2010—Guidance on social Responsibility. Geneva, Switzerland: International Standard Organisation (2010). p. 106.

Google Scholar

29. WHO. Ethics and governance of artificial intelligence for health: Guidance on large multi-modal models Licence: CC BY-NC-SA 3.0 IGO. Geneva: World Health Organization (WHO) (2024).

Google Scholar

30. T.E.P.A.T.C.O.T.E. UNION. Consolidated text: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance). Brussels, Belgium: EUROPEAN UNION, EUR-Lex (2016).

Google Scholar

31. T.E.P.A.T.C.O.T.E. UNION. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (Text with EEA relevance). Brussels, Belgium: EUROPEAN UNION, EUR-Lex (2024). p. 144.

Google Scholar

32. Liu Y, Admon R, Mellem MS, Belleau EL, Kaiser RH, Clegg R, et al. Machine learning identifies large-scale reward-related activity modulated by dopaminergic enhancement in major depression. Biol Psychiatry Cogn Neurosci Neuroimaging. (2020) 5:163–72. doi: 10.1016/j.bpsc.2019.10.002

PubMed Abstract | Crossref Full Text | Google Scholar

33. Carrillo F, Sigman M, Fernández Slezak D, Ashton P, Fitzgerald L, Stroud J, et al. Natural speech algorithm applied to baseline interview data can predict which patients will respond to psilocybin for treatment-resistant depression. J Affect Disord. (2018) 230:84–6. doi: 10.1016/j.jad.2018.01.006

PubMed Abstract | Crossref Full Text | Google Scholar

34. Chekroud AM, Zotti RJ, Shehzad Z, Gueorguieva R, Johnson MK, Trivedi MH, et al. Cross-trial prediction of treatment outcome in depression: a machine learning approach. Lancet Psychiatry. (2016) 3:243–50. doi: 10.1016/S2215-0366(15)00471-X

PubMed Abstract | Crossref Full Text | Google Scholar

35. Iniesta R, Malki K, Maier W, Rietschel M, Mors O, Hauser J, et al. Combining clinical variables to optimize prediction of antidepressant treatment outcomes. J Psychiatr Res. (2016) 78:94–102. doi: 10.1016/j.jpsychires.2016.03.016

PubMed Abstract | Crossref Full Text | Google Scholar

36. Kautzky A, Dold M, Bartova L, Spies M, Vanicek T, Souery D, et al. Refining prediction in treatment-resistant depression: results of machine learning analyses in the TRD III sample. J Clin Psychiatry. (2018) 79. doi: 10.4088/JCP.16m11385

PubMed Abstract | Crossref Full Text | Google Scholar

37. Nie Z, Vairavan S, Narayan VA, Ye J, and Li QS. Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study. PLoS One. (2018) 13:e0197268. doi: 10.1371/journal.pone.0197268

PubMed Abstract | Crossref Full Text | Google Scholar

38. Webb CA, Trivedi MH, Cohen ZD, Dillon DG, Fournier JC, Goer F, et al. Personalized prediction of antidepressant v. placebo response: evidence from the EMBARC study. psychol Med. (2019) 49:1118–27. doi: 10.1017/S0033291718001708

PubMed Abstract | Crossref Full Text | Google Scholar

39. Rutherford BR, Wall MM, Brown PJ, Choo TH, Wager TD, Peterson BS, et al. Patient expectancy as a mediator of placebo effects in antidepressant clinical trials. Am J Psychiatry. (2017) 174:135–42. doi: 10.1176/appi.ajp.2016.16020225

PubMed Abstract | Crossref Full Text | Google Scholar

40. Zilcha-Mano S, Brown PJ, Roose SP, Cappetta K, and Rutherford BR. Optimizing patient expectancy in the pharmacologic treatment of major depressive disorder. psychol Med. (2019) 49:2414–20. doi: 10.1017/S0033291718003343

PubMed Abstract | Crossref Full Text | Google Scholar

41. Furukawa TA, Debray TPA, Akechi T, Yamada M, Kato T, Seo M, et al. Can personalized treatment prediction improve the outcomes, compared with the group average approach, in a randomized trial? Developing and validating a multivariable prediction model in a pragmatic megatrial of acute treatment for major depression. J Affect Disord. (2020) 274:690–7. doi: 10.1016/j.jad.2020.05.141

PubMed Abstract | Crossref Full Text | Google Scholar

42. Foster S, Mohler-Kuo M, Tay L, Hothorn T, and Seibold H. Estimating patient-specific treatment advantages in the ‘Treatment for Adolescents with Depression Study’. J Psychiatr Res. (2019) 112:61–70. doi: 10.1016/j.jpsychires.2019.02.021

PubMed Abstract | Crossref Full Text | Google Scholar

43. Lorenzo-Luaces L, Rodriguez-Quintana N, Riley TN, and Weisz JR. A placebo prognostic index (PI) as a moderator of outcomes in the treatment of adolescent depression: Could it inform risk-stratification in treatment with cognitive-behavioral therapy, fluoxetine, or their combination? Psychother Res. (2021) 31:5–18. doi: 10.1080/10503307.2020.1747657

PubMed Abstract | Crossref Full Text | Google Scholar

44. Zhdanov A, Atluri S, Wong W, Vaghei Y, Daskalakis ZJ, Blumberger DM, et al. Use of machine learning for predicting escitalopram treatment outcome from electroencephalography recordings in adult patients with depression. JAMA network Open. (2020) 3:e1918377. doi: 10.1001/jamanetworkopen.2019.18377

PubMed Abstract | Crossref Full Text | Google Scholar

45. Oakley T, Coskuner J, Cadwallader A, Ravan M, and Hasey G. EEG biomarkers to predict response to sertraline and placebo treatment in major depressive disorder. IEEE Trans bio-medical Eng. (2023) 70:909–19. doi: 10.1109/TBME.2022.3204861

PubMed Abstract | Crossref Full Text | Google Scholar

46. Schwartzmann B, Dhami P, Uher R, Lam RW, Frey BN, Milev R, et al. Developing an electroencephalography-based model for predicting response to antidepressant medication. JAMA network Open. (2023) 6:e2336094. doi: 10.1001/jamanetworkopen.2023.36094

PubMed Abstract | Crossref Full Text | Google Scholar

47. Nguyen KP, Fatt CC, Treacher A, Mellema C, Trivedi MH, and Montillo A. Anatomically-informed data augmentation for functional MRI with applications to deep learning. Proc SPIE–the Int Soc Optical Eng. (2020) 11313. doi: 10.1117/12.2548630

PubMed Abstract | Crossref Full Text | Google Scholar

48. Nguyen KP, Raval V, Minhajuddin A, Carmody T, Trivedi MH, Dewey RB Jr., et al. BLENDS: augmentation of functional magnetic resonance images for machine learning using anatomically constrained warping. Brain connectivity. (2023) 13:80–8. doi: 10.1089/brain.2021.0186

PubMed Abstract | Crossref Full Text | Google Scholar

49. Rajpurkar P, Yang J, Dass N, Vale V, Keller AS, Irvin J, et al. Evaluation of a machine learning model based on pretreatment symptoms and electroencephalographic features to predict outcomes of antidepressant treatment in adults with depression: A prespecified secondary analysis of a randomized clinical trial. JAMA network Open. (2020) 3:e206653. doi: 10.1001/jamanetworkopen.2020.6653

PubMed Abstract | Crossref Full Text | Google Scholar

50. Bartlett EA, DeLorenzo C, Sharma P, Yang J, Zhang M, Petkova E, et al. Pretreatment and early-treatment cortical thickness is associated with SSRI treatment response in major depressive disorder. Neuropsychopharmacology. (2018) 43:2221–30. doi: 10.1038/s41386-018-0122-9

PubMed Abstract | Crossref Full Text | Google Scholar

51. Poirot MG, Ruhe HG, Mutsaerts HMM, Maximov II, Groote IR, Bjørnerud A, et al. Treatment response prediction in major depressive disorder using multimodal MRI and clinical data: secondary analysis of a randomized clinical trial. Am J Psychiatry. (2024) 181:223–33. doi: 10.1176/appi.ajp.20230206

PubMed Abstract | Crossref Full Text | Google Scholar

52. Joyce JB, Grant CW, Liu D, MahmoudianDehkordi S, Kaddurah-Daouk R, Skime M, et al. Multi-omics driven predictions of response to acute phase combination antidepressant therapy: a machine learning approach with cross-trial replication. Trans Psychiatry. (2021) 11:513. doi: 10.1038/s41398-021-01632-z

PubMed Abstract | Crossref Full Text | Google Scholar

53. Chen B, Jiao Z, Shen T, Fan R, Chen Y, and Xu Z. Early antidepressant treatment response prediction in major depression using clinical and TPH2 DNA methylation features based on machine learning approaches. BMC Psychiatry. (2023) 23:299. doi: 10.1186/s12888-023-04791-z

PubMed Abstract | Crossref Full Text | Google Scholar

54. Sajjadian M, Uher R, Ho K, Hassel S, Milev R, Frey BN, et al. Prediction of depression treatment outcome from multimodal data: a CAN-BIND-1 report. psychol Med. (2023) 53:5374–84. doi: 10.1017/S0033291722002124

PubMed Abstract | Crossref Full Text | Google Scholar

55. van Bronswijk SC, DeRubeis RJ, Lemmens L, Peeters F, Keefe JR, Cohen ZD, et al. Precision medicine for long-term depression outcomes using the Personalized Advantage Index approach: cognitive therapy or interpersonal psychotherapy? psychol Med. (2021) 51:279–89. doi: 10.1017/S0033291719003192

PubMed Abstract | Crossref Full Text | Google Scholar

56. van Bronswijk SC, Bruijniks SJE, Lorenzo-Luaces L, Derubeis RJ, Lemmens L, Peeters F, et al. Cross-trial prediction in psychotherapy: External validation of the Personalized Advantage Index using machine learning in two Dutch randomized trials comparing CBT versus IPT for depression. Psychother research: J Soc Psychother Res. (2021) 31:78–91. doi: 10.1080/10503307.2020.1823029

PubMed Abstract | Crossref Full Text | Google Scholar

57. Zwerenz R, Becker J, Gerzymisch K, Siepmann M, Holme M, Kiwus U, et al. Evaluation of a transdiagnostic psychodynamic online intervention to support return to work: A randomized controlled trial. PLoS One. (2017) 12:e0176513. doi: 10.1371/journal.pone.0176513

PubMed Abstract | Crossref Full Text | Google Scholar

58. Jacobson NC and Nemesure MD. Using artificial intelligence to predict change in depression and anxiety symptoms in a digital intervention: evidence from a transdiagnostic randomized controlled trial. Psychiatry Res. (2021) 295:113618. doi: 10.1016/j.psychres.2020.113618

PubMed Abstract | Crossref Full Text | Google Scholar

59. Solomonov N, Lee J, Banerjee S, Flückiger C, Kanellopoulos D, Gunning FM, et al. Modifiable predictors of nonresponse to psychotherapies for late-life depression with executive dysfunction: a machine learning approach. Mol Psychiatry. (2021) 26:5190–8. doi: 10.1038/s41380-020-0836-z

PubMed Abstract | Crossref Full Text | Google Scholar

60. Kannampallil T, Dai R, Lv N, Xiao L, Lu C, Ajilore OA, et al. Cross-trial prediction of depression remission using problem-solving therapy: A machine learning approach. J Affect Disord. (2022) 308:89–97. doi: 10.1016/j.jad.2022.04.015

PubMed Abstract | Crossref Full Text | Google Scholar

61. Zandvakili A, Philip NS, Jones SR, Tyrka AR, Greenberg BD, and Carpenter LL. Use of machine learning in predicting clinical response to transcranial magnetic stimulation in comorbid posttraumatic stress disorder and major depression: A resting state electroencephalography study. J Affect Disord. (2019) 252:47–54. doi: 10.1016/j.jad.2019.03.077

PubMed Abstract | Crossref Full Text | Google Scholar

62. Dinov ID. Model Performance Assessment, Validation, and Improvement, Data Science and Predictive Analytics: Biomedical and Health Applications using R. Cham, Switzerland: Springer (2023) p. 477–531.

Google Scholar

63. WHO. WHO issues first global report on Artificial Intelligence (AI) in health and six guiding principles for its design and use. Geneva, Switzerland: WHO website (2021).

Google Scholar

64. Kautzky A, Möller HJ, Dold M, Bartova L, Seemüller F, Laux G, et al. Combining machine learning algorithms for prediction of antidepressant treatment response. Acta psychiatrica Scandinavica. (2021) 143:36–49. doi: 10.1111/acps.13250

PubMed Abstract | Crossref Full Text | Google Scholar

65. Schwartzmann B, Dhami P, Uher R, Lam RW, Frey BN, Milev R, et al. Developing an Electroencephalography-Based Model for Predicting Response to Antidepressant Medication. JAMA Netw Open. (2023) 6(9):e2336094. doi: 10.1001/jamanetworkopen.2023.36094

PubMed Abstract | Crossref Full Text | Google Scholar

66. Fried EI and Nesse RM. Depression is not a consistent syndrome: An investigation of unique symptom patterns in the STAR*D study. J Affect Disord. (2015) 172:96–102. doi: 10.1016/j.jad.2014.10.010

PubMed Abstract | Crossref Full Text | Google Scholar

67. Lee Y, Ragguett R-M, Mansur RB, Boutilier JJ, Rosenblat JD, Trevizol A, et al. Applications of machine learning algorithms to predict therapeutic outcomes in depression: A meta-analysis and systematic review. J Affect Disord. (2018) 241:519–32. doi: 10.1016/j.jad.2018.08.073

PubMed Abstract | Crossref Full Text | Google Scholar

68. Pigoni A, Tesic I, Pini C, Enrico P, Di Consoli L, Siri F, et al. Multimodal machine learning prediction of 12-month suicide attempts in bipolar disorder. Bipolar Disord. (2025) 27:167–253. doi: 10.1111/bdi.70011

PubMed Abstract | Crossref Full Text | Google Scholar

69. Kline A, Wang H, Li Y, Dennis S, Hutch M, Xu Z, et al. Multimodal machine learning in precision health: A scoping review. NPJ digital Med. (2022) 5:171. doi: 10.1038/s41746-022-00712-8

PubMed Abstract | Crossref Full Text | Google Scholar

70. Chen RJ, Lu MY, Williamson DFK, Chen TY, Lipkova J, Noor Z, et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell. (2022) 40:865–878.e6. doi: 10.1016/j.ccell.2022.07.004

PubMed Abstract | Crossref Full Text | Google Scholar

71. Gupta Y, Lama RK, and Kwon GR. Prediction and classification of alzheimer’s disease based on combined features from apolipoprotein-E genotype, cerebrospinal fluid, MR, and FDG-PET imaging biomarkers. Front Comput Neurosci. (2019) 13:72. doi: 10.3389/fncom.2019.00072

PubMed Abstract | Crossref Full Text | Google Scholar

72. Xie Y, Sang Q, Da Q, Niu G, Deng S, Feng H, et al. Improving diagnosis and outcome prediction of gastric cancer via multimodal learning using whole slide pathological images and gene expression. Artif Intell Med. (2024) 152:102871. doi: 10.1016/j.artmed.2024.102871

PubMed Abstract | Crossref Full Text | Google Scholar

73. Afrash MR, Mirbagheri E, Mashoufi M, and Kazemi-Arpanahi H. Optimizing prognostic factors of five-year survival in gastric cancer patients using feature selection techniques with machine learning algorithms: a comparative study. BMC Med Inf decision making. (2023) 23:54. doi: 10.1186/s12911-023-02154-y

PubMed Abstract | Crossref Full Text | Google Scholar

74. Tadist K, Mrabti F, Nikolov NS, Zahi A, and Najah S. SDPSO: Spark Distributed PSO-based approach for feature selection and cancer disease prognosis. J Big Data. (2021) 8:19. doi: 10.1186/s40537-021-00409-x

Crossref Full Text | Google Scholar

75. Alireza Z, Maleeha M, Kaikkonen M, and Fortino V. Enhancing prediction accuracy of coronary artery disease through machine learning-driven genomic variant selection. J Trans Med. (2024) 22:356. doi: 10.1186/s12967-024-05090-1

PubMed Abstract | Crossref Full Text | Google Scholar

76. Flint J. The genetic basis of major depressive disorder. Mol Psychiatry. (2023) 28:2254–65. doi: 10.1038/s41380-023-01957-9

PubMed Abstract | Crossref Full Text | Google Scholar

77. Corponi F, Fabbri C, and Serretti A. Pharmacogenetics and depression: A critical perspective. Psychiatry Invest. (2019) 16:645–53. doi: 10.30773/pi.2019.06.16

PubMed Abstract | Crossref Full Text | Google Scholar

78. Wang X, Wang C, and Zhang Y. Effect of pharmacogenomics testing guiding on clinical outcomes in major depressive disorder: a systematic review and meta-analysis of RCT. BMC Psychiatry. (2023) 23:334. doi: 10.1186/s12888-023-04756-2

PubMed Abstract | Crossref Full Text | Google Scholar

79. Bousman CA, Stevenson JM, Ramsey LB, Sangkuhl K, Hicks JK, Strawn JR, et al. Clinical pharmacogenetics implementation consortium (CPIC) guideline for CYP2D6, CYP2C19, CYP2B6, SLC6A4, and HTR2A genotypes and serotonin reuptake inhibitor antidepressants. Clin Pharmacol Ther. (2023) 114:51–68. doi: 10.1002/cpt.2903

PubMed Abstract | Crossref Full Text | Google Scholar

80. T.E.P.A.T.C.O.T.E. UNION. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (Text with EEA relevance). Brussels, Belgium: EUROPEAN UNION, EUR-Lex (2024).

Google Scholar

81. Ermers NJ, Hagoort K, and Scheepers FE. The predictive validity of machine learning models in the classification and treatment of major depressive disorder: state of the art and future directions. Front Psychiatry. (2020) 11:472. doi: 10.3389/fpsyt.2020.00472

PubMed Abstract | Crossref Full Text | Google Scholar

82. Proudman D, Greenberg P, and Nellesen D. The growing burden of major depressive disorders (MDD): implications for researchers and policy makers. Pharmacoeconomics. (2021) 39:619–25. doi: 10.1007/s40273-021-01040-7

PubMed Abstract | Crossref Full Text | Google Scholar

83. Onitiu D, Wachter S, and Mittelstadt B. How AI challenges the medical device regulation: patient safety, benefits, and intended uses. J Law Biosci. (2024). doi: 10.1093/jlb/lsae007

Crossref Full Text | Google Scholar

84. Monaghan TF, Rahman SN, Agudelo CW, Wein AJ, Lazar JM, Everaert K, et al. Foundational statistical principles in medical research: sensitivity, specificity, positive predictive value, and negative predictive value. Medicina (Kaunas). (2021) 57:503. doi: 10.3390/medicina57050503

PubMed Abstract | Crossref Full Text | Google Scholar

85. Gunlicks-Stoessel M, Liu Y, Parkhill C, Morrell N, Choy-Brown M, Mehus C, et al. Adolescent, parent, and provider attitudes toward a machine-learning based clinical decision support system for selecting treatment for youth depression. Res Sq. (2023) 24. doi: 10.21203/rs.3.rs-3374103/v1

PubMed Abstract | Crossref Full Text | Google Scholar

86. Cutillo CM, Sharma KR, Foschini L, Kundu S, Mackintosh M, Mandl KD, et al. Machine intelligence in healthcare—perspectives on trustworthiness, explainability, usability, and transparency. NPJ digital Med. (2020) 3:47. doi: 10.1038/s41746-020-0254-2

PubMed Abstract | Crossref Full Text | Google Scholar

87. Genes N, Kim MS, Thum FL, Rivera L, Beato R, Song C, et al. Usability evaluation of a clinical decision support system for geriatric ED pain treatment. Appl Clin Inform. (2016) 7:128–42. doi: 10.4338/ACI-2015-08-RA-0108

PubMed Abstract | Crossref Full Text | Google Scholar

88. Deininger M, Daly SR, Lee JC, Seifert CM, and Sienko KH. Prototyping for context: exploring stakeholder feedback based on prototype type, stakeholder group and question type. Res Eng Design. (2019) 30:453–71. doi: 10.1007/s00163-019-00317-5

PubMed Abstract | Crossref Full Text | Google Scholar

89. Ghorayeb A, Darbyshire JL, Wronikowska MW, and Watkinson PJ. Design and validation of a new Healthcare Systems Usability Scale (HSUS) for clinical decision support systems: a mixed-methods approach. BMJ Open. (2023) 13:e065323. doi: 10.1136/bmjopen-2022-065323

PubMed Abstract | Crossref Full Text | Google Scholar

90. Hiltunen A-M, Haavisto I, Nuutinen M, Lahelma M, Salminen A, de Almeida Mello J, et al. Protocol of the pilot study to test and evaluate the iCARE tool: a machine learning-based e-platform tool to make health prognoses and support decision-making for the care of older persons with complex chronic conditions. BMJ Open. (2025) 15:e101234. doi: 10.1136/bmjopen-2025-101234

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: major depressive disorder, machine learning model, outcome prediction, ELSI, decision support system

Citation: Ntam VA, Huebner T, Steffens M and Scholl C (2025) Machine learning approaches in the therapeutic outcome prediction in major depressive disorder: a systematic review. Front. Psychiatry 16:1588963. doi: 10.3389/fpsyt.2025.1588963

Received: 06 March 2025; Accepted: 30 June 2025;
Published: 13 August 2025.

Edited by:

Gábor Gazdag, Jahn Ferenc Dél-Pesti Kórház és Rendelőintézet, Hungary

Reviewed by:

Johanna Kreither, University of Talca, Chile
Renato Bulcao Neto, Federal University of Goias, Brazil

Copyright © 2025 Ntam, Huebner, Steffens and Scholl. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Catharina Scholl, Q2F0aGFyaW5hLlNjaG9sbEBiZmFybS1yZXNlYXJjaC5kZQ==

These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.