ORIGINAL RESEARCH article

Front. Big Data, 04 February 2026

Sec. Machine Learning and Artificial Intelligence

Volume 8 - 2025 | https://doi.org/10.3389/fdata.2025.1651290

Depression detection through dual-stream modeling with large language models: a fusion-based transfer learning framework integrating BERT and T5 representations

  • 1. Faculty of Engineering, Universiti Putra Malaysia, UPM Serdang, Serdang, Selangor, Malaysia

  • 2. School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia

  • 3. School of Automation, Guangdong Polytechnic Normal University, Guangzhou, Guangdong, China

  • 4. Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, UPM Selangor, Selangor, Malaysia

  • 5. Department of Electrical Engineering, Faculty of Engineering, Universiti Malaya, Kuala Lumpur, Malaysia

Article metrics

View details

489

Views

57

Downloads

Abstract

Millions of people around the world suffer from depression. While early diagnosis is essential for timely intervention, it remains a significant challenge due to limited access to clinically diagnosed data and privacy restrictions on mental health records. These limitations hinder the training of robust AI models for depression detection. To tackle this, this article proposes a parallel transfer learning framework for depression detection that integrates BERT and T5 through a fusion mechanism, combining the complementary advantages of these two large language models (LLMs). By integrating their semantic embeddings, the method captures a broader range of linguistic cues from transcribed speech. These embeddings are processed through a model with two parallel branches: a one-dimensional convolutional neural network and a dense neural network are used to construct each branch for preliminary prediction, which are then fused for final prediction. Evaluations on the E-DAIC dataset demonstrate that the proposed method outperforms baseline models, achieving a 3.0% increase in accuracy (91.3%), a 6.9% increase in precision (95.2%), and a 1.7% improvement in F1-score (90.0%). The experimental results verify the effectiveness of BERT and T5 fusion in enhancing depression detection performance and highlight the potential of transfer learning for scalable and privacy-conscious mental health applications.

1 Introduction

Depression is also referred to as major depressive disorder (MDD). It is a widespread mental illness that impacts a significant percentage of people globally. According to the statistics of the World Health Organization, approximately 332 million people in the world have depression (Herrman et al., 2019). Depression is characterized by symptoms such as persistent sadness, loss of interest in daily activities, irritability, hopelessness, changes in appetite or weight, and low self-worth. In more severe cases, it may also result in suicidal thoughts or actions.

Diagnosing depression is inherently complex, typically requiring the expertise of trained mental health professionals. While early detection is essential for effective intervention, access to timely diagnosis remains limited due to high costs and inadequate access to mental health care, particularly in rural and underserved areas (Semrau et al., 2019). Artificial intelligence (A.I.) has shown promise in recent years in various medical applications (Uppal et al., 2023; Zafar et al., 2023; Wang et al., 2025; Zhao et al., 2023), including mental health assessment. However, the development of AI-based depression detection systems is hindered by several challenges, chief among them being the shortage of clinically validated data for model training.

Most datasets that are accessible to the general public for depression detection are relatively small and exhibit severe class imbalance. For instance, the Extended Distress Analysis Interview Corpus (E-DAIC) includes only 275 participants, of whom just 66 (24%) are diagnosed with MDD. Similarly, the Chinese Multimodal Depression Corpus (CMDC) and the Multimodal Open Dataset for Mental-health Analytics (MODMA) contain merely 78 and 52 participants, respectively, with fewer than half representing depressed individuals. The potential of AI models to generalize across various populations and therapeutic scenarios is severely limited by these dataset size and representativeness constraints.

Among various methods of detecting depression, the investigation of linguistic features in patient speech has been regarded as especially beneficial. Text features offer a more abundant level of interpretability compared to auditory or visual modalities. Linguistic markers, particularly those with absolutist terms such as “NEVER” indicated by phrases such as “I NEVER want to wake up” (Adam-Troian et al., 2022), and desperation or fatigue expressions, more commonly represented by “I am so tired all the time,” are found to correlate with depressive symptoms. Conventional clinical instruments, including the Beck Depression Inventory (Beck et al., 1996), Self-Rating Depression Scale (Zung, 1965), and Patient Health Questionnaire (PHQ; Spitzer et al., 1999), rely strongly on patient feedback in verbal or written format, thereby highlighting the central role that textual information plays in depression evaluation.

In spite of these strengths, the variation in patients' language abilities, emotional expressions, and cultural contexts poses significant obstacles to the construction of representative textual datasets. Without sufficient diversity and sample populations, artificial intelligence systems are unable to successfully pick up on subtle cues of indicators of depression and might not be as generalizable to real-life situations. It is necessary to overcome these limitations to fully harness the potential of artificial intelligence-based depression intervention systems.

To resolve limitations related to insufficient training data, we recommend using a transfer learning approach based upon deep pre-trained language models to strengthen performance in minimal, clinically annotated datasets. Specifically, our model utilizes two big language models (LLMs) to take advantage of their pre-trained knowledge and two different neural network designs, namely a one-dimensional convolutional neural network (1DCNN) and a fully connected neural network typified by dense layers combined with a dropout mechanism for automatic depression detection. By fine-tuning the model using diagnostic labels and harnessing the linguistic knowledge encoded in the LLMs, the proposed architecture enhances the system's ability to identify depressive symptoms from textual input.

The following is a summary of this research's main contributions.

  • Introducing a dual-stream transfer learning fusion framework that leverages the complementary strengths of two pre-trained large language models (LLMs), BERT, and T5, combined with 1D convolutional and dense neural networks, enabling robust depression detection by capturing diverse linguistic representations.

  • Designing a lightweight logical “AND fusion” strategy for integrating the outputs of all four branches as a conservative agreement based decision rule, with the aim of enhancing prediction reliability and precision rather than introducing a novel ensemble mechanism.

  • Conducting comprehensive ablation studies to evaluate the individual and combined contributions of BERT and T5 embeddings, as well as different architectural variants, providing deeper insights into their effectiveness for clinical depression detection.

  • Benchmarking the proposed method against common machine learning models (including traditional machine learning models and deep learning models), as well as the state-of-the-art studies on the E-DAIC dataset. The findings of this study offer a reference for future study on the application of LLMs in related domains.

This paper's remaining sections are arranged as follows: Section 2 provides an overview of previous studies pertinent to this research; Section 3 describes the workflow, dataset, data processing steps, and model structure; Section 4 outlines the experiment and evaluation setups; Section 5 presents the experiment configurations and evaluation criteria; and finally, Section 6 concludes the study with final remarks.

2 Related work

Speech text provides valuable insights for depression evaluation, which motivates researchers to develop automatic depression detection methods using textual data. Among the deep learning approaches, 1DCNN and variants of Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) and BiLSTM, have been widely employed for these tasks.

Convolutional neural networks, initially proposed for image processing (LeCun et al., 1989), were utilized for one-dimensional data in the mid-2010s, tackling difficulties in natural language processing, time-series analysis, and signal processing. An early milestone in this direction was the work of Kim et al., who applied 1DCNNs to text classification in 2014 (Kim, 2014).

Complementing CNNs, Long Short-Term Memory (LSTM) networks, proposed by Hochreiter and Schmidhuber (1997), addressed the vanishing gradient issue of traditional RNNs, allowing the effective learning of long- and short-term correlations. Building on LSTM, Graves and Schmidhuber (2005) proposed BiLSTM, which combines forward and backward LSTMs to capture bidirectional dependencies, making it suitable for sequential data processing tasks, e.g., speech recognition, automatic translation, and classifications.

In 2023, Wani et al. explored depression screening using Word2Vec (Church, 2017) and TF-IDF (Baena-García et al., 2011) features combined with CNN and LSTM models (Ahmad Wani et al., 2023). They achieved accuracies of 99.01% with Word2Vec (CNN + LSTM) on data sourced from Facebook, Twitter, and YouTube. However, one of the study's limitations is its dependence on non-clinically diagnosed data, raising concerns about its applicability in clinical settings.

To address limitations in data and feature extraction, transfer learning techniques have emerged as a promising approach across various research domains and have shown effectiveness in a range of language-related applications.

The BERT model (Vaswani et al., 2017), pre-trained on massive corpora such as BooksCorpus (Zhu et al., 2015) and English Wikipedia, has become an essential building block of transfer learning techniques. Such proficiency in learning knowledge from long and uninterrupted texts through document-level training has been key to its applicability in tasks that need deep contextual understanding. For instance, Milintsevich et al. (2023) applied the RoBERTa (Liu et al., 2019) model, an adaptation of BERT, for depression prediction based on clinical transcripts from the DAIC dataset. According to their study, depression detection revealed a macro-F1 score of 73.9 within a binary setting, citing the need for clinically validated datasets.

Zhang and Guo (2024) presented the MDSD-FGPL algorithm, integrating BERT and T5 encoders in fine-grained prompt learning, through a research project carried out in 2024. The multi-tier detection approach yielded an F1-score of 0.8276 for binary classification, hence pointing out the advantages of encoder integration. Hadikhah Mozhdehi and Eftekhari Moghadam (2023) applied Emotional BERT toward tasks of emotion recognition on the Wang and MELD datasets through another investigation, further proving the effectiveness of transfer learning in deriving contextual knowledge from minimal datasets.

The T5 model, described by Raffel et al. (2020), has been widely applied to a variety of text-centered analytical tasks (Bao et al., 2024). The model's flexibility is further demonstrated by the Sensory-T5 version presented by Zhao et al. (2025), which incorporates sensory information to enhance emotion classification accuracy. This methodology has led to significant precision and F1 score improvement on various datasets.

Inspired by a similar effort, Lu et al. (2025) also fine-tuned T5 for their ABSA and achieved remarkable improvement through their application of data augmentation techniques and implicit rationale-driven information management. Chawla et al. (2024) also identified emotional dimensions that influence the negotiation process, particularly with regard to outcome predictions such as satisfaction and perceptions of partners. Incorporating the CaSiNo dataset with emoticons, Linguistic Inquiry and Word Count, and T5 models, they conducted a comparison whereby T5-Reddit performed better than other models in detecting subtle emotional expressions.

As impressive as the breakthroughs achieved through leveraging pre-trained models have been for various text analytical applications, most available works tend to be based upon a single model developed for a specific architecture or a single configuration of tasks. Even though these models have delivered impressive performances, overreliance upon a solitary model has the potential to compromise generalizability, especially when applied in cases like detecting clinical depression, where access to annotated knowledge is sparse and contextual centrality matters.

Considering these limitations, the current work proposes initiating an innovative dual-language model integration. A methodological strategy that combines benefits of two sizable pre-trained language models: BERT and T5. The models have complementary characteristics in structural design and learning methods. BERT is an encoder-based model that is particularly designed to handle bidirectional contexts. The use of contextual features in a sentence reinforces its ability to capture subtle semantic relationships and complex syntactic relationships. On the other hand, T5 is a text-to-text transformer model based on a combined encoder-decoder model that is ideally suited to generate and reconstruct text content, thus demonstrating a better ability to generalize and abstract across diverse tasks. By combining the models, the current approach leverages the accuracy of BERT in producing contextual embeddings (Devlin et al., 2019) and the versatility of T5 in capturing task-agnostic patterns through its unified text-to-text framework (Raffel et al., 2020), ultimately leading to more accurate and efficient detection of depression from transcribed speech.

3 Methodology

The initial step involves pre-processing raw text transcripts, which is integrated into the proposed framework. As illustrated in Figure 1, the framework takes advantage of the collaborative strengths of transformer-based pre-trained large language models, specifically BERT and T5, both of which are highly reputed for their efficiencies in natural language processing. These models possess the capability to learn subtle linguistic patterns, grammatical structures, and other language attributes (Rodríguez-Ibánez et al., 2023), and are assumed to capture transferable knowledge.

Figure 1

To effectively utilize this pre-trained knowledge for our task, both models are fine-tuned using pre-processed data tailored to depression prediction. The processed text is fed into the fine-tuned BERT and T5 encoders, generating two distinct sets of embeddings. These embeddings are subsequently passed through parallel 1DCNN and Dense branches to extract task-relevant features and produce preliminary predictions.

The resulting predictions are then passed to a fusion module that performs a logical “AND” operation for the final output.

3.1 Dataset and pre-processing

Clinically diagnosed datasets are scarce, as most publicly available datasets are derived from social media posts where labels are self-claimed. While these datasets offer valuable insights, they lack the reliability of clinically confirmed ground truth. Some datasets are built from samples collected from clinically diagnosed MDD patients; however, the majority of these are not publicly accessible. In contrast, the E-DAIC dataset is not only clinically reliable but also publicly available, with baseline models provided to enable fair comparisons of model performance.

The E-DAIC dataset, developed from clinical interviews, is designed to aid in diagnosing psychological disorders including anxiety, depression, and post-traumatic stress disorder (PTSD). It includes recordings of the conversations between the interviewers and the research subjects, along with transcripts capturing both parties' spoken words, providing a rich linguistic context for analysis. Compared to the DAIC dataset, E-DAIC offers a larger sample size, which enhances its utility for research.

The dataset includes recordings from 275 subjects. 66 (35 males, 31 females) of whom were diagnosed with depression at the time of recording, while 209 (135 males, 74 females) were categorized as health controls (HCs). PHQ-8 scores of each subject are provided. A subset of the sessions in the E-DAIC dataset was collected semi-automatically, with a virtual interviewer controlled by a human. The rest were gathered using an AI-controlled agent. It uses automated modules for perception and behavior generation to function completely independently. The subjects comprised U.S. military veterans and recruits from the general public in the Los Angeles area.

Detailed summaries of the data collection methods and participant demographics are provided in Tables 1, 2.

Table 1

AttributeDetails
DisordersDepression, PTSD
Diagnosis basisPHQ-8, PCL-C
Number of participants275
Data modalitiesVisual, Audio, Text

Dataset information of the E-DAIC.

PCL-C, PTSD Checklist - Civilian Version.

Table 2

GenderHCMDDPositive ratio
Male1353520.6%
Female743129.5%
Total2096624.0%

Gender and MDD proportion of the E-DAIC dataset.

The following preprocessing steps were implemented for preprocessing:

  • Trimming irrelevant sections: the interview scripts' initial 90 s and last 40 s were eliminated. These portions predominantly contained answers to introductory remarks, general questions (e.g., “Where are you from?”), greetings, and goodbyes, which were deemed irrelevant for the analysis.

  • Only participant utterances were retained for textual analysis, while interviewer speech was excluded.

  • Neutral sentence removal: sentences consisting of one neutral word only, e.g., “yes” or “ok,” were excluded, as they lacked meaningful information for classification.

  • Labeling: the scripts were assigned labels based on PHQ-8 binary scores, with “1” indicating MDD and “0” indicating HC.

  • Balancing classes: to achieve an equal distribution between MDD and HC samples and prevent model bias during training, we adopted a random undersampling strategy. Among the available options, we chose undersampling to preserve the full set of clinically diagnosed MDD cases. Ensuring that all MDD samples originate from genuine, clinically diagnosed patients is crucial in medically sensitive contexts such as ours (Fernández et al., 2015). In contrast, upsampling methods typically introduce synthetic data, which may compromise the authenticity and clinical reliability of the dataset.

Table 3 provides examples of the preprocessed text and labels.

Table 3

Text transcriptLabel
No, it's just too rough trying to pick up all the pieces1
Sleeping all the time, eating too much, arguing, screaming at this1
I just couldn't find work, and so I had to settle for doing that for right now1
I applied from anywhere and everywhere1
My parents just buried their daughter six months ago; they don't want to bury their other daughter1
I play sports: volleyball, softball, biking, walking0
I'm a little more rigid than most people, but it was okay, not bad0
Maybe more disciplined. I can take orders very well. I'm not afraid of most situations0
He knows exactly what I'm going through, and he's one hundred percent behind me, and I love him very much0
Yeah, I mean they've always gave me great advice0

Examples of the pre-processed text data.

The E-DAIC dataset was selected for the current study since it is clinically trustworthy and publicly accessible. However, a small dataset sample size makes it hard to train deep learning models from the beginning. In tackling such a weakness, transfer learning was used to improve not just performance but also generalization ability. Leveraging knowledge embedded in pre-trained models, transfer learning considerably expands exploration of the E-DAIC dataset. Transfer learning is leveraged in the current project by fine-tuning transformer-based LLMs: BERT and T5.

3.2 BERT and T5

Both BERT and T5 are pre-trained large language models (LLMs) based on the Transformer architecture. They act as contextual feature extractors for input text, generating dense vector representations. Table 4 summarizes the architectural and training differences between the original Transformer, B-base (BERT-base), and T-base (T5-base).

Table 4

AspectTransformer baseB base (BERT)T base (T5)
ArchitectureEncoder-DecoderEncoder-onlyEncoder-Decoder
ContextualizationUnidirectional encoder and decoderBidirectional encoderBidirectional encoder, Unidirectional decoder
Pre-training objectiveNone (original)MLM + NSPSpan Corruption
Model size (parameters)65M110M220M

Key differences between transformer, BERT, and T5.

Let the input sequence be represented as:

where 𝕋 denotes the token space. The contextual embeddings extracted by each model are denoted by:

where

To obtain fixed-size embeddings, we apply mean pooling across each token sequence:

where and denote the sentence-level embeddings for the input X from BERT and T5, respectively.

These vectors are saved in NumPy format (.npy) and serve as inputs to downstream classifiers. The embedding extraction process is summarized in Algorithm 1.

Algorithm 1

3.3 Classification and fusion

To classify the embeddings, we use two neural network classifiers:

  • MC: a 1D Convolutional Neural Network (1DCNN),

  • MD: a fully connected Dense Neural Network (DNN).

The 1DCNN MC operates on the input vector x ∈ ℝd with kernel w ∈ ℝm, producing an output yi at position i as:

Each model computes a probability score indicating the likelihood of depression.

A final prediction is made using a conservative logical AND fusion:

where τ ∈ [0, 1] is the decision threshold (defaultτ = 0.5), and I[·] is the indicator function returning 1 if the condition is true. A sample is labeled as MDD only if all four models agree.

The classification and fusion procedure is summarized in Algorithm 2.

Algorithm 2

4 Experiment configurations and evaluation criteria

The experiment's hardware is a workstation with an AMD Ryzen 9 CPU and 64 GB RAM. The software environment is the Windows 11 operating system, Jupyter IDE, and Python 3.11. The models were trained using a CPU.

The experimental results are assessed using the following metrics: accuracy, precision, recall, and F1 score. These metrics offer a thorough comprehension of the model's functionality. The equations for these metrics are defined as follows:

where TP, TN, FP, and FN represent the counts of true positives, true negatives, false positives, and false negatives in the predictions, respectively. Additionally, the study involves plotting the Receiver Operating Characteristic (ROC) curves for the learning models for comparison purposes. The ROC curve depicts the true positive rate (TPR) vs. the false positive rate (FPR) at various threshold settings. Furthermore, the area under the curve (AUC) was calculated. An increased AUC signifies improved overall performance.

5 Experimental results

In this section, we examine and compare the traditional text feature extraction algorithms and transfer learning-based text embeddings using both conventional ML and deep learning models. The study evaluates six conventional machine learning models–Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), Naive Bayes (NB), K-Nearest Neighbor (KNN), and Decision Tree (DT)–alongside five deep learning models designed for one-dimensional text data analysis. The deep learning models include a fully connected dense neural network, a 1DCNN, a recurrent neural network (RNN), a gated recurrent unit (GRU), an LSTM, and a bidirectional long short-term memory network (BiLSTM).

5.1 Experiments of traditional text feature extraction algorithms

To evaluate transfer learning's efficacy in improving models' performance in this task, we first adopted the commonly used traditional text feature extraction algorithms, namely term frequency-inverse document frequency (TF-IDF), Keras frequency-based tokenizer (KFT), Bag of Words (BoW), and N-grams on both conventional MLs and deep learning models.

Figure 2 presents the ROC curves generated using traditional text features for both conventional ML models and deep learning models. Among the conventional ML models, RF and DT demonstrated superior performance, with RF achieving the highest AUC of 94.5% when using KFT features. In contrast, the deep learning models generally underperformed, with the notable exception of the fully connected Dense Neural Network, which achieved an AUC of 94.0% using BoW features.

Figure 2

The RF model excels with a higher true positive rate (TPR) at lower false positive rates (FPR), reflecting its efficiency in correctly identifying positive samples. This balance between sensitivity and specificity underscores its effectiveness as a classifier. Furthermore, RF consistently achieves the highest AUC across various feature extraction methods, demonstrating its robustness to diverse data representations and its superior ability to distinguish between classes.

The deep learning models, with the exception of dense neural networks constructed solely with dense layers, showed poor performance when tested with traditional text feature extraction algorithms. The results of the other deep learning models are not reported, as their poor performance suggests that combining traditional text feature extraction algorithms with these models is not an effective approach for this task. Among the tested models, the dense neural network using BoW achieved an accuracy of 85.9%, outperforming all other deep learning models.

Table 5 reports the accuracy, precision, recall, and F1-score of the learning models using traditional text feature extraction algorithms. As shown, the highest accuracy was achieved by RF using TF-IDF, with an accuracy of 88.3%. This method (TF-IDF + RF) is selected as the baseline for evaluating the effectiveness of transfer learning. Overall, the results indicate that conventional ML models outperform deep learning models when traditional extraction algorithms are applied.

Table 5

ModelFeatureAccuracy (%)Precision (%)Recall (%)F1 (%)
LRTF-IDF73.072.872.772.7
KFT65.164.964.664.6
BoW77.977.977.577.6
N-gram81.080.980.880.8
SVMTF-IDF75.775.775.875.6
KFT64.964.764.664.7
BoW81.281.281.081.0
N-gram83.583.483.383.4
RFTF-IDF88.3*88.3*88.3*88.3*
KFT86.886.886.886.8
BoW86.186.186.186.1
N-gram86.786.686.786.6
NBTF-IDF74.674.674.774.5
KFT59.265.056.350.5
BoW74.174.174.374.1
N-gram73.773.573.573.5
KNNTF-IDF57.163.653.945.6
KFT66.366.165.865.8
BoW66.968.265.464.9
N-gram65.466.963.862.9
DTTF-IDF85.385.685.785.3
KFT86.186.186.486.1
BoW86.086.186.385.9
N-gram86.786.786.786.5
DenseTF-IDF85.285.485.385.2
KFT67.070.167.667.0
BoW85.986.085.985.9
N-gram85.585.585.585.5

The performance of learning models with traditional text feature algorithms (Non-TL).

For each model, the highest scores are highlighted in bold.

*denotes the highest performance across all learning models and text features.

5.2 Experiments of transfer learning methods

Next, we investigate the effect of transfer learning by adopting two pre-trained models: BERT and T5. BERT represents an encoder-only model optimized for bidirectional contextual understanding, while T5 follows a sequence-to-sequence, text-to-text pre-training paradigm based on span corruption. Including both models allows us to examine whether generative-style pre-training yields complementary representations for depression-related language beyond encoder-only architectures. For a fair comparison, the embeddings generated from these BERT and T5 were used to train the same conventional ML and deep learning models.

In Table 6, we observe that the Dense model performs best across most metrics when using BERT embeddings, while the 1DCNN outperforms all models with T5 embeddings, achieving the highest accuracy, F1-score, and AUC. Overall, the 1DCNN emerges as the most robust model, excelling particularly with T5 embeddings, while the dense layer consistently delivers strong performance with both embeddings, showing a slight edge with BERT. In contrast, RNN, LSTM, and BiLSTM models consistently underperform compared to other models regardless of the embedding type.

Table 6

ModelEmbeddingAccuracy (%)Precision (%)Recall (%)F1 (%)
LRBERT69.0466.3167.0366.67
T572.4770.0170.6870.34
SVMBERT69.7967.2567.4367.34
T571.2968.8769.0568.96
RFBERT86.1084.2586.7685.49
T587.2785.1787.7086.42
NBBERT61.5557.2366.3561.45
T563.1758.8767.3062.80
KNNBERT70.6668.2468.2468.24
T569.4166.3268.6567.46
DTBERT86.8382.5390.6886.41
T585.5880.9989.8685.20
DenseBERT87.3384.3889.0586.65
T586.1483.2987.5785.38
1DCNNBERT89.45*88.02*89.3288.67*
T588.4586.5688.7887.66
RNNBERT85.4680.8889.7385.07
T577.6582.1565.9573.16
GRUBERT85.2178.7493.11*85.33
T585.0281.6587.1684.31
LSTMBERT67.0464.6463.2463.93
T567.2966.1259.8662.84
BiLSTMBERT52.2548.3348.7848.55
T572.9767.7979.0572.99

Comparison of the models using transfer learning.

The highest value for each model is highlighted in bold.

Arrows (↑/↓) indicate the increase or decrease compared to the highest score of the corresponding non-TL method for the same learning model. *denotes the highest performance across all learning models using TL embeddings.

Traditional machine learning methods are routinely outperformed by deep learning models, especially 1DCNN and dense. 1DCNN (BERT) achieves the highest accuracy (89.45%) and F1-score (88.67%), highlighting its robustness in this task. Among traditional ML models, RF and DT perform better than the other models but do not show an increase compared to non-transfer learning models. While GRU (BERT) excels in recall (93.11%), minimizing false negatives, models like RNN and LSTM show moderate performance. Simpler models such as Naive Bayes (NB) exhibit significant drops in F1-score.

The comparison between non-TL and TL models reveals distinct performance patterns. Traditional models such as RF and DT show competitive results in the non-TL setting, particularly when combined with TF-IDF features, with RF achieving the highest scores in accuracy, precision, recall, and F1-score. This suggests these models are well-suited to conventional feature representations.

In contrast, TL models using BERT and T5 embeddings consistently outperform non-TL approaches. Among them, 1DCNN with BERT embeddings achieves the highest accuracy (89.45%) and F1-score (88.67%), surpassing the best non-TL setup (RF + TF-IDF: 88.3%). This highlights the benefit of leveraging contextual embeddings for improved classification accuracy.

As illustrated in Figure 3, TL-based models produce ROC curves that shift further left, corresponding to higher AUC scores and improved true positive rates, especially at low false positive rates. This behavior underscores the increased sensitivity and robustness of transfer learning in text-based depression detection.

Figure 3

Given the effectiveness of 1DCNN and dense models using BERT and T5 embeddings in the above experiments, these models emerged as promising candidates for developing a superior approach to this task. As a result, we selected them for further experiments.

As illustrated in Figure 4, the confusion matrix of the proposed dual-stream model demonstrates relatively balanced performance across both classes. Notably, the number of false negatives (113) is substantially lower than that of false positives (13), indicating that the model prioritizes sensitivity. This is particularly advantageous in early-stage depression screening, where maximizing recall is often more critical than specificity.

Figure 4

In terms of ROC performance, the fusion approach–integrating predictions from four branches (BERT + Dense, BERT + 1DCNN, T5 + Dense, and T5 + 1DCNN)–achieved the highest AUC of 0.956, as shown in Figure 4, outperforming each individual model. By shifting the ROC curve further toward the top-left corner, the fusion method yields a higher true positive rate at a lower false positive rate.

5.3 Ablation study

As observed, integrating BERT and T5 embeddings with 1DCNN and Dense architectures demonstrates strong discriminative ability, achieving high scores across all evaluation metrics and highlighting their potential as a promising approach. To further optimize the architecture, an ablation study was carried out to identify the best structure for the task.

During training BERT and T5 for predictive tasks, baseline models were created to test the effectiveness of embeddings without considering architectural bias. BERT achieved impressive performance, recording an accuracy rate of 88.8%, an F1-score of 88.0%, and an AUC of 92.8%. Its impressive recall rate of 89.5% indicates that it has good proficiency in exactly categorizing positive examples, making it very apt for operations that require minimizing false negatives. On the contrary, T5 had much lower performance scores, recording an accuracy rate of 64.1%, an F1 score of 63.1%, and an AUC of 65.6%. The difference in performance highlights deficiencies of T5 embeddings to distinguish classes effectively, hence requiring consideration of different frameworks for overall quality improvement.

The addition of one-dimensional convolutional neural network (1DCNN) layers greatly enhanced the performance of the T5 model. The BERT + 1DCNN achieved an accuracy rate of 89.5%, which outperformed the 88.8% accuracy rate of the BERT model, and an equally impressive F1-score of 88.7%. Though its recall rate of 89.3% happened to fall marginally lower compared to several other models, it had an AUC of 95.1% and thereby lent indication of an impressive overall performance. Similarly, the T5 + 1DCNN achieved an accuracy of 88.5% and an F1-score of 87.7%, reflecting a level of performance that is overall quite similar to that of the BERT model. The T5 + 1DCNN model achieved an AUC of 95.5%, which demonstrated a small improvement from its BERT counterpart for that specific metric.

The use of dense neural networks, though not outperforming 1D CNN models on all metrics considered, still produced competitive outputs. The BERT + Dense model had an accuracy rate of 87.3%, an AUC of 86.7%, and an F1-score of 93.1%.On the other hand, T5 + Dense attained an accuracy and an AUC of 86.1 and 93.0%, and an F1-score of 85.4. Lastly, our proposed approach sends embeddings from BERT and T5 through two parallel streams, each comprising two branches composed of one using a 1DCNN and another using a dense network before their outputs are combined.

The integration of four model combinations: BERT + 1DCNN, T5 + 1DCNN, BERT + Dense, and T5 + Dense, demonstrated superior performance compared to any single configuration. The combined model achieved an accuracy of 91.3%, an F1-score of 90.0%, and an AUC of 95.6%. Notably, its precision reached 95.2%, indicating a strong ability to reduce false positives and enhance the reliability of detected depressive cases. This emphasis on precision helps minimize unnecessary referrals and ensures that individuals identified as depressed are highly likely to require clinical attention. We acknowledge that the moderate decline in recall (from 89.5 to 86.4%) represents the cost of this gain in precision; however, the overall improvement across other metrics, particularly F1 and AUC, which reflect balanced and discriminative performance.

The confusion matrix and ROC curves of the proposed method are shown in Figure 4, while the accuracy, precision, recall, and F1-score are reported in Table 7. To provide further context on computational feasibility, we report training time and model size for each branch in Table 7. These values quantify the per-epoch cost and the number of trainable parameters. For the ensemble, TM and MS depend on the training strategy: if branches are trained sequentially, the cost is the sum of all branches, whereas in a parallel setup the cost corresponds to the maximum branch. Detailed runtime per epoch and reproducibility logs are available in the accompanying GitHub repository.

Table 7

ModelAcc.Prec.Rec.F1AUCTM (s)MS (M)
BERT88.886.689.588.092.81844.8110
T564.168.364.163.165.64519.0223
BERT + Dense87.384.489.186.793.1161.5a116
T5 + Dense86.183.387.685.493.0161.2a229
BERT + 1DCNN89.588.089.388.795.149.0a111
T5 + 1DCNN88.586.688.887.795.548.8a223
(Proposed)91.395.286.490.095.6N/AN/A

Ablation study.

The highest value for each model is highlighted in bold.

TM, training time; MS, model size.

aTraining time of Dense or 1DCNN using extracted embeddings from BERT or T5.

5.4 Comparison with the state-of-the-art methods

To provide a more comprehensive assessment of the proposed approaches, we additionally experimented with a Generative Pre-trained Transformer (GPT)-based model and a Temporal Graph model (TGCN), while also replicating several state-of-the-art baselines for direct comparison. Table 8 provides a summary of the evaluation metrics.

Table 8

StudyMethodAccuracy (%)Precision (%)Recall (%)F1 (%)
Zhang and Guo (2024)BERT, T5, Dense89.1380.085.7182.76
OursBERT, T5, Dense88.288.288.488.2
Milintsevich et al. (2023)S-RoBERTa, BiLSTM80.6
OursS-RoBERTa, BiLSTM86.186.186.386.1
Villatoro-Tello et al. (2021)Multi-layer Perceptron87.081.083.0
OursMulti-layer Perceptron84.984.884.984.8
Rai et al. (2024)BERT, BiLSTM83.083.083.0
OursBERT, BiLSTM73.572.069.971.0
Li et al. (2024)Heterogeneous graph Att.79.080.079.0
OursHeterogeneous graph Att.77.778.477.077.2
OursLLaMA (GPT-based)86.383.188.085.5
OursTGCN + BERT74.371.274.672.8
OursTGCN + T577.875.576.976.2
ProposedBERT, T5, Dense, 1DCNN91.395.285.490.0

Comparison with the state-of-the-art methods.

Att.: Attention mechanism.

Compared to previous works, our dual-stream fusion approach demonstrates stronger generalization by effectively combining contextual textual embeddings (from BERT and T5) with convolutional representations that capture local semantic patterns. The GPT-based model (LLaMA) also achieved competitive results (F1 = 85.5%), highlighting the potential of generative large language models for depression detection. The Temporal Graph TGCN variants incorporating BERT and T5 embeddings achieved moderate but consistent performance (F1 = 72.8 and 76.2%, respectively), suggesting that temporal graph reasoning can capture sequential dependencies in dialogue-level or time-series contexts.

When comparing against other state-of-the-art methods, the following two observations can be drawn:

  • Transformers dominate: models such as BERT, T5, and S-RoBERTa continue to lead in performance owing to their superior contextual representation and transfer-learning capabilities.

  • Hybrid architectures remain strong: many competitive systems combine transformers with BiLSTM, dense, or convolutional layers to exploit both contextual depth and sequential or local feature learning.

6 Discussion and conclusion

This study presents a dual-stream transfer learning framework for text-based depression detection, leveraging transformer-based large language models.

Transfer learning, particularly through BERT and T5, offers substantial advantages over non-TL approaches–especially in tasks requiring nuanced interpretation of linguistic patterns. Although models such as RF and Dense perform competitively in the non-TL setting, they are outperformed by TL-based models in terms of recall and F1. This highlights the strength of contextual embeddings in capturing subtle depressive cues.

When T5 embeddings were processed through a 1DCNN architecture, performance improved significantly. The T5 + 1DCNN model achieved an accuracy of 88.5%, an F1 of 87.7%, and an AUC of 95.5%–a substantial increase over standalone T5, which showed weaker performance (accuracy: 64.1%, F1: 63.1%, AUC: 65.6%). These results suggest that convolutional layers effectively complement T5 embeddings by extracting local sequential patterns that improve representation quality.

The integration of BERT and T5 embeddings through parallel Dense and 1DCNN branches led to further performance improvements. The resulting fusion model, which combines all four branches, achieved the best results across all evaluation metrics, including accuracy (91.3%), F1 score (90.0%), and AUC (95.6%). Its high precision (95.2%) indicates strong reliability in minimizing false positive predictions. Although high recall is critical for depression screening, the proposed model balances sensitivity with improved precision to reduce false alarms in decision-support use. It is important to emphasize that the combination of BERT and T5 is not intended to assert theoretical optimality of a specific model pairing. Instead, this design aims to reduce reliance on a single pre-training bias by leveraging heterogeneous semantic encoders. Such an agreement oriented fusion strategy enhances robustness when modeling subtle and implicitly expressed linguistic markers of depression.

Compared to the best-performing non-TL model (RF + TF-IDF, accuracy: 88.3%), the proposed fusion method improved accuracy by 3.0%, demonstrating its superior effectiveness in real-world, low-resource clinical datasets.

In summary, transfer learning with pre-trained LLMs significantly enhances the capability of automated depression detection systems. By combining BERT and T5 within a dual-stream architecture, this study demonstrates a robust approach that outperforms both traditional and deep learning baselines. Importantly, the proposed framework has potential clinical relevance: it may support early-stage depression screening, reduce the burden on clinicians by providing automated pre-assessment tools, and enable scalable integration into telehealth and digital platforms for mental health monitoring.

Looking ahead, future research could explore the generalizability of this method to other mental health conditions, its extension to multimodal settings (e.g., combining text, audio, and video), and its deployment in clinical environments for scalable and early-stage mental health assessment.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

NW: Conceptualization, Methodology, Software, Writing – original draft. WZ: Conceptualization, Methodology, Writing – original draft, Writing – review & editing. RK: Supervision, Writing – review & editing. IR: Writing – review & editing. SA: Supervision, Writing – review & editing. NI: Conceptualization, Methodology, Writing – original draft. ZZ: Software, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    Adam-TroianJ.BonettoE.ArciszewskiT. (2022). Using absolutist word frequency from online searches to measure population mental health dynamics. Sci. Rep. 12:2619. doi: 10.1038/s41598-022-06392-4

  • 2

    Ahmad WaniM.ELAffendiM. A.ShakilK. A.Shariq ImranA.Abd El-LatifA. A. (2023). Depression screening in humans with ai and deep learning techniques. IEEE Trans. Comput. Soc. Syst. 10, 20742089. doi: 10.1109/TCSS.2022.3200213

  • 3

    Baena-GarcíaM.Carmona-CejudoJ. M.CastilloG.Morales-BuenoR. (2011). “TF-SIDF: term frequency, sketched inverse document frequency, ” in 2011 11th International Conference on Intelligent Systems Design and Applications (Cordoba: IEEE), 10441049. doi: 10.1109/ISDA.2011.6121796

  • 4

    BaoE.PérezA.ParaparJ. (2024). Explainable depression symptom detection in social media. Health Inform. Sci. Syst.12:47. doi: 10.1007/s13755-024-00303-9

  • 5

    BeckA. T.SteerR. A.BallR.RanieriW. F. (1996). Comparison of beck depression inventories-IA and-II in psychiatric outpatients. J. Pers. Assess. 67, 588597. doi: 10.1207/s15327752jpa6703_13

  • 6

    ChawlaK.CleverR.RamirezJ.LucasG. M.GratchJ. (2024). Towards emotion-aware agents for improved user satisfaction and partner perception in negotiation dialogues. IEEE Trans. Affect, Comput. 15, 433444. doi: 10.1109/TAFFC.2023.3238007

  • 7

    ChurchK. W. (2017). Word2Vec. Nat. Lang. Eng. 23, 155162. doi: 10.1017/S1351324916000334

  • 8

    [Dataset] DevlinJ.ChangM.-W.LeeK.ToutanovaK. (2019). BERT: pre-training of deep bidirectional transformers for language understanding.

  • 9

    [Dataset] LiuY.OttM.GoyalN.DuJ.JoshiM.ChenD.et al. (2019). RoBERTa: a robustly optimized BERT pretraining approach.

  • 10

    FernándezA.GarcíaS.GalarM.PratiR. C.KrawczykB.HerreraF. (2015). “Learning from imbalanced data sets,” in Springer Handbook of Computational Intelligence (Heidelberg: Springer), 993-1014.

  • 11

    GravesA.SchmidhuberJ. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602610. doi: 10.1016/j.neunet.2005.06.042

  • 12

    Hadikhah MozhdehiM.Eftekhari MoghadamA. (2023). Textual emotion detection utilizing a transfer learning approach. J. Supercomput. 79, 1307513089. doi: 10.1007/s11227-023-05168-5

  • 13

    HerrmanH.KielingC.McGorryP.HortonR.SargentJ.PatelV. (2019). Reducing the global burden of depression: a lancet-world psychiatric association commission. Lancet. 393, e42e43. doi: 10.1016/S0140-6736(18)32408-5

  • 14

    HochreiterS.SchmidhuberJ. (1997). Long short-term memory. Neural Comput. 9, 17351780. doi: 10.1162/neco.1997.9.8.1735

  • 15

    KimY. (2014). “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), eds A. Moschitti, B. Pang, and W. Daelemans (Doha: Association for Computational Linguistics), 17461751. doi: 10.3115/v1/D14-1181

  • 16

    LeCunY.BoserB.DenkerJ.HendersonD.HowardR.HubbardW.et al. (1989). Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process. Syst. 2, 396404.

  • 17

    LiM.SunX.WangM. (2024). Detecting depression with heterogeneous graph neural network in clinical interview transcript. IEEE Trans. Comput. Soc. Syst. 11, 13151324. doi: 10.1109/TCSS.2023.3263056

  • 18

    LuH.LiuT.CongR.YangJ.GanQ.FangW.et al. (2025). QAIE: LLM-based quantity augmentation and information enhancement for few-shot aspect-based sentiment analysis. Inform. Process. Manag. 62:103917. doi: 10.1016/j.ipm.2024.103917

  • 19

    MilintsevichK.SirtsK.DiasG. (2023). Towards automatic text-based estimation of depression through symptom prediction. Brain Inform. 10:4. doi: 10.1186/s40708-023-00185-9

  • 20

    RaffelC.ShazeerN.RobertsA.LeeK.NarangS.MatenaM.et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Machine Learn. Res. 21, 167.

  • 21

    RaiB. K.JainI.TiwariB.SaxenaA. (2024). Multimodal mental state analysis. Health Serv. Outcomes Res. Methodol. 25, 85112. doi: 10.1007/s10742-024-00329-2

  • 22

    Rodríguez-IbánezM.Casánez-VenturaA.Castejón-MateosF.Cuenca-JiménezP.-M. (2023). A review on sentiment analysis from social media platforms. Expert Syst. Appl. 223:119862. doi: 10.1016/j.eswa.2023.119862

  • 23

    SemrauM.AlemA.Ayuso-MateosJ. L.ChisholmD.GurejeO.HanlonC.et al. (2019). Strengthening mental health systems in low-and middle-income countries: recommendations from the emerald programme. BJPsych Open5:e73. doi: 10.1192/bjo.2018.90

  • 24

    SpitzerR. L.KroenkeK.WilliamsJ. B. (1999). Validation and utility of a self-report version of prime-MD: the PHQ primary care study. JAMA282, 17371744. doi: 10.1001/jama.282.18.1737

  • 25

    UppalM.GuptaD.JunejaS.GadekalluT. R.BayoumyI. E.HussainJ.et al. (2023). Enhancing accuracy in brain stroke detection: multi-layer perceptron with adadelta, RMSProp and AdaMax optimizers. Front. Bioeng. Biotechnol. 11:1257591. doi: 10.3389/fbioe.2023.1257591

  • 26

    VaswaniA.ShazeerN.ParmarN.UszkoreitJ.JonesL.GomezA. N.et al. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 115.

  • 27

    Villatoro-TelloE.Ramírez-de-la RosaG.Gática-PérezD.Magimai-DossM.Jiménez-SalazarH. (2021). “Approximating the mental lexicon from clinical interviews as a support tool for depression detection,” in Proceedings of the 2021 International Conference on Multimodal Interaction, ICMI '21 (New York, NY: Association for Computing Machinery), 557566. doi: 10.1145/3462244.3479896

  • 28

    WangN.KamilR.Al-HaddadS. A. R.IbrahimN.ZhaoZ. (2025). Enhancing AI depression detection using transfer learning. Contemp. Math. 6, 30543080. doi: 10.37256/cm.6320256184

  • 29

    ZafarA.HussainS. J.AliM. U.LeeS. W. (2023). Metaheuristic optimization-based feature selection for imagery and arithmetic tasks: an fNIRS study. Sensors, 23:3714. doi: 10.3390/s23073714

  • 30

    ZhangJ.GuoY. (2024). Multilevel depression status detection based on fine-grained prompt learning. Pattern Recognit. Lett. 178, 167173. doi: 10.1016/j.patrec.2024.01.005

  • 31

    ZhaoQ.XiaY.LongY.XuG.WangJ. (2025). Leveraging sensory knowledge into text-to-text transfer transformer for enhanced emotion analysis. Inform. Proces. Manag. 62:103876. doi: 10.1016/j.ipm.2024.103876

  • 32

    ZhaoZ.ChuahJ. H.LaiK. W.ChowC.-O.GochooM.DhanalakshmiS.et al. (2023). Conventional machine learning and deep learning in alzheimer's disease diagnosis using neuroimaging: a review. Front. Comput. Neurosci. 17:1038636. doi: 10.3389/fncom.2023.1038636

  • 33

    ZhuY.KirosR.ZemelR.SalakhutdinovR.UrtasunR.TorralbaA.et al. (2015). “Aligning books and movies: towards story-like visual explanations by watching movies and reading books,” in Proceedings of the IEEE International Conference on Computer Vision (Santiago: IEEE), 1927. doi: 10.1109/ICCV.2015.11

  • 34

    ZungW. W. (1965). A self-rating depression scale. Arch. Gen. Psychiatry12, 6370. doi: 10.1001/archpsyc.1965.01720310065008

Summary

Keywords

1DCNN, BERT, depression, E-DAIC, T5, text, transfer learning, transformer

Citation

Wang N, Zhang W, Kamil R, Renner I, Abdul Rahman Al-Haddad S, Ibrahim N and Zhao Z (2026) Depression detection through dual-stream modeling with large language models: a fusion-based transfer learning framework integrating BERT and T5 representations. Front. Big Data 8:1651290. doi: 10.3389/fdata.2025.1651290

Received

25 June 2025

Revised

20 December 2025

Accepted

30 December 2025

Published

04 February 2026

Volume

8 - 2025

Edited by

Hanqi Zhuang, Florida Atlantic University, United States

Reviewed by

Himanshu Sharma, NIMS University, India

K.M. Poonam, Indian Institute of Technology Kharagpur, India

Updates

Copyright

*Correspondence: Weijia Zhang, ; Raja Kamil, ; Zhen Zhao, ; Syed Abdul Rahman Al-Haddad,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics