Development of Digitally Obtainable 10-Year Risk Scores for Depression and Anxiety in the General Population

The burden of depression and anxiety in the world is rising. Identification of individuals at increased risk of developing these conditions would help to target them for prevention and ultimately reduce the healthcare burden. We developed a 10-year predictive algorithm for depression and anxiety using the full cohort of over 400,000 UK Biobank (UKB) participants without pre-existing depression or anxiety using digitally obtainable information. From the initial 167 variables selected from UKB, processed into 429 features, iterative backward elimination using Cox proportional hazards model was performed to select predictors which account for the majority of its predictive capability. Baseline and reduced models were then trained for depression and anxiety using both Cox and DeepSurv, a deep neural network approach to survival analysis. The baseline Cox model achieved concordance of 0.7772 and 0.7720 on the validation dataset for depression and anxiety, respectively. For the DeepSurv model, respective concordance indices were 0.7810 and 0.7728. After feature selection, the depression model contained 39 predictors and the concordance index was 0.7769 for Cox and 0.7772 for DeepSurv. The reduced anxiety model, with 53 predictors, achieved concordance of 0.7699 for Cox and 0.7710 for DeepSurv. The final models showed good discrimination and calibration in the test datasets. We developed predictive risk scores with high discrimination for depression and anxiety using the UKB cohort, incorporating predictors which are easily obtainable via smartphone. If deployed in a digital solution, it would allow individuals to track their risk, as well as provide some pointers to how to decrease it through lifestyle changes.


INTRODUCTION
Global prevalence of depression was estimated to be 280 million (1) in 2019. By 2030, depression is expected to be the second-largest contributor to worldwide loss of years of healthy life because of death or disability (2). Highly comorbid with depression, anxiety disorders globally are estimated to affect 301 million individuals (1). NICE guidelines currently recommend the use of validated questionnaires [e.g., PHQ-9, Patient Health Questionnaire (3); HADS, Hospital Anxiety and Depression Scale (4) and BDI, Beck Depression Inventory (5) for depression; GAD-2 or GAD-7 (6) for anxiety disorders] to diagnose patients and classify the severity of their symptoms (7). Whilst tools based on patients' self-reported feelings and mood changes are invaluable to track the progression of the disorder, multifactorial models including well-established risk factors are needed to successfully manage the disorder in the long-term. Predictive scores can be used to effectively identify patients at highest risk of developing depression or anxiety and enroll them in preventative pathways, thus, minimizing relapses and lowering the burden of the disease (8).
A comprehensive review of existing predictive scores is outside of the scope of this paper and can be found elsewhere (9). There are several scores for specific populations at risk of depression, e.g., adolescents (10), elderly (11), traumatic head injury (12) or stroke patients (13), patients with diabetes (14), or immune-mediated inflammatory disorders (15). For the general population, the most widely used depression risk score is the PredictD score, developed using patient data from six European countries and externally validated on a population from Chile (16). The original score contains 10 risk factors (age, sex, country, education level, personal and family history of depression, physical and mental health disturbances, difficulties at work, and experience of discrimination). Other countryspecific scores have been developed after PredictD to better account for cultural and socio-economic differences (17)(18)(19)(20), but little research has been conducted to develop risk scores aimed at predicting onset of generalized anxiety and panic disorders in the general population. The PredictA score was developed using the same dataset described above for identifying factors predicting depression, and it includes sex, age, lifetime depression, family history of psychological difficulties, physical health, and mental health disturbances, unsupported difficulties in paid and/or unpaid work, country of residence and time of follow-up (21). Given the high comorbidity between anxiety and depressive disorders, it has also been suggested that, on top of disorder-specific risk factors, a set of common underlying risk factors for both disorders may exist (22).
The majority of the published risk scores for depression and anxiety relate to short term predictions (between 6 and 24 months) that mainly involve non-modifiable factors (e.g., family history). Given the impact of recurring episodes and lifetime duration on the progression of the disorders (23,24), early identification of individuals at risk of depression or anxiety would be beneficial in devising effective preventative and therapeutic pathways. Evidence suggests that the risk of depression and anxiety can be decreased by modifying certain lifestyle factors, such as having a balanced diet and performing physical activity (25,26) or smoking cessation (27). With an increased availability and need for telematics solutions in healthcare due to the COVID-19 pandemic (28)(29)(30), risk scores evidencing modifiable lifestyle changes have the potential to be of wide benefit.
The UK Biobank (UKB) is a prospective study of over half a million UK participants, recruited between 2006 and 2010. Available data includes primary care and hospital inpatient records, results of touchscreen questionnaires and verbal interviews about lifestyle, pre-existing conditions and family history, as well as a comprehensive battery of medical tests, imaging and physical assessments. While predictive models have been developed for many common diseases, such as cardiovascular diseases or diabetes (31,32), no long-term predictive models have been derived for depression and anxiety using the UKB cohort. This dataset contains information about many known predictors of severe mental illness, including modifiable lifestyle factors and other digitally-obtainable data such as comorbidities, socioeconomic factors, or early-life events (33,34).
The aim of this study is to devise a model of potential risk factors for depression and anxiety in the long term (10+ years) using the UKB data and with a focus on behavioral and health indices that can be potentially tracked by means of a remote digital solution.

Data Source and Study Design
The use of data for this study was approved by UKB, under the project title "Validation and comparative analysis of novel prediction models focused on modifiable lifestyle factors for the risks of common, preventable diseases and all-cause mortality: a cohort study" (application number 55668).
The aim of this study was to model the risk of being diagnosed with depression or anxiety for the first time over the next 10 years. The outcomes were defined as an occurrence of a depressive episode (ICD10 code F32) or an anxiety disorder (ICD10 code F41), respectively, after the date of assessment. These were derived from the UKB First Occurrences fields 130894 and 130906 which combine data from primary care and hospital inpatient diagnoses with conditions self-reported at the time of assessment. Primary and secondary care data contains the exact date of each diagnosis. Results from diagnostic mental health questionnaires were not used in the definition of the outcome because the participants were not systematically screened over the follow-up period.
Potential predictor variables include results from the touchscreen questionnaire administered at the initial assessment (instance 0) and pre-existing illnesses diagnosed by the time of assessment. Only fields which could be collected via a smartphone app (i.e., using the sensors available on smartphones or via user's input), collected at or before the time of the initial assessment were included. The final set of potential predictors includes demographic data (age, sex, ethnicity), socioeconomic status (income, qualifications), physio-metric data (body mass index, height, weight, etc.,), family history (illnesses of parents or siblings), medical history, lifestyle characteristics (physical activity, diet, sleep habits), mental health history, moods and overall perceived well-being (satisfaction with life, mood swings, feelings of worry, loneliness, etc.,). Continuous variables with a high proportion of missing values (>10%) were excluded, as well as fields which could cause label leakage (Supplementary Table 1).

Data Preparation
Participants were excluded from the study if they were diagnosed with the corresponding outcome condition prior to the date of assessment. Survival time was measured as years from the date of assessment (instance 0 in the UK BioBank dataset) until the date of depression/anxiety diagnosis, or in participants who were not diagnosed, data was right-censored at the data extraction date (30th September 2020), date of death or date when they were lost to follow-up.
The binary variable "any_mental_issue" was derived from the set of First Occurrences fields for ICD10 diagnoses in the "Mental and behavioral disorders" category (F00-F72, excluding F32 and F33 in the Depression model and F40 and F41 in the Anxiety model). Only dates prior to the date of assessment were considered for this variable.
For CoxPH, all categorical features were one-hot encoded, followed by exclusion of categories containing <0.1% items; for DeepSurv, the categorical features were encoded using one-hot, target-encoding, and weight of evidence (the encoding being one of the parameters), followed by exclusion of categories containing For six diet-related features, values "Do not know" and "Prefer not to answer" were substituted with mean. Continuous features were centered and scaled to unit variance (details in Supplementary Table 2). Finally, participants with any remaining missing continuous features were excluded (characteristics of excluded participants in Supplementary Tables 3, 4).
Test set (25%) was set aside for internal validation using a stratified train-test split (preserving the ratios seen in the binary outcome field). From the train set, a further 25% was set aside as a validation set and used for feature selection and the optimisation of DeepSurv parameters. For the final models after feature selection, the train and validation dataset were combined for training, followed by evaluation on the test dataset (Supplementary Figure 1).

Baseline Model and Feature Selection
First, a Cox-Proportional Hazards (CPH) model, implemented in the Python lifelines library (35), was trained using the full set of features. The number of features was then reduced to decrease the risk of overfitting, improve the explainability of the model and to narrow down the number of inputs from users in a potential digital solution. As a first step, univariate Cox analysis was performed for each feature and those with p-value over 0.1 were excluded. Then, an iterative backward elimination algorithm was used to get the final set of features. In short, in every round of elimination, the CPH model was retrained without a set of features with the highest p-value. If the concordance index evaluated on the validation dataset decreased by more than 0.001, features were kept for an additional round testing elimination of a smaller number of features. Elimination of each remaining single feature was tested before the decision to keep it in the reduced model.
The reduced model was then trained using this final set of features on joined train and validation dataset and its performance evaluated on the unseen test cohort. Finally, this model was reviewed and variables with problematic clinical explanation were manually removed from the model.

Deep Survival Analysis
The next model we tested was the Cox proportional hazards deep neural network (DeepSurv), using an implementation in the "pycox" package (36) based on the deep learning library PyTorch (37). Details of all libraries and their versions used in this study can be found in Supplementary Table 5.
Using either the full or reduced set of features, we searched the hyperparameter space using a set of parameters described in Supplementary Table 6. This was done using a Tree-Structured Parzen Estimator algorithm (38) from the Optuna Library (39). In total, 500 configurations were tested for both depression and anxiety, separately for baseline and reduced models, each evaluated on the validation dataset. Feedforward neural networks deep up to three hidden layers have been tested. Classic Stochastic Gradient Descent algorithms with Momentum (40) and Adam (41) with optimal learning rate estimation were used for training. The best combination of hyperparameters was selected separately for the baseline and reduced models for depression and anxiety. The performance was then compared to a neural network with the same hyperparameters but using only the reduced set of features selected by the backwards elimination using Cox classifier.

Statistical Analysis
Results from the analysis of demographic characteristics show participant numbers and percentages of total for categorical/ordinal variables, or medians and quartiles (Q1 and Q3) for continuous variables. Statistical comparisons were performed using the Chi-squared test for categorical/ordinal and Kruskal-Wallis test for continuous variables.
C-index was used as the metric for all models, with 95% confidence intervals calculated using the percentile bootstrap resampling method (50 resampling rounds). Where detailed analysis of the results of CPH models is provided, logarithm of hazard ratios/log(HR) with 95% confidence intervals (CIs) are shown. P-values test the null hypothesis that the coefficient of each variable is equal to zero and significance level was set  Frontiers in Psychiatry | www.frontiersin.org to 0.05. Calibration was evaluated at the 10-year time point using calibration plots and the Integrated Calibration Index (ICI), which is a mean weighted difference between observed and predicted probabilities, implemented in the Python lifelines library (35).

Study Population
From the initial set of 502,488 participants in the UK Biobank, 40,367 had pre-existing depression and 11,296 pre-existing anxiety. These participants were excluded from the respective datasets. Further participants were excluded due to missing values in some continuous or ordinal variables. In the depression dataset, of the remaining 448,733 participants, 16,507 (3.68%) developed depression after assessment. For the anxiety dataset, it was 17,830 (3.74%) out of 477,100 participants diagnosed with anxiety after assessment. Details on distribution of the outcomes in the train, validation and test datasets can be found in Supplementary Figure 2. There were no significant differences in any features between the train + validation and test datasets (Supplementary Tables 7, 8). Median follow-up time of 11.2 years and maximum follow-up time of 13.8 years were the same for depression and anxiety. Distribution of the durations to depression or anxiety development can be found in Supplementary Figures 2A,B.
The dataset used in this study contains 46% men and 54% women, aged 56 ± 8 (range 38-73) at the time of the initial assessment (Supplementary Figures 3C,D). The ethnic background of the participants was 94.5% white, 2.1% Asian, 1.5% black, and 1.9% other or unknown. Summary of the variables in the final model for depression and anxiety is presented in Supplementary Tables 9, 10, respectively.

Cox Model and Feature Selection
The initially selected 167 UK Biobank variables were preprocessed into 429 feature columns for depression/anxiety (Supplementary Table 2). These were used to build a baseline Cox proportional hazards model for each of the two outcomes, achieving a concordance index of 0.7901 for depression and 0.7650 for anxiety in the validation cohort ( Table 1).
After feature selection, the number of predictors for depression was narrowed down to 39 and for anxiety to 53. After manual review (excluded features and their coefficients can be found in Supplementary Table 11), the final number of predictors was 35 and 46, respectively. Decrease in number of predictors was accompanied by a slight decrease of the concordance index of depression model to 0.7796, the anxiety model maintained its performance at 0.7668 ( Table 1). The predictors in the final models, along with their coefficients and confidence intervals are shown in Figure 1 and Supplementary Tables 12, 13.
The top three risk factors in the depression model were seeing a GP for nerves, anxiety, tension, or depression, not wanting to talk about seeing a psychiatrist about these conditions and poor overall health. The most protective factors were annual household income over £100,000 and being a non-smoker. For anxiety, the top risk factors were not wanting to talk about father's illnesses, poor self-rated health and suffering from health, the protective factors were never visiting a GP or psychiatrist for nerves, anxiety, tension, or depression and annual household income over £100,000. The mean predicted 10-year risk of developing depression was 3.30% (95% CI 3.27-3.32), the mean observed risk was 3.21% (95% CI 3.18-3.23). The anxiety model predicted an average risk of 3.26% (95% CI 3.23-3.28), while the observed probability was 3.18% (95% CI 3. 16-3.20). The Cox models showed good calibration, particularly for the low probabilities which were abundant in the population, with slightly larger errors for the higher probabilities which were sparsely represented. The Integrated Calibration Index (ICI) of 0.0009 for depression and 0.0013 for anxiety (Supplementary Figure 4).

Machine Learning Models
The hyperparameter space for the DeepSurv model was explored by running 500 combinations, best of which achieved a concordance index of 0.7878 (95% CI 0.7866-0.7894) for depression and 0.7728 (95% CI 0.7714-0.7737) for anxiety using the full set of variables ( Table 2). Using the reduced set of features after feature selection, the depression model showed a concordance of 0.7863 (95% CI 0.7723-0.7873) and the anxiety model concordance of 0.7710 (95% CI 0.7710-0.7719) ( Table 2). The details of the hyperparameters for the best-performing models can be found in Supplementary Table 6.

Comparison of the Depression and Anxiety Models
Of the 35 features in the depression model and 46 features in the anxiety model, 8 features appeared in both models (Figure 2). These include Poor or fair self-rated health, experience of mood swings, suffering from nerves, number of operations, having their feelings easily hurt, and water intake. An annual household income over £100,000 had negative coefficients in both models.

Key Findings
The aim of the current study was to build prediction models for depression and anxiety, with a specific focus on factors linked to digitally-obtainable data. Using a data-driven approach to feature selection and model optimisation, we trained models for prediction of depression and anxiety using both traditional statistical and machine learning methods. The best-performing model for depression achieved a concordance index of 0.7863, with similar performance to other similar depression risk scores. For comparison, the current golden standard score PredictD achieved a concordance index of 0.790 using stepwise logistic regression (16). Rosellini et al. (19) used an ensemble machine learning algorithm with a concordance of 0.757, whereas, Wang et al. (20) developed sex-specific logistic regression models with concordance index of 0.795 for men and 0.767 for women. Our DeepSurv model for anxiety shows a concordance of 0.771, compared to 0.752 in the PredictA study (21). It is important to note that participants in our study were followed for more than 10 years while the prediction horizon in the other studies was shorter (1-4 years). Among the most similar studies developed on the UKB dataset, Zhou et al. achieved a concordance index of 0.778 for prediction of depressive moods using neuroimaging and questionnaire data and Sarris et al. (42) in their study of lifestyle factors associated with frequency of depressive moods built an ordinal logistic regression model but did not report its performance. To the best of our knowledge, this is the first study which used the UKB cohort to develop long-term prognostic risk scores for both depression and anxiety.
The DeepSurv model performed comparably to the Cox model. While it has the capability to capture complex nonlinear relationships between factors, this fact could point to a linear relationship of most variables included in the model. An alternative explanation could be based on the requirement of deep learning algorithms for very large datasets to model very complex problems. It is possible that the problem in hand is too complex to be modeled with the provided number of training examples and deep learning thus cannot overperform traditional models. For the use of this score in a digital healthcare setting, interpretability of the score is key. With DeepSurv, similar to many other blackbox machine learning models, the direction and scale of each feature's contribution to the overall risk may not be easily understandable. From this perspective, Cox model coefficients provide more intuitive understanding of how each feature could be changed to decrease the predicted risk, an information which could potentially motivate the user toward the right lifestyle changes.
Our prediction model includes many traditional risk factors for mental health illness, including smoking, alcohol consumption, employment status, overall health status, sleeping disturbances, social functioning, or education and income level (33). Interestingly, while age, sex, and ethnicity are among the risk factors in the existing predictive scores (16,18,21), they were eliminated from our models during the feature selection, with the exception of age in the anxiety model. We assume that the limited representation of all age and ethnic groups in the UKB dataset may be the reason some of these were not identified as important predictors. Nevertheless, the developed model provides a good discrimination capability for populations comprising predominantly white individuals and within the age range of ∼40-70. Its performance on other demographics must be further investigated.
From the comparison of the final set of predictors in the depression and anxiety models it is clear that some preexisting mental health conditions adversely affect the risk: seeing or not seeing a GP/psychiatrist for nerves, anxiety, tension or depression in the past is among the most important features in both models. Other factors featuring in both depression and anxiety models include the selfreported overall health rating, and associated variable of number of operations. Our analysis also showed that household income or certain personality traits (being a moody person or easily hurt) are common risk factors for depression and anxiety.
Some of the predictors in our final model are dynamic factors, changing over time (e.g., smoking status, alcohol consumption, feelings or tiredness over the past 2-week period). Regular reassessment of the score would therefore be beneficial and could be aided by deployment of the score as a digital solution. Such application could also provide personalized recommendations on how to change one's lifestyle to decrease the risk of development of depression or anxiety.

Study Limitations
Among the limitations of this study is the selection bias in the UKB cohort used to train the prediction models. It has been reported that the participants of UKB are on average healthier and come from less deprived areas (43). Notably, the representation of ethnicities in the dataset deviates from the general population, with over 94% participants being white, compared to 86% in the general population according to the latest UK census. Therefore, caution should be exercised when making predictions for individuals with a minority background, as they were underrepresented in the training dataset. This is an important limitation because they may be disproportionately affected by severe mental illnesses (44). The study also included only participants with a limited age range: 37-73 years at the time of assessment. Therefore, until these models have been validated on an appropriate cohort extending this range, predictions for users younger or older than this should be interpreted with reserve. The results of the internal validation performed in this study using an unseen test cohort does not point to overfitting but external validation will be necessary to confirm this.
Missing records or methods of variable encoding might also be introducing potential bias into this study. The goal of the study was to develop two separate models for anxiety and depression, we therefore, decided to keep diagnosis of anxiety as a predictor for the depression model, and vice versa, because of their high comorbidity. This setup would not be suitable if the goal were to build a single model predicting either depression or anxiety or both conditions simultaneously.

Implications of Our Findings
This tool is intended to be used as a digital solution for dynamic tracking of individuals' risk of developing depression or anxiety. Evidence suggests that most individuals value knowing their risk of depression, especially if ways of prevention are also indicated (45,46). Providing a personalized depression/anxiety risk score therefore appears as a safe preventative strategy with demonstrated benefits (46).
External validation of the developed model on data collected from another population will be necessary to be able to extend these findings to populations from geographies other than the United Kingdom. If these populations show significant differences in demographic characteristics, such as ethnicity, causing the accuracy of the predictions to drop, the model can be re-calibrated if a suitable dataset exists for this population.
Additional improvements to the prediction models for mental health would be an expansion of the input space with other dynamic parameters, such as activity measures. Raw accelerometer data is available for a proportion of UK Biobank participants, and it could be used to provide an objective view on individuals' level of activity, sedentary time and sleep, with known associations with the risk of development of mental illness (33,42,47). Because these would be obtainable via wearable devices, they bring the potential of truly dynamic monitoring of an individual's risk.

CONCLUSION
In summary, we present algorithms for prediction of depression and anxiety, developed using a large UK cohort of individuals followed for over 10 years. All factors in our models are easily acquirable via smartphone devices and thus can be used to support development of preventative digital solutions for mental health.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: the dataset analyzed in this study has been provided by the UK BioBank: https://www.ukbiobank.ac.uk/.

AUTHOR CONTRIBUTIONS
DM designed the study. MC assisted with implementation of the used methods. DM and ND trained the models and performed data analysis. DM, ND, SP, and DP wrote the manuscript. All authors revised the manuscript.

FUNDING
Huma Therapeutics LTD funded this research.