Edited by: Jorge Lopez-Castroman, Centre Hospitalier Universitaire de Nîmes, France
Reviewed by: Glen Coppersmith, Qntfy, United States; Ayah Zirikly, National Institutes of Health (NIH), United States
This article was submitted to Psychopathology, a section of the journal Frontiers in Psychiatry
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Suicide is a serious public health issue worldwide, yet current clinical methods for assessing a person's risk of taking their own life remain unreliable and new methods for assessing suicide risk are being explored. The widespread adoption of electronic health records (EHRs) has opened up new possibilities for epidemiological studies of suicide and related behaviour amongst those receiving healthcare. These types of records capture valuable information entered by healthcare practitioners at the point of care. However, much recent work has relied heavily on the structured data of EHRs, whilst much of the important information about a patient's care pathway is recorded in the unstructured text of clinical notes. Accessing and structuring text data for use in clinical research, and particularly for suicide and self-harm research, is a significant challenge that is increasingly being addressed using methods from the fields of natural language processing (NLP) and machine learning (ML). In this review, we provide an overview of the range of suicide-related studies that have been carried out using the Clinical Records Interactive Search (CRIS): a database for epidemiological and clinical research that contains de-identified EHRs from the South London and Maudsley NHS Foundation Trust. We highlight the variety of clinical research questions, cohorts and techniques that have been explored for suicide and related behaviour research using CRIS, including the development of NLP and ML approaches. We demonstrate how EHR data provides comprehensive material to study prevalence of suicide and self-harm in clinical populations. Structured data alone is insufficient and NLP methods are needed to more accurately identify relevant information from EHR data. We also show how the text in clinical notes provide signals for ML approaches to suicide risk assessment. We envision increased progress in the decades to come, particularly in externally validating findings across multiple sites and countries, both in terms of clinical evidence and in terms of NLP and machine learning method transferability.
The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Prior to the introduction of electronic health records (EHRs), the study of suicidality in Camberwell, the southeast London catchment area served by King's College Hospital, was undertaken by paper case note review, for example of all referrals to a self-harm team over a 6 month period (
Later, when Dutta et al. (
This methodology meant inclusion in the cohort was clearly and consistently defined, and the outcome of deaths by suicide and open verdicts up until March 31, 2007 according to the International Classification of Diseases (ICD) was identified by a direct case-tracing procedure with the Office for National Statistics (ONS) for England and Wales and the General Register Office (GRO) for Scotland. This enabled the study of early risk factors for suicide in the cohort (
OPCRIT+ (a redesigned version of OPCRIT for use in clinical settings with an expanded number of objectively rated items) facilitated access to structured symptom information entered by clinicians to generate diagnoses including “suicidal ideation” but not self-harm (
The widespread adoption of EHRs has meant that large-scale clinical data are now available for clinical research, although researchers have to contend with the large volume, complexity and heterogeneity of these “big data” resources. Typical EHR systems store patient data in both structured fields and as unstructured text (as well as other media types, such as medical images). Structured data fields, such as drop-down menus, forms and checkboxes, tend to be made available to clinical practitioners as a means to directly encode patient diagnoses, assessment results, etc. in a predetermined format. However, rates of completion can vary. Unstructured text entry allows for more nuanced documentation, providing context to assessments, patient status, and other information pertinent to the clinical interaction. The availability of these electronic health data has greatly facilitated mental health research. Investigators can now use EHRs to gather data about clinical populations, identify participants for clinical trials, carry out retrospective case-control studies, develop and trial predictive models, and guide the implementation of evidence-based practices (
In 2008, the South London and Maudsley National Health Service (NHS) Foundation Trust Biomedical Research Center (SLaM BRC) developed the Clinical Record Interactive Search (CRIS) application. Since 2008, CRIS became an extensive UK-based repository of anonymised, structured and free-text data derived from the EHR system used by SLaM [See (
CRIS provides unprecedented information on mental disorders and outcomes in routine clinical care at scale, particularly through enhancements from the use of natural language processing (NLP) to extract previously inaccessible information, ranging from patients' cognitive function, smoking status and education, to antipsychotic medication profiles and substance misuse (
The availability of this type of large-scale data heralds the prospect of using statistical and data science approaches to analyse larger cohorts and better understand how these behaviours manifest in healthcare settings (
Over the last 10 years, researchers have used CRIS to conduct a number of epidemiological studies to examine suicidal behaviours across a range of mental health conditions (e.g., autism, psychotic disorders), and demographic groups (e.g., adults, children and adolescents, pregnant women). Methodologies have evolved, improving the accuracy of identifying suicidality-related constructs and predictive models of suicide risk. In the following sections, we review the evidence generated from CRIS on suicidal behaviours, the NLP methods used, and the value of the resulting cohorts and datasets created.
Suicide-related behaviour is the manifestation of a complex set of phenomena that depend on many contextual factors which can change quickly from 1 day to another. Completed suicide remains relatively rare, meaning that tools to assess suicide risk must have a high predictive validity to be of use in a clinical setting (
A wide range of known risk and contributory factors are associated with suicide, with symptoms of mental illness being recognisable in more than 85% of people who die by suicide, according to psychological autopsy interviews with family, friends and medical professionals (
Summarised characteristics of clinical cohorts created using CRIS for the study of suicide and related behaviour.
Polling et al. ( |
Adults attending ED | 7,444 | 10,688 ED attendances | N/A | 01/04/2009–31/12/2011 | ICD-10 codes X60-X84, presence of keywords related to self-harm, suicide attempts and suicidality |
Bogdanowicz et al. ( |
Patients with opioid use disorder | 5,335 | N/A | 15–73 years Mean (SD) = 37.6 (9.07) years | 01/04/2008–31/03/2014 | ICD-10 codes X409-X450, Y120, Y125, F119 |
Lopez-Morinigo et al. ( |
Patients with schizophrenia spectrum disorder | 426 (71 cases, 355 controls) | N/A | Mean (SD) = 44.9 (18.0) years | 01/01/2007–31/12/2013 | ICD-10 codes X64, X70, X71, X78, X80, X81, X84, Y10-34 |
Lopez-Morinigo et al. ( |
Patients accessing secondary mental healthcare | 13,758 | N/A | Mean (SD) = 41.3 (12.2) for suicide, 40.6 (11.5) for no suicide | 01/01/2007–01/04/2015 | ICD-10 codes X64, X70, X71, X78, X80, X81, X84, Y10-34 |
Roberts et al. ( |
Individuals with chronic fatigue syndrome | 2,147 | N/A | Mean = 39.1 years | 01/01/2007–31/12/2013 | ICD-10 codes X60-X84 |
Taylor et al. ( |
Perinatal women with SMI | 420 | N/A | Mean (SD) = 31.9 (6.2) years | 01/01/2007–31/12/2011 | Presence of keywords [from ( |
Downs et al. ( |
Children and adolescents with ASD | 1,906 | N/A | 14–18 years | 01/01/2008–31/12/2013 | NLP, manual classification of suicidality-related expressions |
Velupillai et al. ( |
Adolescents attending CAMHS | 23,455 | N/A | 11–17 years | 01/04/2009–31/03/2016 | Manual annotation of suicidality-related expressions, NLP |
Bittar et al. ( |
Patients accessing secondary mental healthcare | 17,640 (2,913 cases, 14,727 controls) | 21,175 admissions (4,235, cases, 16,940 controls) | Mean (SD) = 33.7 (15.6) years | 02/04/2006–31/03/2017 | X6 |
The Health of the Nation Outcome Scales (HoNOS) were introduced in 1996, to measure the health and social functioning of people with mental illness. Within SLaM, as with most UK mental health trusts, clinicians are expected to complete HoNOS for all patients receiving care. The non-accidental self-injury item on the HoNOS score has been shown to be the only individual item associated with higher mental health service costs (
Structured suicide and violence risk assessments in mental health services has been shown to have low predictive accuracy for all-cause mortality (
Research into mortality, including death by suicide, has typically utilised ICD-10 diagnostic codes (which must be completed as part of clinical assessment), linked with outcome data from the Office for National Statistics, ONS (
Polling et al. (
Using the self-harm-related terms identified by Polling et al. (
In a further study using the free-text search capabilities of CRIS, Borschmann et al. (
The first approaches that were developed to process CRIS data were pattern matching approaches to identify certain pieces of information (e.g., medication, smoking status, substance misuse) using the GATE framework (
In addition to these “integrated” NLP applications, clinicians have worked alongside NLP researchers to develop custom NLP tools to identify suicide-related constructs in specific population samples within CRIS. As we have seen, the focus of most work has been the epidemiology and prevalence of suicidal behaviour, with NLP tools that use both rule-based (
Using data from CRIS with an external linkage to ONS mortality data, Bogdanowicz et al. (
Today, with the increasing body of research on suicide and related behaviour in CRIS, and a diversity of clinical population groups under study, has come a need to develop more targeted methods of accessing the suicide-related data within the unstructured clinical narratives. NLP systems designed for this task need to identify the different types of suicide-related behaviour (suicide attempt, suicidal ideation, self-harm, etc.) and account for the linguistic variation that indicates whether a mention is attested, negated or uncertain, is relevant to the patient, or a family member, and so on. These considerations have spurred on the recent development of bespoke NLP tools. For example, Gkotsis et al. (
Identifying periods during which a patient is at elevated risk of making a suicide attempt is key to enabling timely intervention. However, information available to clinicians concerning the rapidly changing dynamic factors leading up to a suicide attempt has been limited. Bittar et al. (
The risk and conceptualisation of suicidal behaviour for children and adolescents can be different to adults (
Adolescence is associated with a high risk of suicide and self-harm compared to most other age groups, but few studies have examined the prevalence of suicidal behaviour in large adolescent patient cohorts. Downs et al. (
Using a subset of this cohort, Holden et al. (
Velupillai et al. (
Free-text mentions of depressive symptoms were used as outcome measures in the assessment of later-life depression in people from ethnic minorities by Mansour et al. (
Although EHR data are not created for research purposes, they provide a rich resource for large-scale retrospective research, allowing identification of diverse and comprehensive clinical study samples. One of the main challenges in suicide research is obtaining sufficiently large study samples to study an outcome with a high enough base rate for predictive modelling to have a meaningful positive predictive value. The low base rate of completed suicide limits the predictive value of any model, whether established statistical techniques or machine learning (
One avenue of research being pursued in CRIS is comparison of suicide-related phenomenon over a span of time, within the same hospital trust culture, but where mandatory changes have occurred with regards to how assessments are made and recorded. The focus on a single mental health trust for a review opens the opportunity for a different set of more detailed analyses than a review that covers multiple sites (
Furthermore, EHRs reflect real-world clinical practice. This means that the context of how, for example, structured risk assessment tools and other schedules, like HoNOS, are used in daily clinical work needs to be well understood when including them as variables in clinical research studies. Most of the relevant information is found in the free text, and appropriate NLP solutions are key components for enabling risk modelling.
Looking to the future, replication studies of work based on SLaM CRIS, including the developed NLP applications, across other EHR systems and in other clinical catchment areas would provide insights into the generalisability of these particular models to new clinical settings. However, the portability of these NLP applications needs further scrutiny. The studies in this review all have developed their methodologies from the same CRIS system; clinical text may have higher internal homogeneity (e.g., in terminology) with respect to other CRIS systems based in other health districts. Testing the generalisability of the NLP tools described in this review across other health organisations is essential and has only just begun. As described earlier, CRIS has also been implemented in other sites across the UK. On example is the Camden & Islington Research Database (
Other studies using EHR data for suicide-related research range from those that use rule-based approaches to e.g., estimate the use of diagnostic codes vs. information recorded in free text broadly in EHR data (
Furthermore, given the inherent variation across clinical populations, which is reflected in the language used in clinical reporting, NLP tools developed for one clinical subpopulation, such as working age adults, may not be reliably transferable to another group, such as school age children, without adaptation. NLP systems used to identify suicide-related constructs in clinical notes must, therefore, be developed for and validated within each target population. The same principle applies for the application of NLP tools across institutions and EHR systems. The studies in this review all use data from the same system, CRIS, for which language is likely to show a certain level of internal homogeneity (e.g., in terminology) with respect to other systems. Testing the generalisability of the NLP tools described below in this review has only just started.
When also including free text and NLP models, as mentioned above, the extent to which internal homogeneity (e.g., in terminology) impacts results across different institutions and clinical settings, is an area well worth further studies to further advance this field and provide evidence about the broader generalisability of findings. The culture, incentives, and structure of clinical systems outside of the UK may induce further differences between the signals of NLP systems for detecting discussion of suicide. Collaborative efforts are currently being made to compare methodologies and NLP tools across healthcare institutions not just within the UK, but also with collaborators in the USA. We envision advances in ML and NLP methods, standards for interoperability, and infrastructures to enable such comparisons in the future.
Furthermore, advances in computational analysis of EHR data, e.g., machine learning in combination with NLP, will continue to develop, and provide novel solutions to suicide research (
Going beyond identification or prediction of those at risk, analysis of continuously collected data, and integration of EHR data with smartphone, wearable device and even social media data could allow collection of data across different time periods, not just at the time of clinical interactions, thus helping to understand suicidal crises and enabling delivery of targeted suicide prevention interventions (
In this review of a decade of research into suicide and related behaviour using CRIS we have summarised the evolution of different methods employed to identify suicide and related behaviour, including linkages to mortality data, structured ICD-10 codes, manual review of clinical notes, keyword searching in free text and relevant mentions identified using NLP techniques. Cohorts under study have varied in size from several hundred to tens of thousands of patients and have covered adult, elderly as well as child and adolescent patients. A range of clinical disorders have been described from the perspective of suicide and related behaviours, including pregnancy, severe mental illness and self-harm, opioid use disorder patients, chronic fatigue syndrome and autism spectrum disorders. Finally, some studies have identified and investigated specific clinical events, such as emergency department attendances or hospital admissions.
In conclusion, the breadth and depth of the research and findings of understanding suicide and related behaviour from this past decade using CRIS have accelerated the field in ways unthinkable prior to the availability of EHR data. These studies not only add to the clinical evidence base, but also reflect an important evolution of data-driven method applicability and development that is central to advancing this field further. We envision increased progress in the decades to come, particularly in externally validating findings across multiple sites and countries, both in terms of clinical evidence and in terms of NLP and machine learning method transferability.
The de-identified CRIS database has received ethical approval for secondary analysis: Oxford REC C, reference 18/SC/0372. The data is used in an anonymised and data-secure format under strict governance procedures. CRIS data is made available to researchers with appropriate credentials (provided by the South London and Maudsley NHS Trust) working on approved projects. Projects are approved by a CRIS Oversight Committee, a body set up by and reporting to the South London and Maudsley Caldicott Guardian. On request, and after appropriate credentials have been obtained as well as arrangements with the lead of the respective CRIS project, data presented in this study can be viewed within the secure system firewall.
RD and SV proposed the manuscript and its contents. AB wrote the first draft of the manuscript and compiled data pertaining to study design, cohorts and NLP systems used in the cited literature, and incorporated edits by other authors. Each author contributed to specific sections of the manuscript: RD and SV on introductory and historical overviews: RS on use of structured data fields: JD on child and adolescent populations: AB and SV on natural language processing: RD, SV, and JD on perspectives and conclusions. All authors contributed to editing and revising the manuscript and approved the final version.
RD and SV declare previous research funding received from Janssen. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The authors acknowledge infrastructure support from the National Institute for Health Research (NIHR).
1See