Classifying Characteristics of Opioid Use Disorder From Hospital Discharge Summaries Using Natural Language Processing

Poulsen, Melissa N.; Freda, Philip J.; Troiani, Vanessa; Davoudi, Anahita; Mowery, Danielle L.

doi:10.3389/fpubh.2022.850619

ORIGINAL RESEARCH article

Front. Public Health, 09 May 2022

Sec. Digital Public Health

Volume 10 - 2022 | https://doi.org/10.3389/fpubh.2022.850619

This article is part of the Research TopicComputational Methods in Substance Use and Addiction ResearchView all 4 articles

Classifying Characteristics of Opioid Use Disorder From Hospital Discharge Summaries Using Natural Language Processing

Melissa N. Poulsen^1*‡

Philip J. Freda^2†‡

Vanessa Troiani³

Anahita Davoudi²

Danielle L. Mowery^2,4

¹Department of Population Health Sciences, Geisinger, Danville, PA, United States
²Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, United States
³Autism and Developmental Medicine Institute, Geisinger, Danville, PA, United States
⁴Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States

Background: Opioid use disorder (OUD) is underdiagnosed in health system settings, limiting research on OUD using electronic health records (EHRs). Medical encounter notes can enrich structured EHR data with documented signs and symptoms of OUD and social risks and behaviors. To capture this information at scale, natural language processing (NLP) tools must be developed and evaluated. We developed and applied an annotation schema to deeply characterize OUD and related clinical, behavioral, and environmental factors, and automated the annotation schema using machine learning and deep learning-based approaches.

Methods: Using the MIMIC-III Critical Care Database, we queried hospital discharge summaries of patients with International Classification of Diseases (ICD-9) OUD diagnostic codes. We developed an annotation schema to characterize problematic opioid use, identify individuals with potential OUD, and provide psychosocial context. Two annotators reviewed discharge summaries from 100 patients. We randomly sampled patients with their associated annotated sentences and divided them into training (66 patients; 2,127 annotated sentences) and testing (29 patients; 1,149 annotated sentences) sets. We used the training set to generate features, employing three NLP algorithms/knowledge sources. We trained and tested prediction models for classification with a traditional machine learner (logistic regression) and deep learning approach (Autogluon based on ELECTRA's replaced token detection model). We applied a five-fold cross-validation approach to reduce bias in performance estimates.

Results: The resulting annotation schema contained 32 classes. We achieved moderate inter-annotator agreement, with F₁-scores across all classes increasing from 48 to 66%. Five classes had a sufficient number of annotations for automation; of these, we observed consistently high performance (F₁-scores) across training and testing sets for drug screening (training: 91–96; testing: 91–94) and opioid type (training: 86–96; testing: 86–99). Performance dropped from training and to testing sets for other drug use (training: 52–65; testing: 40–48), pain management (training: 72–78; testing: 61–78) and psychiatric (training: 73–80; testing: 72). Autogluon achieved the highest performance.

Conclusion: This pilot study demonstrated that rich information regarding problematic opioid use can be manually identified by annotators. However, more training samples and features would improve our ability to reliably identify less common classes from clinical text, including text from outpatient settings.

Introduction

In 2020, 9.5 million Americans aged 12 years and older had misused opioids in the past year and 2.7 million had an opioid use disorder (OUD) (1). OUD is characterized by a loss of control of opioid use, risky opioid use, impaired social functioning, tolerance, and withdrawal, as defined by the Diagnostic and Statistical Manual of Mental Disorders-5 (DSM-5). Opioid misuse and OUD have a host of negative impacts on individuals' health and quality of life, including risk of overdose and death. In 2020, overdose deaths reached a new high of 93,000; of these deaths, approximately 70,000 were attributable to opioids, including prescription opioids, heroin, and fentanyl (2). The opioid epidemic presents an urgent public health crisis that warrants innovative research strategies to identify those at risk for opioid-related morbidity and mortality.

Opioid Use Disorder Research Using Electronic Health Records

Electronic health records (EHRs) have been widely used for population health research (3). Most studies rely upon structured data contained within EHRs—such as diagnostic codes, medication orders, or laboratory tests—to identify individuals with specific conditions. Regarding OUD, a review of studies through 2015 identified 15 algorithms developed to identify non-medical opioid use, the majority of which used medical claims data (4). Such algorithms that incorporate opioid prescriptions are particularly useful for identifying iatrogenic cases of OUD (stemming from prescription opioid dependence) (5). Given the underdiagnosis of OUD (6), structured EHR data has less utility for identifying OUD that may have arisen through illicit opioid use. The historic underdiagnosis of OUD may be due to several factors, including uncertainty in diagnosing the condition by providers lacking specialty training, as well as stigma that leads providers to avoid assigning diagnostic codes for opioid misuse or patients to hide their condition (6, 7). Unstructured data contained within EHRs, including clinical narratives within medical encounter notes, document signs and symptoms of OUD as well as social risks and behaviors that may not be captured with diagnostic codes, providing a useful source of data that can enrich structured EHR data.

Framework for Developing Natural Language Processing Tools

Efficiently synthesizing information from clinical text requires automated information extraction techniques such as natural language processing (NLP). An important first step to NLP is the development of a rigorous annotation process, which is critical to the reliability and performance of the NLP system (8). The standard approach to annotation includes multiple annotators reviewing and marking the same data and computing agreement across annotators, generally measured by inter-annotator agreement (IAA). IAA provides an indication of the difficulty and clarity of a task. To develop a high-quality corpus of annotated text, annotators follow a set of guidelines to ensure the process is consistent and objective (8).

Modern NLP methods include symbolic rules, machine learning, deep learning, and hybrid-based approaches. Validation processes are used to reduce biased performance estimates, particularly for studies with small sample sizes in which there is less statistical power for pattern recognition (9). Feature selection, cross-validation, and train/test split approaches have been shown to produce less biased performance estimates, even with a small sample (9). Performance of NLP tools are typically evaluated using measures of recall, precision, and F₁-score (10).

Natural Language Processing for Opioid Use Disorder Identification

Prior studies have utilized NLP to identify problematic opioid use (7, 11–15) and opioid overdose (16) from EHR and paramedic response documentation (17). However, several gaps remain in the development of NLP systems to identify problematic opioid use and OUD. Symbolic rule-based systems that rely on keyword lists, regular expressions, and term co-occurrence have been most commonly developed, such as nDepth^TM (11) and MediClass (16), among other tools (7, 12, 13). More contemporary NLP approaches remain limited, with only three previous studies having applied machine learning methods to identify opioid misuse (14, 15, 17). Of these studies, only Lingeman and colleagues (14) described details of their annotation process, with annotation performed by a single annotator. Lingeman and colleagues (14) also expanded beyond keywords such as “opioid abuse” to capture a greater range of opioid-related aberrant behaviors. However, other clinical, behavioral, and environmental factors linked to OUD documented in clinical notes, such as other substance use disorders, psychiatric co-morbidities, chronic pain, overdose, and social determinants of health (e.g., homelessness) could prove useful in characterizing OUD. Finally, prior studies have primarily been conducted among patients on long-term prescription opioids, e.g., as therapy for chronic pain, with one exception (15), missing opportunities to identify and study OUD related to illicit opioid use in the population.

Thus, studies are needed that utilize rigorous annotation approaches to inform NLP systems that include individuals who developed OUD through illicit opioid use and that draw upon additional information contained in clinical text to deeply characterize problematic opioid use. Such efforts could inform development of an NLP tool that would facilitate more accurate case finding in EHR data, bolstering a range of research on OUD, including epidemiologic, clinical, and genetic studies (18). Our long-term objective is to develop an NLP system that identifies and characterizes cases of OUD arising from both prescription and illicit opioid use to conduct EHR-based studies to understand biological, patient, provider, and community factors associated with OUD. Our short-term objectives in this study were to develop and apply an annotation schema to deeply characterize OUD, and to automate the schema using machine learning and deep learning-based approaches. Herein, we describe our annotation process and schema, and then present the results of two supervised classification approaches.

Methods

We first developed an annotation schema to characterize problematic opioid use, identify individuals with potential OUD, and provide psychosocial context surrounding the condition. We applied the schema to clinical notes of de-identified patients with an OUD diagnosis. We then developed computational methods to automate the schema using machine and deep learning and evaluated the informativeness of features for predicting OUD in its contexts within sentences within hospital encounter documentation. The Geisinger and University of Pennsylvania Institutional Review Boards reviewed the protocol for this study and determined it met criteria for exempt human subjects research, as all data were fully de-identified.

Study Population

Individuals included in this study came from the MIMIC-III Critical Care Database, a publicly-available, de-identified dataset that includes clinical data for roughly 60,000 patients with a hospital stay at Beth Israel Deaconess Medical Center in Boston, Massachusetts between 2001–2012 (19). From the MIMIC-III dataset, we downloaded discharge summaries from 762 patients who had an International Classification of Diseases, version 9 (ICD-9) code related to OUD (304.00–304.03, 304.7, 304.70–304.73, 304.8, 304.81, 304.82, 304.83, 305.50–305.53, 965.00, 965.01, 965.02, 965.09, E850.0, E935.0).

Annotation Schema Development

Initial development of the annotation schema included both deductive and inductive approaches to defining classes. We first drew upon prior research that used medical record review to identify OUD based on DSM-5 criteria (6), creating classes to reflect these criteria. We added to this initial set of classes by reading through discharge summaries and considering instances related to opioid and other drug use, as well as our knowledge of previously identified risk factors for OUD (e.g., psychiatric conditions). We then iteratively refined the initial schema through our first round of annotation of discharge summaries for five patients. We developed guidelines that defined each class and provided examples to ensure consistency between annotators. All authors were involved in the schema development.

The final annotation schema represented a deep characterization of OUD-related information documented in clinical notes, with 32 classes related to problematic opioid use, factors contributing to opioid use/misuse, substance use, and consequences of opioid misuse (Figure 1; Supplementary Table 1). Several classes contained attributes (e.g., drug screening types and results). We included a class labeled other contexts to capture details in discharge summaries that were potentially relevant for OUD, but for which we had not defined a specific class (e.g., “altered mental status,” “counseled on drug use”). This class was largely intended to inform future changes to the annotation schema. We also included a patient-level assertion of OUD status, which was annotated at the level of each discharge summary rather than at the sentence level. This was made based on the clinical writer's assertion of OUD status rather than the annotators' assessment and was classified as positive, negative, uncertain, or not-specified. For example, a discharge summary in which the clinical writer noted that the patient abused heroin or was receiving methadone treatment at a drug treatment facility was classified as “positive,” whereas a summary in which the clinical writer made no comments indicating whether or not the patient had an opioid use disorder was classified as “not-specified.”

FIGURE 1

Figure 1. Depiction of annotation schema characterizing OUD-related information contained in discharge summaries. Figure created using Coggle (https://coggle.it/).

Annotation Study

We leveraged an open-source text annotation tool called the extensible Human Oracle Suite of Tools (eHOST) (20) to annotate discharge summaries. Two authors (MP and PF) separately reviewed the full discharge summaries and annotated individual sentences for the same 40 patients over eight rounds of annotation (corpus 1). Annotation was completed at the sentence level, assigning full sentences to one or more relevant classes, with the exceptions of the class opioid type, for which we annotated phrases (i.e., the specific opioid name) and the patient-level OUD assertion. After completing a batch of five patients and their associated notes, we calculated F₁-scores to capture IAA among classes and types for overlapping spans (IAA was not calculated for the first batch because this batch was primarily used to refine the annotation schema; annotations from these five patients were also not included in the automation study). Annotations were adjudicated with disagreements resolved through discussion with all study authors. Once the IAA was deemed sufficient to begin separate annotation work, the same two authors then annotated discharge summaries for a unique set of 30 patients each (total of 60 patients; corpus 2). We report the IAA agreement over each batch as well as the frequency distribution and highest IAA achieved for each class.

Automation Study

Experimental Design

We randomly sampled patients with their associated annotated sentences and divided them into training (65%) and testing (35%) sets. The training set was used to generate and select the most informative features for predicting each class and reduce the likelihood of overfitting. To ensure the comparability of the training and testing sets, we evaluated class distributions between the two data sets.

Feature Generation and Selection

We leveraged the training set to generate and select features informative for training prediction models to classify sentences according to each class from the annotation schema. First, for each entry from the training dataset, we preprocessed the text to reduce case and add spaces around punctuation to best encode terms.

Next, we selected three open-source NLP systems/knowledge bases to encode semantic features from the annotated sentences: Empath, Unified Medical Language System (UMLS), and ConText (Table 1). We chose these approaches and systems to encode features based on their demonstrated informativeness in prior studies from the OUD literature [e.g., (7, 11, 14, 15)]. We applied Empath (21), a tool that draws connotations between words and phrases based on neural word embeddings from over 1.8 billion words of modern fiction, to generate semantic categories based on lay terms, including categories describing clinical, behavioral, and environmental factors such as pain, alcohol, crime, and family. We applied existing categories derived from Empath based on the reddit corpus, which captures common, rather than clinical, language to describe these concepts. However, Empath has broader coverage of terms related to these topics, at the expense of semantic precision. Therefore, we removed existing, built-in categories that did not capture accurate semantics in the clinical text (e.g., the category of “heroic” spuriously encoded “heroine,” a misspelling of “heroin”). To overcome limitations in coverage of relevant concepts by Empath, we added novel categories, including “opioid,” “dosage,” “overdose,” “withdrawal,” “psychiatric,” and “substance abuse.” Next, we leveraged scispacy to encode clinical concepts from the UMLS, a standardized vocabulary of biomedical concepts (22). The UMLS contains a robust terminology for clinical conditions (sign or symptom, disease or syndrome, mental or behavioral dysfunction) illicit and non-illicit drugs (clinical drug, pharmacologic substance), among other medical concepts mapped to concept unique identifiers (CUIs; e.g., “heroin” and “diamorphine” maps to “C0011892”). Finally, we applied the python version of the ConText algorithm to encode contextual information important for discerning historical from recent events, references to patients from references to family members, and negations from affirmed states (23, 24). We also encoded syntactic information including use of conjunctions, pseudo-negations, etc. Examples of features and their usage can be found in Table 1.

TABLE 1

Table 1. Feature types used to train and test supervised classifiers, with examples of features and related annotation sentences.

We generated boxplot representations and applied an ANOVA to statistically compare the training and testing datasets based on the mean length of annotations (i.e., the number of words) in each class and the mean number of features per class for each feature type. We applied Chi-square feature selection to identify and retain only the most informative features for classifying each class. We graphed the frequency distribution of the reduced set of features by type and class. All graphs were generated using the R package.

Sentence Classification

We developed prediction models for classifying each sentence according to class using scikit-learn (25) and Autogluon version 0.3.1 (26–28), two machine learning and data science packages for developing prediction models for binary classification tasks. We trained and tested two supervised machine and deep learning classifiers—logistic regression and Autogluon (28)—to classify each sentence according to an OUD class.

Each algorithm was trained using the default settings, as described below. No hyperparameter tuning was carried out.

• Logistic regression: This classifier uses a sigmoid function defined by linear transformation of the features to find the best model to describe the relationship between the target variable (output) and a given set of features (inputs). The default parameters were penalty = “l2”, ^*, dual = False, tol = 0.0001, C = 1.0, fit_intercept = True, intercept_scaling = 1, class_weight = None, random_state = None, solver = “lbfgs”, max_iter = 100, multi_class = “auto”, verbose = 0, warm_start = False, n_jobs = None, l1_ratio = None.

• Autogluon: We trained and tested TextPredictor, which fits a transformer neural network model using transfer learning from a pretrained model, ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately). ELECTRA leverages a replaced token detection rather than masked learning models like BERT (Bidirectional Encoder Representations from Transformers) (29) and has been shown to produce superior results to BERT given the same model size, data, and compute (28, 30). The pretrained model is an electra base discriminator with a learning rate decay of 0.90. Class predictions were output through two additional dense layers. Each classifier was trained using ten epochs and 150 iterations.

Validation and Performance Evaluation

For both supervised learning classifiers, we trained and tested prediction models for classes with at least 100 annotations in the training set. We did not evaluate the other contexts class because it was not meaningful for OUD characterization. Classes with fewer annotations were not included due to concerns about overfitting, which could result in less robust and poorly generalizable prediction models.

We implemented a cross-validation approach using both the training and testing folds in an effort to reduce the likelihood of producing biased performance estimates due to small sample sizes. We applied a five-fold cross-validation approach to train the prediction models on the training set, reporting the average performance across validation folds. The testing set was separated into 5-folds to provide an additional external validation of the prediction models generated by the training set. We computed the standard performance metrics of recall (sensitivity) and precision (positive predictive value) to evaluate how well-each classifier identified each class. We also computed F₁-score—the harmonic mean between recall and precision—to select the classifier with the best performance (10). Training was optimized for F₁-score. We report the means and 95% confidence intervals for the 5-fold cross-validation results.