<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2022.846930</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Topic Modeling for Interpretable Text Classification From EHRs</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Rijcken</surname> <given-names>Emil</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1498613/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Kaymak</surname> <given-names>Uzay</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Scheepers</surname> <given-names>Floortje</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/517977/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Mosteiro</surname> <given-names>Pablo</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1554983/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zervanou</surname> <given-names>Kalliopi</given-names></name>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<xref ref-type="aff" rid="aff5"><sup>5</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Spruit</surname> <given-names>Marco</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<xref ref-type="aff" rid="aff5"><sup>5</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1069645/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Jheronimus Academy of Data Science, Eindhoven University of Technology</institution>, <addr-line>Eindhoven</addr-line>, <country>Netherlands</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Information and Computing Sciences, Utrecht University</institution>, <addr-line>Utrecht</addr-line>, <country>Netherlands</country></aff>
<aff id="aff3"><sup>3</sup><institution>University Medical Center Utrecht</institution>, <addr-line>Utrecht</addr-line>, <country>Netherlands</country></aff>
<aff id="aff4"><sup>4</sup><institution>Public Health and Primary Care (PHEG), Leiden University Medical Center, Leiden University</institution>, <addr-line>Leiden</addr-line>, <country>Netherlands</country></aff>
<aff id="aff5"><sup>5</sup><institution>Leiden Institute of Advanced Computer Science (LIACS), Faculty of Science, Leiden University</institution>, <addr-line>Leiden</addr-line>, <country>Netherlands</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Tru Cao, University of Texas Health Science Center at Houston, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Feng Chen, Dallas County, United States; Sihong Xie, Lehigh University, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Emil Rijcken  <email>e.f.g.rijcken&#x00040;tue.nl</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>04</day>
<month>05</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>846930</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>12</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Rijcken, Kaymak, Scheepers, Mosteiro, Zervanou and Spruit.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Rijcken, Kaymak, Scheepers, Mosteiro, Zervanou and Spruit</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>The clinical notes in electronic health records have many possibilities for predictive tasks in text classification. The interpretability of these classification models for the clinical domain is critical for decision making. Using topic models for text classification of electronic health records for a predictive task allows for the use of topics as features, thus making the text classification more interpretable. However, selecting the most effective topic model is not trivial. In this work, we propose considerations for selecting a suitable topic model based on the predictive performance and interpretability measure for text classification. We compare 17 different topic models in terms of both interpretability and predictive performance in an inpatient violence prediction task using clinical notes. We find no correlation between interpretability and predictive performance. In addition, our results show that although no model outperforms the other models on both variables, our proposed fuzzy topic modeling algorithm (FLSA-W) performs best in most settings for interpretability, whereas two state-of-the-art methods (ProdLDA and LSI) achieve the best predictive performance.</p></abstract>
<kwd-group>
<kwd>text classification</kwd>
<kwd>topic modeling</kwd>
<kwd>explainability</kwd>
<kwd>interpretability</kwd>
<kwd>electronic health records</kwd>
<kwd>psychiatry</kwd>
<kwd>natural language processing</kwd>
<kwd>information extraction</kwd>
</kwd-group>
<counts>
<fig-count count="7"/>
<table-count count="0"/>
<equation-count count="3"/>
<ref-count count="45"/>
<page-count count="11"/>
<word-count count="7351"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Inpatient violence at psychiatry departments is a common and severe problem (van Leeuwen and Harte, <xref ref-type="bibr" rid="B43">2017</xref>). Typical adverse reactions that victims (professionals) face include emotional reactions, symptoms of post-traumatic stress disorder, and a negative impact on work functioning. Therefore, it is vital to assess the risk of a patient showing violent behavior and take preventive measures. The psychiatry department of the Utrecht Medical Center Utrecht uses questionnaires to predict the likelihood of patients becoming violent. However, filling out these forms is time-consuming and partly subjective. Instead, automated machine-learning approaches based on existing patient information could overcome the time burden and help make more objective predictions. Various automated text classification approaches utilizing clinical notes in the electronic health record allow for more accurate predictions than the questionnaires (Menger et al., <xref ref-type="bibr" rid="B26">2018a</xref>; Mosteiro et al., <xref ref-type="bibr" rid="B31">2021</xref>). In addition to accurate predictions, clinical providers and other decision-makers in healthcare consider the interpretability of model predictions as a priority for implementation and utilization. As machine learning applications are increasingly being integrated into various parts of the continuum of patient care, the need for prediction explanation is imperative (Ahmad et al., <xref ref-type="bibr" rid="B1">2018</xref>). Yet, an intuitive understanding of the automated text classification approaches&#x00027; inner workings is currently missing, as the clinical notes are represented numerically by large dense matrices with unknown semantic meaning.</p>
<p>A more intuitive and potentially interpretable approach is text classification through topic modeling, where clinical notes can be represented as a collection of topics. To do so, a topic model is trained on all the written notes to find <italic>k</italic> topics. Each topic consists of the <italic>n</italic> most likely words associated with that topic and weights for each word. After training the topic model, all the documents associated with one patient can be represented by a <italic>k</italic>-length vector in which each cell indicates the extent to which that topic appears in the text. The assumption is that if the generated topics are well interpretable, the model&#x00027;s decision making is more explainable. Several authors have focused on text classification through topic modeling in health care (Rumshisky et al., <xref ref-type="bibr" rid="B38">2016</xref>; Wang et al., <xref ref-type="bibr" rid="B45">2019</xref>). They use Latent Dirichlet Allocation (LDA) (Blei et al., <xref ref-type="bibr" rid="B5">2003</xref>). Yet, many other topic modeling algorithms exist and selecting a model is not straightforward. A topic modeling algorithm for text classification should be selected based on predictive performance and interpretability. If a model performs well on predictions but is not interpretable, then there is no added value for our analysis in using topic models for this task. Similarly, if the predictive performance is low, but the interpretability is high, topic models should not be used for classification. We note that, to the best of our knowledge, no previous work focuses on both the predictive performance and interpretability of topic models for text classification.</p>
<p>In this article, we train seventeen different topic models and use their topic embeddings as input for text classification (violence prediction). Then, we analyze how each model&#x00027;s interpretability compares to its predictive value. From this analysis, we make the following contributions:</p>
<list list-type="order">
<list-item><p>We are the first to analyze both the predictive performance and topic modeling interpretability.</p></list-item>
<list-item><p>We compare 17 topic modeling algorithms based on both criteria.</p></list-item>
<list-item><p>We present considerations that can be used for the selection of a topic model for text classification.</p></list-item>
</list>
<p>The outline of the article is as follows. In Section 2, we describe how topic models work, how they can be used for text classification, how different algorithms relate to one another and which measures are used for evaluation. In Section 3, we describe our comparison methodology and the data set that we used. In Section 4, we provide tables and show graphs to illustrate how different topic modeling algorithms compare to each other. In Section 5, we discuss our findings, its implications and we conclude the work in Section 6.</p></sec>
<sec id="s2">
<title>2. Topic Modeling Algorithms</title>
<p>We compare different topic modeling algorithms based on their interpretability and predictive performance for text classification. In this section, we describe the task of text classification, followed by a description of topic models. Then, we discuss the best-known topic modeling algorithms and discuss how these have been used for text classification.</p>
<sec>
<title>2.1. Text Classification</title>
<p>Classification models are a set of techniques that map input data (in the feature space) to a fixed set of output labels (Flach, <xref ref-type="bibr" rid="B15">2012</xref>). Text classification is the task of assigning such a label to a text. Typically, a ML text classification pipeline contains two steps:</p>
<list list-type="order">
<list-item><p>representation step,</p></list-item>
<list-item><p>classification step.</p></list-item>
</list>
<p>In the first step, a text file is transformed from a string into a numeric representation, called an embedding. The classification algorithm in the next step then calculates the most likely label based on the embedding. The choice of the technique depends on various aspects such as the number of features, the size of the data set and whether a technique should be interpretable. Typically, classification models are considered to be interpretable if they can indicate the weights that have been assigned to each input feature. Amongst classification models, the subset of commonly used interpretable models include linear regression, logistic regression, decision trees, fuzzy systems, and association rules (Guillaume, <xref ref-type="bibr" rid="B17">2001</xref>; Alonso et al., <xref ref-type="bibr" rid="B2">2015</xref>).</p>
<sec>
<title>2.1.1. Representation Techniques</title>
<p>Early approaches for representing texts numerically used the bag-of-words approach (BOW) to represent each word as a one-hot-encoding (Jurafsky and Martin, <xref ref-type="bibr" rid="B19">2009</xref>). BOW suffers from two significant limitations: (i) it is hard to scale, (ii) it only considers the presence of a word in a text and not the word&#x00027;s location. Therefore, it does not capture syntactic information.</p>
<p>Neural embeddings such as Word2Vec (Mikolov et al., <xref ref-type="bibr" rid="B29">2013</xref>) do not suffer from BOW&#x00027;s limitations and have been used widely ever since being introduced in 2013. Through neural embeddings, words are represented as dense vectors in a high-dimensional space such that semantically similar words are located close to each other. Since WordsVec&#x00027;s introduction, several neural embedding approaches have been used for text classification, such as BERT (Devlin et al., <xref ref-type="bibr" rid="B11">2018</xref>), Doc2Vec (Le and Mikolov, <xref ref-type="bibr" rid="B25">2014</xref>), Glove (Pennington et al., <xref ref-type="bibr" rid="B33">2014</xref>), and ELMO (Peters et al., <xref ref-type="bibr" rid="B34">2018</xref>). These neural models have improved the performance of text classification significantly. However, relatively little is known about the information captured by these embeddings&#x00027; features. Therefore, there is still little understanding of the classification decisions in the subsequent step. Alternatively, the topics trained by topic models could serve as features for text classification. These topics are better interpretable than the features in neural representations and could help understand text classification decisions better.</p></sec></sec>
<sec>
<title>2.2. Topic Models</title>
<p>Topic models are a group of unsupervised natural language processing algorithms that calculate two quantities:</p>
<list list-type="order">
<list-item><p><italic>P</italic>(<italic>W</italic><sub><italic>i</italic></sub>|<italic>T</italic><sub><italic>k</italic></sub>)- the probability of word <italic>i</italic> given topic <italic>k</italic>,</p></list-item>
<list-item><p><italic>P</italic>(<italic>T</italic><sub><italic>k</italic></sub>|<italic>D</italic><sub><italic>j</italic></sub>)- the probability of topic <italic>k</italic> given document <italic>j</italic>,</p></list-item>
</list>
<p>with:</p>
<p><italic>i</italic> word index <italic>i</italic>&#x02208;{1, 2, 3, ..., <italic>M</italic>},</p>
<p><italic>j</italic> document index <italic>j</italic>&#x02208;{1, 2, 3, ..., <italic>N</italic>},</p>
<p><italic>k</italic> topic index <italic>k</italic>&#x02208;{1, 2, 3, ..., <italic>C</italic>},</p>
<p><italic>M</italic> the number of unique words in the data set,</p>
<p><italic>N</italic> the number of documents in the data set,</p>
<p><italic>C</italic> the number of topics.</p>
<p>The top-<italic>n</italic> words with the highest probability per topic are typically taken to represent a topic. Topic models aim to find topics in which these top-<italic>n</italic> words in each topic are coherent with each other so that the topic is interpretable and a common theme can be derived. Using topic embeddings for text classification, each input document is transformed into a vector of size <italic>C</italic>. Each cell indicates the extent to which the document belongs to a topic. After predictions are made for each input text, interpretable classification algorithms can reveal which topics were most important for performing classifications.</p></sec>
<sec>
<title>2.3. Topic Modeling Algorithms</title>
<p>We compare a set of state-of-the-art topic modeling algorithms as defined in Terragni et al. (<xref ref-type="bibr" rid="B40">2021</xref>) supplemented with topic modeling algorithms we have developed in an earlier study (Rijcken et al., <xref ref-type="bibr" rid="B36">2021</xref>). The different methods can be divided into two categories; methods based on dimensionality reduction and methods based on the Dirichlet distribution.</p>
<sec>
<title>2.3.1. Dimensionality Reduction Methods</title>
<p>The algorithms based on dimensionality reduction all start with a document-term matrix <bold>A</bold>. This could be a simple bag-of-words representation, but typically a weighting mechanism such as tf-idf is applied. The algorithms based on dimensionality reduction are the following.</p>
<sec>
<title>2.3.1.1. NMF</title>
<p>One of the oldest methods is non-negative matrix factorization (NMF) (F&#x000E9;votte and Idier, <xref ref-type="bibr" rid="B14">2011</xref>). Using matrix <bold>A</bold>, NMF returns two matrices <bold>W</bold> &#x00026; <bold>H</bold>. Since the vectors of the decomposed representations are non-negative, their coefficients are non-negative as well. <bold>W</bold> contains the found topics (topics &#x000D7; words) and <bold>H</bold> contains the coefficients (documents &#x000D7; topics). Then, NMF modifies <bold>W</bold> and <bold>H</bold>&#x00027;s initial values so that its product approaches <bold>A</bold>.</p></sec>
<sec>
<title>2.3.1.2. LSI</title>
<p>Other foundational work on topic modeling is latent semantic indexing (LSI)<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> which uses singular value decomposition for dimensionality reduction on matrix <bold>A</bold> (Landauer et al., <xref ref-type="bibr" rid="B23">1998</xref>). SVD&#x00027;s output is a decomposition of <bold>A</bold>, such that <bold>A</bold> = <bold>U&#x003A3;V</bold><sup><italic>T</italic></sup>. In this case, <bold>U</bold> emerges as the document-topic matrix <bold><italic>P</italic>(<italic>T</italic><sub><italic>k</italic></sub>|<italic>D</italic><sub><italic>j</italic></sub>)</bold>, <bold>V</bold> becomes the term-topic matrix <bold><italic>P</italic>(<italic>W</italic><sub><italic>i</italic></sub>|<italic>T</italic><sub><italic>k</italic></sub>)</bold> and <bold>&#x003A3;</bold> contains singular values in its diagonal.</p></sec>
<sec>
<title>2.3.1.3. FLSA</title>
<p>Similar to LSA, fuzzy latent semantic analysis (FLSA) starts with matrix <bold>A</bold> and uses singular value decomposition for dimensionality reduction (Karami et al., <xref ref-type="bibr" rid="B20">2018</xref>). FLSA hypothesizes that singular value decomposition projects words into a lower dimensional space in a meaningful way, such that words that are semantically related are located nearby each other. FLSA takes the <bold>U</bold> matrix from singular value decomposition (number of singular values &#x000D7; number of documents), then performs fuzzy c-means clustering (Bezdek, <xref ref-type="bibr" rid="B3">2013</xref>) to find different topics and lastly uses Bayes&#x00027; theorem and linear algebra to find the two output matrices.</p></sec>
<sec>
<title>2.3.1.4. FLSA-W</title>
<p>Since FLSA works with the <bold>U</bold> matrix, which gives singular values for each document, the clustering is based on documents. Yet, topics are distributions over words and therefore clustering words seems to make more sense. Therefore, FLSA-W clusters on the <bold>V</bold> matrix instead of <bold>U</bold>, hence by clustering on words directly.</p></sec>
<sec>
<title>2.3.1.5. FLSA-V</title>
<p>While FLSA and FLSA-W implicitly assume that the projection to a lower dimensional space occurs in a meaningful way, there is no explicit step guarantying it. FLSA-V uses a projection method similar to multi-dimensional scaling (Borg and Groenen, <xref ref-type="bibr" rid="B6">2005</xref>) for embedding the words into a lower dimensional manifold such that similar words (based on co-occurrence) are placed close together on the manifold (Van Eck and Waltman, <xref ref-type="bibr" rid="B41">2010</xref>). Then, the algorithm performs similar steps as FLSA-W to find the topics. We note that the projection step is very memory intensive and the implementation we need (VOSviewer software, Van Eck and Waltman, <xref ref-type="bibr" rid="B41">2010</xref>) ran into memory issues with large corpuses and require heavy pruning to perform its mapping.</p></sec></sec>
<sec>
<title>2.3.2. Dirichlet-Based Models</title>
<sec>
<title>2.3.2.1. LDA</title>
<p>Underlying the class of dimensionality reduction methods that includes the prior models is the &#x0201C;BOW assumption,&#x0201D; which states that the order of words in a document can be neglected. The irrelevance of order also holds for documents, as it does not matter in what order documents occur in a corpus for a topic model to be trained. De Finetti&#x00027;s representation theorem (De Finetti, <xref ref-type="bibr" rid="B10">2017</xref>) establishes that any collection of exchangeable random variables has a representation as a mixture distribution. Thus, to consider exchangeable representations for words and documents, mixture models that capture the exchangeability of both should be used. This line of thought paves the way to Latent Dirichlet Allocation (LDA) (Blei et al., <xref ref-type="bibr" rid="B5">2003</xref>), which is the best-known topic modeling algorithm on which multiple other topic models are based. LDA posits that each document can be seen as a probability distribution over topics and that each topic can be seen as a probability distribution over words. From a Dirichlet distribution, which is a multivariate generalization of the beta distribution, a random sample is drawn to represent the topic distribution. Then, a random sample is selected from another Dirichlet distribution to represent the word distribution.</p></sec>
<sec>
<title>2.3.2.2. ProdLDA and NeuralLDA</title>
<p>Although the posterior distribution is intractable for exact inference, many approximate inference algorithms can be considered for LDA. Popular methods are mean-field methods and collapsed Gibbs sampling (Porteous et al., <xref ref-type="bibr" rid="B35">2008</xref>). However, both of these methods require a rederivation of the inference method when applied to new topic models, which can be time-consuming. This drawback has been the basis for black-box inference methods, which require only very limited and easy to compute information from the model and can be applied automatically to new models (Srivastava and Sutton, <xref ref-type="bibr" rid="B39">2017</xref>). Autoencoding variational Bayes (AEVB) is a natural choice for topic models as it trains an inference network (Dayan et al., <xref ref-type="bibr" rid="B9">1995</xref>); a neural network that directly maps the BOW representation of a document onto a continuous latent representation. A decoder network then reconstructs the BOW by generating its words from the latent document representation (Kingma and Welling, <xref ref-type="bibr" rid="B22">2014</xref>). ProdLDA and NeuralLDA are the first topic modeling algorithms that use AEVB inference methods. In ProdLDA, the distribution over individual words is a product of experts (it models a probability distribution by combining the output from several simpler distributions) rather than the mixture model used in NeuralLDA.</p></sec>
<sec>
<title>2.3.2.3. ETM</title>
<p>Another problem with LDA is dealing with large vocabularies. To fit good topic models, practitioners must severely prune their vocabularies, typically done by removing the most and least frequent words. To this end, the embedded topic model (ETM) is proposed (Dieng et al., <xref ref-type="bibr" rid="B12">2020</xref>). ETM is a generative model of documents that combines traditional topic models with word embeddings. The ETM models each word with a categorical distribution whose natural parameter is the inner product between the word&#x00027;s embedding and an embedding of its assigned topic.</p></sec>
<sec>
<title>2.3.2.4. CTM</title>
<p>The topic models described above should all be trained on unilingual datasets. However, many data sets (e.g., reviews, forums, news, etc.) exist in multiple languages in parallel. They cover similar content, but the linguistic differences make it impossible to use traditional, BOW-based topic models. Models have to be either unilingual or suffer from a vast but highly sparse vocabulary. Both issues can be addressed with transfer learning. The cross-lingual contextualized topic model (CTM), a zero-shot cross-lingual topic model, learns topics in one language and predicts them for unseen documents in different languages. CTM extends ProdLDA and is trained with input document representations that account for word-order and contextual information, overcoming one of the main limitations of the BOW models (Bianchi et al., <xref ref-type="bibr" rid="B4">2020</xref>).</p></sec>
<sec>
<title>2.3.2.5. HDP</title>
<p>A different topic modeling algorithm based on the Dirichlet distribution is the Hierarchical Dirichlet Process (HDP), which is a Bayesian non-parametric model that can be used to model mixed-membership data with a potentially infinite number of components. In contrast to all the algorithms discussed in this section, HDP is the only algorithm that determines the number of topics itself (rather than being set by the user). Given a document collection, posterior inference is used to determine the number of topics needed and to characterize their distributions (Wang et al., <xref ref-type="bibr" rid="B44">2011</xref>).</p></sec></sec></sec></sec>
<sec id="s3">
<title>3. Study Design</title>
<p>In this section, we provide the details of our comparative study. We describe first the dataset that we have used, followed by the training of the topic models. Then, we explain the classifier we used. Finally, we provide details of our comparison and evaluation methodology,</p>
<sec>
<title>3.1. Data</title>
<p>The data for this research consists of clinical notes, written in Dutch, by nurses and physicians in the University Medical Center (UMC) Utrecht&#x00027;s psychiatry ward between 2012-08-01 and 2020-03-01 as used in previous studies (Mosteiro et al., <xref ref-type="bibr" rid="B30">2020</xref>, <xref ref-type="bibr" rid="B31">2021</xref>; Rijcken et al., <xref ref-type="bibr" rid="B36">2021</xref>). The 834,834 notes available are de-identified for patient privacy using DEDUCE (Menger et al., <xref ref-type="bibr" rid="B27">2018b</xref>). Since the goal of the topic models is to increase the understanding of the decisions made by the subsequent text classification algorithm, we maintain the same structure as in previous studies. Each patient can be admitted to the psychiatry ward multiple times. In addition, an admitted patient can spend time in various sub-departments of psychiatry. The time a patient spends in each sub-department is called an admission period. In the data set, each admission period is a data point. For each admission period, all notes collected between 28 days before and 1 day after the start of the admission period are concatenated and considered as a single period note. We preprocess the text by converting it to lowercase and removing all accents, stop words and single characters. This results in 4,280 admission periods with an average length of 1,481 words. Admission periods having fewer than 101 words are discarded, similar to previous work (Menger et al., <xref ref-type="bibr" rid="B26">2018a</xref>, <xref ref-type="bibr" rid="B28">2019</xref>; Van Le et al., <xref ref-type="bibr" rid="B42">2018</xref>). The dataset is highly imbalanced: amongst the 4,280 admission periods, 425 and 3,855 are associated with violent- and non-violent patients, respectively.</p></sec>
<sec>
<title>3.2. Training Topic Models</title>
<p>For the comparison of topic models, we have used the OCTIS Python package (Terragni et al., <xref ref-type="bibr" rid="B40">2021</xref>) and FuzzyTM<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>. In total, we train and compare 17 different algorithms: LDA, NeuralLDA, ProdLDA, NMF, CTM, ETM, LSI, HDP, and three variations for each FLSA, FLSA-W, and FLSA-V. The three variations for the FLSA-based algorithms differ in the fuzzy clustering algorithms used. We apply fuzzy c-means clustering (Bezdek, <xref ref-type="bibr" rid="B3">2013</xref>), FST-PSO clustering (Nobile et al., <xref ref-type="bibr" rid="B32">2018</xref>; Fuchs et al., <xref ref-type="bibr" rid="B16">2019</xref>), and Gustafson-Kessel clustering (Gustafson and Kessel, <xref ref-type="bibr" rid="B18">1979</xref>). Since the number of topics can influence a topic model&#x00027;s coherence significantly (Rijcken et al., <xref ref-type="bibr" rid="B36">2021</xref>), we train all the topics models with five to 100 topics in steps of five. Since HDP automatically selects the number of topics, we did not include this in the grid search for number of topics. To account for randomness in topic model initialization, we run each combination of topic models with the number of topics ten times. Consequently, we trained a total of 3,210 topic models (16 algorithms with 20 models plus the HDP model<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref>, make a total of 3,210 topic models).</p></sec>
<sec>
<title>3.3. Classification Model</title>
<p>We use two different approaches to create a topic embedding for each document. For the first approach, we include all the words of a topic&#x00027;s distribution and use the vectors from <italic>P</italic>(<italic>T</italic>|<italic>D</italic>) for the classification of each document. We found 20 words to be most interpretable in previous research (Rijcken et al., <xref ref-type="bibr" rid="B36">2021</xref>). For the second approach, we use a topic embedding approach based on the top <italic>n</italic> words per topic, since topics are typically represented by the top-<italic>n</italic> words. Hence, we also use <italic>n</italic> &#x0003D; 20 in this paper. For each topic, the probabilities associated with the words that are present in both the document and the topic&#x00027;s top-<italic>20</italic> words are aggregated.</p>
<p>There are many machine learning methods that can be used for classification. One of the most popular and simplest models for binary prediction is logistic regression. In this paper, we use logistic regression with 10-fold cross validation as the prediction model because of its simplicity and fast training time. A visual impression of the modeling pipeline is depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Visual representation of the modeling pipeline per algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-846930-g0001.tif"/>
</fig></sec>
<sec>
<title>3.4. Evaluation</title>
<p>The evaluation of the topic models depends on the evaluation goals, which are operationalized through various metrics. In this paper, we consider the quality and the prediction performance of the topic model obtained by using the topic model as the criteria along which we evaluate different algorithms. The ideal way for evaluating the quality of a topic model is human evaluation. Various methods have been suggested for this purpose (Chang et al., <xref ref-type="bibr" rid="B7">2009</xref>). However, human evaluation is costly in terms of effort and is not feasible with a large number of models. Since we are training and comparing 3,210 models, we use quantitative measures for comparison. In particular, we use an interpretability score and classification performance as the aspects along which our comparison is made. In this section, we explain the definition of these metrics.</p>
<sec>
<title>Interpretability Score</title>
<p>Interpretability is an abstract concept that could be considered along a number of aspects. From the perspective of modeling from EHR data, our interactions with the clinicians have shown that two aspects are very important. Firstly, the words within each topic (intra-topic assessment) must be semantically related. We use the coherence score (<italic>c</italic><sub><italic>v</italic></sub>) to quantify this. Secondly, different topics should focus on different themes and be diverse; for this, we use the diversity score (inter-topic assessment). Then, we formulate the interpretability score as the product between coherence and diversity, similar to Dieng et al. (<xref ref-type="bibr" rid="B12">2020</xref>).</p>
<p>Amongst the quantitative measures for intra-topic assessment, such as perplexity or held-out likelihood, methods that are based on Normalized Pointwise Mutual Information (NPMI) correlate best with human interpretation (Lau et al., <xref ref-type="bibr" rid="B24">2014</xref>). One measure that incorporates NPMI is the coherence score. This score indicates how well words support each other and the score can be divided into four dimensions: the segmentation of words, the calculation of probabilities, the confirmation measure and the aggregation of the topic scores. R&#x000F6;der et al. (<xref ref-type="bibr" rid="B37">2015</xref>) tested each possible combination on six open data sets and compared it to human evaluation. Based on this extensive setup, <italic>c</italic><sub><italic>v</italic></sub> was found to correlate highest with human evaluation, amongst all the coherence scores. With <italic>c</italic><sub><italic>v</italic></sub>, the Normalized Pointwise Mutual Information (2) is calculated for the combination of all the top-<italic>n</italic> words in a topic. For the calculation of the NPMI, a sliding window of size 110 is used to calculate the probabilities. Then, the arithmetic mean is calculated to aggregate the scores for different topics.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>N</mml:mi><mml:mi>P</mml:mi><mml:mi>M</mml:mi><mml:mi>I</mml:mi><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The coherence score ranges between zero and one, where one means perfect coherence and zero means no coherence whatsoever. Since the coherence score focuses on the support of words within topics, it only focuses on intra-topic quality and ignores inter-topic quality.</p>
<p>A measure for inter-topic quality is topic diversity (Dieng et al., <xref ref-type="bibr" rid="B12">2020</xref>), which measures the unique words in a topic model as a proportion to the total number of words (3). Mathematically, we calculate the topic diversity as follows. Let <italic>W</italic><sup>&#x0002A;</sup> be the set of top-<italic>n</italic> words that have been identified for <italic>C</italic> topics. Then, the diversity score <italic>D</italic> is defined as</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>If the topic diversity equals one, different topics do not share any common words, whereas a value of <inline-formula><mml:math id="M4"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> indicates that all topics contain the same <italic>n</italic> words.</p></sec>
<sec>
<title>Predictive Performance</title>
<p>To assess the predictive performance of the topic models, we use both the area under the ROC curve (AUC) (Fawcett, <xref ref-type="bibr" rid="B13">2006</xref>) and the area under the Kappa curve (AUK) (Kaymak et al., <xref ref-type="bibr" rid="B21">2012</xref>). The AUC is one of the most commonly used scalars for ranking model performance and was used in previous work as well (Mosteiro et al., <xref ref-type="bibr" rid="B30">2020</xref>), (Menger et al., <xref ref-type="bibr" rid="B26">2018a</xref>; Mosteiro et al., <xref ref-type="bibr" rid="B31">2021</xref>). The AUC is independent of class priors, but it ignores misclassification costs. For this problem of violence risk assessment, misclassification costs may be asymmetric since having false positives is less problematic than having false negatives. The AUK is based on Cohen&#x00027;s Kappa (Cohen, <xref ref-type="bibr" rid="B8">1960</xref>) and corrects a model&#x00027;s accuracy for chance agreement. The main difference between AUC and AUK is that AUK is more sensitive to class imbalance than AUC.</p></sec></sec></sec>
<sec sec-type="results" id="s4">
<title>4. Results</title>
<p><xref ref-type="fig" rid="F2">Figure 2</xref> shows the performance (AUC) against the interpretability index for the trained topic models. <xref ref-type="fig" rid="F3">Figure 3</xref> shows the same, but the predictive performance is measured by the AUK for both the top-20 words and the entire topic distribution. The subscript of each performance metric indicates the number of words considered for the prediction: <italic>20</italic> means the top 20 words only and <italic>all</italic> means the entire topic distribution is considered. The patterns in <xref ref-type="fig" rid="F2">Figures 2</xref>, <xref ref-type="fig" rid="F3">3</xref> look similar because the Kappa curve is a nonlinear transformation of the ROC curve, but there are also differences. For example, LSI results are clearly separated from LDA results according to AUC and interpretability, but the separation is much smaller when considering AUK. This is because AUK indicates that the performance of LSI and LDA models is more similar than what is indicated by AUC.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Model interpretability vs. predictive performance per trained model, as measured by the AUC_20 (only the top 20 words per topic are considered.).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-846930-g0002.tif"/>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Two graphs showing how a model&#x00027;s interpretability relates to its predictive performance, as measured by the AUK per trained model. <bold>(A)</bold> Shows predictions based on a topics first 20 words only, while <bold>(B)</bold> takes an entire topics&#x00027; distribution into account.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-846930-g0003.tif"/>
</fig>
<p>Therefore, we will focus on the AUC only for the rest of this analysis. From the graphs, it can be seen that there is no correlation between a model&#x00027;s interpretability and predictive power. Also, no model outperforms other models for both indicators, for all parameter settings. When basing the predictions on the top 20 words per topic only, FLSA-W (with fcm clustering) and ProdlDA seem to perform best. ProdLDA has many instances with the highest predictive performance (and average interpretability) and a few instances with the highest interpretability (and suboptimal predictive performance). In contrast, almost all instances of FLSA-W have high predictive performance and high interpretability, but no instance has a maximum for either of the variables. It seems that FLSA-W operates at a different trade-off point between performance and interpretability than ProdLDA.</p>
<p><xref ref-type="fig" rid="F4">Figure 4</xref> shows for each model the effect of the number of topics interpretability. Each data point in this graph is the average of ten models trained with that setting. We only show each FLSA model with its best clustering method for clarity. For each FLSA model, we selected the clustering method that scored highest on interpretability on the most number of topics amongst the other clustering methods. To allow for comparison, we keep these settings for the following graphs. Except for the lowest number of topics, FLSA-W (with FCM clustering) scores best on interpretability on all numbers of topics. Since the interpretability consists of both the coherence- and the diversity score, both give relevant insights. <xref ref-type="fig" rid="F5">Figure 5</xref> shows the graphs for these variables. It can be seen that LDA scores the highest on topic coherence and the second-lowest on diversity. This means that the words in the topics support each other but that many topics share most of their words. FLSA-W has almost a perfect diversity score for all topics, and its coherence score is average. ProdLDA&#x00027;s coherence score is slightly higher than LDA, but its diversity is much lower and decreases as the number of topics increase.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>The effect of number of topics on the Interpretability&#x02014;each data point is a mean score based on 10 runs.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-846930-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>The effect of number of topics on the coherence, based on the top 20 words <bold>(A)</bold> and diversity <bold>(B)</bold>&#x02014;each data point is a mean score based on 10 runs.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-846930-g0005.tif"/>
</fig>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> shows the effect of different number of topics on the predictive performance. Note that for the FLSA-based models the clustering methods with the highest interpretability scores are shown only. Since the interpretability and predictive performances do not correlate, these models do not necessarily have the highest predictive performance. The graph shows that some models&#x00027; predictive performance is better when the top-20 words are considered only (CTM, FLSA-W, FLSA-V, ProdLDA, and NeuralLDA). In contrast, other models perform better when the entire topic distribution is considered (LDA, LSI, NMF). ProdLDA has the highest predictive performance based on the top 20 words only, whereas LSI has the highest predictive power based on the entire topic distribution.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>The effect of number of topics on AUC. <bold>(A)</bold> Shows predictions based on 20 words only and <bold>(B)</bold> shows predictions based on the entire topic distribution&#x02014;each data point is a mean score based on 10 runs.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-846930-g0006.tif"/>
</fig></sec>
<sec sec-type="discussion" id="s5">
<title>5. Discussion</title>
<p>We study the behavior of different topic modeling algorithms based on their interpretability and predictive performance. LDA was used as a topic embedding in earlier text classification approaches with topic embeddings. However, <xref ref-type="fig" rid="F2">Figure 2</xref> shows that LDA has the lowest interpretability and predictive performance amongst all topic models. Although LDA has high coherence scores, many topics contain the same words, and therefore the interpretability is low. Our results show that selecting a topic model for text classification is not straightforward as no model outperforms the other models both in interpretability and predictive performance. If the interpretability needs to be maximized, FLSA-W is the new preferred model based on the interpretability index, whereas ProdLDA or LSI are preferred for maximal predictive performance.</p>
<p>Whether ProdLDA or LSI are preferred depends on the number of words per topic to base the predictions on. We argue that for the sake of interpretability, predictions based on a topic&#x00027;s top-<italic>n</italic> (20 in our case) words are preferred over the ones based on the entire distribution on two counts. Firstly, we use topic embeddings for interpretable text classification. It is more intuitive to interpret a set of <italic>n</italic> words, than it is to interpret complete distributions where all distributions contain the same words, but the probabilities per word vary. Secondly, no meaningful coherence score can be calculated on a full distribution as the coherence considers the words of a topic and not the probability. If the full topic distribution is taken, the coherence score would be the same for all topic models. Note that, the best predictive performance of the LSI model, based on all words, performs almost on par (with the AUC slightly below 0.8) with the best predictive performance in earlier work (Mosteiro et al., <xref ref-type="bibr" rid="B31">2021</xref>), and hence, we recommend considering topic embeddings for future text classification approaches is reasonable.</p>
<p>Surprisingly, <xref ref-type="fig" rid="F2">Figure 2</xref> which is based on all models, shows no correlation between model interpretability and predictive performance. <xref ref-type="fig" rid="F7">Figure 7</xref> shows the same data as <xref ref-type="fig" rid="F2">Figure 2</xref> but zoomed in on FLSA_W_FCM and FLSA_W_GK. We observe a positive correlation between the two variables for these two models. In contrast, ETM, NMF, LSI, LDA, FLSA_W_fst_pso, FLSA_V_fst_pso, FLSA_V_fcm show a slightly negative correlation between interpretability and predictive performance. The lack of correlation between these two variables raises the question of what information the classifier uses for its decisions. If words in topics do not support each other, then a topic is considered noisy. Yet, the predictive performance of models is reasonably high, implying there might be some tension between topic coherence and the prediction ability of the models.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Interpretability vs. AUC_20 zoomed in FLSA_W_FCM and FLSA_W_GK shows a positive correlation between interpretability and predictive performance - each data point represents a trained model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-846930-g0007.tif"/>
</fig>
<p>A limitation of our work is that we work with a private, specific and imbalanced dataset with relatively long texts. Therefore, it is unknown whether our results can be extended to other datasets. Another limitation is that we formulate interpretability as the product between a topic&#x00027;s coherence and diversity, based on recent work (Dieng et al., <xref ref-type="bibr" rid="B12">2020</xref>). However, coherence does not always correlate with human interpretation (Rijcken et al., <xref ref-type="bibr" rid="B36">2021</xref>). Furthermore, human evaluations could shine more lights on the topic interpretability, but this is infeasible in our current setup due to the high number of topic models that we have trained (3,210). Lastly, interpretability cannot be reduced to a single number as it is a complex concept, but using a single metric can serve as a proxy for topic comparison at large scale.</p></sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>There are many applications of text classification based on electronic health records in the clinical domain. For these tasks, classification interpretability is imperative. Using topic modeling algorithms as topic embeddings for text classification might make a model more explainable. Therefore, this work studies both the topic&#x00027;s interpretability and predictive performance for interpretable text classification. Comparing all models, we have not found a model that outperforms the other models both on interpretability and on predictive performance. Based on our findings, the FLSA-W (fcm) seems to be the best model for interpretability, whereas ProdLDA seems the best choice for predictive performance. However, this finding is based on one dataset only, and future work should assess the generalizability to other datasets. We found no correlation between a model&#x00027;s interpretability and predictive performance. More specifically, we observed that some topic models&#x00027; predictive performance correlate positively with interpretability, whereas others show an inverse correlation. Also, we found that some models&#x00027; predictions are better when the entire topic distribution is used, whereas others score better with the top 20 words only. This work demonstrates that selecting a topic modeling algorithm for text classification is not straightforward and requires careful consideration. Future work will investigate based on which information classifiers make their decisions. This insight could explain why different models&#x00027; predictive performance correlates differently with interpretability and why for some models, predictions are better when based on the entire distribution and others are better when based on the top-<italic>n</italic> words only. Since we assess a topic&#x00027;s interpretability quantitatively, future work should also focus on qualitatively assessing a topic&#x00027;s interpretability. If topics are found to be interpretable based on the qualitive assessment, the decisions made by the text classification algorithm are more interpretable. With interpretable algorithms, we are one step closer to implementing automated methods for violence risk assessment and other text classification tasks in the mental health setting.</p></sec>
<sec sec-type="data-availability" id="s7">
<title>Data Availability Statement</title>
<p>The datasets presented in this article are not readily available because this is a dataset with pseudonymized clinical notes from the EHR. Requests to access the datasets should be directed to <email>e.f.g.rijcken&#x00040;tue.nl</email>.</p></sec>
<sec id="s8">
<title>Author Contributions</title>
<p>ER and UK planned the experiment. The experiment was carried out by ER. PM supported with the coding. The manuscript was written by ER, with support from UK, FS, MS, and KZ. The dataset was provided by FS. All authors contributed to the article and approved the submitted version.</p></sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p></sec>
</body>
<back>
<ack><p>We acknowledge the COmputing VIsits DAta (COVIDA) funding provided by the strategic alliance of TU/e, WUR, UU, and UMC Utrecht.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ahmad</surname> <given-names>M. A.</given-names></name> <name><surname>Eckert</surname> <given-names>C.</given-names></name> <name><surname>Teredesai</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Interpretable machine learning in healthcare,&#x0201D;</article-title> in <source>Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics</source> (<publisher-loc>New York, NY</publisher-loc>), <fpage>559</fpage>&#x02013;<lpage>560</lpage>. <pub-id pub-id-type="doi">10.1145/3233547.3233667</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alonso</surname> <given-names>J. M.</given-names></name> <name><surname>Castiello</surname> <given-names>C.</given-names></name> <name><surname>Mencar</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). <source>Interpretability of Fuzzy Systems: Current Research Trends and Prospects</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>. <pub-id pub-id-type="doi">10.1007/978-3-662-43505-2_14</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bezdek</surname> <given-names>J. C..</given-names></name></person-group> (<year>2013</year>). <source>Pattern Recognition with Fuzzy Objective Function Algorithms</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer Science &#x00026; Business Media</publisher-name>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bianchi</surname> <given-names>F.</given-names></name> <name><surname>Terragni</surname> <given-names>S.</given-names></name> <name><surname>Hovy</surname> <given-names>D.</given-names></name> <name><surname>Nozza</surname> <given-names>D.</given-names></name> <name><surname>Fersini</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). <article-title>Cross-lingual contextualized topic models with zero-shot learning</article-title>. <source>arXiv preprint arXiv:2004.07737</source>. <pub-id pub-id-type="doi">10.18653/v1/2021.eacl-main.143</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blei</surname> <given-names>D. M.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name> <name><surname>Jordan</surname> <given-names>M. I.</given-names></name></person-group> (<year>2003</year>). <article-title>Latent Dirichlet Allocation</article-title>. <source>J. Mach. Learn. Res</source>. <volume>3</volume>, <fpage>993</fpage>&#x02013;<lpage>1022</lpage>. <pub-id pub-id-type="doi">10.5555/944919.944937</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Borg</surname> <given-names>I.</given-names></name> <name><surname>Groenen</surname> <given-names>P. J.</given-names></name></person-group> (<year>2005</year>). <source>Modern Multidimensional Scaling: Theory and Applications</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer Science &#x00026; Business Media</publisher-name>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>J.</given-names></name> <name><surname>Gerrish</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Boyd-Graber</surname> <given-names>J.</given-names></name> <name><surname>Blei</surname> <given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Reading tea leaves: how humans interpret topic models,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 22</source>, <fpage>288</fpage>&#x02013;<lpage>296</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname> <given-names>J..</given-names></name></person-group> (<year>1960</year>). <article-title>A coefficient of agreement for nominal scales</article-title>. <source>Educ. Psychol. Meas</source>. <volume>20</volume>, <fpage>37</fpage>&#x02013;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1177/001316446002000104</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name> <name><surname>Neal</surname> <given-names>R. M.</given-names></name> <name><surname>Zemel</surname> <given-names>R. S.</given-names></name></person-group> (<year>1995</year>). <article-title>The Helmholtz machine</article-title>. <source>Neural Comput</source>. <volume>7</volume>, <fpage>889</fpage>&#x02013;<lpage>904</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1995.7.5.889</pub-id><pub-id pub-id-type="pmid">7584891</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>De Finetti</surname> <given-names>B..</given-names></name></person-group> (<year>2017</year>). <source>Theory of Probability: A Critical Introductory Treatment, Vol. 6</source>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x00026; Sons</publisher-name>. <pub-id pub-id-type="doi">10.1002/9781119286387</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Devlin</surname> <given-names>J.</given-names></name> <name><surname>Chang</surname> <given-names>M.-W.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Toutanova</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <article-title>Bert: pre-training of deep bidirectional transformers for language understanding</article-title>. <source>arXiv preprint arXiv:1810.04805</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1810.04805</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dieng</surname> <given-names>A. B.</given-names></name> <name><surname>Ruiz</surname> <given-names>F. J.</given-names></name> <name><surname>Blei</surname> <given-names>D. M.</given-names></name></person-group> (<year>2020</year>). <article-title>Topic modeling in embedding spaces</article-title>. <source>Trans. Assoc. Comput. Linguist</source>. <volume>8</volume>, <fpage>439</fpage>&#x02013;<lpage>453</lpage>. <pub-id pub-id-type="doi">10.1162/tacl_a_00325</pub-id><pub-id pub-id-type="pmid">31443000</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fawcett</surname> <given-names>T..</given-names></name></person-group> (<year>2006</year>). <article-title>An introduction to ROC analysis</article-title>. <source>Pattern Recogn. Lett</source>. <volume>27</volume>, <fpage>861</fpage>&#x02013;<lpage>874</lpage>. <pub-id pub-id-type="doi">10.1016/j.patrec.2005.10.010</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>F&#x000E9;votte</surname> <given-names>C.</given-names></name> <name><surname>Idier</surname> <given-names>J.</given-names></name></person-group> (<year>2011</year>). <article-title>Algorithms for nonnegative matrix factorization with the &#x003B2;-divergence</article-title>. <source>Neural Comput</source>. <volume>23</volume>, <fpage>2421</fpage>&#x02013;<lpage>2456</lpage>. <pub-id pub-id-type="doi">10.1162/NECO_a_00168</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Flach</surname> <given-names>P..</given-names></name></person-group> (<year>2012</year>). <source>Machine Learning: The Art and Science of Algorithms That Make Sense of Data</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>. <pub-id pub-id-type="doi">10.1017/CBO9780511973000</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fuchs</surname> <given-names>C.</given-names></name> <name><surname>Spolaor</surname> <given-names>S.</given-names></name> <name><surname>Nobile</surname> <given-names>M. S.</given-names></name> <name><surname>Kaymak</surname> <given-names>U.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;A swarm intelligence approach to avoid local optima in fuzzy c-means clustering,&#x0201D;</article-title> in <source>2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)</source> (<publisher-loc>Piscataway, NJ</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/FUZZ-IEEE.2019.8858940</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guillaume</surname> <given-names>S..</given-names></name></person-group> (<year>2001</year>). <article-title>Designing fuzzy inference systems from data: an interpretability-oriented review</article-title>. <source>IEEE Trans. Fuzzy Syst</source>. <volume>9</volume>, <fpage>426</fpage>&#x02013;<lpage>443</lpage>. <pub-id pub-id-type="doi">10.1109/91.928739</pub-id><pub-id pub-id-type="pmid">30119845</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gustafson</surname> <given-names>D. E.</given-names></name> <name><surname>Kessel</surname> <given-names>W. C.</given-names></name></person-group> (<year>1979</year>). <article-title>&#x0201C;Fuzzy clustering with a fuzzy covariance matrix,&#x0201D;</article-title> in <source>1978 IEEE Conference on Decision and Control Including the 17th Symposium on Adaptive Processes</source> (<publisher-loc>Piscataway, NJ</publisher-loc>), <fpage>761</fpage>&#x02013;<lpage>766</lpage>. <pub-id pub-id-type="doi">10.1109/CDC.1978.268028</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jurafsky</surname> <given-names>D.</given-names></name> <name><surname>Martin</surname> <given-names>J. H.</given-names></name></person-group> (<year>2009</year>). <source>Speech and language processing: An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition</source>. <publisher-loc>Hoboko, NJ</publisher-loc>: <publisher-name>Pearson/Prentice Hall</publisher-name>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karami</surname> <given-names>A.</given-names></name> <name><surname>Gangopadhyay</surname> <given-names>A.</given-names></name> <name><surname>Zhou</surname> <given-names>B.</given-names></name> <name><surname>Kharrazi</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Fuzzy approach topic discovery in health and medical corpora</article-title>. <source>Int. J. Fuzzy Syst</source>. <volume>20</volume>, <fpage>1334</fpage>&#x02013;<lpage>1345</lpage>. <pub-id pub-id-type="doi">10.1007/s40815-017-0327-9</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaymak</surname> <given-names>U.</given-names></name> <name><surname>Ben-David</surname> <given-names>A.</given-names></name> <name><surname>Potharst</surname> <given-names>R.</given-names></name></person-group> (<year>2012</year>). <article-title>The AUK: a simple alternative to the AUC</article-title>. <source>Eng. Appl. Artif. Intell</source>. <volume>25</volume>, <fpage>1082</fpage>&#x02013;<lpage>1089</lpage>. <pub-id pub-id-type="doi">10.1016/j.engappai.2012.02.012</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Auto-encoding variational Bayes,&#x0201D;</article-title> in <source>The International Conference on Learning Representations</source> (<publisher-loc>La Jolla, CA</publisher-loc>).<pub-id pub-id-type="pmid">32176273</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Landauer</surname> <given-names>T. K.</given-names></name> <name><surname>Foltz</surname> <given-names>P. W.</given-names></name> <name><surname>Laham</surname> <given-names>D.</given-names></name></person-group> (<year>1998</year>). <article-title>An introduction to latent semantic analysis</article-title>. <source>Discour. Process</source>. <volume>25</volume>, <fpage>259</fpage>&#x02013;<lpage>284</lpage>. <pub-id pub-id-type="doi">10.1080/01638539809545028</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lau</surname> <given-names>J. H.</given-names></name> <name><surname>Newman</surname> <given-names>D.</given-names></name> <name><surname>Baldwin</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Machine reading tea leaves: automatically evaluating topic coherence and topic model quality,&#x0201D;</article-title> in <source>Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>), <fpage>530</fpage>&#x02013;<lpage>539</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Le</surname> <given-names>Q.</given-names></name> <name><surname>Mikolov</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Distributed representations of sentences and documents,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>San Diego, CA</publisher-loc>), <fpage>1188</fpage>&#x02013;<lpage>1196</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Menger</surname> <given-names>V.</given-names></name> <name><surname>Scheepers</surname> <given-names>F.</given-names></name> <name><surname>Spruit</surname> <given-names>M.</given-names></name></person-group> (<year>2018a</year>). <article-title>Comparing deep learning and classical machine learning approaches for predicting inpatient violence incidents from clinical text</article-title>. <source>Appl. Sci</source>. <volume>8</volume>:<fpage>981</fpage>. <pub-id pub-id-type="doi">10.3390/app8060981</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Menger</surname> <given-names>V.</given-names></name> <name><surname>Scheepers</surname> <given-names>F.</given-names></name> <name><surname>van Wijk</surname> <given-names>L. M.</given-names></name> <name><surname>Spruit</surname> <given-names>M.</given-names></name></person-group> (<year>2018b</year>). <article-title>Deduce: a pattern matching method for automatic de-identification of Dutch medical text</article-title>. <source>Telem. Inform</source>. <volume>35</volume>, <fpage>727</fpage>&#x02013;<lpage>736</lpage>. <pub-id pub-id-type="doi">10.1016/j.tele.2017.08.002</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Menger</surname> <given-names>V.</given-names></name> <name><surname>Spruit</surname> <given-names>M.</given-names></name> <name><surname>Van Est</surname> <given-names>R.</given-names></name> <name><surname>Nap</surname> <given-names>E.</given-names></name> <name><surname>Scheepers</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>Machine learning approach to inpatient violence risk assessment using routinely collected clinical notes in electronic health records</article-title>. <source>JAMA Netw. Open</source> <volume>2</volume>:<fpage>e196709</fpage>. <pub-id pub-id-type="doi">10.1001/jamanetworkopen.2019.6709</pub-id><pub-id pub-id-type="pmid">31268542</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Chen</surname> <given-names>K.</given-names></name> <name><surname>Corrado</surname> <given-names>G.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>Efficient estimation of word representations in vector space</article-title>. <source>arXiv preprint arXiv:1301.3781</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1301.3781</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mosteiro</surname> <given-names>P.</given-names></name> <name><surname>Rijcken</surname> <given-names>E.</given-names></name> <name><surname>Zervanou</surname> <given-names>K.</given-names></name> <name><surname>Kaymak</surname> <given-names>U.</given-names></name> <name><surname>Scheepers</surname> <given-names>F.</given-names></name> <name><surname>Spruit</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Making sense of violence risk predictions using clinical notes,&#x0201D;</article-title> in <source>International Conference on Health Information Science</source> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>3</fpage>&#x02013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-61951-0_1</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mosteiro</surname> <given-names>P.</given-names></name> <name><surname>Rijcken</surname> <given-names>E.</given-names></name> <name><surname>Zervanou</surname> <given-names>K.</given-names></name> <name><surname>Kaymak</surname> <given-names>U.</given-names></name> <name><surname>Scheepers</surname> <given-names>F.</given-names></name> <name><surname>Spruit</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>Machine learning for violence risk assessment using Dutch clinical notes</article-title>. <source>J. Artif. Intell. Med. Sci</source>. <volume>2</volume>, <fpage>44</fpage>&#x02013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.2991/jaims.d.210225.001</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nobile</surname> <given-names>M. S.</given-names></name> <name><surname>Cazzaniga</surname> <given-names>P.</given-names></name> <name><surname>Besozzi</surname> <given-names>D.</given-names></name> <name><surname>Colombo</surname> <given-names>R.</given-names></name> <name><surname>Mauri</surname> <given-names>G.</given-names></name> <name><surname>Pasi</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>Fuzzy self-tuning PSO: a settings-free algorithm for global optimization</article-title>. <source>Swarm Evol. Comput</source>. <volume>39</volume>, <fpage>70</fpage>&#x02013;<lpage>85</lpage>. <pub-id pub-id-type="doi">10.1016/j.swevo.2017.09.001</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pennington</surname> <given-names>J.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Glove: global vectors for word representation,&#x0201D;</article-title> in <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>), <fpage>1532</fpage>&#x02013;<lpage>1543</lpage>. <pub-id pub-id-type="doi">10.3115/v1/D14-1162</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Peters</surname> <given-names>M. E.</given-names></name> <name><surname>Neumann</surname> <given-names>M.</given-names></name> <name><surname>Iyyer</surname> <given-names>M.</given-names></name> <name><surname>Gardner</surname> <given-names>M.</given-names></name> <name><surname>Clark</surname> <given-names>C.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Deep contextualized word representations</article-title>. <source>CoRR, abs/1802.05365</source>. <pub-id pub-id-type="doi">10.18653/v1/N18-1202</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Porteous</surname> <given-names>I.</given-names></name> <name><surname>Newman</surname> <given-names>D.</given-names></name> <name><surname>Ihler</surname> <given-names>A.</given-names></name> <name><surname>Asuncion</surname> <given-names>A.</given-names></name> <name><surname>Smyth</surname> <given-names>P.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x0201C;Fast collapsed Gibbs sampling for latent Dirichlet allocation,&#x0201D;</article-title> in <source>Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>New York, NY</publisher-loc>), <fpage>569</fpage>&#x02013;<lpage>577</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rijcken</surname> <given-names>E.</given-names></name> <name><surname>Scheepers</surname> <given-names>F.</given-names></name> <name><surname>Mosteiro</surname> <given-names>P.</given-names></name> <name><surname>Zervanou</surname> <given-names>K.</given-names></name> <name><surname>Spruit</surname> <given-names>M.</given-names></name> <name><surname>Kaymak</surname> <given-names>U.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;A comparative study of fuzzy topic models and lda in terms of interpretability,&#x0201D;</article-title> in <source>Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI)</source> (<publisher-loc>Piscataway, NJ</publisher-loc>). <pub-id pub-id-type="doi">10.1109/SSCI50451.2021.9660139</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>R&#x000F6;der</surname> <given-names>M.</given-names></name> <name><surname>Both</surname> <given-names>A.</given-names></name> <name><surname>Hinneburg</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Exploring the space of topic coherence measures,&#x0201D;</article-title> in <source>Proceedings of the Eighth ACM International Conference on Web Search and Data Mining</source> (<publisher-loc>New York, NY</publisher-loc>), <fpage>399</fpage>&#x02013;<lpage>408</lpage>. <pub-id pub-id-type="doi">10.1145/2684822.2685324</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rumshisky</surname> <given-names>A.</given-names></name> <name><surname>Ghassemi</surname> <given-names>M.</given-names></name> <name><surname>Naumann</surname> <given-names>T.</given-names></name> <name><surname>Szolovits</surname> <given-names>P.</given-names></name> <name><surname>Castro</surname> <given-names>V.</given-names></name> <name><surname>McCoy</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Predicting early psychiatric readmission with natural language processing of narrative discharge summaries</article-title>. <source>Transl. Psychiatry</source> <volume>6</volume>:<fpage>e921</fpage>. <pub-id pub-id-type="doi">10.1038/tp.2015.182</pub-id><pub-id pub-id-type="pmid">27754482</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Srivastava</surname> <given-names>A.</given-names></name> <name><surname>Sutton</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Autoencoding variational inference for topic models</article-title>. <source>arXiv preprint arXiv:1703.01488</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1703.01488</pub-id><pub-id pub-id-type="pmid">32750790</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Terragni</surname> <given-names>S.</given-names></name> <name><surname>Fersini</surname> <given-names>E.</given-names></name> <name><surname>Galuzzi</surname> <given-names>B. G.</given-names></name> <name><surname>Tropeano</surname> <given-names>P.</given-names></name> <name><surname>Candelieri</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Octis: comparing and optimizing topic models is simple!,&#x0201D;</article-title> in <source>Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>), <fpage>263</fpage>&#x02013;<lpage>270</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2021.eacl-demos.31</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van Eck</surname> <given-names>N. J.</given-names></name> <name><surname>Waltman</surname> <given-names>L.</given-names></name></person-group> (<year>2010</year>). <article-title>Software survey: VOSviewer, a computer program for bibliometric mapping</article-title>. <source>Scientometrics</source> <volume>84</volume>, <fpage>523</fpage>&#x02013;<lpage>538</lpage>. <pub-id pub-id-type="doi">10.1007/s11192-009-0146-3</pub-id><pub-id pub-id-type="pmid">20585380</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van Le</surname> <given-names>D.</given-names></name> <name><surname>Montgomery</surname> <given-names>J.</given-names></name> <name><surname>Kirkby</surname> <given-names>K. C.</given-names></name> <name><surname>Scanlan</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Risk prediction using natural language processing of electronic mental health records in an inpatient forensic psychiatry setting</article-title>. <source>J. Biomed. Inform</source>. <volume>86</volume>, <fpage>49</fpage>&#x02013;<lpage>58</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbi.2018.08.007</pub-id><pub-id pub-id-type="pmid">30118855</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>van Leeuwen</surname> <given-names>M. E.</given-names></name> <name><surname>Harte</surname> <given-names>J. M.</given-names></name></person-group> (<year>2017</year>). <article-title>Violence against mental health care professionals: prevalence, nature and consequences</article-title>. <source>J. Forens. Psychiatry Psychol</source>. <volume>28</volume>, <fpage>581</fpage>&#x02013;<lpage>598</lpage>. <pub-id pub-id-type="doi">10.1080/14789949.2015.1012533</pub-id><pub-id pub-id-type="pmid">23696335</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Paisley</surname> <given-names>J.</given-names></name> <name><surname>Blei</surname> <given-names>D.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Online variational inference for the hierarchical dirichlet process,&#x0201D;</article-title> in <source>Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics</source> (<publisher-loc>New Jersey, NJ</publisher-loc>), <fpage>752</fpage>&#x02013;<lpage>760</lpage>.<pub-id pub-id-type="pmid">27305687</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Sha</surname> <given-names>L.</given-names></name> <name><surname>Lakin</surname> <given-names>J. R.</given-names></name> <name><surname>Bynum</surname> <given-names>J.</given-names></name> <name><surname>Bates</surname> <given-names>D. W.</given-names></name> <name><surname>Hong</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Development and validation of a deep learning algorithm for mortality prediction in selecting patients with dementia for earlier palliative care interventions</article-title>. <source>JAMA Network Open</source> <volume>2</volume>:<fpage>e196972</fpage>. <pub-id pub-id-type="doi">10.1001/jamanetworkopen.2019.6972</pub-id><pub-id pub-id-type="pmid">31298717</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>Also referred to as Latent Semantic Analysis (LSA).</p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://pypi.org/project/FuzzyTM/">https://pypi.org/project/FuzzyTM/</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup>Each trained 10 times.</p></fn>
</fn-group>
</back>
</article>