Discovering key topics from short, real-world medical inquiries via natural language processing and unsupervised learning

Millions of unsolicited medical inquiries are received by pharmaceutical companies every year. It has been hypothesized that these inquiries represent a treasure trove of information, potentially giving insight into matters regarding medicinal products and the associated medical treatments. However, due to the large volume and specialized nature of the inquiries, it is difficult to perform timely, recurrent, and comprehensive analyses. Here, we propose a machine learning approach based on natural language processing and unsupervised learning to automatically discover key topics in real-world medical inquiries from customers. This approach does not require ontologies nor annotations. The discovered topics are meaningful and medically relevant, as judged by medical information specialists, thus demonstrating that unsolicited medical inquiries are a source of valuable customer insights. Our work paves the way for the machine-learning-driven analysis of medical inquiries in the pharmaceutical industry, which ultimately aims at improving patient care.


I. INTRODUCTION
Every day pharmaceutical companies receive numerous medical inquiries related to their products from patients, healthcare professionals, research institutes, or public authorities from a variety of sources (e.g. websites, e-mail, phone, social media channels, company personnel, telefax). These medical inquiries may relate to drug-drug-interactions, availability of products, side effects of pharmaceuticals, clinical trial information, product quality issues, comparison with competitor products, storage conditions, dosing regimen, and the like. On the one hand, a single medical inquiry is simply a question of a given person searching for a specific information related to a medicinal product. On the other hand, a plurality of medical inquiries from different persons may provide useful insight into matters related to medicinal products and associated medical treatments. Examples of these insights could be early detection of product quality or supply chain issues, anticipation of treatment trends and market events, improvement of educational material and standard answers/frequently asked question coverage, potential changes in treatment pattern, or even suggestions on new possible indications to investigate. From a strategic perspective, this information could enable organizations to make better decisions, drive organization results, and more broadly create benefits for the healthcare community. However, obtaining high-level general insights is a complicated task since pharmaceutical companies receive copius amounts of medical inquiries every year. Machine learning and natural language processing represent a promising route to automatically extract insights from these large amounts of unstructured (and noisy) medical text. Natural language processing and text mining techniques have been widely used in the medical domain 1,2 , with particular emphasis on electronic health records [3][4][5][6] . In particular, deep learning has been successfully applied to medical text, with the overwhelming majority of works in supervised learning, or representation learning (in a supervised or self-supervised setting) to learn specialized word vector representations (i.e. word embeddings) [7][8][9][10][11] . Conversely, the literature on unsupervised learning for medical text is scarce despite the bulk of real-world medical text being unstructured, without any labels or annotations. Unsupervised learning from unstructured medical text is mainly limited to the development of topic models based on latent Dirichlet allocation (LDA) 12 . Examples of applications in the medical domain are clinical event identification in brain cancer patients from clinical reports 13 , modeling diseases 14 and predicting clinical order patterns 15 from electronic health records, or detecting cases of noncompliance to drug treatment from patient forums 16 . Only recently, word embeddings and unsupervised learning techniques have been combined to analyze unstructured medical text to study the concept of diseases 17 , medical product reviews 18 , or to extract informative sentences for text summarization. 19 In this work, we combine biomedical word embeddings and unsupervised learning to discover topics from real-world medical inquiries received by Bayer™. A real-world corpus of medical inquiries presents numerous challenges. From an inquirer (e.g. healthcare professional or patient) perspective, often the goal is to convey the information requested in as few words as possible to save time. This leads to an extensive use of acronyms, sentences with atypical syntactic structure, occasionally missing verb or subject, or inquiries comprising exclusively a single noun phrase. Moreover, since medical inquiries come from different sources, it is common to find additional (not relevant) information related to the text source; examples are references to internal computer systems, form frames (i.e. textual instructions) alongside with the actual form content, lot numbers, email headers and signatures, city names. The corpus contains a mixture of layman and medical language depending (mostly) on the inquirer being either a patient or a healthcare professional. Style and content of medical inquiries vary quite substantially according to which therapeutic areas (e.g. cardiovascular vs oncology) a given medicinal product belongs to.
As already mentioned, medical inquiries are short. More specifically, they comprise less than fifteen words in the vast majority of cases. Standard techniques for topic modelling based on LDA 12 do not apply, since the main assumptioneach document/text is a distribution over topics -clearly does not hold given that the text is short 20 . Approaches based on pseudo-documents 21 or using auxiliary information 22,23 are also not suitable since no meaningful pseudo-document nor auxiliary information are available for medical inquiries. Moreoever, these models aim to learn semantics (e.g. meaning of words) directly from the corpus of interest. However, the recent success of pretrained embeddings 24,25 shows that it is beneficial to include semantics learned on a general (and thus orders of magnitude larger) corpus, thus providing semantic information difficult to obtain from smaller corpora. This is particularly important for limited data and short text settings. To this end, there has been recently some work aimed at incorporating word embeddings into probabilistic models similar to LDA (Dirichlet multinomial mixture model 26 ) and that -contrary to LDA -satisfies the single topic assumption (i.e. one document/text belong to only one topic) 27,28 . Even though these models include (some) semantic information in the topic model, it is not evident how to choose the required hyperparameters, for example determining an appropriate threshold when filtering semantically related word pairs. 28 Concurrently to our work, document-level embeddings and hierarchical clustering have been combined to obtain topic vectors from news articles and a question-answer corpus. 29 Here, we propose an approach based on specialized biomedical word embeddings and unsupervised learning to discover topics from short, unstructured, real-world medical inquiries. This approach -schematically depicted in Fig.1 -is then used to discovery topics in medical inquiries received by Bayer™ Medical Information regarding the oncology medicinal product Stivarga™.

II. RESULTS
A. Machine learning approach to discover topics in medical inquiries

Text representation
One of the main challenges of topic discovery in short text is sparseness: it is not possible to extract semantic information from word co-occurrences because words rarely appear together since the text is short. In our case, the sparseness problem is exacerbated by two following aspects. First, the amount of data available is limited: most medicinal products receive less than 4,000 medical inquiries yearly. Second, medical inquiries are sent by patients as well as healthcare professionals (e.g. physicians, pharmacists, nurses): this leads to inquiries with widely different writing styles, containing a mixture of common and specialized medical text. The sparsity problem can be tackled by leveraging word embedding models trained on large corpora; these embeddings have been shown to learn semantic similarities directly from data, even for specialized biomedical text [7][8][9]30 . Specifically, we use the scispaCy word embedding model 7 , which was trained on a large corpus containing scientific abstracts from medical literature (PubMed) as well as web pages (OntoNotes 5.0 corpus 31 ). This assorted training corpus enables the model to treat specialized medical terminology and layman terms on the same footing, so that medical topics are discovered regardless of the writing style.
One of the main disadvantages of word vector (word2vec) models -like the (scispaCy) model used in this work -is their inability to handle out-of-vocabulary (oov) words: if a word appearing in the text is not included in the model vocabulary, it is effectively skipped from the analysis (i.e. a vector of all zeros is assigned to it). To tackle this issue, several models have been proposed, initially based on chargram level embeddings (FastText 32 ), and more recently contextual embeddings based on character (ELMO 24 ), or byte pair encoding 33 representations (BERT 25 ). Even though other advancements -namely word polysemy handling and the use of attention 34 -were arguably the decisive factors, improvements in oov word handling also contributed in making ELMO and BERT the defacto gold standard for natural language processing, at least for supervised learning tasks.
Even though the use of contextual word embeddings is gen- erally beneficial and can be readily incorporated in our approach (simply substituting the word representation), we notice that -given the large amount of noise present and the purely unsupervised setting -a word2vec model is actually advantageous for the task of extracting medical topics from real-word medical inquiries. Indeed, using a model with a limited yet comprehensive vocabulary (the scispaCy model used in this work includes 600k word vectors) constitutes a principled, data-driven, efficient, and effective way to filter relevant information from the noise present in the corpus. This filtering is principled and data driven because the words (and vectors) included in the model vocabulary are automatically determined in the scispaCy training procedure by optimizing the performance on biomedical text benchmarks 7 . This also leads to harmonization of the medical inquiry corpus by eliminating both non-relevant region-specific terms, and noise introduced by machine translation (words or expressions are sometimes not translated but simply copied still in the original language 35 ). Clearly, in this context it is of paramount importance to use specialized biomedical embeddings so that the word2vec model has a comprehensive knowledge of medical terms despite its relatively limited vocabulary. Table II A 1 presents a qualitative comparison of a standard embedding (en core web lg, trained on the Common Crawl) and a specialized biomedical embedding (scispaCy en core sci lg, trained also on PubMed). Specifically, for a given probe word (i.e. leukemia, carcinoma, blood), the words most semantically similar to it -measured by the cosine similarity between word vectors -are retrieved, together with their similarity with the probe word (shown in parenthesis, 1.0 being the highest possible similarity). It is evident that the biomedical embedding returns much more relevant and medically specific terms. For instance, given the probe word leukemia, the standard embedding returns generic terms like cancer, tumor, chemotherapy which are broadly related to oncology, but not necessarily to leukemia. In contrast, the biomedical embedding returns more specialized (and med-ically relevant) terms like lymphoblastic, myelomonocytic, myelogenous, myeloid, promyelocytic: acute lymphoblastic, chronic myelomonocytic, chronic myelogenous, adult acute myeloid, and acute promyelocytic are all types of leukemia.

Clustering similar medical inquiries via hierarchical clustering
We have shown in the previous section that word embeddings provide a natural way to include semantic information (i.e. meaning of individual words) in the modeling. Medical inquiries comprise multiple words, and therefore a semantic representation for each inquiry needs to be computed from the word-level embeddings. We accomplish this by simply averaging the embeddings of the words belonging to the inquiry, thus obtaining one vector for each inquiry. Since these vectors capture semantic information, medical inquiries bearing similar meaning are mapped to nearby vectors in the highdimensional embedding space. To group similar inquiries, clustering is performed in this embedding space, and for each medicinal product separately.
Usually, it is not conducive to define an appropriate number of clusters a priori. A reasonable number of clusters depends on various interdependent factors: number of incoming inquiries, therapeutic area of the medicine, time frame of the analysis, and intrinsic amount of information (i.e. variety of the medical inquiries). For a given medicinal product, typically a handful of frequently asked questions covers a large volume of inquiries, accompanied by numerous low-volume and less cohesive inquiry clusters. These low-volume clusters often contain valuable information, which might not even be known to medical experts: their low volume makes it difficult to detect them via manual inspection. To perform clustering in the embedding space, we use the hierarchical, density based clustering algorithm HDBSCAN [36][37][38] . As customary in unsupervised learning tasks, one needs to provide some information on the desired granularity, i.e. how fine or coarse the clus-tering should be. In HDSBCAN, this is accomplished by specifying a single, intuitive hyper-parameter (min cluster size). In our case, the objective is to obtain approximately 100 clusters so that the results can be easily analyzed by medical experts. Thus, the main factor in defining min cluster size is the number of inquiries for a given medicinal product: the larger the medical inquiry volume, the larger the parameter min cluster size. Note that min cluster size is not a strict controller of cluster size (and thus how many clusters should be formed), but rather a guidance provided to the algorithm regarding the desired clustering granularity. It is also possible to combine different min cluster size for the same dataset, i.e. using a finer granularity for more recent inquiries, thus enabling the discovery of new topics when only few inquiries are received, at a price however of an increase in noise given the low data volume. Moreover, min cluster size is very slowly varying with data (medical inquiry) volume, which facilitate its determination (see Methods).
In practice, a non-linear dimensionality reduction is applied to lower the dimensionality of the text representation before clustering is performed. We utilize the UMAP algorithm 39,40 because of its firm mathematical foundations from manifold learning and fuzzy topology, ability to meaningfully project to any number of dimensions (not only two or three like t-SNE 41 ), and computational efficiency. Reducing the dimensionality considerably improves the clustering computational performance, greatly easing model deployment to production, especially for products with more than 5,000 inquiries. The dimensionality reduction can in principle be omitted, especially for smaller datasets. At the end of this step, for each medicinal product a set of clusters is returned, each containing a collection of medical inquiries. A given medical inquiry is associated to one topic only, in accordance with the single topic assumption.
In order to convey the cluster content to users, a name (or headline) needs to be determined for each cluster. To this end, the top-five most recurring words for each cluster are concatenated, provided that they appear in at least 20% of the inquiries belonging to that cluster; this frequency threshold is set to avoid to include in the topic name words that appear very infrequently but are still in the top-five words. Thus, if a word does not fulfill the frequency requirement, it is not included in the topic name (resulting in topic names with less than five words). By such naming (topic creation), the clusters are represented by a set of words, which summarize their semantic content. Finally, topics with similar names are merged in order to limit the number of topics to be presented to medical experts (see Methods). After the topics are merged, new topic names are generated according to the procedure outlined above. The final result is a list of topics defined by a given name, each containing a set of similar medical inquiries. The list of discovered topics is then outputted, and presented to medical experts.
Since the goal is to extract as much knowledge as possible from incoming medical inquiries, a relatively large number of topics (typically around 100) is returned to medical experts for each medicinal product. To facilitate topic exploration and analysis, topics are visualized on a map that reflects the sim-ilarity between topics (Fig. 3a): topics close to each other in this map are semantically similar. To obtain this semantic map, first topic vectors are computed by averaging the text representation of all inquiries belonging to a given topic; then, a dimensionality reduction to two dimensions via UMAP is performed.

Topic evaluation: topic semantic compactness and name saliency
Once topics are discovered, it is desirable to provide medical experts with information regarding the quality of a given topic. To this end, a score is calculated for each topic. Since no available metric applies to the case of medical topic discovery (see Methods), we introduce two new quantities to evaluate topics discovered in an unsupervised setting. These quantities -which we term topic semantic compactness and name saliency -fully leverage semantic similarity at the sentence (medical inquiry) level; they are also intuitive, computationally efficient, and intrinsic, i.e. do not require any gold labels.
One way of quantifying the quality of a discovered topic is to determine how similar the inquiries grouped in a topic are: intuitively, the more similar the inquiries in a topic are, the higher the quality of this topic. As described in Sec. II A 1-II A 2, we maps inquiries (via word2vec and averaging of word vectors) in a semantic space where clustering is performed. From a geometrical point of view, topic quality can be estimated by calculating the similarity among medical inquiries within a given topic in this semantic space; those similarities are then summed to evaluate the semantic compactness of a topic. The topic semantic compactness is between zero and one, one indicating the highest possible compactness.
The topic name is one of the main information shown to the users to summarize the semantic content of a discovered medical topic. It is therefore of interest to quantify how representative the name is for a given medical topic. This is tackled by answering the following question: how similar is the name with the inquiries grouped in the topic it represents? This is quantified by the name salience, a score between zero and one, the higher the value, the more salient the name w.r.t. the topic its represents. Topic quality is then obtained by averaging semantic compactness and title saliency for each topic. Additional information on topic evaluation can be found in Methods.
B. A real-world example of topic discovery: the oncology product Stivarga™ As a real-world example of topic discovery, we present the results for medical inquiries on the oncology product Sti-varga™. Stivarga™ is an oral multikinase inhibitor which inhibits various signal pathways responsible for tumor growth.
In this work, all unsolicited medical inquiries received by Bayer™ worldwide in the time frame July 2019-June 2020 are considered. All non-English inquiries are translated to English using machine translation. These inquiries are then pre- processed: acronyms are resolved; non-informative phrases, words or patterns are removed; text is tokenized and lemmatized. Additional details are provided in Methods. Then, the topic discovery algorithm introduced in Sec. II A is applied.
The semantic map with the discovered topics is shown in Fig. 3a. These topics span a relatively large variety of themes, ranging from interactions with food and adverse drug reactions to purchase costs and literature requests. The topics are judged as meaningful and medically relevant by medical information specialists, on the basis of their expert knowledge of the medicinal product.
Topics are also specific: the unsupervised learning approach allows information to emerge directly from the data, without recurring to predefined lists of keywords or classes, as required when using ontologies or supervised learning. An example of a very specialized topic for inquiries on scientific literature is treatment role bruix grothey evolve: 12 requests related to the review article on the treatment of advanced cancer with Regorafenib (active ingredient of Stivarga™) published on February 2020 42 . Other examples are the five topics fat diet high meal low, eat breakfast day drink, precaution diet, eat time, contraindication diet. Even though all these topics relate to nutrition, they are addressing different aspects. It is quite advantageous that they are identified as distinct since medical recommendations will likely differ across these five topics.
Thanks to the inclusion of semantics via word embedding, the algorithm is able to group together inquiries having similar meaning, even though the actual words in them are distinct. For instance, the topic pain side effect foot fatigue comprises 21 inquiries on medical issues (which may or may not be related with the medicine), in which the following words appear: pain (seven times), side effect (six times) nausea (three times), fatigue (five times), dysphonia (two times). The algorithm is able to cluster these inquiries together because similar inquiries are mapped close to each other in the high dimensional semantic space where clustering is performed. This is corroborated by the relatively high similarity score between the terms appearing in these inquiries (pain-nausea: 0.66, nausea-fatigue:0.61, pain-fatigue:0.71, dysphonia-pain:0.55, dysphonia-fatigue:0.49), scores much higher than zero, zero being the score expected for unrelated terms.(cf. pain-day:0.05, nausea-sun:0.08). Conversely, if there is a moderate number of inquiries on a specific medical matter, the algorithm is generally able to detect that signal, as in the case of mucositis and hoarse in the two topics daily care method rash mucositis, and daily care method rash hoarse.
As shown in Fig. 3a, the automatically generated topic names (see Sec.II A 2) provide a reasonably good insight into the semantic content of their respective topics. However, one needs to be mindful that the topic might -and usually willcontain additional information of relevance. To convey this information in a simple yet effective way to the users, wordclouds are generated for each topic; examples of wordcloud are shown in Fig. 3b. For example, in the wordcloud of topic compassionate program (Fig. 3b, 1st column-2nd row), concepts not included in the topic name (e.g. assistance, interested, access, status) appear, thus giving further insight into the topic content. In some cases, even the wordcloud might not convey the topic meaning: users will then resort to manually inspect the inquiries belonging to the topic. For instance, the content of topic chinese is not clear, neither from the topic name nor from the wordcloud; however, inspection of the actual inquiries quickly reveals that they refer to the interaction between Chinese medicine and Stivarga™(the word medicine does not appear since it is a stopword). Another examples are al et clin and long, which group together requests for scientific articles and product durability, respectively.
Topic quality provides a useful guidance when exploring topics. If topic quality is close to one, medical inquiries in that topic are all very similar, and the topic name is expected to summarize the topic content well. Conversely, topics with low quality will contain inquiries that might differ quite substantially, yet are similar enough to be clustered together by the algorithm. In these cases, manual inspection of the underlying medical inquiries may be a good strategy. From Fig. 3a, it appears that smaller topics tend to have higher topic scores, although no clear trend emerges.
Finally, in addition of having similar inquiries within topics, the model captures semantic similarities between topics. This is apparent from Fig. 3a: similar topics tend to be close to each other in the semantic map. Even though this feature does not influence the topic discovered, from a user perspective it provides a clear advantage when exploring topics (e.g. compared to reading them from as a simple list).

III. DISCUSSION
This study introduces an unsupervised machine learning approach to automatically discover topics from medical inquiries. After the initial (one-time) effort for preprocessing (e.g. abbreviation definition, stopword refinement) and hyperparameters determination, the algorithm runs without requiring any human intervention, discovering key topics as medical inquiries are received. Topics can be discovered even if only a small number of inquiries is present, and are generally specific, thus enabling targeted, informed decisions by medical experts. Being completely unsupervised, the algorithm can discover topics that were neither known nor expected in advance, topics which often are the most valuable. This is in stark contrast with ontology or supervised based approaches, where topics need to be defined a priori (as collections of keywords or classes), and incoming text can be associated only to these predefined lists of topics, thus hindering the discovery of a priori unknown topics. The machine learning ap-proach introduced here does not use ontologies (which are costly and hard to build, validate, maintain, and difficult to apply when layman and specialized medical terms are combined), and instead it incorporates domain knowledge via specialized biomedical word embeddings. This allows to readily apply the topic discovery algorithm to different medicinal products, without the burden of having to develop specialized ontologies for each product or therapeutic area. Indeed, the algorithm is periodically analyzing medical inquiries for a total of sixteen Bayer™ medicinal products, encompassing cardiol-ogy, oncology, gynecology, hematology, and ophthalmology.
Our approach has several limitations. First, it can happen that a small fraction of inquiries associated to a given topic are actually extraneous to it, especially for semantically broad topics. This is because -due to the noise present in this realworld dataset -the soft clustering HDBSCAN algorithm must be applied with a low probability threshold for cluster assignment to avoid the majority of inquiries being considered as outliers (see Methods). Second, even though the topic names are generally quite informative, a medical expert needs to read the actual inquiries to fully grasp the topic meaning, especially if a decision will be made on the grounds of the discovered topics. This is however not burdensome because inspection is limited to the inquiries associated to a given topic (and not all inquiries). Last, some discovered topics are judged by medical experts -based on their expert knowledge -so similar that they could have been merged in a single topic, but are considered distinct by the algorithm. In these cases, manual topic grouping might be required to determine the top topics by inquiry volumes. Still, these similar topics very often appear close to each other in the topic map.
Despite these limitations, this study demonstrates that medical inquiries contain useful information, and that machine learning can extract this information in an automatic way, discovering topics that are judged by medical information specialists as meaningful and valuable. The hope is that this will stimulate mining of medical inquiries, and more generally the use of natural language processing and unsupervised learning in the medical industry. Interesting future directions are the inclusion of a priori expert knowledge (e.g. a list of expected topics) while at the same time maintaining the ability to discover new and previously unknown topics, and grouping topics in meta-topics though a clustering hierarchy.

A. Preprocessing
Since our dataset comprises real-word medical inquiries, preprocessing is a crucial step to limit the amount of noise in the corpus. The corpus contains numerous acronyms: a first step is thus acronym resolution, i.e. substitute a given acronym with its extended form. A dictionary for the most recurring acronyms (∼40 per product) is compiled with the help of medical experts. Acronym resolution is performed via a curated dictionary for two reasons. First, the data is too scarce and noisy to train a reliable, custom-made word embedding to learn the acronym meanings from the corpus. Second, in pretrained word embeddings typically there is no suitable representation for the acronym, or the acronym in our corpus is used to indicate something different than in natural language (it can even be company-specific). For example, in our corpus lode does not refer to a vein of a metal, but stands for lack of product effect. Regular expressions are then used to remove non-informative strings (e.g. lot numbers, references to internal systems). Next. text is split into sentences, tokenized and lemmatized using the scispaCy library 7 (which is built on top of spaCy). We disable the scispaCy parser; this gives a significant speed-up without affecting the topic discovery outcome. Finally stopwords (i.e. non informative words) are removed. In addition to standard English stopwords, there are non-standard stopwords which arise from the dataset being composed of medical inquiries e.g. ask, request, email, inquiry, patient, doctor, and product-dependent stopwords, typically the brand and chemical name of the medicinal product to which the inquiries refer to. It is also the case that in the medical inquiry corpus single words bear value, but when combined they are no longer relevant for medical topic discovery. For example, the word years and old are generally of relevance, but if contiguous (years old) they are no longer significant since this expression simply originates from a medical information specialists logging the age of the patient to which the inquiry refers to. Another example is the word morning: when appearing alone it is of relevance, but when it is preceded by the word good it loses its relevance since the expression good morning does not bear any significance for medical topic discovery. We compile a short list of stop n-grams (∼20) and remove them from the corpus.

B. Text representation
To represent medical inquiries, the scispaCy 7 word embedding model en core sci lg-0.2.5 is used. No model re-training or fine-tuning is performed because of the small amount of data and the sparsity problem; since no labels are available, one would need to train a language model on noisy and short text instances which would likely lead the model to forget the semantics learned by the scispaCy model. For each token, the (200-dimensional) scispaCy embedding vector is retrieved; the sentence representation is then obtained simply by calculating the arithmetic average of the vectors representing each token over all tokens belonging to a given sentence.

C. Out-of-vocabulary words
Even though the overwhelming majority of out-ofvocabulary (oov) words are not of interest for medical topic discovery, a very small (but relevant) subset of important oov words would be missed if one were to simply use the word2vec model. We thus devise a strategy to overcome this, as described below. For each product, the most recurring oov words are automatically detected; these words need to be included in the word2vec model so that they can be represented by a vector which accurately captures their meaning. Training a new embedding to include these new terms is not a good approach given the sparseness problem described above. To overcome this, we combine a definition mapping and embedding strategy. Specifically, first each of the relevant oov terms is manually mapped to a short definition; for example, the oov ReDOS is mapped to dose optimization study since ReDOS refers to a dose-optimisation phase 2 study on regorafenib 43 . Then, using the text from these definitions, a meaningful vector representation for the oov words is obtained with the em-bedding strategy described above (scispaCy word embedding model and arithmetic average of word vectors, the word vectors averaged now being the words of the oov word definition). This procedure has two main benefits. First, it does not require any training data nor any training effort. Second, it ensures by construction that the added word vectors are compatible with the word representation model in use. Pharmaceutical product trade names are oov words of particular interest for medical topic discovery. However, they are are generally not included in the scispaCy model. Thus, a slightly different procedure is used to ensure that all trade names appearing in medical inquiries are added to the model, regardless of them belonging to the most recurring oov words or not. Luckily, international non-proprietary names (INNs) of drugs are included. For instance, the oncology product trade name Stivarga™ is not present, while its corresponding INN (regorafenib) is. Thus, to automatically detect drug trade names we utilize the scispaCy named entity recognizer (NER) and the scispaCy UmlsEntityLinker as follows. First, the NER is used to extract entities from the text; then, for each entity, the UmlsEntityLinker performs a linking with the Unified Medical Language System (UMLS) 44 by searching within a knowledge base of approximately 2.7 million concepts via string overlap as described in Ref. 7. To limit the number of false positive matches we increase the UmlsEntityLinker threshold to 0.85 from the default of 0.7. For each entities that has been successfully linked to UMLS, several information regarding the identified concepts are returned by the UmlsEntityLinker: concept unique identifier (CUI), concept preferred name, concept definition, concept aliases, and concept type unique identifier (TUI). In particular, the latter defines to which semantic group the linked concept belongs to 45,46 ; an up-to-date list of semantic type mappings can be found at 47 . A TUI value of T121 indicates that the concept found is a Pharmacologic Substance. Extracting the entities with TUI equal to T121 allows to automatically identify drug trade names. Each drug trade name is then mapped to the concept preferred name; if that is not present, the concept definition is used; if that is also not present, drug trade name is replaced by to the phrase pharmaceutical medication drug. Once this mapping is performed, the same embedding strategy used for the other oov words is followed in order to obtain semantically meaningful word vector representations.

D. Hierarchical clustering
The HDBSCAN algorithm 36-38 starts by defining a mutual reachability distance based on a density estimation; the data is then represented as a weighted graph where vertices are data points and edges have weight equal to the mutual reachability distance between points. The minimum spanning tree is built, and converted to a hierarchy of connected components via a union-find data structure: starting from an initial cluster containing all points, the data is subsequently split at each level of the hierarchy according to the distance, ultimately returning as many clusters as data points when the threshold distance approaches zero. This cluster hierarchy is commonly depicted as dendogram. To obtain a meaningful set of clusters, this hierarchy needs to be condensed. The crucial point is to discern -at any given split -if two new meaningful clusters are formed by splitting their parent cluster, or instead the parent cluster is simply loosing points (and in the latter case one wishes to keep the parent cluster undivided). In HDBSCAN, this decision is governed by the minimum cluster size hyperparameter (min cluster size): a cluster split is accepted only if both newly formed clusters have at least min cluster size points. The final clusters are then chosen from this set of condensed clusters by means of a measure of stability as defined by Ref. 36. The main factor in defining min cluster size is the number of inquiries for a given product: we want to obtain (approximately) 100 clusters so that results can be easily analyzed by medical experts. It is important to point out that min cluster size does not strictly specify the number of clusters that will be formed, but rather provides to the algorithm an indication regarding the desired granularity, as outlined above. In our case, min cluster size ranges only between 5 and 10 depending on the number of inquiries. This small range of variation substantially facilitate the hyper-parameter search. Moreover, we noticed that -for approximately the same amount of inquiries and same min cluster size -the number of returned clusters increases with data variety, where data variety is qualitatively evaluated by manual inspection: for products with more diverse inquiries HDBSCAN tends to return a higher number of clusters, ceteris paribus. We utilize the leaf cluster selection method instead of the excess of mass algorithm because the former is known to return more homogeneous clusters 37 .
Due to the noise in the dataset, using the standard (hard) HDBSCAN clustering results in a large portion of the dataset (30-60%) considered as outliers consistently across all products. To overcome this, we use the soft HDBSCAN clustering 37 , which returns -instead of a (hard) cluster assignment -the probability that an inquiry belongs to a given cluster. We then define a probability threshold under which a point is considered to be an outlier; for all other points above this threshold, we associate them to the cluster with the highest probability through an argmax operation. This probability threshold ranges between 10 −3 and 10 −2 and it is chosen such that approximately 10% of the inquiries are classified as outliers. As mentioned in the main text, for computational reasons, we project via UMAP to a lower dimensional space before clustering is performed. Specifically, we project to 100 dimensions for products with less than 15,000 inquiries, and to 20 dimensions for products with more than 15,000 inquiries. Moreover, inquiries longer than 800 characters are also considered as outliers: this is because the text representation (average of word vectors) degrades for long sentences. These inquiries are gathered in the outlier cluster and made available to medical experts for manual inspection.

E. Topic merging
Given a topic, the vector representation for each word in the topic name is calculated; the topic name vector is then ob-tained by averaging the word vectors of the words present in the topic name. Topics are merged if their similarity -evaluated as cosine similarity between their topic name vectors -is larger than a threshold. Threshold values range between 0.8 and 0.95 depending on the medicinal product considered.
F. Topic evaluation: topic semantic compactness and name saliency The most popular topic evaluation metrics for topic modelling on long text are UCI 48 and UMass 49 . However, both UCI and UMass metrics are not good indicators for quality of topics in short text topic modelling due to the sparseness problem 50 . In Ref. 50, a purity measure is introduced to evaluate short text topic modelling; however, it requires pairs of short and long documents (e.g. abstract and corresponding full text article), and thus it is not applicable here because there is no long document associated to a given medical inquiry. Indeed, evaluation of short text topic modelling is an open research problem 20 . An additional challenge is the absence of labels. Performing annotations would require substantial manual effort by specialized medical professionals, and would be of limited use because one of the main goals is to discover previously unknown topics as new inquiries are received. The absence of labels precludes the use of the metrics based on purity and normalized mutual information proposed in Ref. 51, 52, 26. Ref. 53 bring forward the valuable idea of using distributional semantic to evaluate topic coherence, exploiting the semantic similarity learned by word2vec models. Topic coherence is assessed by calculating the similarity among the top n-words of a given topic: semantically similar top n-words lead to higher topic coherence. If this might be in general desirable, in the case of discovering medical topics it is actually detrimental: interesting (and potentially previously unknown) topics are often characterized by top n-words which are not semantically similar. For example, a medical topic having as top 2-words rivaroxaban (an anticoagulant medication) and glutine is clearly relevant from a medical topic discovery standpoint. However, rivaroxaban and glutine are not semantically similar, and thus the metric proposed in Ref. 53 would consider this as a low coherence (and thus low quality) topic, in stark contrast with human expert judgment. Analogous considerations apply to the indirect confirmation measures in Ref. 54: words emerging in novel topics would have rarely appeared before in a shared context. For this reason, we introduce a new measure of topic compactness which takes into account the semantics of the inquiries, and does not require any labeled data. Specifically, we compute the similarity of all inquiries belonging to a given topic with each other (excluding self-similarity), sum the elements of the resulting similarity matrix, and divide by the total number of elements in this matrix. The topic semantic compactness γ α of topic α where |C α | is the cardinality of topic α (how many inquiries are in topic α), q i (and q j ) is the word vector representing inquiry i (j), and S is a function quantifying the semantic similarity between inquiry q i and q j , taking values between 0 and 1 (S = 1 when q i and q j are indentical, and S = 0 being the lowest possible similarity). Given the chosen normalization factor (i.e. the denominator in Eq.1), 0 ≤ γ α ≤ 1 and thus γ α can be directly used as (a proxy for) topic quality score. The topic compactness maximum (γ α = 1) is attained if and only if every sentence (after preprocessing) contains exactly the same words. It is important to point out that γ α automatically takes semantics into account: different but semantically similar medical inquiries would still have high similarity score, and thus would lead (as desired) to a high topic semantic compactness, despite these inquiries using different words to express similar content. Contrary to Ref. 53, the topic semantic compactness γ α introduced in Eq.1 does not artificially penalize novel topics just because they associate semantically different words appearing in the same inquiry. To come back to the previous example, if numerous inquiries in a discovered topic contain the words rivaroxaban and glutine, the topic semantic compactness would be high (as desired), regardless from the fact that the top 2-words are not semantically similar since the similarity is evaluated at the inquiry level (by S(q i , q j ) in Eq. 1).
It is also beneficial to evaluate how representative the topic name is for the topic it represents. To this end, we calculate the name saliency τ α for medical topic α by calculating the similarity of the word vector representing the topic name with the word vectors representing the inquiries in the topic, sum these similarity values, and divide by the total number of inquiries in the topic. This reads where |C α | is the cardinality of topic α (how many inquiries are in topic α), t α is the word vector representing the name of topic α, and q i is the vector representing inquiry i. This returns a score (0 ≤ τ α ≤ 1) which quantifies how representative (salient) the name is for the topic it represents. As in the case of the topic semantic compactness, the name saliency τ α takes natively semantics (e.g. synonyms) into account via S(t α , q i ) in Eq. 2. In both Eq. 1 and Eq. 2, the cosine similarity is used as similarity measure.

V. COMPETING INTERESTS
Financial support for the research was provided by Bayer AG. The authors reports a patent application on Topic Modelling of Short Medical Inquiries submitted on April 21st, 2020 (application number EP20170513.4).