Models and Processes to Extract Drug-like Molecules From Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.


INTRODUCTION
The Coronavirus Disease  pandemic, caused by transmissible infection of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has resulted in tens of millions of diagnosed cases and over 1,450,000 deaths worldwide (Dong et al., 2020); straining healthcare systems, and disrupting key aspects of society and the wider economy. It is thus important to identify effective treatments rapidly via discovery of new drugs and repurposing of existing drugs. Here, we leverage advances in natural language processing to enable automatic identification of drug candidates being studied in the scientific literature.
The magnitude of the pandemic has resulted in an enormous number of academic publications related to COVID-19 research since early 2020. Many of these articles are collated in the  Open Research Dataset Challenge (CORD-19) collection (Allen Institute For AI, 2020; Wang et al., 2020). With 198,875 articles at the time of writing, that collection is far too large for humans to read. Thus, tools are needed to automate the process of extracting relevant data, such as drug names, testing protocols, and protein targets. Such tools can save domain experts significant time and effort.
Extracting named entities from scientific texts has been studied for more than 2 decades. Prior work relied on matching tokens (words) in text to entries in existing databases or ontologies (Rindflesch et al., 1999;Furrer et al., 2019). However, for our task, existing drug databases like DrugBank cover both too much, in that they include entities of many types (for example, "rabbit" is in DrugBank as an allergen), and too little: because the creation of such databases is time-consuming, they cannot keep up with the new entities in the latest COVID'19-related publications (∼200 k papers in the CORD'19 corpus), which is exactly what we want to extract here.
Machine learning (ML) and deep learning (DL) methods have also been used for identifying named entities from free text. While less dependent on a comprehensive ontology, they need labeled data for training. Traditional training data collection employs a brute-force approach, collecting a large corpus and labeling each and every word indiscriminatingly (Bada et al., 2012). The time and human resources involved is enormous. We, in contrast, had access to just a few assistants who worked on labeling only during free time in their day job. Thus we needed a more careful approach of selective labeling to maximize their effectiveness.
Towards this goal, we describe here how we have tackled two important problems: creating labelled training data via judicious use of scarce human expertise, and applying a named entity recognition (NER) model to automatically identify drug-like molecules in text. We are looking not only for novel drugs under development, but also any small molecule drug that has been used to treat patients with COVID'19 or similar infectious diseases like SARS and MERS. In the absence of expert-labeled data for the growing COVID literature, we employ an iterative model-in-the-loop collection process inspired by our previous work (Tchoua et al., 2019a,b) and demonstrate that it can build a high quality training set without input from domain scientists. We first assemble a small bootstrap set of human-verified examples to train a model for identifying similar examples. We then iteratively apply the model, use human reviewers to verify the predictions for which the model is least confident, and retrain the model until the improvement in performance is less than a threshold. (The human reviewers were administrative staff without scientific backgrounds, with time available for this task due to the pandemic.) Having collected adequate training data via this model-guided human annotation process, we then use the resulting labeled data to re-train a NER model originally developed to identify polymer names in materials science publications (Hong et al., 2020b) and apply this trained model to . We show that the labeled data produced by our approach are of sufficiently high quality than when used to train NER models, which achieves a best F-1 score of 80.5%-roughly equivalent to that achieved by nonexpert humans.
The labeled data, model, and model results are all available online, as described in Section 5.

MATERIALS AND METHODS
We aim to develop and apply new computational methods to mine the scientific literature to identify small molecules that have been investigated or found useful as antiviral therapeutics. For example, processing the following sentence should allow us to determine that the drug sofosbuvir has been found effective against the Zika virus: "Sofosbuvir, an FDA-approved nucleotide polymerase inhibitor, can efficiently inhibit replication and infection of several ZIKV strains, including African and American isolates." (Bullard-Feibelman et al., 2017).
This problem of identifying drug-like molecules in text can be divided into two linked problems: 1) identifying references to small therapeutic molecules ("drugs") and 2) determining what the text says about those molecules. In this work, we consider potential solutions to the first problem.
A simple way to identify entities in text that belong to a specialized class (e.g., drug-like molecules) is to refer to a curated list of valid names, if such is available. In the case of drugs, we might think to use DrugBank (Wishart et al., 2018) or the FDA Drug Database (Center for Drug Evaluation and Research, 2020), both of which in fact list sofosbuvir. However, such databases are not in themselves an adequate solution to our problem, for at least two reasons. First, they are rarely complete. The tens of thousands of entity names in DrugBank and the FDA Drug Database together are just a tiny fraction of the billions of molecules that could potentially be used as drugs. Second, such databases may be overly general: DrugBank, for example, includes the terms "rabbit" and "calcium," neither of which have value as antiviral therapeutics. In general, the use of any such list to identify entities will lead to both false negatives and false positives. We need instead to employ the approach that a human reader might follow in this situation, namely to scan text for words that appear in contexts in which a drug name is likely to appear. In the following, we explain how we combine human and artificial intelligence for this purpose.

Automated Drug Entity Extraction From Literature
Finding strings in text that refer to drug-like molecules is an example of Named Entity Recognition (NER) (Nadeau and Sekine, 2007), an important NLP task. Both grammatical and statistical (e.g., neural network-based) methods have been applied to NER; the former can be more accurate, but require much effort from trained linguists to develop. Statistical methods use supervised training on labeled examples to learn the contexts in which entities of interest (e.g., drug-like molecules) are likely to occur, and then classify previously unseen words as such entities if they appear in similar contexts. For instance, a training set may contain the sentence "Ribavirin was administered once daily by the i. p. route" (Oestereich et al., 2014), with ribavirin labelled as Drug. With sufficient training data, the model may learn to assign the label Drug to arbidol in the sentence "Arbidol was administered once daily per os using a stomach probe" (Oestereich et al., 2014). This learning approach can lead to general models capable of finding previously unseen candidate molecules in natural language text.
The development of effective statistical NER models is complicated by the many contexts in which names can occur. For example, while the contexts just given for ribavirin and arbidol are similar, both are quite different from that quoted for sofosbuvir earlier. Furthermore, authors may use different wordings and sentence structures: e.g., "given by i. p. injection once daily" rather than "administered once daily by the i. p. route." Thus, statistical NER methods need to do more than learn template word sequences: they need to learn more abstract representations of the context(s) in which words appear. Modern NLP and NER systems do just that (Chiu and Nichols, 2016).

SpaCy and Keras-Long-Short Term Memory Models
We consider two NER models in this paper, SpaCy and a Keras long-short term memory (LSTM) model. Both models are publicly available on DLHub (Li et al., 2021) and GitHub, as described in Section 5.
SpaCy (Honnibal and Montani, 2020a;Honnibal et al., 2020) is an open source NLP library that provides a pretrained entity recognizer that can recognize 18 types of entities, including PERSON, ORGANIZATION, LOCATION, and PRODUCT. Its model calculates a probability distribution of a word over the entity types, and outputs the type with the highest probability as the predicted type for that word. When pre-trained on the OntoNotes five dataset of over 1.5 million labeled words (Weischedel et al., 2013), the SpaCy entity recognizer can identify supported entities with 85.85% accuracy. However, it does not include drug names as a supported entity class, and thus we would need to retrain the SpaCy model on a drug-specific training corpus. Unfortunately, there is no publicly available corpus of labeled text for drug-like molecules in context. Thus, we need to use other methods to retrain this model (or other NER models), as we describe in Section 4.
While SpaCy is easy to use, it lacks flexibility: its end-to-end encapsulation does not expose many tunable parameters. Thus we also explore the use of a Keras-LSTM model that we developed in previous work for identification of polymers in materials science literature (Hong et al., 2020b). This model is based on the Bidirectional LSTM network with a conditional random field (CRF) layer added on top. It takes training data labeled according to the "IOB" schema. The first word in an entity is given the label "B" (Beginning), the following words in the same entity are labeled "I" (Inside), and non-entity words are labeled "O" (outside). During prediction, the Bi-LSTM network tries to assign one of "IOB" to each word in the input sentence, but it has no awareness of the validity of the label sequence. The CRF layer is used on top of Bi-LSTM to lower the probability of invalid label sequences (e.g., "OIO").
We compare the performance of SpaCy and Keras-LSTM models under various conditions in Section 2.2.

Model-In-The-Loop Annotation Workflow
We address the lack of labeled training data by using Algorithm 1 (and see Figure 1) to assemble a set of human-and machinelabeled data from CORD-19 (Wang et al., 2020). In describing this process, we refer to paragraphs labeled automatically via a heuristic or model as silver and to silver paragraphs for which labels have been corrected by human reviewers as gold. We use the Prodigy machine learning annotation tool to manage the review process: reviewers are presented with a silver paragraph, with putative drug entities highlighted; they click on false negative and false positive words to add or remove the highlights and thus produce a gold paragraph. Prodigy saves the corrected labels in standard NER training data format.
Our algorithm involves three main phases, as follows. In the first bootstrap phase, we assemble an initial test set of gold paragraphs for use in subsequent data acquisition. We create a first set of silver paragraphs by using a simple heuristic: we select N 0 paragraphs from CORD-19 that contain one or more words in DrugBank with an Anatomical Therapeutic Chemical Classification System (ATC) code, label those words as drugs, and ask human reviewers to correct both false positives and false negatives in our silver paragraphs, creating gold paragraphs. In the subsequent build test set phase, we repeatedly use all gold paragraphs obtained so far to train an NER model; use that model to identify and label additional silver paragraphs, and engage human reviewers to correct false positives and false negatives, creating additional gold paragraphs. We repeat this process until we have N t initial gold paragraphs.
In the third build labeled set phase, we repeatedly use an NER model trained on all human-validated labels obtained to date, with the N t gold paragraphs from the bootstrap phase used as a test set, to identify and label promising paragraphs in CORD-19 for additional human review. To maximize the utility of this human effort, we present the reviewers only with paragraphs that contain one or more uncertain words, i.e., words that the NER model identifies as drug/non-drug with a confidence in the range (min, max). We continue this process of model retraining, paragraph selection and labeling, and human review until the F-1 score improves by less than ϵ.
The behavior of this algorithm is influenced by six parameters: N 0 , N, N t , ϵ min, and max. N 0 and N are the number of paragraphs that are assigned to human reviewers in the first and subsequent steps, respectively. N t is the number of examples in the test set. ϵ is a threshold that determines when to stop collecting data. The min and max determine the confidence range from which words are selected for human review. In the experimental studies described below, we used N 0 278, N 120, N t 500, ϵ 0, min 0.45, and max 0.55.
The NER model used in the model-in-the-loop annotation workflow to score words might also be viewed as a parameter. In the work reported here, we use SpaCy exclusively for that purpose, as it integrates natively with the Prodigy annotation tool and trains more rapidly. However, as we show below, the Keras-LSTM model is ultimately somewhat more accurate when trained on all of the labeled data generated, and thus is preferred when processing the entire CORD-19 dataset: see Section 3.1.1 and Section 3.2.
This semi-automated method saves time and effort for human reviewers because they are only asked to verify labels that have already been identified by our model to be uncertain, and thus worth processing. Furthermore, as we show below, we find that we do not need to engage biomedical professionals to label drugs in text: untrained people, armed with contextual information (and online search engines), can spot drug names in text with accuracy comparable to that of experts.
We provide further details on the three phases of the algorithm in the following, with numbers in the list referring to line numbers in Algorithm 1.

A) Bootstrap
1. We start with the 2020-03-20 release version of the CORD-19 corpus, which contains 44,220 papers (Wang et al., 2020). We create C, a random permutation of its paragraphs from which we will repeatedly fetch paragraphs via next (C). 2. We bootstrap the labeling process by identifying as D the 2,675 items in the DrugBank ontology with a Anatomical Therapeutic Chemical Classification System (ATC) code attached (eliminating many, but not all, drug-like molecule entities). 3. We create an initial set of silver paragraphs, P 0 , by selecting N 0 paragraphs from C that include a word from D.
4. We engage human reviewers to remove false positives and label false negatives in P 0 , yielding an initial set of gold paragraphs, B.

B) Build test set
5. We expand the test set that we will use to evaluate the model created in the next phase, until we have N t validated examples. 6. We train the NER model on 60% of the data collected to date and evaluate it on the remaining 40%, to create a new trained model, M, with improved knowledge of the types of entities that we seek. 7. We use the probabilities over entities returned by the model to select, as our N new silver paragraphs, P, paragraphs that contain at least one uncertain word (see above). 8. We engage human reviewers to convert these new silver paragraphs, P, to gold, V. 9. We add the new gold paragraphs, V, to the bootstrap set B.

C) Build labeled set
13. We assemble a training set G, using the test set T assembled in the previous phases for testing. This process continues until the F-1 score stops improving see Section 2.2.
14-17. Same as Steps 6-9, except that we train on G and test on T . Human reviewers are engaged to review new silver FIGURE 1 | Overview of the training data collection workflow, showing the three phases described in the text and with the parameter values used in this study. Each phase pulls paragraphs from the CORD-19 dataset (blue dashed line) according to the Select criteria listed (yellow shaded box). Phases B and C repeatedly update the weights for the NER model (green arrows) that they use to identify and label uncertain paragraphs; human review (yellow and gold arrows) corrects those silver paragraphs to yield gold paragraphs. Total human review work is ∼278 + 600+960 1838 paragraphs.
Frontiers in Molecular Biosciences | www.frontiersin.org August 2021 | Volume 8 | Article 636077 paragraphs and produce new gold paragraphs, which are then added to G instead of T .

Data-Performance Tradeoffs in Named-Entity Recognition Models
As noted in Section 2.1.2, our model-in-the-loop annotation workflow requires repeated retraining of a SpaCy model. Thus we conducted experiments to understand how SpaCy prediction performance is influenced by model size, quantity of training data, and amount of training performed.
As the training data produced by the model-in-the-loop evaluation workflow are to be used to train an NER model that we will apply to the entire CORD-19 dataset, we also evaluate the Keras-LSTM model from the perspectives of big data accuracy and training time.

Model Size
We first need to decide which SpaCy model to use for model-inthe-loop annotation. Model size is a primary factor that affects training time and prediction performance. In general, larger models tend to perform better, but require both more data and more time to train effectively. As our model-in-the-loop annotation strategy requires frequent model retraining, and furthermore will (initially at least) have little data, we hypothesize that a smaller model may be adequate for our purposes.
To explore this hypothesis, we study the performance achieved by the SpaCy medium and large models (Honnibal and Montani, 2020b) on our initial training set of 278 labeled paragraphs. We show in Figure 2 the performance achieved by the two models as a function of number of training epochs. Focusing on the harmonic mean of precision and recall, the F-1 score (a good measure a model's ability to recognize both true positives and true negatives), we see that the two models achieve similar prediction performance, with the largest difference in F-1 score being around 2%. As the large model takes over eight times longer to train per epoch, we select the medium model for model-in-the-loop data collection.

Word Embedding Models
The Keras LSTM model requires external word vectors since, unlike SpaCy, it does not include a word embedding model. To explore the affect of different word embedding models we trained both BERT (Devlin et al., 2018), a top-performing language model developed by Google, and FastText (Bojanowski et al., 2016), a model shown to have outperformed traditional Word2Vec models such as CBOW and Skipgram in our previous work (Hong et al., 2020b). While Google has released pre-trained BERT models, and researchers often build upon these models by "fine-tuning" them with additional training on small external datasets, it is not suitable to our problem as the vocabulary used in the CORD-19 is very different than the datasets used to train these models. 1 Rather than use a pre-trained BERT model, we trained a BERT model on the CORD-19 corpus using a distributed neural network training framework from our previous work (Pauloski et al., 2020). As the CORD-19 corpus is approximately 20% of the size of the training data used by Google used to train BERT, we reduced the word embedding size proportionally from 768 to 128 to avoid over-fitting. For FastText word embeddings we set the size to the default 120. We used word embeddings derived from both models to train the Keras LSTM model on the same training and testing data collected in Section 2.1.2. The model using FastText embeddings achived a slightly higher F-1 score (80.5%) than the model trained with BERT embeddings (78.7%). This result is likely due to the limited training data and embedding size. In short, the humongous BERT model requires an equally humongous amount of data to achieve the best performance, and without such it will not necessarily outperform other much smaller and less computationally intensive word embedding models. In the remainder of the paper, we use the FastText word embedding model.

Amount of Training Data
As data labeling is expensive in both human time and model training time, it is valuable to explore the tradeoff between time spent collecting data and prediction performance. To this end, we manually labeled a set of 500 paragraphs selected at random from CORD-19 (Wang et al., 2020) as a test set. Then, we used that test set to evaluate the results of training the SpaCy and Keras-LSTM models of Section 2.1.1 on increasing numbers of the paragraphs produced by our human-in-theloop annotation process. Figure 3 shows their F-1 score curves as we scale from 0 to 1,000 training samples. With only 100 training examples, SpaCy and Keras-LSTM achieve F-1 scores of 57 and 66%, respectively. SpaCy performs better than Keras-LSTM with fewer training examples (i.e., less than 300), after which Keras-LSTM overtakes it and maintains a steady 2-3% advantage as the number of examples increases. This result motivates our choice of Keras-LSTM for the CORD-19 studies in Section 3.2.
We stopped collecting training data after 1,000 examples. We see in Figure 3 that the performance of the SpaCy and Keras-LSTM models is essentially the same with 1,000 training examples as with 700 examples, with the F-1 score even declining when the number of available examples increases to 800 or 900. At 1,000 examples the F-1 score is greatest for both models. We conclude that the 1,000 training examples, along with the other 500 withheld as the test set, are best-suited to train our models. There are 4,244 and 1861 entities in the training and test set, respectively.

Training Epochs
Prediction performance is also influenced by the number of epochs spent in training. The cost of training is particularly important in a model-in-the-loop setup, as human reviewers cannot work while an model is offline for training. Figure 4 shows the progression of the loss, precision, recall, and F-1 values of the SpaCy model during 100 epochs of training with the initial 278 examples. We can see that the best F-1 score is achieved within 10-20 epochs. Increasing the number of epochs does not result in any further improvement. Indeed, F-1 score does not tell us all about the model's performance. Sometimes training for more epochs could lead to lower loss values while other metrics (such as precision, recall, or F-1) no longer improve. That would still be desirable because it means the model is now more "confident," in a sense, about its predictions. However, that is not the case here. As shown in Figure 4, after around 40 epochs the loss begins to oscillate instead of continuing downwards, suggesting that in this case training for 100 epochs does not result in a better model than only training for 20 epochs. Figure 5 shows the progression of accuracy and loss value for the Keras-LSTM model with the initial 278 examples. In Figure 5A, we see that validation accuracy improves as training accuracy increases during the first 50 epochs. After around epoch 50, the training and validation accuracy curves  diverge: the training accuracy continues to increase but the validation accuracy plateaus. This trend is suggestive of overfitting, which is corroborated by Figure 5B. After about 50 epochs, the validation loss curve turns upwards. Hence we choose to limit the training epochs to 64. After each epoch, if a lower validation loss is achieved, the current model state is saved. After 64 epochs, we test the model with the lowest validation loss on the withheld test set.

RESULTS
We present the results of experiments in which we first evaluate the performance of our models from various perspectives and then apply the models to the CORD-19 dataset.

Evaluating Model and Human Performance
We conducted experiments to compare the performance of the SpaCy and Keras-LSTM NER models; compare the performance of the models against humans; determine how training data influences model performance; and analyze human and model errors.

Performance of SpaCy and Keras Named-Entity Recognition Models
We used the collected data of Section 2.1.2 to train both the SpaCy and Keras-LSTM NER models of Section 2.1.1 to recognize and extract drug-like molecules in text. The SpaCy model used is the medium English language model en_core_web_md (Honnibal and Montani, 2020b). For the Keras model, the input embedding size is 120, the LSTM and Fully-connected hidden layers have a size of 32, and the dropout rate is 0.1. The model is trained for 64 epochs with a batch size of 64. We find that the trained SpaCy model achieved a best F-1 score of 77.3%, while the trained Keras-LSTM model achieved a best F-1 score of 80.5%, somewhat outperforming SpaCy.
As shown in Figure 3, the SpaCy model performs better than the Keras-LSTM model when trained with small amounts of training data-perhaps because of the different mechanisms employed by the two methods to generate numerical representations for words. SpaCy's built-in language model, pre-trained on a general corpus of blog posts, news, comments, etc., gives it some knowledge about commonly used words in English, which are likely also to appear in a scientific corpus. On the other hand, the Keras-LSTM model uses custom word embeddings trained solely on an input corpus, which provides it with better understanding of multi-sense words, especially those that have quite different meanings in a scientific corpus. However, without enough raw data to draw contextual information from, custom word embeddings can not accurately reflect the meaning of words.

Comparison Against Human Performance
Recognizing drug-like molecules is a difficult task even for humans, especially non-medical professionals (such as our non-expert annotators). To assess the accuracy of the annotators, we asked three people to examine 96 paragraphs, with their associated labels, selected at random from the labeled examples. Two of these reviewers had been involved in creating the labeled dataset; the third had not. For each paragraph, each reviewer decided independently whether each drug molecule entity was labeled correctly (a true positive), was labeled as a  This process revealed a total of 257 drug molecule entities in the 96 paragraphs, of which the annotators labeled 201 correctly (true positives), labeled 49 incorrectly (false positives), and missed 34 (false negatives). The numbers of true positives and false negatives do not sum up to the total number of drug molecules because in some cases an annotator labeled not to a drug entity but the entity plus extra preceding or succeeding word or punctuation mark (e.g., "sofosbuvir," instead of "sofosbuvir") and we count such occurrences as false positives rather than false negatives. In this evaluation, the non-expert annotators achieved an F-1 score of 82.9%, which is comparable to the 80.5% achieved by our automated models, as shown in Figure 3. In other words, our models have performance on par with that of non-expert humans.

Effects of Training Data Quality on Model Performance
We described in the previous section how review of 96 paragraphs labeled by the non-expert annotators revealed an error rate of about 20%. This raises the question of whether model performance could be improved with better training data. To examine this question, we compare the performance of our models when trained on original vs. corrected data. As we only have 96 corrected paragraphs, we restrict our training sets to those 96 paragraphs in each case.
We sorted the 96 paragraphs in both datasets so that they are considered in the same order. Then, we split each dataset into five subsets for K-fold cross validation (K 5), with the first four subsets having 19 paragraphs each and the last subset having 20. Since K is set to five, the SpaCy and Keras models are trained five times. In the ith round, each model is trained on four subsets (excluding the ith) of each dataset. The ith subset of the corrected dataset is used as the test set. The ith subset of the original dataset is not used in the ith round.
We present the K-fold cross validation results in Tables 1, 2. The models performed reasonably well when trained on the original dataset, with an average F-1 score only 2% less than that achieved with the corrected labels. Given that the expert input required for validation is hard to come by, we believe that using non-expert reviewers is an acceptable tradeoff and probably the only practical way to gather large amounts of training data.

Applying the Trained Models
After training the models with the labeled examples, we applied the trained models to the entire CORD-19 corpus (2020-10-04 version with 198,875 articles) to identify potential drug-like molecules. Processing a single article takes only a few seconds; we adapted our models to use data parallelism to enable rapid processing of these many articles.
We ran the SpaCy model on two Intel Skylake 6,148 processors with a total of 40 CPU cores; this run took around 80 core-hours and extracted 38,472 entities. We ran the Keras model on four NVidia Tesla V100 GPUs; this run took around 40 GPU-hours and extracted 121,680 entities. We recorded for each entity the number of the times that it has been recognized by each model, and used those numbers as a voting mechanism to further determine which entities are the most likely to be actual drugs. In our experiments, "balanced" entities (i.e., those whose numbers of detection by the two models are within a factor of 10 of each other) are most likely to appear in the DrugBank list. As shown in Figure 6, we sorted all extracted entities in descending order by their total number of detections by both models and compared the top 100 entities to DrugBank. We found that only 77% were exact matches to drug names or aliases, or 86% if we included partial matches (i.e., the extracted entity is a word within a multi-word drug name or alias in DrugBank). In comparison, among the top 100 "balanced" entities, 88% were exact matches to DrugBank, or 91% with partial matches.
Although DrugBank provides a reference metric to evaluate the results, it is not an exhaustive list of known drugs. For instance, remdesivir, a drug that has been proposed as a potential cure for COVID-19, is not in DrugBank. We manually checked via Google searches the top 50 "balanced" and top 50 "imbalanced" entities not matched to DrugBank, and found that 70% in the "balanced" list are actual drugs, but only 26% in the "imbalanced" list. Looking at the false positives in these top 50 lists, the "balanced" false positives are often  understandable. For example, in the sentence "ELISA plate was coated with . . . and then treated for 1 h at 37.8°C with dithiothreitol . . . ", the model mistook the redox reagent dithiothreitol for a drug entity, probably due to its context "treated with." On the other hand, we found no such plausible explanations for the false positives in the "imbalanced" list, where most false positives are chemical elements (e.g., silver, sodium), amino acids (e.g., cysteine, glutamine), or proteins (e.g., lactoferrin, cystatin). Finally, we compared our extraction results to the drugs being used in clinical trials, as listed on the United States National Library of Medicine website (National Institutes of Health (2020). We queried the website with "covid" as the keyword and manually screened the returned drugs in the "Interventions" column to remove stopwords (e.g., tablet, injection, capsule) and dosage information (e.g., 2.5 mg, 2.5%) and only keep the drug names. Then we compared the top 50 most frequently appeared drugs to the automatically extracted drugs from literature. The "balanced" entities extracted by both models matched to 64% of the top 50 drugs in clinical trial, whereas the "imbalanced" entities only matched to 6% in the same list.
The results discussed here are available in the repository described in Section 5.

Validating the Utility of Identified Molecules
We are interested to evaluate the relevance to COVID research of the molecules that we extracted from CORD19. To that end, we compared our extracted molecule list against ZINC and Drugbank sets that a group of scientists at Argonne National Laboratory had used for computational screening. We found that our list contained an additional 3,591 molecules not found in their screening sets (filtered by their canonical SMILE strings). Applying their methods to screen those 3,591 molecules for docking against a main coronavirus protease, 3CLPro, revealed that 18 had docking scores in the top 0.1% of the 6.6 M ZINC molecules that they had screened previously (Saadi et al., 2020)-significantly more than the four that we would expect by chance.
As reported by Babuji et al. (2020a), those researchers have leveraged the outputs of our models in a computational screening pipeline that leverages HPC resources at scale, coupled with multiple artificial intelligence and simulation-based approaches (including leads from NLP), to identify high-quality therapeutic compounds to screen experimentally. A further manuscript, detailing the end-toend process from data collection to simulation, incorporation of these results, cellular assays, and identification of high performing therapeutic compounds, is in preparation.

DISCUSSION: ANALYSIS OF HUMAN AND MODEL ERRORS
Finally, we explore the contexts in which human reviewers and models make mistakes. Specifically, we study the tokens that appear most frequently near to incorrectly labeled entities. To investigate the effects of immediate and long-distance context, we control, as window size, the maximum distance between a token and a entity for that token to be considered as "context" for that entity.
One difficulty with this analysis is that the most frequent tokens identified in this way were mostly stop words or punctuation marks. For instance, when the window size is set to three, the 10 most frequent tokens around mislabeled words are, in descending order, "comma (,)," "and," "mg," "period (.)," "right parenthesis ())," "with," "of," "left parenthesis (()," "is," and "or." Only "mg" is neither a stop word nor punctuation mark.
Those tokens provide little insight as to why human reviewers might have made mistakes, and furthermore are unlikely to have influenced reviewer decisions. Thus we exclude stopwords and punctuation marks when providing, in Table 3, lists of the 10 most frequent tokens within varying window sizes of words that were incorrectly identified as molecules by human reviewers.
FIGURE 6 | Percentage of detected entities that are also found in DrugBank, when running either on all words found by our model or on just the balanced subset, and with "found" defined as either a full or partial match.    We see that there are indeed several deceptive contextual words. With a window size of one, the 10 most frequent tokens include "oral," "dose," and "intravenous." It is understandable that an untrained reviewer might label as drugs words that immediately precede or follow such context words. Similar patterns can be seen for window sizes of three and five. Without background knowledge to draw from, non-experts are more likely to rely on their experience gained from labeling previous paragraphs. One may hypothesize that after the reviewers have seen a few dozen to a few hundred paragraphs, those deceptive contextual words must have left a deep impression, so that when those words re-appear they are likely to label the strange unknown word close to them as a drug.
To investigate this hypothesis, we also explored the most frequent words around drug entities that are correctly labeled by human reviewers: see Table 4. Interestingly, we found overlaps between the lists in Tables 3, 4: in all, three, four, and two overlaps for window sizes of one, three, and five, respectively, when treating all numerical values as identical. This finding supports our hypothesis that those frequent words around real drug entities may confuse human reviewers when they appear around non-drug entities.
We repeat this comparison of context words around human and model errors while considering stopwords and punctuation marks. Tables 5, 6 show the 20 most frequent tokens in each case. We see that 20-25% of the tokens in Table 5, but only 5-10% of those in Table 6, are not stop words or punctuation marks. As the model only learns its word embeddings from the input text, if a token often cooccurs with drug entities in the training corpus the model will treat it as an indication of drug entities near its presence, regardless of whether or not it is a stopword. This apparently leads the model to make incorrect inferences. Humans, on the other hand, are unlikely to think that stopword such as "the" is indicative of drug entities, no matter how frequently they appear together.

DATA AVAILABILITY AND FORMATS
We have made our annotated training data, trained models, and the results of applying the models to the CORD-19 corpus publicly available online. (Babuji et al., 2020b).
In order to facilitate training of various models, we published the training data in two formats-an unsegmented version in line-delimited JSON (JSONL) format, and a segmented version in Comma Separated Value (CSV) format. The JSONL format contains the most comprehensive information that we have collected on the paragraphs in the dataset. We choose JSONL format rather than a JSON list because it allows for the retrieval of objects without having to parse the entire file. A JSON object in the JSONL file has the following structure: • text: The original paragraph stored as a string without any modification. • tokens: The list of tokens from text after tokenization.
• text: The text of the token as a string.
• start: The index of the first character of the token in text.
• end: The index of the first character after the token in text.
• id: Zero-based numbering of the token.
• spans: The list of spans (sequences of tokens) that are labeled as named entities (drugs) • start: The index of the first character of the span in text.
• end: The index of the first character after the span in text.
• token_start: The index of the first token of the span in text.
• token_end: The index of the last token of the span in text.
• label: The label of the span ("drug") Another commonly adopted labeling scheme for NER datasets is the "IOB" labeling scheme, in which the original text is first tokenized and each token is assigned a label "I," "O," or "B." The label "B (eginning)" means the corresponding token is the first in a named entity. A label "I (nside)" is given to every token in a named entity except for the first token. All other tokens gets the label "O (utside)" which means they are not part of any named entity. The aforementioned JSONL data are converted according to the IOB scheme and stored in Comma Separated Value (CSV) files with one training example per line. Each line consists of two columns: a first of tokens that made up of the original texts, and a second of the corresponding IOB labels for those tokens. In addition to a different labeling scheme, the samples in the CSV files are segmented, meaning that each sentence is treated as a training sample instead of an entire paragraph. This structure aligns with that used in standard NER training sets such as CoNLL03 (Sang and De Meulder, 2003).
The trained SpaCy and Keras models and the results of applying the models to the 198,875 articles in the CORD-19 corpus are also available in this GitHub repo. Additionally, the pre-trained SpaCy model is provided as a cloud service via DLHub (Hong et al., 2020a;Li et al., 2021). (The Keras model could not be hosted there due to compatibility issues with DLHub.) This cloud service allows researchers to apply the model to any texts they provide with as few as four lines of code.

CONCLUSION AND FUTURE DIRECTIONS
We have presented a human-machine hybrid pipeline for collecting training data for named entity recognition (NER) models. We applied this pipeline to create a NER model for identifying druglike molecules in COVID-19-related research papers. Our pipeline facilitated efficient use of valuable human resources by presenting human labellers only with samples that were most likely to confuse the NER model. We explored various trade-offs, including model size, number of training samples, and training epochs, to find the right balance between model performance and time-to-result. In total, human reviewers working with our pipeline validated labels for 278 bootstrap samples, 1,000 training samples, and 500 test samples. As this work was performed in conjunction with other tasks, we cannot accurately quantify the total effort taken to collect and annotate the above training and test samples, but it was likely around 100 person-hours.
NER models trained with these training data achieved a best F-1 score of 80.5% when evaluated on our collected test set. Our models correctly identified 64% of the top 50 drugs that are in clinical trials for COVID-19, and when applied to 198,875 articles in the CORD-19 collection, identified 10,912 molecules with potential therapeutic effects against the SARS-CoV-2 coronavirus. The code, model, and extraction results are publicly available. Our work provided an additional 3591 SMILES strings to scientists at Argonne National Laboratory to be used in computational screening pipelines, of which 18 ranked in the top 0.1% of the molecules screened. Babuji et al. (2020a) have leveraged the outputs of our models in a computational screening pipeline that leverages HPC resources at scale to identify high-quality therapeutic compounds to screen experimentally. A further manuscript detailing the end-to-end process of identifying high performing therapeutic compounds is in preparation.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/ globus-labs/covid-nlp/tree/master/drug-ner.

AUTHOR CONTRIBUTIONS
ZH and JGP conducted the computational experiments. BB and LW contributed to discussions of ideas. ZH wrote the first draft.
KC and IF revised the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act. This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This work was also supported by financial assistance award 70NANB19H005 from United States Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD).