Neural text generation in regulatory medical writing

Background: A steep increase in new drug applications has increased the overhead of writing technical documents such as medication guides. Natural language processing can contribute to reducing this burden. Objective: To generate medication guides from texts that relate to prescription drug labeling information. Materials and Methods: We collected official drug label information from the DailyMed website. We focused on drug labels containing medication guide sections to train and test our model. To construct our training dataset, we aligned “source” text from the document with similar “target” text from the medication guide using three families of alignment techniques: global, manual, and heuristic alignment. The resulting source-target pairs were provided as input to a Pointer Generator Network, an abstractive text summarization model. Results: Global alignment produced the lowest ROUGE scores and relatively poor qualitative results, as running the model frequently resulted in mode collapse. Manual alignment also resulted in mode collapse, albeit higher ROUGE scores than global alignment. Within the family of heuristic alignment approaches, we compared different methods and found BM25-based alignments to produce significantly better summaries (at least 6.8 ROUGE points above the other techniques). This alignment surpassed both the global and manual alignments in terms of ROUGE and qualitative scoring. Conclusion: The results of this study indicate that a heuristic approach to generating inputs for an abstractive summarization model increased ROUGE scores, compared to a global or manual approach when automatically generating biomedical text. Such methods hold the potential to significantly reduce the manual labor burden in medical writing and related disciplines.


Appendix B: Heuristic Alignment of Document and Subsections
Since both splitting and heuristic calculation can be done in an unsupervised manner for any text, the heuristic alignment method is generalizable to any domain. best target subsection = None; best similarity score = 0; for each target t i in {t 0 , t 1 , . . . t k }: similarity = heuristic(d i , t i ) if similarity > best: best similarity score = similarity; best target subsection = t i pairs.append((d i , t i )) return pairs

TF-IDF + L2-Normalized Euclidean: "Best hits"
In the best-hits method, target and source text are represented as TF-IDF vectors and the L2-normalized Euclidean distance is calculated between each pair. For each target text section, the closest source by this distance metric is chosen as the best hit, and this (target, source) pair is returned as a match.

TF-IDF + L2-Normalized Euclidean: Average Contributions
In the average contributions method, target and source text are represented as TF-IDF vectors and the L2-normalized Euclidean distance is calculated between each pair. The scores for each pairing are then averaged across all drug labels, and a ranked list of possible source matches is generated for each target, based on lowest average distance. To generate matches for each target from each drug label, the target's highest ranking source (from the ranked averages list) available in that drug label is chosen, and this (target, source) pair is returned as a match.

TF-IDF + L2-Normalized Euclidean: Median Contributions
In the median contributions method, target and source text are represented as TF-IDF vectors and the L2-normalized Euclidean distance is calculated between each pair. The scores for each pairing are then collected across all documents and the median is found for each possible (target, source) pair. Based on lowest median distance, a ranked list of possible source matches is generated for each target. To generate matches for each target from each document, the target's highest ranking source (from the ranked medians list) available in that document is chosen, and this (target, source) pair is returned as a match.

LSH Forest / Jaccard Distance
In an attempt to reduce the time taken to find source target pairs across all documents, we explored the concept of MinHash Local Sensitive Hashing (LSH). In this approach, each source text is considered a document, and each document is encoded with the MinHash algorithm. Using the datasketch package's MinHashLSHForest Zhu (2021), we added all of the encoded documents to a bucket or index. At query time, the query itself (which is a target text) is hashed to the same bucket. We then request that best match for the top two candidates (or source documents) with the current target document. Based on this algorithm, the best match is considered to be the one whose encoded hash is closest to the query hash. While this allows for quicker computation, we still need to ensure that the top match is indeed a suitable choice.
To do so, the query results were preprocessed and the algorithm then calculated Jaccard distances between each query result and the target text. The lower the Jaccard distance, the higher the similarity. Therefore, the query result that had lower distance value was chosen as the top match to the target text. This process is repeated for all the target texts in the dataset.

BERT-Cosine
Another approach to speed up the process of pairing source target texts accurately is by using sentence transformer networks that are based on pretrained BERT models. The sentences or documents are first preprocessed and are then encoded by the sentence transformer model. These encoded source texts are then compared with the encoded target texts using cosine similarity. The higher the cosine similarity, the better the match. Hence, the source text that had the highest cosine score becomes the top match to the target text.