Validating GAN-BioBERT: A Methodology for Assessing Reporting Trends in Clinical Trials

Background The aim of this study was to validate a three-class sentiment classification model for clinical trial abstracts combining adversarial learning and the BioBERT language processing model as a tool to assess trends in biomedical literature in a clearly reproducible manner. We then assessed the model's performance for this application and compared it to previous models used for this task. Methods Using 108 expert-annotated clinical trial abstracts and 2,000 unlabeled abstracts this study develops a three-class sentiment classification algorithm for clinical trial abstracts. The model uses a semi-supervised model based on the Bidirectional Encoder Representation from Transformers (BERT) model, a much more advanced and accurate method compared to previously used models based upon traditional machine learning methods. The prediction performance was compared to those previous studies. Results The algorithm was found to have a classification accuracy of 91.3%, with a macro F1-Score of 0.92, significantly outperforming previous studies used to classify sentiment in clinical trial literature, while also making the sentiment classification finer grained with greater reproducibility. Conclusion We demonstrate an easily applied sentiment classification model for clinical trial abstracts that significantly outperforms previous models with greater reproducibility and applicability to large-scale study of reporting trends.


Introduction
Publication bias is a widespread phenomenon of systematic under or overreporting of research findings dependent on the direction of the results found [1].As a result of this phenomenon, systematic reviews of clinical guidelines may reach incorrect conclusions [2], which creates the potential for harm to patients with treatments that have an otherwise poor evidence base.Despite this potential for harm and its widespread presence within clinical literature [1], there have been limited efforts to develop and utilize methods to characterize publication bias on a large scale.In 2016 Hedin et.al. found that only fifty-five percent of meta-analysis in anesthesiology journals discussed publication bias, and only forty-three percent actually used tools to assess the arXiv:2106.00665v1[cs.CL] 1 Jun 2021 phenomenon [3].Furthermore, currently used methods for assessing publication bias such as funnel-plot based methods and selection models [4,5], are criticized as unintuitive to interpret within the literature's context [4].These methods are also focused on the quantitative findings expressed in the studies in question in the form of effect sizes and p-values, and are therefore limited to those studies that express these types of findings.
Historically, systematic assessment of the qualitative interpretation of the findings has been limited to rating systems performed by human raters [6,7,8].This has changed with the development of sentiment analysis and natural language processing as a toolset capable of understanding the qualitative statements made in a body of text.Several studies have explored the assessment of citation sentiment analysis in academic literature [9,10,11,12] with the goal of examining the sentiment towards particular papers cited in the body of another article as an assessment of article impact.Sentiment analysis has also been applied to the analysis of clinical notes in the electronic health record with the goal of prognostication [13,14].In contrast, there have been limited attempts to use sentiment analysis to characterize the qualitative findings authors express towards their own publication's findings [16,17].However, both of these studies were limited by the availability of labeled abstract data for training of the algorithms developed, as well as the algorithms classes being limited to the two class tasks of positive/neutral [16] or positive/not positive [17] respectively.The methods used in these studies also did not take advantage of newer natural language processing architectures such as the Bidirectional Encoder Representations from Transformers (BERT) model [18], or similar newer models for the analyses, instead opting for the use of a support vector machine [16] and a sequential neural network [17] respectively.This study's goal is to develop and validate a sentiment analysis model for clinical trial abstracts that addresses the limitations of previous studies such that it can be applied to large-scale assessment of clinical literature.This was done by fine tuning a BERT model for three-class sentiment classification [18] pretrained on PubMed text [19] in a semi-supervised fashion with the assistance of a generative adversarial network (GAN) [20].Training was performed using clinical trial abstracts with sentiment labeled by expert raters supplemented with a large amount of unlabeled abstracts from the PubMed database.The final performance of this algorithm was then determined and compared to previous methodologies used for this type of sentiment analysis task.

Methods
The overall goal of this study was the creation of a semi-supervised three-class sentiment classifier for clinical trial abstracts that classifies clinical trial abstracts as expressing positive, negative, or neutral sentiment.This approach was chosen so that the final algorithm could be both sufficiently accurate for application in future studies, while also providing enough information to delineate publication trends in clinical trial literature over time.

Creation of the Labeled & Unlabeled Training Sets
There are no publicly available annotated datasets specific to sentiment analysis of clinical trials, so for the purposes of this study an appropriate annotated dataset had to be created.Given that the most appropriate raters for the sentiment rating of clinical trials are trained clinician experts, creation of a fully annotated dataset for this study was determined to be particularly resource intensive, a problem inherent to clinically related natural language processing (NLP) tasks [22].With this in mind, this study elected to use a semi-supervised approach, combining expert-annotated clinical trial abstracts and large amounts of unlabeled data to minimize the resources required to create the final algorithm.
Data Gathering All abstracts gathered for this study were from the National Library of Medicine's (NLM) PubMed database, filtered specifically to publications classified as clinical trials.The collection of these abstracts was automated using the NCBI's Entrez search and retrieval system with a data mining tool built by the authors using the BioPython toolkit [23].This tool is able to gather all MEDLINE data that is reported for a particular PubMed query, and is able to search in a specific medical field by cross referencing journal ID numbers with a NLM catalog query.
For the creation of the labeled dataset, 12 abstracts each from clinical trials in the fields of Obstetrics & Gynecology, Orthopedics, Pediatrics, Anesthesiology, General Surgery, Internal Medicine, Thoracic Surgery, Critical Care, and Cardiology were randomly selected, for a total of 108 labeled abstracts.These abstracts were then stripped of all information other than the abstract text, and provided to a panel of four clinician experts to independently label the abstracts as having positive, negative, or neutral sentiment.The ground truth class of these abstracts was then defined by the most common rating assigned to the abstract by three of the raters.A single rater's scores were held out to serve as an accuracy measure to compare with the validation results Fig. 1 A visual representation of the GAN-BERT algorithm as described by the original developers [20] of the final algorithm.The smallest and largest class of the labeled data was then randomly oversampled or undersampled to equal the number of samples from the median class in order to create an algorithm with a maximally balanced accuracy between classes [15].The unlabeled dataset was a collection of 2000 clinical trial abstracts selected from PubMed in the same manner described above, excluding those used in the labeled dataset.The unlabeled data is then given a label of UNK UNK so that when it's used to train the classification algorithm the label is appropriately masked.

Data Preprocessing
The conclusion sentences of the labeled and unlabeled abstracts to be used for training and validation were then extracted.This was done as it was found in previous study that using solely the concluding sentences led to an increase in classification accuracy [16,17].Using the Natural Language Tool Kit (NLTK) Python toolkit [24], concluding sentences were identified as those following the conclusion heading for structured abstracts.For unstructured abstracts, the conclusion sentences were determined to be the last n sentences of an abstract based on the number of sentences in the abstract using equation 1 below, where S t is the total number of sentences.
The relative value of 0.125 was determined empirically in a previous study based on the analysis of 2000 structured abstracts [16].
Tokenization Following extraction of the conclusion sentences, all sentences were tokenized using the BERT tokenizer available as part of the HuggingFace Transformers toolkit [25], which tokenizes each word.The BERT tokenizer begins by tagging the first token of each sentence with the token [CLS], then converting each token to its corresponding ID that is defined in the pre-trained BERT model.The end of each sentence is then padded with the tag [PAD] to a fixed sentence length, as the BERT model requires a fixed length sentence as an input [18].

GAN-BioBERT Workflow
Generally, the GAN-BERT architecture consists of a generator function G based on the Semi-Supervised generalized adversarial network (GAN) architecture that generates fake samples F using a noise vector as input [26], the pre-trained BERT model, which is given the labeled data, and a discriminator function D that is a BERT-based k-class classifier that is fine-tuned to the particular classification task [20].This workflow is shown graphically in Figure 1, with further discussion of each element to follow.

BERT Architecture
Before discussing the details of the algorithm used in this study it's key to first discuss the general BERT architecture.Bidirectional Encoder Representations from Transformers or BERT model is a method for language processing first described in 2018 by Devlin et.al. that achieved state of the art performance on a variety of natural language processing tasks and has since become a heavily used tool in natural language processing research [18].BERT functions using 2 sequential workflows, a semi-supervised language modeling task that develops a general language model, then a supervised learning step specific to the language processing task the model is being applied to such as text classification.For developing the pre-trained language model BERT is provided with a very large corpus from a particular domain, such as publications in PubMed [19], documents from a particular language [27], or English Wikipedia and BooksCorpus as in the original BERT model [18].BERT then develops a complete language model from the provided corpus using both masked language modeling, which determine the meaning of individual words within the sentence's context, and next sentence prediction, which works to understand the relationship between sentences.The result of this process is a trained context-sensitive general language model for the specific domain being studied that can then be disseminated for a wide variety of applications.The pretrained language model from the semi-supervised stage of BERT is then fine-tuned for a specific language task by providing task-specific inputs and outputs and then adjusting the parameters of the model accordingly to create the complete task-specific algorithm [18].
BERT Pretrained Model Selection Given the important role of the pretrained model in the BERT architecture, and the relative complexity of biomedical literature, general language models are likely to encounter lower accuracy when applied to a biomedical application such as the one in this study due to a change in the word distributions between general and biomedical corpora [19].As such, in this study the pretrained BioBERT model was used as the general language model to be fine-tuned for sentiment classification [19].BioBERT is a 2020 pretrained BERT model by Lee et.al. that is specific to the biomedical domain that was trained on PubMed abstracts and PubMed Central full-text articles, as well as English Wikipedia and BooksCorpus as was done in the original BERT model [18,19].As a result of this domain specific training, BioBERT has shown improved performance on a variety of biomedical NLP tasks when compared to the standard BERT models [19].

GAN-BERT
While BERT and its derivatives have been able to achieve state of the art performance on a variety of tasks, one major limitation of the model is that fully trained models typically require thousands of annotated examples to achieve these results [20].In particular, significant drops in performance were observed when less than 200 annotated examples are used [20].In order to address this limitation, Croce, Castellucci, and Basili developed the GAN-BERT model in 2020 as a semi-supervised approach to fine tuning BERT models that achieves performance competitive with fully supervised settings [20].Specifically, GAN-BERT expands upon the BERT architecture by the introduction of a Semi-Supervised Generative Adversarial Network (SS-GAN) to the finetuning step of the BERT architecture [26].In a SS-GAN, a "generator" is trained to produce samples resembling the data distribution of the training data i.e. the labeled abstracts in this study.This process is dependent on a "discriminator", a BERT-based classifier in the case of this study, which in an SS-GAN is trained to classify the data into their true classes, in addition to identifying whether the sample was created by the generator or not.When trained in this manner, the labeled abstract data was used to train the discriminator, while both the unlabeled abstracts and the generated data is used to improve the model's inner representations of the classes, which subsequently increases the model's generalizability to new data [20].As a result of this approach the minimum number of annotated samples to train a BERT model is reduced from thousands, to a few dozen [20].Because of this effect, this study uses GAN-BERT to minimize the resource intensive process of creating an expert-annotated corpus of clinical trial abstracts.

GAN-BioBERT Architecture
Having now discussed the individual elements of the GAN-BioBERT workflow and the reasoning for their use in this study, the specific methods and parameters used in this study, including the details of the GAN-BERT architecture used, will be described in greater detail.The generator, G, used is a Multi Layer Perceptron (MLP) that takes a 100 dimension noise vector drawn from the normal distribution N(µ, σ 2 ) and produces a fake vector h fake ∈ R D of length 256.Similarly, the discriminator, D, is a MLP that takes as an input a vector h * ∈ R d which can be either the h fake produced by the generator, or h real , for the real labeled and unlabeled examples.The last layer of D is a softmaxactivated layer, whose output is a vector of logits corresponding to the categories positive, negative, neutral or fake.For forward propagation through the discriminator, it should classify h * into the categories positive, negative, or neutral if h * = h real or the category fake if h * = h fake .During the training process, the goal is to optimize the two competing losses L D and L G corresponding to the loss functions for the discriminator and generator respectively.These loss functions are described more formally by Salimans et al. [26] as follows: Given that p m (ŷ = y|x, y = real) is the probability provided by the model that a sample x is considered a real example and that p m (ŷ = y|x, y = f ake) is the probability given by the model that a sample x is considered a fake (generated) example, the loss function L D of the discriminator is defined as: Where L Dsup is the error in assigning the wrong class to a real example defined as: and L Dunsup is the error assigned to incorrectly assigning an unlabeled examples as fake or failing to recognize a fake example defined as: Meanwhile, the loss function L G of the generator is defined as: Where L G feature matching is the feature matching loss, which leads the generator to produce examples that are increasingly similar to real inputs.If f(x) is the activation on an intermediate layer of discriminator D, This loss is further defined as: and L G unsup is the error induced by fake examples being correctly identified by the discriminator, i.e. it penalizes the generator for the discriminator succeeding in identifying a fake example.This is further defined as: During back-propagation, the unlabeled examples only contribute to L Dunsup , such that they only contribute to the loss if they are falsely considered real, otherwise, their contribution is masked [20].In contrast, the labeled examples contribute to the supervised loss L Dsup [20].It is important to note that in the GAN-BERT architecture the generated examples contribute to both the generator and discriminator loss [20].During back propagation, both the BERT weights and the discriminator's weights are updated, thus both the labeled and unlabeled data contribute to the final classification algorithm.After training the generator is discarded, leaving the BERT model for inference, with no additional computational cost when compared to using a standard BERT model [20].
In short, in this study GAN-BioBERT takes the BERT architecture pretrained on biomedical text using BioBERT [19], and fine-tunes it for sentiment classification of clinical trial abstracts in a semi-supervised manner by using adversarial learning in the form of an SS-GAN architecture known as GAN-BERT [20,26].The training data used consisted of a set of clinical trial abstracts annotated by three expert raters as positive, negative, or neutral, where the least common class was upsampled and the most common class was downsampled in order to create a balanced training set, as well as 2000 unlabeled clinical trial abstracts.The validation accuracy and F1-scores of the resulting algorithm were then determined and compared to previous attempts at this type of application, the performance of a fourth expert rater on the same labeled data used to train and validate the algorithm, and performance using the GAN-BERT algorithm without BioBERT (i.e. with the standard BERT pretrained model).

Results and Discussion
Of the 108 abstracts labeled by the expert raters, twentysix were classified as positive, sixty-nine were classified as neutral, and thirteen were classified as negative by the raters.As such, the negative samples were upsampled, and the neutral samples were downsampled so that each class contained twenty-six examples, for a final labeled dataset of seventy-eight abstracts for training purposes.From this 30% of the samples were held out as the validation set for determining the performance of the algorithm.
After completion of training, the final GAN-BioBERT algorithm was found to have an accuracy of 91.3%, and a macro F1-Score of 0.92.The training of the algorithm took forty-five minutes using the Google Colaboratory Environment using 35 GB of RAM with TPU hardware acceleration.The confusion matrix associated with these results is shown in Figure 2.
As a point of comparison, the performance of the standard GAN-BERT approach was assessed as well as the performance of an expert rater on the same dataset.GAN-BERT using the base uncased BERT pretrained model was found to have an accuracy of 82.6% and a macro F1-score of 0.824.Using the same dataset, an expert rater had an accuracy of 62% with a macro F1-Score of 0.603.These results, alongside the results of the two previous studies investigating sentiment analysis of clinical trial abstracts, are summarized in Table 1.
From a technical perspective, these results show that GAN-BioBERT is a significant step forward for assessing the sentiment in clinical trial literature, with an 8.7% improvement in performance over GAN-BERT for the same classification task.When compared to previous studies' attempts at classifying sentiment in clinical trial abstracts [16,17], this improvement is even more significant as there is an absolute accuracy improvement of 15.3%, while also expanding the classification task to positive, negative, and neutral, as opposed to positive/not positive [17], or positive/neutral [16].This significant improvement in accuracy and expansion of the number of classifiers make GAN-BioBERT much  The performance of the manual expert's rating also merits further discussion of its implications, especially when compared to the performance of GAN-BioBERT.The expert raters used in this study were board-certified physicians with many years working within the setting of academic medicine, and thus are likely to be some of the most qualified individuals to interpret the sentiment in the abstracts of clinical research papers.Knowing this, the expert rater's accuracy of 62% on this classification task underpins both the difficulty of the task of classifying sentiment in clinical literature, as well as the importance of developing a more accurate and standardized measure to assess this sentiment.This finding also brings into question the validity of the historical practice [6,7,8] of large-scale qualitative assessment of clinical literature using manual raters.In short, the comparative performance of expert rating to GAN-BioBERT shows that developing a systematic, reproducible method for assessing clinical literature is a large step forward for the field of academic medicine, and in creating a systematic, unbiased method for assessing the statements made in clinical literature.

Conclusion
This study presents GAN-BioBERT, a sentiment analysis classifier for the assessment of the sentiment expressed in clinical trial abstracts.GAN-BioBERT was shown to significantly outperform both previous attempts to classify sentiment in clinical trial abstracts using sentiment analysis as well as manual expert rating with regards to accuracy and number of sentiment classes.Considering this high multi-class accuracy, and the reproducible results GAN-BioBERT generates, this study posits GAN-BioBERT as a viable tool for large-scale assessment of the findings expressed in clinical trial literature.By using a tool such as GAN-BioBERT the large scale assessment of qualitative reporting trends in clinical trial literature becomes significantly more feasible with more reproducible findings when compared to previously utilized methods.

Table 1
Performance metric results for both this study & previous studies