Augmenting Semantic Lexicons Using Word Embeddings and Transfer Learning

Alshaabi, Thayer; Van Oort, Colin M.; Fudolig, Mikaela Irene; Arnold, Michael V.; Danforth, Christopher M.; Dodds, Peter Sheridan

doi:10.3389/frai.2021.783778

ORIGINAL RESEARCH article

Front. Artif. Intell., 24 January 2022
Sec. Language and Computation
Volume 4 - 2021 | https://doi.org/10.3389/frai.2021.783778

Augmenting Semantic Lexicons Using Word Embeddings and Transfer Learning

Thayer Alshaabi^1,2^*

Colin M. Van Oort^2,3

Mikaela Irene Fudolig²

Michael V. Arnold²

Christopher M. Danforth^2,4

Peter Sheridan Dodds^2,5

¹Advanced Bioimaging Center, University of California, Berkeley, Berkeley, CA, United States
²Vermont Complex Systems Center, University of Vermont, Burlington, VT, United States
³The MITRE Corporation, McLean, VA, United States
⁴Department of Mathematics & Statistics, University of Vermont, Burlington, VT, United States
⁵Department of Computer Science, University of Vermont, Burlington, VT, United States

Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researchers to readily determine which words and phrases contribute most to a change in measured sentiment. A challenge for any lexicon-based approach is that the lexicon needs to be routinely expanded with new words and expressions. Here, we propose two models for automatic lexicon expansion. Our first model establishes a baseline employing a simple and shallow neural network initialized with pre-trained word embeddings using a non-contextual approach. Our second model improves upon our baseline, featuring a deep Transformer-based network that brings to bear word definitions to estimate their lexical polarity. Our evaluation shows that both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost.

1. Introduction

In computational linguistics and natural language processing (NLP), sentiment analysis involves extracting emotion and opinion from text data. There is an increasing demand for sentiment-aware intelligent systems. The growth of sentiment-aware frameworks in online services can be seen across a vast, multidisciplinary set of applications (Nasukawa and Yi, 2003; Medhat et al., 2014; Bakshi et al., 2016).

With the modern volume of text data—which has long rendered human annotation infeasible—automated sentiment analysis is used, for example, by businesses in evaluating customer feedback to make informed decisions regarding product development and risk management (Turney, 2002; Cabral and Hortacsu, 2010). Combined with recommender systems, sentiment analysis has also been used with the intent to improve consumer experience through aggregated and curated feedback from other consumers, particularly in retail (Kumar and Lee, 2006; Tang et al., 2009; Yu et al., 2013), e-commerce (Bhatt et al., 2015; Haque et al., 2018), and entertainment (Terveen et al., 1997; Pang et al., 2002).

Beyond applications in industry, sentiment analysis has been widely applied in academic research, particularly in the social and political sciences (Chen et al., 2021). Public opinion, e.g., support for or opposition to policies, can be potentially gauged from online political discourse, giving policymakers an important window into public awareness and attitude (Laver et al., 2003; Thomas et al., 2006). Sentiment analysis tools have shown mixed results in forecasting elections (Tumasjan et al., 2010) and monitoring inflammatory discourse on social media, with vital relevance to national security (Pang and Lee, 2008). Sentiment analysis has also been used in the public health domain (Coppersmith et al., 2014; Yadollahi et al., 2017; Gohil et al., 2018), with recent studies analyzing social media discourse surrounding mental health (Bathina et al., 2021; Stupinski et al., 2021), disaster response and emergency management (Beigi et al., 2016).

The growing number of applications of sentiment-aware systems has led the NLP community in the past decade to develop end-to-end models to examine short- and medium-length text documents (Wilson et al., 2005; Feldman, 2013), particularly for social media (Pak and Paroubek, 2010; Agarwal et al., 2011; Korkontzelos et al., 2016). Some researchers have considered the many social and political implications of using AI for sentiment detection across media (Crawford, 2019; Crawford and Paglen, 2021). Recent studies highlight some of the implicit hazards of crowdsourcing text data (Shmueli et al., 2021), especially in light of the latest advances in NLP and emerging ethical concerns (Conway and O'Connor, 2016; Hovy and Spruit, 2016). Identifying potential racial and gender disparity in NLP models is essential to develop better models (Tatman, 2017).

Sentiment analysis tools fall into one of two groups, depending on their definition of sentiment and their model for its estimation. One of the more popular paradigms is discrete classification, where sentiment is divided into several classes (e.g., positive, negative) and pieces of text are associated with each class. However, sometimes a continuous measure is desired, requiring a spectrum of sentiment scores rather than sentiment classes (Thelwall et al., 2010). This more nuanced sentiment scoring paradigm has been widely adopted for e-commerce, movies, and restaurant reviews (Snyder and Barzilay, 2007).

Sentiment analysis models largely derive from two major paradigms: 1. Lexicon-based models and 2. Contextual models. Lexicon-based models compute sentiment scores based on sentiment dictionaries (sentiment lexicons) typically constructed by human annotators (Taboada et al., 2011; Dodds et al., 2015; Augustyniak et al., 2016). A sentiment lexicon contains not only terms that express a particular sentiment/emotion, but also terms that are associated with a particular sentiment/emotion (denotation vs. connotation). Contextual models, on the other hand, extrapolate semantics by converting words to vectors in an embedding space, and learning from large-scale annotated datasets to predict sentiment based on co-occurrence relationships between words (Wilson et al., 2005; Pak and Paroubek, 2010; Agarwal et al., 2011; Feldman, 2013; Socher et al., 2013b). Contextual models have the advantage in differentiating multiple meanings, as in the case of “The dog is lying on the beach” vs. “I never said that—you are lying,” while lexicon-based models usually have a single score for each word, regardless of usage. Despite the flexibility of contextual models, their results can be difficult to interpret, as the high-dimensional latent space in which they are embedded renders explanation difficult. The ease of use and transparent comprehension of lexicon-based models help explain their continued popularity (Pang and Lee, 2008; Taboada et al., 2011; Dodds et al., 2015). For example, while the linguistic mechanisms leading to change in sentiment may be hard to explain with word embeddings, one can straightforwardly use lexicon scores to reveal the words contributing to shifted sentiment (Dodds et al., 2011; Reagan et al., 2017; Gallagher et al., 2021).

A major challenge for the simpler and more interpretable lexicon-based models, however, is the time and financial investment associated with maintaining them. Sentiment lexicons must be updated regularly to mitigate the out-of-vocabulary (OOV) problem—words and phrases that were either not considered or did not exist when the dictionaries were originally constructed (Riloff, 1996). While researchers show general sentiment trends are observable unless the lexicon does not have enough words, having a versatile dictionary with specialized and rarely used words improves the signal (Dodds and Danforth, 2010; Reagan et al., 2017). Notably, language is an evolving sociotechnical phenomenon. New words and phrases are created constantly, especially on social media (Alshaabi et al., 2021a). Word usage changes over time. New words are created, old words lose popularity, and the meaning of words can change. For example, the word “covid” grew to be the most narratively trending n-gram in reference to the global Coronavirus outbreak during February and March 2020 (Alshaabi et al., 2021b).

Sentiment analysis applications are often developed to investigate bipolar relationships (e.g., positive–negative, happy–sad, excited–bored). These bipolar relationships are conveniently handled by binary classification systems, however, such a formalization leads to multiple varieties of neutral sentiment (Colhon et al., 2017). Many sentiment analysis applications avoid, ignore, or remove text with neutral sentiment. Excluding neutral sentiment text during training can have significant impacts on trained models, which are often confused by or uncertain of neutral sentiment text (Koppel and Schler, 2006). For classification-based applications, explicitly representing neutral sentiment as a third class can improve model performance (Ribeiro et al., 2016). Humans process emotionally charged words differently than neutral words, thus sentiment analysis model may find success via similar processes Kissler and Herbert (2013).

In this work, we propose an automated framework extending sentiment for semantic lexicons to OOV words, reducing the need for crowdsourcing scores from human annotators, a process that can be time-consuming and expensive. Although our framework can be used in a more general sense, we focus on predicting happiness scores based on the labMT dataset (Dodds et al., 2015). This dataset was constructed from human ratings of the “happiness” of words on a continuous scale, averaging scores from multiple annotators for more than 10,000 words. We discuss this dataset in detail in section 3.1. In section 2, we discuss recent developments using deep learning in NLP, and how they relate to our work. We introduce two models, demonstrating accuracy on par with human performance (see section 3 for technical details). We first introduce a baseline model—a neural network initialized with pre-trained word embeddings—to gauge happiness scores. Second, we present a deep Transformer-based model that uses word definitions to estimate their sentiment scores. We will refer to our models as the “Token” and “Dictionary” models, respectively. We present our results and model evaluation in section 4, highlighting how the models perform compared with reviewers from Amazon's Mechanical Turk. Finally, we highlight key limitations of our approach, and outline some potential future developments in concluding remarks.

2. Related Work

Word embeddings are abstract numerical representations of the relationships between words, derived from statistics on individual corpora, and encoding language patterns so that concepts with similar semantics have similar representations (Bengio et al., 2003). Researchers have shown that efficient representations of words can both express meanings and preserve context (Maas et al., 2011; Hollis and Westbury, 2016; Hollis et al., 2017; Li et al., 2017). While there are many ways to construct word embedding models (e.g., matrix factorization), we often use the term to refer to a specific class of word embeddings that are learnable via neural networks.

Word2Vec is one of the key breakthroughs in NLP, introducing an efficient way for learning word embeddings from a given text corpus (Mikolov et al., 2013a,b). At its core, it builds off of a simple idea borrowed from linguistics and formally known as the “distributional hypothesis”—words that are semantically similar are also used in similar ways, and likely to appear with similar context words (Harris, 1954).

Starting from a fixed vocabulary, we can learn a vector representation for each word via a shallow network with a single hidden layer trained in one of two fashions (Mikolov et al., 2013a,b). Both approaches formalize the task as a unsupervised prediction problem, whereby an embedding is learned jointly with a network that is trained to either predict an anchor word given the words around it (i.e., continuous bag-of-words (CBOW)), or by predicting context words for an anchor word (i.e., skip-gram) (Mikolov et al., 2013a). Both approaches, however, are limited to local context bounded by the size of the context window. Global Vectors (GloVe) addresses that problem by capturing corpus global statistics with a word co-occurrence probability matrix (Pennington et al., 2014).

While Word2Vec and GloVe offer substantial improvements over previous methods, they both fail to encode unfamiliar words—tokens that were not processed in the training corpora. FastText refines word embeddings by supplementing the learned embedding matrix with subwords to overcome the challenge of OOV tokens (Bojanowski et al., 2017; Joulin et al., 2017). This is achieved by training the network with character-level n-grams (n∈{3, 4, 5, 6}), then taking the sum of all subwords to construct a vector representation for any given word. Although the idea behind FastText is rather simple, it presents an elegant solution to account for rare words, allowing the model to learn more general word representations.

A major shortcoming of the earlier models is their inability to capture contextual descriptions of words as they all produce a fixed vector representation for each word. In building context-aware models, researchers often use fundamental building blocks such as recurrent neural networks (RNN) (Rumelhart et al., 1986)—particularly long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997)—that are designed to process sequential data. Many methods have provided incremental improvements over time (Chen et al., 2017; Lee et al., 2017; Peters et al., 2017). ELMo is one of the key milestones toward efficient contextualized models, using deep bi-directional LSTM language representations (Peters et al., 2018).

In late 2017, the advent of deep attention-based models, dubbed transformers, rapidly changed the landscape in the NLP community (Vaswani et al., 2017). The encoder-decoder framework, powered by attention blocks, enables faster processing of the input sequence while also preserving context (Vaswani et al., 2017). Recent adaptations of the building blocks of Transformers continue to break records, improving the state-of-the-art across all NLP benchmarks with recent applications to computer vision and pattern recognition (Dosovitskiy et al., 2021).

Exploiting the versatile nature of Transformers, we observe the emergence of a new family of language models widely known as “self-supervised” including as bidirectional encoders (e.g., BERT) (Devlin et al., 2019), and left-to-right decoders (e.g., GPT) (Radford et al., 2018). Self-supervised language models are pre-trained by masking random tokens in the unlabeled input data and training the model to predict these tokens. Researchers leverage recent subword tokenization techniques, such as WordPiece (Wu et al., 2016), SentencePiece (Kudo and Richardson, 2018), and Byte Pair Encoding (BPE) (Sennrich et al., 2016), to overcome the challenge of rare and OOV words. Subtle contextualized representations of words can be learned by predicting whether sentence B follows sentence A (Devlin et al., 2019). Pre-trained language models can then be fine-tuned using labeled data for downstream NLP tasks, such as named entity recognition, question answering, text summarization, and sentiment analysis (Radford et al., 2018; Devlin et al., 2019).

Recent advances in NLP continue to improve the language facility of Transformer-based models. The introduction of XLNet (Yang et al., 2019) is another remarkable breakthrough that combines the bi-directionality of BERT (Devlin et al., 2019) and the autoregressive pre-training scheme from Transformer-XL (Dai et al., 2019). While the current trend of making ever-larger and deeper language models shows an impressive track record, it is arguably unfruitful to maintain unreasonably large models that only giant corporations can afford to use due to hardware limitations (Thompson et al., 2020). Vitally, less expensive language models need to be both computationally efficient and exhibit performance on par with larger models. Addressing that challenge, researchers proposed clever techniques of leveraging knowledge distillation (Hinton et al., 2015) to train smaller and faster models [e.g., DistilBERT (Sanh et al., 2019)]. Similarly, efficient parameterization strategies via sharing weights across layers can also reduce the size of the model while maintaining state-of-the-art results [e.g., ALBERT (Lan et al., 2020)].

Previous work on automatic sentiment lexicon generation (ASLG) has used a variety of heuristics to assign sentiment scores to OOV words. Most ASLG methods start with a seed lexicon containing words of known sentiment, then use a distance function to propagate sentiment scores from known words to unknown words. Word co-occurrence frequencies (Turney and Littman, 2003; Kiritchenko et al., 2014) and shortest path distances within a semantic word graph (Qiu et al., 2009; Baccianella et al., 2010; San Vicente et al., 2014) [such as WordNet (Fellbaum, 1998)] were common distance functions in earlier work. More recently, distance functions based on learned word embeddings have gained popularity (Tang et al., 2014; Wang et al., 2016; Ljubešić et al., 2018; Thavareesan and Mahesan, 2020). The outputs of word embedding models usually need to be projected into a lower dimension before they can be used for ASLG. This can be done using a variety of machine learning models, though linear models are likely one of the most popular options (Qiu et al., 2009; Amir et al., 2015; Wang et al., 2016; Li et al., 2017; Alshari et al., 2018; Ljubešić et al., 2018; Thavareesan and Mahesan, 2020). Amir et al. (2015) proposed the use of a support vector regressor (SVR) trained with CBOW (Mikolov et al., 2013a) or GloVe (Pennington et al., 2014) word embeddings, finding that the SVR model out performed various linear models [e.g., Lasso (Yuan and Lin, 2006), Ridge (Hoerl and Kennard, 1970), ElasticNet (Zou and Hastie, 2005) regressors] on the labMT lexicon. However, their models only predicted a binary sentiment polarity (ŷ∈[0, 1]), rather than continuous scores. Li et al. (2017) extended their work, proposing a class of linear regression models trained with word embeddings to predict affective meanings in several sentiment lexicons such as ANEW (Bradley and Lang, 1999), VAD (Mohammad, 2018). Darwich et al. (2019) present an excellent review of ASLG.

Many of the human engineered heuristics used in previous work on ASLG can be largely automated via clever application of new machine learning techniques. Sentiment analysis knowledge bases can be constructed using graph-mining and multi-dimensional scaling techniques (Bajpai et al., 2016). Once constructed, these knowledge bases allow for the application of a host of additional methods. Neural tensor networks can be used for knowledge base completion, inferring relationships that were missed during construction (Socher et al., 2013a). Graph neural networks can create rich features from the relationships captured in knowledge bases, allowing sentiment analysis models to handle complex context-based problems (Dowlagar and Mamidi, 2021; Liao et al., 2021; Yang et al., 2021). Ensembles of symbolic and sub-symbolic AI can be used to cover the individual weaknesses of each method (Cambria et al., 2020).

Building on the many of the models discussed above, we develop a framework for augmenting semantic lexicons using word embeddings and pre-trained large language models. Our models output continuous valued sentiment scores that can represent degrees of negative, neutral, and positive sentiment. Our tool reduces the need for crowdsourcing scores from human annotators while still providing similar, and often better, results compared with random reviewers from Amazon Mechanical Turk at a fraction of the cost.

3. Materials and Methods

We propose two models for predicting happiness scores for the labMT lexicon (Dodds et al., 2015)—a general-purpose sentiment lexicon used to measure happiness in text corpora (see section 3.1 for more details).

Our first model is a neural network initialized with pre-trained FastText word embeddings. The model uses fixed word representations to gauge the happiness score for a given expression, enabling us to augment the labMT dataset at a low cost. For simplicity, we will refer to this model as the Token model.

Bridging the link between lexicon-based and contextualized models, we also propose a deep Transformer-based model that uses word definitions to estimate their happiness scores—namely, the Dictionary model. The contextualized nature of the input data allows our model to accurately estimate the expressed happiness score for a given word based on its lexical meaning.

We implement our models using Tensorflow (Abadi et al., 2016) and Transformers (Wolf et al., 2020). See section 3.2 and section 3.3 for additional details of our Token and Dictionary models, respectively. Our source code, along with pre-trained models, are publicly available via our GitLab repository (https://gitlab.com/compstorylab/sentiment-analysis).

3.1. Data

In this study, we use the labMT dataset as an example sentiment lexicon to test and evaluate our models (Dodds et al., 2015). The labMT lexicon contains roughly ten thousand unique words—combining the five thousand most frequently used words from New York Times articles, Google Books, Twitter messages, and music lyrics (Dodds et al., 2015). It is a lexicon designed to gauge changes in the happiness (i.e., valence or hedonic tone) of text corpora. Happiness is defined on a continuous scale h ∈ {1 → 9}, where 1 bounds the most negative (sad) side of the spectrum, and 9 is the most positive (happy). Ratings for each word are crowdsourced via Amazon Mechanical Turk (AMT), taking the average score h_avg from 50 reviewers to set a happiness score for any given word. For example, the words “suicide,” “terrorist,” and “coronavirus” have the lowest happiness scores, while the words “laughter,” “happiness,” and “love” have the highest scores. Function and stop words along with numbers and names tend to have neutral scores (h_avg ≈ 5), such as “the,” “fourth,” “where,” and “per.”

The labMT dataset also powers the Hedonometer, an instrument quantifying daily happiness on Twitter (Dodds et al., 2011). Over the past few years, the labMT lexicon was updated to include new words that were not found in the original survey [e.g, terms related to the COVID19 pandemic (Alshaabi et al., 2021b)].

We are particularly interested in this dataset because it also provides the standard deviation of human ratings for each word, which we use to evaluate our models. In this work, we propose two models to estimate h_avg using word embeddings, and thus provide an automated tool to augment the labMT dataset both reliably and efficiently.

In Figure 1, we display a 2D histogram of the human rated happiness scores in the labMT dataset. The figure highlights the degree of uncertainty in human ratings of the emotional valence of words. For example, the word “the” has an average happiness score of h_avg = 4.98, with standard deviation of σ = 0.91, while the word “hahaha” has a happier score with h_avg = 7.94 and σ = 1.56. Some words also have a relatively large standard deviation such as “church” (h_avg = 5.48, σ = 1.85), and “cigarettes” (h_avg = 3.31, σ = 2.6).

FIGURE 1

Figure 1. Emotional valence of words and uncertainty in human ratings of lexical polarity. A 2D histogram of happiness h_avg and standard deviation of human ratings for each word in the labMT dataset. Happiness is defined on a continuous scale from 1 to 9, where 1 is the least happy and 9 is the most. Words with a score between 4 and 6 are considered neutral. While the vast majority of words are neutral, there is a positive bias in human language (Dodds et al., 2015). The average standard deviation of human ratings for estimating the emotional valence of words in the labMT dataset is 1.38.

While the majority of words are neutral, with a score between 4 and 6, we still observe a human positivity bias in the English language (Dodds et al., 2015; Aithal and Tan, 2021). On average, the standard deviation of human ratings is 1.38. In our evaluation (section 4), we show how our models perform relative to the uncertainty observed in human ratings.

3.2. Token Model

Our first model uses a neural network that learns to map words from the labMT lexicon to their corresponding sentiment scores. While still being able to learn a non-linear mapping between the words and their happiness scores, the model only considers the individual words as input—enriching its internal utility function with subword representations to estimate the happiness score.

The input word is first processed into a token embedding—sequentially breaking each word into its equivalent character-level n-grams whereby n ∈ {3,4,5} (see Figure 2 for an illustration). English words have an average length of 5 characters (Miller et al., 1958; Mayzner and Tresselt, 1965), which would yield 6 unique character-level n-grams given our tokenization scheme. While we did try shorter and longer sequences, we fix the length of the input sequence to a size of 50 and pad shorter sequences to ensure a universal input size. We choose a longer sequence length to allow us to encode longer n-grams and rare words.

FIGURE 2

Figure 2. Input sequence embeddings. We use two encoding schemes to prepare input sequences for our models: token embeddings (blue) and dictionary embeddings (orange) for our Token and Dictionary models, respectively. Given an input word (e.g., “coronavirus,”) we first break the input token into character-level n-grams (n∈{3, 4, 5}). The resulting sequence of n-grams along with the original word at the beginning of the embeddings are used in our Token model. Sequences shorter than a specified length are appended with PAD, a padding token ensuring a universal input size. For our Dictionary model, we first look up a dictionary definition for the given input. We then process the input word along with its definition into subwords using WordPiece (Wu et al., 2016). Uncommon and novel words are broken into subwords, with double hashtags indicating that the given token is not a full word.

We then pass the token embeddings to a 300-dimensional embedding layer. We initialize the embedding layer with weights trained with subword information on Common Crawl and Wikipedia using FastText (Bojanowski et al., 2017). In particular, we use weights from a pre-trained model using CBOW with character-level n-grams of length 5 and a window size of 5 and 10 (https://fasttext.cc/docs/en/english-vectors.html).

The output of the embedding layer is pooled down and passed to a sequence of three dense layers of decreasing sizes: 128, 64, and 32, respectively. We use a rectified linear activation function (ReLU) for all dense layers. We also add a dropout layer after each dense layer, with a 50% dropout rate to add stochasticity to the model, allowing for a simple estimate of uncertainty using the standard deviation of the network's predictions (Srivastava et al., 2014).

We experimented with a few different layout configurations, finding that making the network either wider or deeper has minimal effect on the network performance. Therefore, we choose to keep our model rather simple with roughly 10 million trainable parameters. The output of the last dense layer is finally passed over to a single output layer with a linear activation function to regress a sentiment score between 1 and 9. See Figure 3 for a simple diagram of the model architecture.

FIGURE 3

Figure 3. Model architectures. Our first model is a neural network initialized with pre-trained word embeddings to estimate happiness scores. Our second model, is a deep Transformer-based model that uses word definitions to estimate their sentiment scores. See section 3.2 and 3.3 for further technical details of each model, respectively. Note the Token model is considerably smaller with roughly 10 million trainable parameters compared with the Dictionary model that has a little over 66 million parameters.

3.3. Dictionary Model

Historically, lexicon-based models have only considered simple statistical methods to estimate the emotional valence of words. Here, we try to bridge the connection between the conventional techniques among the community and recent advances in NLP.

For our second model, we use a contextualized Transformer-based language model to estimate the sentiment score for a given word based on its dictionary definition. While still predicting scores for individual words, we now do so by augmenting each word with its expressed meaning(s) from a general dictionary. Given an input word, we look up its definition via a free online dictionary API available at https://dictionaryapi.dev.

The average length of definitions for the words found in labMT is roughly 38 words. We choose a maximum definition length of 50 words—which covers the 75th percentile of that distribution—to ensure that words with multiple definitions are adequately represented. While increasing the sequence length beyond 50 did not improve our accuracy, it increases the model complexity slowing our training and inference time substantially. Therefore, we fix the length of word definitions to a maximum of 50 words. We pad shorter sequences, and truncate words 51 and beyond to ensure a fixed input size.

We estimate the sentiment of each labMT word as follows. The word, along with its definition, is processed into dictionary embeddings by breaking each word into subwords based on their frequency of usage using WordPiece (Wu et al., 2016). This is a widely adopted tokenization technique that breaks uncommon and novel words into subwords, which reduces the vocabulary size of language models and enables them to handle OOV tokens. Other tokenization models will give similar results (Kudo and Richardson, 2018). We only use the word as input to our model for terms without definitions.

In principle, the dictionary embeddings can be passed to a vanilla Transformer model [e.g., BERT (Devlin et al., 2019), XLNet (Yang et al., 2019)]. However, we prefer more manageable models (i.e., smaller and faster) due to their efficiency while maintaining state-of-the-art results. We tried both ALBERT (Lan et al., 2020) and DistilBERT (Sanh et al., 2019). Both models have equivalent performance on our task. The output of the model's pooling layer is passed to a sequence of three dense layers of decreasing sizes with dropout applied after each layer—similar to our approach in the Token model. Finally, the output of the last dense layer is projected down to a single output value that servers as the sentiment score prediction.

The Token model is considerably lighter in terms of memory usage, and faster in terms of training and inference time than the Dictionary model. Our current configuration of the Token model results in roughly 10 million trainable parameters compared with the Dictionary model that has over 66 million parameters.

4. Results

4.1. Ensemble Learning and k-Fold Cross-Validation

In the deep learning community, particularly in the NLP domain, it is common to scale up the number of parameters in successful models to eke out additional performance gains. The effectiveness of this approach tends to be correlated with the amount of training data available (i.e., larger models are more effective when trained on larger data sets). With the limited size of our training set, we needed alternative techniques to increase the performance of our models. Ensemble learning is a widely known and adopted family of methods in which the average performance of an ensemble is shown to be both less biased and better than the individual models (Hansen and Salamon, 1990; Krogh and Vedelsby, 1994).

First, we randomly subsample our dataset, taking a 20% subset as our holdout set for testing. Using a 5-fold cross-validation strategy, we break the remaining samples into 5 distinct subsets using a 80/20 split for training/validation. We train one model per fold for a maximum of 500 epochs each, and combine the 5 trained models to form an ensemble. While there are many gradient descent optimization algorithms, we use Adam (Kingma and Ba, 2015) as a popular and well-established optimizer, keeping its default configuration and setting our initial learning rate to 0.001. In Figure 4, we show a breakdown of our ensemble pipeline whereby the blue squares highlight the validation subset for each fold. Note, the holdout set is removed before training the ensemble and is only used for testing a complete ensemble.

FIGURE 4

Figure 4. Ensemble learning and k-fold cross-validation. Using an 80/20 split for training/validation, we train our models for a maximum of 500 epochs per fold for a total of 5 folds. We use the model trained from each fold to build an ensemble because the average performance of an ensemble is less biased and better than the individual models.

To estimate the happiness score for a given word, we take a Monte Carlo approach by sampling 100 predictions per model in the ensemble. We use the training setting for the dropout layers in each model, rather than the test time averaging that is commonly used, so that these predictions are heterogeneous. The mean over these predictions becomes the proposed happiness score, while the standard deviation serves as an estimate of model uncertainty (Gal and Ghahramani, 2016). Providing a point estimate along with an uncertainty band allows us to compare and contrast the level of model uncertainty in our ensembles with the uncertainty observed between human annotators.

4.2. Comparison With Other Methods and Human Annotators

Although both of our proposed strategies—namely using character-level n-grams and word definitions—performed well, the Dictionary model outperforms the Token model. To evaluate our models we train 10 replicates each and then investigate error distributions obtained using the test set. We report the mean absolute error (MAE) as an estimate of overall performance, along with a selection of percentiles to compare tail behavior across models. Each of these statistics are averaged over the 10 replicates. This process provides us with a strong estimate of the generalization performance for our proposed models.

Table 1 summarizes the results of this evaluation process for our proposed models and ensembles. We provide baseline comparisons to models from previous work (Amir et al., 2015; Li et al., 2017), including popular linear models, random forests, and support vector machines trained with three different flavors of word embeddings: Word2Vec (Mikolov et al., 2013a), GloVe (Pennington et al., 2014), and FastText (Bojanowski et al., 2017). These results indicate that our Token model outperforms all prior baselines, our Dictionary model outperforms our Token model, and both of our proposed models benefited from ensemble learning. Though the ensembles outperformed the individual models in both cases, it is interesting to note that they also had longer tails for their error distributions.

TABLE 1

Table 1. Summary statistics of the testing subset comparing our models to the annotated ratings reported in labMT.

We further examine the error distributions to investigate if the models have a bias toward high or low happiness scores. In Figures 5, 6, we display a breakdown of our MAE distributions for the Token and Dictionary models, respectively. For ease of interpretation and visualization, we categorize the happiness scores into three groups: negative (h_avg ∈ [1,4)), neutral (h_avg ∈ [4,6]), and positive (h_avg ∈ (6,9]). While the distributions show our models operate well on all words, particularly neutral expressions, we note a relatively higher MAE for negative words, whereby our predictions to these terms are more positive than the annotations.

FIGURE 5

Figure 5. Error distributions for the Token model. We display mean absolute errors for predictions using the Token model on all words in labMT. We arrange the happiness scores into three groups: negative (h_avg ∈ [1,4), orange), neutral (h_avg ∈ [4,6], gray), and positive (h_avg ∈ (6,9], green). Most words have an MAE less than 1 with the exception of a few outliers. We see a relatively higher MAE for negative and positive terms compared to neutral expressions.

FIGURE 6

Figure 6. Error distributions for the Dictionary model. We display mean absolute errors for predictions using the Dictionary model on all words in labMT. Again, we categorize the happiness scores into three groups: negative (h_avg ∈ [1,4), orange), neutral (h_avg ∈ [4,6], gray), and positive (h_avg ∈ (6,9], green). Similar to the Token model, most words have an MAE less than 1 with the exception of a few outliers. While the Dictionary model outperforms the Token model, we still observe a higher MAE for negative and positive terms compared to neutral expressions.

We also compare our predictions to the ground-truth ratings, examining the degree to which the models either overshoot or undershoot the happiness scores crowdsourced via AMT. Words in the labMT lexicon were scored by taking the average happiness score of distinct evaluations from 50 different individuals (see Table S2, Dodds et al., 2015). Since the variance of human ratings and our model MAEs are on the same scale, we can use the observed average variance of the ratings (1.17) as a baseline to assess rater confidence in the reported scores. Comparing our models to that baseline, we note that all models offer consistent predictions with similar expectations to a random and reliable reviewer from AMT. See Table 1 for further statistical details.

In Figures 7, 8, we display the top-50 words with the highest mean absolute error for the Token and Dictionary models, respectively. While the models always predict the right emotional attitude outlining each word based on its lexical polarity, they bias toward neutral by undershooting scores for happy words, and overshooting scores for sad expressions.

FIGURE 7

Figure 7. Token model: Top-50 words with the highest mean absolute error. Model predictions are shown in blue and the crowdsourced annotations are displayed in gray. While still maintaining relatively low MAE, most of our predictions are conservative—marginally underestimating words with extremely high happiness scores, and overestimating words with low happiness scores.

FIGURE 8

Figure 8. Dictionary model: Top-50 words with the highest mean absolute error. Model predictions are shown in blue and the crowdsourced annotations are displayed in gray. Note, the vast majority of words with relatively high MAE also have high standard deviations of AMT ratings. Words that have multiple definitions will have a neutral score (e.g., lying). A neutral happiness score is also often predicted for words because we are unable to obtain good definitions for them to use as input. Although we have definitions for most words in our dataset, we still have a little over 1,500 words with missing definitions. Most of these words are names (e.g., “‘Burke,”) and slang (e.g., “xmas,” and “ta.”)

One possible explanation of this systematic behavior is the lack of words with extreme happiness scores in the labMT lexicon. It is possible to train models with a smaller but balanced subset of the dataset to overcome that challenge. Doing so, however, would reduce the size of training/validation samples substantially. Still, our margin of error is relatively low compared to human ratings. Future investigations may test and improve the models by examining larger sentiment lexicons.

Another key factor that plays a big role in our prediction error is obtaining good word definitions, or the lack thereof, to use as input for our Dictionary model. Surprisingly, outsourcing definitions from online dictionaries for a large set of words is rather challenging, especially if you opt-out of reliable but paid services. In our work, we choose not to use an urban dictionary or any services with paid APIs. We use a free online dictionary API that is available at https://dictionaryapi.dev.

While we do have definitions for most words in our dataset, a total of 1518 words have missing definitions. Most of these words are names, abbreviations, and slang terms (e.g., “xams,” “foto,” “nvm,” and “lmao”). Words with multiple definitions can also cancel each other's score (e.g., “lying”).

Notably, the vast majority of words with high MAE also have high AMT standard deviations. To further investigate prediction accuracy, we examine the overlap between the predictions and human ratings. In particular, we compute the intersection over union (IOU) between the predicted happiness score $h_{a v g}^{'} \pm σ^{'}$ , and the corresponding value from the annotated ratings h_avg±σ.

The Token model underestimates the happiness score for “win”—the only word with a prediction that falls outside the range of human annotated happiness scores. The remaining predicted happiness scores fall well within the range of scores crowdsourced via AMT. Similarly, the Dictionary model slightly underestimates the happiness scores for “mamma” while overestimating the scores for “lying,” and “coronavirus.”

5. Discussion

As the growing demand for sentiment-aware intelligent systems increases, we will continue to see improvements to both lexicon-based models and contextual language models. While contextualized models are suitable for a wide set of applications, lexicon-based models are used by computational linguists, journalists, and data scientists who are interested in studying how individual words contribute to sentiment trends.

Sentiment lexicons, however, have to be updated periodically to support new words and expressions that were not considered when the dictionaries were assembled. In this paper, we proposed two models for predicting sentiment scores to augment semantic dictionaries using word embeddings and pre-trained large language models. Our first model establishes a baseline using a neural network initialized with pre-trained word embeddings, while our second model features a deep Transformer-based network that brings into play word definitions to estimate their lexical polarity. Our results and evaluation of both models demonstrate human-level performance on a state-of-the-art human annotated list of words.

Although both models can predict scores for novel words, we acknowledge a few shortcomings. Our Token model relies on subword information to estimate a happiness score for any given word. For example, using subwords for “coronavirus” yields a good estimate given that it contains “virus.” By contrast, parsing character-level n-grams for other words (e.g., “covid”) may not reveal any further information. We can overcome that hurdle by using the word definition as input to our Dictionary model to gauge its happiness score. Words, however, often have different meanings based on context. Finding good definitions may be challenging, especially for slang, informal expressions, and abbreviations. We recommend using the Dictionary model whenever it is possible to outsource a good definition of the word.

A natural next step would be to develop similar models for other languages, for example by building a model for each language, or a multilingual model. Fortunately, FastText (Bojanowski et al., 2017) provides pre-trained word embeddings for over 100 languages. Therefore, it is easy to upgrade the Token model to support other languages. Updating the Dictionary model is also a straightforward task by simply adopting a multilingual Transformer-based model pre-trained with several languages [e.g., Multilingual BERT (Devlin et al., 2019)]. We caution against translating words and using the same English scores because most words do not have a one-to-one mapping into other languages, and are often used to express different meanings by the native speakers of any given language (Dodds et al., 2015).

Another vast space of improvements would be to adopt our proposed strategies to develop prediction models for other semantic dictionaries. Researchers can further fine-tune these models to predict other sentiment scores. For example, the happiness scores in the labMT (Dodds et al., 2015) dataset are closely aligned with the valence scores in the NRC-VAD lexicon (Mohammad, 2018). We envision future work developing similar models to predict other semantic differentials such as arousal and dominance (Mohammad, 2018), EPA (Osgood, 1962), and SocialSent (Hamilton et al., 2016). Our primary goal is to provide an easy and robust method to augment semantic dictionaries to empower researchers to maintain and expand them at a relatively low cost using today's state-of-the-art NLP methods.

Data Availability Statement

Our source code along with pre-trained models are publicly available on our Gitlab repository (https://gitlab.com/compstorylab/sentiment-analysis).

Author Contributions

TA designed and developed the methods. TA and CV verified and analyzed data. TA, CV, MF, MA, CD, and PD edited the manuscript. CD and PD supervised the project. All authors provided critical feedback and helped shape the research, analysis and manuscript.

Funding

We are grateful for the computing resources provided by the Vermont Advanced Computing Core and financial support from Google and the Massachusetts Mutual Life Insurance Company. Computations were performed on the Vermont Advanced Computing Core supported in part by NSF award No. OAC-1827314.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

We thank Anne Marie Stupinski, Julia Zimmerman and our colleagues at the Computational Story Lab for their insightful discussion and suggestions on this project.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). “Tensorflow: a system for large-scale machine learning,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation OSDI'16 (Berkeley, CA: USENIX Association), 265–283.

PubMed Abstract | Google Scholar

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., and Passonneau, R. (2011). “Sentiment analysis of Twitter data,” in Proceedings of the Workshop on Language in Social Media (LSM 2011) (Portland: Association for Computational Linguistics), 30–38.

ORIGINAL RESEARCH article

Augmenting Semantic Lexicons Using Word Embeddings and Transfer Learning

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data

3.2. Token Model

3.3. Dictionary Model

4. Results

4.1. Ensemble Learning and k-Fold Cross-Validation

4.2. Comparison With Other Methods and Human Annotators

5. Discussion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Publisher's Note

Acknowledgments

References

People also looked at