^{†}

^{*}

^{†}

Edited by: Yiming Ying, University at Albany, United States

Reviewed by: Shao-Bo Lin, Xi'an Jiaotong University (XJTU), China; Sijia Liu, Mayo Clinic, United States

This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics

†These authors have contributed equally to this work

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Word embeddings have introduced a compact and efficient way of representing text for further downstream natural language processing (NLP) tasks. Most word embedding algorithms are optimized at the word level. However, many NLP applications require text representations of groups of words, like sentences or paragraphs. In this paper, we propose a supervised algorithm that produces a task-optimized weighted average of word embeddings for a given task. Our proposed text embedding algorithm combines the compactness and expressiveness of the word-embedding representations with the word-level insights of a BoW-type model, where weights correspond to actual words. Numerical experiments across different domains show the competence of our algorithm.

Word embeddings, or a learned mapping from a vocabulary to a vector space, are essential tools for state-of-the-art Natural Language Processing (NLP) techniques. Dense word vectors, like Word2Vec [

Most downstream tasks, like sentiment analysis and information retrieval (IR), are used to analyze groups of words, like sentences or paragraphs. For this paper, we refer to this more general embedding as a “text embedding.”

In this paper we propose a supervised algorithm that produces embeddings at the sentence-level that consist on an weighted average of an available pre-trained word-level embedding. The resulting sentence-level embedding is optimized for the corresponding supervised learning task. The weights that the proposed algorithm produces can be use to estimate the importance of the words with respect to the supervised task. For example, when classifying movie reviews into one of two classes: action movies or romantic movies, words like “action,” “romance,” “love,” and “blood,” will get precedence over words, like “movie,” “i,” and “theater.” This leads to the shifting of the text-level vector toward words with larger weights, as can be seen in

Unweighted (UAEm)

When we use an unweighted averaged word embedding (

Our algorithm has many advantages over simpler text embedding approaches, like bag-of-words (BoW) and the averaged word-embedding schemes discussed in section 2. In section 4, we show results from experiments on different datasets. In general, we observed that our algorithm is competitive with other methods. Unlike the simpler algorithms, our approach finds a

It is important to note that other deep-learning-based approaches for text classification also implicitly optimize the text-level representation from word-level embedding in the top layers of the neural network. However, in order to train such models large datasets are needed. Our empirical results show that our proposed representation is in general competitive with traditional deep learning based text classification approaches and outperforms them when the training data is relatively small.

Additionally, by generating importance weights to each one of the words in the vocabulary, our algorithm yields a more interpretable result than looking at the weights corresponding to the word-embedding dimensions that have no human-interpretable meaning. Effectively, our text embedding algorithm combines the compactness and expressiveness of the word-embedding representations with the human-interpretability of a BoW-type model.

Furthermore, in contrast with some deep-learning-based approaches, our approach does not impose constraints or require special processing (trimming, padding) with respect to the length of the sentence or text to be classified. In summary, we can summarize the contributions of the paper as follows:

Our algorithm provides a task optimized text embedding from word level embeddings.

Our algorithm outperforms other more complex algorithms when training data is relatively small in size.

Our algorithm can be implemented by leveraging existing libraries in a trivial way as it only requires access to a SVM implementation.

Our resulting task specific text embedding are as compact as the original word level embedding while providing word level insights similar to a BOW type model.

The rest of the paper is organized as follows: in section 2, we discuss related work. Later, in section 3, we present a detailed explanation and mathematical justification to support our proposed algorithm. In section 4, we present and described our proposed algorithm.

Various representation techniques for text have been introduced over the course of time. In the recent years, none of these representations have been as popular as the word embeddings, such as Word2Vec [

Text embedding has been a more challenging problem over word embeddings due to the variance of phrases, sentences, and text. Le and Mikolov [

Methods have been developed that use word embeddings to generate text embeddings without having to train on whole texts. These methods are less costly than the ones that train directly on whole text, and can be implemented faster.

Unweighted average word embedding [

The unsupervised text embedding methods face the problem of importance-allocation of words while computing the embedding. This is important, as word importance determines how biased the text embedding needs to be toward the more informative words. DeBoom et al. [

Our method generates weights based on the importance of the words perceived through a supervised approach. We use classifiers to determine the weights of the words based on their importance captured through the procedure. The advantage of this method over other methods is that we keep the simplicity of Wieting's algorithm [

A sentence, paragraph, or document can be represented using a given word-level embedding (

where,

we will assume that _{i} is the ^{m×n} containing a collection of

λ_{j} ∈ _{j} ∈ _{j} = 1, ∀

δ_{ij} is a normalized occurrence count. It is the number of times

Our proposed algorithm assumes that we have a supervised classification problem for which we want to find an optimal representation at the document (text) level from the word embeddings.

More concretely, we consider the problem of classifying ^{n}, represented by the _{i} in the classes +1 or −1 as specified by a given

In general, this linear classification problem can formulated as follows:

where,

^{m×1} is a column of ones;

^{m×1} is a slack vector;

(^{(n+1) ×1} represents the separating hyperplane.

Note that, if

which we will denote by

From (1), we can rewrite

That is,

where Δ ∈ ℝ^{m×k}; is a matrix of occurrences count with δ_{ij} in the (^{k×n} is the matrix whose rows are all the word2vec vectors considered in the word2vec corpus or dictionary.

From (3) and (6),

where _{1}, …, λ_{k}).

Formulation (7) is a biconvex optimization problem, which can be solved using alternate optimization [_{1}, …, λ_{k}) which can be used to interpret classification results for the specific tasks at hand. Though we could have restricted the λ_{i} to be positive, we choose to leave them unconstrained in order to make our algorithm more scalable and computationally efficient. Another interesting consideration would be to add a relative importance constrained on addition to the non-negativity bounds of the form:

but again, we choose not to for computationally efficiency. We will explore this option in the future.

In (7), if we fix Λ to a constant

We can obtain the corresponding optimal solution for (

On the other hand, if we fix

where

Similarly, from (7) and (10) and making

since

We can obtain an approximate optimal (

We are ready now to describe our proposed alternate optimization (AO) algorithm to solve formulation (7).

One of the advantages of the algorithm is that it can be easily implemented by using existing open-source SVM libraries, like the ones included in scikit-learn [

The optimal text embedding algorithm, then, inherits the convergence properties and characteristics of the AO problems [

In order to further reduce the computational complexity of the proposed algorithm, we can consider a simplified loss function

Fixing

This formulation corresponds to a least-squares or Proximal SVM formulation [

If

On the other hand, fixing

since

Furthermore,

From (17) and (18),

For some problems, Δ^{T}Δ can be ill-conditioned, which may lead to incorrect values for

where ϵ is a very small value.

Note that (Δ^{T}Δ + ϵ^{−1} involves calculating the inverse of a

with ^{T}. Then (Δ^{T}Δ + ϵ^{−1} becomes

which involves inverting an

The

Algorithm 1 can be modified to consider formulation (3) instead of (13) by making two simple changes:

Substitute line 6 of Algorithm 1 by: Solve Equation (15) to obtain (_{i}, γ);

Substitute line 8 of Algorithm 1 by: Solve Equation (19) to obtain _{i};

Optimal Text Embedding

We used binary classification as the task for evaluating our algorithm performance by comparing it to the following methods:

Note that in both the CNN experiments we have initialized the embedding layer with pre-trained word2vec models and these vectors are kept static.

We implemented two versions of our Algorithm 1: SVM-based (

In SVM-OptEm, we used a support vector machine (SVM) [

In LSSVM-OptEm, we used a least square support vector machine (LS-SVM) [

To showcase the performance of our model, we chose fifteen different binary classification tasks over the subsets of different datasets. Twelve public datasets are briefly described in

Brief description of the public datasets used for our experiments.

20NEWSGRP-SCI | 20 Newsgroup documents | Science-related documents | [ |

AMZN-EX | Amazon reviews | Electronics review | [ |

AMZNBK-SENT | Amazon book reviews | Positive review | [ |

BBC | BBC news articles | Sports article | [ |

BLOG-GENDER | Blog articles | Male Writer | [ |

DBPEDIA | Wikipedia articles | Artist article | [ |

IMDB | IMDB movie reviews | Positive review | [ |

SCIPAP | Sentences from scientific papers | Owner-written sentence | [ |

SST | Movie reviews | Positive sentiment | [ |

YAHOO-ANS | Questions from Yahoo's question-answer dataset | Health-related question | [ |

YELP-REST | Yelp Restaurant Reviews | Restaurant-related review | [ |

YELP-STAR | Yelp Reviews | Positive review | [ |

We also performed experiments on three datasets belonging to the insurance domain.

We chose to work on different word2vec-based word embeddings. These word embeddings have either been pre-trained models or in-house trained models. These embeddings were used on the datasets based on their contextual relevance.

We used different word2vec models to verify that our models works well independently of the underlying embedding representation. Moreover, it also gives better contextual representation of words for these datasets.

The method of processing employed on text was similar to the one done for training the word2vec models. This ensured the consistency of word-occurrence in the dataset in lieu to the model that would be used for mapping the words.

Different word2vec models had different processing procedures, such as substitutions based on regular expressions, removal of non-alphabetical words, and lowercasing the text. Accordingly, text-processing was done for the training data.

To compare performance of the algorithms tested, we decided to use area under curve (AUC) for evaluation. This metric was chosen in order to remove the possibility of unbalanced datasets affecting the efficacy of the accuracy of the models.

The performance of our models for the experiments can be seen in

Binary text classification AUC and accuracy results for test data for:

20NEWSGRP-SCI | Google-news | 86 | 3000 | 2000 | 0.904 | 0.9053 | 0.9040 | 0.9081 | 0.9150 | 0.9139 | |

AMZN-EX | Wikipedia | 100 | 10,000 | 10,000 | 0.9914 | 0.9897 | 0.9887 | 0.9822 | 0.9838 | 0.9921 | |

AMZNBK-SENT | Amzn | 5 | 10,000 | 10,000 | 0.9294 | 0.9218 | 0.9344 | 0.9269 | 0.9294 | 0.9273 | |

BBC | Google-news | 458 | 1,850 | 500 | 0.9978 | 0.9973 | 0.9959 | 0.9946 | 0.9921 | 0.9948 | |

BLOG-GENDER | Wikipedia | 422 | 2,000 | 1,000 | 0.7668 | 0.7536 | 0.7813 | 0.7580 | 0.7428 | 0.7569 | |

DBPEDIA | Wikipedia | 48 | 10,000 | 10,000 | 0.9921 | 0.9870 | 0.9930 | 0.9935 | 0.9934 | 0.9974 | |

IMDB | Wikipedia | 237 | 5,000 | 2,500 | 0.9116 | 0.8935 | 0.9209 | 0.9206 | 0.8981 | 0.9102 | |

SCIPAP | Wikipedia | 26 | 1,500 | 750 | 0.8630 | 0.8515 | 0.9208 | 0.9205 | 0.8973 | 0.9105 | |

SST | Google-news | 11 | 10,000 | 10,000 | 0.9016 | 0.8990 | 0.9040 | 0.8967 | 0.8722 | 0.9168 | |

YAHOO-ANS | Wikipedia | 12 | 20,000 | 10,000 | 0.9316 | 0.9293 | 0.9287 | 0.9248 | 0.8819 | 0.9280 | |

YELP-REST | Yelp | 117 | 40,000 | 40,000 | 0.9733 | 0.9709 | 0.9696 | 0.9627 | 0.9342 | 0.9773 | |

YELP-STAR | Yelp | 125 | 20,000 | 10,000 | 0.9707 | 0.9652 | 0.9707 | 0.9665 | 0.9567 | 0.9747 | |

BI-1 | Claim-notes | 128 | 1,508 | 561 | 0.8850 | 0.8270 | 0.8852 | 0.9014 | 0.7023 | 0.7907 | |

BI-2 | Claim-notes | 137 | 1,081 | 238 | 0.7653 | 0.666 | 0.8007 | 0.5403 | 0.4875 | 0.5640 | |

TRANSCRIPTS | Transcript | 828 | 5,000 | 3,000 | 0.9616 | 0.9604 | 0.9638 | 0.9620 | 0.9617 | 0.9736 |

Our algorithm provides better or comparable performance against

AUC scores of test data over iterations.

It is important to note that our proposed algorithm tends to achieve better AUC performance when the training data is small which it is the case for many scenarios in the insurance domain where labels are difficult and expensive to obtain. This fact make the algorithm a good choice for active learning frameworks where labels are scarce specially at early iterations of such approaches.

In general, our algorithm approached an “equilibrium” stage for the vector

Average of weight difference over iteration for three datasets. This difference approaches zero over iterations.

One of the advantages our model holds over

Weights of top 15 words identified by OptEm for two of the datasets used in our experiments. The words appear to be very informative; some can be easily associated to corresponding class.

^{*} learned from the algorithm and rank them in descending order. For a human eye, these words clearly makes sense with respect to the given classification task. For example,

Top 10 Words with highest absolute weights for AMZN-EX, YAHOO-ANS, AND DBPEDIA.

Book | Period | Born |

Sound | Profile | Author |

Product | Mushrooms | Singer |

Player | Medicare | Directed |

Use | Daily | Album |

Unit | Youngest | Artist |

Price | Longest | Writer |

Quality | Anger | Known |

Lens | Aerobics | Musician |

Radio | Confirm | Novelist |

We also found that words that are least informative about the given task have weights(λ^{*}) close to zero.

Following our results presented in

Our method is competitive with more sophisticated models. As a matter of fact, we are winning on 7 out of 15 text classification tasks from various domains.

Our method seems to significantly outperform other approaches when the dataset size is relatively small in size. This might be very relevant in situations where labeling data is expensive to obtain which is often the case in many industrial applications.

Our paper provides an alternative way of sentence/document-level representation for supervised text classification, based on optimization of the weights of words in the corresponding text to be classified. This approach takes labels into consideration when generating optimal word's weights for these words. Numerical experiments show that our proposed algorithm is competitive with respect with other state-of-the-art techniques and outperformed CNNs when the training data was small and we even show that this approach is not sensitive to document lengths.

Our model also brings additional benefits to the table. It provides a ranking of the relevance of the words with respect to the text classification problem at hand. This ranking of words by importance can be used for different NLP applications related to the same task, such as extraction-based summarization, context-matching, and text cleaning. By learning the optimal weights of the words, our model also tends to remove or ignore less informative words, thus performing its own version of feature selection. Our text embedding algorithm combines the compactness and expressiveness of the word-embedding representations with the human-interpretability of a BoW-type model.

We intend to extend this work to make the proposed algorithm more scalable in order to incorporate larger, more complex classification models and tasks, such as multi-label, multi-class classification and summarization.

We want to explore using other normalizations and constraints to the weight vector. One possibility is to explore 1-norm regulation for the weight vector to make it more sparse and have a more aggressive feature (word) selection. Another interesting direction is to consider group regularization similar [

The datasets generated for this study are available on request to the corresponding author.

GF was the technical advisor and the central figure for driving this project to completion. SG was responsible for running all the initial set of experiments and dataset preparation. TK was responsible for finishing all of the remaining experiments and manuscript writing. DC was part of research discussions and brainstorming.

Authors were employed by the company American Family Insurance. The authors declare that this study received funding from American Family Insurance. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.