Task-Optimized Word Embeddings for Text Classification Representations

Word embeddings have introduced a compact and efficient way of representing text for further downstream natural language processing (NLP) tasks. Most word embedding algorithms are optimized at the word level. However, many NLP applications require text representations of groups of words, like sentences or paragraphs. In this paper, we propose a supervised algorithm that produces a task-optimized weighted average of word embeddings for a given task. Our proposed text embedding algorithm combines the compactness and expressiveness of the word-embedding representations with the word-level insights of a BoW-type model, where weights correspond to actual words. Numerical experiments across different domains show the competence of our algorithm.


INTRODUCTION
Word embeddings, or a learned mapping from a vocabulary to a vector space, are essential tools for state-of-the-art Natural Language Processing (NLP) techniques. Dense word vectors, like Word2Vec [1] and GLoVE [2], are compact representations of a word's semantic meaning, as demonstrated in analogy tasks [3] and part-of-speech tagging [4].
Most downstream tasks, like sentiment analysis and information retrieval (IR), are used to analyze groups of words, like sentences or paragraphs. For this paper, we refer to this more general embedding as a "text embedding." In this paper we propose a supervised algorithm that produces embeddings at the sentence-level that consist on an weighted average of an available pre-trained word-level embedding. The resulting sentence-level embedding is optimized for the corresponding supervised learning task. The weights that the proposed algorithm produces can be use to estimate the importance of the words with respect to the supervised task. For example, when classifying movie reviews into one of two classes: action movies or romantic movies, words like "action, " "romance, " "love, " and "blood, " will get precedence over words, like "movie, " "i, " and "theater." This leads to the shifting of the text-level vector toward words with larger weights, as can be seen in Figure 1.
When we use an unweighted averaged word embedding (UAEm) [5] for representing the two reviews, we see that all the words get the same importance, due to which the reviews-"I like action movies" and "I prefer romance flicks"-end up close to each other in the vector space. Our algorithm, on the other hand, identifies "romance" and "action" as two important words in the vocabulary for the supervised task, and assigns weights with high absolute value to these words. This leads to shifting of the representation of the two reviews toward their respective important words in the vector space, increasing the distance between them. This indicates that, for the task of differentiating an action movie review from a romantic movie review, our algorithm FIGURE 1 | Unweighted (UAEm) (left) and Optimal embeddings (OptEm) (right) of two movie reviews in feature space. Distance between the two reviews increases for OptEm representation of the text.
produces a representation at the review level more adequate for discriminating between the two kinds of reviews.
Our algorithm has many advantages over simpler text embedding approaches, like bag-of-words (BoW) and the averaged word-embedding schemes discussed in section 2. In section 4, we show results from experiments on different datasets. In general, we observed that our algorithm is competitive with other methods. Unlike the simpler algorithms, our approach finds a task-specific representation. While BoW and some weighting schemes, like tf-idf, rely only on word frequencies to determine word importance, our algorithm computes how important the word is to a specific task. We believe that for some applications, this task-specific representation is important for performance; one would expect the importance of words to be very different whether you are trying to do topic modeling or sentiment analysis.
It is important to note that other deep-learning-based approaches for text classification also implicitly optimize the text-level representation from word-level embedding in the top layers of the neural network. However, in order to train such models large datasets are needed. Our empirical results show that our proposed representation is in general competitive with traditional deep learning based text classification approaches and outperforms them when the training data is relatively small.
Additionally, by generating importance weights to each one of the words in the vocabulary, our algorithm yields a more interpretable result than looking at the weights corresponding to the word-embedding dimensions that have no humaninterpretable meaning. Effectively, our text embedding algorithm combines the compactness and expressiveness of the wordembedding representations with the human-interpretability of a BoW-type model.
Furthermore, in contrast with some deep-learning-based approaches, our approach does not impose constraints or require special processing (trimming, padding) with respect to the length of the sentence or text to be classified. In summary, we can summarize the contributions of the paper as follows: • Our algorithm provides a task optimized text embedding from word level embeddings. • Our algorithm outperforms other more complex algorithms when training data is relatively small in size. • Our algorithm can be implemented by leveraging existing libraries in a trivial way as it only requires access to a SVM implementation. • Our resulting task specific text embedding are as compact as the original word level embedding while providing word level insights similar to a BOW type model.
The rest of the paper is organized as follows: in section 2, we discuss related work. Later, in section 3, we present a detailed explanation and mathematical justification to support our proposed algorithm. In section 4, we present and described our proposed algorithm.

RELATED WORK
Various representation techniques for text have been introduced over the course of time. In the recent years, none of these representations have been as popular as the word embeddings, such as Word2Vec [1] and GLoVE [2], that took contextual usage of words into consideration. This has led to very robust word and text representations. Text embedding has been a more challenging problem over word embeddings due to the variance of phrases, sentences, and text. Le and Mikolov [6] developed a method to generate the embeddings that outperforms the traditional bag-of-words approach [7]. More recently, deeper neural architectures have been developed to generate these embeddings and to perform text classification tasks [8] and some of these architectures involve sequential information of text, such as LSTMs [9], BERT [10], and XLNET [11]. Furthermore, recently developed attention models can also provide insights about word importance, however they require large amounts of training data.
Methods have been developed that use word embeddings to generate text embeddings without having to train on whole texts. These methods are less costly than the ones that train directly on whole text, and can be implemented faster.
Unweighted average word embedding [5] generated text embeddings by computing average of the embeddings of all the words occurring in the text. This is one of the most popular methods of computing text embeddings from trained word embeddings, and, though simple, has been known to outperform the more complex text embedding models especially in out-ofdomain scenarios. Arora et al. [12] provided a simpler method to enhance the performance of text embedding generated from simple averaged embedding by the application of PCA.
The unsupervised text embedding methods face the problem of importance-allocation of words while computing the embedding. This is important, as word importance determines how biased the text embedding needs to be toward the more informative words. DeBoom et al. [13] introduced a method that would assign importance to the words based on their tf-idf scores in the text.
Our method generates weights based on the importance of the words perceived through a supervised approach. We use classifiers to determine the weights of the words based on their importance captured through the procedure. The advantage of this method over other methods is that we keep the simplicity of Wieting's algorithm [5], while incorporating the semantically agreeable weights for the words.

OPTIMAL WORD EMBEDDINGS
A sentence, paragraph, or document can be represented using a given word-level embedding (wle) as follows: where, • A i ∈ R n is a vectorial representation or embedding at the sentence, text or document level (we will refer it as tle in rest of the paper) of ith sample; • we will assume that A i is the ith row of a matrix A ∈ R m×n containing a collection of m documents, k is the number of words in the wle corpus V; • λ j ∈ R is a weighting factor associated with the jth word v j ∈ V. Note that for the widely used averaged tle (text2vec) representation [5], λ j = 1, ∀j; • δ ij is a normalized occurrence count. It is the number of times jth word appears in the document i divided by the total number of words in the document i.
Our proposed algorithm assumes that we have a supervised classification problem for which we want to find an optimal representation at the document (text) level from the word embeddings. More concretely, we consider the problem of classifying m points in the n-dimensional real space R n , represented by the m × n matrix A, according to membership of each point A i in the classes +1 or −1 as specified by a given m × m diagonal matrix D with ones or minus ones along its diagonal.
In general, this linear classification problem can formulated as follows: where, • e ∈ R m×1 is a column of ones; • y ∈ R m×1 is a slack vector; • (w, γ ) ∈ R (n+1)×1 represents the separating hyperplane.
• L is a loss function that is used to minimize the misclassification error. • R is a regularization function used to improve generalization.
• c is a constant that controls the trade-off between error and generalization.
Note that, if L(.)= (.) + 2 2 and R(.)= . 2 2 , then Equation (2) corresponds to an SVM formulation [14]. The corresponding unconstrained convex optimization problem is given as: From (1), we can rewrite A as: That is, where ∈ R m×k ; is a matrix of occurrences count with δ ij in the (i, j) position. = diag((λ 1 , . . . , λ k )) ∈ R k×k , and V ∈ R k×n is the matrix whose rows are all the word2vec vectors considered in the word2vec corpus or dictionary. From (3) and (6), where λ = (λ 1 , . . . , λ k ). Formulation (7) is a biconvex optimization problem, which can be solved using alternate optimization [15]. By solving this problem, not only do we obtain an SVM-type classifier, but also learn the optimal importance weights for each word in our corpus (λ 1 , . . . , λ k ) which can be used to interpret classification results for the specific tasks at hand. Though we could have restricted the λ i to be positive, we choose to leave them unconstrained in order to make our algorithm more scalable and computationally efficient. Another interesting consideration would be to add a relative importance constrained on addition to the nonnegativity bounds of the form: but again, we choose not to for computationally efficiency. We will explore this option in the future. In (7), if we fix to a constant¯ , we have: We can obtain the corresponding optimal solution for (w, γ ) by solving (w * , γ * ) = SVM(Ã, D, c).
We are ready now to describe our proposed alternate optimization (AO) algorithm to solve formulation (7).
One of the advantages of the algorithm is that it can be easily implemented by using existing open-source SVM libraries, like the ones included in scikit-learn [16] or a more recent GPU-based fast SVM implementation like ThunderSVM [17].
The optimal text embedding algorithm, then, inherits the convergence properties and characteristics of the AO problems [15]. It is important to note that the set of possible solutions to which Algorithm 1 can converge can include certain type of saddle points (i.e., a point that behaves like a local minimizer only when projected along a subset of the variables). However, it is stated in the paper [15] that it is extremely difficult to find examples where converge occurs to a saddle point rather than to a local minimizer.
In order to further reduce the computational complexity of the proposed algorithm, we can consider a simplified loss function L(.)= . 2 2 and R(.)= . 2 2 . Then formulation (7) becomes the corresponding unconstrained convex optimization problem:

Algorithm 1: Optimal Text Embedding
Input : Training vocabulary matrix (V); scaled word occurrence matrix ( ); vector of labels diag(D); max number of iterations maxiter; tolerance tol; regularization parameters c 1 and c 2 ; Output: optimal word weight vector λ * ; classification hyperplane (w * , γ * ); Initialize ∀j λ j =1; 0 =diagonal(λ) i = 0; Fixing =˜ , from (9) and (12), we have This formulation corresponds to a least-squares or Proximal SVM formulation [18,19], and its solution can be obtained by solving a simple system of linear equations. We will denote formulation (13) by IfĀ = Ã . . . − e then the solution to (13) is given by On the other hand, fixing (w, γ ) = (w,γ ), we have sincew is a constant. Hence, Furthermore, From (17) and (18), Frontiers in Applied Mathematics and Statistics | www.frontiersin.org For some problems, T can be ill-conditioned, which may lead to incorrect values for λ. In order to improve conditioning we add a Tikhonov regularization perturbation [20]. (19) becomes where ǫ is a very small value. Note that ( T + ǫI) −1 involves calculating the inverse of a k × k matrix, where k is the number of words in the word2vec dictionary. In some cases, k can be much larger than m, the number of training set examples. If this is the case, we can use the Sherman-Morrison-Woodbury formula [21]: with Z = ǫI, u = v = T . Then ( T + ǫI) −1 becomes which involves inverting an m × m matrix with m << k. The λ we obtained is a vector of weights of the words that would be used in (1) to calculate text2vec of a given sample.
Algorithm 1 can be modified to consider formulation (3) instead of (13) by making two simple changes: 1. Substitute line 6 of Algorithm 1 by: Solve Equation (15)

EXPERIMENTS
We used binary classification as the task for evaluating our algorithm performance by comparing it to the following methods: 1. UAEm: Unweighted average of the word vectors that comprise the sentence or document [5]. 2. WAEm: Weighted averaged text representations. We computed WAEm using tf-idf coefficients as the weights as described in De Boom et al. [13]. 3. FastText [22], an open-source, free, library that allows users to learn text representations and text classifiers. The classifiers are based in a simple shallow model instead of deep one which allows the framework to train models in a fast manner.

AdvCNN [8] is a CNN based deep network which comprises
of parallel convolutional layers with varying filter widths and it achieves state-of-the-art performance on sentiment analysis and question classification. 5. VanillaCNN is a custom CNN architecture we designed and is similar to Kim [8] except that in this case there is only one convolutional layer instead of parallel layers.
Note that in both the CNN experiments we have initialized the embedding layer with pre-trained word2vec models and these vectors are kept static.
In SVM-OptEm, we used a support vector machine (SVM) [23] as the classifier. We used a scikit-learn [24] implementation of SVM for the experiments.
In LSSVM-OptEm, we used a least square support vector machine (LS-SVM) [23] as the classifier.

Datasets
To showcase the performance of our model, we chose fifteen different binary classification tasks over the subsets of different datasets. Twelve public datasets are briefly described in Table 1.
We also performed experiments on three datasets belonging to the insurance domain.
• BI-1 and BI-2: These datasets consist of the claim notes with binary classes based on topic of phone conversation. These notes were taken by call representative of the company after the phone call was completed. For BI-1, we classified the call notes into two categories based on claim complexity: simple and complex. For BI-2, we wanted to identify notes that documented a failed attempt made by the call representative to get in touch with the customer. It is important to note that the corpus is same for these two datasets but the classification task is different. • TRANSCRIPTS: These datasets consist of the phone transcripts with two classes: pay-by-phone calls and others. These transcripts were generated inside the company for the calls received at the call center. Each call would be assigned a class based on the purpose of the call.

Word Embeddings
We chose to work on different word2vec-based word embeddings. These word embeddings have either been pretrained models or in-house trained models. These embeddings were used on the datasets based on their contextual relevance.
• wikipedia [38]: The skip-gram model was trained on English articles in Wikipedia by FastText [39].

• google-news [40]: The model was trained on Google News
Data, and is available on the Google Code website [41]. • amzn: The skip-gram model was trained in-house on amazon reviews [27,28]. Gensim [42] was used to train the model. • yelp: The skip-gram model was trained in-house on yelp reviews [37]. Gensim was used to train the model. • transcript: The continuous bag-of-words model was trained in-house on the transcripts generated in the of the calls from call centers. Gensim was used to train the model over approximately 3 million transcripts. • claim-notes: The continuous bag-of-words model was trained in-house on the notes taken by call representatives after the call was completed. C-based code from Google Word2vec website [41] was used to train the model over approximately 100 million notes.
We used different word2vec models to verify that our models works well independently of the underlying embedding AMZN-EX Amazon reviews Electronics review [26] AMZNBK-SENT Amazon book reviews Positive review [27,28] BBC BBC news articles Sports article [29] BLOG-GENDER Blog articles Male Writer [30] DBPEDIA Wikipedia articles Artist article [31][32][33] IMDB IMDB movie reviews Positive review [34] SCIPAP Sentences from scientific papers Owner-written sentence [35] SST Movie reviews Positive sentiment [36] YAHOO-ANS Questions from Yahoo's question-answer dataset Health-related question [33] YELP-REST Yelp Restaurant Reviews Restaurant-related review [37] YELP-STAR Yelp Reviews Positive review [37] representation. Moreover, it also gives better contextual representation of words for these datasets.

Text Processing
The method of processing employed on text was similar to the one done for training the word2vec models. This ensured the consistency of word-occurrence in the dataset in lieu to the model that would be used for mapping the words. Different word2vec models had different processing procedures, such as substitutions based on regular expressions, removal of non-alphabetical words, and lowercasing the text. Accordingly, text-processing was done for the training data.

Results
To compare performance of the algorithms tested, we decided to use area under curve (AUC) for evaluation. This metric was chosen in order to remove the possibility of unbalanced datasets affecting the efficacy of the accuracy of the models.
The performance of our models for the experiments can be seen in Table 2.
Our algorithm provides better or comparable performance against UAEm and WAEm. This performance is achieved over multiple iterations, as seen in Figure 2. The number of iterations required to reach the best performance for our model varies with the dataset and training size.
It is important to note that our proposed algorithm tends to achieve better AUC performance when the training data is small which it is the case for many scenarios in the insurance domain where labels are difficult and expensive to obtain. This fact make the algorithm a good choice for active learning frameworks where labels are scarce specially at early iterations of such approaches.
In general, our algorithm approached an "equilibrium" stage for the vector λ, as seen in Figure 3. In other words, as the algorithm iterate, the norm of the difference between the current weights and the weights from the previous iteration of the words approaches zero. This behavior is seen consistently for all the experimental cases. This shows that our algorithm exhibits good convergence behavior as expected.

Text Representation
One of the advantages our model holds over UAEm and WAEm is that our model can be used to extract the most important words in the training set. As our model reconfigures the weights of the words at each iteration, it also indirectly reassigns the degree of importance to these words. We can obtain these words by taking the absolute values of the weights assigned to these words at the end of the iteration. This information can be used for improving different algorithms, such as visual representation of text and topic-discovery, and as features for other models. Figure 4 shows weights of top 15 words for three of our datasets. Weights assigned to the words are based on the role they play in helping the classifier determine the class of any given sample. Table 3 shows the top 10 words for three of our datasets. The words are determined by taking the absolute value of the weights i.e., λ * learned from the algorithm and rank them in descending order. For a human eye, these words clearly makes sense with respect to the given classification task. For example, 1. AMZN-EX classification task is to predict items belonging to eletronics category based on reviews. 2. YAHOO-ANS classification task is to predict health related questions. 3. DBPEDIA classification task is to predict artistic articles.
We also found that words that are least informative about the given task have weights(λ * ) close to zero.
Following our results presented in Table 2, we want to highlight the following observations: • Our method is competitive with more sophisticated models.
As a matter of fact, we are winning on 7 out of 15 text classification tasks from various domains. • Our method seems to significantly outperform other approaches when the dataset size is relatively small in size. This might be very relevant in situations where labeling data is expensive to obtain which is often the case in many industrial applications. The highest score for each evaluation metric is in boldface.

CONCLUSIONS AND FUTURE WORK
Our paper provides an alternative way of sentence/documentlevel representation for supervised text classification, based on optimization of the weights of words in the corresponding text to be classified. This approach takes labels into consideration when generating optimal word's weights for these words. Numerical experiments show that our proposed algorithm is competitive with respect with other state-of-the-art techniques and outperformed CNNs when the training data was small FIGURE 3 | Average of weight difference over iteration for three datasets. This difference approaches zero over iterations.
and we even show that this approach is not sensitive to document lengths. Our model also brings additional benefits to the table. It provides a ranking of the relevance of the words with respect to the text classification problem at hand. This ranking For a human eye, most of these words makes sense given the classification task. of words by importance can be used for different NLP applications related to the same task, such as extraction-based summarization, context-matching, and text cleaning. By learning the optimal weights of the words, our model also tends to remove or ignore less informative words, thus performing its own version of feature selection. Our text embedding algorithm combines the compactness and expressiveness of the wordembedding representations with the human-interpretability of a BoW-type model. We intend to extend this work to make the proposed algorithm more scalable in order to incorporate larger, more complex classification models and tasks, such as multi-label, multi-class classification and summarization.
We want to explore using other normalizations and constraints to the weight vector. One possibility is to explore 1norm regulation for the weight vector to make it more sparse and have a more aggressive feature (word) selection. Another interesting direction is to consider group regularization similar [43], where the groups of words are suggested by a graph naturally defined by the distances between the words provided by the word embedding. In this way, semantically similar words would be weighted similarly and the result of the algorithm would be a clustering of terms by semantic meaning or topics that are relevant to the classification problem at hand.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

AUTHOR CONTRIBUTIONS
GF was the technical advisor and the central figure for driving this project to completion. SG was responsible for running all the initial set of experiments and dataset preparation. TK was responsible for finishing all of the remaining experiments and manuscript writing. DC was part of research discussions and brainstorming.