Unsupervised Word Embedding Learning by Incorporating Local and Global Contexts

Word embedding has benefited a broad spectrum of text analysis tasks by learning distributed word representations to encode word semantics. Word representations are typically learned by modeling local contexts of words, assuming that words sharing similar surrounding words are semantically close. We argue that local contexts can only partially define word semantics in the unsupervised word embedding learning. Global contexts, referring to the broader semantic units, such as the document or paragraph where the word appears, can capture different aspects of word semantics and complement local contexts. We propose two simple yet effective unsupervised word embedding models that jointly model both local and global contexts to learn word representations. We provide theoretical interpretations of the proposed models to demonstrate how local and global contexts are jointly modeled, assuming a generative relationship between words and contexts. We conduct a thorough evaluation on a wide range of benchmark datasets. Our quantitative analysis and case study show that despite their simplicity, our two proposed models achieve superior performance on word similarity and text classification tasks.


APPENDIX
Lemma 1. The definite integral of power of sin on the interval [0, π] is given by where Γ(x) = ∞ 0 exp(−t)t x−1 dt is the gamma function. PROOF.
Using the above iteration relationship and the property of gamma function Γ(x + 1) = xΓ(x), we write J p using gamma function: • When p is an even integer: Plugging in the base case J 0 = π and Γ 1 2 = √ π, Γ(1) = 1, we prove that • When p is an odd integer: Plugging in the base case J 1 = 2 and Γ 3 2 = √ π 2 , Γ(1) = 1, we prove that Theorem 1. When the corpus size and vocabulary size are infinite (i.e., |D| → ∞ and |V | → ∞) and all word vectors and document vectors are assumed to be unit vectors, generalizing the relationship of proportionality assumed in Equations (2), (4), (7) and (9), to the continuous cases results in the vMF distribution with the corresponding prior vector as the mean direction and constant 1 as the concentration parameter, i.e., lim PROOF. We give the proof for Equation (18). The proof for Equations (16), (17) and (19) can be derived similarly.
We generalize the relationship proportionality p(w j | w i ) ∝ exp(u w i v w j ) in Equation (7) to the continuous case and obtain the following probability dense distribution: where we denote the integral in the denominator as Z, and our goal becomes to prove the following equality 1 Z = 1 c p (1) .
To evaluate the integral Z, we make the transformation to polar coordinates. Let t = Qv w , where Q ∈ R p×p is an orthogonal transformation so that dt = dv w . Moreover, let the first row of Q be u w i so that t 1 = u w i v w . Then we use (r, θ 1 , . . . , θ p−1 ) to represent the polar coordinates of t where r = 1 and cos θ 1 = u w i v w . The transformation from Euclidean coordinates to polar coordinates is given by (Sra, 2007) via computing the determinant of the Jacobian matrix for the coordinate transformation: By Lemma 1, we have According to Definition 4, the integral term of Z above can be expressed with I p/2−1 (1) as: π 0 exp(cos θ 1 )(sin θ 1 ) p−2 dθ 1 = Γ p−1 2 Γ 1 2 2 1−p/2 I p/2−1 (1).
Therefore, with the fact that Γ 1 2 = √ π, Z = 2π Plugging Z back to Equation (20), we finally arrive that Lemma 2. Let X be a set of n unit vectors drawn independently from the vMF distribution vM F p (µ, κ), i.e., The maximum likelihood estimate for parameter µ is given by the normalized sum of the n vectors, i.e., The log-likelihood is log P (X | µ, κ) = n log c p (κ) + κµ s, where s = n i=1 x i . Since µ = 1, we introduce a Lagrange multiplier η to account for the constraint and maximize the Lagrangian objective function below: L(µ, κ, η; X ) = n log c p (κ) + κµ s + η(1 − µ µ).
After setting the partial derivative to be zero, we obtain µ =κ 2η s.
Substituting Equation (22) into Equation (21), we finally arrive at the maximum likelihood estimation of the mean direction:μ