<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2020.00009</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Unsupervised Word Embedding Learning by Incorporating Local and Global Contexts</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Meng</surname> <given-names>Yu</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/814695/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Huang</surname> <given-names>Jiaxin</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/894464/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Guangyuan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Zihan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Chao</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Han</surname> <given-names>Jiawei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c002"><sup>&#x0002A;</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science, University of Illinois at Urbana-Champaign</institution>, <addr-line>Champaign, IL</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology</institution>, <addr-line>Atlanta, GA</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Huan Liu, Arizona State University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Georgios Kollias, IBM, United States; Shuhan Yuan, Utah State University, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Yu Meng <email>yumeng5&#x00040;illinois.edu</email></corresp>
<corresp id="c002">Jiawei Han <email>hanj&#x00040;illinois.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>11</day>
<month>03</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>3</volume>
<elocation-id>9</elocation-id>
<history>
<date date-type="received">
<day>06</day>
<month>12</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>02</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Meng, Huang, Wang, Wang, Zhang and Han.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Meng, Huang, Wang, Wang, Zhang and Han</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Word embedding has benefited a broad spectrum of text analysis tasks by learning distributed word representations to encode word semantics. Word representations are typically learned by modeling local contexts of words, assuming that words sharing similar surrounding words are semantically close. We argue that local contexts can only partially define word semantics in the unsupervised word embedding learning. Global contexts, referring to the broader semantic units, such as the document or paragraph where the word appears, can capture different aspects of word semantics and complement local contexts. We propose two simple yet effective unsupervised word embedding models that jointly model both local and global contexts to learn word representations. We provide theoretical interpretations of the proposed models to demonstrate how local and global contexts are jointly modeled, assuming a generative relationship between words and contexts. We conduct a thorough evaluation on a wide range of benchmark datasets. Our quantitative analysis and case study show that despite their simplicity, our two proposed models achieve superior performance on word similarity and text classification tasks.</p></abstract>
<kwd-group>
<kwd>word embedding</kwd>
<kwd>unsupervised learning</kwd>
<kwd>word semantics</kwd>
<kwd>local contexts</kwd>
<kwd>global contexts</kwd>
</kwd-group>
<contract-num rid="cn001">W911NF-09-2-0053</contract-num>
<contract-num rid="cn002">W911NF-17-C-0099</contract-num>
<contract-num rid="cn002">FA8750-19-2-1004</contract-num>
<contract-num rid="cn003">IIS 16-18481</contract-num>
<contract-num rid="cn003">IIS 17-04532</contract-num>
<contract-num rid="cn003">IIS 17-41317</contract-num>
<contract-num rid="cn004">HDTRA11810026</contract-num>
<contract-num rid="cn005">1U54GM114838</contract-num>
<contract-sponsor id="cn001">Army Research Laboratory<named-content content-type="fundref-id">10.13039/100006754</named-content></contract-sponsor>
<contract-sponsor id="cn002">Defense Advanced Research Projects Agency<named-content content-type="fundref-id">10.13039/100000185</named-content></contract-sponsor>
<contract-sponsor id="cn003">National Science Foundation<named-content content-type="fundref-id">10.13039/100000001</named-content></contract-sponsor>
<contract-sponsor id="cn004">Defense Threat Reduction Agency<named-content content-type="fundref-id">10.13039/100000774</named-content></contract-sponsor>
<contract-sponsor id="cn005">National Institute of General Medical Sciences<named-content content-type="fundref-id">10.13039/100000057</named-content></contract-sponsor>
<counts>
<fig-count count="4"/>
<table-count count="8"/>
<equation-count count="28"/>
<ref-count count="29"/>
<page-count count="12"/>
<word-count count="7974"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Unsupervised word representation learning, or word embedding, has shown remarkable effectiveness in various text analysis tasks, such as named entity recognition (Lample et al., <xref ref-type="bibr" rid="B12">2016</xref>), text classification (Kim, <xref ref-type="bibr" rid="B10">2014</xref>) and machine translation (Cho et al., <xref ref-type="bibr" rid="B4">2014</xref>). Words and phrases, which are originally represented as one-hot vectors, are embedded into a continuous low-dimensional space. Typically, the mapping function is learned based on the assumption that words sharing similar local contexts are semantically close. For instance, the famous word2vec algorithm (Mikolov et al., <xref ref-type="bibr" rid="B22">2013a</xref>,<xref ref-type="bibr" rid="B23">b</xref>) learns word representation from each word&#x00027;s local context window (i.e., surrounding words) so that local contextual similarity of words are preserved. The Skip-Gram architecture of word2vec uses the center word to predict its local context, and the CBOW architecture uses the local context to predict the center word. GloVe (Pennington et al., <xref ref-type="bibr" rid="B24">2014</xref>) factorizes a global word-word co-occurrence matrix, but the co-occurrence is still defined upon local context windows.</p>
<p>In this paper, we argue that apart from local context, another important type of word context&#x02014;which we call <italic>global context</italic>&#x02014;has been largely ignored by unsupervised word embedding models. Global context refers to the larger semantic unit that a word belongs to, such as a document or a paragraph. While local context reflects the local semantic and syntactic features of a word, global context encodes general semantic and topical properties of words in the document, which complements local context in embedding learning. Neither local context nor global context alone is sufficient for encoding the semantics of a word. For example, <xref ref-type="fig" rid="F1">Figure 1</xref> is a text snippet from the 20 Newsgroup dataset. When we only look at the local context window (the transparent part of <xref ref-type="fig" rid="F1">Figure 1</xref>) of the word &#x0201C;harmful,&#x0201D; it is hard to predict if the center word should have positive or negative meaning. On the other hand, if we only know the entire document is about car robbery but do not have information about the local context, there is also no way to predict the center word. This example demonstrates that local and global contexts provide complementary information about the center word&#x00027;s semantics, and using either of them only may not be enough to capture the complete word semantics.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>A text snippet from the 20 Newsgroup dataset. The transparent part represents the local context of the word &#x0201C;harmful.&#x0201D; The semitransparent part denotes the remainder of the document.</p></caption>
<graphic xlink:href="fdata-03-00009-g0001.tif"/>
</fig>
<p>To the best of our knowledge, there is no previous study that <italic>explicitly</italic><xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> models both local and global contexts to learn word representations. Topic models (Hofmann, <xref ref-type="bibr" rid="B8">1999</xref>; Blei et al., <xref ref-type="bibr" rid="B2">2003</xref>) essentially use global contexts to discover latent topics, by modeling documents as a mixture of latent topics and topics as distributions over words. In topic modeling, however, local contexts are completely ignored because word ordering information is discarded. Some studies along the embedding line learn word embeddings based on global contexts <italic>implicitly</italic>. HSMN (Huang et al., <xref ref-type="bibr" rid="B9">2012</xref>), PTE (Tang et al., <xref ref-type="bibr" rid="B26">2015</xref>), and Doc2Cube (Tao et al., <xref ref-type="bibr" rid="B27">2018</xref>) take the average of word embedding in the document as the document representation and encourage similarity between word embedding and document embedding for co-occurred words and documents. However, these methods do not model global contexts explicitly because the document representations are essentially aggregated word representations and thus are not tailored for contextual representations. Moreover, both PTE and Doc2Cube require additional class information for text classification and thus are not unsupervised word embedding frameworks.</p>
<p>We propose two models that incorporate both local and global contexts for unsupervised word embedding learning. Our proposed models are surprisingly simple extensions of Skip-Gram and CBOW architectures of word2vec, by extending their objective functions to include a loss term corresponding to the global context. Despite our models&#x00027; simplicity, we usea spherical generative model to show our models have theoretical bases: Under the assumption that there is a generative relationship between words and their contexts, our models essentially perform maximum likelihood estimation on the corpus with word representations as the parameters to be estimated.</p>
<p>Our contributions are summarized below:</p>
<list list-type="order">
<list-item><p>We propose two unsupervised models that incorporate both local and global word contexts in word embedding learning, allowing them to provide complementary information for capturing word semantics.</p></list-item>
<list-item><p>We provide theoretical interpretations of the proposed models based on a spherical generative model, which shows equivalence between our models&#x00027; objectives and maximum likelihood estimation on the corpus where word representations are parameters to be estimated.</p></list-item>
<list-item><p>We conduct a thorough evaluation on the word embedding quality trained on benchmark datasets. The two proposed models are superior to their word2vec counterparts and achieve superior performances on word similarity and text classification tasks. We also perform case studies to understand the properties of our models.</p></list-item>
</list>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<p>In this section, we review related studies on word embedding, and categorize them into three classes according to the type of word context captured by the model.</p>
<sec>
<title>2.1. Local Context Word Embedding</title>
<p>Most unsupervised word embedding frameworks learn word representations by preserving the local context similarity of words. The underlying assumption is that similar surrounding words imply similar semantics of center words. Distributed word representation is first proposed in Bengio et al. (<xref ref-type="bibr" rid="B1">2000</xref>) to maximize the conditional probability of the next word given the previous few words, which act as the local context. The definition of the local context is later extended in Collobert et al. (<xref ref-type="bibr" rid="B5">2011</xref>) to include not only the preceding words, but also the succeeding ones. Afterwards, the most famous word embedding framework, word2vec (Mikolov et al., <xref ref-type="bibr" rid="B22">2013a</xref>,<xref ref-type="bibr" rid="B23">b</xref>), proposes two models that capture local context similarity. Specifically, word2vec&#x00027;s Skip-Gram model (Mikolov et al., <xref ref-type="bibr" rid="B23">2013b</xref>) maximizes the probability of using the center word to predict its surrounding words; word2vec&#x00027;s CBOW model (Mikolov et al., <xref ref-type="bibr" rid="B22">2013a</xref>), by symmetry, uses the local context to predict the center word. It is also shown in Levy and Goldberg (<xref ref-type="bibr" rid="B14">2014</xref>) that word2vec&#x00027;s Skip-Gram model with negative sampling is equivalent to factorizing a shifted PMI matrix. Another word embedding framework, GloVe (Pennington et al., <xref ref-type="bibr" rid="B24">2014</xref>), learns embedding by factorizing a so-called global word-word co-occurrence matrix. However, the co-occurrence is still defined upon local context windows, so GloVe essentially captures local context similarity of words as well.</p>
</sec>
<sec>
<title>2.2. Global Context Word Embedding</title>
<p>There have been previous studies that incorporate global context, i.e., the document a word belongs to, into word embedding learning. Doc2Vec (Le and Mikolov, <xref ref-type="bibr" rid="B13">2014</xref>) finds the representation for a paragraph or a document by training document embedding to predict words in the document. Although word embedding is trained simultaneously with the document embedding, the final goal of Doc2Vec is to obtain document embedding instead of word embedding, and documents are treated as the representation learning target but not as context for words.</p>
<p>A few recent papers incorporate global context implicitly into network structures where word embeddings are learned. PTE (Tang et al., <xref ref-type="bibr" rid="B26">2015</xref>) and Doc2Cube (Tao et al., <xref ref-type="bibr" rid="B27">2018</xref>) construct word-document network and encode word-document co-occurrence frequency in the edge weights to enforce embedding similarity between co-occurred words and documents. However, PTE and Doc2Cube do not explicitly model global context because document representations are simply the averaged word embedding. Another notable difference from unsupervised word embedding is that they also rely on another word-label network which requires class-related information to optimize the word embedding for text classification purposes. Hence, the embedding is trained under semi-supervised/weakly-supervised settings and does not generalize well to other tasks.</p>
</sec>
<sec>
<title>2.3. Joint Context Word Embedding</title>
<p>There have been a few attempts to incorporate both local and global contexts in word embedding. (Huang et al., <xref ref-type="bibr" rid="B9">2012</xref>) proposes a neural language model which uses global context to disambiguate upon local context. Specifically, the framework conducts word sense discrimination for polysemy by learning multiple embeddings per word according to the document that the word token appears in. However, the document embedding is directly computed as the weighted average of word embeddings and is not tailored for contextual representation. In this paper, we explicitly learn document embedding as global context representation, so that local and global context representations clearly capture different aspects of word contexts. Topic word embeddings (Liu et al., <xref ref-type="bibr" rid="B15">2015</xref>) and Collaborative Language Model (Xun et al., <xref ref-type="bibr" rid="B29">2017</xref>) share the similar idea that topic modeling [e.g., LDA (Blei et al., <xref ref-type="bibr" rid="B2">2003</xref>)] benefits word embedding learning by relating words with topical information. However, these types of framework suffer from the same major problems as topic modeling does: (1) They require prior knowledge about the number of latent topics in the corpus, which may not be always available under unsupervised settings; (2) Due to the local optimal solutions given by the topic modeling inference algorithm, the instability in topic discovery results in instability in word embedding as well. Our proposed models learn document embedding to represent global context and do not rely on topic modeling. The most relevant framework to our design is Spherical Text Embedding (Meng et al., <xref ref-type="bibr" rid="B18">2019a</xref>) which jointly models word-word and word-paragraph co-occurrence statistics on the sphere.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Definitions and Preliminaries</title>
<p>In this section, we provide the meanings of the notations used in this paper in <xref ref-type="table" rid="T1">Table 1</xref> and introduce the necessary preliminaries for understanding our design and interpretations of the proposed models.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Notations and meanings.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Notation</bold></th>
<th valign="top" align="left"><bold>Meaning</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold><italic>u</italic></bold><sub><italic>w</italic></sub>, <bold><italic>v</italic></bold><sub><italic>w</italic></sub></td>
<td valign="top" align="left">The &#x0201C;input&#x0201D; and &#x0201C;output&#x0201D; vector representation of word <italic>w</italic>.</td>
</tr>
<tr>
<td valign="top" align="left"><bold><italic>d</italic></bold></td>
<td valign="top" align="left">The vector representation of document <italic>d</italic>.</td>
</tr>
<tr>
<td valign="top" align="left">|<italic>d</italic>|</td>
<td valign="top" align="left">The length of document <italic>d</italic>.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-formula><mml:math id="M1"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="left">The local and global context of a word <italic>w</italic> &#x02208; <italic>d</italic>.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathsize="1.19em"><mml:mrow></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula></td>
<td valign="top" align="left">The text corpus represented by the set of documents.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-formula><mml:math id="M3"><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathsize="1.19em"><mml:mrow></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>V</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula></td>
<td valign="top" align="left">The corpus vocabulary represented by the set of unique word tokens.</td>
</tr>
<tr>
<td valign="top" align="left"><italic>h</italic></td>
<td valign="top" align="left">Local context window size.</td>
</tr>
<tr>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">Word and document vector dimension.</td>
</tr>
<tr>
<td valign="top" align="left">&#x1D54C;<sup><italic>p</italic>&#x02212;1</sup></td>
<td valign="top" align="left">The unit sphere in &#x0211D;<sup><italic>p</italic></sup>.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Definition 1</bold> (Local Context). We represent each document <italic>d</italic> as a sequence of words <italic>d</italic> &#x0003D; <italic>w</italic><sub>1</sub><italic>w</italic><sub>2</sub>&#x02026;<italic>w</italic><sub><italic>n</italic></sub>. The local context <inline-formula><mml:math id="M4"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> of a word <italic>w</italic><sub><italic>i</italic></sub> &#x02208; <italic>d</italic> refers to all other words appearing in the local context window of <italic>w</italic><sub><italic>i</italic></sub> (i.e., <italic>h</italic> words before and after <italic>w</italic><sub><italic>i</italic></sub>) in document <italic>d</italic>. Formally, <inline-formula><mml:math id="M5"><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> if <italic>w</italic><sub><italic>j</italic></sub> &#x02208; <italic>d, i</italic> &#x02212; <italic>h</italic> &#x02264; <italic>j</italic> &#x02264; <italic>i</italic> &#x0002B; <italic>h, i</italic> &#x02260; <italic>j</italic>.</p>
<p><bold>Definition 2</bold> (Global Context). The global context <inline-formula><mml:math id="M6"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> of a word <italic>w</italic> with regard to <italic>d</italic> refers to the relationship that <italic>w</italic> appears in <italic>d</italic>. Formally, <inline-formula><mml:math id="M7"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> if <italic>w</italic> &#x02208; <italic>d</italic>, and <inline-formula><mml:math id="M8"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x02205;</mml:mi></mml:math></inline-formula> otherwise.</p>
<p><bold>Definition 3</bold> (The von Mises Fisher (vMF) distribution). A unit random vector <bold><italic>x</italic></bold> &#x02208; &#x1D54A;<sup><italic>p</italic>&#x02212;1</sup> &#x02282; &#x0211D;<sup><italic>p</italic></sup> has the <italic>p</italic>-variate von Mises Fisher distribution <italic>vMF</italic><sub><italic>p</italic></sub>(<italic><bold>&#x003BC;</bold></italic>, &#x003BA;) if its probability density function is</p>
<disp-formula id="E1"><mml:math id="M9"><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>&#x003BC;</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:msup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>&#x022A4;</mml:mo></mml:msup><mml:mtext>&#x000A0;</mml:mtext><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>&#x003BC;</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>
<p>where &#x003BA; &#x02265; 0 is the concentration parameter, ||<italic><bold>&#x003BC;</bold></italic>|| &#x0003D; 1 is the mean direction and the normalization constant <italic>c</italic><sub><italic>p</italic></sub>(&#x003BA;) is given by</p>
<disp-formula id="E2"><mml:math id="M10"><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>&#x003BA;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x003C0;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
<p>where <italic>I</italic><sub><italic>r</italic></sub>(&#x000B7;) represents the modified Bessel function of the first kind at order <italic>r</italic>, as defined in Definition 4.</p>
<p><bold>Definition 4</bold> (Modified Bessel Function of the First Kind). The modified Bessel function of the first kind of order <italic>r</italic> can be defined as (Mardia and Jupp, <xref ref-type="bibr" rid="B16">2009</xref>):</p>
<disp-formula id="E3"><mml:math id="M11"><mml:msub><mml:mi>I</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mi>r</mml:mi></mml:msup></mml:mrow><mml:mrow><mml:mo>&#x00393;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>r</mml:mi><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00393;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mstyle displaystyle='true'><mml:mrow><mml:msubsup><mml:mo>&#x0222B;</mml:mo><mml:mn>0</mml:mn><mml:mi>&#x003C0;</mml:mi></mml:msubsup><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext></mml:mrow></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003BA;</mml:mi><mml:mtext>&#x000A0;cos&#x000A0;</mml:mtext><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mtext>sin&#x000A0;</mml:mtext><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>r</mml:mi></mml:mrow></mml:msup><mml:mi>d</mml:mi><mml:mi>&#x003B8;</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M12"><mml:mi>&#x00393;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x0222B;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x0221E;</mml:mi></mml:mrow></mml:msubsup><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> is the gamma function.</p>
</sec>
<sec id="s4">
<title>4. Models</title>
<p>In this section, we introduce the two models built upon the word2vec framework that incorporate both global and local contexts in unsupervised word embedding learning.</p>
<sec>
<title>4.1. Joint CBOW Model</title>
<p>The Joint CBOW model adopts the similar idea of word2vec&#x00027;s CBOW model (Mikolov et al., <xref ref-type="bibr" rid="B22">2013a</xref>). Specifically, the model tries to predict the current word given its contexts. The objective has two components: the loss of using local context for prediction and the loss of using global context for prediction.</p>
<p>We define the loss of local context as below which encourages the model to correctly predict a word using its local context window:</p>
<disp-formula id="E4"><label>(1)</label><mml:math id="M13"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>L</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Following Mikolov et al. (<xref ref-type="bibr" rid="B22">2013a</xref>), we define the conditional probability to be</p>
<disp-formula id="E5"><label>(2)</label><mml:math id="M14"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x02208;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M15"><mml:msub><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>u</mml:mi></mml:mstyle></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>u</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:mo>&#x02016;</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>u</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x02016;</mml:mo></mml:math></inline-formula> is the normalized sum of vector representations of words in the local context window of <italic>w</italic><sub><italic>i</italic></sub>.</p>
<p>We define the loss of global context as below which encourages the model to correctly predict a word using the document it belongs to:</p>
<disp-formula id="E6"><label>(3)</label><mml:math id="M16"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>G</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>We define the conditional probability to be</p>
<disp-formula id="E7"><label>(4)</label><mml:math id="M17"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x02208;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula>
<p>The final objective is the sum of local context loss and global context loss weighted by a hyperparameter &#x003BB;.</p>
<disp-formula id="E8"><label>(5)</label><mml:math id="M18"><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:math></disp-formula>
<p>We note that when &#x003BB; &#x0003D; 1, the model places equal emphasis on local and global contexts. When &#x003BB; &#x0003C; 1, local context matters more and vice versa.</p>
</sec>
<sec>
<title>4.2. Joint Skip-Gram Model</title>
<p>The <sans-serif>Joint Skip-Gram</sans-serif> model mirrors the <sans-serif>Joint CBOW</sans-serif> model in that the inputs and outputs are swapped, i.e., now the model tries to predict the contexts given the current word. Again, the objective consists of a local context loss and a global context loss.</p>
<p>We define the loss of local context as below which encourages the model to correctly predict a word&#x00027;s local context window:</p>
<disp-formula id="E9"><label>(6)</label><mml:math id="M19"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mtext>&#x000A0;</mml:mtext><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>L</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Following (Mikolov et al., <xref ref-type="bibr" rid="B23">2013b</xref>), we define the conditional probability to be</p>
<disp-formula id="E10"><label>(7)</label><mml:math id="M20"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x02208;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>
<p>We define the loss of global context as below which encourages the model to correctly predict the document a word belongs to:</p>
<disp-formula id="E11"><label>(8)</label><mml:math id="M21"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mtext>&#x000A0;</mml:mtext><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>G</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mtext>&#x000A0;</mml:mtext><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>We define the conditional probability to be</p>
<disp-formula id="E12"><label>(9)</label><mml:math id="M22"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:msup><mml:mi>d</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula>
<p>The final objective is the sum of local context loss and global context loss weighted by a hyperparameter &#x003BB;.</p>
<disp-formula id="E13"><label>(10)</label><mml:math id="M23"><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:math></disp-formula>
<p>We will study the effect of &#x003BB; in the experiment section.</p>
</sec>
</sec>
<sec id="s5">
<title>5. Interpreting the Models</title>
<p>In this section, we propose a novel generative model to analyze the two models introduced in the previous section and show how they jointly incorporate global and local contexts. Overall, we assume there is <italic>a generative relationship</italic> between center words and contexts, i.e., either center words are generated from both local and global contexts (<sans-serif>Joint CBOW</sans-serif>), or local and global contexts are generated by center words (<sans-serif>Joint Skip-Gram</sans-serif>), as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. A spherical distribution is used in the generative model where word vectors are treated as the parameters to be estimated.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p><sans-serif>Joint CBOW</sans-serif> and <sans-serif>Joint Skip-Gram</sans-serif> as generative models.</p></caption>
<graphic xlink:href="fdata-03-00009-g0002.tif"/>
</fig>
<sec>
<title>5.1. The Spherical Generative Model</title>
<p>Before explaining <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif>, we first define the spherical distribution used in the generative model and show how it is connected with the conditional probability used in <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif>.</p>
<p><bold>Theorem 1</bold>. When the corpus size and vocabulary size are infinite (i.e., <inline-formula><mml:math id="M24"><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>|</mml:mo><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0221E;</mml:mi></mml:math></inline-formula> and |<italic>V</italic>| &#x02192; &#x0221E;) and all word vectors and document vectors are assumed to be unit vectors<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>, generalizing the relationship of proportionality assumed in Equations (2), (4), (7), and (9), to the continuous cases results in the vMF distribution with the corresponding prior vector as the mean direction and constant 1 as the concentration parameter, i.e.,</p>
<disp-formula id="E14"><mml:math id="M25"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:munder><mml:mrow><mml:mi>lim</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:mi>V</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0221E;</mml:mi></mml:mrow></mml:munder><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>L</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:munder><mml:mrow><mml:mi>lim</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:mi>V</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0221E;</mml:mi></mml:mrow></mml:munder><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:munder><mml:mrow><mml:mi>lim</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:mi>V</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0221E;</mml:mi></mml:mrow></mml:munder><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:munder><mml:mrow><mml:mi>lim</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:mi mathvariant='script'>D</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0221E;</mml:mi></mml:mrow></mml:munder><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo>&#x022A4;</mml:mo></mml:msup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>See <xref ref-type="supplementary-material" rid="SM1">Appendix</xref> for proof.</p>
</sec>
<sec>
<title>5.2. <sans-serif>Joint CBOW</sans-serif> as Words Generation</title>
<p>In this subsection, we show that <sans-serif>Joint CBOW</sans-serif> performs maximum likelihood estimation of the corpus assuming <italic>local and global contexts generate words</italic>. This assumption follows naturally how humans write articles: we first have a general idea about what the document is going to talk about, and then write down each word so that the word is coherent with both the meaning of the entire document (global context) and its surrounding words (local context).</p>
<p>We describe the details of the generative model below:</p>
<list list-type="order">
<list-item><p>Underlying assumptions of local and global contexts.</p>
<p>The global context representation <bold><italic>d</italic></bold> (equivalent to the document vector) encodes general semantic and topical information of the entire document and should be a constant vector; the local context representation <bold><italic>l</italic></bold><sub><italic>i</italic></sub> encodes local semantic and syntactic information around <italic>w</italic><sub><italic>i</italic></sub> and should keep drifting slowly as the local context window shifts.</p>
<p>Based on the above intuition, we assume <bold><italic>d</italic></bold> is fixed for each document <italic>d</italic>, while <bold><italic>l</italic></bold><sub><italic>i</italic></sub> drifts slowly on the unit sphere in the embedding space with a small displacement between consecutive words. Finally, <italic>w</italic><sub><italic>i</italic></sub> is generated based on both <bold><italic>d</italic></bold> and <bold><italic>l</italic></bold><sub><italic>i</italic></sub>, i.e.,
<disp-formula id="E15"><mml:math id="M26"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mtext>&#x000A0;gets&#x000A0;generated</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p></list-item>
<list-item><p>Contexts generate words.</p>
<p>Given the local context representation <bold><italic>l</italic></bold><sub><italic>i</italic></sub> of <italic>w</italic><sub><italic>i</italic></sub> and global context representation <bold><italic>d</italic></bold>, we assume the probability of a word being generated as <italic>the center word</italic> is given by the vMF distribution with the context representation as the mean direction and 1 as the concentration parameter:
<disp-formula id="E16"><label>(11)</label><mml:math id="M27"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>
<disp-formula id="E17"><label>(12)</label><mml:math id="M28"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>where we will derive the explicit representation of the local context representation <bold><italic>l</italic></bold><sub><italic>i</italic></sub> later.</p>
<p>Recall that in <sans-serif>Joint CBOW</sans-serif>, each word plays two roles: (1) center word and (2) context word for other words. Given the local context representation <bold><italic>l</italic></bold><sub><italic>i</italic></sub> of <italic>w</italic><sub><italic>i</italic></sub>, we assume the probability of a word being generated as <italic>the context word</italic> (we use <italic>u</italic><sub><italic>i</italic></sub> to denote the word is viewed as a context word instead of a center word) is also given by the vMF distribution:
<disp-formula id="E18"><label>(13)</label><mml:math id="M29"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula></p></list-item>
</list>
<p>Now we are ready to use the above generative model for explaining the relationship between <sans-serif>Joint CBOW</sans-serif> and text generation. We begin with deriving the explicit representation of <bold><italic>l</italic></bold><sub><italic>i</italic></sub>. Let <inline-formula><mml:math id="M30"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">U</mml:mi></mml:mrow></mml:math></inline-formula> be the set of embedding of context words around word <italic>w</italic><sub><italic>i</italic></sub>, i.e.,</p>
<disp-formula id="E19"><mml:math id="M31"><mml:mi mathvariant='script'>U</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mo>&#x1D54A;</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mtext>&#x000A0;follows&#x000A0;</mml:mtext><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x0007D;</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>
<p>then we use the maximum likelihood estimates (see <xref ref-type="supplementary-material" rid="SM1">Appendix</xref>) to find <inline-formula><mml:math id="M32"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula>:</p>
<disp-formula id="E20"><mml:math id="M33"><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>l</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:mo>&#x02016;</mml:mo></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula>
<p>Now we view word vector representations <bold><italic>v</italic></bold><sub><italic>w</italic><sub><italic>i</italic></sub></sub>, <bold><italic>u</italic></bold><sub><italic>w</italic><sub><italic>i</italic>&#x0002B;<italic>j</italic></sub></sub> and document representation <bold><italic>d</italic></bold> as parameters of the text generation model to be estimated, and write the likelihood of the corpus <inline-formula><mml:math id="M34"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> as:</p>
<disp-formula id="E21"><mml:math id="M35"><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant='script'>D</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mi>p</mml:mi></mml:mstyle></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>l</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula>
<p>When the corpus size is finite, we have to turn the equality in Equations (11) and (12) to proportionality, i.e., <inline-formula><mml:math id="M36"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0221D;</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>v</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M37"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>d</mml:mi></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0221D;</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>v</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:mrow></mml:msubsup><mml:mstyle mathvariant="bold"><mml:mi>d</mml:mi></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Then the explicit expression of <inline-formula><mml:math id="M38"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and that of <italic>p</italic>(<italic>w</italic><sub><italic>i</italic></sub>&#x02223;<bold><italic>d</italic></bold>) become Equations (2) and (4), respectively.</p>
<p>The log-likelihood of the corpus <inline-formula><mml:math id="M39"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> is:</p>
<disp-formula id="E22"><mml:math id="M40"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mtext>log&#x000A0;&#x000A0;</mml:mtext><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant='script'>D</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>log&#x000A0;&#x000A0;</mml:mtext><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mover accent='true'><mml:mrow><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy='true'>&#x0005E;</mml:mo></mml:mover><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mtext>log&#x000A0;&#x000A0;</mml:mtext><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext><mml:mfrac><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>l</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mover accent='true'><mml:mi>l</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mtext>log&#x000A0;</mml:mtext><mml:mfrac><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mtext>&#x000A0;exp</mml:mtext></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M41"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M42"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> correspond to the local context loss (Equation 1) and the global context loss (Equation 3) of the <sans-serif>Joint CBOW</sans-serif> model, respectively. The only difference between the log-likelihood here and the <sans-serif>Joint CBOW</sans-serif> objective is that log-likelihood assumes equal weights on local and global contexts (&#x003BB; &#x0003D; 1 in Equation 5).</p>
<p>Therefore, <sans-serif>Joint CBOW</sans-serif> performs maximum likelihood estimation on the text corpus with the assumption that words are generated by their contexts.</p>
</sec>
<sec>
<title>5.3. <sans-serif>Joint Skip-Gram</sans-serif> as Contexts Generation</title>
<p>In this subsection, we show that <sans-serif>Joint Skip-Gram</sans-serif> performs maximum likelihood estimation of the corpus assuming <italic>center words generate their local and global contexts</italic>, reversing the generation relationship assumed in <sans-serif>Joint CBOW</sans-serif>.</p>
<p>We describe the details of the generative model below:</p>
<list list-type="order">
<list-item><p>Underlying assumptions of local and global contexts.</p>
<p>The local context of a word <italic>w</italic><sub><italic>i</italic></sub> carries its local semantic and syntactic information and is assumed to be generated according to the semantics of <italic>w</italic><sub><italic>i</italic></sub>. Further, we assume each context word in the local context window is generated independently, i.e.,
<disp-formula id="E23"><mml:math id="M43"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>L</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;gets&#x000A0;generated</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>L</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:munder><mml:mi>p</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>The global context of word <italic>w</italic><sub><italic>i</italic></sub> carries the global semantics of the entire document <italic>d</italic> that <italic>w</italic><sub><italic>i</italic></sub> belongs to and is assumed to be generated collectively by all the words in <italic>d</italic>, i.e.,
<disp-formula id="E24"><mml:math id="M44"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>G</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;gets&#x000A0;generated</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:munder><mml:mi>p</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p></list-item>
<list-item><p>Words generate contexts.</p>
<p>Given the word representation <bold><italic>u</italic></bold><sub><italic>w</italic><sub><italic>i</italic></sub></sub> of <italic>w</italic><sub><italic>i</italic></sub>, we assume the local and global contexts of <italic>w</italic><sub><italic>i</italic></sub> are generated from the vMF distribution with <bold><italic>u</italic></bold><sub><italic>w</italic><sub><italic>i</italic></sub></sub> as the mean direction and 1 as the concentration parameter:
<disp-formula id="E25"><label>(14)</label><mml:math id="M45"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>
<disp-formula id="E26"><label>(15)</label><mml:math id="M46"><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mi>M</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;&#x000A0;exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo>&#x022A4;</mml:mo></mml:msup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p></list-item>
</list>
<p>Now we are ready to write out the likelihood of the collection of local and global contexts in the entire corpus <inline-formula><mml:math id="M47"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0222A;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>:</p>
<disp-formula id="E27"><mml:math id="M48"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant='script'>C</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>L</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>G</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi mathvariant='script'>C</mml:mi><mml:mi>L</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:munder><mml:mi>p</mml:mi></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x0220F;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mi>p</mml:mi></mml:mstyle></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>When the corpus size is finite, we have to turn the equality in Equations (14) and (15) to proportionality, i.e., <inline-formula><mml:math id="M49"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0221D;</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>v</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>u</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M50"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0221D;</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>d</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>u</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Then the explicit expression of <italic>p</italic>(<italic>w</italic><sub><italic>j</italic></sub>&#x02223;<italic>w</italic><sub><italic>i</italic></sub>) and <italic>p</italic>(<italic>d</italic>&#x02223;<italic>w</italic><sub><italic>i</italic></sub>) will become Equations (7) and (9), respectively.</p>
<p>The log-likelihood of the contexts <inline-formula><mml:math id="M51"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:math></inline-formula> is:</p>
<disp-formula id="E28"><mml:math id="M52"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mtext>log&#x000A0;</mml:mtext><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi mathvariant='script'>C</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:mtext>exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x02208;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>v</mml:mi></mml:mstyle><mml:msup><mml:mi>w</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>d</mml:mi><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>log&#x000A0;</mml:mtext></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:mtext>exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>d</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>D</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mtext>exp</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>u</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:msup><mml:mi>d</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M53"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M54"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> correspond to the local context loss (Equation 6) and the global context loss (Equation 8) of the <sans-serif>Joint Skip-Gram</sans-serif> model, respectively. The only difference between the log-likelihood here and the <sans-serif>Joint Skip-Gram</sans-serif> objective is that log-likelihood assumes equal weights on local and global contexts (&#x003BB; &#x0003D; 1 in Equation 10).</p>
<p>Therefore, <sans-serif>Joint Skip-Gram</sans-serif> performs maximum likelihood estimation on the text corpus with the assumption that contexts are generated by words.</p>
</sec>
</sec>
<sec id="s6">
<title>6. Experiments</title>
<p>In this section, we empirically evaluate the word embedding quality trained by our proposed models and conduct a set of case studies to understand the properties of our models.</p>
<sec>
<title>6.1. Datasets</title>
<p>We use the following benchmark datasets for both word embedding training and text classification evaluation. The dataset statistics are summarized in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Dataset statistics.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="center"><bold>&#x00023; Train/&#x00023; Test</bold></th>
<th valign="top" align="center"><bold>&#x00023; Classes</bold></th>
<th valign="top" align="center"><bold>Avg. doc. length</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>20News</bold></td>
<td valign="top" align="center">11, 314/7, 532</td>
<td valign="top" align="center">20</td>
<td valign="top" align="center">396</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Reuters</bold></td>
<td valign="top" align="center">5, 485/2, 189</td>
<td valign="top" align="center">8</td>
<td valign="top" align="center">105</td>
</tr>
</tbody>
</table>
</table-wrap>
<list list-type="bullet">
<list-item><p><bold>20News</bold>: The 20 Newsgroup dataset<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> contains newsgroup documents partitioned nearly evenly across 20 different newsgroups. We follow the same train/test split of the &#x0201C;bydate&#x0201D; version.</p></list-item>
<list-item><p><bold>Reuters</bold>: We use the 8-class version of the Reuters-21578 dataset<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref> following (Kusner et al., <xref ref-type="bibr" rid="B11">2015</xref>; Xun et al., <xref ref-type="bibr" rid="B29">2017</xref>) with the same train/test split as described in Sebastiani (<xref ref-type="bibr" rid="B25">2002</xref>).</p></list-item>
</list>
</sec>
<sec>
<title>6.2. Baselines and Ablations</title>
<p>We compare our models with the following baseline methods:</p>
<list list-type="bullet">
<list-item><p><bold>Skip-Gram</bold> (Mikolov et al., <xref ref-type="bibr" rid="B23">2013b</xref>) and <bold>CBOW</bold> (Mikolov et al., <xref ref-type="bibr" rid="B22">2013a</xref>): The two models of the word2vec<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref> framework. <bold>Skip-Gram</bold> uses the center word to predict its local context, and <bold>CBOW</bold> uses local context to predict the center word.</p></list-item>
<list-item><p><bold>GloVe</bold> (Pennington et al., <xref ref-type="bibr" rid="B24">2014</xref>): <bold>GloVe</bold><xref ref-type="fn" rid="fn0006"><sup>6</sup></xref> learns word embedding by factorizing a global word-word co-occurrence matrix where the co-occurrence is defined upon a fix-sized context window.</p></list-item>
<list-item><p><bold>DM</bold> and <bold>DBOW</bold> (Le and Mikolov, <xref ref-type="bibr" rid="B13">2014</xref>): The two models of the Doc2Vec<xref ref-type="fn" rid="fn0007"><sup>7</sup></xref> framework. <bold>DM</bold> uses the concatenation of word embeddings and document embedding to predict the next word, and <bold>DBOW</bold> uses the document embedding to predict the words in a window. Although Doc2Vec is originally used for learning paragraph/document representation, it also learns word embedding simultaneously. We evaluate the word embedding trained by Doc2Vec.</p></list-item>
<list-item><p><bold>HSMN</bold> (Huang et al., <xref ref-type="bibr" rid="B9">2012</xref>): <bold>HSMN</bold><xref ref-type="fn" rid="fn0008"><sup>8</sup></xref> uses both local and global contexts to predict the next word in the sequence. The local context representation is obtained by concatenating the embedding of words preceding the next word, and the global context representation is simply the weighted average of all word embedding in the document.</p></list-item>
<list-item><p><bold>PTE</bold> (Tang et al., <xref ref-type="bibr" rid="B26">2015</xref>): Predictive Text Embedding (<bold>PTE</bold>)<xref ref-type="fn" rid="fn0009"><sup>9</sup></xref> constructs heterogeneous networks that encode word-word and word-document co-occurrences as well as class label information. It is originally trained under semi-supervised setting (i.e., labeled documents are required). We adapt it to unsupervised setting by pruning its word-label network.</p></list-item>
<list-item><p><bold>TWE</bold> (Liu et al., <xref ref-type="bibr" rid="B15">2015</xref>): Topical word embedding (<bold>TWE</bold>)<xref ref-type="fn" rid="fn0010"><sup>10</sup></xref> has three models for incorporating topical information into word embedding with the help of topic modeling. <bold>TWE</bold> requires prior knowledge about the number of latent topics in the corpus and we provide it with the correct number of classes of the corresponding corpus. We run all three models of <bold>TWE</bold> and report the best performance.</p></list-item>
</list>
<p>We compare our models with the following ablations:</p>
<list list-type="bullet">
<list-item><p><bold>Concat Skip-Gram</bold> and <bold>Concat CBOW</bold>: The ablation of <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif>, respectively. We train <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> twice with &#x003BB; &#x0003D; 0 (only local context is captured) and &#x003BB; &#x0003D; &#x0221E; (only global context is captured). Then we concatenate the two embeddings so that the resulting embedding contains both local and global context information, but with two types of contexts trained independently. For fair comparison, the embedding dimension of &#x003BB; &#x0003D; 0 and &#x003BB; &#x0003D; &#x0221E; cases is set to be <inline-formula><mml:math id="M55"><mml:mfrac><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> so that the embedding dimension after concatenation is <italic>p</italic>, equal to that of <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif>.</p></list-item>
</list>
</sec>
<sec>
<title>6.3. Implementation Details and Settings</title>
<p>Because the full softmax in Equations (2), (4), (7), and (9) results in computational complexity proportional to the vocabulary size, we adopt the negative sampling strategy (Mikolov et al., <xref ref-type="bibr" rid="B23">2013b</xref>) for efficient approximation.</p>
<p>We first pre-process the corpus by getting rid of infrequent words that appear &#x0003C; 5 times in the corpus. For fair comparison, we set the hyperparameters as below for all methods: word embedding dimension<xref ref-type="fn" rid="fn0011"><sup>11</sup></xref> <italic>p</italic> &#x0003D; 100, local context window size <italic>h</italic> &#x0003D; 5, number of negative samples <italic>k</italic> &#x0003D; 5, number of training iterations on the corpus <italic>iter</italic> &#x0003D; 10. Other parameters (if any) are set to be the default values of the corresponding algorithm. Our method has an additional hyperparameter &#x003BB; that balances between the importance of local and global contexts. We empirically find &#x003BB; &#x0003D; 1.5 to be the optimal choice in general, so we report the performances of our models by setting &#x003BB; &#x0003D; 1.5 for all tests.</p>
</sec>
<sec>
<title>6.4. Word Similarity Evaluation</title>
<p>In the first set of evaluation, we are interested in how well the word embedding captures similarity between word pairs. We use the following test datasets for evaluation: WordSim-353 (Finkelstein et al., <xref ref-type="bibr" rid="B6">2001</xref>), MEN (Bruni et al., <xref ref-type="bibr" rid="B3">2014</xref>), and SimLex-999 (Hill et al., <xref ref-type="bibr" rid="B7">2015</xref>). These datasets contain word pairs with human-assigned similarity scores. We first train word embedding on <bold>20News</bold> dataset<xref ref-type="fn" rid="fn0012"><sup>12</sup></xref>, and then rank word pair similarity according to their cosine similarity value in the embedding space. Finally, we compare the ranking given by the word embedding with the ranking given by human ratings. We use both Spearman&#x00027;s rank correlation &#x003C1; and Kendall&#x00027;s rank correlation &#x003C4; as measures with out-of-vocabulary word pairs excluded from the test sets.</p>
<p>The word similarity evaluation results are shown in <xref ref-type="table" rid="T3">Table 3</xref>. We observe that <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> achieve the best performances under two metrics across three test sets. The fact that <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> outperform <bold>Skip-Gram</bold>, <bold>CBOW</bold>, and <bold>GloVe</bold> demonstrates that by capturing global context in additional to local context, our model is able to rank word similarity more concordantly with human ratings. Comparing <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> with <bold>DM</bold>, <bold>DBOW</bold>, and <bold>PTE</bold>, we show that our models are more effective in leveraging global context to capture word similarity. Our models also do better than <bold>HSMN</bold>, <bold>TWE</bold>, <bold>Concat Skip-Gram</bold>, and <bold>Concat CBOW</bold>, showing superiority in jointly incorporating local and global contexts.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Word similarity evaluation.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>WordSim-353</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Men</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>SimLex-999</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>&#x003C1;</bold></th>
<th valign="top" align="center"><bold>&#x003C4;</bold></th>
<th valign="top" align="center"><bold>&#x003C1;</bold></th>
<th valign="top" align="center"><bold>&#x003C4;</bold></th>
<th valign="top" align="center"><bold>&#x003C1;</bold></th>
<th valign="top" align="center"><bold>&#x003C4;</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Skip-Gram</bold></td>
<td valign="top" align="center">0.430</td>
<td valign="top" align="center">0.293</td>
<td valign="top" align="center">0.303</td>
<td valign="top" align="center">0.206</td>
<td valign="top" align="center">0.153</td>
<td valign="top" align="center">0.104</td>
</tr>
<tr>
<td valign="top" align="left"><bold>CBOW</bold></td>
<td valign="top" align="center">0.410</td>
<td valign="top" align="center">0.284</td>
<td valign="top" align="center">0.349</td>
<td valign="top" align="center">0.241</td>
<td valign="top" align="center">0.109</td>
<td valign="top" align="center">0.074</td>
</tr>
<tr>
<td valign="top" align="left"><bold>GloVe</bold></td>
<td valign="top" align="center">0.207</td>
<td valign="top" align="center">0.140</td>
<td valign="top" align="center">0.196</td>
<td valign="top" align="center">0.134</td>
<td valign="top" align="center">0.042</td>
<td valign="top" align="center">0.028</td>
</tr>
<tr>
<td valign="top" align="left"><bold>DBOW</bold></td>
<td valign="top" align="center">0.378</td>
<td valign="top" align="center">0.257</td>
<td valign="top" align="center">0.341</td>
<td valign="top" align="center">0.234</td>
<td valign="top" align="center">0.116</td>
<td valign="top" align="center">0.078</td>
</tr>
<tr>
<td valign="top" align="left"><bold>DM</bold></td>
<td valign="top" align="center">0.367</td>
<td valign="top" align="center">0.254</td>
<td valign="top" align="center">0.305</td>
<td valign="top" align="center">0.209</td>
<td valign="top" align="center">0.116</td>
<td valign="top" align="center">0.079</td>
</tr>
<tr>
<td valign="top" align="left"><bold>HSMN</bold></td>
<td valign="top" align="center">0.103</td>
<td valign="top" align="center">0.070</td>
<td valign="top" align="center">0.146</td>
<td valign="top" align="center">0.100</td>
<td valign="top" align="center">0.027</td>
<td valign="top" align="center">0.018</td>
</tr>
<tr>
<td valign="top" align="left"><bold>PTE</bold></td>
<td valign="top" align="center">0.312</td>
<td valign="top" align="center">0.209</td>
<td valign="top" align="center">0.177</td>
<td valign="top" align="center">0.120</td>
<td valign="top" align="center">0.162</td>
<td valign="top" align="center">0.108</td>
</tr>
<tr>
<td valign="top" align="left"><bold>TWE</bold></td>
<td valign="top" align="center">0.227</td>
<td valign="top" align="center">0.155</td>
<td valign="top" align="center">0.210</td>
<td valign="top" align="center">0.144</td>
<td valign="top" align="center">0.140</td>
<td valign="top" align="center">0.093</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Concat skip-gram</bold></td>
<td valign="top" align="center">0.369</td>
<td valign="top" align="center">0.248</td>
<td valign="top" align="center">0.324</td>
<td valign="top" align="center">0.221</td>
<td valign="top" align="center">0.163</td>
<td valign="top" align="center">0.111</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Concat CBOW</bold></td>
<td valign="top" align="center">0.413</td>
<td valign="top" align="center">0.283</td>
<td valign="top" align="center">0.350</td>
<td valign="top" align="center">0.240</td>
<td valign="top" align="center">0.110</td>
<td valign="top" align="center">0.073</td>
</tr>
<tr>
<td valign="top" align="left"><sans-serif>Joint Skip-Gram</sans-serif></td>
<td valign="top" align="center">0.464</td>
<td valign="top" align="center">0.319</td>
<td valign="top" align="center"><bold>0.375</bold></td>
<td valign="top" align="center"><bold>0.256</bold></td>
<td valign="top" align="center">0.181</td>
<td valign="top" align="center">0.121</td>
</tr>
<tr>
<td valign="top" align="left"><sans-serif>Joint CBOW</sans-serif></td>
<td valign="top" align="center"><bold>0.473</bold></td>
<td valign="top" align="center"><bold>0.326</bold></td>
<td valign="top" align="center">0.374</td>
<td valign="top" align="center"><bold>0.256</bold></td>
<td valign="top" align="center"><bold>0.192</bold></td>
<td valign="top" align="center"><bold>0.131</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Bold values denote the best performance among all methods</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>6.5. Text Classification Evaluation</title>
<p>In the second set of evaluation, we use a classical downstream task in NLP, text classification, to evaluate the quality of word embedding. For each of the two datasets described in section 6.1, we train a one-vs-rest logistic regression classifier on the training set and apply it on the testing set. The document features are obtained by averaging all word embedding vectors in the document, and the word embedding is trained on the training set of the corresponding dataset. We use Micro-F1 and Macro-F1 scores as metrics for classification performances, as in (Meng et al., <xref ref-type="bibr" rid="B20">2018</xref>, <xref ref-type="bibr" rid="B21">2019c</xref>).</p>
<p>The text classification performance is reported in <xref ref-type="table" rid="T4">Table 4</xref>. Under all cases, the best performance is achieved by either <sans-serif>Joint Skip-Gram</sans-serif> or <sans-serif>Joint CBOW</sans-serif>. <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> give constantly better results than <bold>Skip-Gram</bold> and <bold>CBOW</bold>, respectively. This shows that global context enriches word embedding with topical semantics which is beneficial for the text classification task. Apart from the fact that our joint models achieve state-of-the-art performances as unsupervised word embedding for text classification, another interesting finding is that <bold>Concat Skip-Gram</bold> and <bold>Concat CBOW</bold> are pretty strong embedding baselines for text classification (outperforming <bold>Skip-Gram</bold> and <bold>CBOW</bold>), but are always inferior to <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif>. This indicates that the combination of local and global contexts indeed improves word embedding quality for classification tasks, but how to incorporate both types of contexts is also important&#x02014;training jointly on local and global contexts is more effective than training independently on either context and then performing post-processing to obtain concatenated word embedding.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Text classification evaluation.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>20News</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Reuters</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>Macro-F1</bold></th>
<th valign="top" align="center"><bold>Micro-F1</bold></th>
<th valign="top" align="center"><bold>Macro-F1</bold></th>
<th valign="top" align="center"><bold>Micro-F1</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Skip-gram</bold></td>
<td valign="top" align="center">0.681</td>
<td valign="top" align="center">0.699</td>
<td valign="top" align="center">0.750</td>
<td valign="top" align="center">0.953</td>
</tr>
<tr>
<td valign="top" align="left"><bold>CBOW</bold></td>
<td valign="top" align="center">0.653</td>
<td valign="top" align="center">0.668</td>
<td valign="top" align="center">0.866</td>
<td valign="top" align="center">0.965</td>
</tr>
<tr>
<td valign="top" align="left"><bold>GloVe</bold></td>
<td valign="top" align="center">0.526</td>
<td valign="top" align="center">0.548</td>
<td valign="top" align="center">0.725</td>
<td valign="top" align="center">0.944</td>
</tr>
<tr>
<td valign="top" align="left"><bold>DBOW</bold></td>
<td valign="top" align="center">0.687</td>
<td valign="top" align="center">0.703</td>
<td valign="top" align="center">0.796</td>
<td valign="top" align="center">0.950</td>
</tr>
<tr>
<td valign="top" align="left"><bold>DM</bold></td>
<td valign="top" align="center">0.594</td>
<td valign="top" align="center">0.610</td>
<td valign="top" align="center">0.837</td>
<td valign="top" align="center">0.955</td>
</tr>
<tr>
<td valign="top" align="left"><bold>HSMN</bold></td>
<td valign="top" align="center">0.385</td>
<td valign="top" align="center">0.431</td>
<td valign="top" align="center">0.200</td>
<td valign="top" align="center">0.736</td>
</tr>
<tr>
<td valign="top" align="left"><bold>PTE</bold></td>
<td valign="top" align="center">0.700</td>
<td valign="top" align="center">0.718</td>
<td valign="top" align="center">0.776</td>
<td valign="top" align="center">0.957</td>
</tr>
<tr>
<td valign="top" align="left"><bold>TWE</bold></td>
<td valign="top" align="center">0.608</td>
<td valign="top" align="center">0.632</td>
<td valign="top" align="center">0.616</td>
<td valign="top" align="center">0.916</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Concat Skip-Gram</bold></td>
<td valign="top" align="center">0.759</td>
<td valign="top" align="center">0.772</td>
<td valign="top" align="center">0.764</td>
<td valign="top" align="center">0.958</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Concat CBOW</bold></td>
<td valign="top" align="center">0.680</td>
<td valign="top" align="center">0.695</td>
<td valign="top" align="center">0.873</td>
<td valign="top" align="center">0.961</td>
</tr>
<tr>
<td valign="top" align="left"><sans-serif>Joint Skip-Gram</sans-serif></td>
<td valign="top" align="center"><bold>0.773</bold></td>
<td valign="top" align="center"><bold>0.785</bold></td>
<td valign="top" align="center">0.854</td>
<td valign="top" align="center">0.962</td>
</tr>
<tr>
<td valign="top" align="left"><sans-serif>Joint CBOW</sans-serif></td>
<td valign="top" align="center">0.736</td>
<td valign="top" align="center">0.753</td>
<td valign="top" align="center"><bold>0.885</bold></td>
<td valign="top" align="center"><bold>0.966</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Bold values denote the best performance among all methods</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>6.6. Parameter Study</title>
<p>In the previous subsections, we fix &#x003BB; for both <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> models for all evaluation tasks. In this subsection, we would like to explore the trade-off between local and global contexts in embedding learning. Specifically, we vary &#x003BB; in the <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> model with a 0.5 interval in range [0, 3] and &#x0221E; (the performances of &#x003BB; &#x0003D; &#x0221E; are represented as horizontal dotted lines), and conduct word similarity evaluation on the WordSim-353 dataset and text classification evaluation on the <bold>20News</bold> dataset. The performances under different &#x003BB;&#x00027;s for both models are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. We observe that the optimal settings of both models for word similarity and text classification are &#x003BB; &#x0003D; 1.5 and &#x003BB; &#x0003D; 2.0, respectively. This verifies our arguments that combining both types of contexts achieves the best performances.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Hyperparameter study on word similarity <bold>(left)</bold> and text classification <bold>(right)</bold>.</p></caption>
<graphic xlink:href="fdata-03-00009-g0003.tif"/>
</fig>
</sec>
<sec>
<title>6.7. Running Time Study</title>
<p>We report the training time on <bold>20News</bold> dataset per iteration of all baselines in <xref ref-type="table" rid="T5">Table 5</xref> to compare the training efficiency. All the models are run on a machine with 20 cores of Intel(R) Xeon(R) CPU E5-2680 v2 &#x00040; 2.80 GHz. <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> have similar training time with their original counterparts and are more efficient than the other baselines, demonstrating their high efficiency.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Running time evaluation on <bold>20News</bold> dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Skip-Gram</bold></th>
<th valign="top" align="center"><bold>CBOW</bold></th>
<th valign="top" align="center"><bold>GloVe</bold></th>
<th valign="top" align="center"><bold>DBOW</bold></th>
<th valign="top" align="center"><bold>DM</bold></th>
<th valign="top" align="center"><bold>HSMN</bold></th>
<th valign="top" align="center"><bold>PTE</bold></th>
<th valign="top" align="center"><bold>TWE</bold></th>
<th valign="top" align="center"><bold>JSG</bold></th>
<th valign="top" align="center"><bold>JCBOW</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Running time (s)</td>
<td valign="top" align="center">29.8</td>
<td valign="top" align="center">24.7</td>
<td valign="top" align="center">31.2</td>
<td valign="top" align="center">41.9</td>
<td valign="top" align="center">35.7</td>
<td valign="top" align="center">44.6</td>
<td valign="top" align="center">48.8</td>
<td valign="top" align="center">&#x0003E;1,000</td>
<td valign="top" align="center">30.1</td>
<td valign="top" align="center">25.4</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>6.8. Case Studies</title>
<p>In this subsection, we perform a set of case studies to understand the properties of our models and why incorporating both local and global contexts leads to better word embedding. We conduct all the case studies on the <bold>20News</bold> dataset unless stated otherwise.</p>
<sec>
<title>6.8.1. Effect of Global Context</title>
<p>We are interested in why and how global context can be beneficial for capturing more complete word semantics. We set &#x003BB; &#x0003D; &#x0221E; and &#x003BB; &#x0003D; 0 in Equation (10) so that the embedding trained by <sans-serif>Joint Skip-Gram</sans-serif> only captures the global/local context of words. We select a set of acronyms (e.g., <italic>CMU</italic> stands for Carnegie Mellon University.) and use their embedding to retrieve a few most similar words (measured by cosine similarity in the embedding space). In <xref ref-type="table" rid="T6">Table 6</xref>, we list five university acronyms and show the top words retrieved by the embedding trained with only global context and only local context, respectively. We observe that local context embedding retrieves nothing meaningful related to the acronyms, but global context embedding successfully finds the original word components of the acronyms. The reason is that each original word component usually does not share similar local context with the acronym (e.g., <italic>CMU</italic> and the single word &#x0201C;Carnegie&#x0201D; obviously have different surrounding words) despite their semantic similarity. However, the original word components and acronyms usually appear in same/similar documents, resulting in higher global context similarity. The insights gained from this case study can be generalized to other cases where words are semantically similar but syntactically dissimilar. Global context is effective in discovering semantic and topical similarity of words without enforcing syntactic similarity.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Effect of global context on interpreting acronyms.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Acronyms</bold></th>
<th valign="top" align="left"><bold>Global (<bold>&#x003BB; &#x0003D; &#x0221E;</bold>)</bold></th>
<th valign="top" align="left"><bold>Local (<bold>&#x003BB; &#x0003D; 0</bold>)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CMU</td>
<td valign="top" align="left"><bold>mellon</bold>, <bold>carnegie</bold>, andrew, pa, pittsburgh</td>
<td valign="top" align="left">andrew, kfnjyea00uh, am2x, mr47, devineni</td>
</tr>
<tr>
<td valign="top" align="left">UIUC</td>
<td valign="top" align="left"><bold>urbana</bold>, <bold>illinois</bold>, uxa, <bold>univ</bold>, uchicago</td>
<td valign="top" align="left">uxa, ux4, ux1, mrcnext, cka52397</td>
</tr>
<tr>
<td valign="top" align="left">UNC</td>
<td valign="top" align="left"><bold>chapel</bold>, <bold>carolina</bold>, astro, images, usc</td>
<td valign="top" align="left">launchpad, gibbs, umr, lambada, jge</td>
</tr>
<tr>
<td valign="top" align="left">Caltech</td>
<td valign="top" align="left"><bold>california</bold>, gap, <bold>institute</bold>, keith, <bold>technology</bold></td>
<td valign="top" align="left">juliet, jafoust, lmh, henling, bdunn</td>
</tr>
<tr>
<td valign="top" align="left">JHU</td>
<td valign="top" align="left"><bold>johns</bold>, camp, <bold>hopkins</bold>, nation, grand</td>
<td valign="top" align="left">pablo, hasch, iglesias, davidk, atlantis</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Bold values denote the best performance among all methods</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>6.8.2. Different Contexts Capture Different Aspects of Word Similarity</title>
<p>Word similarity has different aspects. Words can be semantically similar but syntactically dissimilar and vice versa. For example, antonyms have opposite semantics (e.g., <italic>good</italic> vs. <italic>bad</italic>) but are syntactically similar and may occur with similar short surrounding contexts. We list a set of antonyms and provide their embedding cosine similarity when different types of context are captured by <sans-serif>Joint Skip-Gram</sans-serif> (Global, &#x003BB; &#x0003D; &#x0221E;; Local, &#x003BB; &#x0003D; 0; Joint, &#x003BB; &#x0003D; 1.5) as shown in <xref ref-type="table" rid="T7">Table 7</xref>.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Cosine similarity of antonym embeddings trained with different contexts.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Antonyms</bold></th>
<th valign="top" align="center"><bold>Global (<bold>&#x003BB; &#x0003D; &#x0221E;</bold>)</bold></th>
<th valign="top" align="center"><bold>Local (<bold>&#x003BB; &#x0003D; 0</bold>)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Good&#x02014;bad</td>
<td valign="top" align="center">0.3150</td>
<td valign="top" align="center">0.7127</td>
</tr>
<tr>
<td valign="top" align="left">Happy&#x02014;unhappy</td>
<td valign="top" align="center">0.3911</td>
<td valign="top" align="center">0.6178</td>
</tr>
<tr>
<td valign="top" align="left">Large&#x02014;small</td>
<td valign="top" align="center">0.4871</td>
<td valign="top" align="center">0.7265</td>
</tr>
<tr>
<td valign="top" align="left">Increase&#x02014;decrease</td>
<td valign="top" align="center">0.2663</td>
<td valign="top" align="center">0.7308</td>
</tr>
<tr>
<td valign="top" align="left">Enter&#x02014;exit</td>
<td valign="top" align="center">0.2756</td>
<td valign="top" align="center">0.5553</td>
</tr>
<tr>
<td valign="top" align="left">Save&#x02014;spend</td>
<td valign="top" align="center">&#x02212;0.0388</td>
<td valign="top" align="center">0.4792</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be observed that all antonyms have high cosine similarity when only local context is captured in embedding (&#x003BB; &#x0003D; 0). On the other hand, antonym embeddings trained on global context (&#x003BB; &#x0003D; &#x0221E;) have relatively low cosine similarity. The results verify our intuition that local context focuses more on syntactic similarity while global context emphasizes more on semantic or topical similarity of words. Our joint model strikes a balance between local and global contexts, and thus reflects both syntactic and semantic aspects of word similarity.</p>
</sec>
<sec>
<title>6.8.3. Global Context Embedding Quality</title>
<p>In the third set of case studies, we qualitatively evaluate the global context embedding by visualizing the document vectors together with word embeddings. We select five documents from five different topics of the <bold>20News</bold> dataset, and then select several topical related words for each document. The five topics are: <italic>electric, automobiles, guns, christian</italic>, and <italic>graphics</italic>. We apply t-SNE (van der Maaten and Hinton, <xref ref-type="bibr" rid="B28">2008</xref>) to visualize both the document embedding and the word embedding in <xref ref-type="fig" rid="F4">Figure 4</xref>, where green stars represent document embeddings and red dots represent word embeddings. Documents are indeed embedded close to their topical related words, implying that global context embeddings appropriately encode topical semantic information, which consequently benefits word embedding learning.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Word and document embedding visualization.</p></caption>
<graphic xlink:href="fdata-03-00009-g0004.tif"/>
</fig>
</sec>
<sec>
<title>6.8.4. Weakly-Supervised Text Classification</title>
<p>In the previous case study, we have shown word embedding and document embedding can be jointly trained unsupervisedly. It then becomes natural to consider the possibility to perform text classification without labeled documents. When only weak supervisions, such as class surface names (e.g., <italic>politics, sports</italic>) are available, the unsupervised word embedding quality becomes essential for text classification because there is no additional labeling for fine-tuning word embedding. WeSTClass (Meng et al., <xref ref-type="bibr" rid="B20">2018</xref>, <xref ref-type="bibr" rid="B21">2019c</xref>) models class semantics as vMF distributions in the word embedding space and applies a pretrain-refine neural approach to perform text classification under weak supervision. Doc2Cube (Tao et al., <xref ref-type="bibr" rid="B27">2018</xref>) leverages word-document co-occurrences to embed class labels, words and documents in the same space and perform classification by comparing embedding similarity. We adopt the two frameworks and replace the original embedding with the embedding trained by our <sans-serif>Joint Skip-Gram</sans-serif> and <sans-serif>Joint CBOW</sans-serif> models. We perform weakly-supervised text classification on the training set of <bold>Reuters</bold> with class names as weak supervision and report the Macro-F1 and Micro-F1 scores in <xref ref-type="table" rid="T8">Table 8</xref>.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Weakly-supervised text classification on Reuters.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Macro-F1</bold></th>
<th valign="top" align="center"><bold>Micro-F1</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">WeSTClass</td>
<td valign="top" align="center">0.554</td>
<td valign="top" align="center">0.593</td>
</tr>
<tr>
<td valign="top" align="left">Doc2Cube</td>
<td valign="top" align="center">0.435</td>
<td valign="top" align="center">0.446</td>
</tr>
<tr>
<td valign="top" align="left">Doc2Cube w/<sans-serif>Joint Skip-Gram</sans-serif></td>
<td valign="top" align="center">0.585</td>
<td valign="top" align="center">0.717</td>
</tr>
<tr>
<td valign="top" align="left">Doc2Cube w/<sans-serif>Joint CBOW</sans-serif></td>
<td valign="top" align="center">0.570</td>
<td valign="top" align="center">0.700</td>
</tr>
<tr>
<td valign="top" align="left">WeSTClass w/<sans-serif>Joint Skip-Gram</sans-serif></td>
<td valign="top" align="center"><bold>0.717</bold></td>
<td valign="top" align="center"><bold>0.801</bold></td>
</tr>
<tr>
<td valign="top" align="left">WeSTClass w/<sans-serif>Joint CBOW</sans-serif></td>
<td valign="top" align="center">0.691</td>
<td valign="top" align="center">0.698</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Bold values denote the best performance among all methods</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>We show that we can achieve reasonably good text classification performances even without labeled documents, by fully leveraging the context information to capture more complete semantics in word embedding.</p>
</sec>
</sec>
</sec>
<sec id="s7">
<title>7. Discussions</title>
<p>In this section, we discuss several open issues and interesting directions for further exploration.</p>
<list list-type="bullet">
<list-item><p>How to choose appropriate global contexts in practice?</p>
<p>In Definition 2, we defined global context to be the document in which a word appears. In practice, however, the global context of a word can flexibly refer to its belonging paragraph, or several sentences surrounding it, based on different application scenarios. For example, for short documents like a piece of review text, it is appropriate to use the entire document as the global context of a word. In long news articles or research papers, it might be more suitable to define the global context as the paragraph or the subsection a word appears in. Therefore, we recommend practitioners to experiment with different global context settings for different texts.</p></list-item>
<list-item><p>Global context for other embedding training settings.</p>
<p>In this work, we showed that using global contexts in addition to local contexts improves unsupervised word embedding quality since the two types of contexts capture complementary information about a word. Based on this observation, we may consider incorporating global contexts into other embedding learning settings. For example, in CatE (Meng et al., <xref ref-type="bibr" rid="B17">2020</xref>) we improve the discriminative power of the embedding model over a specific set of user-provided categories with the help of global contexts, based on which a topic mining framework (Meng et al., <xref ref-type="bibr" rid="B19">2019b</xref>) is further developed. We believe that there are many other tasks where global contexts can complement local contexts in training and fine-tuning embeddings.</p></list-item>
<list-item><p>Embedding learning in the spherical space.</p>
<p>It has been shown that directional similarity is more effective than Euclidean measurement in word similarity and clustering. Therefore, it might be beneficial to model both local and global contexts in the spherical space to train text embeddings of even better quality, like in JoSE (Meng et al., <xref ref-type="bibr" rid="B18">2019a</xref>). Further exploration might involve using Riemannian optimization on the unit sphere or enforcing vector norm constraints to fine-tune text embeddings in downstream tasks.</p></list-item>
</list>
</sec>
<sec id="s8">
<title>8. Conclusions and Future Work</title>
<p>We propose two simple yet effective unsupervised word embedding learning models to jointly capture the complementary word contexts. Local context focuses more on the syntactic and local semantic aspect whereas global context provides information more regarding the general and topical semantics of words. Experiments show that incorporating both types of contexts achieves state-of-the-art performance on word similarity and text classification tasks. We provide a novel generative perspective to theoretically interpret the two proposed models. The interpretation might pave the path for several future directions:</p>
<list list-type="bullet">
<list-item><p>The global context may not be always defined as the document that a word appears in, because the generative relationship between a word and its corresponding sentence/paragraph might be stronger than that between a word and the entire document.</p></list-item>
<list-item><p>Our current models (and the original word2vec framework) assume that the vMF distribution for generating words/contexts has constant 1 as the concentration parameter &#x003BA;. However, the most appropriate &#x003BA; might depend on vocabulary size, average document length in the corpus, etc. and can vary across different datasets. It will be interesting to explore how to set appropriate &#x003BA; for even better word embedding quality.</p></list-item>
</list>
</sec>
<sec sec-type="data-availability-statement" id="s9">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="http://qwone.com/&#x0007E;jason/20Newsgroups/">http://qwone.com/&#x0007E;jason/20Newsgroups/</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://www.daviddlewis.com/resources/testcollections/reuters21578/">http://www.daviddlewis.com/resources/testcollections/reuters21578/</ext-link>.</p>
</sec>
<sec id="s10">
<title>Author Contributions</title>
<p>YM and JHu contributed to the design of the models. YM, JHu, GW, and ZW implemented the models and conducted the experiments. YM, JHu, CZ, and JHa wrote the manuscript. All authors contributed to the manuscript revision, read, and approved the submitted version.</p>
<sec>
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<sec sec-type="supplementary-material" id="s11">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fdata.2020.00009/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fdata.2020.00009/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Presentation_3.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Ducharme</surname> <given-names>R.</given-names></name> <name><surname>Vincent</surname> <given-names>P.</given-names></name></person-group> (<year>2000</year>). <article-title>&#x0201C;A neural probabilistic language model,&#x0201D;</article-title> in <source>Conference on Neural Information Processing Systems</source> (<publisher-loc>Denver, CO</publisher-loc>).</citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blei</surname> <given-names>D. M.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name> <name><surname>Jordan</surname> <given-names>M. I.</given-names></name></person-group> (<year>2003</year>). <article-title>Latent dirichlet allocation</article-title>. <source>J. Mach. Learn. Res.</source> <volume>3</volume>, <fpage>993</fpage>&#x02013;<lpage>1022</lpage>. <pub-id pub-id-type="doi">10.5555/944919.944937</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bruni</surname> <given-names>E.</given-names></name> <name><surname>Tran</surname> <given-names>N.-K.</given-names></name> <name><surname>Baroni</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Multimodal distributional semantics</article-title>. <source>J. Artif. Intell. Res.</source> <volume>49</volume>, <fpage>1</fpage>&#x02013;<lpage>47</lpage>. <pub-id pub-id-type="doi">10.1613/jair.4135</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cho</surname> <given-names>K.</given-names></name> <name><surname>van Merrienboer</surname> <given-names>B.</given-names></name> <name><surname>G&#x000FC;l&#x000E7;ehre</surname> <given-names>&#x000C7;.</given-names></name> <name><surname>Bahdanau</surname> <given-names>D.</given-names></name> <name><surname>Bougares</surname> <given-names>F.</given-names></name> <name><surname>Schwenk</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Learning phrase representations using rnn encoder-decoder for statistical machine translation,&#x0201D;</article-title> in <source>Conference on Empirical Methods in Natural Language Processing</source> (<publisher-loc>Doha</publisher-loc>).</citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Collobert</surname> <given-names>R.</given-names></name> <name><surname>Weston</surname> <given-names>J.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name> <name><surname>Karlen</surname> <given-names>M.</given-names></name> <name><surname>Kavukcuoglu</surname> <given-names>K.</given-names></name> <name><surname>Kuksa</surname> <given-names>P. P.</given-names></name></person-group> (<year>2011</year>). <article-title>Natural language processing (almost) from scratch</article-title>. <source>J. Mach. Learn. Res.</source> <volume>12</volume>, <fpage>2493</fpage>&#x02013;<lpage>2537</lpage>. <pub-id pub-id-type="doi">10.5555/1953048.2078186</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Finkelstein</surname> <given-names>L.</given-names></name> <name><surname>Gabrilovich</surname> <given-names>E.</given-names></name> <name><surname>Matias</surname> <given-names>Y.</given-names></name> <name><surname>Rivlin</surname> <given-names>E.</given-names></name> <name><surname>Solan</surname> <given-names>Z.</given-names></name> <name><surname>Wolfman</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2001</year>). <article-title>&#x0201C;Placing search in context: the concept revisited,&#x0201D;</article-title> in <source>WWW&#x00027;01: Proceedings of the 10th International Conference on World Wide Web</source> (<publisher-loc>Hong Kong</publisher-loc>).</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hill</surname> <given-names>F.</given-names></name> <name><surname>Reichart</surname> <given-names>R.</given-names></name> <name><surname>Korhonen</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>Simlex-999: evaluating semantic models with (genuine) similarity estimation</article-title>. <source>Comput. Linguist.</source> <volume>41</volume>, <fpage>665</fpage>&#x02013;<lpage>695</lpage>. <pub-id pub-id-type="doi">10.1162/COLI_a_00237</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hofmann</surname> <given-names>T.</given-names></name></person-group> (<year>1999</year>). <article-title>&#x0201C;Probabilistic latent semantic indexing,&#x0201D;</article-title> in <source>SIGIR&#x00027;99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source> (<publisher-loc>Berkeley, CA</publisher-loc>).</citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>E. H.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Improving word representations via global context and multiple word prototypes,&#x0201D;</article-title> in <source>Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics</source> (<publisher-loc>Jeju Island</publisher-loc>).</citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Convolutional neural networks for sentence classification,&#x0201D;</article-title> in <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing</source> (<publisher-loc>Doha</publisher-loc>).</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kusner</surname> <given-names>M. J.</given-names></name> <name><surname>Sun</surname> <given-names>Y.</given-names></name> <name><surname>Kolkin</surname> <given-names>N. I.</given-names></name> <name><surname>Weinberger</surname> <given-names>K. Q.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;From word embeddings to document distances,&#x0201D;</article-title> in <source>Proceedings of the 32nd International Conference on Machine Learning</source>.</citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lample</surname> <given-names>G.</given-names></name> <name><surname>Ballesteros</surname> <given-names>M.</given-names></name> <name><surname>Subramanian</surname> <given-names>S.</given-names></name> <name><surname>Kawakami</surname> <given-names>K.</given-names></name> <name><surname>Dyer</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Neural architectures for named entity recognition,&#x0201D;</article-title> in <source>Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source> (<publisher-loc>San Diego, CA</publisher-loc>).</citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Le</surname> <given-names>Q. V.</given-names></name> <name><surname>Mikolov</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Distributed representations of sentences and documents,&#x0201D;</article-title> in <source>Proceedings of the 31st International Conference on Machine Learning</source> (<publisher-loc>Beijing</publisher-loc>).</citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levy</surname> <given-names>O.</given-names></name> <name><surname>Goldberg</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Neural word embedding as implicit matrix factorization,&#x0201D;</article-title> in <source>NIPS&#x00027;14: Proceedings of the 27th International Conference on Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>).</citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Chua</surname> <given-names>T.-S.</given-names></name> <name><surname>Sun</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Topical word embeddings,&#x0201D;</article-title> in <source>AAAI&#x00027;15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence</source> (<publisher-loc>Austin, TX</publisher-loc>).</citation>
</ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mardia</surname> <given-names>K. V.</given-names></name> <name><surname>Jupp</surname> <given-names>P. E.</given-names></name></person-group> (<year>2009</year>). <source>Directional Statistics</source>, <volume>Vol. 494</volume>. <publisher-name>John Wiley &#x00026; Sons</publisher-name>.</citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Meng</surname> <given-names>Y.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Discriminative topic mining via category-name guided text embedding,&#x0201D;</article-title> in <source>Proceedings of The Web Conference 2020 (WWW20)</source> (<publisher-loc>Taipei</publisher-loc>).</citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Meng</surname> <given-names>Y.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Zhuang</surname> <given-names>H.</given-names></name> <name><surname>Kaplan</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2019a</year>). <article-title>&#x0201C;Spherical text embedding,&#x0201D;</article-title> in <source>33rd Conference on Neural Information Processing Systems (NeurIPS 2019)</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Meng</surname> <given-names>Y.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Fan</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2019b</year>). <article-title>&#x0201C;Topicmine: user-guided topic mining by category-oriented embedding,&#x0201D;</article-title> in <source>ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)</source> (<publisher-loc>Anchorage, AK</publisher-loc>).</citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Meng</surname> <given-names>Y.</given-names></name> <name><surname>Shen</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Han</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Weakly-supervised neural text classification,&#x0201D;</article-title> in <source>ACM International Conference on Information and Knowledge Management (CIKM)</source> (<publisher-loc>Torino</publisher-loc>).</citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Meng</surname> <given-names>Y.</given-names></name> <name><surname>Shen</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Han</surname> <given-names>J.</given-names></name></person-group> (<year>2019c</year>). <article-title>&#x0201C;Weakly-supervised hierarchical text classification,&#x0201D;</article-title> in <source>AAAI Conference on Artificial Intelligence (AAAI)</source> (<publisher-loc>Honolulu, HI</publisher-loc>).</citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Chen</surname> <given-names>K.</given-names></name> <name><surname>Corrado</surname> <given-names>G. S.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2013a</year>). <article-title>Efficient estimation of word representations in vector space</article-title>. <source>CoRR</source> abs/1301.3781.</citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Chen</surname> <given-names>K.</given-names></name> <name><surname>Corrado</surname> <given-names>G. S.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2013b</year>). <article-title>&#x0201C;Distributed representations of words and phrases and their compositionality,&#x0201D;</article-title> in <source>NIPS&#x00027;13: Proceedings of the 26th International Conference on Neural Information Processing Systems</source> (<publisher-loc>Lake Tahoe, NV</publisher-loc>).</citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pennington</surname> <given-names>J.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Glove: global vectors for word representation,&#x0201D;</article-title> in <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source> (<publisher-loc>Doha</publisher-loc>).</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sebastiani</surname> <given-names>F.</given-names></name></person-group> (<year>2002</year>). <article-title>Machine learning in automated text categorization</article-title>. <source>ACM Comput. Surv.</source> <volume>34</volume>, <fpage>1</fpage>&#x02013;<lpage>47</lpage>. <pub-id pub-id-type="doi">10.1145/505282.505283</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Qu</surname> <given-names>M.</given-names></name> <name><surname>Mei</surname> <given-names>Q.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Pte: predictive text embedding through large-scale heterogeneous text networks,&#x0201D;</article-title> in <source>KDD &#x00027;15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Sydney, NSW</publisher-loc>).</citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tao</surname> <given-names>F.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Jiang</surname> <given-names>M.</given-names></name> <name><surname>Hanratty</surname> <given-names>T.</given-names></name> <name><surname>Kaplan</surname> <given-names>L. M.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>&#x0201C;Doc2cube: allocating documents to text cube without labeled data,&#x0201D;</article-title> in <source>2018 IEEE International Conference on Data Mining (ICDM)</source> (<publisher-loc>Singapore</publisher-loc>).</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>van der Maaten</surname> <given-names>L.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2008</year>). <article-title>Visualizing data using t-SNE</article-title>. <source>J. Mach. Learn. Res</source>. <volume>9</volume>, <fpage>2579</fpage>&#x02013;<lpage>2605</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xun</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Gao</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Collaboratively improving topic discovery and word embeddings by coordinating global and local contexts,&#x0201D;</article-title> in <source>23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Halifax, NS</publisher-loc>).</citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>&#x0201C;Explicitly&#x0201D; means local context and global context have explicit and independent vector representations.</p></fn>
<fn id="fn0002"><p><sup>2</sup>This is similar to the constraint introduced in Meng et al. (<xref ref-type="bibr" rid="B18">2019a</xref>).</p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="http://qwone.com/&#x0007E;jason/20Newsgroups/">http://qwone.com/&#x0007E;jason/20Newsgroups/</ext-link></p></fn>
<fn id="fn0004"><p><sup>4</sup><ext-link ext-link-type="uri" xlink:href="http://www.daviddlewis.com/resources/testcollections/reuters21578/">http://www.daviddlewis.com/resources/testcollections/reuters21578/</ext-link></p></fn>
<fn id="fn0005"><p><sup>5</sup><ext-link ext-link-type="uri" xlink:href="https://code.google.com/archive/p/word2vec/">https://code.google.com/archive/p/word2vec/</ext-link></p></fn>
<fn id="fn0006"><p><sup>6</sup><ext-link ext-link-type="uri" xlink:href="https://nlp.stanford.edu/projects/glove/">https://nlp.stanford.edu/projects/glove/</ext-link></p></fn>
<fn id="fn0007"><p><sup>7</sup><ext-link ext-link-type="uri" xlink:href="https://radimrehurek.com/gensim/models/doc2vec.html">https://radimrehurek.com/gensim/models/doc2vec.html</ext-link></p></fn>
<fn id="fn0008"><p><sup>8</sup><ext-link ext-link-type="uri" xlink:href="http://ai.stanford.edu/&#x0007E;ehhuang/">http://ai.stanford.edu/&#x0007E;ehhuang/</ext-link></p></fn>
<fn id="fn0009"><p><sup>9</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/mnqu/PTE">https://github.com/mnqu/PTE</ext-link></p></fn>
<fn id="fn0010"><p><sup>10</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/largelymfs/topical_word_embeddings">https://github.com/largelymfs/topical_word_embeddings</ext-link></p></fn>
<fn id="fn0011"><p><sup>11</sup>Since the datasets used in our experiments are relatively small-scale, using higher embedding dimensions (e.g., <italic>p</italic> &#x0003D; 200, 300) does not lead to noticeably different results, so we only report the results with <italic>p</italic> &#x0003D; 100.</p></fn>
<fn id="fn0012"><p><sup>12</sup>In this work, we are interested in embedding quality when embeddings are trained on the local corpus where downstream tasks are carried out. In Meng et al. (<xref ref-type="bibr" rid="B18">2019a</xref>), we report the word similarity evaluation of embeddings trained on the Wikipedia dump.</p></fn>
</fn-group>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> Research was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreements Nos. W911NF-17-C-0099 and FA8750-19-2-1004, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS 17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). Any opinions, findings, and conclusions or recommendations expressed in this document are those of the author(s) and should not be interpreted as the views of any U.S. Government. The U.S. Government was authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.</p></fn>
</fn-group>
</back>
</article>