<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2019.00005</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Location Prediction for Tweets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Huang</surname> <given-names>Chieh-Yang</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/661818/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Tong</surname> <given-names>Hanghang</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/559530/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>He</surname> <given-names>Jingrui</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/560852/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Maciejewski</surname> <given-names>Ross</given-names></name>
</contrib>
</contrib-group>
<aff><institution>CIDSE, Arizona State University</institution>, <addr-line>Tempe, AZ</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Brian D. Davison, Lehigh University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Ovidiu Dan, Lehigh University, United States; Shuhan Yuan, University of Arkansas, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Chieh-Yang Huang <email>chiehyang.huang&#x00040;asu.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>05</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>2</volume>
<elocation-id>5</elocation-id>
<history>
<date date-type="received">
<day>26</day>
<month>12</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>04</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2019 Huang, Tong, He and Maciejewski.</copyright-statement>
<copyright-year>2019</copyright-year>
<copyright-holder>Huang, Tong, He and Maciejewski</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Geographic information provides an important insight into many data mining and social media systems. However, users are reluctant to provide such information due to various concerns, such as inconvenience, privacy, etc. In this paper, we aim to develop a deep learning based solution to predict geographic information for tweets. The current approaches bear two major limitations, including (a) hard to model the long term information and (b) hard to explain to the end users what the model learns. To address these issues, our proposed model embraces three key ideas. First, we introduce a multi-head self-attention model for text representation. Second, to further improve the result on informal language, we treat subword as a feature in our model. Lastly, the model is trained jointly with the city and country to incorporate the information coming from different labels. The experiment performed on W-NUT 2016 Geo-tagging shared task shows our proposed model is competitive with the state-of-the-art systems when using accuracy measurement, and in the meanwhile, leading to a better distance measure over the existing approaches.</p></abstract> <kwd-group>
<kwd>data mining</kwd>
<kwd>location prediction</kwd>
<kwd>multi-head self-attention mechanism</kwd>
<kwd>joint training</kwd>
<kwd>deep learning</kwd>
<kwd>tweets</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="4"/>
<equation-count count="29"/>
<ref-count count="43"/>
<page-count count="12"/>
<word-count count="8577"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Nowadays, many technology systems (e.g., a social media platform) emit a variety of digital information, such as texts, times, logs, and so on. Geographic information has been receiving much attention lately. In fact, there are a large amount of applications benefiting from geographic information, ranging from marketing recommendation systems (Bao et al., <xref ref-type="bibr" rid="B2">2012</xref>; Savage et al., <xref ref-type="bibr" rid="B35">2012</xref>; Yin et al., <xref ref-type="bibr" rid="B41">2013</xref>; Cheng and Shen, <xref ref-type="bibr" rid="B8">2014</xref>) to event detection systems (Sakaki et al., <xref ref-type="bibr" rid="B33">2010</xref>, <xref ref-type="bibr" rid="B34">2013</xref>; Watanabe et al., <xref ref-type="bibr" rid="B40">2011</xref>; Li et al., <xref ref-type="bibr" rid="B25">2012a</xref>). Although technology that allows the user to share his/her geographic information has matured, many users are reluctant to do so due to various concerns such as inconvenience, privacy and so on. As Sloan et al. (<xref ref-type="bibr" rid="B36">2013</xref>) illustrated, only &#x0003C;1% of tweets have a geographic tag attached which in turn limits the growth of related applications. Therefore, researchers have tried to automatically identify the location of the user or post on social media sites. In this paper, we target our location prediction problem on Twitter, which is one of the largest social media sites.</p>
<p>Our proposed model takes three concepts into account, multi-head attention mechanism, subword feature, and joint training technique. The first two parts are proposed for better modeling the text representation and the joint training part is related to the whole architecture of our model. In this work, we mainly focus on using only text information instead of other metadata provided by Twitter since our goal is to develop a generic social media location prediction method, which could be further applied to other platforms (e.g., online news where the user information is usually not available). As a result, we introduce different methods to enhance the text representation part as well as the whole model architecture.</p>
<p>Text representation, however, is one of the most important tasks for Natural Language Processing (NLP) applications. In recent studies with the help of deep learning techniques, many NLP tasks, including text classification problem, question answering problem, sentiment analysis, translation, start from designing a good module for capturing useful and meaningful text information. One of the well-known approach is the Recurrent Neural Network (RNN) based models such as vanilla Recurrent Neural Network (vanilla RNN), Long Short-term Memory Neural Network (LSTM), and Gated Recurrent Unit Network (GRU). Some existing works have shown the RNN-based models&#x00027; power of handling the language modeling tasks (Bengio et al., <xref ref-type="bibr" rid="B3">2003</xref>; Mikolov et al., <xref ref-type="bibr" rid="B29">2010</xref>; Sundermeyer et al., <xref ref-type="bibr" rid="B37">2012</xref>). In RNN, the model iterates all of the text step-by-step and at the same time propagates the information to the next step to form the sentence representation. RNN has achieved a big success in many applications. However, the RNN-based model usually suffers from the extremely long training time because every word depends on all the previous words which makes it hard to parallelize and accelerate. Another branch of studies focuses on the Convolutional Neural Network (CNN) based models (Kim, <xref ref-type="bibr" rid="B20">2014</xref>). Though CNN was originally proposed to solve problems on images, Kim (<xref ref-type="bibr" rid="B20">2014</xref>) successfully introduced it into the NLP field. The idea of CNN is to capture some specific n-gram patterns of the text by using lots of filters with various lengths. However, due to the length limitation of the kernel filters, CNN works better in modeling the local information. Therefore, we instead adopt the multi-head self-attention model (Vaswani et al., <xref ref-type="bibr" rid="B38">2017</xref>) to model the text information. Multi-head self-attention model (Vaswani et al., <xref ref-type="bibr" rid="B38">2017</xref>) utilizes only attention mechanism, yet it enjoys the advantages of both RNN and CNN. That is to say, we can perform parallel computing for all text at the same time, and in the meanwhile, long term information is also encoded.</p>
<p>The subword feature was shown to be very useful for tasks built on social media since people tend to use lots of informal language on social media (Zhang et al., <xref ref-type="bibr" rid="B42">2015</xref>; Vylomova et al., <xref ref-type="bibr" rid="B39">2016</xref>). One simple but common example is the use of &#x0201C;Good&#x0201D; with a various number of &#x0201C;o&#x0201D; which produces words like &#x0201C;Goooooood&#x0201D;. Another example is the user-created word such as &#x0201C;Linsanity&#x0201D; which is the combination of &#x0201C;Jeremy Lin&#x0201D; (an NBA player) and &#x0201C;insanity.&#x0201D; Therefore, if we start from subword feature such as character, we could potentially infer the subtle meaning of these words. Many applications (Zhang et al., <xref ref-type="bibr" rid="B42">2015</xref>; Vylomova et al., <xref ref-type="bibr" rid="B39">2016</xref>) have already introduced the subword feature and achieved a remarkable result. Therefore, in our task, we also treat subword as an important feature.</p>
<p>Multitask Learning (Caruana, <xref ref-type="bibr" rid="B5">1997</xref>; Zhang and Yang, <xref ref-type="bibr" rid="B43">2017</xref>) is a method to train a learning model with different targets. When applying to multitask learning, the model could learn to extract features that are meaningful for both tasks and thus often lead to a more robust result. In this task, our goal is to identify the city of the given tweet. It is worth noticing that there does exist some relations between different cities. For example, two cities can locate within the same country, share the same time zone, or be closer than a specific distance. We believe using the hierarchical relation between cities could enable the model to learn extra information and thus improve the inference ability. Therefore, we introduce the joint training method into our model in order to take the relation between cities into consideration.</p>
<p>In the remainder of this paper, we will introduce related works in section 2. The problem definition and the detail of our proposed model will be described in sections 3, 4 respectively. In section 5, we will introduce the W-NUT 2016 Geo-tagging task and the in-depth analysis to illustrate the pros and cons of our proposed model.</p>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<p>Location predicting has been studied for decades, but most of the work focuses on predicting a user&#x00027;s location. Recently, with the help of deep neural network, analyzing pure text is more feasible and thus researchers start trying to predict the location for a post, such as a tweet. In the following section, we will introduce the related location prediction tasks for user and post respectively.</p>
<p>For location inference of users, one well known approach is to infer the location from a graph structure (Backstrom et al., <xref ref-type="bibr" rid="B1">2010</xref>; Davis et al., <xref ref-type="bibr" rid="B11">2011</xref>; Li et al., <xref ref-type="bibr" rid="B26">2012b</xref>,<xref ref-type="bibr" rid="B27">c</xref>; Jurgens, <xref ref-type="bibr" rid="B18">2013</xref>; Rout et al., <xref ref-type="bibr" rid="B32">2013</xref>; Compton et al., <xref ref-type="bibr" rid="B10">2014</xref>; Kong et al., <xref ref-type="bibr" rid="B22">2014</xref>; Jurgens et al., <xref ref-type="bibr" rid="B19">2015</xref>). In these approaches, the main assumption is that friends will be very likely to live in the same location. Therefore, we could predict a user&#x00027;s location based on his relationship to other users. Among these works, Backstrom et al. (<xref ref-type="bibr" rid="B1">2010</xref>) is the first one noticing the interaction between geographical information and social relationship. They carefully examined the interaction and proposed a maximum likelihood approach to identify a user&#x00027;s location given the geographic information of the user&#x00027;s friends. Davis et al. (<xref ref-type="bibr" rid="B11">2011</xref>) built a following-follower network on Twitter and inferred a user&#x00027;s location based on a voting mechanism with three adjusting parameters. Li et al. (<xref ref-type="bibr" rid="B27">2012c</xref>) applied a Gaussian distribution to model a node&#x00027;s (friends or tweets) location as well as its influence scope. This network was then used to predict a user&#x00027;s location by maximizing the probability of building edges between the user and its friends or tweets. Li et al. (<xref ref-type="bibr" rid="B25">2012a</xref>) further extended the model to capture the property of a user having multiple related locations such as the home location as well as the college location. Their model is a revised version of Latent Dirichlet Allocation (LDA) model where the latent variables are locations. Rout et al. (<xref ref-type="bibr" rid="B32">2013</xref>) formulated the problem as a classification task and solved it by applying Support Vector Machine (SVM) with the features extracted from a Twitter&#x00027;s follower-based network. Jurgens (<xref ref-type="bibr" rid="B18">2013</xref>) extended label propagation method with the spatial property. As a semi-supervised learning method, spatial label propagation could iteratively inference all the user&#x00027;s location starting from only a few ground truth data. SPOT (Kong et al., <xref ref-type="bibr" rid="B22">2014</xref>) took the social relation as a continuous feature instead of a binary feature (friends or not) by measuring the social closeness. The authors also introduced a confidence-based iteration method to overcome the data sparsity problem. Compton et al. (<xref ref-type="bibr" rid="B10">2014</xref>) formulated the social network geo-location inference task as a convex optimization problem and applied a total variation-based algorithm to solve it. These works rely on the information behind the social network and hence building a user relationship network is inevitable. This becomes a limitation if we want to work on data other than social media.</p>
<p>Another kind of method focuses on predicting using content and metadata provided by the user (Cheng et al., <xref ref-type="bibr" rid="B7">2010</xref>; Eisenstein et al., <xref ref-type="bibr" rid="B12">2010</xref>; Chandra et al., <xref ref-type="bibr" rid="B6">2011</xref>; Roller et al., <xref ref-type="bibr" rid="B31">2012</xref>; Mahmud et al., <xref ref-type="bibr" rid="B28">2014</xref>). In Eisenstein et al. (<xref ref-type="bibr" rid="B12">2010</xref>)&#x00027;s work, they presented a generative model to capture the relation between latent topics and geographical regions as they found that high-level topics such as &#x0201C;sport&#x0201D; and &#x0201C;entertainment&#x0201D; are rendered differently according to different location. Chandra et al. (<xref ref-type="bibr" rid="B6">2011</xref>) utilized only the content information but instead of using only a user&#x00027;s tweets, they augmented it with the replied tweets from other users by assuming that a reply tweet would have the same topic as the original tweet. A probability distribution model which could capture the relation between terms and locations was then applied to predict a user&#x00027;s location based on the corresponding augmented tweets set. Mahmud et al. (<xref ref-type="bibr" rid="B28">2014</xref>)&#x00027;s work focused on building a hierarchical classification to integrate tweet contents, different categories of metadata, user&#x00027;s tweeting behaviors, and external location knowledge such as a geographic gazetteer dictionary. They also examined the impact of frequently traveling users and found that these users usually introduce noise into the model. This lead to a conclusion that eliminating frequently traveling users could improve the prediction accuracy. Roller et al. (<xref ref-type="bibr" rid="B31">2012</xref>) proposed an information retrieval method where the idea was to build a grid on the earth and then generate reference documents for each grid by selecting the location-related documents from training set. To overcome the problem of uniform grids, they constructed the grid using a k-d tree algorithm to dynamically adapt the grid size of the training data. Cheng et al. (<xref ref-type="bibr" rid="B7">2010</xref>)&#x00027;s work focused on using purely content to predict the user&#x00027;s location with the assumption of location language difference. Although these approaches use mainly the content information, what they used is a bunch of posts provided by the user. As Cheng et al. (<xref ref-type="bibr" rid="B7">2010</xref>) and Chandra et al. (<xref ref-type="bibr" rid="B6">2011</xref>) revealed, given more posts, the accuracy would improve. This fact also suggests that predicting location for a single post is much difficult than for a user.</p>
<p>The tasks of predicting the location for a post were proposed much recently. After Han et al. (<xref ref-type="bibr" rid="B14">2016</xref>) built the dataset from Twitter and then proposed a shared task, researchers started digging into it. There are several approaches proposed in the shared task. Chi et al. (<xref ref-type="bibr" rid="B9">2016</xref>) applied a Naive Bayes classifier on many selected features, including location-indicative words, user meta data and so on. CSIRO (Jayasinghe et al., <xref ref-type="bibr" rid="B17">2016</xref>) designed an ensemble method that incorporated heuristics, time zone text classifiers and an information retrieval approach. Miura et al. (<xref ref-type="bibr" rid="B30">2016</xref>) proposed a variant version of FastText Model which can take user&#x00027;s meta data into account. After the shared task, Huang and Carley (<xref ref-type="bibr" rid="B16">2017</xref>) designed a model with the help of the CNN layer. Lau et al. (<xref ref-type="bibr" rid="B24">2017</xref>), on the other hand, proposed the DeepGeo which utilized the character-level recurrent convolutional network to further capture the subword feature within the tweets. Most of these works tried to apply the deep learning framework to capture the language difference among the tweets. As we can see, with the help of deep learning framework, though we only have limited information in a single post, the result is still improving year by year.</p>
</sec>
<sec id="s3">
<title>3. Problem Definition</title>
<p>We use a bold capital letter to represent a matrix (e.g., <bold>A</bold>), a bold lowercase letter to represent a vector (e.g., <bold>a</bold>), and a normal lowercase letter to represent a scalar (e.g., a). Furthermore, a tweet which consists of <italic>n</italic> words is represented as <bold>S</bold> &#x0003D; {<bold>w</bold><sub>1</sub>, <bold>w</bold><sub>2</sub>, &#x000B7;&#x000B7;&#x000B7;&#x02009;, <bold>w</bold><sub><italic>n</italic></sub>} where <bold>S</bold> is the tweet matrix and <bold>w</bold> is the word embedding. A tweet could also be represented as a sequence of <italic>m</italic> characters <bold>C</bold> &#x0003D; {<bold>c</bold><sub>1</sub>, <bold>c</bold><sub>2</sub>, &#x000B7;&#x000B7;&#x000B7;<bold>c</bold><sub><italic>m</italic></sub>} where <bold>C</bold> is the character matrix and <bold>c</bold> is the character embedding. Other general naming conventions are provided in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Notations and naming convention.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Symbols</bold></th>
<th valign="top" align="left"><bold>Definitions</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>w</bold></td>
<td valign="top" align="left">A word embedding <inline-formula><mml:math id="M1"><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>S</bold> &#x0003D; {<bold>w</bold><sub>1</sub>, <bold>w</bold><sub>2</sub>, &#x02026;, <bold>w</bold><sub><italic>n</italic></sub>}</td>
<td valign="top" align="left">A text matrix consists of <italic>n</italic> word embeddings. The dimension is <inline-formula><mml:math id="M2"><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>c</bold></td>
<td valign="top" align="left">A character embedding <inline-formula><mml:math id="M3"><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>C</bold> &#x0003D; {<bold>c</bold><sub>1</sub>, <bold>c</bold><sub>2</sub>, &#x02026;, <bold>c</bold><sub><italic>m</italic></sub>}</td>
<td valign="top" align="left">A character matrix consists of <italic>m</italic> character embeddings. The dimension is <inline-formula><mml:math id="M4"><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>E</bold></td>
<td valign="top" align="left">Embedding matrix.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>H</bold>, <bold>M</bold>, <bold>h</bold></td>
<td valign="top" align="left">hidden matrix and hidden vector.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>v</bold></td>
<td valign="top" align="left">Output vector.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>W</bold></td>
<td valign="top" align="left">Trainable weight matrix.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>b</bold></td>
<td valign="top" align="left">Trainable bias matrix.</td>
</tr>
<tr>
<td valign="top" align="left"><italic>m</italic>, <italic>n</italic></td>
<td valign="top" align="left">Sequence length.</td>
</tr>
<tr>
<td valign="top" align="left"><italic>d</italic></td>
<td valign="top" align="left">Dimension.</td>
</tr>
<tr>
<td valign="top" align="left"><bold>y</bold><sub><italic>city</italic></sub>, <bold>y</bold><sub><italic>country</italic></sub></td>
<td valign="top" align="left">True city labels and true country labels.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>y</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>y</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></td>
<td valign="top" align="left">Predicted city labels and predicted country labels.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>With the notation provided above, the problem definition could be described as:</p>
<table-wrap position="float">
<table rules="none">
<tbody>
<tr>
<td valign="top" align="left" colspan="2">P<sc>roblem</sc> D<sc>efinition</sc> 1.</td>
</tr>
<tr>
<td valign="top" align="right"><bold>Given:</bold></td>
<td valign="top" align="left">A tweet and its corresponding representations, <bold>S</bold> and <bold>C</bold>.</td>
</tr>
<tr>
<td valign="top" align="right"><bold>Predict:</bold></td>
<td valign="top" align="left">The label of the given tweet. This could be either <italic>y</italic><sub><italic>city</italic></sub> or <italic>y</italic><sub><italic>country</italic></sub>.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4">
<title>4. Location Prediction for Tweets</title>
<p>In this chapter, we describe our model in several steps. We first introduce the high level architecture of our model. Then describe its separate modules, multi-head self-attention mechanism, subword features, and joint training method.</p>
<sec>
<title>4.1. Model Overview</title>
<p>As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, the proposed model contains several small modules, but can be mainly separated into two parts including text representation and joint training. The text representation module consists of word representation and character representation. Both of the representation are encoded by multi-head self-attention layer but for character representation, we further use a CNN layer and pooling layer first to reduce the dimension and extract meaningful information. The word representation and character representation are then concatenated as a vector which represents the given tweet. In the second module, to utilize the relation between cities, we use the same concatenated vector but two different output layers to predict the country and city at the same time. However, the country classification is used only for training. In the testing phase, we use only the city part of the model for prediction.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Overview of our proposed model.</p></caption>
<graphic xlink:href="fdata-02-00005-g0001.tif"/>
</fig>
</sec>
<sec>
<title>4.2. Multi-Head Self-Attention Mechanism</title>
<p>The multi-head self-attention model is proposed by Vaswani et al. (<xref ref-type="bibr" rid="B38">2017</xref>) and is designed for language translation task. Here, we introduce the multi-head self-attention model as a module for text representation.</p>
<sec>
<title>4.2.1. Self-Attention</title>
<p>Let&#x00027;s start by defining the self-attention layer. In the normal attention mechanism, the model usually takes three inputs, a query <bold>Q</bold>, a key <bold>K</bold>, and a value <bold>V</bold>, where <bold>Q</bold> and <bold>K</bold> are used to compute the weights for <bold>V</bold>. The formal definition (Vaswani et al., <xref ref-type="bibr" rid="B38">2017</xref>) is written as follows:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Q</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>K</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Q</mml:mtext></mml:mstyle><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>K</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>V</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>d</italic> is the word embedding dimension. However, in the self-attention layer, all the three inputs are the same matrix <bold>S</bold>, the text matrix {<bold>w</bold><sub>1</sub>, <bold>w</bold><sub>2</sub>, &#x000B7;&#x000B7;&#x000B7;&#x02009;, <bold>w</bold><sub><italic>n</italic></sub>}, where <bold>S</bold> &#x02208; &#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup>, <inline-formula><mml:math id="M8"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <italic>n</italic> is the text length, and <italic>d</italic> is the word embedding dimension. By definition, the self-attention is:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>However, we can make it clearer by defining it in word level. The <bold>h</bold><sub><italic>i</italic></sub> below is the transformation of <bold>w</bold><sub><italic>i</italic></sub> by weighted sum over the sentence.</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>H</bold> &#x02208; &#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup> and <inline-formula><mml:math id="M12"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. <bold>h</bold><sub><italic>i</italic></sub> is computed as follows:</p>
<disp-formula id="E5"><label>(4)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>&#x003B1;<sub><italic>ij</italic></sub> is the weight for each <bold>w</bold><sub><italic>j</italic></sub> and is computed by the softmax function with the scaling term <inline-formula><mml:math id="M14"><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula>:</p>
<disp-formula id="E6"><label>(5)</label><mml:math id="M15"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>We also illustrate this idea in <xref ref-type="fig" rid="F2">Figure 2</xref>. In this figure, we first use both the top part (green) and the left part (blue) to compute the corresponding weight for each cell. Equation (5) tells us that we need to normalize each row so the sum over each row is one. Then for each row, the new vector is constructed by summing up the vector in the top side (blue) multiplying by the corresponding weight.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>The idea of the multi-head self-attention model. <inline-formula><mml:math id="M16"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> will be reconstructed by {<bold>w</bold><sub>1</sub>, <italic>w</italic><sub>2</sub>&#x02026;, <bold>w</bold><sub><italic>n</italic></sub>} with the corresponding weights computed from <bold>w</bold><sub><italic>i</italic></sub> and {<bold>w</bold><sub>1</sub>, <bold>w</bold><sub>2</sub>&#x02026;, <bold>w</bold><sub><italic>n</italic></sub>}. Notice that the formula of the weight is actually <inline-formula><mml:math id="M17"><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></inline-formula>. The figure is used to illustrate the idea of weights so <inline-formula><mml:math id="M18"><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula> is not listed.</p></caption>
<graphic xlink:href="fdata-02-00005-g0002.tif"/>
</fig>
</sec>
<sec>
<title>4.2.2. Multi-Head Self-Attention</title>
<p>In the above definition, the attention mechanism is performed only once, resulting only a single aspect vector. To equip the model with the power to learn multiple aspects information, a multi-head self-attention mechanism is proposed. In multi-head self-attention mechanism, we first apply a linear transformation <bold>W</bold> on <bold>S</bold>, producing <bold>S</bold>&#x02032; &#x0003D; <bold>SW</bold> where <bold>S</bold>&#x02032; &#x02208; &#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic>&#x02032;</sup>, <bold>S</bold> &#x02208; &#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup> and <bold>W</bold> &#x02208; &#x0211D;<sup><italic>d</italic>&#x000D7;<italic>d</italic>&#x02032;</sup>. Notice that <italic>d</italic>&#x02032; &#x0003C; <italic>d</italic>, which means we are reducing the dimension. We then apply the self-attention model on <bold>S</bold>&#x02032;:</p>
<disp-formula id="E7"><label>(6)</label><mml:math id="M19"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The idea of multi-head self-attention mechanism is to perform the above work <italic>h</italic> times and then concatenate the resulting <italic>h</italic> vectors together. This gives the model the capability of learning <italic>h</italic> kinds of information. This functionality is defined as follows if <italic>h</italic> is set up as <inline-formula><mml:math id="M20"><mml:msup><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>:</p>
<disp-formula id="E8"><label>(7)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>As we can see, after performing the multi-head self-attention mechanism, the shape of the output matrix <bold>M</bold> is still &#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup> which is the same as the input matrix <bold>S</bold>. Therefore, we could further apply the residual network (He et al., <xref ref-type="bibr" rid="B15">2016</xref>) on the multi-head self-attention model. We revised Equation (7) as follow:</p>
<disp-formula id="E10"><label>(8)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>S</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The idea of residual network is to add the input vector to the output vector. Since the original model is to learn a mapping function <italic>F</italic>(<bold>x</bold>), we change it to <italic>F</italic>&#x02032;(<bold>x</bold>) &#x0003D; <italic>F</italic>(<bold>x</bold>) &#x02212; <bold>x</bold>. The output <bold>y</bold> &#x0003D; <italic>F</italic>&#x02032;(<bold>x</bold>) &#x0002B; <bold>x</bold> will still be the same but we could compute the gradient from the residual path and then reduce the gradient vanishing problem.</p>
</sec>
<sec>
<title>4.2.3. Position-Wise Feed-Forward Network</title>
<p>The position-wise feed-forward network is introduced as the function of fully connected layer after the multi-head self-attention layer. The idea is to apply two linear transformations with a ReLU activation function on the input matrix. The mechanism could be described as follow:</p>
<disp-formula id="E11"><label>(9)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>F</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M25"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mn>4</mml:mn><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M26"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn><mml:mi>d</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are transformation matrices and <inline-formula><mml:math id="M27"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M28"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are bias vectors. The transformation dimensions are suggested by Vaswani et al. (<xref ref-type="bibr" rid="B38">2017</xref>). We also apply the residual network here so Equation (9) will be revised as:</p>
<disp-formula id="E12"><label>(10)</label><mml:math id="M29"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>F</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the last term <bold>M</bold> is identical to the input vector <bold>M</bold>.</p>
</sec>
<sec>
<title>4.2.4. Position Encoding</title>
<p>One important thing that multi-head self-attention model can not model is the position relation between words. In RNN based model, the word order is well preserved since the output vector is calculated step by step. In the CNN based model, the word order is somehow preserved because it works on extracting information from the n-grams. However, the multi-head self-attention model utilizes weighted sum over the sequence of vectors where word order is totally ignored. Therefore, to fix this problem, Vaswani et al. (<xref ref-type="bibr" rid="B38">2017</xref>) introduced an idea of injecting the position information into the word vector. As a result, we build a position embedding matrix <bold>E</bold><sub><italic>position</italic></sub> (Gehring et al., <xref ref-type="bibr" rid="B13">2017</xref>) and add the corresponding position vector to the word vector. <bold>E</bold><sub><italic>position</italic></sub> works as a word embedding layer which turns a position into a vector. This could be described as the following Equation:</p>
<disp-formula id="E13"><label>(11)</label><mml:math id="M30"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>E</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mo>:</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>w</bold><sub><italic>i</italic></sub> is the <italic>i</italic>-th word from <bold>S</bold> and <italic>i</italic> is the position. By introducing the position embedding, we expect that <bold>E</bold><sub><italic>position</italic></sub> can learn how to represent the position information. For example, one of the vector <italic>e</italic><sub><italic>position</italic>1</sub> in <bold>E</bold><sub><italic>position</italic></sub> could learn the meaning of the first word. If we add <italic>e</italic><sub><italic>position</italic>1</sub> to a word embedding, then the resulting vector should contain the information of the position (first word) and the meaning of the word. The position encoding is applied in the beginning of the whole model. As a result, after obtaining <bold>w</bold>&#x02032;, we replace <bold>S</bold> by <inline-formula><mml:math id="M31"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>w</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
</sec>
<sec>
<title>4.2.5. Text Representation</title>
<p>In our model, we stack the multi-head self-attention layer and the position-wise feed-forward network twice. Notice that after passing through these layers, the output is still a matrix of &#x0211D;<sup><italic>n</italic>&#x000D7;<italic>d</italic></sup>. In order to get the text representation, we need to reduce the output matrix into a one dimensional vector. As a result, we simply sum up alone the sequence dimension <italic>n</italic>, producing the text representation as follow:</p>
<disp-formula id="E14"><label>(12)</label><mml:math id="M32"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>the resulting <bold>v</bold><sup><italic>text</italic></sup> &#x02208; &#x0211D;<sup><italic>d</italic></sup>.</p>
</sec>
</sec>
<sec>
<title>4.3. Subword Feature</title>
<p>The idea of adding a subword feature is to infer the meaning of the low frequency word. Though the most simple way here is to apply the above model directly to the character level sequence, it will be infeasible since the complexity of multi-head self-attention model is <italic>O</italic>(<italic>n</italic><sup>2</sup> &#x000B7; <italic>d</italic>), where <italic>n</italic> is the sequence length. Although the maximum number of characters allowed in Twitter is only 140, it will still cause a huge computation bottleneck in our model. Therefore, we first apply one-dimensional convolutional neural network to extract the n-gram information, and then use maximum pooling over a small window to extract meaningful information and at the same time reduce the sequence length. The detailed procedure is described in the following paragraph.</p>
<p>Given the character matrix <bold>C</bold>, we first apply a convolutional neural network layer to it. Each element of the resulting matrix <bold>H</bold><sup><italic>conv</italic></sup> could be described as:</p>
<disp-formula id="E15"><label>(13)</label><mml:math id="M33"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>-</mml:mo><mml:mi>k</mml:mi><mml:mo>:</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002A;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>b</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, &#x0002A; is the convolution operator, <italic>k</italic> is half of the kernel size, <italic>i</italic> is the index of the character sequence ranging from 1 to the character length <italic>m</italic>, and <italic>j</italic> is the index of the filter ranging from 1 to the number of filters <italic>f</italic>.</p>
<p>After this, we apply the maximum pooling with a window size equal to the kernel size 2<italic>k</italic> and then slide 2<italic>k</italic> &#x02212; 1 to next step. Therefore, the element of the resulting matrix <bold>H</bold><sup><italic>pool</italic></sup> is given by:</p>
<disp-formula id="E16"><label>(14)</label><mml:math id="M34"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>-</mml:mo><mml:mi>k</mml:mi><mml:mo>:</mml:mo><mml:mi>l</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>As we can see, the first dimension reduces from <italic>m</italic> to <inline-formula><mml:math id="M35"><mml:mfrac><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> and thus <bold>H</bold><sup><italic>pool</italic></sup> is <inline-formula><mml:math id="M36"><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mfrac><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. We then apply the multi-head self-attention model on <bold>H</bold><sup><italic>pool</italic></sup> and get <bold>v</bold><sup><italic>char</italic></sup> &#x02208; &#x0211D;<sup><italic>f</italic></sup> as:</p>
<disp-formula id="E17"><label>(15)</label><mml:math id="M37"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>H</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E18"><label>(16)</label><mml:math id="M38"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle><mml:msup><mml:mrow><mml:mo>&#x0200A;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>F</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E19"><label>(17)</label><mml:math id="M39"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mfrac><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle><mml:msup><mml:mrow><mml:mo>&#x0200A;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>4.4. Joint Training</title>
<p>The <bold>v</bold><sup><italic>text</italic></sup> and <bold>v</bold><sup><italic>char</italic></sup> are then concatenated as the tweet representation vector.</p>
<disp-formula id="E20"><label>(18)</label><mml:math id="M40"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>w</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>v</bold><sup><italic>tweet</italic></sup> &#x02208; &#x0211D;<sup><italic>d</italic>&#x0002B;<italic>f</italic></sup>.</p>
<p>By applying different transformation <bold>W</bold><sub><italic>city</italic></sub> and <bold>W</bold><sub><italic>country</italic></sub>, we could get two different vector <bold>v</bold><sub><italic>city</italic></sub> and <bold>v</bold><sub><italic>country</italic></sub> for prediction.</p>
<disp-formula id="E21"><label>(19)</label><mml:math id="M41"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>w</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E22"><label>(20)</label><mml:math id="M42"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>w</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>W</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M43"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M44"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M45"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M46"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, and <italic>m</italic><sub><italic>city</italic></sub>, <italic>m</italic><sub><italic>country</italic></sub> are city size and country size respectively.</p>
<p>We then apply the softmax function to get the probability for each city <inline-formula><mml:math id="M47"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and each country <inline-formula><mml:math id="M48"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>.</p>
<disp-formula id="E23"><label>(21)</label><mml:math id="M49"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:mstyle><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E24"><label>(22)</label><mml:math id="M50"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:mstyle><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M51"><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the <italic>l</italic>-th element of <bold>v</bold><sub><italic>city</italic></sub> and <inline-formula><mml:math id="M52"><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the <italic>l</italic>-th element of <bold>v</bold><sub><italic>country</italic></sub>.</p>
<p>The prediction of city <inline-formula><mml:math id="M53"><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and country <inline-formula><mml:math id="M54"><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the label with the highest probability.</p>
<disp-formula id="E25"><label>(23)</label><mml:math id="M55"><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mo>&#x02032;</mml:mo></mml:msubsup></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmax</mml:mtext></mml:mrow><mml:mi>l</mml:mi></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E26"><label>(24)</label><mml:math id="M56"><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mo>&#x02032;</mml:mo></mml:msubsup></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mtext>argmax</mml:mtext></mml:mrow><mml:mi>l</mml:mi></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x000B7;&#x000B7;&#x000B7;</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>m</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In our task, we have two kinds of labels; city and country. As we could see here, some cities are actually in the same country and thus shared some common information. Therefore, we proposed the joint training framework for modeling city and country at the same time. Since we use cross-entropy as our loss function, the joint learning loss function is as follows:</p>
<disp-formula id="E27"><label>(25)</label><mml:math id="M57"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo class="qopname">log</mml:mo><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo class="qopname">log</mml:mo><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M58"><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M59"><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> are binary indicators (0, 1) which will give 1 only if the label is the correct one.</p>
</sec>
</sec>
<sec id="s5">
<title>5. Experiments and Results</title>
<p>In this section, we describe the experiment on the W-NUT 2016 Geo-tagging task<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> and some benchmark approaches for comparison. Different metrics are utilized in this experiment to provide insights from different aspects.</p>
<sec>
<title>5.1. Data</title>
<p>We directly use the geolocation prediction shared task dataset (Han et al., <xref ref-type="bibr" rid="B14">2016</xref>) in our experiment. Though they provide two tasks, predicting locations for tweets and users, we only focus on the tweet prediction part. The dataset is collected from 2013 to 2016 by Twitter Streaming API. Besides, only tweets whose language are identified by Twitter as English are retained. Due to the limitation of Twitter policy, the dataset provides only the ID of the collected tweets instead of the original tweets and the corresponding information. As a result, although the dataset provides 12M and 10k tweets for training and developing, we could only collect about 8M and 8k tweets respectively since users could delete the tweets they posted and thus some tweets are no longer available. However, the testing data containing 10k tweets is shared comprehensively so comparing with the previous benchmark is possible. The detail statistic information is provided in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Statistic of the W-NUT geo-tagging task dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Data</bold></th>
<th valign="top" align="center"><bold>Amount</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Training Set</td>
<td valign="top" align="center">8,492,598</td>
</tr>
<tr>
<td valign="top" align="left">Validation Set</td>
<td valign="top" align="center">7,214</td>
</tr>
<tr>
<td valign="top" align="left">Testing Set</td>
<td valign="top" align="center">10,000</td>
</tr>
<tr>
<td valign="top" align="left">City Label</td>
<td valign="top" align="center">3,362</td>
</tr>
<tr>
<td valign="top" align="left">Country Label</td>
<td valign="top" align="center">175</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s6">
<title>6. Evaluation Metrics</title>
<p>There are three metrics adopted in the W-NUT 2016 Geo-tagging task, including one hard metric and two soft metrics. The first way is the classification <bold>accuracy</bold> over the city prediction. This is regarded as a hard metric because there is no tolerance for a wrong prediction. The distance-based metric, on the other hand, is regarded as a soft metric as it measures the distance between the true value and the predicted value. Therefore, only a wrong prediction with large error will be penalized a lot. Two distance-based metric is utilized, <bold>median error distance</bold> and <bold>mean error distance</bold>. Given the evaluation result <italic>R</italic> &#x0003D; <italic>d</italic><sub>1</sub>, <italic>d</italic><sub>2</sub>, &#x000B7;&#x000B7;&#x000B7;&#x02009;, <italic>d</italic><sub><italic>n</italic></sub>, where <italic>d</italic><sub><italic>n</italic></sub> is the error distance (kilometers) between the predicted and the standard geographic coordinate, median error distance and mean error distance are computed as follows:</p>
<disp-formula id="E28"><label>(26)</label><mml:math id="M60"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>E</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>D</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E29"><label>(27)</label><mml:math id="M61"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>E</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>D</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<sec>
<title>6.1. Benchmark Model</title>
<p>Several benchmarks are selected for comparison. Since the original dataset provides metadata like user specified location description, timezone, self-introduction, and so on, most of the work utilizes all of these information in their model. However, our proposed model focuses on modeling the information behind the content itself. As a result, we implemented the text-content-only version for these models by removing the modules or layers that deal with the metadata. In the following section, we describe several benchmarks and how we remove the metadata-related modules.</p>
<sec>
<title>6.1.1. DeepGeo</title>
<p>DeepGeo (Lau et al., <xref ref-type="bibr" rid="B24">2017</xref>) utilized the character-level recurrent convolutional network (Lai et al., <xref ref-type="bibr" rid="B23">2015</xref>) for text modeling. In the recurrent convolutional network, the character matrix is passed through a bi-directional LSTM layer, producing a hidden state matrix. The hidden state matrix is then passed through a CNN layer followed by a max-over-time pooling layer to generate the subword features. After acquiring the subword features, a attention network is applied to merge the subword feature matrices into a single vector. In addition to the text representation module, deepgeo introduce a RBF network for modeling the time-related features, such as Tweet creation time and account creation time. All of these vectors including the text representation and meta feature are then concatenated and go through two dense layers for classification. To understand the model&#x00027;s ability of handling the text information, we remove the layers other than the character-level RCNN.</p>
</sec>
<sec>
<title>6.1.2. <sc>FUJIXEROX</sc></title>
<p>This approach is proposed by <sc>FUJIXEROX</sc> (Miura et al., <xref ref-type="bibr" rid="B30">2016</xref>), one of the participated team in W-NUT 2016 Geo-tagging task. This model is a variant version of the original FastText Model (Bojanowski et al., <xref ref-type="bibr" rid="B4">2016</xref>). The idea is to represent a word by the sum of its n-gram embeddings. Therefore, for a out-of-vocabulary word, the model could still inference its word vector according to the subword features (n-gram). <sc>FUJIXEROX</sc> applied FastText model on not only the tweet text but also the user specified location and the user profile description. The three feature vectors and the time zone embedding vector are then concatenated then passed into a dense layer for prediction. To have it use only the text information from tweets, the metadata features are removed and the resulting model is a supervised FastText model.</p>
</sec>
<sec>
<title>6.1.3. CNN Model</title>
<p>A CNN-based model is provided by Huang and Carley (<xref ref-type="bibr" rid="B16">2017</xref>). Their approach is to use a CNN layer (Kim, <xref ref-type="bibr" rid="B20">2014</xref>) for modeling the tweet content, the user profile description, the user specified location, and the user name. Then, these four vectors are concatenated with four one-hot vectors, tweet language, user language, time zone and the tweet creation time. The concatenated vector is then passed through a dense layer and form a classifier. Unlike the previous two approaches, this task is performed on a self-built dataset. Therefore, we implemented this approach for comparison. The model after removing the metadata feature is actually a CNN model.</p>
</sec>
<sec>
<title>6.1.4. CSIRO</title>
<p>Jayasinghe et al. (<xref ref-type="bibr" rid="B17">2016</xref>) utilize ensemble approaches to overcome the weakness of each component. They also handle many kinds of metadata and integrate them with external information like gazetteer, IP-Lookup table, and so on. They then apply these features to label propagation approach, information retrieval approach, and text classification approach. By examining different ensemble strategies, they found that the full cascade one outperforms the other strategies. As their approach heavily relies on the metadata, we only list it as a reference.</p>
</sec>
<sec>
<title>6.1.5. Naive Bayes Methods</title>
<p>This method is proposed by Chi et al. (<xref ref-type="bibr" rid="B9">2016</xref>) with the use of naive Bayes methods on many selected features. However, only features extracted from text data are considered such as location-indicative words, hashtags and so on.</p>
</sec>
</sec>
<sec>
<title>6.2. Experiment Setting</title>
<p>The parameters used for our model are listed in the <xref ref-type="table" rid="T3">Table 3</xref>. If we combine a multi-head self-attention layer and a feed-forward layer as an attention layer, then the stack number 2 means we stack two attention layers and produce a series of layers as multi-head self-attention, feed-forward, multi-head self-attention, feed-forward. We use Adam (Kingma and Ba, <xref ref-type="bibr" rid="B21">2014</xref>) for optimization. The model is trained for 10 epochs and then the one with the best validation result is kept for testing. To compute the distance error, we didn&#x00027;t use the model to predict latitude and longitude. However, we map the predicted city into the corresponding latitude and longitude and then take it as our prediction for the geographic coordinate. For example, if the predicted city label is &#x0201C;los angeles-ca037-us,&#x0201D; we search on GeoNames<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> by the query &#x0201C;Los Angeles US&#x0201D; (city name and country name). The geographic coordinate (N 34&#x000B0;3&#x02032; 8&#x02032;&#x02032;, W 118&#x000B0;14&#x02032; 37&#x02032;&#x02032;) is used as the predicted result.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Detail of the Parameter Setting. Setting<sub>2</sub> is to use only single-head self-attention model. Setting<sub>3</sub> is trained without country label.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Network</bold></th>
<th valign="top" align="center"><bold>Parameter</bold></th>
<th valign="top" align="center"><bold>Size<sub><bold>1</bold></sub></bold></th>
<th valign="top" align="center"><bold>Size<sub><bold>2</bold></sub></bold></th>
<th valign="top" align="center"><bold>Size<sub><bold>3</bold></sub></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Overall</td>
<td valign="top" align="center">Batch Size</td>
<td valign="top" align="center">512</td>
<td valign="top" align="center">512</td>
<td valign="top" align="center">512</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Epochs</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">10</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Dropout</td>
<td valign="top" align="center">0.3</td>
<td valign="top" align="center">0.3</td>
<td valign="top" align="center">0.3</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Learning Rate</td>
<td valign="top" align="center">0.0005</td>
<td valign="top" align="center">0.001</td>
<td valign="top" align="center">0.001</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Min Word Frequency</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">10</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Text</td>
<td valign="top" align="center">Max Length</td>
<td valign="top" align="center">30</td>
<td valign="top" align="center">30</td>
<td valign="top" align="center">30</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Heads</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Stack Number</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><inline-formula><mml:math id="M62"><mml:msup><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td valign="top" align="center">200</td>
<td valign="top" align="center">200</td>
<td valign="top" align="center">200</td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><italic>d</italic><sup><italic>h</italic></sup></td>
<td valign="top" align="center">200</td>
<td valign="top" align="center">200</td>
<td valign="top" align="center">200</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Character</td>
<td valign="top" align="center">Max Length</td>
<td valign="top" align="center">140</td>
<td valign="top" align="center">140</td>
<td valign="top" align="center">140</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Heads</td>
<td valign="top" align="center">8</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Stack Number</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><inline-formula><mml:math id="M63"><mml:msup><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">100</td>
</tr>
<tr>
<td/>
<td valign="top" align="center"><italic>d</italic><sup><italic>h</italic></sup></td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">100</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">CNN filter size</td>
<td valign="top" align="center">3, 4, 5, 6, 7</td>
<td valign="top" align="center">3, 4, 5</td>
<td valign="top" align="center">3, 4, 5</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">filter number</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">64</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>In the pre-processing step, words that appear less then the min word frequency times will be turned into &#x0003C;UNK&#x0003E; token. Max length means the maximum sequence length so exceeding words will be removed. If the text sequence doesn&#x00027;t meet the max length, we will pad &#x0003C;nan&#x0003E; in front of the text</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>6.3. Result</title>
<p>The results are listed in <xref ref-type="table" rid="T4">Table 4</xref>. We separate the result into two sections because we mainly focus on the setting without metadata. Within this setting, our model outperforms all the other models according to Acc<sub>1</sub>, the accuracy of the city. <sc>FUJIXEROX</sc>&#x00027;s fastText model performs relatively well in both of the distance measurements but our proposed approach is competitive. For the rest of the methods, Proposed Method<sub>1</sub> outperforms DeepGeo 8.98% and 22.07%, CNN 14.35% and 27.49%, Naive Bayes 18.08% and 43.09% in mean error distance and median error distance respectively. This phenomenon suggests that our proposed method could better capture the location relation. To better understand the behavior of the model, we try to examine the country-wise prediction. Here, we turn a city label into a country label by extracting the country from the city label. For example, the country label for &#x0201C;los angeles-ca037-<bold>us</bold>&#x0201D; is &#x0201C;us.&#x0201D; We then compute the accuracy and report it also in <xref ref-type="table" rid="T4">Table 4</xref> as <bold>Acc</bold><sub>2</sub>. As we can see, there is no huge difference between Acc<sub>2</sub> which suggests that our proposed method gives a closer city prediction within the same country. Let&#x00027;s move to the setting with metadata. The results are reported by their papers so part of the table is missing. It is, however, easy to see that the results of using metadata improve a lot. This is foreseeable since some of the metadata provide very strong information. For instance, the timezone feature basically acts as a geographic constraint. Not to mention that some users explicitly reveal their home location in the profile location description which becomes another metadata. As we state before, we focus on extracting the information from pure text content, so it is reasonable for us to ignore the metadata.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Result of model using only text information. Acc<sub>1</sub> means the accuracy of the city and Acc<sub>2</sub> represents the accuracy of the country.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Acc<sub><bold>1</bold></sub></bold></th>
<th valign="top" align="center"><bold>Acc<sub><bold>2</bold></sub></bold></th>
<th valign="top" align="center"><bold>Mean</bold></th>
<th valign="top" align="center"><bold>Median</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" style="background-color:#bbbdc0" colspan="5"><bold>Without Metadata</bold></td>
</tr>
<tr>
<td valign="top" align="left">Naive Bayes<xref ref-type="table-fn" rid="TN1"><sup>&#x0002A;</sup></xref></td>
<td valign="top" align="center">0.146</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">5338.9</td>
<td valign="top" align="center">3424.6</td>
</tr>
<tr>
<td valign="top" align="left"><sc>FUJIXEROX</sc></td>
<td valign="top" align="center">0.168</td>
<td valign="top" align="center">0.566</td>
<td valign="top" align="center">4441.5</td>
<td valign="top" align="center"><bold>1900.5</bold></td>
</tr>
<tr>
<td valign="top" align="left">CNN Model</td>
<td valign="top" align="center">0.207</td>
<td valign="top" align="center">0.581</td>
<td valign="top" align="center">5106.8</td>
<td valign="top" align="center">2687.6</td>
</tr>
<tr>
<td valign="top" align="left">DeepGeo<sup>&#x0002B;</sup></td>
<td valign="top" align="center">0.202</td>
<td valign="top" align="center"><bold>0.597</bold></td>
<td valign="top" align="center">4805.5</td>
<td valign="top" align="center">2500.9</td>
</tr>
<tr>
<td valign="top" align="left">Proposed Method<sub>1</sub></td>
<td valign="top" align="center"><bold>0.218</bold></td>
<td valign="top" align="center">0.590</td>
<td valign="top" align="center"><bold>4373.7</bold></td>
<td valign="top" align="center">1948.9</td>
</tr>
<tr>
<td valign="top" align="left">Proposed Method<sub>2</sub></td>
<td valign="top" align="center">0.215</td>
<td valign="top" align="center"><bold>0.597</bold></td>
<td valign="top" align="center">4449.2</td>
<td valign="top" align="center">1970.6</td>
</tr>
<tr>
<td valign="top" align="left">Proposed Method<sub>3</sub></td>
<td valign="top" align="center">0.216</td>
<td valign="top" align="center">0.581</td>
<td valign="top" align="center">4697.2</td>
<td valign="top" align="center">2088.4</td>
</tr>
<tr>
<td valign="top" align="left" style="background-color:#bbbdc0" colspan="5"><bold>With Metadata</bold></td>
</tr>
<tr>
<td valign="top" align="left"><sc>FUJIXEROX</sc></td>
<td valign="top" align="center">0.409</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">1792.5</td>
<td valign="top" align="center">69.5</td>
</tr>
<tr>
<td valign="top" align="left">DeepGeo</td>
<td valign="top" align="center">0.428</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
</tr>
<tr>
<td valign="top" align="left">CSIRO</td>
<td valign="top" align="center">0.436</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">2538.2</td>
<td valign="top" align="center">74.7</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TN1">
<label>&#x0002A;</label>
<p><italic>Notice that the result of Navie Bayes is reported in Chi et al. (<xref ref-type="bibr" rid="B9">2016</xref>) where Acc<sub>2</sub> is not available. <sup>&#x0002B;</sup>Though the reported accuracy of Acc<sub>1</sub> in Lau et al. (<xref ref-type="bibr" rid="B24">2017</xref>) is 0.217, our experiment gives only 0.202. The results with metadata are provided in the original paper and therefore some numbers are missing. The bold values mean the best performances for each column in the &#x0201C;without metadata&#x0201D; setting</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>When comparing different settings of our proposed method, we can see that Proposed Method<sub>1</sub> performs the best. In Proposed Method<sub>2</sub>, where we set the head number to 1 and get the single-head self-attention model, the performance generally decrease in both Acc<sub>1</sub> and the distance measurements. However, when reducing the head number to one, the training time also reduces a lot. In Proposed Method<sub>3</sub>, we tried to train the model without the country label. As we could see, the distance measurements increase, especially the mean error distance. This means that using country label could really help the model learn the geographic relation between cities.</p>
</sec>
<sec>
<title>6.4. Analysis</title>
<p>In this section, we analyze some cases by printing the attention weight matrix. We only focus on the word representation module of the Proposed Method<sub>2</sub> since the multi-head attention model have several attention weight matrices and thus is hard to illustrate. Also, as the character representation module contains the CNN layer and pooling layer, it is hard to understand which subword feature is kept in the attention layer. Therefore, we focus on analyzing the word representation module to see what the Proposed Method<sub>2</sub> learned.</p>
<p>In <xref ref-type="fig" rid="F3">Figure 3</xref>, we can see there are two attention weight matrices because we have two stacked self-attention layers in our model. In the figure, each row represents a set of weights to construct the new vector. For example, in the last row where the word is &#x0201C;morning,&#x0201D; only &#x0201C;all&#x0201D; as well as &#x0201C;moncton&#x0201D; have higher weights. As we can see in both first layer and second layer, &#x0201C;moncton&#x0201D; get a very high weight in most of the words. Since this word reveals the location directly, DeepGeo, CNN and our proposed model give the correct prediction. <sc>FUJIXEROX</sc>, on the other hand, predicts <bold>los angeles-ca037-us</bold> and fails to give the correct label. Notice that in the first layer (<xref ref-type="fig" rid="F3">Figure 3A</xref>), the weights of the first <monospace>&#x0003C;nan&#x0003E;</monospace> distribute evenly over the words. As a result, after the first layer, the vector of <monospace>&#x0003C;nan&#x0003E;</monospace> could be seen as a sentence representation. This is why in the second layer (<xref ref-type="fig" rid="F3">Figure 3B</xref>), lots of the words also highly attend on the first column (<monospace>&#x0003C;nan&#x0003E;</monospace>).</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p><bold>(A,B)</bold> The attention weight matrix of our model. <monospace>&#x0003C;nan&#x0003E;</monospace> is the padding term since the word length is &#x0003C;30. In this case, DeepGeo, CNN and the proposed method correctly predict <bold>moncton-04-ca</bold> but <sc>FUJIXEROX</sc> predicts <bold>los angeles-ca037-us</bold>.</p></caption>
<graphic xlink:href="fdata-02-00005-g0003.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F4">Figure 4</xref>, since the topic is about basketball, &#x0201C;crean&#x0201D; then becomes a very important word. Actually, &#x0201C;crean&#x0201D; stands for a basketball coach, Tom Crean, in Indiana University. In the first layer, we could see that both &#x0201C;turnovers&#x0201D; and &#x0201C;crean&#x0201D; get a high weight meaning that the model successfully capture the relation between a basketball term &#x0201C;turnovers&#x0201D; and a person name &#x0201C;crean.&#x0201D; In the second layer, <monospace>&#x0003C;nan&#x0003E;</monospace> and &#x0201C;crean&#x0201D; get high weights. Notice that the vector of <monospace>&#x0003C;nan&#x0003E;</monospace> could also be seen as a sentence representation. As a conclusion for this case, our proposed model successfully captures the hidden relations and gives a correct prediction but DeepGeo, CNN, and <sc>FUJIXEROX</sc> all fail in this case.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p><bold>(A,B)</bold> The attention weight matrix of our model. <monospace>&#x0003C;nan&#x0003E;</monospace> is the padding term since the word length is &#x0003C;30. In this case, only our proposed method correctly predict <bold>indianapolis-in097-us</bold>. DeepGeo, CNN, and <sc>FUJIXEROX</sc> predict <bold>city of london-enggla-gb</bold>, <bold>chicago-il031-us</bold>, and <bold>toronto-08-ca</bold> respectively.</p></caption>
<graphic xlink:href="fdata-02-00005-g0004.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F5">Figure 5</xref>, all the four models fail to recognize the location correctly, since this post does not give any useful information. We could find that in the first layer, all the weights are similar. Also, in the second layer, all of the word attends on the first column (<monospace>&#x0003C;nan&#x0003E;</monospace>) where the vector does not contain any useful information. This means that our model could not find any meaningful and helpful information for prediction.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p><bold>(A,B)</bold> The attention weight matrix of our model. <monospace>&#x0003C;nan&#x0003E;</monospace> is the padding term since the word length is &#x0003C;30. In this case, the correct city is <bold>kisumu-07-ke</bold> but DeepGeo, CNN, <sc>FUJIXEROX</sc>, and the proposed method predict <bold>lagos-05-ng</bold>, <bold>lagos-05-ng</bold>, <bold>quezon city-ncrf2-ph</bold>, and <bold>kano-29-ng</bold> respectively.</p></caption>
<graphic xlink:href="fdata-02-00005-g0005.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F6">Figure 6</xref>, both DeepGeo, CNN, and <sc>FUJIXEROX</sc> predict the correct city, <bold>salt lake city-ut035-us</bold>, but our model predicts <bold>atlanta-ga121-us</bold>. When investigating the weight matrix, we can find that both &#x0201C;utah&#x0201D; and &#x0201C;atlanta&#x0201D; get high attention which somehow represents the two label <bold>salt lake city-ut035-us</bold> and <bold>atlanta-ga121-us</bold> respectively. This gives a controversial information to our model and thus in the end our model predicts the wrong label. In conclusion, our proposed model fails to capture the semantic meaning of &#x0201C;leaving.&#x0201D;</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p><bold>(A,B)</bold> The attention weight matrix of our model. <monospace>&#x0003C;nan&#x0003E;</monospace> is the padding term since the word length is &#x0003C;30. In this case, DeepGeo, CNN, and <sc>FUJIXEROX</sc> correctly predict <bold>salt lake city-ut035-us</bold> but our proposed method predicts <bold>atlanta-ga121-us</bold>.</p></caption>
<graphic xlink:href="fdata-02-00005-g0006.tif"/>
</fig>
<p>The above four cases give us a brief understanding of the behavior of our proposed model. Our model could capture the hidden relations between different terms. However, it is still suffering from understanding the semantic meaning of words so it gives wrong predictions sometimes. Generally, the information captured by the model is easy to understand and quite meaningful.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s7">
<title>7. Conclusions</title>
<p>In this paper, we have proposed a new deep learning model to predict location for tweets. Our model integrates three key concepts, including multi-head self-attention mechanism, subword feature, and joint training technique with country label. The experiment on W-NUT geo-tagging task shows our model is competitive or better than the state-of-the-art methods w.r.t. different measurements. The analysis on attention weight matrix also illustrates that our model can capture the hidden relations between different words. In the future, we will further consider the semantic information of the sentences to better capture the meaning of the tweet.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>C-YH designed the model, performed the experiments, analyzed the data, and wrote the paper. HT gave suggestions for the model as well as the experiments and wrote the paper. JH and RM contributed in the early discussion and the problem identification.</p>
<sec>
<title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Backstrom</surname> <given-names>L.</given-names></name> <name><surname>Sun</surname> <given-names>E.</given-names></name> <name><surname>Marlow</surname> <given-names>C.</given-names></name></person-group> (<year>2010</year>). <article-title>Find me if you can: improving geographical prediction with social and spatial proximity</article-title>, in <source>Proceedings of the 19th International Conference on World Wide Web</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>61</fpage>&#x02013;<lpage>70</lpage>. <pub-id pub-id-type="doi">10.1145/1772690.1772698</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bao</surname> <given-names>J.</given-names></name> <name><surname>Zheng</surname> <given-names>Y.</given-names></name> <name><surname>Mokbel</surname> <given-names>M. F.</given-names></name></person-group> (<year>2012</year>). <article-title>Location-based and preference-aware recommendation using sparse geo-social networking data</article-title>, in <source>Proceedings of the 20th International Conference on Advances in Geographic Information Systems</source>, SIGSPATIAL &#x02032;12 (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>A</publisher-name>CM), <fpage>199</fpage>&#x02013;<lpage>208</lpage>. <pub-id pub-id-type="doi">10.1145/2424321.2424348</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Ducharme</surname> <given-names>R.</given-names></name> <name><surname>Vincent</surname> <given-names>P.</given-names></name> <name><surname>Jauvin</surname> <given-names>C.</given-names></name></person-group> (<year>2003</year>). <article-title>A neural probabilistic language model</article-title>. <source>J. Mach. Learn. Res.</source> <volume>3</volume>, <fpage>1137</fpage>&#x02013;<lpage>1155</lpage>. <pub-id pub-id-type="doi">10.1007/3-540-33486-6_6</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bojanowski</surname> <given-names>P.</given-names></name> <name><surname>Grave</surname> <given-names>E.</given-names></name> <name><surname>Joulin</surname> <given-names>A.</given-names></name> <name><surname>Mikolov</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>Enriching word vectors with subword information</article-title>. <source>arXiv</source>. <volume>arXiv</volume>:<fpage>1607.04606</fpage> <pub-id pub-id-type="doi">10.1162/tacl-a-00051</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Caruana</surname> <given-names>R.</given-names></name></person-group> (<year>1997</year>). <article-title>Multitask learning</article-title>. <source>Mach. Learn.</source> <volume>28</volume>, <fpage>41</fpage>&#x02013;<lpage>75</lpage>.</citation></ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chandra</surname> <given-names>S.</given-names></name> <name><surname>Khan</surname> <given-names>L.</given-names></name> <name><surname>Muhaya</surname> <given-names>F. B.</given-names></name></person-group> (<year>2011</year>). <article-title>Estimating twitter user location using social interactions&#x02013;a content based approach</article-title>, in <source>2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing</source> (<publisher-loc>Boston, MA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>838</fpage>&#x02013;<lpage>843</lpage>. <pub-id pub-id-type="doi">10.1109/PASSAT/SocialCom.2011.120</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>Z.</given-names></name> <name><surname>Caverlee</surname> <given-names>J.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name></person-group> (<year>2010</year>). <article-title>You are where you tweet: a content-based approach to geo-locating twitter users</article-title>, in <source>Proceedings of the 19th ACM International Conference on Information and Knowledge Management</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>759</fpage>&#x02013;<lpage>768</lpage>. <pub-id pub-id-type="doi">10.1145/1871437.1871535</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>Z.</given-names></name> <name><surname>Shen</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Just-for-me: an adaptive personalization system for location-aware social music recommendation</article-title>, in <source>Proceedings of International Conference on Multimedia Retrieval</source>, ICMR &#x02032;14 (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <volume>185</volume>:<fpage>185</fpage>&#x02013;<publisher-loc>185</publisher-loc>:192. <pub-id pub-id-type="doi">10.1145/2578726.2578751</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chi</surname> <given-names>L.</given-names></name> <name><surname>Lim</surname> <given-names>K. H.</given-names></name> <name><surname>Alam</surname> <given-names>N.</given-names></name> <name><surname>Butler</surname> <given-names>C. J.</given-names></name></person-group> (<year>2016</year>). <article-title>Geolocation prediction in twitter using location indicative words and textual features</article-title>, in <source>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</source> (<publisher-loc>Osaka</publisher-loc>), <fpage>227</fpage>&#x02013;<lpage>234</lpage>.</citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Compton</surname> <given-names>R.</given-names></name> <name><surname>Jurgens</surname> <given-names>D.</given-names></name> <name><surname>Allen</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). <article-title>Geotagging one hundred million twitter accounts with total variation minimization</article-title>, in <source>2014 IEEE International Conference on Big Data (Big Data)</source>, (<publisher-loc>IEEE</publisher-loc>), <fpage>393</fpage>&#x02013;<lpage>401</lpage>. <pub-id pub-id-type="doi">10.1109/BigData.2014.7004256</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davis</surname> <given-names>C. A.</given-names> <suffix>Jr.</suffix></name> <name><surname>Pappa</surname> <given-names>G. L.</given-names></name> <name><surname>de Oliveira</surname> <given-names>D. R. R.</given-names></name> <name><surname>de L. Arcanjo</surname> <given-names>F.</given-names></name></person-group> (<year>2011</year>). <article-title>Inferring the location of twitter messages based on user relationships</article-title>. <source>Trans. GIS</source> <volume>15</volume>, <fpage>735</fpage>&#x02013;<lpage>751</lpage>. <pub-id pub-id-type="doi">10.1111/j.1467-9671.2011.01297.x</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Eisenstein</surname> <given-names>J.</given-names></name> <name><surname>O&#x00027;Connor</surname> <given-names>B.</given-names></name> <name><surname>Smith</surname> <given-names>N. A.</given-names></name> <name><surname>Xing</surname> <given-names>E. P.</given-names></name></person-group> (<year>2010</year>). <article-title>A latent variable model for geographic lexical variation</article-title>, in <source>Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing</source> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>1277</fpage>&#x02013;<lpage>1287</lpage>.</citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gehring</surname> <given-names>J.</given-names></name> <name><surname>Auli</surname> <given-names>M.</given-names></name> <name><surname>Grangier</surname> <given-names>D.</given-names></name> <name><surname>Yarats</surname> <given-names>D.</given-names></name> <name><surname>Dauphin</surname> <given-names>Y. N.</given-names></name></person-group> (<year>2017</year>). <article-title>Convolutional sequence to sequence learning</article-title>, in <source>Proceedings of the 34th International Conference on Machine Learning-Volume 70</source> (<publisher-loc>Sydney, NSW</publisher-loc>: <publisher-name>JMLR. org</publisher-name>), <fpage>1243</fpage>&#x02013;<lpage>1252</lpage>.</citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>B.</given-names></name> <name><surname>Rahimi</surname> <given-names>A.</given-names></name> <name><surname>Derczynski</surname> <given-names>L.</given-names></name> <name><surname>Baldwin</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>Twitter geolocation prediction shared task of the 2016 workshop on noisy user-generated text</article-title>, in <source>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</source> (<publisher-loc>Osaka</publisher-loc>), <fpage>213</fpage>&#x02013;<lpage>217</lpage>.</citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep residual learning for image recognition</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>B.</given-names></name> <name><surname>Carley</surname> <given-names>K. M.</given-names></name></person-group> (<year>2017</year>). <article-title>On predicting geolocation of tweets using convolutional neural networks</article-title>, in <source>International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation</source> (<publisher-loc>Washington, DC</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>281</fpage>&#x02013;<lpage>291</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-60240-0-34</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jayasinghe</surname> <given-names>G.</given-names></name> <name><surname>Jin</surname> <given-names>B.</given-names></name> <name><surname>Mchugh</surname> <given-names>J.</given-names></name> <name><surname>Robinson</surname> <given-names>B.</given-names></name> <name><surname>Wan</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>Csiro data61 at the wnut geo shared task</article-title>, in <source>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</source> (<publisher-loc>Osaka</publisher-loc>), <fpage>218</fpage>&#x02013;<lpage>226</lpage>.</citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jurgens</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). <article-title>That&#x00027;s what friends are for: Inferring location in online social media platforms based on social relationships</article-title>. <source>Icwsm</source> <volume>13</volume>, <fpage>273</fpage>&#x02013;<lpage>282</lpage>.</citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jurgens</surname> <given-names>D.</given-names></name> <name><surname>Finethy</surname> <given-names>T.</given-names></name> <name><surname>McCorriston</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>Y. T.</given-names></name> <name><surname>Ruths</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). <article-title>Geolocation prediction in twitter using social networks: a critical analysis and review of current practice</article-title>. <source>ICWSM</source> <volume>15</volume>, <fpage>188</fpage>&#x02013;<lpage>197</lpage>.</citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>Convolutional neural networks for sentence classification</article-title>. <source>arXiv. arXiv:1408.5882.</source> <pub-id pub-id-type="doi">10.3115/v1/d14-1181</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Ba</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Adam: a method for stochastic optimization</article-title>. <source>arXiv. arXiv:1412.6980.</source></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kong</surname> <given-names>L.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>Spot: locating social media users based on social network context</article-title>. <source>Proc. VLDB Endowm.</source> <volume>7</volume>, <fpage>1681</fpage>&#x02013;<lpage>1684</lpage>. <pub-id pub-id-type="doi">10.14778/2733004.2733060</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lai</surname> <given-names>S.</given-names></name> <name><surname>Xu</surname> <given-names>L.</given-names></name> <name><surname>Liu</surname> <given-names>K.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Recurrent convolutional neural networks for text classification</article-title>. <source>AAAI</source> <volume>333</volume>, <fpage>2267</fpage>&#x02013;<lpage>2273</lpage>.</citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lau</surname> <given-names>J. H.</given-names></name> <name><surname>Chi</surname> <given-names>L.</given-names></name> <name><surname>Tran</surname> <given-names>K.-N.</given-names></name> <name><surname>Cohn</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>End-to-end network for twitter geolocation prediction and hashing</article-title>. <source>arXiv</source>. <volume>arXiv</volume>:<fpage>1710.04802</fpage></citation></ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Lei</surname> <given-names>K. H.</given-names></name> <name><surname>Khadiwala</surname> <given-names>R.</given-names></name> <name><surname>Chang</surname> <given-names>K. C.-C.</given-names></name></person-group> (<year>2012a</year>). <article-title>Tedas: A twitter-based event detection and analysis system</article-title>, in <source>2012 ieee 28th international conference on Data engineering (icde)</source> (<publisher-loc>Washington, DC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1273</fpage>&#x02013;<lpage>1276</lpage>.</citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Chang</surname> <given-names>K. C.-C.</given-names></name></person-group> (<year>2012b</year>). <article-title>Multiple location profiling for users and relationships from social network and content</article-title>. <source>Proc. VLDB Endowm.</source> <volume>5</volume>, <fpage>1603</fpage>&#x02013;<lpage>1614</lpage>.</citation></ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Deng</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>R.</given-names></name> <name><surname>Chang</surname> <given-names>K. C.-C.</given-names></name></person-group> (<year>2012c</year>). <article-title>Towards social user profiling: unified and discriminative influence model for inferring home locations</article-title>, in <source>Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1023</fpage>&#x02013;<lpage>1031</lpage>.</citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mahmud</surname> <given-names>J.</given-names></name> <name><surname>Nichols</surname> <given-names>J.</given-names></name> <name><surname>Drews</surname> <given-names>C.</given-names></name></person-group> (<year>2014</year>). <article-title>Home location identification of twitter users</article-title>. <source>ACM Trans. Intell. Syst. Technol.</source> <volume>5</volume>:<fpage>47</fpage>. <pub-id pub-id-type="doi">10.1145/2528548</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Karafi&#x000E1;t</surname> <given-names>M.</given-names></name> <name><surname>Burget</surname> <given-names>L.</given-names></name> <name><surname>&#x0010C;ernock&#x01EF3;</surname> <given-names>J.</given-names></name> <name><surname>Khudanpur</surname> <given-names>S.</given-names></name></person-group> (<year>2010</year>). <article-title>Recurrent neural network based language model</article-title>, in <source>Eleventh Annual Conference of the International Speech Communication Association</source> (<publisher-loc>Prague</publisher-loc>). <pub-id pub-id-type="doi">10.1109/ICASSP.2011.5947611</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Miura</surname> <given-names>Y.</given-names></name> <name><surname>Taniguchi</surname> <given-names>M.</given-names></name> <name><surname>Taniguchi</surname> <given-names>T.</given-names></name> <name><surname>Ohkuma</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>A simple scalable neural networks based model for geolocation prediction in twitter</article-title>, in <source>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</source> (<publisher-loc>Osaka</publisher-loc>), <fpage>235</fpage>&#x02013;<lpage>239</lpage>.</citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Roller</surname> <given-names>S.</given-names></name> <name><surname>Speriosu</surname> <given-names>M.</given-names></name> <name><surname>Rallapalli</surname> <given-names>S.</given-names></name> <name><surname>Wing</surname> <given-names>B.</given-names></name> <name><surname>Baldridge</surname> <given-names>J.</given-names></name></person-group> (<year>2012</year>). <article-title>Supervised text-based geolocation using language models on an adaptive grid</article-title>, in <source>Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</source> (<publisher-loc>Association for Computational Linguistics</publisher-loc>) (Jeju Island), <fpage>1500</fpage>&#x02013;<lpage>1510</lpage>.</citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rout</surname> <given-names>D.</given-names></name> <name><surname>Bontcheva</surname> <given-names>K.</given-names></name> <name><surname>Preo&#x00163;iuc-Pietro</surname> <given-names>D.</given-names></name> <name><surname>Cohn</surname> <given-names>T.</given-names></name></person-group> (<year>2013</year>). <article-title>Where&#x00027;s&#x00040; wally? a classification approach to geolocating users based on their social ties</article-title>, in <source>Proceedings of the 24th ACM Conference on Hypertext and Social Media</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>11</fpage>&#x02013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1145/2481492.2481494</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sakaki</surname> <given-names>T.</given-names></name> <name><surname>Okazaki</surname> <given-names>M.</given-names></name> <name><surname>Matsuo</surname> <given-names>Y.</given-names></name></person-group> (<year>2010</year>). <article-title>Earthquake shakes twitter users: Real-time event detection by social sensors</article-title>, in <source>Proceedings of the 19th International Conference on World Wide Web</source>, WWW &#x00027;10 (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>851</fpage>&#x02013;<lpage>860</lpage>. <pub-id pub-id-type="doi">10.1145/1772690.1772777</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakaki</surname> <given-names>T.</given-names></name> <name><surname>Okazaki</surname> <given-names>M.</given-names></name> <name><surname>Matsuo</surname> <given-names>Y.</given-names></name></person-group> (<year>2013</year>). <article-title>Tweet analysis for real-time event detection and earthquake reporting system development</article-title>. <source>IEEE Trans. Knowled. Data Eng.</source> <volume>25</volume>, <fpage>919</fpage>&#x02013;<lpage>931</lpage>. <pub-id pub-id-type="doi">10.1109/TKDE.2012.29</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Savage</surname> <given-names>N. S.</given-names></name> <name><surname>Baranski</surname> <given-names>M.</given-names></name> <name><surname>Chavez</surname> <given-names>N. E.</given-names></name> <name><surname>H&#x000F6;llerer</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <article-title>I&#x00027;m feeling loco: a location based context aware recommendation system</article-title>, in <source>Advances in Location-Based Services</source>, eds <person-group person-group-type="editor"><name><surname>Gatner</surname> <given-names>G.</given-names></name> <name><surname>Ortag</surname> <given-names>F.</given-names></name></person-group> (<publisher-loc>Vienna</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>37</fpage>&#x02013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-24198-7-3</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sloan</surname> <given-names>L.</given-names></name> <name><surname>Morgan</surname> <given-names>J.</given-names></name> <name><surname>Housley</surname> <given-names>W.</given-names></name> <name><surname>Williams</surname> <given-names>M.</given-names></name> <name><surname>Edwards</surname> <given-names>A.</given-names></name> <name><surname>Burnap</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Knowing the tweeters: deriving sociologically relevant demographics from twitter</article-title>. <source>Soc. Res. Online</source> <volume>18</volume>, <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.5153/sro.3001</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sundermeyer</surname> <given-names>M.</given-names></name> <name><surname>Schl&#x000FC;ter</surname> <given-names>R.</given-names></name> <name><surname>Ney</surname> <given-names>H.</given-names></name></person-group> (<year>2012</year>). <article-title>Lstm neural networks for language modeling</article-title>, in <source>Thirteenth Annual Conference of the International Speech Communication Association</source> (<publisher-loc>Portland, OR</publisher-loc>).</citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Attention is all you need</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>5998</fpage>&#x02013;<lpage>6008</lpage>.</citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vylomova</surname> <given-names>E.</given-names></name> <name><surname>Cohn</surname> <given-names>T.</given-names></name> <name><surname>He</surname> <given-names>X.</given-names></name> <name><surname>Haffari</surname> <given-names>G.</given-names></name></person-group> (<year>2016</year>). <article-title>Word representation models for morphologically rich languages in neural machine translation</article-title>. <source>arXiv</source>. <volume>arXiv</volume>:<fpage>1606.04217</fpage>. <pub-id pub-id-type="doi">10.18653/v1/W17-4115</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Watanabe</surname> <given-names>K.</given-names></name> <name><surname>Ochi</surname> <given-names>M.</given-names></name> <name><surname>Okabe</surname> <given-names>M.</given-names></name> <name><surname>Onai</surname> <given-names>R.</given-names></name></person-group> (<year>2011</year>). <article-title>Jasmine: a real-time local-event detection system based on geolocation information propagated to microblogs</article-title>, in <source>Proceedings of the 20th ACM International Conference on Information and Knowledge Management</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>2541</fpage>&#x02013;<lpage>2544</lpage>. <pub-id pub-id-type="doi">10.1145/2063576.2064014</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yin</surname> <given-names>H.</given-names></name> <name><surname>Sun</surname> <given-names>Y.</given-names></name> <name><surname>Cui</surname> <given-names>B.</given-names></name> <name><surname>Hu</surname> <given-names>Z.</given-names></name> <name><surname>Chen</surname> <given-names>L.</given-names></name></person-group> (<year>2013</year>). <article-title>Lcars: a location-content-aware recommender system</article-title>, in <source>Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>221</fpage>&#x02013;<lpage>229</lpage>. <pub-id pub-id-type="doi">10.1145/2487575.2487608</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name> <name><surname>LeCun</surname> <given-names>Y.</given-names></name></person-group> (<year>2015</year>). <article-title>Character-level convolutional networks for text classification</article-title>, in <source>Advances in Neural Information Processing Systems</source>, <fpage>649</fpage>&#x02013;<lpage>657</lpage>.</citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>Q.</given-names></name></person-group> (<year>2017</year>). <article-title>A Survey on multi-task learning</article-title>. <source>arXiv</source>. <volume>arXiv</volume>:<fpage>1707.08114</fpage>.</citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="https://noisy-text.github.io/2016/geo-shared-task.html">https://noisy-text.github.io/2016/geo-shared-task.html</ext-link></p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://www.geonames.org/">https://www.geonames.org/</ext-link></p></fn>
</fn-group>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This work is supported by NSF (IIS-1651203 and IIS-1715385), ARO (W911NF-16-1-0168), and DHS (2017-ST-061-QA0001).</p>
</fn>
</fn-group>
</back>
</article>