Location Prediction for Tweets

Geographic information provides an important insight into many data mining and social media systems. However, users are reluctant to provide such information due to various concerns, such as inconvenience, privacy, etc. In this paper, we aim to develop a deep learning based solution to predict geographic information for tweets. The current approaches bear two major limitations, including (a) hard to model the long term information and (b) hard to explain to the end users what the model learns. To address these issues, our proposed model embraces three key ideas. First, we introduce a multi-head self-attention model for text representation. Second, to further improve the result on informal language, we treat subword as a feature in our model. Lastly, the model is trained jointly with the city and country to incorporate the information coming from different labels. The experiment performed on W-NUT 2016 Geo-tagging shared task shows our proposed model is competitive with the state-of-the-art systems when using accuracy measurement, and in the meanwhile, leading to a better distance measure over the existing approaches.


INTRODUCTION
Nowadays, many technology systems (e.g., a social media platform) emit a variety of digital information, such as texts, times, logs, and so on. Geographic information has been receiving much attention lately. In fact, there are a large amount of applications benefiting from geographic information, ranging from marketing recommendation systems (Bao et al., 2012;Savage et al., 2012;Yin et al., 2013;Cheng and Shen, 2014) to event detection systems (Sakaki et al., 2010(Sakaki et al., , 2013Watanabe et al., 2011;Li et al., 2012a). Although technology that allows the user to share his/her geographic information has matured, many users are reluctant to do so due to various concerns such as inconvenience, privacy and so on. As Sloan et al. (2013) illustrated, only <1% of tweets have a geographic tag attached which in turn limits the growth of related applications. Therefore, researchers have tried to automatically identify the location of the user or post on social media sites. In this paper, we target our location prediction problem on Twitter, which is one of the largest social media sites.
Our proposed model takes three concepts into account, multi-head attention mechanism, subword feature, and joint training technique. The first two parts are proposed for better modeling the text representation and the joint training part is related to the whole architecture of our model. In this work, we mainly focus on using only text information instead of other metadata provided by Twitter since our goal is to develop a generic social media location prediction method, which could be further applied to other platforms (e.g., online news where the user information is usually not available). As a result, we introduce different methods to enhance the text representation part as well as the whole model architecture.
Text representation, however, is one of the most important tasks for Natural Language Processing (NLP) applications. In recent studies with the help of deep learning techniques, many NLP tasks, including text classification problem, question answering problem, sentiment analysis, translation, start from designing a good module for capturing useful and meaningful text information. One of the well-known approach is the Recurrent Neural Network (RNN) based models such as vanilla Recurrent Neural Network (vanilla RNN), Long Short-term Memory Neural Network (LSTM), and Gated Recurrent Unit Network (GRU). Some existing works have shown the RNNbased models' power of handling the language modeling tasks (Bengio et al., 2003;Mikolov et al., 2010;Sundermeyer et al., 2012). In RNN, the model iterates all of the text step-by-step and at the same time propagates the information to the next step to form the sentence representation. RNN has achieved a big success in many applications. However, the RNN-based model usually suffers from the extremely long training time because every word depends on all the previous words which makes it hard to parallelize and accelerate. Another branch of studies focuses on the Convolutional Neural Network (CNN) based models (Kim, 2014). Though CNN was originally proposed to solve problems on images, Kim (2014) successfully introduced it into the NLP field. The idea of CNN is to capture some specific n-gram patterns of the text by using lots of filters with various lengths. However, due to the length limitation of the kernel filters, CNN works better in modeling the local information. Therefore, we instead adopt the multi-head self-attention model (Vaswani et al., 2017) to model the text information. Multi-head self-attention model (Vaswani et al., 2017) utilizes only attention mechanism, yet it enjoys the advantages of both RNN and CNN. That is to say, we can perform parallel computing for all text at the same time, and in the meanwhile, long term information is also encoded.
The subword feature was shown to be very useful for tasks built on social media since people tend to use lots of informal language on social media (Zhang et al., 2015;Vylomova et al., 2016). One simple but common example is the use of "Good" with a various number of "o" which produces words like "Goooooood". Another example is the user-created word such as "Linsanity" which is the combination of "Jeremy Lin" (an NBA player) and "insanity." Therefore, if we start from subword feature such as character, we could potentially infer the subtle meaning of these words. Many applications (Zhang et al., 2015;Vylomova et al., 2016) have already introduced the subword feature and achieved a remarkable result. Therefore, in our task, we also treat subword as an important feature.
Multitask Learning (Caruana, 1997;Zhang and Yang, 2017) is a method to train a learning model with different targets. When applying to multitask learning, the model could learn to extract features that are meaningful for both tasks and thus often lead to a more robust result. In this task, our goal is to identify the city of the given tweet. It is worth noticing that there does exist some relations between different cities. For example, two cities can locate within the same country, share the same time zone, or be closer than a specific distance. We believe using the hierarchical relation between cities could enable the model to learn extra information and thus improve the inference ability. Therefore, we introduce the joint training method into our model in order to take the relation between cities into consideration.
In the remainder of this paper, we will introduce related works in section 2. The problem definition and the detail of our proposed model will be described in sections 3, 4 respectively. In section 5, we will introduce the W-NUT 2016 Geo-tagging task and the in-depth analysis to illustrate the pros and cons of our proposed model.

RELATED WORK
Location predicting has been studied for decades, but most of the work focuses on predicting a user's location. Recently, with the help of deep neural network, analyzing pure text is more feasible and thus researchers start trying to predict the location for a post, such as a tweet. In the following section, we will introduce the related location prediction tasks for user and post respectively.
For location inference of users, one well known approach is to infer the location from a graph structure (Backstrom et al., 2010;Davis et al., 2011;Li et al., 2012b,c;Jurgens, 2013;Rout et al., 2013;Compton et al., 2014;Kong et al., 2014;Jurgens et al., 2015). In these approaches, the main assumption is that friends will be very likely to live in the same location. Therefore, we could predict a user's location based on his relationship to other users. Among these works, Backstrom et al. (2010) is the first one noticing the interaction between geographical information and social relationship. They carefully examined the interaction and proposed a maximum likelihood approach to identify a user's location given the geographic information of the user's friends. Davis et al. (2011) built a following-follower network on Twitter and inferred a user's location based on a voting mechanism with three adjusting parameters. Li et al. (2012c) applied a Gaussian distribution to model a node's (friends or tweets) location as well as its influence scope. This network was then used to predict a user's location by maximizing the probability of building edges between the user and its friends or tweets. Li et al. (2012a) further extended the model to capture the property of a user having multiple related locations such as the home location as well as the college location. Their model is a revised version of Latent Dirichlet Allocation (LDA) model where the latent variables are locations. Rout et al. (2013) formulated the problem as a classification task and solved it by applying Support Vector Machine (SVM) with the features extracted from a Twitter's follower-based network. Jurgens (2013) extended label propagation method with the spatial property. As a semi-supervised learning method, spatial label propagation could iteratively inference all the user's location starting from only a few ground truth data. SPOT (Kong et al., 2014) took the social relation as a continuous feature instead of a binary feature (friends or not) by measuring the social closeness. The authors also introduced a confidence-based iteration method to overcome the data sparsity problem. Compton et al. (2014) formulated the social network geo-location inference task as a convex optimization problem and applied a total variationbased algorithm to solve it. These works rely on the information behind the social network and hence building a user relationship network is inevitable. This becomes a limitation if we want to work on data other than social media.
Another kind of method focuses on predicting using content and metadata provided by the user (Cheng et al., 2010;Eisenstein et al., 2010;Chandra et al., 2011;Roller et al., 2012;Mahmud et al., 2014). In Eisenstein et al. (2010)'s work, they presented a generative model to capture the relation between latent topics and geographical regions as they found that high-level topics such as "sport" and "entertainment" are rendered differently according to different location. Chandra et al. (2011) utilized only the content information but instead of using only a user's tweets, they augmented it with the replied tweets from other users by assuming that a reply tweet would have the same topic as the original tweet. A probability distribution model which could capture the relation between terms and locations was then applied to predict a user's location based on the corresponding augmented tweets set. Mahmud et al. (2014)'s work focused on building a hierarchical classification to integrate tweet contents, different categories of metadata, user's tweeting behaviors, and external location knowledge such as a geographic gazetteer dictionary. They also examined the impact of frequently traveling users and found that these users usually introduce noise into the model. This lead to a conclusion that eliminating frequently traveling users could improve the prediction accuracy. Roller et al. (2012) proposed an information retrieval method where the idea was to build a grid on the earth and then generate reference documents for each grid by selecting the location-related documents from training set. To overcome the problem of uniform grids, they constructed the grid using a k-d tree algorithm to dynamically adapt the grid size of the training data. Cheng et al. (2010)'s work focused on using purely content to predict the user's location with the assumption of location language difference. Although these approaches use mainly the content information, what they used is a bunch of posts provided by the user. As Cheng et al. (2010) and Chandra et al. (2011) revealed, given more posts, the accuracy would improve. This fact also suggests that predicting location for a single post is much difficult than for a user.
The tasks of predicting the location for a post were proposed much recently. After Han et al. (2016) built the dataset from Twitter and then proposed a shared task, researchers started digging into it. There are several approaches proposed in the shared task. Chi et al. (2016) applied a Naive Bayes classifier on many selected features, including location-indicative words, user meta data and so on. CSIRO (Jayasinghe et al., 2016) designed an ensemble method that incorporated heuristics, time zone text classifiers and an information retrieval approach. Miura et al. (2016) proposed a variant version of FastText Model which can take user's meta data into account. After the shared task, Huang and Carley (2017) designed a model with the help of the CNN layer. Lau et al. (2017), on the other hand, proposed the DeepGeo which utilized the characterlevel recurrent convolutional network to further capture the subword feature within the tweets. Most of these works tried to apply the deep learning framework to capture the language difference among the tweets. As we can see, with the help y city , y country True city labels and true country labels.
Predicted city labels and predicted country labels.
of deep learning framework, though we only have limited information in a single post, the result is still improving year by year.

PROBLEM DEFINITION
We use a bold capital letter to represent a matrix (e.g., A) , a bold lowercase letter to represent a vector (e.g., a), and a normal lowercase letter to represent a scalar (e.g., a). Furthermore, a tweet which consists of n words is represented as S = {w 1 , w 2 , · · · , w n } where S is the tweet matrix and w is the word embedding. A tweet could also be represented as a sequence of m characters C = {c 1 , c 2 , · · · c m } where C is the character matrix and c is the character embedding. Other general naming conventions are provided in Table 1.
With the notation provided above, the problem definition could be described as: Given: A tweet and its corresponding representations, S and C. Predict: The label of the given tweet. This could be either y city or y country .

LOCATION PREDICTION FOR TWEETS
In this chapter, we describe our model in several steps. We first introduce the high level architecture of our model. Then describe its separate modules, multi-head self-attention mechanism, subword features, and joint training method.

Model Overview
As shown in Figure 1, the proposed model contains several small modules, but can be mainly separated into two parts including text representation and joint training. The text representation module consists of word representation and character representation. Both of the representation are encoded by multi-head self-attention layer but for character representation, we further use a CNN layer and pooling layer first to reduce the dimension and extract meaningful information.
The word representation and character representation are then concatenated as a vector which represents the given tweet. In the second module, to utilize the relation between cities, we use the same concatenated vector but two different output layers to predict the country and city at the same time. However, the country classification is used only for training. In the testing phase, we use only the city part of the model for prediction.

Multi-Head Self-Attention Mechanism
The multi-head self-attention model is proposed by Vaswani et al. (2017) and is designed for language translation task. Here, we introduce the multi-head self-attention model as a module for text representation.

Self-Attention
Let's start by defining the self-attention layer. In the normal attention mechanism, the model usually takes three inputs, a query Q, a key K, and a value V, where Q and K are used to compute the weights for V. The formal definition (Vaswani et al., 2017) is written as follows: where d is the word embedding dimension. However, in the selfattention layer, all the three inputs are the same matrix S, the text matrix {w 1 , w 2 , · · · , w n }, where S ∈ R n×d , w i ∈ R 1×d , n is the text length, and d is the word embedding dimension. By definition, the self-attention is: However, we can make it clearer by defining it in word level.
The h i below is the transformation of w i by weighted sum over the sentence.
where H ∈ R n×d and h i ∈ R 1×d . h i is computed as follows: α ij is the weight for each w j and is computed by the softmax function with the scaling term √ d: We also illustrate this idea in Figure 2. In this figure, we first use both the top part (green) and the left part (blue) to compute the corresponding weight for each cell. Equation (5) tells us that we need to normalize each row so the sum over each row is one. Then for each row, the new vector is constructed by summing up the vector in the top side (blue) multiplying by the corresponding weight.

Multi-Head Self-Attention
In the above definition, the attention mechanism is performed only once, resulting only a single aspect vector. To equip the model with the power to learn multiple aspects information, a multi-head self-attention mechanism is proposed. In multi-head self-attention mechanism, we first apply a linear transformation W on S, producing S ′ = SW where S ′ ∈ R n×d ′ , S ∈ R n×d , and W ∈ R d×d ′ . Notice that d ′ < d, which means we are reducing the dimension. We then apply the self-attention model on S ′ : The idea of multi-head self-attention mechanism is to perform the above work h times and then concatenate the resulting h vectors together. This gives the model the capability of learning h kinds of information. This functionality is defined as follows if h is set up as d ′ = d h : As we can see, after performing the multi-head self-attention mechanism, the shape of the output matrix M is still R n×d which is the same as the input matrix S. Therefore, we could further apply the residual network ) on the multi-head self-attention model. We revised Equation (7) as follow: The idea of residual network is to add the input vector to the output vector. Since the original model is to learn a mapping function F(x), we change it to F ′ (x) = F(x) − x. The output y = F ′ (x) + x will still be the same but we could compute the gradient from the residual path and then reduce the gradient vanishing problem.

Position-Wise Feed-Forward Network
The position-wise feed-forward network is introduced as the function of fully connected layer after the multi-head selfattention layer. The idea is to apply two linear transformations with a ReLU activation function on the input matrix. The mechanism could be described as follow: where W 1 ∈ R d×4d , W 2 ∈ R 4d×d are transformation matrices and b 1 ∈ R 4d , b 1 ∈ R d are bias vectors. The transformation dimensions are suggested by Vaswani et al. (2017). We also apply the residual network here so Equation (9) will be revised as: where the last term M is identical to the input vector M.

Position Encoding
One important thing that multi-head self-attention model can not model is the position relation between words. In RNN based model, the word order is well preserved since the output vector is calculated step by step. In the CNN based model, the word order is somehow preserved because it works on extracting information from the n-grams. However, the multi-head selfattention model utilizes weighted sum over the sequence of vectors where word order is totally ignored. Therefore, to fix this problem, Vaswani et al. (2017) introduced an idea of injecting the position information into the word vector. As a result, we build a position embedding matrix E position (Gehring et al., 2017) and add the corresponding position vector to the word vector. E position works as a word embedding layer which turns a position into a vector. This could be described as the following Equation: where w i is the i-th word from S and i is the position. By introducing the position embedding, we expect that E position can learn how to represent the position information. For example, one of the vector e position1 in E position could learn the meaning of the first word. If we add e position1 to a word embedding, then the resulting vector should contain the information of the position (first word) and the meaning of the word. The position encoding is applied in the beginning of the whole model. As a result, after obtaining w ′ , we replace S by {w ′ 1 , w ′ 2 , · · · , w ′ n }.

Text Representation
In our model, we stack the multi-head self-attention layer and the position-wise feed-forward network twice. Notice that after passing through these layers, the output is still a matrix of R n×d . In order to get the text representation, we need to reduce the output matrix into a one dimensional vector. As a result, we simply sum up alone the sequence dimension n, producing the text representation as follow: the resulting v text ∈ R d .

Subword Feature
The idea of adding a subword feature is to infer the meaning of the low frequency word. Though the most simple way here is to apply the above model directly to the character level sequence, it will be infeasible since the complexity of multi-head selfattention model is O(n 2 · d), where n is the sequence length. Although the maximum number of characters allowed in Twitter is only 140, it will still cause a huge computation bottleneck in our model. Therefore, we first apply one-dimensional convolutional neural network to extract the n-gram information, and then use maximum pooling over a small window to extract meaningful information and at the same time reduce the sequence length. The detailed procedure is described in the following paragraph. Given the character matrix C, we first apply a convolutional neural network layer to it. Each element of the resulting matrix H conv could be described as: where, * is the convolution operator, k is half of the kernel size, i is the index of the character sequence ranging from 1 to the character length m, and j is the index of the filter ranging from 1 to the number of filters f . After this, we apply the maximum pooling with a window size equal to the kernel size 2k and then slide 2k − 1 to next step. Therefore, the element of the resulting matrix H pool is given by: As we can see, the first dimension reduces from m to m 2k−1 and thus H pool is R m 2k−1 ×f . We then apply the multi-head selfattention model on H pool and get v char ∈ R f as:

Joint Training
The v text and v char are then concatenated as the tweet representation vector.
where v tweet ∈ R d+f . By applying different transformation W city and W country , we could get two different vector v city and v country for prediction.
where W city ∈ R (d+f )×m city , W country ∈ R (d+f )×m country , v city ∈ R m city , v country ∈ R m country , and m city , m country are city size and country size respectively. We then apply the softmax function to get the probability for each city p l city and each country p l country .
where v l city is the l-th element of v city and v l country is the l-th element of v country .
The prediction of city y ′ city and country y ′ country is the label with the highest probability.
y ′ city = argmax l (p 1 city , p 2 city , · · · , p m city city ) y ′ country = argmax l (p 1 country , p 2 country , · · · , p m country country ) In our task, we have two kinds of labels; city and country. As we could see here, some cities are actually in the same country and thus shared some common information. Therefore, we proposed the joint training framework for modeling city and country at the same time. Since we use cross-entropy as our loss function, the joint learning loss function is as follows: where y l city and y l country are binary indicators (0, 1) which will give 1 only if the label is the correct one.

EXPERIMENTS AND RESULTS
In this section, we describe the experiment on the W-NUT 2016 Geo-tagging task 1 and some benchmark approaches for comparison. Different metrics are utilized in this experiment to provide insights from different aspects.

Data
We directly use the geolocation prediction shared task dataset (Han et al., 2016) in our experiment. Though they provide two tasks, predicting locations for tweets and users, we only focus on the tweet prediction part. The dataset is collected from 2013 to 2016 by Twitter Streaming API. Besides, only tweets whose language are identified by Twitter as English are retained. Due to the limitation of Twitter policy, the dataset provides only the ID of the collected tweets instead of the original tweets and the corresponding information. As a result, although the dataset provides 12M and 10k tweets for training and developing, we could only collect about 8M and 8k tweets respectively since users could delete the tweets they posted and thus some tweets are no longer available. However, the testing data containing 10k tweets is shared comprehensively so comparing with the previous benchmark is possible. The detail statistic information is provided in Table 2.

EVALUATION METRICS
There are three metrics adopted in the W-NUT 2016 Geo-tagging task, including one hard metric and two soft metrics. The first way is the classification accuracy over the city prediction. This is regarded as a hard metric because there is no tolerance for a wrong prediction. The distance-based metric, on the other hand, is regarded as a soft metric as it measures the distance between the true value and the predicted value. Therefore, only a wrong prediction with large error will be penalized a lot. Two distancebased metric is utilized, median error distance and mean error distance. Given the evaluation result R = d 1 , d 2 , · · · , d n , where d n is the error distance (kilometers) between the predicted and 1 https://noisy-text.github.io/2016/geo-shared-task.html  In the pre-processing step, words that appear less then the min word frequency times will be turned into <UNK> token. Max length means the maximum sequence length so exceeding words will be removed. If the text sequence doesn't meet the max length, we will pad <nan> in front of the text.
the standard geographic coordinate, median error distance and mean error distance are computed as follows:

Benchmark Model
Several benchmarks are selected for comparison. Since the original dataset provides metadata like user specified location description, timezone, self-introduction, and so on, most of the work utilizes all of these information in their model. However, our proposed model focuses on modeling the information behind the content itself. As a result, we implemented the text-content-only version for these models by removing the modules or layers that deal with the metadata. In the following section, we describe several benchmarks and how we remove the metadata-related modules.  Chi et al. (2016) where Acc 2 is not available. + Though the reported accuracy of Acc 1 in Lau et al. (2017) is 0.217, our experiment gives only 0.202. The results with metadata are provided in the original paper and therefore some numbers are missing. The bold values mean the best performances for each column in the "without metadata" setting.
In the recurrent convolutional network, the character matrix is passed through a bi-directional LSTM layer, producing a hidden state matrix. The hidden state matrix is then passed through a CNN layer followed by a max-over-time pooling layer to generate the subword features. After acquiring the subword features, a attention network is applied to merge the subword feature matrices into a single vector. In addition to the text representation module, deepgeo introduce a RBF network for modeling the time-related features, such as Tweet creation time and account creation time. All of these vectors including the text representation and meta feature are then concatenated and go through two dense layers for classification. To understand the model's ability of handling the text information, we remove the layers other than the characterlevel RCNN.

FUJIXEROX
This approach is proposed by FUJIXEROX (Miura et al., 2016), one of the participated team in W-NUT 2016 Geo-tagging task. This model is a variant version of the original FastText Model (Bojanowski et al., 2016). The idea is to represent a word by the sum of its n-gram embeddings. Therefore, for a outof-vocabulary word, the model could still inference its word vector according to the subword features (n-gram). FUJIXEROX applied FastText model on not only the tweet text but also the user specified location and the user profile description. The three feature vectors and the time zone embedding vector are then concatenated then passed into a dense layer for prediction. To have it use only the text information from tweets, the metadata features are removed and the resulting model is a supervised FastText model.

CNN Model
A CNN-based model is provided by Huang and Carley (2017). Their approach is to use a CNN layer (Kim, 2014) for modeling the tweet content, the user profile description, the user specified location, and the user name. Then, these four vectors are concatenated with four one-hot vectors, tweet language, user language, time zone and the tweet creation time. The concatenated vector is then passed through a dense layer and form a classifier. Unlike the previous two approaches, this task is performed on a self-built dataset. Therefore, we implemented this approach for comparison. The model after removing the metadata feature is actually a CNN model.

CSIRO
Jayasinghe et al. (2016) utilize ensemble approaches to overcome the weakness of each component. They also handle many kinds of metadata and integrate them with external information like gazetteer, IP-Lookup table, and so on. They then apply these features to label propagation approach, information retrieval approach, and text classification approach. By examining different ensemble strategies, they found that the full cascade one outperforms the other strategies. As their approach heavily relies on the metadata, we only list it as a reference.

Naive Bayes Methods
This method is proposed by Chi et al. (2016) with the use of naive Bayes methods on many selected features. However, only features extracted from text data are considered such as location-indicative words, hashtags and so on.

Experiment Setting
The parameters used for our model are listed in the Table 3. If we combine a multi-head self-attention layer and a feed-forward layer as an attention layer, then the stack number 2 means we stack two attention layers and produce a series of layers as multihead self-attention, feed-forward, multi-head self-attention, feedforward. We use Adam (Kingma and Ba, 2014) for optimization. The model is trained for 10 epochs and then the one with the best validation result is kept for testing. To compute the distance error, we didn't use the model to predict latitude and longitude. However, we map the predicted city into the corresponding latitude and longitude and then take it as our prediction for the geographic coordinate. For example, if the predicted city label is "los angeles-ca037-us," we search on GeoNames 2 by the query "Los Angeles US" (city name and country name). The geographic coordinate (N 34 • 3 ′ 8 ′′ , W 118 • 14 ′ 37 ′′ ) is used as the predicted result.

Result
The results are listed in Table 4. We separate the result into two sections because we mainly focus on the setting without metadata. Within this setting, our model outperforms all the other models according to Acc 1 , the accuracy of the city. FUJIXEROX's fastText model performs relatively well in both of the distance measurements but our proposed approach is competitive. For the rest of the methods, Proposed Method 1 outperforms DeepGeo 8.98% and 22.07%, CNN 14.35% and FIGURE 3 | (A,B) The attention weight matrix of our model. <nan> is the padding term since the word length is <30. In this case, DeepGeo, CNN and the proposed method correctly predict moncton-04-ca but FUJIXEROX predicts los angeles-ca037-us.

FIGURE 4 | (A,B)
The attention weight matrix of our model. <nan> is the padding term since the word length is <30. In this case, only our proposed method correctly predict indianapolis-in097-us. DeepGeo, CNN, and FUJIXEROX predict city of london-enggla-gb, chicago-il031-us, and toronto-08-ca respectively.
27.49%, Naive Bayes 18.08% and 43.09% in mean error distance and median error distance respectively. This phenomenon suggests that our proposed method could better capture the location relation. To better understand the behavior of the model, we try to examine the country-wise prediction. Here, we turn a city label into a country label by extracting the country from the city label. For example, the country label for "los angeles-ca037-us" is "us." We then compute the accuracy and report it also in Table 4 as Acc 2 . As we can see, there is no huge difference between Acc 2 which suggests that our proposed method gives a closer city prediction within the same country. Let's move to the setting with metadata. The results are reported by their papers so part of the table is missing. It is, however, easy to see that the results of using metadata improve a lot. This is foreseeable since some of the metadata provide very strong information. For instance, the timezone feature basically acts as a geographic constraint. Not to mention that some users explicitly reveal their home location in the profile location description which becomes another metadata. As we state before, we focus on extracting the information The attention weight matrix of our model. <nan> is the padding term since the word length is <30. In this case, the correct city is kisumu-07-ke but DeepGeo, CNN, FUJIXEROX, and the proposed method predict lagos-05-ng, lagos-05-ng, quezon city-ncrf2-ph, and kano-29-ng respectively.

FIGURE 6 | (A,B)
The attention weight matrix of our model. <nan> is the padding term since the word length is <30. In this case, DeepGeo, CNN, and FUJIXEROX correctly predict salt lake city-ut035-us but our proposed method predicts atlanta-ga121-us.
from pure text content, so it is reasonable for us to ignore the metadata. When comparing different settings of our proposed method, we can see that Proposed Method 1 performs the best. In Proposed Method 2 , where we set the head number to 1 and get the single-head self-attention model, the performance generally decrease in both Acc 1 and the distance measurements. However, when reducing the head number to one, the training time also reduces a lot. In Proposed Method 3 , we tried to train the model without the country label. As we could see, the distance measurements increase, especially the mean error distance. This means that using country label could really help the model learn the geographic relation between cities.

Analysis
In this section, we analyze some cases by printing the attention weight matrix. We only focus on the word representation module of the Proposed Method 2 since the multi-head attention model have several attention weight matrices and thus is hard to illustrate. Also, as the character representation module contains the CNN layer and pooling layer, it is hard to understand which subword feature is kept in the attention layer. Therefore, we focus on analyzing the word representation module to see what the Proposed Method 2 learned.
In Figure 3, we can see there are two attention weight matrices because we have two stacked self-attention layers in our model. In the figure, each row represents a set of weights to construct the new vector. For example, in the last row where the word is "morning," only "all" as well as "moncton" have higher weights. As we can see in both first layer and second layer, "moncton" get a very high weight in most of the words. Since this word reveals the location directly, DeepGeo, CNN and our proposed model give the correct prediction. FUJIXEROX, on the other hand, predicts los angeles-ca037-us and fails to give the correct label. Notice that in the first layer (Figure 3A), the weights of the first <nan> distribute evenly over the words. As a result, after the first layer, the vector of <nan> could be seen as a sentence representation. This is why in the second layer (Figure 3B), lots of the words also highly attend on the first column (<nan>).
In Figure 4, since the topic is about basketball, "crean" then becomes a very important word. Actually, "crean" stands for a basketball coach, Tom Crean, in Indiana University. In the first layer, we could see that both "turnovers" and "crean" get a high weight meaning that the model successfully capture the relation between a basketball term "turnovers" and a person name "crean." In the second layer, <nan> and "crean" get high weights. Notice that the vector of <nan> could also be seen as a sentence representation. As a conclusion for this case, our proposed model successfully captures the hidden relations and gives a correct prediction but DeepGeo, CNN, and FUJIXEROX all fail in this case.
In Figure 5, all the four models fail to recognize the location correctly, since this post does not give any useful information. We could find that in the first layer, all the weights are similar. Also, in the second layer, all of the word attends on the first column (<nan>) where the vector does not contain any useful information. This means that our model could not find any meaningful and helpful information for prediction.
In Figure 6, both DeepGeo, CNN, and FUJIXEROX predict the correct city, salt lake city-ut035-us, but our model predicts atlanta-ga121-us. When investigating the weight matrix, we can find that both "utah" and "atlanta" get high attention which somehow represents the two label salt lake city-ut035us and atlanta-ga121-us respectively. This gives a controversial information to our model and thus in the end our model predicts the wrong label. In conclusion, our proposed model fails to capture the semantic meaning of "leaving." The above four cases give us a brief understanding of the behavior of our proposed model. Our model could capture the hidden relations between different terms. However, it is still suffering from understanding the semantic meaning of words so it gives wrong predictions sometimes. Generally, the information captured by the model is easy to understand and quite meaningful.

CONCLUSIONS
In this paper, we have proposed a new deep learning model to predict location for tweets. Our model integrates three key concepts, including multi-head self-attention mechanism, subword feature, and joint training technique with country label. The experiment on W-NUT geo-tagging task shows our model is competitive or better than the state-of-the-art methods w.r.t. different measurements. The analysis on attention weight matrix also illustrates that our model can capture the hidden relations between different words. In the future, we will further consider the semantic information of the sentences to better capture the meaning of the tweet.

AUTHOR CONTRIBUTIONS
C-YH designed the model, performed the experiments, analyzed the data, and wrote the paper. HT gave suggestions for the model as well as the experiments and wrote the paper. JH and RM contributed in the early discussion and the problem identification.