Generation of individual daily trajectories by GPT-2

Mizuno, Takayuki; Fujimoto , Shouji; Ishikawa , Atushi

doi:10.3389/fphy.2022.1021176

ORIGINAL RESEARCH article

Front. Phys., 08 November 2022

Sec. Interdisciplinary Physics

Volume 10 - 2022 | https://doi.org/10.3389/fphy.2022.1021176

This article is part of the Research TopicInterdisciplinary Approaches Towards the Evolution of Socio-economic Systems Under Selective Trend PressuresView all 6 articles

Generation of individual daily trajectories by GPT-2

Takayuki Mizuno¹*

Shouji Fujimoto ²

Atushi Ishikawa ²

¹National Institute of Informatics, Tokyo, Japan
²Department of Economic Informatics, Kanazawa Gakuin University, Kanazawa, Japan

We propose a new method to convert individual daily trajectories into token time series by applying the tokenizer “SentencePiece” to a geographic space divided using the Japan regional grid code “JIS X0,410.” Furthermore, we build a highly accurate generator of individual daily trajectories by learning the token time series with the neural language model GPT-2. The model-generated individual daily trajectories reproduce five realistic properties: 1) the distribution of the hourly moving distance of the trajectories has a fat tail that follows a logarithmic function, 2) the autocorrelation function of the moving distance exhibits short-time memory, 3) a positive autocorrelation exists in the direction of moving for one hour in long-distance moving, 4) the final location is often near the initial location in each individual daily trajectory, and 5) the diffusion of people depends on the time scale of their moving.

1 Introduction

Big data on individual daily trajectories is important for addressing issues involving disasters, terrorism, public safety, infectious diseases, spatial segregation, marketing, and traffic congestion. By analyzing big data on human mobility, we can detect the causes of traffic congestion [1] and find efficient traffic control strategies to balance economic activity with infection control [2]; [3]; [4]. We are also able to monitor the evacuation of people in natural disasters and mass protests through telecommunication providers [5]; [6]. By developing models that satisfy the statistical properties of trajectories, we can simulate changes in urban mobility in the presence of new infrastructure, the spread of epidemics, terrorist attacks, and international events such as an Expo [7]; [8]; [9,10]. In addition, generative models are valuable for protecting the geo-privacy of trajectory data [11]; [12]; [13]. While it is difficult to control the trade-off between uncertainty and utility when disclosing real data, synthetic trajectories that preserve statistical properties have the potential to achieve performance comparable to real data on multiple tasks.

The modeling of human mobility can be classified into four types [14]. The first is the Trajectory Generation model that generates realistic individual spatial-temporal trajectories [15]; [16]; [17]; [18]; [19]. The purpose of this model is to generate realistic individual trajectories for ordinary and extraordinary days. This model is also required to reproduce trajectories from home to destination and from destination back to home. The second type is the Flow Generation model that generates realistic Origin-Destination matrices [20]; [21]; [22]. This model is often used to find the relationships between POIs (Points of Interest) and human mobility networks. The third is the Next-Location Prediction model that predicts an individual’s future location [23]; [24]; [25]; [26]; [27]; [28]; [29]; [30]; [31]; [32]; [33]. This type is developed by inputting weather, transportation, and other factors to the model to capture the spatiotemporal patterns that characterize human habits. The fourth type is the Crowd Flow Prediction model that predicts in/out aggregated crowd flows [34]; [35]; [36]; [37]; [38]; [39]; [40]; [41]; [42]; [43]; [44]; [45]; [46]; [47]. This type is used to understand the relationship between external factors such as weather, weekly and daily cycles, and events (e.g., festivals) and the flow network structure. This research is also classified as the first type, i.e., Trajectory Generation modelling, but it also has the ability to predict Next-Location.

Both physics and machine learning approaches have been taken to develop generative models. The physics approach includes the gravity model, the preferential selection model, the Markov chain, and the autoregressive model (e.g., ARIMA) [48]; [49]. While these models are simple and intuitive, they have limitations in generating realistic individual trajectories. On the other hand, machine learning approach includes language models and autoregressive-type neural networks [14]; [50]. This approach generates highly realistic individual trajectories by building complex models with many parameters. In this study, we build a model to generate the individual daily trajectories using GPT-2 [51], one of the Transformer models that is becoming a successful alternative to Recurrent Neural Network in natural language generation. This model inputs the initial locations in the morning (e.g., around the home) and then outputs the individual daily trajectory (e.g., coordinates of the route taken by public transportation to a sightseeing spot, sightseeing and eating, and then returning home).

To apply language models such as GPT-2 to individual trajectories, we need to index the locations as words [14]. We utilize the Japan regional grid code “JIS X0410” for location indexing [52]. This code consists of several subcodes. The first-level subcode represents the absolute location of each grid, where geographic space is divided into squares with a latitude difference of 40 min and a longitude difference of 1 degree. Each grid is divided recursively until the desired resolution is achieved. The second-level and higher subcodes represent relative locations within a divided grid. We do not need a huge number of unique subcodes, even when the geographic space is large and the resolution is high. We can index many locations with subcode combinations. The grid subcodes, codes, and trajectories (i.e., grid code time series) correspond to characters, words, and sentences in natural language.

In language models such as GPT-2, the introduction of subwords between words and letters, such as “un” and “ing”, increases the accuracy of text generation. Words, subwords, and characters are called tokens, and the process of identifying tokens from text is called tokenization. Using tokenizers such as “SentencePiece” [53]; [54], we can find frequent combinations of subcodes from individual daily trajectories expressed by “JIS X0410,” such as major substrings. To apply GPT-2 to individual trajectories, we propose a new method to convert individual daily trajectories into token time series by applying the tokenizer “SentencePiece”.

Trajectory generation requires capturing the temporal and spatial patterns of individual human movements simultaneously. A realistic generative model should reproduce the tendency of individuals to move preferentially within short distances [55]; [56], the heterogeneity of characteristic distances [55]; [56] and their scales [57], the tendency of individuals to split into returners and explorers [58], the routinary and predictable nature of human displacement [59], and the fact that individuals visit a number of locations that are constant in time [60].

Subsequent sections are organized as follows. Section 2 introduces big data on individual daily trajectories for training the model. Section 3 describes the proposed methods: a geospatial tokenizer based on SentencePiece and the GPT-2 individual daily trajectory generator and its comparative models. Section 4 presents the results. We show the statistical spatial-temporal properties that the model-generated individual daily trajectories must satisfy, and we discuss the accuracy of the models for predicting an individual’s future location. Section 5 offers our conclusions.

2 Data

We used the minute-order location data (280 million logs) from a total of 1.7 million smartphones (about 28,000 per day) that passed through the Kyoto Station area (Shimogyo-ku, Kyoto) in November 2021 and January 2022, provided by Agoop Corp [61]. Kyoto is one of the most famous tourist destinations in Japan, and many people from all over Japan visit Kyoto for sightseeing. Location information includes latitude and longitude. GPS accuracy depends on the smartphone model and the communication environment, but it is usually within 20 m. We coarsened each trajectory to a 250-m grid and 30-min order using a sliding 1-min window. This sliding window converts 1.7 million trajectories with a 1-min time resolution to 51 (= 1.7 × 30) million trajectories with a 30-min time resolution. By removing the home grid for each user, we protected geo-privacy and focused only on the trajectory when the user is out of the home. The total number of daily trajectories of individuals who have been out of their homes for over 10 h is 8.4 million time series.

We indexed each location at 250-m grid resolution using the Japan regional grid code “JIS X0410” (see Appendix A for the definition) [52]. The region analysed in this paper is Japan, but if a region outside of Japan were targeted, the extended JIS X0410 [62] would be used. The Japan regional grid is a code given when subdividing Japanese landscape into rectangular subregions by latitude and longitude. A grid code is represented by a combination of five subcodes, such as “5235/36/80/2/3”. The first-level subcode (e.g., 5235) is a four-digit number representing a unique location enclosed by a square with a 40-min difference in latitude and a 1-degree difference in longitude. In Japan, the land areas can be represented by using 176 first-level subcodes, which covers the whole country. The second-level subcode (e.g., 36) is a two-digit number indicating the area created by dividing the first-level grid into eight equal areas in the latitudinal and longitudinal directions. The third-level subcode (e.g., 80) is a two-digit number describing the area obtained by dividing the second-level grid into ten equal areas in the latitudinal and longitudinal directions. The fourth-level subcode (e.g., 2) bisects the third-level grid by latitude and longitude. The fifth-level subcode (e.g., 3) bisects the fourth-level grid by latitude and longitude. The length of one side is about 250 m. In total, about 18 million unique grid codes on land in Japan, at a resolution of 250 m, can be represented by combinations of only 348 subcodes from the first to the fifth level.

Travel from Kyoto station “5235368023” to Kiyomizu Temple “5235369224” can be described in subcode time series, such as “5235/36/80/2/3/_/5235/36/92/2/4/”. We apply the language model GPT-2 to the generation of individual trajectories by corresponding regional grid codes to words, subcodes to characters, and trajectories represented by grid time series to sentences.

3 Methods

First, to apply GPT-2 to individual trajectories, we build a geospatial tokenizer to identify tokens derived from the individual trajectories expressed by grid codes. Next, we introduce four models that generate individual daily trajectories: GPT-2, 2-g, 3-g, and Multi-Output Catboost.

3.1 Geospatial tokenizer

Tokenization is a way of separating a piece of text into smaller units called tokens. Tokens can be either words, characters, or subwords. Tokenization is essential for a language model to efficiently learn the structure of a natural language from a given text. For example, the Oxford English Dictionary contains approximately 600,000 English words. Here, we consider the case in which all tokens are words. Statistically estimating the probability of a word w_j occurring after a word w_i from the given finite text is difficult because the combinatorial explosion of words occurs. In particular, it is nearly impossible to estimate the probability of rare word combinations. One way to solve this problem is to introduce subwords, which are decomposed words in natural language processing. Words are often composed of subwords, such as “un-relax”, “relax”, “relax-es”, “relax-ed”, “relax-ing”, and “un-relax-ed”. Rare words often consist of a combination of common subwords. By setting the subwords to tokens, we can often statistically estimate the probability that a sentence containing the rare word w_j will occur, based on the subword-combination probability.

One of the tokenizers that automatically identifies subwords from a given text is SentencePiece [53]; [54]. In a given text, SentencePiece assumes that a subword of string c_ic_j exists if the joint probability p (c_i, c_j) of strings c_i and c_j is statistically significantly higher than the combination probability p (c_i)p (c_j). This method finds subwords such as “un”, “es”, “ed”, “ing”, etc. Tokens in SentencePiece are subwords and characters, and this tokenizer decomposes text into a minimum number of tokens.

We apply SentencePiece to individual trajectories represented by grid time series. First, as a technical process, we make a “Grid subcode from/to byte-character translation map” by assigning a unique byte character to each subcode. For example, the subcodes 5235/36/36/2/3/are converted to the byte characters $ß / \dot{I} / \overset{̌}{e} / A / 1$ . Note that we have assigned different byte characters to “36” in the second-level subcode and “36” in the third-level subcode because they have different meanings. Then, using SentencePiece, we set all 348 byte-characters (1st level: 176, 2nd level: 64, 3rd level: 100, 4th level: 4, 5th level: 4) and frequent byte-character combinations (i.e., frequent byte-subwords) to tokens for the Japanese land area until the total reaches 50,000 tokens. Such an algorithm for finding byte-subwords is called “Byte Pair Encoding”. We used 42 GB of GPU memory for 50,000 tokens. The maximum number of tokens could be increased depending on available GPU memory. The finding of p (c_i, c_j), which is statistically significantly higher than p (c_i)p (c_j), depends on the sample size. Densely populated areas are visited by many people, so the sample is concentrated. Hence, for trajectories through densely populated areas, tokens with byte-subwords are frequently chosen.

Finally, we add a comma token “,” to a temporary return home and a period token “.” to the last return home each day. This geospatial tokenizer based on SentencePiece transforms the grid code time series into a token time series as follows.

Grid code time series: …_/5235149412/_/5235034923/_/5235030422/…

Grid subcode time series: …_/5235/14/94/1/2/_/5235/03/49/2/3/_/5235/03/04/2/2/…

Byte-character time series: $\dots_/ ß / ḳ / \overset{́}{ι} / C / 3 /_/ ß / \hat{J} / \overset{̊}{A} / A / 1 /_/ ß / \hat{J} / \dot{E} / A / 3 / \dots$

Token time series: $\dots /_ß ḳ \overset{́}{ι} C / 3 /_ß \hat{J} \overset{̊}{A} A / 1 /_ß \hat{J} \dot{E} A / 3 / \dots$

3.2 Individual daily trajectory generator

We randomly split the 8.4 million individual daily trajectories explained in the Data section into a 4:1 division, under the constraints that the same user is not split. We use 4/5 to build machine learning models and the remaining 1/5 to compare the statistical properties and prediction accuracy between the original and model-generated trajectories. In all models, the split rate is common.

GPT-2 is one of the transformer models of deep neural networks that has multiple Transformer layers consisting of Self-attention and Projection layers. Another well-known transformer model is BERT [63]. By using attention in place of previous recurrence- and convolution-based architectures in natural language generation tasks, the transformer models are becoming successful alternatives to RNNs (Recurrent Neural Networks) and CNNs (Convolutional Neural Networks). GPT-2 is an autoregressive model in neural networks that can sequentially predict the next token from the previous token, i.e., the next location from the past locations, by referring only to the input token sequence prior to the position to be processed in the Transformer layers. Using the geospatial tokenizer, we convert the grid codes obtained from the input location coordinates (or the input grid codes) into byte characters according to the translation map, and then tokenize their byte characters. By inputting these tokens into GPT-2, the GPT-2 recursively generates next tokens. The generated tokens are then reversely converted into grid codes according to the translation map. Through these processes, an individual daily trajectory is generated (See Appendix B for a concrete example). In this paper, we use GPT-2 SMALL proposed by OpenAI, which consists of 12 attention heads and 12 transformer layers as well as 768 dimensions of the embedding and hidden states [51]; [64]. Other hyperparameters are also given with default values. The learning time is about 90 h on one NVIDIA RTX A6000. Figure 1 shows the training and validation losses for each iteration. We used the cross-entropy loss function. Training and validation losses at epoch = 10 are 1.74 and 1.98, respectively.

FIGURE 1

FIGURE 1. (□) Training and (⧫) validation losses of GPT-2.

We introduce three non-neural network models to compare with GPT-2 on the accuracy of generating individual daily trajectories. The first is a 2-g model described by a first-order Markov chain as follows:

\Pr (X_{t} = x | X_{t - 1} = x_{t - 1}, \dots, X_{1} = x_{1}, X_{0} = x_{0}) = \Pr (X_{t} = x | X_{t - 1} = x_{t - 1}), (1)

where x_t is a grid code of the location where a user visited at time t. The second model is a 3-g model described by a second-order Markov chain as follows:

\Pr (X_{t} = x | X_{t - 1} = x_{t - 1}, \dots, X_{1} = x_{1}, X_{0} = x_{0}) = \Pr (X_{t} = x | X_{t - 1} = x_{t - 1}, X_{t - 1} = x_{t - 1}) . (2)

Conditional probabilities in these n-gram models are estimated from combinations of grid codes that occur at least 30 times in the given texts for training.

The third model is Multi-Output Catboost, one of the multi-regression trees with gradient boosting [65]; [66]; [67]. Multi-Output CatBoost is an extension of supervised machine learning with decision trees. In the multi-regression analysis using a decision tree, the multidimensional space of the explanatory variables is divided by the decision trees, and a multi-regression model is constructed to predict representative values such as the average value of the objective variables in each divided area. The learning is performed so that a loss function such as the Multi Root Mean Squared Error (MultiRMSE) of the training data is minimized. In this study, we build a Multi-Output Catboost model that predicts the location vector v_t = (long_t, lat_t) defined by latitude and longitude at time t on a given day from t location vectors from time 0 to time t − 1 as follows:

v_{t}^{*} = f (v_{t - 1}, \dots, v_{1}, v_{0}), (3)

where v* is the location vector predicted by the model. In this study, the unit of time resolution is 30 min, and the maximum of t is 20. Five locations are set for the initial 2.5 h on the initial individual trajectory. That is, the range of t in the prediction is 5 ≤ t ≤ 20. We use 16 (= 20–5 + 1) Multi-Output Catboost models to predict the location vector. We used the official CatBoost Python package [67]. In learning, we used default values for each hyperparameter: MultiRMSE as loss function, maximum number of trees at 1,000, and depth of tree at 6. Other hyperparameters are also given with default values. The final training loss and validation loss for the model predicting the location at t = 5 are 0.2638 and 0.2662, respectively. In the case at t = 20, the losses are 0.222 and 0.219, respectively.

4 Results

First, we plot a typical example of the trajectories generated by each model on a map to intuitively understand the characteristics of those trajectories. Next, we statistically clarify the similarities and differences between the characteristics of the original and model-generated trajectories. Finally, we evaluate the performance of the models in predicting the individual daily trajectories. GPT-2 improves prediction accuracy by fine tuning.

4.1 Typical output examples

Figure 2A shows an example of the input and output locations of the GPT-2 trajectory generator. In this example, sixteen grid codes were output from the generator by inputting the following five grid codes. We manually verified that these grid codes are in the following locations on the OpenStreetMap.

FIGURE 2

FIGURE 2. Examples of model-generated individual daily trajectories from the same input: (A) GPT-2, (B) 2-g, (C) 3-g, (D) Multi-Output Catboost. Blue and red icons represent inputs and outputs, respectively. The maps were created using OpenStreetMap.

Input: Three locations around Osaka Castle → Daito Tsurumi IC (Kinki Highway) → Katano Kita IC (Daini Keihan Highway)

Output of GPT-2: Ritto IC (Meishin Highway) → five locations in downtown Kusatsu → five locations at the AEON shopping mall in Kusatsu → Otsu City Hall → one location in downtown Kusatsu → Kyoto Station → Osaka Station → one location around Osaka Castle.

The trajectory generated by GPT-2 is very different from a random walk and is human-like. In long-distance moving, expressways and bullet trains are used, and the location coordinates at 30-min intervals are spatially sparse due to the high moving speed. Landmarks and commercial areas are chosen as destinations, where people stay for a long time. They leave in the morning and return in the evening. The same route is likely to be chosen for the outbound and inbound trips.

We input the same five initial grid codes as the GPT-2 trajectory generator into the 2-g, 3-g, and Multi-Output Catboost trajectory generators. Figures 2B,C,D show each of the sixteen outputs generated sequentially by the generators.

Output of 2-g: Fushimi-Momoyama Castle Athletic Park → Kyoto Station → Mt. Hiei Sakamoto Station → Mt. Hiei cable car → Enryakuji Temple on Mt. Hiei → one location around Mt. Hiei Sakamoto Station → two locations around Horikawa Gojo → eight locations around Kitayama Omiya.

Output of 3-g: Kyotanabe TB (Second Keihan Highway) → one location around Fushimi → Nishioji Gojo → Enmachi Station → Horikawa Kitaoji → Kamogawa Junior High School → Nishimarutamachi → Oguraike IC (Second Keihan Highway) → Higashi-Osaka City → Kashiba City → Kashihara City → Yamatokoriyama City → Momoyama → Keihan Ishiyama → two locations around Ritto City.

Output of Multi-Output Catboost: Kugayama → 11 locations around Shimotoba → Yoko-oji → Oyamazaki JCT (Meishin Highway) → Kaminomaki area → southern Takatsuki City.

Typical output examples of these models do not reproduce the return home. In the example of the 2-g model, the generated locations are trapped in a specific area and cannot get out of their area. In n-grams, we cannot statistically estimate the probability of the occurrence of rare trajectories that would escape from the trap because the combinatorial explosion in the number of solutions is unavoidable. In this paper, GPT-2 avoids this problem by using geospatial tokens. In the example of the 3-g model, inefficient moving trajectories are generated, such as multiple trips going back and forth. This phenomenon means that the memory length of past trajectories is not sufficient in 3-g to generate realistic individual trajectories. In GPT-2, we have 1,024 tokens in the default setting, which is a sufficient memory length. Because GPT-2 memorizes the initial location, it can generate trajectories back to that location. In the example of Multi-Output Catboost, the generated location is often far from landmarks and major roads. Catboost adopts a bagging method that averages the predicted coordinates of multiple regression trees. If different regression trees predict different destinations, the output will be their intermediate coordinates. For example, if x_i or x_j is the destination, Multi-Output Catboost will predict that the intermediate location between them is the destination. In this paper, we do not adopt the bagging method in GPT-2. Instead, we adopt the greedy method to sequentially generate destinations with the highest probability.

4.2 Statistical properties of individual daily trajectories

We investigated the statistical properties of the model-generated individual daily trajectories following five types of statistics: 1) Distribution of moving distance, 2) Auto-correlation of moving distance, 3) Relationship between moving distance and next moving angle, 4) Recurrence probability to initial location, and 5) Diffusion coefficient of people. We measure the distance between two points with the shortest distance on the surface of the Earth’s ellipsoid model WGS84 [68]. The inputs for each model are five locations for the initial 2.5 h in the original trajectory.

For the first type of statistics, the moving distance distribution, Figure 3 shows the cumulative probability distribution of the hourly moving distance in a straight line for the original and model-generated trajectories. Note that the horizontal axis is on a logarithmic scale. The distribution of original trajectory is approximated by a logarithmic function where its R² is 0.995. Individuals tend to prefer moving within short distances. Half of all moved distances are less than 4 km. Using Jensen-Shannon divergence with the base-2 logarithm, D_JS, we measure the similarity of the distance distributions between the original and each model-generated trajectory. For 2-g, 3-g, Catboost, and GPT-2, D_JS is 0.0037, 0.0064, 0.030, and 0.049, respectively. D_JS ∼ 0 means that these models reproduce the statistical property in which the distribution of the hourly moving distance follows a logarithmic function as in the original trajectories.

FIGURE 3

FIGURE 3. Distribution of hourly moving distance in a straight line for (□) original trajectory and trajectories generated by (▴) 2-g, (▾) 3-g, (•) Multi-Output Catboost, and (⧫) GPT-2 models. Dashed line represents the logarithmic function. R² is 0.995.

The second type is the autocorrelation function of a 30-min moving distance. Note that this minute dimension is not spatial but temporal. As shown in Figure 4, since the original autocorrelation function decays exponentially, the dynamics of the original movement follow a short-term memory process. The 3-g and GPT-2 models reproduce autocorrelations that follow an exponential function. On the other hand, Catboost is less reproducible.

FIGURE 4

FIGURE 4. Autocorrelation function of 30-minuts moving distance for (□) original trajectory and trajectories generated by (▴) 2-g, (▾) 3-g, (•) Multi-Output Catboost, and (⧫) GPT-2 Models. Dashed line represents the exponential function. R² is 0.9919.

The third type of trajectory is the relationship between moving distance and next moving angle. Most people move toward their destinations and thus are not random walkers. In Figure 5, we show the relationship between the distance |X_t| of hourly moving vector X_t = (x_t − x_t−1) and the cosine of the moving vectors as follows:

\cos θ = \frac{X_{t} X_{t+1}}{| X_{t} ‖ X_{t+1} |}, (4)

where x_t is a position vector representing location coordinate at time t.

FIGURE 5

FIGURE 5. Relationship between hourly moving distance and next moving angle for (□) original trajectory and trajectories generated by (▴) 2-g, (▾) 3-g, (•) Multi-Output Catboost, and (⧫) GPT-2 models. Vertical axis is the conditional mean of the cosine between two consecutive one-hour moves. Dashed line represents a moving distance of 10 km.

In the original trajectories, for moving less than 10 km per hour, the conditional mean of next moving angle is ⟨ cos θ||X_t| < 10 km⟩≃ 0. On the other hand, the conditional mean is positive, ⟨ cos θ||X_t|≥ 10 km⟩ > 0, for a moving of more than 10 km per hour. If the distance to the destination is less than 10 km, people can arrive within an hour. On the other hand, if the destination is more than 10 km away, people may not be able to arrive within one hour. In that case, they continue to move toward their destination over the next hour. Figure 5 illustrates these characteristics of human mobility, which can only be reproduced by the GPT-2 and 3-g models.

The Fourth type of statistics is the recurrence probability to the initial location. Most people leave their homes in the morning to go to their destinations and return home after completing their errands at these destinations. In this trajectory dataset, the coordinates are recorded when the smartphones are more than 100 m away from the homes. Therefore, in many cases, the initial coordinate of the daily trajectory is around the home or place of staying. We show in Figure 6 the recurrence probabilities within 3 km of the initial coordinate for the 5 h until the final time (i.e., homecoming time) of the individual daily trajectory. On the original trajectories, the recurrence probability increases from 2 h before the final time. Only GPT-2 reproduces this property.

FIGURE 6

FIGURE 6. Recurrence probability to initial location for the 5 h until the final time (i.e., homecoming time) of the individual daily trajectory: (□), (▴), (▾) original, 2-g, 3-g, (•) Multi-Output Catboost, and (⧫) GPT-2 generated trajectories. Time = 0 represents the final time.

As the fifth type of statistics, we investigated the time-scale dependent properties of trajectories by observing the diffusion coefficients of people. In Figure 7, we plot the elapsed time from the initial time (i.e., time scale) on the horizontal axis and the mean square of the distance from the initial location on the vertical axis. The four plots for the initial 2 h in the left side of the figure are initial values of the models, so they are common to the original and the model-generated trajectories. If the individual trajectory follows a two-dimensional random walk, the mean of squares is proportional to the time scale. If people linearly move away from the initial locations, the mean of squares is proportional to the square of the time scale. The slope of this power-law relationship is the diffusion coefficient of people. The diffusion coefficient of the original trajectories is around 2 until the 4-h time scale and around 1 over the 4-h time scale. These results suggest that the upper limit of moving time from home to destination for Kyoto tourism is about 4 h. The 3-g, Catboost, and GPT-2 models successfully reproduce the properties of people’s diffusion.

FIGURE 7

FIGURE 7. People’s diffusion for (□) original trajectory and trajectories generated by (▴) 2-g, (▾) 3-g, (•) Multi-Output Catboost, and (⧫) GPT-2 models. Horizontal axis indicates the elapsed time from the initial time. Vertical axis represents the mean square of the distance from the initial location. The dotted and dashed guidelines show that the mean square of the distance is proportional to the elapsed time and the square of the elapsed time, respectively. The four plots for the initial 2 h on the left side are initial values of the models.

4.3 Prediction accuracy

We confirmed the predictive performance of the GPT-2, N-gram, and Multi-Output Catboost models for individual daily trajectories using test data not used for training. Five initial coordinates for 2.5 h were input to predict the coordinates for the next half hour, one hour, two hours, four hours, and the final time (i.e., homecoming time) of the individual daily trajectory. The probability that the prediction is within 1 km (10 km) of the actual location coordinates is shown in Table 1. For all forecasts, GPT-2 outperforms the other models. Especially for the last location of the day, we could confirm that GPT-2 is eight times more accurate than the other models.

TABLE 1

TABLE 1. Probability that the prediction is within 1 km (10 km) of the actual location coordinates for the next half hour, one hour, two hours, four hours, and the final time of the day. We performed 24,247 realizations of each model to estimate the probabilities.

4.4 Fine-tuning GPT-2

Many tourists visit the best places to enjoy viewing the autumn leaves. The autumn leaves season is short in Kyoto, only about two weeks. In 2021, the weekend of November 27 and 28 was the best time to see the autumn leaves. With only two days of modeling targets, it is difficult to collect enough orbital data for the model to learn the characteristics of trajectories from zero. By fine-tuning the GPT-2 parameters learned in the previous section with the trajectories for November 27 and 28, we upgraded GPT-2 to generate the individual daily trajectories for this weekend. Table 2 shows a comparison of the prediction accuracy by the GPT-2 model before and after fine-tuning. To measure this accuracy, we focused on the probability that the prediction is within 1 km (10 km) of the actual location coordinates for the next half hour, one hour, two hours, four hours, and the final time (i.e., homecoming time) of the individual daily trajectories for November 27 and 28. We could significantly improve the accuracy of trajectory prediction on given days by fine tuning the GPT-2 parameters.

TABLE 2

TABLE 2. Probability that the prediction is within 1 km (10 km) of the actual location coordinates for the next half hour, one hour, two hours, four hours, and the end of the day for November 27 and 28.

5 Conclusion

We proposed a method to convert individual daily trajectories into token time series by applying the tokenizer SentencePiece to a geographic space divided using the Japan regional grid code JIS X0410. We could build a highly accurate generator of individual daily trajectories by learning the token time series with the neural language model GPT-2. The model-generated individual daily trajectories reproduced the following five realistic properties. The first property is that the cumulative distribution of the hourly moving distance follows a logarithmic function. The second property is that the autocorrelation function of the moving distance exhibits short-time memory. The third property is that there is a positive autocorrelation in the direction of moving for one hour in long-distance trips. The fourth property is that the last location is often near the initial location in each individual daily trajectory. The fifth property is the time-scale dependence of people’s diffusion. On larger time scales, the diffusion is slower. Generators based on n-grams and Catboost, in particular, could not reproduce the recurrence probability to the initial location.

We investigated the prediction accuracy of each model for individual daily trajectories. GPT-2 outperformed the n-gram and Catboost models. Moreover, we showed that fine-tuning the parameters of GPT-2 with a part of the individual trajectories on given days could significantly improve the accuracy of the trajectory prediction for those days.

Aa a final point, we propose three important tasks to be tackled in the future. The first task is to generate trajectories that take into account individual attributes such as gender and age. Since the neural language model can generate text about a given category by training both various texts and their text categories, this method could be applied to the generation of trajectories that depend on individual attributes. As a second challenge, it is constructive to develop the next-location predictor that handles sequences of locations and timestamps. The time resolution used in this paper is fixed at 30 min, so we do not generate the temporal dimension (e.g., Fushimi-Momoyama Castle Athletic Park at 10:00 a.m. → Kyoto Station at 11:15 a.m. → Mt. Hiei Sakamoto Station at 11:20 a.m.). To generate the temporal dimension, it is necessary to develop a model that trains both timestamps and location coordinates. The third task is to generate collective trajectories. In this paper, we introduce models in which individuals do not interact with each other. As part of our future challenges, we plan to develop methods for models to train the interactions. Generating highly accurate synthetic trajectories from models would contribute to fundamental knowledge for such areas as urban planning, what-if analysis, and computational epidemiology.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: Anyone can purchase the data from Agoop-Corp., https://www.agoop.co.jp/.

Author contributions

All authors carried out the conceptualization, methodology, investigation, and validation. TM and SF carried out the formal analysis. TM prepared and wrote the original draft. TM carried out the project administration and funding acquisition. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by Strategic Research Project grant from ROIS (Research Organization of Information and Systems), JST CREST Grant Number JPMJCR20D3 and JSPS KAKENHI Grant Numbers JP19K22852, JP21H01569, and JP21K04557.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Zhu L, Yu FR, Wang Y, Ning B, Tang T. Big data analytics in intelligent transportation systems: A survey. IEEE Trans Intell Transp Syst (2019) 20:383–98. doi:10.1109/tits.2018.2815678

Generation of individual daily trajectories by GPT-2

1 Introduction

2 Data

3 Methods

3.1 Geospatial tokenizer

3.2 Individual daily trajectory generator

4 Results

4.1 Typical output examples

4.2 Statistical properties of individual daily trajectories

4.3 Prediction accuracy

4.4 Fine-tuning GPT-2

5 Conclusion

Data availability statement

Author contributions

Funding

Conflict of interest

Publisher’s note

References

Appendix A:

Appendix B: