Unsupervised vessel trajectory reconstruction

A trajectory is a sequence of observations in time and space, for examples, the path formed by maritime vessels, orbital debris, or aircraft. It is important to track and reconstruct vessel trajectories using the Automated Identification System (AIS) data in real-world applications for maritime navigation safety. In this project, we use the National Science Foundation (NSF)'s Algorithms for Threat Detection program (ATD) 2019 Challenge AIS data to develop novel trajectory reconstruction method. Given a sequence of N unlabeled timestamped observations X = {x1,x2,…,xN}, the goal is to track trajectories by clustering the AIS points with predicted positions using the information from the true trajectories X. It is a natural way to connect the observed point xî with the closest point that is estimated by using the location, time, speed, and angle information from a set of the points under consideration xi ∀ i ∈ {1, 2, …, N}. The introduced method is an unsupervised clustering-based method that does not train a supervised model which may incur a significant computational cost, so it leads to a real-time, reliable, and accurate trajectory reconstruction method. Our experimental results show that the proposed method successfully clusters vessel trajectories.

. Introduction to trajectory reconstruction The Automatic Identification System (AIS) is an automatic tracking system which all ships over 300 gross tonnage and passenger ships are required to be installed aboard according to a mandate for maritime security according to the International Convention for the Safety of Life at Sea issued by the International Maritime Organization (IMO) to avoid ship collisions [1,2]. To address the challenges of tracking moving vessels using both space and time information to detect anomalous trajectories, the National Geospatialintelligence Agency (NGA) has collaborated with the National Science Foundation (NSF)'s Algorithms for Threat Detection program (ATD) for providing the ATD 2019 Challenge. The ATD 2019 AIS data [3] contain time-stamped information about a maritime vessel's movement including latitude, longitude, course over ground (angle), and speed over ground. The ATD 2019 Challenge is tracking the vessel trajectories in real time even when the AIS data may not have completely recorded vessel ID information due to technical issues or operational concerns. In this situation, there is no training set for applying supervised methods to identify the vessel and predict trajectories, and hence unsupervised methods are required. Although the existing unsupervised clustering methods can be used for predicting trajectories of vessels, they may not be able to provide desired prediction accuracy [4]. We propose an unsupervised trajectory reconstruction method can be used for space debris path prediction since space debris typically lack known labels for model training [5], and analyze and investigate three AIS datasets provided by NSF's ATD program and collected from the 1st of June to the 31st of July, 2019 (see Table 1).
. /fams. . We use the term trajectory reconstruction for estimating the AIS positions and connecting them as trajectories [6]. The existing works of trajectory reconstruction include linear interpolation, curvilinear interpolation [7], and its improvements [8,9], and Recurrent Neural Networks (RNNs) [10]. Some of these methods employ physical models of movement information such as speeds, directions, and time, and typically use the speed over ground and course over ground, and others assume a distribution of vessel trajectories and train it from historical records [11,12]. The stateof-the-art methods for trajectory reconstruction [13][14][15] generally have the following three steps: (1) apply a clustering method [16,17] to group trajectories data according to their route patterns, (2) assign the vessel to one of these clusters, and (3) interpolate or predict the vessel trajectory based on the route pattern of the assigned cluster. However, these methods requires a training set of stationary patterns such as paths in long time and distances, and hence they are not applicable to the three AIS datasets that we analyzed consist of short-term and distances trajectories which lack the long-term patterns.
Our main contributions include: (1) The design of a novel big-data-compliant unsupervised algorithm which automatically learns and extracts useful spatiotemporal information from AIS data; (2) The proposed spatiotemporal features improve the accuracy of clustering the AIS points and reconstructing trajectories; (3) The proposed method has been successfully applied to reconstructed vessel trajectories with the real AIS data collected nearby Norfolk, Virginia, and simulated data. The highlights of this paper are summarized as follows. The proposed vessel trajectory reconstruction method utilizing the spatiotemporal characteristics of AIS data is unsupervised, and therefore it does not require a training set. The experimental results demonstrate the advantages of the proposed method when the training set is insufficient. Unlike the traditional clustering method, the proposed method uses the points with features represented by its projected positions based on speeds and angles, so the computation only involve local information and thus runs fast.

. Next-point nearest neighbor clustering method
We first introduce the next-point concept with nearest neighbor classification method and then develop the nearest neighborhood clustering (NNC) when the vessel IDs are unknown using the proposed next-point method. We introduce a basic NNC method and design an advanced NNC trajectory reconstruction in this section. We will compare results of all these methods in the next section.

. . Next-point connection
We convert the longitude and latitude into the Universal Transverse Mercator (UTM) coordinates, and then group the AIS points by the proposed nearest-neighbor clustering method. The next-point connection (NPC) clustering algorithm uses the distance defined as where K is the index set of the AIS points in interval [t 0 , t] and d st is the space-time distance which is the Euclidean distance using all spatial and temporal features, [t 0 , t] is a preset search range of time (the interval length 1, 000 s used in our analysis), E and O stand for the estimated location and observed location at time t, respectively, and s is the set of variables used for finding the closest training points. The proposed clustering method contains the following steps: • Step 1. Project each point's next location using its speed, direction, and the time differences between the point and its neighboring points. Although the NPC method is similar to the minimum spanning tree (MST) and single linkage cluster analysis (SLCA) [18,19] that combine two clusters with the closest pair of points, NPC uses the estimated position E to measure the distance instead of the observed positions and NPC only searches AIS points in a nearby time interval. When the labels of the AIS points in K are known, the NPC method becomes a classification method and some points from the same vessel can be removed and only the AIS point with time closest to t will be used. The NPC methods that use the nearest neighbors to predict a vessel VID at each time t, they have the weaknesses: (1) NPC classification requires known labels which may not be available; (2) NPC clustering may merge different vessels and some feature with large values may dominate the distance. Therefore, we focus on the clustering method and propose the following algorithm to solve these issues.

. . Trajectory-based clustering
We propose an clustering algorithm which is based on trajectory reconstruction and thus called CBTR, which builds the . /fams. . trajectories of vessels by using local physical information. CBTR is based on an NPC clustering method which uses doubly checked distance to improve accuracy. For each point is the data set, we select another point as its best possible next point (BPNP) and put them in the same cluster. Like MST and SLCA, BPNP groups all points into a dendrogram with several tree-type clusters because two distinct points might have the same BPNP. The trajectory can be visualized by connecting all points with its BPNP with a line segment. See Figure 6 for an illustration. Given an AIS data set, in which points are ordered by their recording time t i , we denote each point by x i . The two-dimensional positions of x i on the earth are denoted as where LAT and LON stand for latitude and longitude of the AIS point x, and their speeds and courses are denoted by v i and c i , respectively. For every x i , we use its velocity, namely speed and course, to predict its future position and look for the best possible next point x j of it. If there is no point inside a reasonable searching range, then we consider x i as an endpoint of a trajectory. We trace each trail till an endpoint occurs and thus finish the clustering. So the algorithm of CBTR is designed as follows: • Step 1. For a given point x i at time t i , we derive the linearly approximate future trajectory γ i within 1,000 s by using its instant speed v i and course θ , i.e., the predicted position is defined by where T is the time period which will be chosen in Step 2. • Step 2. Collect all points appearing in the time zone t ∈ (t i , t i + 1000) and denote the collection as N i . Consider the closeness of γ i and each point x in N i by computing a bi-directional distance D between γ i and x, where T is chosen to be the time difference between x i and x. Impose a spatiotemporal angle condition to exclude points with exaggerated turning course and denote the rest points as N i . Let the BPNP of x i be the one in the collection N i which has smallest D and satisfies the angle condition. Denote this smallest D as D i , namely, Sort all AIS points x i according toD i in descending order. Compute the ratiō D i /D i+2 and find the first i whose ratio is less than a threshold (1.2 was used in our experiments). TakeD i+1 as a threshold and treat an AIS point as an endpoint of a projected line if its D is larger than the threshold. At last, cluster vessels by using these endpoints.

. . . Bi-directional distance
The bi-directional distance D and the turning angle condition are the most crucial elements in CBTR, so we provide details of them as follows.
We compute the (squared) distance between γ i and points in 2 is small, then x j is probably the next AIS point of x i . However, as shown in Figure 1, there might be another vessel (colored in red) appearing in the direction of γ i (the black arrow). In order to catch the correct BPNP x m for x i , we use the information of x m and x n to do double check. Precisely,

FIGURE
The direction of γ i is indicated by the black arrow. The directions of σ m and σ n are indicated by the gray arrow and the red arrow, respectively. d (σ m , x i ), the (squared) distance between x i and the end of gray arrow, is much smaller than d (σ n , x i ). Hence x m is a better prediction than x n .
we compute the backward locations σ m and σ n of x m and x n , respectively, namely the gray arrow and red arrow in Figure 1. One sees that the inconsistency between the black arrow and the red arrow can exclude x n as a BPNP of x i . On the other hand, x m is much possible to be the BPNP of x i because x i lies around the region that the gray arrow indicates. Therefore, we consider the bi- Step 2. This (squared) bi-directional distance can resolve intersection problem in trajectory analysis.
On the other hand, to prevent the endpoint of a trajectory connecting to another vessel, we have to impose a turning angle condition, which involves both space and time information. Roughly speaking, if the trajectory has to make a sudden unreasonable turn to connect its BPNP, then the trajectory should terminate right there. We cannot just measure the spatial angle because a vessel sometimes makes a large turn in a reasonable time period. So we need to consider a spatiotemporal angle. However, there is no natural exchange rate for temporal and spatial scales and we shall define a suitable one.
It is important to balance the scales of different spatiotemporal features for obtaining a meaningful space-time distance. The pooled normalization (a feature's values dividing by the range) and standardization (a feature's values divided by its standard error) are not suitable here, since the ranges of the spatiotemporal features of the vessels vary a lot. Consequently, we propose a dynamic scale conversion rate according to the vessel's speed and direction.
Considering that 1 knot is about 5 · 10 −4 km/sec and the length of the diagonal of a longitude unit square is about 124.45 km in our data set, we choose τ = 4 · 10 −6 ≈ 5 · 10 −4 /124.45 and α = 110.57/(111.32 · cos θ ), which are ratio estimators [20] to resale the data. The scaling factor α is used to convert unit of distance from degree of latitude into degree longitude so that they are comparable; the factor τ is used to normalize the time scale so that the temporal number looks in similar scale as spacial distance. Namely, for any two AIS points x i and x j , the spatiotemporal vector form x i to x j is defined by where T(x) means the time of the AIS point x. When the angle ϕ between − → x i x j and γ i is too large, say cos ϕ < 0.1, then we remove x j of candidates of BPNP of x i . Definition 2.1 (Turning Angle Condition). The trajectory shall not make a sudden and unreasonable turn in which the spatiotemporal angle ϕ is greater than cos −1 (0.1).
Thus we obtain N i from N i in Step 2. On the other hand, steady vessels with very slow movement which may be anchored float around with water currents and thus have randomly changing courses [21]. Therefore, for those steady pairs x i and x j with average speed smaller than 0.15 knots [22], we do not use the forwardbackward distance and simply measure their D( The spatiotemporal vector representation in Equation (2) of the AIS points induces a linear model for the next point x ′ i . Suppose that at the current time t 0 the point is x i and at time s ∈ (t 0 , t) the , we use the current speed speed(x i ) and angel θ (x i ) to approximate the dynamic speed and angle from time t 0 to time s so that the moving distance that has the true value from an integral of the dynamic speed over time (t 0 , s) is estimated by the product of moving time and speed, and y 2 and y 3 come from the first-order Taylor polynomial of the angle around x i that is used for cosine and sine. Consequently, we have the following regression models where In pursuit of better performance, we consider different values of parameter τ according to the types of vessels. However, the types of vessels are not provided in our AIS data. Alternatively, we adjust the value of τ based on the speed of vessels. This method is essentially a ratio estimator in cluster sampling [23]. For faster vessels with speed larger than 4 knots, we use larger τ , 4 · 10 −5 , so that the time difference is scaled to be comparable to the spatial difference in − → x i x j . For slower vessels with speed smaller than four knots, we use τ = 4 · 10 −5 as proposed in the above paragraph. To demonstrate the performance of CBTR on different types of vessels, we present individual results of four categories of vessels according to their speeds: (1) x i and x j are called a high speed pair if the average speed (in knots) S of them is larger than or equal to 16 knots; (2) fast pair if 4 ≤ S < 16; (3) slow pair if 0.15 ≤ S < 4; (4) steady pair if S < 0.15. We use τ = 4 · 10 −5 for vessels of the fist two categories, which have faster speeds, and use τ = 4 · 10 −6 for vessels of the last two categories. The results are shown in Table 5.
The proposed CBTR algorithm can be viewed as a special case of the weighted-average plug-in classifier [24,25]  The clustering results of data set . The numbers in the horizontal axis are ordered by the vessels' VIDs. The red lines are the predicted labels and the blue lines are the true labels. Most points of vessel no. are merged with vessel no. and the rest points are split into another cluster independent from other vessels. Vessel no.
is merged with vessel no. and no. , and contributes jumps.

. Comparison of algorithms
We evaluated the results by the correct-neighbor rate that is defined as n i=1 I(Y i = Y j )/n, where Y j is the label of the closest neighbor of Y i . In the CBTR algorithm, every AIS point is either assigned a BPNP or determined as an endpoint of a trajectory. We sum up all mistakes made in this process, say M, and compute the correct-neighbor rate as 1 − M/n. The proposed method does not aim to find a correct sequential pattern of a trajectory. The definition of the accuracy used in this article only considers the correct clustered labels in the beginning of Section 3.1. It means that it is possible that the proposed method groups one vessel's AIS points in the order of (1, 3, 2) although the true order is (1, 2, 3). However, in this case, it is considered as a correct clustering result.
We compare the CBTR with other methods including the LSTM recurrent neural network (RNN) architecture [27][28][29] and the EM clustering algorithm [30,31] which assumes mixed Gaussian distributed clusters. The Expectation-Maximization (EM) algorithm using a Gaussian mixture model estimates the probability of each observation iteratively through the E-step and M-step. Each  EM cluster is determined by its mean and variance, so that it is suitable for vessels that are anchored or moving randomly in a fixed location. Since most vessels are moving with varying speeds and directions, the EM clustering does not perform well in the datasets. The comparisons of their correct-neighbor rates are listed in Tables 2, 3. The time complexity of the proposed CBTR method is O(nr) with the sample size n and the neighborhood size r. See the computational time for each method in Table 4.

. . Results of CBTR and experiments by sampling
In Figures 2, 3, one sees that CBTR is able to regroup most of the trajectories correctly. We leave the detailed explanation of these plots in the Supplementary material. One may evaluate the performance of CBTR by two numbers: jumps and merges. The former counts the total breaks of trajectories done wrongly by CBTR and the later counts how many wrong groupings CBTR makes. Instead of counting how many points are connected to wrong next point, the sum of jumps and merges shows the performance of CBTR more faithfully. Since each jump creates a new clustering and each merge cancels a group, the difference between them is exactly the difference between the number of vessels of our data and the number of clusters via CBTR. Namely we have the following identity: merges − jumps = #{predicted clusters} − #{actual number of vessels}.
In order to evaluate the robustness of CBTR, we conducted experiments by removing points in the data sets so that the trajectories become harder to be tracked. Indeed, we consider validation sets by method 1: removing each fifth point of every five points (i.e., the fifth, tenth, etc.) and method 2 removing each second point of every two points (i.e., the second, fourth, etc.). In sum, we take out 20 and 50% points, respectively, in each validation set and apply CBTR to predict the trajectories. For the downsampled AIS datasets 1, 2, 3 using method 1, the correct rates of the estimated neighbors are 0.9977, 0.9977, and 0.9966, respectively. For the downsampled AIS datasets 1, 2, 3 using method 2, the correct rates of estimated neighbors are 0.9947, 0.9943, and 0.9913, respectively. As we anticipated, the more points are removed, the lower the correct rates of estimated neighbors are. However, CBTR still performs very well whereas large amounts of points are removed. Furthermore, we remark that there is a trade-off between the reduction of the number of jumps and the increment of the number of mergers. If the upper bound for time interval is lager than 1,000 in Step 1, it may lead to more candidates used for selecting BPNP and fewer jumping points while increases the number of merges.

. . Discussion on the performance of CBTR
The predicted trajectories of all vessels in data set 1 by using CBTR are shown in Figure 4. From the left-hand boundary of this picture, we know the data set contains some incomplete trajectories and it is impossible to cluster them correctly. One can see a zoomed-in picture of this boundary phenomenon in Figure 5. Figure 6 shows another mistake made by CBTR. This kind of mistakes happens at the endpoint of some trajectories. To be precise, when a vessel goes back and parks at a pier, it will turn off the AIS signal transmitter. The last position reported shall be the endpoint of the trajectory. But sometimes CBTR finds a false next point for this endpoint and continues the trial. For example, in Figure 6, vessel no. 7 (colored purple) left toward west, came back, and parked to the east of vessel no. 5. At that moment, vessel no. 5 was reporting its last location before turning off its signal transmitter. CBTR found some point of vessel no. 7 to be a possible next point of the last point of vessel no. 5. So it makes a wrong connection from the circled point to the squared point. This is called a terminal-type mistake and counted as a merge.
These terminal-type mistakes only happen when two vessels are anchored close to each other. We can prevent this terminal-type mistakes by using more restrictive connecting criterion, but this will break some trajectories of moving vessels because AIS points in moving trajectories are much sparser than AIS points in steady vessels moored to the piers. In this case, the speed and angle of a vessel randomly change by wave drift forces, so the variances of the white noises in our model (3) may be larger than the signals (speed). These terminal-type mistakes are not that serious because the AIS data is mainly used to recognize moving vessels. Except the boundary phenomenon and the terminal-type mistakes, CBTR to vessel no. is due to the limit of boundary of the dataset. Vessel no.
is connected to no. , which is outside the plot, due to the same boundary e ect.  performs perfectly on generic situations and is a reliable method to predict trajectories. Table 5 shows the performance of CBTR for vessels with different speeds. One can see that slow vessels are the most challenging ones for CBTR.
. . Discussion on the performance of the LSTM path prediction Long Short-Term Memory [27] is a type of Recurrent Neural Networks (RNNs). LSTMs are an example of a recurrent neural network which has feedback loops allowing time-dependent problems to be solved. That is, the outputs (i.e., previous outputs) can be used as an input to help model the current output. More generally, problems that have a fundamental order can be solved. LSTMs are capable of modeling sequences of different lengths, and this is ideal as vessel paths often have a different number of points [32].
LSTMs have been used for predicting vessel trajectories with AIS data [33][34][35] as they can naturally be adapted to multi-target learning and are capable of learning both simple and complex patterns. Here we can think of the timestamp, latitude, longitude, speed, and direction, all at time t, as response variables whereas the predictor variables (i.e., inputs to the LSTM) are the timestamp, latitude, longitude, speed, and direction at time t−1, t−2, · · · , t−k. We train an LSTM using lagged versions of the timestamp, latitude, longitude, speed, and direction (i.e., time t − 1, t − 2. · · · , t − k) in order to predict the timestamp, latitude, longitude, speed, and direction at one time point in the future (i.e., time t). The goal here is to attempt to predict all characteristics of a vessel automatically using previous information. The architecture and tuning were accomplished via trial an error using a random 20% validation sample. The characteristics of the LSTM are the following: an input dimension of 5 (i.e., timestamp, latitude, longitude, speed, and direction are lagged by k = 1 time unit), 1 hidden layer, 250 hidden units using the Rectified Linear Unit (ReLU) activation function: max(0, x), and 5 output nodes (i.e., timestamp, latitude, longitude, speed, and direction at time t). Additional values for the number of lags were tried, but the performance was essentially unchanged and different activation functions were tried and tended to produce inferior results. The software used was the keras library in Python [36].
The results from the LSTM using all five variables as outputs seem to indicate that this approach is unable to distinguish the different vessel trajectories due to several reasons including the initial value and the training set of LSTM [33], the changes of courses and speeds [34] in the given prediction time range, and the normalization method x−min max − min [35] which may over-compress . /fams. . trajectory data since the some trajectories have large ranges but others do not. The performance of the LSTM next point prediction method is fundamentally dependent on the historical trajectories with labels used to train the LSTM model to predict the properties of the node at the next time point [32,37]. That is, the training set with labeled trajectories are needed to accurately predict the timestamp, latitude, longitude, speed, and direction at some future point in time. However, to make a fair comparison, only the current AIS point is used in training a LSTM model for predicting the next point, and this makes the recurrent neurons not able to sufficiently learn the latent features in the AIS datasets and leads to inaccurate prediction [38,39]. LSTM models are known to require a large amount of data in order to be effective, so the relatively small size of the individual AIS training datasets also is a contributing factor to the LSTM's performance.
An inspection of the LSTM predictions and the resulting nearest neighbor search indicate that most of the errors are related primarily to two factors: some vessels rapidly change their speed and direction while simultaneously other vessels that were previously similar to the rapidly changing vessel do not change their speed or direction suddenly and this results in misclassification, for example, vessels no. 5-8. The second source of error may be that the predicted AIS points by LSTM have large variations [37] and in combination with a larger number of candidates within each time window (i.e., the time window in the nearest neighbor search), mistakes are accumulated.

. Conclusions
The proposed CBTR method successfully cluster AIS points and track a trajectory without knowing the true labels of AIS points.
Step 2 of the proposed CBTR is the essence of our method, which integrates the forward and backward estimated positions into measuring the differences between two adjacent points. This step evaluates how good the fitted path is dynamically instead of using the static point information by measuring the mutual distances between points. Thus, CBTR algorithm is able to distinguish intersecting trajectories. The second feature in Step 2 is to define a suitable parameter τ to exchange time and space scales. Therefore, CBTR is applicable to various kinds of moving-point data lacking in labels, and its spatiotemporal features can be used with other methods [40] to select a safe maneuver crossing scenario with two target ships.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://gitlab.com/ algorithms-for-threat-detection/2019/atd2019.

Author contributions
H-HH and C-WC reviewed literature and designed the proposed methods. C-WC wrote and ran the Matlab code for the proposed CBTR method and edited the proposed methodology and the analysis results. H-HH drafted the manuscript. All authors proofread the manuscript. All authors contributed to the article and approved the submitted version.