Assignment Matrix Free Algorithms for On-line Estimation of Dynamic Origin-Destination Matrices

Dynamic Traffic Assignment (DTA) models represent fundamental tools to forecast traffic flows on road networks, assessing the effects of traffic management and transport policies. As biased models lead to incorrect predictions, which can cause inaccurate evaluations and huge social costs, the calibration of DTA models is an established and active research field. When it comes to estimating Origin-Destination (OD) demand flows, perhaps the most important input for DTA models, one algorithm suggested to outperform all the others for real-time applications: the Kalman Filter (KF). This paper introduces a non-linear Kalman Filter framework for online dynamic OD estimation that reduces the number of variables and can easily incorporate heterogeneous data sources to better explain the non-linear relationship between traffic data and time-dependent OD-flows. Specifically, we propose a model that takes advantage of Principal Component Analysis (PCA) to capture spatial correlations between variables and better exploit the local nature of a specific KF recently proposed in literature, the Local Ensemble Transformed Kalman filter (LETKF). The main advantage of the LETKF is that the Kalman gain is not explicitly formulated which means that, differently from other approaches proposed in the literature, there is no need to compute the assignment matrix or its approximation. The paper shows that the LETKF can easily incorporate different data sources, such as traffic counts and link speeds. Additionally, thanks to the PCA, the model can identify local patterns within the data and better explain the correlation between variables and data. The effectiveness of the proposed methodology is demonstrated first through synthetic experiments where non-linear functions are used to benchmark the model in different conditions and then on the real-world network of Vitoria, Spain (2,884 nodes, 5,799 links) using the mesoscopic simulator Aimsun. Results show that the proposed method leads to better state estimation performances with respect to other Ensemble-based Kalman filters, providing improvements as high as 64% in terms of traffic data reproduction with a 17-fold problem dimensionality reduction.

Dynamic Traffic Assignment (DTA) models represent fundamental tools to forecast traffic flows on road networks, assessing the effects of traffic management and transport policies. As biased models lead to incorrect predictions, which can cause inaccurate evaluations and huge social costs, the calibration of DTA models is an established and active research field. When it comes to estimating Origin-Destination (OD) demand flows, perhaps the most important input for DTA models, one algorithm suggested to outperform all the others for real-time applications: the Kalman Filter (KF). This paper introduces a non-linear Kalman Filter framework for online dynamic OD estimation that reduces the number of variables and can easily incorporate heterogeneous data sources to better explain the non-linear relationship between traffic data and time-dependent OD-flows. Specifically, we propose a model that takes advantage of Principal Component Analysis (PCA) to capture spatial correlations between variables and better exploit the local nature of a specific KF recently proposed in literature, the Local Ensemble Transformed Kalman filter (LETKF). The main advantage of the LETKF is that the Kalman gain is not explicitly formulated which means that, differently from other approaches proposed in the literature, there is no need to compute the assignment matrix or its approximation. The paper shows that the LETKF can easily incorporate different data sources, such as traffic counts and link speeds. Additionally, thanks to the PCA, the model can identify local patterns within the data and better explain the correlation between variables and data. The effectiveness of the proposed methodology is demonstrated first through synthetic experiments where non-linear functions are used to benchmark the model in different conditions and then on the real-world network of Vitoria, Spain (2,884 nodes, 5,799 links) using the mesoscopic simulator Aimsun. Results show that the proposed method leads to better state estimation performances with respect to other Ensemble-based Kalman filters, providing improvements as high as 64% in terms of traffic data reproduction with a 17-fold problem dimensionality reduction.

INTRODUCTION
Due to the rapid growth of road traffic, all major cities of the world are facing severe congestion problems. It is clear that simply increasing the infrastructure supply is not physically or economically feasible. The ever-growing pressure of travel demand on existing transport facilities generates significant societal, environmental and economic losses, due to the increase of pollutant emissions and travel times. To tackle this issue, practitioners and researchers in the transportation field rely on traffic simulation models to evaluate and implement efficient measures, ranging from off-line planning to real-time traffic management solutions. However, the efficiency of these tools depends on their ability to correctly predict traffic conditions. Travel demand is arguably the most important input for traffic simulation models, as both the planning and management of traffic solutions require good knowledge of current and forecasted demand. In dynamic models, demand is usually discretized in the form of Origin-Destination (OD) matrices, where each cell represents all trips from an origin zone to a destination zone started during a specific time interval. A widely adopted procedure uses traffic counts to estimate these matrices, seeking the best possible approximation of OD flows that minimizes the error between simulated and available traffic data. The relation between OD matrices and traffic counts is generally expressed through assignment matrices, which are hard to obtain and impose a simple linear relationship between link and demand flows (Frederix et al., 2013). Traditionally, this procedure can be applied off-line (for medium to long term planning) or on-line (for real-time traffic prediction). The Dynamic OD demand Estimation problem leads to two different approaches: simultaneous or sequential (Cascetta et al., 1993;Cantelmo and Viti, 2020). In the latter approach, matrices are individually estimated, from the first-time interval to the last one, while keeping the matrices calculated in the previous intervals fixed. By pursuing a sequential approach, the computational complexity of the problem decreases as the problem itself can be split into a set of simpler sub-problems and every estimated matrix can be used as the starting estimate for the next time interval. Thus, the sequential approach is a must for online applications, where computational times are an important constraint in the estimation process.
One algorithm that has been proven to outperform all the others for the Online Dynamic Demand Estimation (ODDE) problem is the Kalman Filter (Chang and Wu, 1994;Van der Zijpp and Hammerslag, 1994;Ashok, 1996;Zhou and Mahmassani, 2007). The Kalman Filter uses a series of measures obtained over time to estimate the most likely status of an unknown variable. In the case of on-line estimation of demand, time-dependent OD flows are the unknown variables and observed traffic conditions are the data. However, despite more than 20 years of intense research efforts, the ODDE has not yet been solved in such an effective way to be applied for real problems. Based on the literature, we can distinguish between three main sources of complexity in ODDE: 1. Number of variables: the ODDE procedure generally returns effective results when the number of observations (traffic data) is similar to the number of unknowns (OD flows). However, as this usually does not occur in practice, dimensionality reduction techniques should be deployed to avoid this issue (Marzano et al., 2009;Djukic et al., 2012;Xia et al., 2018). 2. Non-linear relationships between variables: there are two main ways of considering non-linearity within the ODDE, which ideally should be jointly considered. One way is to include different data sources in order to increase the observability of the system. For instance, jointly considering speeds, densities and counts can help to better understand traffic phenomena (Balakrishna et al., 2007;Frederix et al., 2011;Yang et al., 2017). Secondly, describing non-linear systems entails deploying non-linear models. The conventional Kalman Filter is a simple linear model, thus several non-linear extensions have been proposed . 3. Demand structure: mobility demand derives from the demand for activities, and thus, it has a structure. Different models should be used to target different components of the demand, including random fluctuations, structural, and seasonal trends, as well as regular trends (Zhou and Mahmassani, 2007;Cantelmo et al., 2019;Behara et al., 2020).
This paper introduces a non-linear Kalman Filter framework for ODDE, that considers the three sources of complexity previously reported. Through the adoption of Principal Component Analysis (Jolliffe, 2002) (Jolliffe, 2002), it works on exploiting the demand structure and on reducing the number of variables. It is assignment matrices-free, which means that it can easily incorporate heterogeneous data sources to better explain the non-linear relationship between traffic data and timedependent OD-flows. The remainder of the paper is structured as follows. Section Literature Review provides a brief literature review of the ODDE problem solved through the Kalman filter and its non-linear extensions. The model, an extension of the Local Transformed Ensemble Kalman Filter (LTEKF) proposed in Carrese et al. (2017) is then presented in section The Model: PCA-Local Transformed Ensemble Kalman Filter. Section Applications and Results shows the numerical results on a synthetic experiment and on the real-world network of the city of Vitoria, Spain. Lastly, section Conclusions provides some concluding statements and remarks.

Online Dynamic Demand Estimation
Travel demand refers to the entirety of trips between all the traffic zones of a transport network, taking into account the different travel purposes, time frames and modes of transport (Cascetta, 1984). From a modeling perspective, travel demand consists mainly of an origin-destination (OD) matrix with each cell representing all trips from an origin zone O to a destination zone D.
In the dynamic case, the temporal OD demand is usually represented by a sequence of matrix "slices" where each demand slice corresponds to a departure time interval. Estimation of temporal OD demand can be performed for both withinday (intra-period) and day-to-day (inter-period) dynamic frameworks, as well as for off-line (medium-long term planning and design) and on-line (real-time management) applications .
In the on-line within-day formulation, OD flows are estimated as a sequence of separated intervals adopting a rolling-horizon approach. This is called sequential approach and it is formulated as follows: The objective in (1) is to find the matrix d * nh for each time interval n h of the planning horizon T, minimizing: (i) a measure of distance f 1 between the unknown ODs x nh and some a priori target matrix x * nh (seed matrix); (ii) a measure of distance f 2 between simulated traffic data t nh and traffic measurements t * nh . Simulated traffic data derive by the Dynamic Traffic Assignment (DTA) of the unknown ODs, fixing the ODs of previous time slices. The first component f 1 in (1) allows to overcome the non-uniqueness of the solution of the demand estimation problem, where in on-line dynamic formulations, off-line dynamic matrices are usually adopted as the target ones.

Kalman Filter and Non-linear Extensions for ODDE
One of the most established approaches for ODDE is to reformulate the problem as a state-space model and, successively, to adopt a Kalman Filtering (KF) approach (Okutani and Stephanedes, 1984) to solve it. The state-space model is a useful abstraction for dynamic systems that describes the behavior of said systems through three equations: 1. The transition equations, which capture the evolution of the system over time; 2. The measurement equations, which map the state variables to the observed data; 3. The analysis equation, which corrects the estimate derived by the transition equations through the results of the measurements equations and the Kalman gain.
The KF algorithm (Kalman, 1960) (Kalman, 1960)is based on the solution of a least-square cost function in an incremental way, allowing to update the OD flows when new traffic data is available. In order to include the structure of the demand within the estimation framework, Ashok and Ben-Akiva (1993) formulated the KF in terms of deviations between the actual and the historical OD flows. The KF algorithm represents one of the most widely adopted solution framework for the online OD estimation problem (Barcelo and Montero, 2015;Zhang et al., 2017;Marzano et al., 2018;Krishnakumari et al., 2019;Liu et al., 2020). However, its application to the online OD estimation problem has several drawbacks. Firstly, both the transition equation and the measurement equation assume a linear relation between variables. In the case of the transition equation, this relation is usually represented in the form of an autoregressive process (Ashok, 1996), while the assignment matrices are usually used to feed the measurement equation. However, the assignment matrices are then assumed as fixed in the measurement equations to simplify the process and, lastly, standard KF is not able to handle a large number of variables, as it requires intensive linear algebra computations (Bierlaire and Crittin, 2004).
As these formulations poorly represent traffic dynamics, non-linear models need to be deployed in real cases . This paper will focus on the family of Ensemble Kalman filters (EnKF), introduced by Evensen (2003). EnKF chooses an ensemble of initial conditions around the current estimate and propagates each ensemble member based on a non-linear model. Thus, the uncertainty of the estimation is propagated from one time interval to the next and the ensemble is used to parametrize the distribution of the state variables.
The size of the ensemble has to be chosen so that it is statistically representative of the model (Kalnay, 2002) and must span the model sub-space adequately (Oke et al., 2007) or the system may be undersampled and thus lead to unwanted errors, such as inbreeding, filter divergence, and spurious correlations (Whitaker and Hamill, 2002;Lorenc, 2003;Furrer and Bengtsson, 2007). Several approaches have been implemented to avoid the problems caused by undersampling, such as covariance inflation (Anderson and Anderson, 1999) and localization (Hamill et al., 2001).
The size of the ensemble also affects the computational time of the EnKF. All the Ensemble-based Kalman filters tend to be computationally expensive as the state variables ensemble must be maintained throughout the entire time horizon. Furthermore, when computing the dependency between OD flows and observed measurements, for each time interval k runs of the DTA model are required, where k is the number of the elements in the ensemble. Since the DTA performance is highly dependent on the number of OD pairs on the network, to obtain a sustainable prediction time for Ensemble-based Kalman filters algorithm, some computational time reduction approaches must be pursued, such as accelerating the solution algorithm of the DTA or reducing the dimensionality of the problem.
An interesting extension of the EnKF is the Local Ensemble Transformed Kalman filter (LETKF) (Hunt et al., 2007), which efficiently deals with non-linear problems, large-scale models and datasets combining the advantages of two ensemble-based filters: the Local Ensemble Kalman filter and the Ensemble Transformed Kalman filter (ETKF) (Bishop et al., 2001;Wang et al., 2004).
The LETKF adds two extension to the basic EnKF. First, it allows to minimize the Kalman filter cost function in the ensemble space, thus reducing the dimension of the problem and the problem complexity (it is "transformed"). Second, it provides a framework for data assimilation that allows a systemdependent localization strategy, breaking down the problem into sub-problems to be solved in a parallel fashion (it is "local"). As in all the EnKFs, the LETKF avoids the linearization of the dependency between OD flows and observed measurements, by implicitly capturing it through a traffic simulator rather than through an analytic formula.
As pointed out in Hunt et al. (2007) and Carrese et al. (2017), the LETKF provides a framework for data assimilation that allows a localization strategy, i.e., cutting off longer range correlations at a specified distance. Generally, it is performed by applying a Schur product with a correlation function with local support, meaning that the function will be dependent on the Euclidean distance between variables and will be non-zero only in a small local region. However, this idea holds true only for the prediction process of selected dynamic systems, such as weather forecasting, where the spatial correlation between variables does depend on the Euclidean distance between them. Since traffic is sprawled everywhere on the network, the same does not hold for traffic dynamic systems, as OD flows are not directly correlated by Euclidean distance. For the ODDE, the "local" approach means dividing the network into subnetworks and the demand matrix into submatrices, each submatrix containing the ODs that mostly affect the traffic measurements in the corresponding subnetwork. This is something that has already been tested for the off-line OD estimation problem (Cantelmo et al., 2014;Antoniou et al., 2015). However, similarly to methods that explicitly use the assignment matrix, this entails developing procedures to explicitly map the relative weight of the information (e.g., how ODs and link flows are correlated).

Contribution Statement
It is well-known that the ODDE is a highly non-linear, highly undetermined, and computationally demanding problem. Based on the analysis of the state of the art, it emerges that there is a clear need for efficient and robust solution methods able to handle large networks and heterogeneous data sources.
LETKF represents a recent development of EnKF for the ODDE problem, as it is suitable for highly non-linear problems and large-scale networks and datasets. So far, the model has been used for small networks only, thus, the "local" peculiarity of LETKF, which has only been mentioned but not implemented in Carrese et al. (2017), remains one of the main and most interesting research lines for the applicability of the model on large-scale world networks. To fill this gap, this paper introduces a methodology that combines the Principal Component Analysis and the LETKF (PCA-LETKF). Differently from the methods proposed in Cantelmo et al. (2014) and Antoniou et al. (2015), the Principal Component Analysis (PCA) allows in fact to capture spatial correlations between variables without the need to explicitly map these relationships. The PCA is a powerful tool for data analysis that aims to identify patterns in high-dimensional datasets and reduce the number of dimensions without much loss of information. This is achieved by transforming the dataset into a new set of variables-the Principal Components (PCs)-which are uncorrelated and ordered so that the first few retain most of the variation present in all of the original variables (Jolliffe, 2002). PCA is an assumption-free procedure that already calculates how ODs are correlated (Prakash et al., 2018;Qurashi et al., 2019). As showed by Djukic et al. (2012) and Prakash et al. (2018), replacing the OD demand with its Principal Components reduces the problem complexity thus making the ODDE problem simpler. This paper shows that the PCA also identifies the variables that are highly correlated and provides a good classification for the localization framework. Thus, by finding the spatial correlation between variables, the PCA empowers the LETKF model to better exploit its "local" nature.
We show that, for transport applications, the proposed PCA-LETKF outperforms conventional EnKF, including the LETKF. The reason is that the PCA finds spatial correlations between variables, which leads to a smaller number of ensembles required, with respect to the conventional EnKF, to achieve better estimates.

THE MODEL: PCA-LOCAL TRANSFORMED ENSEMBLE KALMAN FILTER
The road traffic network is modeled by a directed graph G = n, l where n is the set of nodes and l is the set of links. The simulated horizon T is divided in equal time intervals h = 1, . . . , t.

Creation of the Starting Dataset
According to Prakash et al. (2018), multiple estimates of the OD flows are required to obtain the principal components of an OD flow vector. One reasonable approach would be using an offline calibration procedure to estimate the OD flows over multiple days and compose a m × n OD data matrix X, where m is the number of estimates used and n OD is the number of OD pairs in the network. In practice, every x i estimate is created from the seed matrix x (having dimension n OD × n h ) as: where R is a normally distributed random vector with mean 0 and values between 0 and 1, is a ±1 Bernoulli distribution random vector used to randomize the increase or decrease of each estimate x i . Lastly, q i is a random coefficient to specify the scale of randomization.
A centered data matrixX is obtained by subtracting the mean X from each column of the matrix X, since having a zero mean dataset simplifies the problem.

Generation of the PC Components
where is a m × n OD matrix with positive values called singular values, U is a m × m matrix with orthogonal column vectors called the left singular vectors and V is a n OD × n OD matrix with orthogonal column vectors called the right singular vectors. The column of the matrix V are the principal component directions, which represent the eigenvectors of the sample covariance matrix 1 mX TX . The first r PC-directions-sorted highest to lowest according to the values of the vector-that explain more than 95% variance of the data matrix are selected to form the n OD × r matrix V.
where v 1 represents the PC direction with the largest sample variance, v 2 represents the PC direction with the largest sample variance that is orthogonal to v 1 and so on. Then, the PCs for the time interval h − 1 are generated as: where x a h−1 are the a priori OD flows (of the seed matrix) containing all the trips departing during the time interval h − 1.
It must be taken into account that the principal components, as computed above, capture the structural spatial relationship between the OD flows and not their temporal relationships. As OD demand can change through the day, it would be desirable to compute several matrices V h relative to the specific time interval h. But calculating time-interval specific principal components can have high data requirements, hence why the feature vector V is considered constant across intervals, implying that the statistical correlation between OD flows is constant through the chosen time frame. This is considered to be an acceptable approximation when, for example, the estimation time frame corresponds to only the morning hours (i.e., from 7:00 a.m. to 12:00 p.m., as in the case studies discussed in the following sections), whereas, if the entire day demand is considered, issues may arise as the afternoon peak hours may affect the correlation in the morning peak hours and vice versa.

Generation of the a priori Ensemble
Given the a priori PCs z a h−1 , an ensemble is generated: where k is the number of members of the ensemble. In ensemblebased filters, the ensemble can be obtained "offline, " from the last matrices estimated during iterations of an off-line adjustment process or several matrices obtained from off-line adjustment conducted for several days. In this paper, as we are generating an ensemble of PCs, the ensemble is generated by perturbing the a priori PCs z a h−1 through a randomization vector S i (containing normally distributed values from 0 to the arbitrarily chosen maximum percentage error given to the PCs) and a Bernoulli distributed vector.

Transition Equations
From each member of the a priori ensemble z i h−1 , the background state estimate for the following time interval h is obtained as: F h|h−1 is the propagation map, that captures the evolution of the system over time through an autoregressive process, as proposed in Ashok (1996) (Ashok, 1996). Several approaches can be found in literature, with one of them being a polynomial approximation that interpolates each a priori vector by the average of the a priori ensemble from one time interval to the next. Furthermore, as ach ensemble is forecasted into the future time interval independently, this makes this formulation well-suited for parallel processing. The mean value z h|h−1 , the deviation ∆Z i h|h−1 and the covariance matrix P h|h−1 are then computed.
And the PCs background ensemble z h|h−1 is transformed back into OD flows to perform the assignment through a DTA simulator: The formulation (10) is an approximation as we reduced the dimensionality of the PC-directions matrix V. Thus, when reconstructing the data, the dimensions that have been discarded are lost.

Measurement Equations
The state variables until time interval h are mapped onto the simulated measurements y at time h: y i h is a vector that contains all measurements chosen, such as link flows, speeds, etc. . . H represents the non-linear model that maps the OD flows to the traffic measurements. It is not required for H to have an analytic formulation as in the LETKF algorithm the Kalman gain is not explicitly formulated. For the linear Kalman filter and other ensemble based Kalman filters, H consisted of the assignment matrix, whereas for the LETKF model H is considered a "black box" that contains a DTA procedure used to simulate the traffic measurements that will be the outputs of the measurement equations (11).
Then, the mean y h and deviation Y h are computed. The covariance matrix of the measurements R is usually assumed constant across time intervals.

Coordinate Change
A coordinate change from the PCs space (r dimension space) to the ensemble space (k dimension space) takes place. The covariance matrix for the analysis state in the k-dimension space becomes: where I is the identity matrix (k × k dimension). The average of the analysis state in the k-dimension space is: where y 0 h are the observed traffic counts for the time interval h andP h (∆Y h ) T R −1 represents the Kalman gain. This transformation allows dealing with a Kalman gain that is not function of either H or its Jacobian, which is usually required in other KF approaches.

Return to the PCs Space
The background PCs ensemble z h|h−1 is corrected and the a priori ensemble for the next time interval z h and its covariance matrix P h are obtained as:

Experimental Design
The proposed PCA-LETKF has been tested firstly on a synthetic network and then on the real-world network of the city of Vitoria, Spain. For the application of the PCA-LETKF algorithm, a data matrix of 100 previous estimates of the starting demand is generated and the PC-directions are calculated through PCA and then reduced until the remaining PC-components contain 95% of the variance of the data matrix.
The synthetic network consists of 3,249 OD pairs and 395 detectors. A starting demand resulting in an uncongested network has been firstly considered, while different degrees of randomization of the starting demand have been finally tested. Traffic counts have been considered the only measurements available and the number of ensembles varying from 5 to 50. The synthetic model uses two possible non-linear functions to map the complex relationship between ODs and traffic counts; the first one is: where Y i represents the general detector i and X n represents the demand for the OD pair n. The two matrices w 1 and w 2 are randomly generated weights that relate link and demand flows for each OD/detector.
The second function increases the non-linearity and complexity of the relation between OD flows and link flows. It has been obtained by incorporating the random weights w 1 and w 2 in the Styblinski-Tang function (Styblinski and Tang, 1990), a commonly used benchmark objective function for optimization methods: In the synthetic experiment, PCA-LETKF results have been compared with those obtained with the standard EnKF and LETKF models.
Concerning the real-world experiment, the city of Vitoria is the capital of the Basque Autonomous Community in northern Spain and represents the typical middle-sized European city in terms of dimension and structure, composed of a city center, a motorway, and suburb areas. Its network consists of 57 traffic zones, 2,884 nodes, and 5,799 links (Figure 1); 395 detectors provide traffic counts data. The mesoscopic simulator Aimsun (2017), a commercial software adopted by practitioners all over the world for planning and real-time management, has been adopted to map the OD flows to the traffic measurements through a Dynamic Traffic Assignment process. The simulations are run with a stochastic route choice scenario and path assignment fixed through dynamic user equilibrium.
In this experiment, the morning demand has been considered-from 7:00 a.m. to 12:00 p.m., for a total of 20 time intervals. Two different demand scenarios are considered, one resulting in an uncongested network (84,089 vehicles) and one resulting in a congested network (158,644 vehicles).
In the real-world experiment, PCA-LETKF results have been compared with those obtained with the LETKF model.
For the evaluation of the performance of the tested algorithms, the Normalized Root Mean Square (RMSN) error (19) is used, which has been previously chosen as the evaluation criteria in many research papers (Ashok and Ben-Akiva, 2000;Prakash et al., 2018).

Synthetic Experiments Results
Results in terms of estimated link flows and OD flows at the end of the model runs adopting the non-linear function (17) in the uncongested case are shown in Figure 2, where the red dashed line represents the initial error.
In the figure, the x-axis represents the number of ensembles used by each model. Intuitively, larger values are associated with better predictions but also higher computational times. The y-axis indicates the quality of the prediction associated with a specific number of ensembles in terms of RMSN. As expected, both the LETKF and PCA-LETKF outperform the traditional EnKF. Despite being a quite advanced model capable of handling non-linearity, even when using 50 ensembles, the EnKF only reduces the error from 0.45 to 0.34 (black dotted line). It is also important to point out that-in order to capture non-linear phenomena-each ensemble requires to perform an objective function evaluation. This entails running the map  linking OD flows to traffic counts 50 times for each time interval, one for each ensemble member. The main reason is that we only have 395 detectors to explain 3,249 variables. In similar conditions, the LETKF provides better results in terms of link flows, however, the PCA-LETKF (black dashdotted line) performs better already with 5 ensembles, while the normal LETKF only provide good results (RMSN ≈ 0.2) for more than 25 ensembles, where again more ensembles mean more computational time. The reason is that as of now the LETKF is not exploiting any localization strategy. This means that more ensembles are needed to learn the structure of the data. Additionally, the PCA-LETKF also performs better in terms of OD flows for most of the cases. This usually happens for a low number of ensembles-between 5 and 25-which suggests that larger ensample values increase the probability of overfitting.
Comparing previous results with those obtained by the adoption of the non-linear function (18), it can be observed a similar pattern (Figure 3), with the PCA-LETKF model maintaining a constant RMSN value of ∼0.2 (56% decrease from initial error) for the first function and of ∼0.45 (65% decrease from initial error) for the second function. The estimated link flows error for the LETKF model decreases as the number of ensembles increases, until it converges to the results of PCA-LETKF for more than 20 ensemble members. This suggests that LETKF and PCA-LETKF potentially converge on similar results but also that LETKF is four times more demanding in terms of computational time. Finally, Figures 2, 3 show a final error of 15-20%, which is too high for real time applications. Real-time applications usually start from a good historical matrix which is then corrected using models, such as the Kalman Filter. In Figures 2, 3, however, the initial error is 45%. This is unrealistic, as the Kalman filter will hardly manage to correct such a large error. To consider the impact of the initial (seed) matrix on the performances, the models have also been tested with function (17) considering varying degrees of randomization of the initial error between the real matrix and the seed matrix. A series of seed matrices has been created off the real matrix: where R is a normally distributed random vector with mean 0 and values ranging between 0 and the arbitrarily chosen parameter E, which represents the maximum percentage error given to each OD pair. ∆ is a ±1 Bernoulli distribution random vector, which randomizes the reduction or the increase for each value OD pair. In Figure 4 the results obtained with a 5-member ensemble and the maximum percentage error E given to the OD pairs ranging from 100 to 10% are shown.
It can be observed that the PCA-LETKF outperforms the other models, both in terms of link flows and OD flows, when 5 ensembles are used in the estimation. In particular, in terms of link flows, the LETKF results show a 22.5% improvement, when the maximum percentage error E between real and seed matrix is 100%, and a 42% improvement, when E is 10%. Whereas, the PCA-LETKF results show a 49.5% improvement for E = 100% and a 67% improvement for E = 10% in terms of link flows, and a 35% improvement in terms of OD flows.
The scatter chart of the results obtained for the 3 models using a 5-member ensemble and a starting maximum percentage error E of 15%, resulting in an initial link flows error of ∼0.2, is highlighted in Figure 5. Figure 5A compares the values of the real OD matrix and the seed OD matrix, Figure 5B compares the values of the observed link flows and the simulated link flows obtained from the assignment of the initial seed matrix. Then,

Vitoria Network Results
We tested the PCA-LETKF for both congested and uncongested conditions. The results of the tests performed on the uncongested scenario are firstly presented in Figures 6, 7. The seed OD matrix has been created assuming a maximum percentage error E given to each OD pair of 50% from the real OD matrix, resulting in a starting RMSN between seed and real OD flows of 1.29 and a 0.34 RMSN between observed link flows and simulated link flows.
In terms of link flows, the LETKF algorithm shows a 9% improvement for a 5-member ensemble that increases up to 16% for a 25 member ensemble, whereas the PCA-LETKF's improvement stays in the 30-33% range regardless of the ensemble dimension. As for the OD flows, the LETKF algorithm presents unsatisfactory results, although the error seems to be decreasing for larger-sized ensembles. The PCA-LETKF, on the other hand, shows an improvement in terms of OD flows between 24 and 27%. Lastly, the LETKF algorithm presents an improvement in terms of observed speeds that ranges between 5 and 8%, whereas the PCA-LETKF presents a 11-14% improvement for the uncongested scenario.
The scatter chart of the results obtained for the models using a 5-member ensemble and a starting maximum percentage error E of 50%, resulting in an initial link flows error of 0.24, is highlighted in Figure 8. Figure 8A compares the values of the real OD matrix and the seed OD matrix, Figure 8B compares the values of the observed link flows and the simulated link flows obtained from the assignment of the initial seed matrix. Then, Figures 8C,D show the distance between the observed link flows and the estimated link flows obtained through PCA-LETKF The models have then been tested with varying degrees of reliability of the seed OD matrix, considering values of the maximum percentage error E between OD pairs that vary  between 50 and 10%. The results are shown in Figures 9, 10. PCA-LETKF presents an improvement in term of link flows that varies between 28% (for a higher error between the real OD matrix and the seed OD matrix) and 47%, whereas the LETKF algorithm barely shows any improvement (4-9%) as 5 ensembles are way too little to statistically represent a large network, and the same applies to speeds. It follows that LETKF also fails to provide satisfactory results when it comes to OD flows, while the PCA-LETKF, instead, provides an improvement in terms of OD flows ranging between 24 and 35% and an improvement in terms of speeds between 11 and 14%.
The same experiments have been repeated on the congested network, as shown in Figures 11, 12. The seed OD matrix has been created from the real OD matrix assuming a maximum percentage error of 15% given to each OD flow, even so, the initial error between the simulated link flows obtained from assigning the seed matrix onto the network and the observed counts is much higher than what has been observed for the uncongested scenario.
Consistently with what has been observed on the synthetic network, the PCA-LETKF performance is more or less constant regardless of the size of the ensemble, whereas the LETKF model results are better when more members of the starting ensembles are used. In terms of link flows, the PCA-LETKF improves the RMSN by 34-36%, whereas the LETKF shows an improvement that goes from only 7-28% when 25 ensembles are used. In terms of OD flows, both algorithms show unsatisfactory results, with the PCA-LETKF not improving nor worsening the results compared to the seed matrix, unlike the LETKF model which shows a 10-16% increase of the RMSN between OD flows. As for speeds, the LETKF starts from a 2% improvement when 5 ensembles are used up to a 10% improvement when 25 ensembles are used, whereas the PCA-LETKF   shows a 14-15% improvement. The scatter chart of the results obtained using 5 member ensembles are shown in Figure 13.
Lastly, LETKF and PCA-LETKF have then been tested with varying degrees of reliability of the seed OD matrix, with values of the maximum percentage error E between OD pairs starting from 15% and then going down to 3%, while the number of ensemble members is kept constant (k = 5) throughout the tests. Results are shown in Figures 14, 15.
In terms of link flows, the PCA-LETKF results improve by 36-37% for E = 15% and E = 10% and by 20-22% for E = 5% and E = 3%, whereas the LETKF improvements stay in the 6-8% range throughout. As for OD flows, instead, the LETKF shows unsatisfactory results, whereas for the PCA-LETKF the results vary depending on the seed matrix: in some instances, the error between real OD flows and estimated OD flows is not higher nor lower than the starting error, while in other instances improvements up to 22% are observed. As for speeds, the error decreases from 2 to 6% for the LETKF algorithm and between 9 and 14% for the PCA-LETKF algorithm.

CONCLUSIONS
This paper introduces a ODDE approach that combines Principal Component Analysis and the Local Ensemble Transformed Kalman Filter (PCA-LETKF). The advantages of this approach are 2-fold: -PCA-LETKF does not require an analytical formulation of the assignment matrix, which is implicitly captured by the DTA simulator. This means that multiple data sources can be utilized in the measurement equations. In this work, traffic counts and speeds have been utilized, but the approach allows for other sources, such as FCD. -PCA-LETKF exploits the "local" peculiarity of LETKF algorithm by finding the spatial correlation that exists between the variables, through the implementation of the Principal Component Analysis to reduce the dimensionality of the problem. In the case studies presented in this thesis, the initial 3,249 variables x to estimate have been reduced to 195 variables z, which contain 95% of the variance. The dimensionality of the problem has been thus reduced by 17 times.
Reducing the dimensionality of the problem tackles two major issues when it comes to the applicability of ensemble Kalman filters to online applications: computational times and undersampling. Computational time, for the online dynamic OD estimation, is a major constraint which generally make ensemble-based filters impractical for anything but very small networks. For a medium-large sized network like the network of the city of Vitoria, both the EnKF and LETKF require a large ensemble to correctly explain the system dynamics, making them computationally unfeasible for online applications, as every ensemble member requires a run of the DTA simulator-the most computationally expensive part of the model. Additionally, deciding the right number of ensembles is a challenging task. Both the EnKF and LETKF show that a small number of ensembles is not sufficient to generate accurate predictions while a large number leads to overfitting. The PCA-LETKF, on the other hand, is much more robust in terms of outputs and less prone to overfitting.
The PCA-LETKF approach shows positive results that outperform those of the EnKF and LETKF for as low as 5 ensembles members, with the estimation of the OD flows over a 5-h long time horizon taking about 5 h as well, meaning that the estimation process of the OD flows of a single time interval takes roughly the same amount of time of the interval itself.
One of the limits of ensemble-based Kalman filters is that if an ensemble that is too small to statistically represent the state and span the sub-space adequately is chosen, the system will be undersampled, which generally leads to a series of unwanted errors affecting the performance of the ensemble filters (spurious correlation, inbreeding, filter divergence). This explains the unsatisfactory results of EnKF and LETKF when it comes to correctly estimate the OD flows, using small ensembles. As for PCA-LETKF, instead, its variables z (the Principal Components) already contain the information of 100 data samples computed beforehand (i.e., the outputs of 100 previous offline calibrations), hence why for all the tests conducted on both the synthetic and the Vitoria network, PCA-LETKF outperforms both EnKF and LETKF. The tests show good results when it comes to better predict the actual traffic conditions on the network. As for OD flows, the EnKF and LETKF show unsatisfactory results for small ensembles, whereas the PCA-LETKF's results are mixed.
The way traffic counts and speeds-whose variances can be very different-are jointly inserted in the equations could be fine-tuned to better exploit the role that speeds play into representing actual traffic conditions when it comes to congested networks. All kinds of data can be incorporated in the proposed model, however, so far, we have only addressed how to incorporate data on a link-level scale, whereas the incorporation of point-to-point data, such as travel times, could be a future development. Furthermore, the assignment part of the algorithm could be also speed up further. In the transition equations, a simple polynomial interpolation has been used to forecast the variables from one time interval   to the next, as the definition of the best-possible transition equation is outside the scope of this paper. However, several autoregressive models have been already proposed in literature and many more have yet to be applied to the OD estimation problem specifically, such as Gaussian processes, for example. Therefore, future developments will cover an evaluation of the best approach to maximize predictive performance of PCA-LETKF. Lastly, an optimal covariance inflation factor could be defined for the Victoria network, as it has only been tested for the synthetic experiment.

DATA AVAILABILITY STATEMENT
The original contributions generated for the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.