Dynamic and interpretable hazard-based models of traffic incident durations

Understanding and predicting the duration or"return-to-normal"time of traffic incidents is important for system-level management and optimisation of road transportation networks. Increasing real-time availability of multiple data sources characterising the state of urban traffic networks, together with advances in machine learning offer the opportunity for new and improved approaches to this problem that go beyond static statistical analyses of incident duration. In this paper we consider two such improvements: dynamic update of incident duration predictions as new information about incidents becomes available and automated interpretation of the factors responsible for these predictions. For our use case, we take one year of incident data and traffic state time-series from the M25 motorway in London. We use it to train models that predict the probability distribution of incident durations, utilising both time-invariant and time-varying features of the data. The latter allow predictions to be updated as an incident progresses, and more information becomes available. For dynamic predictions, time-series features are fed into the Match-Net algorithm, a temporal convolutional hitting-time network, recently developed for dynamical survival analysis in clinical applications. The predictions are benchmarked against static regression models for survival analysis and against an established dynamic technique known as landmarking and found to perform favourably by several standard comparison measures. To provide interpretability, we utilise the concept of Shapley values from the domain of interpretable artificial intelligence to rank the features most relevant to the model predictions at different time horizons. For example, the time of day is always a significantly influential time-invariant feature, whereas the time-series features strongly influence predictions at 5 and 60-minute horizons.

it everyday and 1 billion tonnes of freight being transported across it each year [see Highways England, 2017]. There is little chance that in the short term, the SRN will see significant infrastructure changes, however data describing the traffic state across the network is already being collected and made available for analysis. Whilst a significant component of the UK transport infrastructure, congestion remains a major problem on the network, with the cited report suggesting 75% of businesses consider tackling congestion on the SRN is important or critical to their business.
Traffic congestion can be broadly separated into two types: recurrent and non-recurrent. Recurrent congestion is simply the result of the demand regularly exceeding capacity on busy sections of road during 'rush hour' periods. Non-recurrent congestion on the other hand is mainly caused by traffic incidents and rare incidents [Li et al., 2018]. To better manage traffic during these incidents, traffic operators require reliable estimates of how long a particular incident will last. Whilst there is significant existing work on modelling incident duration, many fundamental challenges remain that are both of practical interest to traffic management centres and remain active areas of research in an academic sense. A review of existing work on this problem is found in Li et al. [2018], where six future challenges for incident duration prediction are listed. These are: combining multiple data-sources, time sequential prediction models, outlier prediction, improvement of prediction methods through machine learning or alternative frameworks, combining recovery times and accounting for unobserved factors. Our work aims to address five of these challenges, combining time-series from sensor networks with incident reports to issue dynamically updated duration predictions. Inspired by approaches adopted in medical applications, we consider classical survival approaches and non-linear, machine learning methods for prediction, understanding where gains in performance are attained. Finally, as we are able to observe traffic behaviour over long periods of time through the sensor network, we are able to judge not just when a traffic incident has been cleared, but when the traffic state has returned to normal operating conditions, thereby combining recovery times into our modelling approach.
The rest of this paper is structured as follows. Firstly, we overview existing work on incident duration prediction, specifying how the duration is defined, existing methods and offer more detail on challenges highlighted in the literature. Secondly, we detail our dataset, its collection and processing and an initial exploratory analysis of it. We then describe the considered modelling approaches, both for static and dynamic predictions, and results for our dataset. Finally, we consider what variables are important for the models and end by summarising our main findings.
Note that throughout this paper, an 'event' is defined as the traffic state on a section of road returning to a baseline behaviour. As such, when an event has 'occurred' we really mean the traffic state has recovered. This is just a note of terminology commonly used in the survival analysis literature.

Background
In this section, we summarise existing work on traffic incident duration analysis that is relevant to our own, along with relevant research from other disciplines that has influenced our approach.
Before any methodologies are considered, it is first important to define exactly what is being modelled. A traffic incident is considered to have four different time-phases: the time taken to detect and report an incident, the time to dispatch an operator to the scene, the travel time of the operator to the scene, and finally the time to clear an incident. Such a framework is described in Li et al. [2018]. We consider an incident to have 'started' when the incident is reported by the human operator, who would use a series of cameras to observe the state of the road in the case of the SRN. Our focus is to model the time from this point until both the incident has been physically cleared, and traffic behaviour on the road has returned to some sense of 'normal' behaviour. We describe exactly how we determine normal behaviour in section 3.1. In brief, we use a seasonal model of the speed time-series to estimate when the traffic speed on a section of road has recovered to a level close to what would be normally be expected for that section at that time of day. The idea to use such a speed profile is also considered in Hojati et al. [2014], where they define the 'total incident duration' to be the time from incident start until the speed has recovered to the profile.
Whatever explicit definition of duration is used, there is an enormous amount of work on predicting incident durations. An initial step of many of these is to determine an appropriate distributional form that the durations take. These are typically heavy tailed and empirically show significant variation. Examples of this include modelling the distribution of incident durations as log-normal in Golob et al. [1987] and Chung and Yoon [2012], log-logistic in Zhang and Khattak [2010] and Junhua et al. [2013], Weibull in Hojati et al. [2014] and Kaabi et al. [2012], and generalised F in Ghosh et al. [2014]. In the latter, it is noted that the generalised F distribution can be equivalent to many other distributional forms for particular parameter choices, including the exponential, Weibull, log-normal, log-logistic and gamma distributions. Hence, it offers more freedom than choosing any single one of these forms. Indeed, the authors state that the increased flexibility it offers allows it to fit the data better. Even further flexibility in the distributional choice is given in Zou et al. [2016], where it was shown that modelling the distribution as a mixture, that is the sum of multiple components, may improve model performance. Specifically, they consider a 2-component log-logistic mixture model, where the final distribution is the weighted average of two log-logistic distributions. Finally, other authors Smith and Smith [2001] have had difficulty finding statistically defensible distributional fits to their data, although it should be noted that different definitions of incident durations will likely impact this.
Using common probability distribution is appealing as it limits a model's freedom and can be easier to fit to data. However, it is clear that authors are exploring more complex distributional forms to better model the data and seeing better results when they do so, as in Ghosh et al. [2014] and Zou et al. [2016]. Mixture distributions are a naturally appealing form, as we assume the data is generated by multiple sub-populations, and can have different effects of covariates for different populations. We incorporate ideas from this section of the literature by considering models that assume log-normal and Weibull distributions on incident durations, as-well as mixture distributions where one supposes the data is generated by one of many sub-populations, each of which has a parametric form. Recent applications of survival analysis in healthcare Lee et al. [2018] have removed distributional assumptions entirely, and instead formulate models that output distributions with no closed form. This is done by treating the output space as discrete, and treating the model output as a probability mass function (PMF) defined over it, allowing for construction of a fully non-parametric estimate. Such an approach offers even more freedom, and as we see more complex distributions used in the traffic literature to provide more freedom, one could ask if removing the distribution assumption entirely can improve model performance.
After a distribution is determined, many works focus on methods from survival analysis, with a common choice being the Accelerated Failure Time (AFT) model. Example applications of this are given in Nam and Mannering [2000], Chung et al. [2010] and Chung et al. [2015]. Such a model assumes that each covariate either accelerates or decelerates the life-time of a particular individual. These models are widely used, and offer an interpretable means of investigating what factors strongly or weakly influence incident duration. However, it can be difficult to incorporate time-series features into them. Whilst it is possible to model time-varying effects of covariates, for example in Zeng and Lin [2007], it is more complex to derive optimal features from a time-series that are also interpretable. Clearly, AFT models are useful in that they produce interpretable outputs and relationships between variables, and can model the non-Gaussian distributions empirically observed in incident duration data. The alternative and well known classical survival model one could apply is a Cox regression model Cox [1972]. Such a model assumes a baseline hazard function for the population, describing the instantaneous rate of incidents, from which survival probabilities can be calculated, see section 4.2 for more details. Covariate vectors for individuals shift this baseline hazard allowing for individualised predictions. Applications of this to transportation problems are given in Oralhan and Göktolga [2018], Wu et al. [2013] and Li and Guo [2014].
Whilst these two methods are widely used, a number of alternatives exist. One such method is a sequential regression approach, an early example being Khattak et al. [1995]. Importantly, the authors identify that more information describing an incident will become available over time, and hence consider a series of models to make sequential predictions. However, their sequential information was more descriptive from an operational stand-point, for example identifying damage to the road and response time of rescue vehicles. We do not have this, and instead focus on the minute-to-minute updates provided through traffic time-series recorded by sensors along the road, and how to engineer features from these. Truncated regression approaches were discussed in Zhang and Khattak [2010], where there was a specific effort to model 'cascading' incidents (referred to as primary and secondary incidents in some literature). The thought here was that incidents that occur nearby in time and space would lead to a significantly longer clearance time for the road segment, and hence this should be accounted for in modelling. We performed an extensive analysis of primary and secondary incidents in our dataset in Kalair et al. [2021]. Building on this and the previous cited work, we include a cascade variable in the models, allowing this to influence duration predictions.
Further regression approaches are explored in Garib et al. [1997] and Weng et al. [2015], and switching regression models are used in Ding et al. [2015]. Note that in Weng et al. [2015], the authors first cluster the incident data, then use this clustering as additional features for a model, further suggesting that there is some element of sub-population structure in the data. A final relevant regression based work is Khattak et al. [2016], where quantile regression is used to model incident durations. This is a natural choice, as there is a clear skew in the empirically observed duration distributions, and if one does not want to assume a particular distributional form, they can instead model properties of the distribution, in this case quantiles of the data.
Further methodologies to note are those based on tree models or ensembles of them, particularly because we apply such a method later in this work. Tree based models are discussed in Smith and Smith [2001], where the authors compare models that assume particular incident duration distributions, a k-nearest-neighbour approach and a classification tree method based on predicting 'short', 'medium' and 'long' incidents. They concluded that no model provided accurate enough results on their dataset to warrant industrial implementation, but found the classification tree was the preferred model of those considered. Further, Zhan et al. [2011] considered a regression tree approach, where the terminal nodes of each tree were themselves multivariate linear models. Such an approach avoids binning of incidents into pre-defined categories, and achieved 42.70% mean absolute percentage error, better than the compared reference models. From an interdisciplinary setting, alternative tree methods have been considered, namely one known as 'random survival forests ' Ishwaran et al. [2008] as an extension to random forests to a survival analysis setting. In such a framework, the terminal nodes of each tree specify cumulative hazard functions for all data-points that fall into that node, and these hazards are combined across many trees to determine an ensemble hazard. There is no defined distributional assumption in such a model, again leaning towards the side of freedom in allowing the data to construct its own hazard function estimate rather than parametrizing an estimated form.
Neural networks are a rapidly developing methodology in machine learning, and have been used extensively in incident duration prediction and form the basis for some of our considered approaches. Examples of this include Wang et al. [2005], Wei and Lee [2005] and Lee and Wei [2010]. Each of these applies feed forward neural networks to determine estimates of incident durations, and particularly in Lee and Wei [2010] sequential prediction was considered, using two models. The first took standard inputs, and the second took these along with detector information near the incident. These were input into feed-forward neural networks, and used to generate point predictions. Additional neural network applications are given in Valenti et al. [2010], where their performance is compared to that of linear regressions, decision trees, support vector machines and k-nearest-neighbour methods. The authors find that different models have optimal performance at different incident durations, suggesting there is still much to improve on feed forward networks. A final point to note is that neural networks have been applied to survival analysis problems in healthcare a number of times. Examples of this include Katzman et al. [2018] which develops a Cox model, replacing linear regression with a neural network output, Lee et al. [2018] which removes any distributional assumptions, and Jarrett et al. [2020] which uses a sliding window mechanism and temporal convolutions for dynamic predictions. We consider if the later two are useful in the application to traffic incidents later in this work. Specifically, using Jarrett et al. [2020] offers an automated way to engineer features from our sensor network data, whilst being able to model a parametric or non-parametric output.
Whilst we have discussed a number of different methodological approaches, the actual features used to make these predictions, regardless of approach, appear quite consistent across different works. In Garib et al. [1997], the authors state that using number of lanes affected, number of vehicles involved, truck involvement, time of day, police response time and weather condition, one can explain 81% of variation in incident duration. An overview of various feature types is given in Li et al. [2018], identifying incident characteristics, environmental conditions, temporal characteristics, road characteristics, traffic flow measurements, operator reactions and vehicle characteristics as important factors when modelling incident durations.
We note that we are not the first to use sensor data in incident duration analysis. Speed data collected from roads was used in Ghosh et al. [2014], where the authors included two features based on the speed series: if the difference between the 15th and 85th percentiles of the speed data was greater than 7mph and if the 85th percentile for speed was less than 70mph. Further, in Wei and Lee [2007] and Lee and Wei [2010] the authors train two feed forward neural networks, with input features that include the speed and flow for detectors near the incident. The first model provided a forecast just before the incident occurred, where as new data was fed into the second whenever available, updating predictions as time progressed. In the second paper, the focus was on reducing the dimensionality of the problem through feature selection via a genetic algorithm. We fundamentally differ from these for multiple reasons. With regard to Ghosh et al. [2014], this was an analysis of which factors impact incident duration the most, and where as they used hazard based models for analysis, we use the sensor data to engineer dynamic features, either manually or through temporal convolutions. Additionally, we differ from Wei and Lee [2007] and Lee and Wei [2010] through how we determine features and network structure, and through predicting an output distribution, not just a point estimate.
With dynamic prediction being highlighted as an area to address in the literature, some recent papers have looked at this problem through different methods to those already discussed. One example of this is , where a topic model was used to interpret written report messages and predictions were made as new textual information arrived. Further, multiple regression models were built in Ghosh et al. [2019], and as different features became available, data-points were assigned clusters, and a prediction was generated using a regression model tailored to each cluster. Lastly Li et al. [2020] considered a five stage approach, where a prediction was made at each stage and different features were available defining these stages. These included vehicles involved, agency response time and number of agencies responding. While this structured approach addresses some aspects of dynamic prediction, the purely data-driven approach which we present in this paper provides much more flexibility.

Data Collection & Processing
As discussed, various traffic data for the SRN is provided by a system called the National Traffic Information Service (NTIS) 2 . This is both historic and real-time data. Whilst the SRN includes all motorways and major A-roads in the UK, we focus our analysis on one the busiest UK motorways, the M25 London Orbital, pictured in Figure 1. Inside NTIS, Figure 1: The M25 London Orbital, roughly 180 kilometres in length. The Dartford crossing, located in the east, is a short segment of road that is not included in the data set.
roads are represented by a directed graph, with edges (referred to as links from now on) being segments of road that have a constant number of lanes and no slip roads joining or leaving. We extract all links that lie on the M25 in the clockwise direction, yielding a subset of the road network to collect data for. Placed along links are physical sensors called 'loops', which record passing vehicles and report averaged quantities each minute.
The most relevant components of NTIS to our work are incident flags that are manually entered by traffic operators. These flags specify an incident type, for example accident or obstruction, the start and end time, date and the link the incident occurred on. Accompanying this information are further details on the incident, for example what vehicles it involved and how many lanes were blocked if any. We extract all incidents that occurred on the chosen links between September 1st 2017 and September 30th 2018. Along with this, NTIS provides time-series data for each link, recording the average travel time, speed and flow for vehicles and publishing it at 1 minute intervals. These values are determined by a combination of floating vehicle data and loop sensors. As well as the incident flags, we extract the time-series of these quantities in the specified time period. In total, our dataset has 4415 incidents that we train on, and 2011 incidents that we use for out of sample validation of the models. Note that of these 4415 training incidents, we use 1324 as hold-out data to judge when to stop our training machine learning models.

Establishing A Data-Driven Baseline
As discussed, incident duration consists of 4 distinct phases and we want to model the time it takes for a link to recover to some baseline state. How to determine this baseline, and ensure it is robust to outliers whilst retaining important features of the data is an open problem. However a natural way to approach it is to develop some seasonal model of behaviour on a link and use this seasonality as a baseline behaviour. We define such a baseline by first taking the speed time-series and pre-filtering it by removing the periods impacted by incidents. After, we account for any potential missing incident flags by further removing any remaining periods with a severity higher than 0.3. Severity is defined as in Kalair and Connaughton [2020], which in short, considers the joint behaviour of the speed-flow time-series and questions what points correspond to large fluctuations from a region of typical behaviour in this relationship. We then take this filtered series and extract the seasonal components, in our case daily and weekly components, to capture natural variability on the link. Note that inspection of the data shows no trend.
To construct a seasonal estimate, we consider simple phase averaging, taking the median of data collected at a given time-point in a week, and STL decomposition Cleveland et al. [1990]. We see little difference between the two methods, so choose to use the phase average baseline for simplicity. We define one speed baseline for each link, establishing a robust profile describing the speed behaviour on a 'typical week', and replicate this over the entire data period. It is robust in the sense that we have pre-filtered the extreme outliers. It also captures the clear seasonality in the problem and can be applied to new test data without any difficulty. An example of this profile for a single link, along with the residuals from it are given in Figure 2A, along with an example incident in Figure 2B. We specifically mark where the NTIS flag was raised, where it was closed, and where the speed behaviour returned to the baseline. Using this We are trying to predict the time until we return to normal operating conditions, rather than the time at which the NTIS flag is turned off. methodology, we process our dataset such that we have a set of records with the start time of each incident and the time at which the link returned to normal. We include a safety margin in this baseline to account for any persistent but minor problems, shifting it down by 8km/hr (≈ 5mph). A link is considered to have returned to normal when its speed is above this shifted baseline for at-least 3 consecutive minutes, acting as a persistence check.

Methodology
Our modelling approach compares multiple methodologies to predict incident durations. A concise summary with all models considered and the main points of note about them is available in the supplementary material.

Incident Features
As discussed, there is much work in the literature identifying the most influential features for incidents. However, we are restricted in two ways. The first is that at any time-step of a prediction, we only incorporate features that would be known at that step. The second is that our dataset does not provide as comprehensive an overview of some features that are likely to be highly informative of the duration, for example we do not have in-depth injury reports or arrival time of police forces. The features we do use are separated into time invariant and time varying categories. The time varying features are derived from time-series of speed, flow and travel-time provided at a 1-minute resolution, which we take for the link an incident occurs on. We first remove the seasonality from each of these series by determining 'typical weeks' just as in the case of the speed baseline, then subtracting this from the time-series to generate a set of residual series. We then hand engineer features that may be of use for some simple dynamic models, computing the gradient of the residuals at some time t using the previous 5-minutes of data from t, as-well as simply recording the value of the series. These are used as initial features from the time-series as they provide a sensible and intuitive summary of an incident from the sensors, however of-course more complex features might be derived from the series. We consider models that do just this by applying temporal convolutions across the residual series, and compare the modelling results in section 5.
The time invariant features on the other hand are detailed in Table 1, and are a combination of what an operator might know using existing camera and phone coverage of the SRN. For completeness we also detail the time varying features there also. Whilst not exhaustive, the features in Table 1 offer a combination of contextual information in time, space and specific to the incident. Our choice of bins for time of day reflects typical commuting patterns in the UK. Some authors use day and night time as separation, as in Zou et al. [2016], where as others account for peak times in their binning Nam and Mannering [2000] and Khattak et al. [1995], as we have.

Survival Analysis Methods
We first offer more detail on models in the vein of classic survival analysis that we will consider for our problem. We remind the reader that, following the convention in the survival analysis literature, we use the word "event" to mean the occurrence of the outcome of interest. In our case, this is the end of a traffic incident as determined by the return of the speed to within some threshold difference from the the profile value. Survival analysis methods aim to model some property of the duration distribution. Let f (t) be the PDF of incident durations, and F (t) be the cumulative distribution function (CDF). A key component in survival analysis is the survival function S(t), which in our context describes the probability an incident has not ended by time t, where time is measured from the start of the incident. Denoting the event time (the incident end time) by T , we formally write: Further, many survival analysis methods are concerned with the hazard function λ(t), describing the instantaneous rate of occurrence of events. One can show that: (2) In practice, this means that the instantaneous rate of events is equal to the density of events at that time divided by the probability of surviving to that time. A final concept of note is the cumulative hazard function Λ(t), which is the integral of the hazard function between time 0 and t: Using these concepts, the first model we apply is a Cox regression model, reviewed in Kumar and Klefsjö [1994]. Suppose some 'individual' i (incident in this application) has covariate vector x i . A Cox model specifies the hazard function for individual i as: where λ 0 (t) is some baseline hazard at time t, and β is vector of regression coefficients. The baseline hazard describes the hazard function for an individual with covariates all equal to 0, and then it is adjusted for a particular individual with the exponential of the regression term. In this original formulation, the covariate effect is constant in time, but the baseline hazard varies in time. Various methods exist for estimating a baseline hazard function, with more details found in Kleinbaum and Klein [2012]. In short, the baseline hazard is assumed to be piecewise constant and determined without any distributional assumptions, allowing the data to construct an approximation. One can determine β by optimizing the partial likelihood: where δ i is 1 if the event time is observed and 0 if it is censored and Y l (τ i ) is 1 if individual l is still at risk at time τ i . Here τ i represents the recorded incident duration for incident i. The baseline hazard function can be determined using the Breslow estimator: where R(τ i ) is the set of at-risk individuals at time τ i and d i is the number of events that have occurred at the i − th time. We use the implementation of Cox models provided in the R package 'survival' Therneau [2015]. Ties in event times are handled using the 'Efron' method, detailed in Efron [1977], altering the likelihood in Eq. (5).
The next model we apply is an accelerated failure time model (recall AFT from the literature review). Such a model supposes relationships between the survival and hazard functions of the form: Here, s 0 (t) and λ 0 (t) represent assumed baseline survival and hazard forms, and covariates 'accelerate' or 'decelerate' the survival time of particular individuals. Given an assumed form, for example Weibull, Log-normal or so on with parameters θ, one can then fit this model through maximum likelihood, optimizing: A common way to interpret the AFT model is as a regression on the log of the durations: where i is noise, with some assumed form. We implement the models using the R package 'flexsurv' detailed in Jackson [2016].
As Cox and AFT models involve, in some way, a linear regression on covariates of interest, they are unable to account for potential non-linear, complex interactions and effects of variables without manual investigation and specification. One way to account for this is to instead use random survival forests (recall RSF from the literature review), which are non-linear models based on an ensemble of individual tree models. The basic idea is as follows. Firstly, one takes a training set and generates B bootstrap samples from it, that is samples with replacement. Each of these samples is used to grow a decision tree, however randomness is introduced in the growing of the tree, by selecting a set of potential split variables at each point in which the tree needs to be split. The optimal split variable from this set is chosen to optimise some survival criterion. One of the most commonly used is the log-rank splitting rule [Ishwaran et al., 2008]. The tree is then grown until some criteria is met, either a maximal size or minimum number of cases remaining, and the output at the end of any branch is the cumulative hazard function for all data-points that are placed into that branch when passed through the tree. This process is repeated several times and the collection of trees is referred to as a forest. Each decision tree is a non-linear mapping from input covariates to the output cumulative hazard function, and the collection of many trees acts as an ensemble learner. Ensemble models are known to show promising performance in a range of tasks, and this in addition to the non-linear decision tree models suggests such models may improve upon Cox and AFT models for certain datasets. For our work, we use the R implementation found in Ishwaran and Kogalur [2007].

Deep Learning Methods
Alternative approaches in survival analysis have focused on applying methods from deep learning to incorporate non-linear covariate effects and behaviours. One of the first such methods was in Faraggi and Simon [1995], where a Cox model was considered, but the term x i β was replaced with g (x i ), which was the output from a neural network given input x i . Similarly, Katzman et al. [2018] and Luck et al. [2017] extended the cox model to a neural network setting, however fundamentally such models are still somewhat restrictive in that they assume a form of the hazard function. More recently, Lee et al. [2018] suggested to make far fewer assumptions, and instead train a network to directly model the function F (t | x) = 1 − P (T > t), referred to as the failure function. To avoid specifying any particular form of this function, the output space was treated as discrete, defined on times {t 1 , t 2 , . . . t max }. We suppose a single output value in this discrete space at time t j gives P (t j | x i ) and hence we derive F (t j | x) as: However, we still need to enforce that the output vector actually defines a discrete probability distribution. A natural way to enforce this is apply a softmax function on the output layer, normalizing the sum of the values to 1 though: In particular, Lee et al. [2018] considered an application with competing risks, where individuals experienced one of many possible events. Here we consider a simpler case, having only one event (traffic state returning to normal), however the methodology remains consistent in this application. As we are only able to measure if an event has occurred each minute from our sensor data, the discrete nature of the model is not restrictive in our context, yet the non-parametric output is appealing as we have seen in an exploratory analysis (described in the supplementary material) that the data does not appear to be generated from any particular, simple closed form distribution. We adapt our implementation from the implementation found in Lee [2020].
For our implementation, we specify a t max value equal to the longest duration, plus a 20% margin as in the original implementation, and define the output grid at a 1 minute resolution. However doing so leads to a very large output space for the model, and could potentially lead to over-fitting. To combat this, we apply dropout after every full connected layer in the network, elastic net regularization on the weights and early stopping based on holdout data. We further consider if a parametric distribution may be sufficiently flexible when attached to a neural network model to perform well in the prediction task. To test this, we build another model as described above, but remove the softmax output layer and replace with with a mixture distribution layer, influenced by Zou et al. [2016]. We choose our mixture components to be log-normal, avoiding the other specified distributions for numerical stability. This alters the output size to be 3 × N m where N m is the number of mixtures, a hyper-parameter to tune.
A final alternative that compromises between the full non-parametric discrete output and the parametric mixture is to allow the output layer of the network to define a set of weights, and construct a probability distribution from these weights using kernel smoothing. Kernel smoothing is a non-parametric technique that aims to construct a distribution by summing a set of kernel functions, evaluated at given data-points. Formally, we can write a kernel smoothed result for some desired point z, with kernel centres Z i and weights ω i as: Here h bw is the smoothing bandwidth and the kernel K(x) is often taken to be Gaussian, and the resulting estimate essentially builds a distribution as a weighted smoothed sum over all kernel centres. A point with high weighting will result in a significant amount of mass near this location, and a wide bandwidth will smooth this mass out to the surrounding area. Applying this to our problem, it allows us to avoid treating the output space as discrete, and instead we place a kernel centre at each point in the formerly discrete grid, and treat the neural network output (including having a softmax applied) as definitions of the weights ω i . Doing this also enforces some amount of smoothness in the output distribution, determined by the choice of h bw . We choose to use a bandwidth of 3 minutes, which still allows significant freedom to the distribution.
The actual function proposed in Lee et al. [2018] to optimize in order to train the network is a combination of two loss values, the first accounting for the likelihood of the observed data and the second enforcing ordering. The likelihood loss function is given as: wheref is the PDF (or PMF in the discrete case) implied by the model output, andF is the CDF or (cumulative mass function in the discrete case) implied by the model output, given a particular input x i . This is exactly as in Eq. (8), but taking logs, describing the likelihood of survival data. The second loss function is written: This loss penalizes the incorrect ordering of pairs in terms of the cumulative probability at their event time. If an incident i ends before j, then we would expectF (τ i | x i ) to be larger thanF (τ i | x j ), and if so this pair is considered correctly ordered. Large deviations from correct ordering are penalized by η(x, y). The total loss function is then the sum of L 1 and L 2 . The hyper-parameter grids used for all machine learning models can be found in the supplementary material, and all models are trained using 100 instances of random search.

Dynamic Methods
To this point, all models discussed in this section have been static, that is an individual has a covariate vector x i , it is passed through some model, and an estimate of its hazard function, failure function or alternative is attained. However, in-practice our specific application contains a significant amount information that may be useful in determining the duration that is not available at the start of the incident. Such information in the traffic domain is a police report made on the scene, recovery information, and specifically of use to us, the time-series provided by the sensors along the road. A significant incident on a road network could lead to speed drops, flow breakdown and travel time spikes, all of which will be evident when we inspect the time-series as the incident progresses. However, the recovery of the link to normal operating conditions is closely tied to these time-series, firstly through the level of the speed series (as this defines how far from a baseline we are), but one could imagine much richer indicators of traffic state can be mined from them. Recall Figure 2B, we see significant structure in the series, having a linear drop near the start of the incident, an unstable oscillation period at lower speeds, then a recovery to normal conditions. Further examples of these series for incident periods can be found in the supplementary material. A number of methods have been suggested to handle dynamic predictions in a survival analysis setting.

Landmarking
With any dynamic prediction approach, the goal is to provide estimates of a hazard function, survival function or similar at some time t, conditioned on the fact that the individual has survived to time t and any covariates they provide. A simple method to do this is known as 'landmarking' and is discussed in Dafni [2011]. We note from the outset that landmarking is similar to truncated regression discussed in section 2, however this terminology is consistent with the wider survival analysis literature. To carry out landmarking, one first specifies a set of 'landmark times' {t LM1 , t LM2 , . . . , t LM K } at which we want to make dynamic predictions. One then chooses some survival model, for example a Cox model, and the hazard function at landmark time t LMj becomes: with t LMj ≤ t < t LMj +∆t, for some ∆t defining how far ahead we are interested in looking. Notice how, compared to Eq. (4), the covariate values x i are replaced with those known at time t LMj and the regression coefficients and baseline hazard can vary based on landmark time. At each landmark time, only incidents that are still ongoing are retained, so the model is therefore conditioned on surviving up to this landmark time. To account for potential time-varying effects and avoid misspecification of the regression parameters, events that occur after t LMj + ∆t are administratively censored, that is they are marked as censored if they survived past the look-ahead time of interest. Such a model is simple to implement as one can refine the dataset at different times to produce dynamic models. However, as the landmark time grows and less data becomes available, some power may be lost when drawing statistical conclusions.
To implement these models, we choose landmark times t LM of {0, 15, 30, 45, 60, 120} minutes and horizons ∆t of {5, 15, 30, 45, 60, 120, 180, 240} minutes and display results throughout section 5. Finally, the landmarking framework can be applied with models other than a Cox model, so we consider both Cox and RSF landmarking models as two candidate dynamic prediction models, with RSF offering a non-linear alternative. The same was done to compare to dynamic models in Lee et al. [2020].

MATCH-Net Based Sliding Window Model
The models considered so far require us to manually engineer features from the time-series variables to incorporate them as covariates. As discussed, we use the level and gradient of the residual series, as these will indicate both how close the link is to reaching standard behaviour, and if the situation is getting better or worse. However, gradients computed in short windows can be noisy, yet computed on large windows may lead to significant delay in identifying features. Instead, we would like some automated method that, given a time-series, is able to learning meaningful features from them and incorporate them into predictions. Such a method is proposed in Jarrett et al. [2020], where the authors detail a sliding window model which they name MATCH-Net. We note that the algorithm is designed to make dynamic predictions accounting for missing data, although in our application we do not have missing data so are interested in the dynamic prediction aspect only. In this model, a window of longitudinal measurements is fed through a convolutional neural network (CNN), with the convolutions learning features from the time-series that are then used for prediction of risk in some look-ahead window. The model slides across the data, shifting the window each time, meaning features are updated as time progresses and predictions are also shifted forward. It is upon this we base our sliding window methodology.
Specifically, we take a historical window of length w, and at time t feed the time-series from time t − w to t through a CNN, where the filters in the CNN aim to derive features from the series without any manual specification of what they should be. We then concatenate the features output by the CNN with the time-invariant features, and then pass these through a series of fully connected layers. In Jarrett et al. [2020], the output layer was a discrete space upon which a softmax activation was applied, and we again consider this, a mixture of log-normal distributions and a kernel smoothed output distribution. At each input time, we consider a window ahead of the the same length as in the fixed case, and treat w as a hyper-parameter to optimize. Since this model is more complex and has far more parameters than in the former case, we consider the discrete distribution to be piecewise constant for 5 minute intervals. As a result, the output space decreases in size by 80% without sacrificing too much freedom. A schematic of the network architecture is given in Figure 3, with the output layer left intentionally vague to be clear that we consider multiple different forms of output.

Time-invariant Features
Concatenate Fully Connected Output Layer Figure 3: Network schematic for sliding window model. We pass filters across the residual time-series to engineer features from each time-series, then concatenate these with the time invariant features to create a feature vector, which is passed through a series of fully connected layers, and some output layer is applied to the result. The example shown is for a single traffic incident being passed through the network. The number of boxes for features is not to scale. The window of time-series represents 3 variables and a window size of 7 in this simple example.

Results
A point infrequently discussed in the context of traffic incidents, is that there are multiple criteria that define a 'good' hazard model, and multiple ways to measure this in the dynamic setting. We discuss some of these ways in the text below. We note also that elastic net regularization is applied to all deep learning methods, and the optimal Cox and AFT models are selected though inspection of sample-size adjusted Akaike information criterion (AIC) to avoid over-fitting.

Discriminative Performance -Concordance Index
The concordance index (C-index) has different definitions in the static and dynamic setting. In the static setting, we write it as: Eq. (16) is the so called 'time dependent' definition used in Lee et al. [2018], accounting for the fact that we care about the entire functionF and not a single point value. In the dynamic setting, it is written given prediction time t and evaluation time ∆t as: The only difference here is now we are specifically computing the C-index at a given prediction time and horizon rather than over the entire dataset, and this is the definition given in Lee et al. [2020]. In computing this, we are taking theF values at some time t + ∆t and compare pairs where an incident i actually ended in the horizon.
As described in Lee et al. [2020], such a measure compares the ordering of pairs. If individual i experienced an event before individual j, then we would expect a good model to correctly assign more chance of an event to individual i than j. A model with perfect C-index, given N traffic incidents, will perfectly predict the order in which the incidents will end. This idea stems from viewing survival analysis as a ranking problem, and since we compare the CDF for two events, we see that it incorporates the entire history from a prediction time up to an evaluation time, not just a single point measurement. A random model will achieve a C-index on average of 0.5, and a perfect model will attain a value of 1.0, so these are reference values to consider when interpreting this measure.
We formally compute Eq. (17) given our dataset by evaluating: where we simply evaluate empirically how often the ordering is correct conditioned on the requirements. The same is true for the static case. If two incidents happen to give exactly the same CDF values when evaluating, we take the convention of adding 0.5 to the total rather than 0 or 1, following the convention in Harrell Jr et al. [1996].

Calibration Performance -Brier Score
The brier score measures how well calibrated a model is, and compares the binary label (1 if an event has happened at some time, 0 otherwise) with the model prediction at that time. Formally, we write: where we sum over all events still active at time t, and ask did incident i end before t + ∆t. If it did, then we would expect a good model to have a high CDF value at this point, with 1 being perfect (i.e. predicting the incident would end by t + ∆t for certain). On the other hand, if the incident did not end before t + ∆t, then we would expect a low CDF value. This definition is that proposed in the supplementary material of Lee et al. [2020]. In a sense, this measures the mean square error of a probabilistic forecast of a binary outcome. In terms of reference values, a model that outputs a survivor function value equal to 0.5 at a particular time will have a Brier score of 0.25, so lower values than this are desirable, and a perfect model will achieve a score of 0.

Point-Wise Performance -Mean Absolute Percentage Error
Whilst C-index and Brier score are used throughout the survival literature, we also note that mean absolute percentage error (MAPE) is used throughout the traffic literature and has some practical relevance to our application. Since we have no censoring, we know the true duration for each traffic incident. As such, we can evaluate for every data-point, what is the error between a point-prediction, and the true duration. A natural choice for such a point prediction is the median of each models output distribution, Yuan [2008], Qi and Teng [2008],  as the distribution of traffic incident durations is known to be heavy tailed. We can then ask what is the point-wise error for each model. Note that, C-index and Brier score asked about the accuracy of the output distribution, where as this asks about a single point taken from the distribution. Highways England currently states that NTIS are measured on their prediction of the 'likely delay associated with an event.' Specifically, NTIS is scored as follows. One aggregates all incidents that have a predicted return to profile time made at their half way point and lasted over 1 hour. The MAPE between these predictions at the half-way point and the true values is computed. The target for NTIS predictions is for this to be below 35%, and it is stated in IBI Group UK [2019] that the current value in practice is 35.49%.
There is of course a problem with this criterion, in that a 'perfect' model by this standard would just always predict double the current duration, which would optimise the prediction at the mid-point, but be of no practical use aside from this. Regardless, this rough measure allows us to frame our work in the context of the practical considerations traffic operators are currently working towards.

Static Prediction Models
We begin by considering how a range of models perform in the static sense. For this, we use only the fixed covariate information available at the start of the incident to fit the models. We apply each of the discussed models, and show results for all metrics in Table 2. We see the ordering of incident durations, measured by the C-index, attains values  between 0.624 and 0.676. All models are informative, beating the 0.5 reference value, and the biggest gains in C-index are seen when we go from the linear to non-linear modelling frameworks. The RSF achieves the optimal C-index, followed by the neural network with a mixture of log-normals.
In terms of point-wise error, we do not make predictions at the half-way point of incidents, we only make them at the start of an incident in this setting. Doing so and measuring for all incidents longer than 60 minutes, we see that all models achieve a MAPE between 37% and 41%, with the best model being the non-parametric neural network. No model achieves an MAPE of less than 35%. Finally, the optimal Brier score is always achieved by the RSF method, with the most noticeable differences observed at horizons of 120 and 180 minutes. There is not much to distinguish many of these models, and ultimately one might suggest that in a static setting, a RSF offers a good compromise between performance measured by C-index, Brier score and MAPE, however if MAPE is the single desired criterion, a non-parametric neural network model would be preferred.

Dynamic Prediction Models
We now consider the models in a dynamic setting. We consider C-index as defined in Eq. (18), and show results for it at various prediction times and horizons in Table 3. Initially, the RSF achieves optimal C-index across all horizons when predicting at t = 0. As time of prediction increases, we see a strong favouring of neural network models, specifically the sliding window model with a kernel smoothed output achieves the optimal C-index in most cases. There is a systematic preference for the non-parametric sliding window models compared to others at all prediction horizons when considering prediction times of 30 minutes or greater. Even at a prediction time of 15 minutes, the non-parametric sliding window models are preferred when considering 180 and 240 minute horizons. As a general summary of Table  3, one should see that out of all 47 prediction time, prediction horizon pairs considered, the optimal model in terms of C-index is the RSF roughly 34% of the time, the sliding window neural network with kernel output 43% of the time, and the sliding window neural network with non-parametric output the remainder of times.
Averaged over all horizons, we see that one would prefer the RSF model when initially making predictions, but all prediction times after 15 minutes favour the kernel smoothed output, with the non-parametric neural network often similar in performance. The neural network model parametrising a mixture of log-normals achieved the highest C-index of the neural network models in the static case, closely following the RSF model, however it never wins in the dynamic case, suggesting that when we provide the time-series features, the RSF makes better use of them initially, and then the other neural network models make better use of them as time-passes. All models achieve C-index values higher than the reference value of 0.5, across all prediction times and horizons showing their predictions remain informative.
One point of note from Table 3 is that the Cox model has quite poor C-index compared to the alternatives considered when predicting at a horizon of 5 minutes. We believe this is due to the amount of administrative censoring introduced at such a short horizon. If we look to van Houwelingen and Putters [2012], an assumption of the Cox landmarking model is that there is not too much censoring at the horizon time. For a very short horizon of 5 minutes, almost all incidents last longer than this, and hence, when we are applying our administrative censoring, this assumption becomes invalid, and we suspect this is why the Cox model has poor results at this horizon.
We further show the Brier scores for each model in Table 4. Again, we observe that initially, the random survival forest achieves optimal scores across all horizons, however as time of prediction increases we gradually see the sliding window neural network with kernel smoothed output start to achieve better Brier scores for short prediction horizons. This is again systematic, and we see for a prediction time of 120 minutes that the optimal model is the sliding window neural network with kernel smoothed output at horizons up to and including 45 minutes, but for a prediction time of 45 minutes it is only optimal for horizons up to and including 15 minutes. One could postulate that initially, time-series provide less useful information than the fixed features, that is at the very start of an incident, we see only the state of the link before the incident, which might have been reasonably seasonal. However, as time progresses, we will attain more informative features specific to single incidents, and in this case the fact the sliding window method engineers its own features, rather than our noisy gradient and level values we manually input to the RSF model, may prove more useful. Despite this, if a duration is very long, say 4 hours, and make a prediction 60 minutes into it, how much do we truly expect to gain from inspecting the time-series so far? It may be that there is just simply no sign of recovery and all we can really conclude is that speed has been slow for a long time, and shows no other clear features.
Another point of note when considering Brier score is that RSF appeared to perform well compared to a non-parametric neural network model in other works. If we look to the supplementary material of Lee et al. [2020], we see that a neural network with a non-parametric output did not consistently improve upon the Brier score achieved by a RSF model (see table VI in the supplementary material of the cited reference). It is unclear therefore if there is some fundamental reason for this in the modelling framework, as two entirely different datasets and applications appear to have observed the same behaviour. Despite this, all models achieve Brier scores below the reference value of 0.25.
Finally, we show the error in a point prediction made at various times throughout incidents in Table 5. For reference we also include the value achieved by the fixed model to get an idea of what we are gaining from making dynamic predictions. From Table 5, we see that when making a prediction after 30% of the duration of an incident has passed, we can expect between 30% and 33% MAPE in that prediction. This is around an 5-10% improvement from the prediction made from the corresponding static models at the start of the incident. If we predict half way through an incident, we see that the neural network models all now achieve quite a significantly better MAPE value than the landmarking models, with an optimal MAPE of 21.6% achieved by the mixture of log-normals model, followed by 22% for the non-parametric model. The discrepancy between the sliding window and landmarking models grows as we make predictions later and later, with the sliding window models achieving an MAPE of between 16.5% and 17.3% compared to a value of 26.5% for the optimal landmarking model (RSF). Additionally, the prediction error shows very little improvement moving beyond the 50th percentiles of an incidents duration for the RSF model, and actually increases for the Cox model, suggesting that they are not sufficiently capturing signs in the time-series that indicate the end is near. A key point of practical interest is that with the dynamic models, we do indeed achieve a MAPE value below 35% as desired by Highways England. Of-course, we would need to attain data for all incidents across the UK to truly ensure that we are able to maintain this on a wider scale, but as far as we are able to measure we achieve what would be considered industrially satisfactory error rates with the dynamic models.
Whilst the landmarking models appear to plateau in point-wise performance here, we note that it is partly due to plateauing or noisy error they appear to exhibit when making predictions at large landmarking times when only very few incidents remain active. We visualize the error per minute into incidents in the supplementary material, which shows this. The plots within the supplementary material are more akin to something one can find in other dynamic works, for example Figure 2 in .

Do Temporal Convolutions Improve Predictions?
As we have applied methods from Jarrett et al.
[2020] to formulate a sliding window model, we have naturally included temporal convolutions to generate information from the time-series. However, a valid question one could ask is are these required in our application, or do the simple levels and gradients retain enough information to make informative predictions? We test this by implementing a model without the CNN structure and instead feeding an input vector consisting of the time-invariant features and the level and gradients computed as in the landmarking case to a feed forward network. Doing so results in a model that achieves a worse Brier score and C-index across all prediction time, horizon pairs and a worse error at the half way point. Exact results are given in the supplementary material.

Feature Importance and Model Interpretability
Variable importance is a topic often addressed in the literature, and we offer some discussion of it here for both the RSF model and the non-parametric neural network model, chosen for simplicity compared to the kernel model.

Random Survival Forest Variable Importance
As RSF are adaptations of random forest methods, standard variable importance metrics are well explored and readily implemented. Recall that trees are trained based on a bootstrap sampled dataset, meaning a set of observations remains for each tree that are 'out of bag' which will be used for measuring variable importance. Given a trained forest and some variable of interest x, we drop the out of bag data for each tree down the tree and whenever a split on x is encountered, we assign a daughter node at random instead of evaluating based on the value of x. We then compute the estimates from the model doing this, and the variable importance for x is the prediction error for the original ensemble subtracted from the prediction error for the new ensemble ignoring the x value. A large variable importance suggests that a variable is  From Figure 4, we see that the speed value is the most important variable for horizons of 5, 30 and 60 minutes. At 180 minutes, the most important variable becomes the time of day. This makes intuitive sense, as we expect the time-varying features to provide more useful information about the immediate future rather than times far into the future. Similar reasoning applies to the importance of travel time and flow. It is interesting to note that flow is often less important than other, time-invariant features, across all horizons. The gradients are always less important than the residual values themselves, which may be a consequence of noise when estimating them, or the fact that the we can see short term rises and falls in the traffic variables that do not indicate the incident is actually near ending, but rather traffic state is just unstable as in Figure 4. The location of an incident is always somewhat important, ranking between sixth and fourth across all horizons. This suggests clear heterogeneity in durations by location. Note that with a location and length, a model should be able to identify single links in the network, so predictions can be specific to these if it improves performance. However, the importance of length decays over horizons, and is always less than the location itself, so the coarse segmentation we have introduced for location seems more important than the specific link an incident occurs on. We see that the season becomes increasingly important as horizon increases. The type of incident varies between the sixth, fifth, eighth and sixth most important variable going from horizons of 5, 30, 60 and 180 minutes respectively.

Neural Network Variable Importance
Recently, there has been a significant effort to improve the interpretability of prediction models, both those involving neural networks and more general frameworks. Examples of this include Ribeiro et al. [2016] and Shrikumar et al. [2017]. In the first, the general idea is to build a simpler 'explainer' model g that locally approximates some complex model f One optimizes g by penalizing complexity, and assigning more weight to data-points near the one we wish to explain the prediction of, hence resulting in local accuracy. It was then shown in Lundberg and Lee [2017] that many existing model interpretability methods could be phrased in-terms of a concept from game-theory known as Shapley values. In short, they proposed explainer models of the form: where z i is a binary value indicating the inclusion or exclusion of a particular feature and we have M features. They then specified three properties that one might desire in a feature attribution method: local accuracy (predictions of the same input give the same output), missingness (no attributed impact of missing features) and consistency (for two models, if ones output is more sensitive to a particular feature change than another, then it achieves a higher attribution value). The authors showed that under these properties and Eq. (20), the φ i values actually coincide with Shapley values from game theory. This approach unified many existing methods, including Ribeiro et al. [2016] and Shrikumar et al. [2017]. They denoted φ i a 'SHAP value' and they are appealing as they are additive, showing how particular features shift a models predictions away from some mean φ 0 to the final result for a particular data-instance. Their absolute value shows the size of a particular features importance, however one can go deeper and ask for a given feature, is this value increasing or decreasing the final prediction of the model? One actually computes the SHAP value for feature i as: where M is the set of all features. In Eq. (21), we sum over all subsets of feature vectors that do not include feature i. It is upon this that we base our feature importance exploration for the neural network model. We compute the SHAP values of the network, for the fixed features and the features output from the CNN deriving information from the time-series. We use the implementation provided by the original authors 3 , specifically the permutation method for computational speed and the incorporation of structured inputs. More details on SHAP values are given in the supplementary material.
A point of note for this method of feature importance is that we are computing values for each output neuron in the network that correspond to particular horizons, and how this output value changes, not how some performance metric changes. As a result, we can question if different features have more or less impact on different parts of the output distribution for a single input data-point. First, we consider raw feature importance, that is does a variable have a large or small impact the output of the model at particular horizons, showing results in Figure 5. From Figure 5, we see  Figure 5: Variable importance, as measured for the sliding window neural network model, for a subset of the important variables. Each plot is a different prediction horizon, all shown are at a prediction time of t = 30. The importance at each prediction time is computed by taking the average of the absolute SHAP values for each feature across all query instances, and then has been normalized such that the largest importance at any horizon is 1, and all others are relative to this. The rank of each variable at a given time is written beside each bar. Due to the scaling, one should focus on the the ranking and relative difference between bars in each plot. TSF stands for time-series feature, that have been extracted through passing temporal convolutions across the data. that for very short horizons (5 minutes) there are a large number of time-series features with high importance. This makes intuitive sense for the same reasoning as in the RSF case. After the time-series features, we see the time of day, location and incident type are the features with the highest impact. Moving to a horizon of 30 minutes, we then see the time-series features become less important, and location and time of day dominate the other features. Note here that in the 5 minute horizon, there were lots of features with quite high importance, showing quite a number influenced the model's output, but at a horizon of 30 minutes we see two with large importance and many others with far less. At a horizon of 60 minutes, we again see the importance of the time-series features increase, but the time of day and location are still the two most important of the time-invariant features, ranking first and fifth respectively. One might question why the time-series resurge in importance here, and we explore this further in Figure 6 and the analysis of it. At a long horizon of 180 minutes, the time of day is by far the most important feature, and the location is second, but is less important relatively than it was at a horizon of 30 minutes. A natural interpretation of this might be that for long horizons into the future, knowing if an incident will overlap with rush hour or go into lunch time or the night is a good indication of if we believe it might last a long time.
Having inspected the magnitude of SHAP values, we now question how do actual feature values shift the network output, either increasing or decreasing it. Visualizing this is more complex due to the fact that we want to plot the impact of various features, the value they attain, and if this shifted the prediction up or down. A standard way to do this for SHAP values is to make a 'beeswarm' plot, in which each data-instance is plotted as a single dot, once per each feature. Examples of making such plots for our dataset are given in Figure 6. One should read these plots as follows. Firstly, along the y-axis are features the model used, where categorical ones have been split into their one-hot encoded states. Secondly, the x-axis displays the SHAP values, not normalized as they were in the previous analysis, showing how the particular feature shifts the model output either up or down. Thirdly, the colour indicates the feature value. For binary features such as 'Morning Rush' a high value indicates the data-point was in the morning rush. Fourth, where many data-points had similar SHAP values for the same feature, points are expanded outwards in the y-axis, so a large vertical strip of points for a single feature indicates a high density of points at that SHAP value. Horizon is indicated by h in each sub-caption, which corresponds to looking at a particular output node of the network. We split the fixed and series features to aid readability. Note that pushing the output 'up' (a feature with a positive SHAP value) indicates increasing the probability mass functions value at this time-horizon.
Whilst there is a large amount of information in Figure 6, we breakdown the main points here. Firstly, we see that at a horizon of 5 minutes (Figures 6a, 8b), there is clear coherence in the features, shown by there not being a random assortment of colours across single features. Earlier, we saw time-series features, the time of day and location were influential factors at this horizon. Considering the time of day more finely, we see from Figure 6a that when incidents are in the morning rush period, the value of the PMF at this time is decreased, suggesting the model believes it is unlikely incidents at this time of day will end very quickly. Inspecting horizons of 30 and 60 minutes, in Figures 6c,  6d, we see that that when incidents occur in the morning rush, this increases the output values at these horizons. Finally, there appears to be more complex behaviour at a horizon of 180 minutes, as incidents occurring in the morning rush sometimes increase, and sometimes decrease the output at this time. If we turn instead to view how a location impacts the result, for example inspecting the SHAP values for 'West' we see that attaining a value of 1 here decreases the model output at a horizon of 5 minutes (Figure 6a), increases it at horizons of 30 and 60 minutes (Figures 6c, 6e) and then decreases it again at a horizon of 180 minutes (Figure 6g). Note however that since some of these features are in-fact categorical and have been one-hot encoded for use with a neural network, care must be taken in interpreting the impact of such values. In doing this analysis, we attain a SHAP value for each feature, however every data-point has as single location value equal to 1, and the rest equal to 0. So each location feature here will alter the neural network output, but the total impact of a data-point having a particular location will be the sum of the SHAP values for that data-point over all encoded categories. As such, we can better visualize the impact of categorical variables, for example location, by first summing the SHAP values for a data-point for all encoded groups of a particular feature, then visualizing how the overall feature impacted predictions. We do so in Figure 7.
The more intuitive description offered in Figure 7 allows one to see the overall impact of the location variable. An example interpretation one could read from Figure 7 is: for data-points where the spatial location was east, the overall impact of having this location on the model output was an increase in the output at horizons of 5 and 30 minutes, and a decrease at 60 and 180 minutes. From this more refined view, we can see that incidents in the north east are quite varied, as their SHAP values sometimes shift up and sometimes shift down for all horizons. Incidents in the north typically have the output increased at horizons of 5 and 30 minutes (Figures 7a, 7b) and then decreased at horizons of 60 and 180 minutes (Figures 7c, 7d). This suggests that, for example, the model has learnt incidents in the north are of shorter duration than incidents in the south, however this can then be adjusted further by other observed features. Locations can be compared in this way for all possible pairs.
A further question of interpretability relates to the features engineered from the time-series: what effect do these have on the model? We previously saw that they were highly important at 5 and 60 minute horizons, but we can now use Figure 6 to consider what impact they are having. If we start with a horizon of 5 minutes, we see from Figure 6b that when the time-series features attain a high value, the model output at this horizon is increased. We do not have an interpretable explanation of these features, but we see that they can provide quite significant shifts up in the output if their values are high. Moving to a horizon of 30 minutes, we see from Figure 6d that the series features become less coherent. Some features attain a high value and shift the output up, others down, and the overall impact is small for many features compared to the impact of the fixed features. This does indicate that the features learnt from the series are distinct in some sense, providing different impacts on the model output. At a horizon of 60 minutes, we see from Figure 6f that the features attaining a high value indicates a decrease in the model value at this point. From this we can interpret that when high values of features from the are attained, they are making the model put more mass in the immediate future, and less at horizons of 60 minutes or longer. This is perhaps an intuitive result, that the time-series are providing information that can significantly increase the model output at short term horizons, and attaining these same values shifts down the predictions at long horizons.
Of course, the sheer amount of data available here is somewhat overwhelming, however using SHAP values one can gain a significant understanding of why the machine learning model is outputting particular values. We further show plots for the overall impact of categorical features in the supplementary material, for interested readers, omitted for brevity here. However, care must always be taken in ensuring that feature importance and causality are not assumed, rather we are able to question why the model gives a set of outputs for a particular set of inputs.

Conclusion
In this work, we have addressed a number of issues raised in the literature regarding traffic incident duration prediction. Firstly, we considered a method to determine incident duration as not only when an operator declared it cleared, but also when the traffic stated had returned to some typical behaviour. This ensured that our predictions reflected when commuters could expect normal traffic conditions to resume if an incident had a significant impact even after it was cleared. Secondly, we considered a range of models, some based on classic survival analysis and others based on machine learning and assessed how they performed on our dataset in both a static and dynamic setting. In-particular, we took inspiration from work in the domain of healthcare and applied emerging methods used there to problems in traffic incident analysis. We saw that in a static setting, there was little to choose between the models but generally either a neural network or random survival forest method would be preferred to the others considered. We moved into the dynamic setting by utilising landmarking and sliding window neural networks that applied temporal convolutions to the time-series, both of which were inspired by success in healthcare applications. We saw non-parametric methods that made no distributional assumptions were preferred to methods that parametrized mixture distributions, and saw some benefit in applying kernel smoothing to enforce some minimal structure on a non-parametric output. Utilizing models with non-parametric distributional forms was influenced by the increasingly complex distributions being considered for this problem in the traffic literature, and showed significant promise in our results.
We assessed how each model performed using three different scoring criteria and in the dynamic sense, we saw clear structure in the results, with the kernel smoothed neural network model achieving an optimal C-index, Brier score depending on prediction time and horizon as to which model was preferred between a random survival forest and kernel smoothed neural network, and finally saw the neural network models showed much better performance in terms of point-wise error than the comparison models. We saw that all score criteria were improved by engineering features from windows of sensor data through temporal convolutions, compared to feeding in local levels and gradients, suggesting future work should continue to explore methods of deriving features from sensor data rather than inputting raw values.
After, we considered variable importance for our model in the dynamic sense, assessing how the random survival forest and neural network models were influenced by the derived features. Whilst we are aware of variable importance being studied in the traffic literature previously, we are not aware of SHAP values being applied to neural network models for incident duration analysis. Time of day and location were generally important across models, particularly at long horizons, and the time-series features were shown to have significant impact on the neural network output at 5 and 60 minute horizons.
Finally, our suggestion to use the output of a neural network to define kernel weights, and from this construct a non-parametric distribution through smoothing was novel in that we are not aware of this being done before to the considered model. It has relevance to other applications, specifically if one requires a continuous distribution to be output, but does not want to make strong parametric assumptions about what form this will take. Further work could be done to consider alternative methods to select a bandwidth when applying this type of model, but the freedom it provides appeared promising on our data. Other avenues for further work include incorporating more features in the dynamic prediction models. In-particular, traffic operators will be able to report when recovery vehicles arrive, police involvement and details from on-site reports. Further, social media and weather data could be collected in real-time. It is likely these will have some predictive power for the duration of an incident, and considering how best to incorporate them is an interesting problem. Finally, from our analysis in section 5 we suspect that if one was able to derive robust, complex features from the time-series and feed them into a random survival forest model, we may see further improvements in the model. One way to do this may be through an auto-encoder framework in which we pre-train a model to determine hidden representations of the series, however it is unclear if these will offer the same predictive power as we have observed from the time-series here.

Funding
This work was supported by the EPSRC (grant number EP/L015374/1).

Acknowledgments
We thank Dr. Steve Hilditch, Thales UK for sharing expertise on NTIS and UK transportation systems.

A Distribution Information
In Table 6 we summarize the various probability distributions discussed throughout this work.

B Incident Visualization
We saw in the main text that some structure clearly existed in the time-series. One could question what features might we aim to discover in the series, and how varied is their behaviour during incidents? To aid in this, we visualize a number of windows with incidents in them, plotting the speed again and seeing some representative series of incidents in Figure S1. From Figure 8, we see that different incidents show quite different representations in the speed time-series. Figures 8A, 8B, 8C show short lived but large drops in speed associated with incidents, each at a different time of day. The speed profile is reasonably symmetric in time and the traffic state starts to recover soon after reaching its minimal speed. Figures 8D, 8E, 8F on the other hand show very asymmetric incidents. In Figures 8D, 8E there is quite a sudden drop in speed, but a comparatively slow recovery. In particular Figure 8E shows two periods where the speed recovers somewhat, but then stalls at particular values, but later recovering again. Figure 8F shows the opposite case, a slow fall in speed to 85 km/hr and then a comparatively rapid recovery. Finally Figures 8G, 8H, 8I show extreme incidents, with a long period of very low speed. In the case of Figure 8H, we see the speed recovers somewhat to 40 km/hr however it remains around this for an extended period of time, where as in Figures 8G, 8I the speed remains around 20 km/hr for a significant period of time.  . Mixture: Each mixture component has some PDFf i (x; θ i ), CDFF i (x; θ i ) that depend on some parameter set θ i , and a component weight π i . The weighted sum of all components generates the final distribution.
Distribution Parameters PDF CDF , then: C Exploratory Analysis

C.1 Incident Duration Analysis
An initial exploratory analysis of our data reveals a number of properties. As discussed, an initial step for many works in this field is to take all of the durations and determine if they are well modelled by a particular statistical distribution. Our review of the literature suggested fitting log-normal, log-logistic, Weibull and generalized F distributions, so we do exactly this and show our results in Figure 9. Specifically, Figure 9A shows the probability density functions (PDFs) compared to the data and Figure 9B shows a quantile-quantile plot. An overview of each distribution considered is given in the supplementary material. From Figure 9A, we see that the Weibull PDF follows the data most closely, but inspecting the quantiles we see only the generalized F distribution has reasonable estimates of the extreme values. When we apply a Kolmogorov-Smirnov with estimated parameters Babu and Rao [2004], we see no significant evidence that any of these candidate distributions are statistically valid fits. Note that in Smith and Smith [2001], a chi-squared test was applied to determine the statical significance of distributional fits, however doing so requires binning the data, which we avoid here. These results suggest that to properly describe the data, we may want to consider more complex distributional representations, either non-parametric or mixtures of many components. Note also that the tail of the durations contains around 6 outliers, suggesting very rare incidents where the link was at a reduced speed for almost an entire day, however the bulk of the incidents attain a duration of less than 300 minutes.

C.2 Incident Clustering Analysis
We saw throughout the literature that time-invariant features were sometimes clustered to reveal different sub-populations of incidents, and provided useful features in modelling. This was particularly evident in Weng et al. [2015] and Araghi et al. [2014], for example. We question if these clusters are reflected also in the time-series for the data. To do so, we require a distance metric between time-series. A common choice is dynamic time warping (DTW) discussed in Berndt and Clifford [1994] and used in, for example, Izakian et al. [2015]. We specifically use the implementation in Montero and Vilar [2014]. The idea is to stretch or squeeze a pair of time-series such that they are as similar as possible. If one time-series has the same shape as another, it will have a low DTW distance, where as if two are fundamentally different, they will have a large distance. We first standardize all link data by it's mean and standard deviation, then compute the DTW distance between all pairs of series that represent a window where an incident occurred. From this, we attain a distance matrix, and we use hierarchical agglomerative clustering (HAC) with a ward linkage function Ward Jr. [1963] to construct a dendrogram to visualize distances between elements. The results are shown in Figure 10, where Figure  10A clusters the data by their time-invariant features and Figure 10B clusters the data by their time-series. From Figure  10, there is clear structure in the distance matrix, suggesting that incidents cluster by both time invariant features, but also by the observed traffic metrics on the links. The number of clusters is somewhat subjective, but if we inspect the average within cluster sum of squared distances, we see that the reduction in this begins to diminish after around 6 clusters are identified for both clustering instances. Examples of different series observed in incident windows are given in the supplementary material, where we observe differences in the severity, symmetry, time-scales and stability of speed behaviour during incidents, and more details on the clustering discussed.

C.3 Clustering Details
Whilst we saw that the dataset appeared to cluster on both the time-series and the time-invariant features, we offer more explanation of the clustering here. When clustering the data by their time-invariant features, we use the 'daisy' dissimilarity measure Kaufman and Rousseeuw [1990] as it is able to handle features of varying types. To do so, it uses the Gower dissimilarity coefficient Gower [1971]. Inspecting the average within cluster sum of squares distance values on both clustering results, we construct a so called 'elbow plot' shown in Figure 11. From Figure 11 we see diminishing reduction in the within cluster sum of squared distances after around 6 clusters in each case, suggesting that one should not segment the data further than this.

D Models Summary
In Table 7, we summarize the models considered and note some key points one should consider when comparing them. They have been chosen to reflect existing work in the transportation literature and recent advancements in other disciplines.

E Hyper Parameter Grids
For each of our deep learning models, we consider a range of hyper-parameters and apply 100 iterations of random search to determine optimal choices. In Table 8 we detail the possible choices for various hyper-parameters. For all fitting, we apply early stopping based on the loss on a holdout set, and we train using a batch size of 128. Learning rate for all models was decayed when learning stalled, and the Adam optimizer was used. Models are implemented in Keras.

F Joint Models
Here we discuss the main alternative for dynamic predictions in survival analysis one can find in the literature: joint models. A more detailed discussion for interested readers can be found in Rizopoulos [2012]. The primary idea behind joint modelling is to couple two models, a linear mixed effects model for the time-series available, and a survival model for the incident times. A general linear mixed effects model is written as: Here,β describes the average longitudinal evolution through time of the population (fixed effects) and b i represents the deviation for a particular individual i from the population average (random effects). The error terms are assumed to be Gaussian with covariance matrix D, and the random effect regression coefficients are assumed to be Gaussian distributed with some fixed variance σ 2 across the population. Finallyx i and z i are covariates of interest, repeatedly measured through time.
The survival model used in joint models is commonly a Cox model, meaning the full specification of the 'basic' joint model can be written as: where M i (t) represents the history of the longitudinal process up to time t, and the covariates impact in the cox model are augmented by the term αm i (t), representing the influence of the longitudinal data. Given Eq. (23), we can incorporate the impact of time-varying covariates and uncertainty into the survival model, however we are really interested in dynamic predictions: determining the probability of surviving to a particular survival time u > t, conditioned on surviving up to t. It is detailed in Rizopoulos [2012] how to do this, requiring Monte Carlo sampling to integrate over all possible random effect values.
We do not consider such a model for comparison in this work for a number of reasons. Firstly, particularly in extreme accident scenarios, the behaviour of traffic metrics is likely highly non-linear as a function of the known covariates. Secondly, we would be required to specify the model for the multivariate time-series, the survival outcome and how to link to the two models when applying this, leaving significant room for model misspecification to accumulate without     To explain a complex model, in our case a neural network, one writes a simpler 'explainer model' of the form: Here, φ 0 represents the baseline model output, and then each feature i shifts this output up or down, until a final value is reached, and we have M features in total. This output can be measured for each neuron in the output layer. Recall from the main text that une computes the SHAP value for feature i as: where M is the set of all features. We can intuitively consider how Eq. (25) arises as follows. Imagine at first we have an empty feature vector. Then, at random, we start adding features to it. Whenever we happen to add feature i, the change in output will be: Now consider, at the point we add i into S, how many ways could S have formed into its current state? There are |S|! ways this could have happened. Further, there are (|N | − |S| − 1)! ways that the remaining features could be added after i has been. This suggests for a single instance we have: However, this is just for a single S, so we then sum over all possible sets S and then average the value by dividing by the the total number of possible orderings of players. Combining all of this, we attain Eq. (25).
However, there is a clear problem in that neural networks cannot take arbitrary missing features as an input, so using Eq. (25) alone one cannot compute the SHAP values for such models. To address this, one considers some 'reference' or 'background' dataset, from which values are taken as replacements when considering alterations of feature vectors. Such an approach is discussed in Lundberg and Lee [2017]. The idea is that, since we cannot omit a feature fully, we instead set its value to a reasonable reference value. The appropriate choice of background is an open problem in applying these methods. For image tasks, it might be clear than one can use a blank image as a reference, and then we are evaluating a pixels importance relative to it being blank. However, in many domains some reasonable and intuitive reference value does not exist. Instead, it is common to provide a dataset with many records in as the background, and average over it, which is the approach we take in our work. Due to computational constraints, we set the background dataset to 10000 random instances from the training dataset, and compute the SHAP values for 1000 randomly selected points in the unseen data at specific prediction times.
A second complication of note is that there is often structure to feature vectors in problems, and elements are not simply a random collection of possible values for each entry. One can consider this structure most clearly when considering a one-hot encoded example. Suppose our feature vector contained three binary values, which indicate morning, afternoon or evening, which of course correspond to a time of day being discretised and then one-hot encoded. In the raw methodology for SHAP in Lundberg and Lee [2017], one would go through each feature, and consider setting it to a reference value and use the change in model output as a measure of importance. However, suppose we had our time of day encoding, and an entry occurred during the morning, meaning the subset of our feature vector was [1, 0, 0]. We may then consider altering the entry corresponding to the binary label for afternoon, which might lead us to considering a feature vector of [1, 1, 0]. Of-course, such an input is not practically possible, as a data-point can only come from one time of day, and so the model output for it is irrelevant to our problem. To avoid this, recent work in Lundberg [2017] incorporates the structure in the data before performing any perturbations. A partition of the data into highly correlated components is performed, and then when considering feature importance, instead of altering a single element, a set are altered if they are grouped together. This is a natural requirement for our problem where locations, time of day and incident types have been one-hot encoded. We do however note that we compute the SHAP values both incorporating this structure and neglecting it, and see similar results in both cases.

I SHAP Values for Categorical Features
In the main text, we considered SHAP values as a measure for variable importance and gave example interpretations of why the neural network model was outputting particular predictions. Here we plot further examples of what the categorical values contribute overall to the model. In Figure 12, we show the results for the time of day variable. Generally, we see that the impact of the time of day variable attaining the value 'afternoon' pushes down predictions at short horizons (5, 30 minutes in Figures 12A, 12B). At a horizon of 60 minutes ( Figure 12C), we see that sometimes the model result is deceased and sometimes it is increased, and at a horizon of 180 minutes ( Figure 12D) every instance had the model output increased. From this we might be lead to believe that incidents occurring in the afternoon can spill over into the evening rush hour and last a significant amount of time. The same analysis can be done for alternative times of day, for example we see that incidents in the night have the output increased at horizons of 5, 30 and 60 minutes ( Figures 12A, 12B, 12C) and decreased at horizons of 180 minutes ( Figure 12D). If we consider instead the incident type, we see results as shown in Figure 13. We first note that the NTIS incidents data is dominated by abnormal traffic incidents, which is clearly seen by the random sample of 1000 incidents for explanation containing a large number of abnormal traffic cases. When we inspect the data, the abnormal traffic cases are quite heavy tailed, suggesting either they are not always turned off by the operator in a timely manner, or there are some significant drops in link performance that are not directly attributed to a physical incident on the link. We generally see that general obstructions increases the model output at horizons of 30 minutes ( Figure 13B) but decreases it after this. Abnormal traffic incidents generally increase the model output at higher horizons and decrease it at lower ones, however the time-series and other information could then account for some of the short-lived abnormal traffic incidents we observe, or indeed some of the heavy tailed aspects we see.
Finally, we can also inspect the seasons impact on the model, with results shown in Figure 14. It is first important to note that we only have one year of data, so the season variable here could actually capture some long term systematic change in response time or incident severity. However, we are not told to expect this from industry experts. When we consider incidents in the winter, we see the output is mixed at a 5 minute horizon ( Figure 14A), increased at a horizon of 30 minutes ( Figure 14B) where as it is decreased at the later horizons. The effect of spring appears to be systematically smaller in size than the other seasons.

J AUROC Validation
Whilst we have looked at C-index and Brier score, for different prediction time, prediction horizon pairs as was done in Lee et al. [2020], we can equally look at the area under the receiver operator curve as a performance metric, as was done in Jarrett et al. [2020]. We show results for this, at particular horizons, averaged over all input windows, in Table  11, computed using the methods discussed in Pölsterl [2020] and Lambert and Chevret [2016].

K MAPE as a Function of Time
We have looked at the error at different percentiles into an incident because of the national criteria set by Highways England when considering incident duration modelling. However, we note that some other papers consider error at different minutes into an incident. These are clearly two different methods of aggregation. The first averages across different events, taking predictions from potentially very different time points, but that represent some fraction of the entire incident. The second averages across the same prediction time for every incident, but disregards the fact that this might be a very short time relative to the total duration of some events, and a very long time relative to others. We present results for the second averaging here to give clear comparison and show our results align generally with the expected behaviour of other dynamic prediction works. Absolute percentage error (APE), computed as a function of minutes into an incident for some of the models considered is given in Figure 15 for events longer than 60 minutes. This behaviour is generally consistent with that seen in Figure 2 in .  Figure 15: APE (both mean and median shown) for some dynamic models as a function of the minutes into an incident. We show 30 minutes onwards, which is the point where only incident related information is being fed into the sliding window model, and the landmarking models are no longer fit with minor events, as these have already ended.