A Literature Review of Performance Metrics of Automated Driving Systems for On-Road Vehicles

The article presents a review of recent literature on the performance metrics of Automated Driving Systems (ADS). More specifically, performance indicators of environment perception and motion planning modules are reviewed as they are the most complicated ADS modules. The need for the incorporation of the level of threat an obstacle poses in the performance metrics is described. A methodology to quantify the level of threat of an obstacle is presented in this regard. The approach involves simultaneously considering multiple stimulus parameters (that elicit responses from drivers), thereby not ignoring multivariate interactions. Human-likeness of ADS is a desirable characteristic as ADS share road infrastructure with humans. The described method can be used to develop human-like perception and motion planning modules of ADS. In this regard, performance metrics capable of quantifying human-likeness of ADS are also presented. A comparison of different performance metrics is then summarized. ADS operators have an obligation to report any incident (crash/disengagement) to safety regulating authorities. However, precrash events/states are not being reported. The need for the collection of the precrash scenario is described. A desirable modification to the data reporting/collecting is suggested as a framework. The framework describes the precrash sequences to be reported along with the possible ways of utilizing such a valuable dataset (by the safety regulating authorities) to comprehensively assess (and consequently improve) the safety of ADS. The framework proposes to collect and maintain a repository of precrash sequences. Such a repository can be used to 1) comprehensively learn and model the precrash scenarios, 2) learn the characteristics of precrash scenarios and eventually anticipate them, 3) assess the appropriateness of the different performance metrics in precrash scenarios, 4) synthesize a diverse dataset of precrash scenarios, 5) identify the ideal configuration of sensors and algorithms to enhance safety, and 6) monitor the performance of perception and motion planning modules.


INTRODUCTION
About 90% of road accident fatalities are attributed to human errors such as distraction, fatigue, violation of traffic rules, and poor judgements (Treat et al., 1979;Katrakazas, 2017;Collet and Musicant, 2019;Wood et al., 2019). Automation of the driving task offers an excellent opportunity to reduce such errors and consequently improve road safety, accident costs, productivity, mobility, and convenience. Automated Driving Systems (ADS) are rigorously being developed across the globe, realizing these tremendous potentials. ADS (SAE level 3, 4, or 5) can simultaneously handle lateral and longitudinal motions of the vehicles (SAE, 2018).
Well-conceived performance metrics shall be practicable and ideally not involve subjective terms (e.g., thresholds). As safety is being quantified, supporting evidence from field evaluations and simulations are necessary. According to National Highway Traffic Safety Authority (NHTSA), it is premature to regulate the safety standard for ADS (NHTSA, 2020). There is no clear consensus about the performance metrics among the researchers or ADS developers. Ill-conceived (as opposed to well-conceived) performance metrics may deter the development or progression of ADS, or worse, provide a false sense of security/performance. Hence, NHTSA is presently seeking inputs from researchers, regulators, and ADS developers to formulate the safety standards for ADS. Its European Union counterpart, the World Forum for Harmonization of Vehicle Regulations, is also actively working on this matter. This emphasizes the need for a literature review of the available performance metrics to gauge the performance of ADS.
However, ADS have safety-of-life critical applications. This characteristic necessitates appropriate guidelines, rules, and regulations to ensure technological advancement without compromising road safety. Performance requirements to ensure the safety of ADS are therefore to be standardized and regulated. The performance metrics used for such a task shall be practicable and objective, meaning that the metrics shall be computed based on scientific measurements (not opinionbased) and be consistent.
Nevertheless, in the absence of standards/regulations, some ADS developers have resorted to voluntary assessment (selfassessment) of safety aspects of ADS. Self-developed performance metrics are being employed to improve the ADS, using data from limited field deployments. Several other research studies related to the safety of ADS are already published (e.g., Alkhoury, 2017;Every et al., 2017;Fraade-blanar et al., 2018;Nistér et al., 2019;Wood et al., 2019;Berk et al., 2020;Riedmaier et al., 2020;Weng et al., 2020;Wishart et al., 2020;Bansal et al., 2021;Elli et al., 2021;Huang and Kurniawati, 2021;Luiten et al., 2021;Wang et al., 2021). Researchers across the globe are making a considerable effort to quantify the safety of ADS and consequently improve it.
The development of performance metrics demands the understanding of driving tasks. Successful execution of driving tasks by human drivers depends on 1) knowing the current state of self (such as location, speed, acceleration, and steering angle), 2) perceiving the states of surrounding obstacles, 3) planning the future course of action ensuring safety, and 4) controlling the vehicle using steering wheel, throttle, and brakes. Analogously, ADS can be considered to have four primary modules ( Figure 1): 1) localization, 2) perception, 3) motion planning, and 4) vehicle control. ADS can also have an additional module dedicated to wireless communications.
Ego vehicle (EV) localization involves measuring the state of EV like position, velocity, and acceleration. Global Navigation Satellite Systems (GNSS) such as GPS, Galileo, GLONASS, BeiDou or regional navigation satellite systems (RNSS) such as IRNSS or QZSS can be used for rough state estimation. Localization accuracy of such standalone systems is generally not suitable for safety-of-life-critical applications. Integration of GNSS/RNSS positioning information with GIS road maps and other sensors such as accelerometers and gyroscopes can enhance the localization accuracy (e.g., Yeh et al., 2016;Li et al., 2017;Sharath et al., 2019;Wang et al., 2019). However, GNSS/RNSS availability diminishes in tunnels and under forest cover. Urbanization introduces multi-path errors, which may deteriorate the quality of localization. Lane-level localization using visual cues such as lane markings or other road signs is also possible (e.g., Li et al., 2010;Alkhorshid et al., 2015;Gaoya;Kamijo et al., 2015;Qu et al., 2015;Cao et al., 2016;Kim et al., 2017). But such systems can suffer inaccuracies due to occlusion.
Environment perception involves abstracting information from the surrounding. It involves measuring the states (e.g., position, velocity, acceleration, and type/class) of surrounding obstacles. A combination of RADARs, LiDARs, and cameras is used to detect, classify, and track the surrounding obstacles (Zhu et al., 2017). Computer vision is a popular approach due to the low cost of cameras and the ability to classify the obstacles accurately (e.g., Mohamed et al., 2018;Janai et al., 2020). Machine learning approaches for environment abstraction are on the rise and appear promising (e.g., Yang et al., 2019;Fayyad et al., 2020;Sligar, 2020).
Information about the surroundings is used to plan the EV's future actions to safely navigate in a dynamic environment. Motion planning (trajectory planning to be more precise) of autonomous vehicles is another very challenging task. Extreme care is to be exercised to ensure safety. The process involves deciding the EV's future states (position, velocity, acceleration) in the dynamic traffic environment. Humans make such decisions based on multiple parameters (see Human-likeness and ADS). Multivariate interactions are to be considered in the human-like motion planning of autonomous vehicles.
The final task is to execute the planned motion. The vehicle control module performs this action. Wireless communication between every entity on the road can substantially simplify the devious task of the environment perception module. However, such a situation could occur only when all the vehicles plying on the road are equipped with a wireless communication module.
The most challenging tasks are assigned to the perception and motion planning modules. Proprioceptive sensors (such as speedometer, accelerometer, and gyroscope) and exteroceptive sensors (such as cameras, LiDARs, and RADARs) fetch data from the surroundings. Understanding/abstracting the surroundings by processing data received from such sensors is the perception module's primary task. The perception module deals with the detection, classification, and tracking of obstacles. It also anticipates the future states of the obstacles. This forms the basis for planning the future motion of the EV. The EV's safe and efficient movement in the dynamic traffic is made possible by the motion planning module using the current and future states of surrounding obstacles. Motion planning involves making highlevel decisions (such as overtaking, lane changing, turning, and following) and low-level decisions (such as deciding instantaneous speed, acceleration, braking, and steering). Errors in any of these tasks may get cascaded and eventually result in an unsafe situation.
The safety of the ADS thus depends on the performance of these primary modules. The environment perception sensors used by different ADS are different. Some developers use cameras as primary sensors, while others make use of LiDARs. As such, the perceived environment will inherently depend upon the configuration of sensors used. The software (algorithms and sensor fusion) used for processing/analysis of data perceived by sensors has a pivotal role in determining the performance of ADS. Furthermore, human drivers and ADS coexist for the next several decades. The driving behavior of ADS shall be similar to that of human drivers to ascertain public acceptance. These aspects present a unique challenge to the regulatory authority. The performance metrics shall be able to incorporate all the points mentioned above.
This article attempts to review the metrics used to quantify the performance of perception and motion planning modules (two of the most complicated modules). Introduction and need for the current work are presented in Section 1 and Section 2. Section 3 provides the performance metrics for environment perception and motion planning modules. Furthermore, the need for metrics to quantify human-like perception and driving behavior is elaborated in Section 4. The advantages and limitations of the existing performance metrics are summarized in Section 5. Lastly, a framework for safety regulating authorities to collect information regarding scenarios resulting in an incident is presented in Section 6. The regulatory authorities may use this repository of benchmark scenarios/datasets to compare different ADS objectively. More specifically, a repository of edge cases (critical scenarios) where the ADS tend (or observed) to perform poorly may be used for selecting/ formulating the performance metrics and eventually specifying the performance requirements. The work is summarized in Section 7.

RESEARCH CONTRIBUTIONS
This article makes the following contributions: 1) A literature review on the safety-quantifying metrics of environment perception and motion planning algorithms are presented. 2) Obstacles posing a high-level risk to the safety of the subject vehicle need to be accurately perceived and proper action taken. On the contrary, erroneous perception of an obstacle that poses no threat may be acceptable. The need for the inclusion of threat levels of obstacles in the performance metric is identified. A novel multivariate cumulative distribution approach to assess (human-like) threat levels is presented. A similar approach can be used for human-like motion planning. 3) A suggestion to the safety regulating authority in the form of a framework is presented. The framework focuses on collecting the states of subject vehicles and the obstacles resulting in incidents. Such a repository can be used for quantifying, monitoring, and evaluating the safety of different ADS.
obstacles). Based on such understandings, the future states of the EV would be determined to ensure safety. The states of the EV (e.g., speed, acceleration, position) and that of other traffic entities dictate the performance of the ADS. ADS may drive the EV into a precarious situation due to inappropriate hardware and/or software implementation. The threat to safety can also arise purely from external sources (other traffic entities). Several manufacturers/organizations are independently developing ADS. The hardware and software components influencing performance thus significantly vary between different ADS developers. As such, a unified metric to quantify the performance of an ADS may not be possible. Furthermore, SAE level 3 vehicles require human drivers' intervention in case of a fallback. As humans are in the loop, performance metrics should include human factors as well. These aspects further complicate the task of setting up safety standards by regulatory authorities.
The performance of an ADS depends on that of the EV localization, perception, motion planning, and vehicle control module (Berk et al., 2020). Perception and motion planning modules are the most complicated and influencing parts of an ADS. Hence, the performance metrics or indicators for these two modules are reviewed in this article.

Performance Metrics for Environment Perception
Environment perception involves understanding/measuring the state of surrounding (dynamic) obstacles. State includes position, velocity, acceleration, and class/type. Cameras are generally used for object (obstacle) detection and tracking in ADS. Data from other sensors (e.g. point cloud data from LiDARs) can also be used for object detection and tracking. A comparison of the three major sensors used for environment perception is provided in Table 1, which is compiled by reviewing multiple sources (Hasch et al., 2012;Murad et al., 2013;Patole et al., 2017;Campbell et al., 2018;Lin and Zhang, 2020;Lu et al., 2020;Wang et al., 2020;Zaarane et al., 2020;Yeong et al., 2021).
Cameras are the ubiquitous sensors in ADS. Monocular cameras tend to have a longer range compared to stereo cameras. Thermal/infrared cameras are also used to detect objects in low-lighting conditions (e.g., Korthals et al., 2018;John and Mita, 2021). The field of view depends on the focal length of the lens used. Multi-input multi-output RADARs are being extensively used in ADS due to their high angular resolution and smaller size (Sun et al., 2020). Cameras and LiDARs complement each other in adverse weather conditions. LiDARs are accurate sensors with a few caveats. They are very expensive, computationally challenging and cannot perceive visual cues. Cameras and LiDARs are both active sensors (emit electromagnetic radiation and analyze the scattered/reflected signals) and hence could suffer from interference when multiple such sensors are placed in close proximity. GNSS receivers are used to locate the vehicle on a road map through a process called map-matching (e.g., Quddus, 2006Quddus, , 2013Velaga et al., 2009;Sharath et al., 2019). The positioning accuracy of GNSS receivers is approximately 5-20 m, and GNSS availability may be compromised under forest cover and in tunnels. Integration of the inertial sensors such as accelerometers and gyroscopes with GNSS receivers can mitigate the issue of unavailability and poor positioning accuracy to some extent. Visual cues such as road markings perceived from cameras can also be used to localize the EV. Multiple sensors are to be fused/integrated to achieve sufficient redundancy in safetyof-life-critical applications. Object detection involves estimating the states of the vehicles at a time step based on data received from sensors. Figure 2 depicts one such instance where the black bounding box is the estimated position. Tracking (also called association) is the process of detecting multiple obstacles and associating a unique identifier to the corresponding obstacles in different time steps (Figure 3. In the figure, Class indicates the type of the obstacle (e.g., bike, car, truck, and pedestrian). Cameras are popularly used for obstacle classification.
X and Y represent the true coordinates of the vehicle in the global Cartesian plane. It can be used to determine the lateral and longitudinal positions of a vehicle in a local coordinate system; X andŶ provide the estimated position of an obstacle; V andV are the true and estimated velocities of an obstacle; τ is the time step with step size Δt.
Data from multiple perception sensors such as LiDARs and RADARs can be used to estimateX,Ŷ, andV.
Environment perception happens using multiple sensors such as cameras, LiDARs, RADARs, SONARs, and microphones. Cameras are prevalent because of their low cost. Visual cues such as lane markings and traffic signs can be perceived using cameras (Pollard et al., 2011;Yogamani et al., 2019). However, range measurements are less precise. Cameras are susceptible to weather conditions, and their ability drastically drops in inclement weather. Multiple cameras are generally used to perceive the environment in all directions. Thermal infrared cameras may also be used to sense the environment in the dark (Miethig et al., 2019;Dai et al., 2021).
LiDARs, though expensive, are suitable for precise range measurements. They are less susceptible to weather conditions. Hence, they are ideal for classification and tracking Gao et al., 2018). RADARs can accurately detect and track metallic objects. They are less sensitive to weather conditions. Shortrange RADARs can be used to detect vulnerable road users (pedestrian and bicyclists) by analyzing micro-Doppler signatures (Steinhauser et al., 2021). However, micro-Doppler effects are not pronounced for stationary objects, and hence they may not be detected. Both RADARs and LiDARs are active sensors, meaning they emit electromagnetic radiation and perceive reflected/scattered radiation. This aspect makes them vulnerable to interference when multiple active sensors are in close proximity. Researchers are working to mitigate interference (Goppelt et al., 2010;Alland et al., 2019). Ultrasonic range measurement sensors are popular in detecting closer objects. Microphones are necessary to respond to audio cues such as that from emergency vehicles.

Traditional Metrics or Performance Indicators
Cameras serve as convenient object detection and tracking sensor. A frame extracted from a video would have multiple objects (obstacles) of interest. First, objects are to be detected and segmented. Then the detected objects are to be identified/ classified. Last, an application such as ADS requires that the objects be tracked (i.e., to understand the association of detected objects between the successive frames). These complex tasks are handled by computer vision algorithms. True Positive (TP), False Positive (FP), and False Negative (FN) are the three basic indicators traditionally used in the context of ADS (Visa et al., 2011;Girshick et al., 2014;Flach and Kull, 2015;Yu and Dai, 2019;Powers, 2020). True positive is when an algorithm detects an object correctly. False positive is when an algorithm detects a nonexistent object. False negative is when an algorithm misses the detection of an existing object. These indicators are used to define the following metrics: Recall (r), also called Sensitivity: It is a ratio of true positive instances to the actual number of positive instances. This metric is suitable when false negatives are of high importance. Precision (p), also called Confidence: It is described as the ratio of true positive instances to the predicted number of positive instances. This metric is useful when false positive instances are important.

p TP TP + FP
(2) F 1 score: It is described as the harmonic mean of precision p and recall r, and is obtained as: Jaccard distance (Volk et al., 2020;Luiten et al., 2021) used both FP and FN instances and is described as: None of the above-described metrics considers the quality of detection/classification/tracking as a binary decision is made (based on a threshold). Tightly bound segmentation of an object is the desired quality apart from its correct detection. Intersection over Union (iou) metric addresses this aspect and is given by: where D is the detected bounding box of an object and G is the actual (ground truth) bounding box of the corresponding object. The numerator considers the area of intersection of the two bounding boxes while the denominator is their union. Figure 4 depicts the concept of IoU which serves as a similarity indicator based on object detection (Luiten et al., 2021).

CLEAR Metrics for Evaluation of Object Detection and Tracking
Traditional metrics described above place emphasis on object detection. Tracking, which is an association of detected objects in successive time steps, is of equal importance. Hence, metrics have been developed to quantify the detection as well as tracking quality (Stiefelhagen et al., 2006). CLassification of Events, Activities and Relationships (CLEAR) is one of the popular studies that described the metrics for quantifying object detection and tracking accuracy (Stiefelhagen et al., 2006;Volk et al., 2020). These metrics can be used for the detection and tracking of obstacles such as pedestrians and vehicles. The metrics are described below: Multiple-Object-Tracking Accuracy (MOTA): The numerator is constructed by an additive combination of false negatives, false positives, and association error (e). This metric does not indicate localization quality (ability to segment/bound the objects).
Multiple-Object-Tracking Precision (MOTP): This metric solely indicates the localization accuracy. It is a measure of conformity between the estimated and actual segmentation of the obstacle. The numerator can be considered to indicate the similarity between the estimated and actual obstacle locations. MOTP is described as the arithmetic mean of similarity scores as follows: Multiple-Object-Detection Accuracy (MODA) and Multiple-Object-Detection Precision (MODP): Weighted sum of false negative and false positive instances are considered in MODA. On the other hand, MODP considers the similarity score similar to that used in MOTP. However, tracking/association aspect is ignored in these metrics. Detection quality in a frame (or at a time step) is quantified.
where IoU i,t is the IoU for obstacle i at time t; TP t (True Positive) is the number of correctly identified/ tracked objects in the frame at time t; FN t is the number of missed detections (False Negative) at time t; e t is the number of objects erroneously tracked/associated at time t; FP t is the number of false positives at time t; Frontiers in Future Transportation | www.frontiersin.org November 2021 | Volume 2 | Article 759125 g t is the number of objects actually present in the frame at time t (ground truth); N t is the number of detections at time t; w and (1 − w) are the weights to respectively dictate the relative importance of FN t and FP t .

Higher Order Tracking Accuracy (HOTA) Metric
CLEAR metrics are constituted by multiple indicators, which can be a hindrance for real-time applications (Volk et al., 2020). Also, detection performance takes precedence over tracking/association performance. Hence, Luiten et al. (2021) have adapted the basic indicators to incorporate the tracking aspect. TPA, FNA, and FPA respectively are described as true positive, false negative, and false positive instances in terms of association/tracking. TPA is when an object is correctly tracked in subsequent time steps. FNA and FPA occur when detection is correct and association between the frames is erroneous. Association score for object c, A c , is computed as Detection accuracy, DetctA α , indicates the proportion of aligning detections and is described as Association accuracy, AssocA α , is given by Finally, the HOTA score at a localization threshold value of α is computed as follows: HOTA metric unifies detection and association metrics. Thus, it provides a balanced emphasis on detection and association/ tracking. The metric has been thoroughly analyzed and validated (Luiten et al., 2021).

Performance Metrics for Motion Planning
Motion planning involves deciding future states of the vehicles at trajectory level and planning maneuvers (Katrakazas et al., 2015). This section includes the performance metrics used at both trajectory levels and for maneuver planning.

Traditional Metrics
Time-To-Collision (TTC): It is the time required to observe a collision between an EV and an obstacle if both of them continue to travel without changing velocities (Minderhoud and Bovy, 2001;Vogel, 2003;Forkenbrock and Snyder, 2015;Johnsson et al., 2018;Li et al., 2021;Wang et al., 2021). It is one of the most popular safety indicators of longitudinal motion of the EV and is given by Hou et al. (2014): where X is the longitudinal position and _ X is the longitudinal speed at time t. Suffix f indicates the follower, while l represents the leader.
Time Exposed Time-to-Collision (TET): Cumulative duration for which TTC remains lower than a specified threshold. Both TTC and TET are suitable to quantify risks of collisions like rearend, turning, and weaving (Mahmud et al., 2017). Usually, a threshold level is set to compute the duration for which a violation occurs. TET can be computed as: where Δt is the step size, T is the threshold, and Post Encroachment Time (PET): It is the time gap between the arrival of two vehicles in the area of potential conflict. PET can be used to quantify the safety risk at intersections, weaving, and merging sections (Wishart et al., 2020). Figure 5 1) depicts the time instant t 1 when a vehicle exits the area of potential conflict while Figure 5 2) shows the time instant t 2 at which another vehicle enters the same area of potential conflict. PET is computed as (Razmpa, 2016): These traditional metrics primarily consider the onedimensional motion (longitudinal) of the EV. However, in the real world, multiple obstacles can simultaneously interact with the EV (pose a threat to the safety of the EV). As such, the EV's two-dimensional (lateral and longitudinal) motion is to be considered in quantifying the safety.
There are other relatively less popular safety indicators such as Time Integrated Time-to-Collision, J-value, standard deviation of lateral position, time-to-intersection, time-to-obstacle (Mahmud et al., 2017).

Responsibility-Sensitive Safety (RSS) Metrics
Specific popular metrics used for indicating/improving the safety of ADS such as 1) miles driven, 2) total number of near-collision incidents/disengagements, 3) simulation, and 4) scenario-based approaches have severe drawbacks (Shalev-Shwartz et al., 2017). To address the drawbacks, Shalev-Shwartz et al. (2017) have described several metrics or indicators to ascertain the safety of an ADS. They are 1) safe longitudinal distance, 2) safe lateral distance, 3) longitudinal danger threshold, and 4) lateral danger threshold. Safe longitudinal distance is the longitudinal separation necessary between an EV and an obstacle to stop the EV without collisions. Safe longitudinal distance is described for the case of 1) EV following another vehicle (traveling in the same direction) and 2) when EV and obstacle are moving toward each other (traveling in opposite directions). Safe lateral distance is the lateral separation necessary to ascertain no lateral collision. When the prevailing separation between the EV and an obstacle is smaller than the safe distance, the situation is considered dangerous. The time instant at which lateral safety is compromised is called lateral danger threshold (similar is the case for longitudinal danger threshold). Using these metrics, proper responses in lateral and longitudinal directions are described in terms of permissible lateral and longitudinal accelerations to ensure safety. Proper responses for routes of different geometry and operational domains are also explained. The three distance measures used are (Shalev-Shwartz et al., 2017;Volk et al., 2020): where d long, same min is the safe longitudinal distance between the EV and an obstacle when they are traveling in the same direction; d long, opp min is the safe longitudinal distance between the EV and an obstacle when they are traveling in opposite directions; d lat min is the safe lateral distance between the EV and an obstacle; v r and v f are the longitudinal velocities of rear and front vehicles, respectively; v i is the speed of vehicle i; a AccMax indicates the maximum acceleration; a DecMax and a DecMin respectively indicate the maximum and minimum deceleration; Superscripts long and lat indicate the longitudinal and lateral directions, respectively; t is the step size; Proper responses for different scenarios are described. However, there are some limitations as the scenario description cannot be exhaustive. Koopman et al. (2019) have identified edge cases or scenarios that cannot be addressed by the RSS approach presented by Shalev-Shwartz et al. (2017). For example, as per d long, same min , the following vehicle with a better braking efficiency can be "ahead" of the leader. Parameters such as slope of the road, road curvature, and contact friction that affect the minimum separation are spatio-dynamic and not comprehensively considered (Koopman et al., 2019). One of the major limitations of scenario-based approaches is the assumption of deterministic motion of the other traffic entities. When human drivers are involved, their responses and the consequent motions would be stochastic (Phillips et al., 2017;Xin et al., 2018;Berntorp et al., 2019).
Another study has formulated a "safety score" by adapting the RSS approach . They have modified the matric to reduce the computation time. Such improvements are necessary for real-time applications.  metric may be considered a generalized TTC indicator. The TTC between EV and the surrounding obstacles is computed when the EV performs an evasive maneuver and obstacles try to collide with EV (Kamikaze approach). The motion of EV and the surrounding vehicles is considered continuous and governed by ordinary differential equations. The analytical solution is available, which makes it appealing to employ in real-time applications. Vehicle kinematic and pedestrian kinematics are provided in detail. The performance of ADS under several traffic scenarios such as the presence of static obstacles, dynamic obstacles, weaving, and lane change operations is evaluated. NHTSA recommends research and development of metrics similar to MPrISM to assess the safety of ADS (NHTSA, 2020).

Other Metrics Used by ADS Developers
The concept of artificial potential fields is popularly used for collision avoidance and motion planning (Latombe, 1991;Xiong et al., 2016;. This approach is further improved by Nistér et al. (2019) to develop the "Safety Force Field." Actions of the dynamic obstacles and the EV are expected to follow specific driving policies to ensure safety. If not, the EV could experience a safety risk. Hence, corrective measures are to be dynamically taken to ascertain continuous safety. The prediction of future states/actions of the dynamic obstacles and of the EV has certain benefits. Foreseeing safety risk is the obvious one. Another major advantage is the possibility of learning the driving policies from the field experiments. A metric comparing observed states and predicted states may be formulated for such a purpose. A consortium of eleven ADS developers/manufacturers have compiled a document providing a framework for developing safe ADS (Wood et al., 2019). Twelve principles governing the safety of ADS are presented in the report. The concepts of safety by design, verification, and validation are the foundation of the proposed framework for ADS development. The required properties of ADS are categorized as fail-safe capabilities and fail-degraded capabilities. Fail-safe and fail-degraded operations are generically described. It is argued that fail-degraded capabilities should assume higher priority over fail-safe capabilities while designing ADS. Fraade-blanar et al. (2018) have developed a generic framework to quantify the safety of ADS. The report provides desirable qualities of safety indicators/ metrics. Suitable safety indicators at the development, demonstration, and deployment stages are mentioned. However, the formulation of performance metrics used by different ADS developers is not provided in either of the reports.

Concept of "Threat" in the Performance Metrics
Obstacles can pose different magnitudes of threats to the EV based on their state (e.g., position, velocity, acceleration, and vehicle type). Perception errors associated with low-threat obstacles (e.g., an obstacle that is far away) may not be as critical as that for high-threat obstacles (Volk et al., 2020). Therefore, performance metrics for a perception system needs first to quantify the potential threat. Missed detection or wrong classification of low-threat obstacles may be acceptable. On the other hand, erroneous perception/classification of obstacles results in erroneous predictions of future states of the obstacles. The repercussion would be erroneous motion planning that can be fatal in safety-of-life critical applications (Volk et al., 2020). Therefore, there is a need to incorporate the "threat level" of obstacles in defining the performance of a perception system. The metrics mentioned above do not incorporate the level of threat an obstacle poses to the EV. Those metrics are formulated to assess the quality of detection and association. However, erroneous perception of objects (obstacles) that pose a very low threat to EV safety may be permissible. On the contrary, instances of an inaccurate perception of objects that pose a very high risk to EV safety shall be minimized/eliminated. Such a process requires a comprehensive and objective description of the Quantification of the level of threat of an obstacle is a leap forward in improving the safety of ADS. Algorithms may be enhanced to detect and track high risk posing obstacles with greater accuracy. Furthermore, it may also be possible to assess the safety of the EV at any given instant. The same approach may be employed for analyzing data from other perception sensors such as LiDARs (Lang et al., 2019;Volk et al., 2020).

HUMAN-LIKENESS AND ADS
ADS and human-driven vehicles will coexist for several decades, forming a mixed traffic environment (Litman, 2020). ADS would receive public acceptance only if they exhibit driving behavior similar to that of humans (Guo et al., 2018;. This is necessary to gain the trust of EV occupants and other road users. Cooperation and coordination between the vehicles in the mixed traffic are crucial to prevent deterioration of the safety and traffic flow parameters . Humans' driving behavior may be characterized by distributions of microscopic traffic parameters such as headways, relative velocities, and accelerations (Zhu et al., 2018). ADS should be developed to mimic human-like driving behavior, resulting in human-like distributions of microscopic traffic parameters. The performance metrics/indicators mentioned earlier do not address this need. Hence, they do not evaluate human-like driving behavior.  2017) incorporates proximity, braking time, and prevailing weather conditions. These are implicit attempts to comprehensively model human driving behavior and can be considered positive steps forward in developing human-like ADS. However, several factors can elicit a reaction in human drivers . These stimulus parameters include 1) velocity of EV, 2) velocity of surrounding obstacles, 3) proximity of an obstacle to the EV, 4) θ, (0 ≤ θ ≤ 180) the enclosed angle between the heading of EV and the line joining obstacle and EV (θ represents the relative position of an obstacle with respect to the EV), 5) relative velocity, 6) relative acceleration, 7) lane offset of EV, 8) type of obstacle, 9) type of EV, and 10) weather conditions (e.g., rain, snow, mud, dust, smoke, day, night etc.).
Such parameters may not independently influence driving behavior. It is not easy to model human-like driving behavior incorporating the interaction between multiple stimulus parameters. However, such interactions can be learned from observation. Human-driven trajectories (NGSIM, 2007;Krajewski et al., 2018) can be used for such a purpose. Multivariate cumulative distribution function(s) (CDF) can be constructed from those trajectories. Please refer to  for a detailed description and justification on using multivariate CDF to model human response. Figure 6 presents a five-dimensional CDF constructed using human-driven trajectories obtained from NGSIM (2007). The five dimensions are 1) θ, 2) relative velocity between EV and obstacle, 3) proximity between EV and obstacle, 4) type of EV, and 5) type of obstacle. CDF can be considered to indicate the EV's potential (or magnitude) to respond to a given situation. The darker the color, the greater is the potential to respond. Negative relative velocity indicates that the EV and the obstacle are moving toward each other. A sharp gradient in color can be observed when relative velocity turns negative, implying that humans are sensitive to relative velocity. Smaller proximities (smaller headways) also result in a greater response. As the θ value increases (θ 0 0 , the obstacle is in front of the EV; θ 90 0 , the obstacle is at a right angle to the EV; θ 180 0 , the obstacle is behind the EV), the magnitude of response decreases. All these observations are very intuitive.
However, developing a nonlinear formulation to model human drivers' responses (with interacting parameters) is not a trivial task. Multivariate CDF could be a way forward in such cases. Note that not all the stimulus parameters mentioned above are used in the example resented in Figure 6 as the visual representation becomes difficult. In reality, there is no limit to the number of stimulus parameters used to construct multivariate CDF. But, as the number of stimulus parameters used increases, the sample size (human-driven trajectories) needed would exponentially increase, which is a limitation of this approach.
The following subsections present a direction to use this multivariate CDF approach to improve human-like perception and motion planning modules of an ADS.

Human-like Perception
Human-like threat perception is essential to model human-like driving behavior. Human drivers may perceive threats from surrounding obstacles based on several stimulus parameters mentioned earlier. The objective is to detect and track the obstacles that pose a high risk with greater accuracy. It may be acceptable to erroneously detect/track the obstacles that pose low or no risk to the safety of EV (might reduce computational requirements). Except for Volk et al. (2020), none of the existing performance metrics considers human-like threat perception. However, the threat quantification metric used by Volk et al. (2020) does not comprehensively consider all these stimulus parameters. Hence, there is room to incorporate all the stimulus parameters in quantifying the performance (humanlikeness) of the perception module of ADS.
The multivariate CDF approach seems to be feasible to quantify threat levels of different obstacles by learning from human-driven trajectories. Such an approach also has the inherent ability to accommodate the interaction between FIGURE 6 | An example of multivariate CDF (EV is Car; Obstacle is Bike).
Frontiers in Future Transportation | www.frontiersin.org November 2021 | Volume 2 | Article 759125 multiple parameters. A nonlinear relationship between the perceived level of threat and the stimulus parameters can be constructed from the observed human-driven trajectories. Every detected obstacle can then be assigned a (human perceived) human-like threat level (which is a continuous value between 0 (very low-level threat) and 1 (very high-level threat)). This objective threat level can be used as a weighting factor in traditional or CLEAR metrics to quantify detection and tracking quality appropriately. Thus, false positives and false negatives of the low-threat obstacles are imposed a lesser penalty as compared to that of high-threat obstacles. Multivariate CDFs are constructed from the observed data. This implies, temporal and spatial variations in driving behaviors and subsequent perception of threat can be dynamically adapted by human intervention. Threat levels can also be quantified at different operational environments and weather conditions. The perception model (and consequent driving behavior model) can be customized for a human driver.

Human-like Driving Behavior
A trajectory is the time series of states/actions. As mentioned earlier, human driving behavior is characterized by several microscopic traffic parameters. Some metrics/indicators are available to quantify the human-likeness of a generated trajectory. Human-driven trajectories are necessary for comparison. The initial position of one of the humandriven trajectories (HDT) is considered to be that of an EV. The motion of all the surrounding obstacles is replayed from human-driven trajectories. The movement of EV is determined according to a policy/model, which results in model predicted trajectory (MPT). The humanlikeliness of the generated trajectory can then be quantified by comparing HDT and MPT. Comparison can happen for variables such as longitudinal positions, lateral positions, lateral speeds, longitudinal speeds, lateral accelerations, longitudinal accelerations, headways, and lane offsets. In general, the metric can be root-weighted squared error (Kuefler et al., 2017): where m is the number of trajectories used, and v is any of the above-mentioned variables under consideration. Multiple metrics may be necessary for targeted improvement of specific parts of the ADS. The longitudinal error may be obtained as (Ossen and Hoogendoorn, 2011;Zhang et al., 2019): The lateral error may be computed as (Kesting and Treiber, 2008;Zhang et al., 2019): Model error, which is a combination of lateral error and longitudinal error, can be determined as (Zhang et al., 2019): where x is the lateral position, y is the longitudinal position, v is the longitudinal speed, and G indicates the gap.

ADVANTAGES AND DISADVANTAGES OF PERFORMANCE METRICS
The metrics used for performance evaluation of environment perception and motion planning are provided in this section. Objectivity of a performance metric is a desirable quality. A metric is said to be objective when it does not contain any subjective term. Performance would be quantified based on measurements/computations that are not subjective. Table 2 summarizes the advantages and disadvantages of metrics used to quantify the performance of environment perception. Table 3 provides the summary of metrics used for the evaluation of motion planning algorithms.

Framework for Safety Regulation of ADS
In June 2021, NHTSA has issued a standing general order mandating ADS developers/operators to report incidents (crashes) (NHTSA, 2021). The order seeks the following information pertaining to an incident, 1) EV information (e.g., model, make, and mileage), 2) incident information (date, time), 3) incident scene (location, pavement characteristics, speed limit, lighting, and weather conditions), 4) crash description (e.g., injury severity, precrash speed, etc.), and 5) postcrash information. However, the scenario leading to the crash is not being asked. Precrash information or the states of traffic participants that resulted in crashes/collisions/incidents are vital to identify the flaws in the existing system.

Reasons for Crashes/Incidents/ Disengagements
A crash is a result of the failure of one or more of the basic four modules of an ADS. This paper's scope is limited to the examination of the perception and motion planning module (as the likelihood of failure of other modules is much smaller). Failure of the perception module (erroneous scene abstraction) can result in improper motion planning. However, the erroneous motion of an EV may not always result in collisions as the other human-driven entities respond (react) to the actions of the EV. But crashes can happen due to a combination of imperfect environment perception and motion planning, as shown in Table 4. Also, erroneous environment perception or motion planning for a short duration may not result in a crash. The reaction of other traffic entities may prevent incidents. Furthermore, the future states of the EV (and of the surrounding traffic entities) are sensitive to the current (initial and previous) state. This butterfly effect may either dampen or magnify the safety risk posed by improper environment perception and motion planning. It is a complex phenomenon to analyze, and significant efforts must be made in this aspect to improve ADS. If a crash or disengagement occurs, 1) it could be solely attributed to improper motion planning, 2) it could be solely attributed to erroneous environment perception, or 3) it could be the result of imperfect environment perception and imperfect motion planning.

Framework for Collecting Precrash Scenarios
Incident reporting is mandatory for ADS developers/operators (NHTSA, 2021). However, precrash information is not being collected by NHTSA. The sequence of precrash events/states may hold valuable lessons in improving ADS. It is necessary to identify the specific cases resulting in crashes as it helps in the targeted development of ADS. The first step in this direction is to

Metric Description Advantages Drawbacks
Traditional r (e.g., Aly et al. (2016); Powers (2020)  understand the "scenarios" culminating in an incident. "Scenarios" are a sequence of states (e.g., position, velocity, and acceleration) of the EV and that of the surrounding traffic participants. Future states of the EV (and that of the other traffic entities) are sensitive to initial states. State evolution is a complex phenomenon and complicated to model. More specifically, the scenarios that culminate in crashes are infrequent but critical. Human-driving behavior under safe ("normal") driving conditions is extensively studied and modeled. Comprehensive simulation models are available to model the driving behavior under normal conditions. Such models can be calibrated and validated with experimental/empirical data. However, modeling human-driving behavior under "precarious" driving conditions presents three significant challenges: 1) any attempt to model such precarious driving conditions (and subsequent driving behavior) cannot be justified by empirical validation, 2) precarious driving conditions are scarce and present a problem of "class-imbalance" (Jeong et al., 2018;Elamrani Abou Elassad et al., 2020), and 3) behavior of multiple agents under precarious (extreme) scenario is challenging to hypothesize, let alone model it.
Class-imbalance exists when instances of one (or a few) class severely outnumber that of the other classes (Vluymans, 2019). In the present context, the two classes can be 1) normal scenario and 2) precarious scenario, where the former out represents the latter. If the under-represented scenario is of major concern (like in the present study), metrics shall be able to appropriately quantify the performance. Approaches to mitigate the issue of imbalance (e.g., synthetic minority oversampling technique, adaptive synthetic sampling) require the generation of precarious scenarios (Vluymans, 2019;Elamrani Abou Elassad et al., 2020;Fujiwara et al., 2020). Simply put, simulation environments may not mimic precrash scenarios due to complexities in comprehending and modeling multi-variate multi-agent interactions. Hence, synthesizing underrepresented scenarios is extremely difficult.
Recognizing (and predicting) the transition from normal to precarious driving scenarios is extremely important in ensuring the safety of ADS. The vital task of comprehending (and subsequent modeling/synthesizing) precarious scenarios can be initiated from empirical observation. Hence, precrash scenarios are extremely important to be collected and analyzed. Precrash scenario simulation can be enhanced using such a dataset, and ADS advancement would be a repercussion. Figure 7 furnishes a framework for the safety regulatory authority to collect precrash scenario, along with the possible usage of the collected database.
ADS developers may be mandated to record the following data: Sensor Data (S): Raw data from perception sensors such as cameras, LiDARs, and RADARs may be recorded. Recordings

Metrics Description Advantages Drawbacks
Traditional TTC (e.g., Minderhoud and Bovy (2001); Wang et al., 2021) • Time to an impending collision • Easy to interpret • Response of the other traffic entities is ignored - Mahmud et al. (2017); Wang et al., 2021) • Duration for which gap maintained was lesser than a threshold Metrics used for performance evaluation of the environment perception module and the motion planning module vary between the ADS developers/operators. Appropriateness of performance metrics in the precrash scenario is a research question to be assessed. Reporting of precrash scenarios can help assess the quality/appropriateness of different performance metrics. Furthermore, scenario-specific (dynamic) performance metrics may be conceptualized.
ADS developers may be asked to anonymize and submit S, E, and M datasets for a short period (say, approximately 5 min) leading to an incident. Not all ADS employ the exact configuration of the sensors. Also, there can be a variety of sensor fusion and environment perception algorithms. Hence, information about the perceived environment is also necessary. Last, the planned motion of the EV is necessary to evaluate the correctness of the planned motion.
The responsibilities of regulatory authorities could include: FIGURE 7 | Framework for collecting precrash sequences and its analysis.
Frontiers in Future Transportation | www.frontiersin.org November 2021 | Volume 2 | Article 759125 14 1) Collection and storing of precrash sequences: The pressing need for precrash sequences (S, E, M) is described above. Regulatory authorities should aim at collecting and storing the same. 2) Modeling the precarious scenarios: Hypothesizing the driving behavior of the EVs and that of the involving traffic entities in precarious scenarios based on empirical observation is an important task. Modeling involves both calibration and validation using empirical sequences. Classical driving behavior models (e.g., Treiber and Kesting, 2013;Kala, 2016) may not comprehensively address both the normal and the precarious sequences. Two separate models may be necessary (or different calibration parameters) to address the two distinct sequences. Alternatively, machine learning approaches that are gaining prominence may be employed to learn the precarious scenario, which is a time series of states (e.g., Kuderer et al., 2015;Gu et al., 2016;Paden et al., 2016;Rehder et al., 2017;Mohanan and Salgoankar, 2018;Schwarting et al., 2018;Wang et al., 2018;Zyner et al., 2018;Zhang et al., 2019). Performance metrics suitable for the imbalanced problem are to be used for the development of such models.

3) Prediction of transition from normal to precarious scenario:
Once the capability to model the precarious situation is achieved, methods to determine the state transition from normal to precarious scenario (and eventually forecast) are to be developed. Such forecasting could be used to prevent an incident. One possible way to achieve this goal is by developing metrics/indicators considering the time series of states (of EV and that of the surrounding entities). Such metrics would account for both spatial and temporal variation in the states. 4) Evaluation of existing performance metrics under precarious scenarios: The quality of existing performance metrics is to be assessed on the dataset of precarious sequences. This is to ascertain that the performance metrics/indicators would not suffer from the problem of class-imbalance. 5) Generation of a comprehensive database of precarious scenarios: Precarious scenarios are very rare, and the reported scenarios would not be comprehensive. As such, it is necessary to synthesize and build up a database of precarious scenarios. Such a synthesized database is a precious source of information toward targeted learning. Hence, the same may be shared with the ADS developers/ operators to accelerate the development of ADS. 6) Assessment of safety performance of different ADS: The database of synthetic precarious trajectories could be used to assess ADS of different developers/operators.
Suitable performance metric(s) can then be used to assess the mapping between S and E, which is an indicator of the performance of the perception module. Furthermore, the mapping between E and M can be analyzed to quantify the correctness of a motion planner.
States of the obstacles can be replayed from the synthetic dataset, and the EV can be made to navigate in precarious scenarios. The database (and the metrics) can also be used to evaluate the individual improvement of either the perception module or motion planning module.
Such an approach helps targeted learning. The configuration of sensors and the type of algorithms (perception and motion planning) ideal for enhancing ADS safety can be determined. Such a collaboration of ADS developers can accelerate the development of ADS. This database of critical scenarios can be used to identify performance metrics that give a false sense of superior performance (a crucial aspect of a performance metric). The quality of different performance metrics under different critical scenarios can be analyzed, with the potential to recognize scenario-specific performance metrics. Last, the repository would also contain human-driving behaviors (trajectories) leading to incidents. This information may be used to quantify the driving performance of drivers and further predict (and intervene) the onset of a precarious situation.

SUMMARY AND CONCLUSION
Automated Driving Systems (ADS) will soon become prevalent and start sharing the road infrastructure with the human drivers (leading to a mixed traffic environment). Safety regulatory authorities are therefore trying to formulate suitable performance metrics to quantify the safety of ADS. At this juncture, it is highly appropriate to review the literature on metrics used to quantify the performance of ADS.
The present article limits its scope to review the metrics related to environment perception and motion planning modules of ADS. It is recognized that the existing metrics on environment perception are formulated to quantify the detection and tracking performance. Usage of such metrics might result in a driving behavior dissimilar to that of human drivers. Such scenarios are unacceptable in a mixed environment. Human-like environment perception and motion planning are therefore essential.
To address this issue, a method to quantify the threat an obstacle poses to the safety of ADS is presented. This novel approach is capable of modeling threats as perceived by human drivers. Human-perceived threats are due to several stimulus parameters such as 1) velocity of subject vehicle, 2) velocity of surrounding obstacles, 3) proximity of an obstacle to the EV, 4) θ, which represents the relative position of an obstacle with respect to the EV, 5) relative velocity, 6) relative acceleration, 7) lane offset of EV, 8) type of obstacle, 9) type of EV, and 10) weather conditions (e.g., rain, snow, mud, dust, smoke, day, night etc.). There may be complex interactions between these stimulus parameters. Multivariate cumulative distributions of the stimulus parameters can be appropriately used to quantify human-like threats.
Imperfect perception of obstacles posing low-level threats may not be a severe issue. On the other hand, it can be fatal to erroneously perceive obstacles that pose a greater risk. The human-like threat perception model suggested in the article can be used to identify threat levels and, consequently, develop a human-like environment perception algorithm. The metrics necessary to quantify the human-likeness of the motion planning algorithm are also presented.
Additionally, a framework is provided to suggest desirable changes to the incident reporting scheme. Currently, ADS operators/developers are mandated to report postcrash information. As thoroughly described, there is an immense potential for utilization of precrash scenarios. It is, hence, desirable to collect the same along with postcrash information. The framework focuses on collecting and managing the information regarding the scenarios that result in incidents. The states of subject vehicles and the obstacles for a small duration before the incident are necessary. Such a database of edge cases, collected from all the ADS developers, can be used to quantify and monitor the performance of environment perception and motion planning modules. The framework also outlines the different ways in which the repository of precrash scenarios could be used. The repository would help in accelerating the development of ADS.
Future research can focus on the development of human-like perception algorithms and human-like motion planning algorithms. A human-like threat level quantification method provided in this article may be employed for such a purpose. Furthermore, it is required to identify traits of the metrics that give a false sense of superior performance. Extensive research is necessary to appropriately model and evaluate the precrash scenarios. Such a study would allow for prediction (and mitigation) of crashes. Safety regulating authorities could objectively and comprehensively assess ADS based on such models.
Redundancy is necessary to prevent catastrophe in the event of an individual sensor (or system) failure and integrity monitoring. Future research can also focus on the conception of performance metrics where system redundancy and integrity are quantified.

AUTHOR CONTRIBUTIONS
MNS and BM: Study conception and design. MNS and BM: Draft manuscript preparation. BM: supervision. All authors reviewed and approved the final version of the manuscript.