Motion Smoothness Metrics for Cannulation Skill Assessment: What Factors Matter?

Medical training simulators have the potential to provide remote and automated assessment of skill vital for medical training. Consequently, there is a need to develop “smart” training devices with robust metrics that can quantify clinical skills for effective training and self-assessment. Recently, metrics that quantify motion smoothness such as log dimensionless jerk (LDLJ) and spectral arc length (SPARC) are increasingly being applied in medical simulators. However, two key questions remain about the efficacy of such metrics: how do these metrics relate to clinical skill, and how to best compute these metrics from sensor data and relate them with similar metrics? This study addresses these questions in the context of hemodialysis cannulation by enrolling 52 clinicians who performed cannulation in a simulated arteriovenous (AV) fistula. For clinical skill, results demonstrate that the objective outcome metric flash ratio (FR), developed to measure the quality of task completion, outperformed traditional skill indicator metrics (years of experience and global rating sheet scores). For computing motion smoothness metrics for skill assessment, we observed that the lowest amount of smoothing could result in unreliable metrics. Furthermore, the relative efficacy of motion smoothness metrics when compared with other process metrics in correlating with skill was similar for FR, the most accurate measure of skill. These results provide guidance for the computation and use of motion-based metrics for clinical skill assessment, including utilizing objective outcome metrics as ideal measures for quantifying skill.


INTRODUCTION
The health of populations is directly related to a well-trained healthcare workforce; therefore, attention must be given to training our clinical professionals efficiently and safely. There is mounting evidence that training not only results in better clinical outcomes but also decreases costs and procedural times [Farnworth et al. (2001)]. To facilitate training, simulators have been gaining increasing popularity in medical education due to their ability to quantify skill in a simulated environment while providing feedback on performance. Further, simulators often enable self-paced learning via metrics that track the performance of trainees over time. In recent years, simulator training has demonstrated positive results in arthroscopy, laparoscopy, and endovascular surgery [Rosen (2008), Hafford et al. (2013), Duran et al. (2015), Goyal et al. (2016)]. Many simulators used in these studies provided skill assessment objectively (i.e., through sensor-based metrics) and not the "subjective" assessment of a trainer. Simulator-based training does not require live animal models or expert trainers and can be done remotely. Also, studies have brought to attention the fact that novice medical students can tend to overestimate their ability [MacDonald et al. (2003)]. Thus, tools must be provided that accurately and reliably help assess a trainee's skill and move students towards proficiency.
In the last few decades, numerous studies have reported on various simulator-based metrics used in their simulators to distinguish the skilled performance of a simulated medical procedure, among which time (T) and path length (PL), the length traversed by a medical instrument during task performance, are frequently used. For instance, training on a simulator has yielded significant differences in completion time after training on the simulator [Judkins et al. (2009)] and in completion time after 1 h of training in a clinical environment [Pedowitz et al. (2002)]. With current simulators featuring stateof-the-art sensors, numerous types of data can be recorded and used for computing metrics. In a recent study using an arthroscopic "box" simulator, time and errors were used to successfully differentiate medical students from novices in two tasks [Braman et al. (2015)], whereas the Fundamentals of Arthroscopic Surgery Training (FAST) saw similar results with the same metrics across four tasks. PL is used in various simulators to determine significant differences among experts and nonexperts [Pedowitz et al. (2002), Jacobsen et al. (2015)]. Metrics such as PL and T, while useful for skill assessment, are somewhat rudimentary measures of skill since they focus on basic aspects of clinical skill.
Dexterity has long been regarded as one of the hallmarks of a good clinician since many medical procedures involve precise and deft handling of instruments. Towards quantifying dexterity, motion smoothness metrics have been recently applied in medical simulators. The primary facet that these metrics capture is the "smoothness" of a motion while performing the task, leading to an understanding of in-the-process task performance. Since they were first proposed about two decades ago with rudimentary metrics such as the number of peaks in the velocity profile (Pks) or various jerk formulations, motion smoothness metrics have evolved in their robustness for skill assessment. Currently, two motion smoothness metrics, log dimensionless jerk (LDLJ) and spectral arc length (SPARC), are being explored for quantifying medical skills. One difficulty encountered by researchers seeking to incorporate these metrics is their dependence on computing derivatives from often noisy sensor data. Depending on how derivatives are computed, noise present in the raw sensor data may be heavily magnified with each order of derivative. To remedy erratic data, smoothing is often performed on sensor data. However, "oversmoothing" the data can filter out important motion features that may be necessary for skill assessment. Therefore, it is vital to examine the effect of smoothing parameters on the computation of motion smoothness metrics and their relative efficacy in quantifying skill in comparison with "traditional" metrics (e.g., T and PL). This work seeks to contribute towards a greater understanding of these two issues.
The context of the current work is evaluating the skill of cannulating a simulated arteriovenous (AV) fistula, a vascular access into the patient's bloodstream, for hemodialysis. This is an operation that is done in dialysis clinics around the world by nurses or patient care technicians tens of times each week. Though this is a relatively simple procedure involving inserting a 14-17 gauge needle into a blood vessel, the quality of cannulation is extremely important for patient health as multiple failures and attempts can lead to several complications such as hematoma, infection, and aneurysm formation that could lead to eventual death [Van Loon et al. (2009)]. One of the main reasons for miscannulation is infiltration: where the clinician punctures through the AV fistula and causes blood to leak out [Brouwer (2011)]. A frequent result of severe miscannulation is a need for surgical treatment of the vascular access, presenting increased risk to the patient. Hospital readmissions can lead to exposure of patients to pathogens that present a greater risk of mortality because of patient comorbidities. Our team has created a simulator for practicing cannulation for hemodialysis in a safe environment with the ability to provide objective metrics based on motion, force, and time data [Zhang et al. (2019), Liu et al. (2020a), Liu et al. (2020b)]. In this study, data from nurses and patient care technicians with various degrees of experience are analyzed for performance characteristics on the cannulation simulator, which yields insights into what constitutes skilled cannulation for hemodialysis.
In addition to systematically quantifying the strength of motion smoothness metrics for cannulation skill, another key contribution of this study is the use of an objective metric to measure the outcome of cannulation. Our simulator features relevant hardware and software to track whether the needle is inserted into the fistula accurately (i.e., for blood withdrawal) and if, during the process, any degree of infiltration was encountered [Liu et al. (2020b)]. By using an outcome metric that is objective regarding the success of the task, we reduce or eliminate the need for other, more "traditional" measures used as surrogates for skill that may inaccurately appraise skill. Two such commonly used measures are years of experience (or the number of cases performed, e.g., in the case of surgeons) and the rating of experts regarding the performance of a task/procedure. Both of these metrics have inherent limitations. In the case of clinical experience, when this measure is used as a surrogate for skill, the implicit assumption is that greater experience results in improved skill; however, this may not be the case as some studies suggest [Hung et al. (2019), ]. By classifying expertise based on clinical experience, we may be inaccurately estimating skill. The other metric commonly used to measure skill is a Likert-scale rating assessed by an expert who witnesses the task being performed. While there is value to these evaluations by experts, the limitations of such a method include the inherent subjectivity of raters and the fact that raters often give one "global" rating for the whole task. In this work, we not only introduce a new outcome metric but also provide insights into how this metric compares with both traditional measures used to gauge skill.
In various specialized as well as everyday tasks, precise and controlled motion is required for successful execution. Some domains where this is particularly the case are sports, surgery, and rehabilitation. To measure skilled performance in these cases, one cannot simply rely on metrics like time to completion or economy of motion. While these have been proven to be useful to assess skill in many cases, these metrics are not designed to quantify the smoothness (or lack thereof) of movement since they do not measure in-the-process motion. As an example, PL, a commonly used metric, only measures the total length of the motion traversed during a motion; it does not quantify how smooth that motion was. Some studies demonstrate moderate utility for Pks to examine smoothness of motion, with limited usage due to lack of generalizability and robustness [Balasubramanian et al. (2012), Balasubramanian et al. (2015), Estrada et al. (2016), Gulde and Hermsdörfer (2018a)]. In the following subsections, we detail more advanced formulations of motion smoothness metrics that are based on higher-order derivatives of position data. It should be noted that these metrics have evolved in their complexity and applicability since they were first devised for use in rehabilitative treatments [Flash and Hogan (1985)]. We discuss the particular strengths and weaknesses of each metric along with their use cases.

Jerk
In an ideal, smooth motion, acceleration would not have any discontinuities, as could be determined by the derivative of acceleration, jerk. This notion has served as the key idea for quantifying motion smoothness. However, computing "pure" jerk is too inconsistent to be used as a measure of motion smoothness [Hogan and Sternad (2009)]. For example, some studies did not detect significant differences in jerk values among motor movements in unhealthy and healthy patients [Goldvasser et al. (2001), Wininger et al. (2009)], while other studies demonstrate otherwise [Teulings et al. (1997), Smith et al. (2000), Rohrer et al. (2002)]. From these studies, it was observed that jerk should be normalized as it depends heavily on movement duration and range of motion and that minimizing jerk is essential for smooth motion quantification. Flash and Hogan (1985) proposed minimizing the cost function of jerk by squaring and integrating the value as a viable metric for motion smoothness This measure, known as integrated square jerk, and others based on this measure were used in several studies to quantify smooth motion [Smith et al. (2000), Goldvasser et al. (2001), Rohrer et al. (2002), Wininger et al. (2009)]. Eventually, these metrics were termed dimensioned jerk metrics since they rely on the duration and amplitude of movement. Later, Hogan and Sternad (2009) created a new metric that eliminated such reliance. This metric, known as dimensionless jerk, accounted for measuring the intermittency in motion regardless of its duration or amplitude. Intermittency in a discrete motion can arise from the lack of controlled movement, characterized by a period of deceleration preceding a point of acceleration, or can be due to finite periods of no motion from uncertainty. Balasubramanian and colleagues noted that for a motion smoothness metric to be valid, it must have the following features: it must be dimensionless, monotonically responsive to motion, sensitive to changes in movement, and feasible for computation [Balasubramanian et al. (2012)].
Dimensionless jerk (DLJ) has been used in several recent studies to assess clinical skills. In one recent study, participants performed a pegboard placement task, with DLJ able to differentiate among surgeons and nonsurgeons as well as among the tasks tested [Ghasemloonia et al. (2017)]. Further, DLJ was also employed in a simulated shoulder arthroscopy test, where significant differences between experts and novices were evidenced in DLJ values [Kholinne et al. (2018)]. This metric has also been used in a Fundamentals of Endovascular Skills (FEVS) trainer [Estrada et al. (2016), O'Malley et al. (2019)], where it differentiated between novice and expert skill. One notable limitation, however, is that the values for DLJ vary widely, differing by the thousands between users in some cases. To eliminate this wide variability, the use of the natural log of dimensionless jerk (LDLJ) has been proposed [Balasubramanian et al. (2012), Balasubramanian et al. (2015), Gulde and Hermsdörfer (2018b), Melendez-Calderon et al. (2020)]. Studies report more robust measurements and better sensitivity of this metric to the physiological hand motion range. Balasubramanian and others emphasize that LDLJ is often affected by a signal noise since calculating jerk involves computing the third derivative [Balasubramanian et al. (2012)]. This observation leads to one of the research questions addressed in this study.

Spectral Arc Length
In 2012, a new metric to quantify motion smoothness that was more robust to noise was formulated, known as spectral arc length (SPARC) [Balasubramanian et al. (2012), Balasubramanian et al. (2015)]. The metric is derived from the arc length of the amplitude of the frequency-normalized Fourier magnitude spectrum of the velocity profile. This metric is based on the observation that smooth hand movements will yield small magnitudes of low-frequency profiles, whereas "unsmooth" movements will yield large magnitudes of different higher-

Metric Equation
Time Peaks in the velocity profile Σv maxima Frontiers in Robotics and AI | www.frontiersin.org April 2021 | Volume 8 | Article 625003 frequency profiles. The larger the magnitudes of different frequency movements are, the more the arc length of the profile increases. This idea is analogous to minimizing the cost function of jerk. Since this metric relies on analyzing motion via the frequency domain, it is more robust to noise and sensitive to changes in smaller movements [Balasubramanian et al. (2012)]. SPARC is being increasingly used to measure skilled or smooth motion, including in the previously mentioned FEVS studies [Duran et al. (2015), O'Malley et al. (2019), Belvroy et al. (2020)], in which it consistently demonstrates strong correlations to skill between experts and novices. In this study, we systematically compare the efficacy of both SPARC and LDLJ for quantifying cannulation skill-another clinical skill that requires smooth motion. A summarized list of metrics is shown in Table 1.

MOTIVATION OF STUDY
In this study, we examine three research questions relevant to the use of motion smoothness metrics for clinical skill assessment and training. We present each question, followed by a rationale motivating the question.
To summarize this section, the three research questions addressed in this work are as follows: • Which skill indicator metric best accounts for dexterity measures?
We define two types of metrics in this study: skill indicator metrics and process metrics. Skill indicator metrics [e.g., global rating sheet (GRS)] seek to appraise the skill of clinicians while process metrics (e.g., LDLJ) quantify certain characteristics of task performance. To better differentiate between the two, process metrics are abbreviated with an italicized font, whereas skill indicator metrics are abbreviated with a bold font. We introduce a novel metric for objectively assessing skill by determining the degree of success in cannulation task outcome. However, of the three skill indicator metrics used in this study-GRS, clinical experience, and task outcome-which one best characterizes clinical skill? The answer to this question has important implications in the way clinical skill is classified in studies. Further, many metrics are task-specific by definition. That is, they are effective to the degree that they accurately reflect the task. Motion smoothness measures are known to be task-specific: motion smoothness values indicative of skill are not equivalent across different tasks [Balasubramanian et al. (2012)]. Thus, we examine which of the three skill indicator metrics is best correlated with the suite of process measures in this study. Another aspect of our study about the way skill is classified needs to be mentioned here. In many studies, skill is binarized as either expert or novice, greatly simplifying the notion of skill. Our work denotes a skill as a continuum wherein each skill indicator metric can take on a range of values, thus enabling a fine-grained classification of skill.
• Does the degree of smoothing of sensor data significantly affect the computation of motion smoothness metrics?
Studies that involve assessing or training clinical skills on simulators most often use data from sensors to provide feedback to trainees. To extract motion smoothness metrics, however, the inherent noise in sensor data poses a problem while computing higher-order derivatives. As such, it is important to understand the role of data smoothing-the type and the degree of smoothing used-on the computation of metrics. To our knowledge, no study thus far has presented the effect of the degree of smoothing for measuring clinical skill. A related study demonstrated that, for computing SPARC and PL, optimal filtering was obtained for a specific window span range and for a specific type of filter [Gulde and Hermsdörfer (2018a)]. This study, however, did not include computation of LDLJ (or related metrics) wherein obtaining stable higher-order derivatives of position is critical for accuracy and interpretability. The Savitzky-Golay (SG) filter is an established and widely used method for derivative estimation from sensor/noisy data. The SG filter demonstrates substantially superior results than discrete finite difference methods for computing higher-order derivatives [Ahnert and Abel (2007)]. Thus, this study explores degrees of SG smoothing for metric computation. Figure 1 illustrates the undesirable effect of noise on the computation of higher-order derivatives motivating the need for this study.
• Are motion smoothness metrics superior to other process metrics in correlating with skill indicator metrics?
In studies seeking to assess clinical skill in a simulator or otherwise, a suite of metrics is typically employed. However, are some metrics more powerful than others in discerning clinical skills? If so, this might have implications for the design of hardware as well as in creating training curricula that involve the most sensitive of metrics. For example, a study on a robotic laparoscopic skill trainer employed a few rudimentary metrics to assess performance [Judkins et al. (2009)] and reported significant differences between the five medical students and five laparoscopic surgeons who participated. In contrast, several studies utilize rudimentary metrics and more complicated motion smoothness metrics to distinguish between skill levels [Gulde and Hermsdörfer (2018b), Kholinne et al. (2018), Belvroy et al. (2020)]. The argument commonly made for using more sophisticated metrics is either that they grasp an aspect of skill not captured by rudimentary metrics or that they do it better. In this study, we examine if any of the process metrics, including both the more sophisticated motion smoothness metrics and the rudimentary metrics, are superior to the other process metrics in correlating with the cannulation skill.

The Cannulation Simulator
This study collected and analyzed data from clinicians on a novel simulator for hemodialysis cannulation [Zhang et al. (2019), Liu et al. (2020a), Liu et al. (2020b)]. The simulator comprises four synthetic arteriovenous (AV) fistulas, each outfitted with a vibration motor to simulate turbulent blood flow (termed Frontiers in Robotics and AI | www.frontiersin.org April 2021 | Volume 8 | Article 625003 "thrill" by clinicians) in a fistula. Figure 2 illustrates a sketch of the simulator hardware and the experimental setup. There are four sensors present in the system: an electromagnetic (EM) position sensor (trakSTAR, Northern Digital Inc.) located inside the needle; a force sensing system to record forces applied by the fingers (FingerTPS, Pressure Profile Systems Inc.); the Leap Motion sensor (Ultraleap Inc.) for tracking finger position; and infrared (IR) emitters and detectors for determining whether the needle is inside the fistula. The Leap Motion sensor is affixed above the simulator while the FingerTPS sensors are fit onto the user's thumb, index, and middle fingers. The IR sensors are embedded within the needle tip and each fistula located in the simulator. An external camera (RealSense, Intel Inc.) records video of participants performing the cannulation task. Custom software was written in C++ for integrating all sensors to enable data collection at sensor-specific sample rates [Liu et al. (2020b)].

Experimental Design
All participants provided informed consent to participate in the study. Data were recorded from 52 participants comprising of nurses, nurse practitioners, and dialysis technicians ranging from 0 to 38 years of experience. These participants performed four cannulation attempts on each of the four fistulas in the simulator for a total of 16 trials per participant. Data on 53 trials were excluded due to either LED failure, lack of expert rating, failure to complete the trial, or data saving errors, resulting in a total of 779 viable trials for analysis. Before performing the task, each participant filled out a questionnaire that included participants' cannulation experience. Once the questionnaire was answered, participants were debriefed about the experimental procedure using a self-advanced PowerPoint presentation. Following this, participants were cannulated on the simulator following clinical guidelines as closely as possible. As such, for each attempt, only one (out of the four) fistula's vibration motor was activated randomly. The subject was instructed to first palpate the skin surface to locate the correct fistula (i.e., the fistula with a "thrill"). They then attempted to insert the needle into the fistula to attempt successful cannulation. If the needle tip successfully entered the fistula, a red LED located inside the cannula was turned on to simulate blood flashback visible in a clinical setting. Following standard guidelines, if a stable blood flashback is procured, the participant is instructed to "level out" (lower the angle of the needle) to allow for taping of the cannula during dialysis.

Data Segmentation
Sensor data were collected at a rate of 100 Hz and synchronized in Visual Studio 2017 (Microsoft Inc.), and data segmentation and metric calculations were conducted through MATLAB R2020a (MathWorks Inc.). We computed the position of the needle tip based on the location of the EM sensor inside the needle and needle geometry using a pivot calibration. Following the procedure outlined in Liu et al. (2020b), data were extracted and segmented into specific cannulation subtasks: insertion, flashback, and leveling out ( Figure 3). In this figure, a plot of x, y, and z needle positions of a sample cannulation trial is seen. The trial is separated with a dotted line and its subtask is seen in the title. Each subtask also has a corresponding visual sketch above the segmented plot. By comparing the z-position of the electromagnetic sensor with the height of the surface of the skin, we determined the timestamp at which the needle punctured the surface of the skin (denoted as t entry ). Movement data were segmented from t entry until the end of the task (t end ). This segmentation also allowed for multiple reinsertion attempts that a participant may have used. As has been noted in studies that use motion smoothness metrics, it is essential to constrain the task since motion smoothness metrics are task-dependent. That is, without a consistent start and end point, evaluation of motion smoothness profiles would be meaningless. Once these data are segmented, derivatives are approximated through a third-order (SG) smoothing filter. To compare various degrees of smoothing, SG window spans of minimum (5 samples

Process Metrics
We define dexterity process metrics calculated in our study as follows: • Time (T): The total time from t entry to t end of the task.
T t end − t entry . (1) • Peaks (Pks): The number of local maxima (peaks) in the velocity profile, computed using the built-in MATLAB function.
Pks findpeaks dX dt , • Path length (PL): The sum of Euclidean distances between points traversed by the needle tip. (5) • Spectral arc length (SPARC): As defined in Balasubramanian et al. (2012), SPARC is the arc length of the Fourier transform of the velocity profile, from the provided MATLAB code.

Skill Indicator Metric Definition
For data analysis, we created three statistical models, one for each skill indicator metric, to examine their effectiveness in quantifying skill. A general sense of the descriptive statistics can be gleaned from Figure 5. We defined the three skill indicator metrics as follows: • Cannulation experience (Exp): Subjects were asked to fill out a questionnaire that included the amount of clinical cannulation experience the participant had. Our participants' years of experience cannulating ranged from 0 to 38 years, with a mean of 11 years and a standard deviation of 8.6 years.
Frontiers in Robotics and AI | www.frontiersin.org April 2021 | Volume 8 | Article 625003 • Global rating sheet (GRS): A commonly used method to determine skill level in various medical fields is by experts observing and rating the performance of a task on a Likertscale questionnaire. Participants were rated by one of three experts on a Likert scale from 1 to 7 on the following aspects: palpation skill, needle holding, needle movement, flashback quality, and overall quality. Ignoring palpation skill, as it was deemed unrelated to the motion of the needle, we summed the scores of the remaining categories for an overall rating for GRS. Summed expert scores ranged from 16 to 35, with a mean of 28 and a standard deviation of 6.
Subjects' scores ranged from 0 to 1, with a mean of 0.79 and a standard deviation of 0.30.

Statistical Analysis
Linear regressions were performed individually for each process metric per each skill indicator metric. Due to the inherent skew of the process metrics, all but LDLJ were log-transformed to enable regression modeling. After log transformation, all process metrics  were standardized to have zero mean and unit variances to allow for direct comparison of estimated coefficients. We then regressed each skill indicator metric onto each process metric, recording the estimated slopes and associated standard errors. This process was repeated for each window size. After model fitting, pairwise comparisons between estimated coefficients were made. The t-test was chosen for comparing differences between parameters due to ease of interpretation, and since the number of comparisons was large, Tukey's multiple comparison adjustment was computed for assessing significance at the α 0.05 level based on the equation below. This procedure was repeated comparing association with skill between the various window spans for LDLJ and SPARC individually.
Confidence interval for Tukey's method where α corresponds to the significance level 0.05, β is the regression coefficient, q denotes the critical value of the studentized range distribution, n corresponds to the total number of observations (779), r is the total number of groups (5 for each analysis), and i and j are group indicators.

RESULTS
The first part of our analysis examined which of the skill indicator metrics best accounted for the dexterity process metrics. We begin by summarizing model fits associated with each regression model. The absolute value of the Pearson correlation coefficient between each process metric and skill indicator metric across all the window sizes and the mean of R 2 for all regression models are presented in Figure 6. It can be noted that FR has a much higher fit than GRS (approximately 25 vs. 8%) and that Exp has an extremely poor model fit (about 0.5%). It is also worth noting that none of the process metrics are very strongly correlated with Exp (as is evidenced in subplot (c) in Figure 7, which demonstrates each process metric's correlation to the subplot's indicator  The objective outcome metric, FR, best accounted for the process metrics used in the study to measure skill.
Frontiers in Robotics and AI | www.frontiersin.org April 2021 | Volume 8 | Article 625003 metric). Therefore, any conclusions made for Exp are not reliable owing to the poor fit. The most salient observation from this analysis is the superiority of FR, the objective outcome metric, in accounting for dexterity process metrics. The second question we examined in this work is if the degree of smoothing affected the computation of LDLJ and SPARC, the two motion smoothness metrics used here. Figure 8 reveals the changes in the values of LDLJ and SPARC when plotted as a function of window span (indicative of the degree of smoothing). This result indicates that motion smoothness values are affected by the degree of smoothing. However, does this change result in a significant difference in the correlation of LDLJ or SPARC with each skill indicator metric? This question is important in the context of this study which seeks to investigate the power of metrics to quantify skill. As seen in Figures 9,  10, the window size does not affect the correlation of LDLJ and SPARC with FR. Even the "noisiest" span of five correlated well with FR and the other spans. For GRS, however, a window span of five is significantly less associated with GRS than the other window spans for both LDLJ and SPARC. That is, when minimal smoothing is applied, metric values do not correlate with GRS as well.
The third research question we examined is if motion smoothness metrics are superior to other process metrics in  correlating with skill indicator metrics. Figures 11-13 show the confidence intervals of the significant differences of association with skill among the process metrics across the five window spans for each skill indicator metric. It is important to note that the formulations of PL, T, and Pks have a negative modifier added to the value to enable comparisons between the slopes. As a result, a decrease in any of the process metric values denotes a worse performance. Our pairwise comparison results demonstrate that no process metric is significantly better than any other metric in association with FR. This holds true across all window spans. While some significant differences were observed in association with GRS, we conclude that these differences are erratic, since neither of the process metrics demonstrates consistency in superior association with the other process metrics across window spans. Note from the earlier discussion that GRS has a low R 2 of about 8%. PL indicates a consistently higher association with Exp in comparison with the other metrics. However, due to the model's exceptionally poor fit, it is difficult to make any meaningful assertions regarding Exp.

Which Skill Indicator Metric Best Accounts for Dexterity Measures?
Methods for classifying the level of skill must be robust for effective application in medical training simulators. In this study, we present an objective outcome metric for assessing the degree of success in a simulated clinical task and examined its power for quantifying skill in comparison with the more generally used skill indicator metrics. In addition, unlike many studies that binarize expertise into "expert" or "novice," we collected years of clinical experience as a finer-grained measure for analysis. This detail in our experimental design enabled the investigation of the research questions presented in this work. From our results, the objective outcome metric FR better accounted for the process metrics used in the study to capture skilled movement. Traditionally used metrics as surrogates for clinical skill demonstrated inferior performance to quantify dexterity. Furthermore, the correlation coefficients for each process metric in Figure 7 drastically changed in association with each skill indicator metric, with FR having the best correlations. In Figure 6, the overall fit for each indicator metric, although comparatively different, is relatively low (about 25% for FR). This can be attributed to the process metrics examined accounting primarily for dexterity. We would likely see improvements in model fit if other aspects of skill, such as force, needle positioning and angle, and decision making, are measured and incorporated. Despite being a regularly used skill indicator metric in the field of medical skills training, Exp performs poorly for the cannulation task, accounting for only about 0.5% of the variation in the model. It is important to take into account the current task: although cannulation is an important medical procedure, it is not as complicated and multifaceted as surgery. As a result, it is possible that simply having experience in cannulation does not necessarily result in increased skills for successful cannulation. We can surmise that Exp is not a useful measure of skill in our cannulation simulator. GRS yields a better fit than Exp, but still relatively poor at about 8.2%. This increase is likely due to the expert's knowledge of skill as he or she rates canulation performance by direct observation. This metric, therefore, provides a better assessment of skill than Exp. Nevertheless, the fit is still relatively small and can be attributed to two primary reasons: (1) GRS evaluates each subject as a whole, rather than on a trialby-trial basis, and (2) expert raters lack true knowledge of the success of a task from mere observation. In contrast, FR measures the true success of the trial and, consequently, sees a significant improvement in model fit at about 25%. This result may encourage the formulation of more outcome metrics to measure the degree of success in a clinical task objectively. Some examples of this are measuring the leakage after a surgeon sutures a vascular anastomosis or measuring the degree of motion after orthopedic surgery. Such metrics may yield a truer measure of skill, potentially impacting skill assessment and training in a positive way.

Does the Degree of Smoothing Significantly Affect the Computation of Motion Smoothness Metrics?
As mentioned earlier, there has been some discussion in recent literature on the relative benefits of SPARC and LDLJ in measuring motion smoothness. One limitation pointed out is LDLJ's sensitivity to noise due to its use of the third derivative of position. In comparison, SPARC uses velocity, requiring only one derivative of position with respect to time. The effects of derivative computation and sensor data smoothing have not been systematically explored in the literature. We hypothesized that the degree of smoothing affects the computation of both motion smoothness metrics. As seen in Figure 8, both LDLJ and SPARC see significant differences in means as a function of window span. Nevertheless, as also evidenced in our results, the significant differences of means between window spans did not affect the discerning power of motion smoothness metrics in our task, calculated using correlation coefficients and regression slope comparisons across a variety of SG window spans. In comparison with the other process metrics, LDLJ and SPARC see the largest increase in correlation when moving from a window span of 5 to 25 for each skill indicator metric in Figure 7. This is most likely due to the noise present in sensor data not being adequately filtered in the small window span. After the window span of 25 (some smoothing), the stability of LDLJ and SPARC is similar.
When comparing the effect of the degree of smoothing (via window spans) on the strength of correlation between indicator metrics and motion smoothness metrics (Figures 9, 10), we see no significant differences in association with FR for both LDLJ and SPARC across all window spans. As a result, the level of smoothing does not ultimately affect the motion smoothness FIGURE 10 | Confidence intervals of each skill indicator metric's association with SPARC between the tested window spans. If the confidence limits are both positive, the minuend is more associated with the corresponding skill indicator metric, and vice versa if the confidence limits are both negative. If the confidence interval passes through zero, there is no significant difference in the association.
Frontiers in Robotics and AI | www.frontiersin.org April 2021 | Volume 8 | Article 625003 metrics' level of association with the objective outcome metric in our study. On the other hand, a window span of five is significantly less associated with GRS than all other window spans for both motion smoothness metrics. Most of these differences become insignificant as the degree of smoothing increases. This is likely due to the noise present in the data at window span five, causing motion smoothness metric values to be more erratic. These results follow the trend that a moderate amount of smoothing, particularly when calculating derivatives, is important for more robust results. We refrain from making any claims for Exp since this skill indicator metric has a poor model fit.
There are few studies examining derivative formulations and smoothing for motion smoothness calculations present in the literature. An in-depth study on the effects of noise on various motion smoothness metrics raises concern with LDLJ's sensitivity to noise, though it was deemed not as sensitive as Pks or DLJ [Balasubramanian et al. (2012)]. In contrast, Balasubramanian et al. (2012) presented SPARC as a viable metric due to its robustness to noise. However, motion smoothness metrics are task-dependent; a recent study reported that SPARC may not be as effective as LDLJ in a specific application [Melendez-Calderon et al. (2020)]. The choice of motion smoothness metrics used in a study should depend on the task performed. Our work also brings to light the importance of the quality of smoothing. Gulde and Hermsdörfer (2018a) reported that minimal window spans yielded the largest relative deviations, whereas window spans between 280 and 690 ms had the lowest relative deviations. One limitation of this study is that the effect of derivative calculations was not examined since jerk-based motion smoothness metrics were not computed. To examine the effects of smoothing parameters on derivative approximations, we chose the SG method due to its superiority over finite difference methods [Ahnert and Abel (2007)]. We saw no significant differences in association with skill after a reasonable degree of smoothing for LDLJ with the highest fit skill indicator metric, FR. Similar to Gulde and colleagues' findings, for LDLJ and SPARC, the window span of five had significantly less association with GRS than the other window spans. Consequently, a reasonable level of smoothing enables meaningful use of motion smoothness metrics for skill assessment.

Are Motion-Based Process Metrics Superior in Correlation to Skill?
When LDLJ and SPARC are used alongside other metrics like T and PL in the literature, there is a lack of direct comparison FIGURE 11 | Confidence intervals of pairwise comparisons of process metrics with respect to FR plotted against all tested window spans. To visualize the relationships between effects sizes of process metrics and window spans, a line is plotted through the point estimates. A horizontal line is drawn through 0. If the confidence limits are both above zero, the minuend is more strongly associated with FR; if the limits pass through zero, the differences are insignificant; and if the limits are below zero, the subtrahend is more strongly associated with FR.
between the relative powers of these metrics to discern skill. For example, PL was unable to demonstrate a significant difference between novices and experts, while both SPARC and idle time distinguished between the groups [Belvroy et al. (2020)]. One reason for this may be that PL may not effectively quantify surgical skill in the FEVS simulator. Even so, is there a benefit to using one metric over another (e.g., SPARC over idle time) if they can both determine significant differences in the groups? Similarly, when comparing hand movements of elderly vs. young patients, though all metrics demonstrated significance, Pks normalized per meter demonstrated a higher multiple linear regression R-squared than SPARC [Gulde and Hermsdörfer (2018b)]. However, in another study on the FEVS, Pks did not have a significant correlation to skill, whereas DLJ and SPARC did, with SPARC having the higher correlation to skill out of the two [Estrada et al. (2016)]. In our results, T had a higher correlation to FR, as expected due to the nature of how FR is formulated. Nonetheless, PL had similar or higher correlations when compared to SPARC and LDLJ.
By directly comparing how the process metrics associate with skill, we desired to determine if any process metric was more significantly superior in association with the skill to each other due to their varying results. As observed in Figure 11, there are no significant differences in the pairwise comparisons of the strength of the process metrics' association with FR. Note that FR has the highest reliability for assessing skill in the study. In contrast, significant differences in association with GRS were observed in some cases in Figure 12. The differences were generally seen when the degree of smoothing was minimal (window span 5). One may infer from this that, after some smoothing is applied to position sensor data, both motion smoothness metrics may have a similar ability to discern skill. This result is reiterated by noting the correlation plots in Figure 7. Correlation coefficients are the least at the lowest window span with the only minimal smoothing (and the greatest amount of potential noise). After some smoothing, however, the correlation coefficients become relatively stable.
Based on our results, we conclude that having a strong measure of skill like the objective outcome metric FR yielded the most stable set of process metric comparisons. GRS, on the other hand, yielded less explainable results as a function of metrics and degree of smoothing. One key finding of this work is the need for a robust amount of smoothing and an accurate metric for measuring skill outcome.
FIGURE 12 | Confidence intervals of significant differences of pairwise comparisons of process metrics with respect to GRS plotted against all tested window spans. A line is drawn through the intercepts to better visualize the trend of the confidence intervals of each window span increase. If the confidence interval is above zero, the minuend is more strongly associated with GRS; if it passes through zero, the differences are insignificant; and if it is below zero, the subtrahend is more strongly associated with GRS.
Our study does not test whether each process metric can predict the value of each skill indicator metric, but rather each process metric's association with that skill indicator metric. Future work involving said prediction models could provide insight into the superiority of the process metrics. It is also important to take into account the dependence of task constraint of motion smoothness metrics when making these analyses: the cannulation procedure consists of a simple motion from t entry to t end . The task may not be complex enough to see these differences in association with skill. Another aspect we wish to highlight is that the efficacy and superiority of motion smoothness metrics depend on the task being studied. As mentioned previously, motion smoothness metrics capture the dexterity of hand movements, examining precise movements. Our cannulation simulator task does not require precise motion, possibly causing the lack of significant difference in process metric association with FR. Previous studies report varying results on process metrics differentiating between experts and novices, yet motion smoothness metrics tend to consistently demonstrate significant differences if effective calculations are performed. Therefore, despite their task dependency, we can conclude that motion smoothness metrics are at least as effective in their skill evaluation as other process metrics.
Future work on a simulator involving testing across tasks requiring different hand movements would allow for a study on the robustness of motion smoothness metrics vs. other process metrics.

Conclusion
Clinical skills training is critical for sustaining an efficient workforce. The use of remote and automated simulators for skills training is especially appealing. In this study, we demonstrate that commonly used skill indicator metrics may be limited in their assessment. In contrast, an objective metric for measuring the degree of task success proved to be a superior skill indicator metric, exhibiting stronger goodness of fit to process metrics. Moreover, the lack of significant differences in association in the process to FR, combined with the much higher model fit, may demonstrate the robustness of this measure. Our results also demonstrated that the degree of smoothing of sensor data affects the computation of motion smoothness metrics under certain conditions. These results directly inform the design and use of simulator-based training methods. The flexibility of training simulators is an excellent asset towards effective training of medical practitioners and students, but it is vital to optimize simulators for efficient use in remote, FIGURE 13 | Confidence intervals of significant differences of pairwise comparisons of process metrics with respect to Exp plotted against all tested window spans. A line is drawn through the intercepts to better visualize the trend of the confidence intervals of each window span increase. If the confidence interval is above zero, the minuend is more strongly associated with Exp; if it passes through zero, the differences are insignificant; and if it is below zero, the subtrahend is more strongly associated with Exp. automated assessment without the need for expert raters or onsite training.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because of restrictions on data access placed by the relevant IRB. Inquiries may be directed to joseph@clemson.edu.

ETHICS STATEMENT
Participants provided informed consent to participate in the study. This study was reviewed and approved by the Institutional Review Board at Prisma Health.

AUTHOR CONTRIBUTIONS
SS and RS contributed to the conception of the experiment, the experimental data collection, data and statistical analysis, and writing of the manuscript; JB contributed to the statistical analysis and writing of the manuscript; and ZL and ZZ contributed to the design of the experimental model, experimental data collection, and writing of the manuscript.

FUNDING
Research reported in this publication was supported by an NIH/NIDDK K01 award (K01DK111767). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.