ORIGINAL RESEARCH article

Front. Comput. Sci., 28 April 2026

Sec. Theoretical Computer Science

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1746591

Toward a novel measure of user trust in XAI systems

  • 1. Unitat de Gràfics i Visió per Ordinador (UGiVIA) Research Group, Department of Mathematics and Computer Science, University of the Balearic Islands, Palma, Spain

  • 2. Laboratory for Artificial Intelligence Applications (LAIA@UIB), Department of Mathematics and Computer Science, University of the Balearic Islands, Palma, Spain

  • 3. Artificial Intelligence Research Institute of the Balearic Islands (IAIB), University of the Balearic Islands, Palma, Spain

  • 4. Soft Computing, Image Processing and Aggregation (SCOPIA) Research Group, Department of Mathematical Sciences and Computer Science, University of the Balearic Islands, Palma, Spain

  • 5. Institute for Health Research of the Balearic Islands (IdISBa), Palma, Spain

  • 6. Univ Rouen Normandie, ESIGELEC, Normandie Univ, IRSEEM UR 4353, Rouen, France

  • 7. Hospital Universitari Son Espases, Palma, Spain

Abstract

The increasing reliance on Deep Learning models, combined with their inherent lack of transparency, has spurred the development of a novel field of study known as eXplainable AI (XAI) methods. These methods aim to enhance end-users' trust in automated systems by providing insights into the rationale behind their decisions. This paper presents a novel trust measure in XAI systems, allowing their refinement. Our proposed metric combines both performance metrics and trust indicators from an objective perspective. To validate this novel methodology, we conducted three case studies showing an improvement with respect to the state-of-the-art, with an increased sensitivity to different scenarios.

1 Introduction

From the seminal work of (Krizhevsky et al. 2012) in 2012, machine learning models, and in particular Deep Learning (DL) ones, have become pervasive in multiple and diverse study fields. This ubiquity of DL approaches is due to their much better results in comparison to the non-deep learning methods. The improvement offered by these methods is achieved through their high complexity; however, this complexity has made it more difficult to understand their inner workings. The fact that the causes behind a decision are unknown can be ignored in non-sensitive fields; nonetheless, it is crucial in sensitive fields, such as medical-related tasks (Miró-Nicolau et al., 2022).

To address this issue, eXplainable Artificial Intelligence (XAI) emerges, according to Adadi and Berrada, aiming to “create a suite of techniques that produce more explainable models whilst maintaining high performance levels” (Adadi and Berrada, 2018). The growing dynamic around XAI has been reflected in several scientific events and in the increase of publications as indicated in several recent reviews about the topic (Adadi and Berrada, 2018; Došilović et al., 2018; Murdoch et al., 2019; Anjomshoae et al., 2019; Minh et al., 2022; Arrieta et al., 2020). In particular, its importance is crucial in the sensitive field of health and wellbeing (Eitel et al., 2019; Miró-Nicolau et al., 2022; Van der Velden et al., 2022; Chaddad et al., 2023).

(Miller 2019) identified the need to measure different aspects of the explanations to be able to make an objective evaluation, “most of the research and practice in this area seems to use the researchers' intuitions of what constitutes a ‘good' explanation.” With this paper, the author started a trend to measure different aspects of the explanation and the use of social science knowledge to make an objective evaluation of XAI techniques.

Multiple aspects of an explanation can be evaluated. (Bodria et al. 2023) propose a distinction between quantitative and qualitative evaluation. Similarly, (Amengual-Alcover et al. 2025) categorize XAI aspects into machine-centered (quantitative) and human-centered (qualitative) dimensions. Machine-centered aspects refer to properties that are independent of the user and can thus be assessed through algorithmic or computational methods. In contrast, human-centered aspects depend on users' perceptions and interactions with the XAI system, requiring evaluation through user studies or other qualitative approaches. (Nauta et al. 2023) reviewed the state-of-the-art of XAI evaluation and also identified as a main issue of the field the discussion between machine-centered analysis and user-centered ones. (Nauta et al. 2023) identified 12 distinct elements that can be used to evaluate explanations. For a comprehensive analysis of these evaluation features, we refer the reader to their work (Nauta et al., 2023). (Doshi-Velez and Kim 2017) also make a similar distinction; these authors propose a taxonomy for XAI evaluation approaches, dividing them into: Application Grounded Evaluation, Human Grounded Metrics, and Functionally Grounded Evaluation. The latter includes machine-centered approaches, and the first two are human-centered measures. (Vilone and Longo 2021) also review the state-of-the-art and propose to divide XAI measures between human-centered and objective evaluation. As can be seen, all these authors highlight the primary distinction between XAI aspects as whether they are machine-centered or human-centered.

While machine-centered aspects are largely studied, human-centered approaches need further research. (Arrieta et al. 2020), after reviewing the state-of-the-art, identified trust as one of the primary goals of an XAI model from a user point of view. According to (Miller 2019), trust must be prioritized and used as a basic criterion of the explanation correctness.

The evaluation of user trust in both AI and XAI is an extensively studied topic in the literature From blockchain (Ressi et al., 2024) to Human-Computer Interaction (HCI) (Perrig et al., 2023), trust is considered a pivotal attribute of AI models. In the field of HCI, this importance is evidenced by the significant number of reviews published recently (Perrig et al., 2023; Kaufman et al., 2025; Ueno et al., 2022; Wischnewski et al., 2023). These reviews converge on similar findings regarding the state-of-the-art in trust evaluation: a lack of consensus on the definition of trust, the usage of ad-hoc measures, and the predominance of scales. Most of these scales were not specially proposed for XAI or AI but used in the whole automation field. (Hoffman et al. 2018) analyzed, identified, and combined multiple scales used in the state-of-the-art for measuring trust in automation: (Jian et al. 2000), (Cahour and Forzy 2009), (Merritt 2011), and (Wang et al. 2009). (Perrig et al. 2023) identified Trust between People and Automation (TPA) (Jian et al. 2000) as the most used scale in the AI context, and for XAI, the proposal of (Hoffman et al. 2018) [known as Trust Scale for Explainable AI (TXAI)]. (Mohseni et al. 2021) identified scales and interviews as subjective measurements. According to (Scharowski et al. 2022), these subjective measures handle the attitudinal (subjective) perception of an agent for a system, criticizing the usage of questionnaires: “the data collected using survey scales is inherently subjective, given that it reflects participants' own perspectives”. These authors proposed to make a shift in the measurement of trust from an attitudinal approach to a behavioral approach, in other words, from a subjective approach to an objective one.

The behavioral approach is usually a single-item evaluation of trust. Multiple authors made these kinds of proposals (Yu et al., 2017; ; Lai and Tan, 2019). (Yu et al. 2017) proposed a synthetic proxy task to analyze the trust of a human in it, asking the user to self-assess their trust using a Likert scale ranging from 1 (distrust) to 7 (trust). similarly proposed a proxy task with self-assessment of trust, but instead of a scale, defined a binary trust or distrust mechanism, allowing a simpler analysis of the results. Finally, (Lai and Tan 2019) proposed a behavioral measurement of trust, defining the trust levels as the percentage of times that the end-user relies on the prediction and explanation.

The influence of the performance of the AI model on the trust in it is largely studied in the state-of-the-art. Particularly, (Glikson and Woolley 2020) reviewed the literature of trust and identified that trust is “prone to change based on the behavior of the trusted agent”. The main conclusion was that multiple authors (de Visser et al., 2012; Dietvorst et al., 2015; Manzey et al., 2012) identified that high initial trust in the AI system tends to decrease as a result of erroneous AI outcomes. (Lai and Tan 2019) also identified that their measure was affected by the prediction performance: “We find that humans tend to trust correct machine predictions more than incorrect ones, which suggests that humans can somewhat effectively identify cases where machines are wrong.” (Yu et al. 2017) demonstrated that trust is modified by the performance. Therefore, it is clear from the state of the art that there is a relationship between AI performance and user trust in it.

The existing state-of-the-art concludes both that an objective evaluation of trust is needed and that AI performance has a large effect on trust. Consequently, in this study, and with this knowledge from the state-of-the-art, we proposed a novel method for a behavioral measure of user trust in an automated system that combines both the prediction performance and the trust of the system, in a simplified and objective approach. Our proposal is based on the work of (Lai and Tan 2019), as one of the first behavioral approaches to measure trust. We proposed a new set of measures based on the well-established classification measures, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), in which we incorporated trust information within each measure. We carried out a case study in the medical context to test the proposed measurement.

The rest of this paper is organized as follows. In the next section, we describe our proposed measure for user trust in an automatic system. In Section 3, we defined a set of case studies to verify the proposed measure. Finally, in Section 4 we present the conclusions of this study.

2 Materials and methods

Because multiple definitions of trust exist, any research in this field must begin by clarifying which definition underpins its proposals and findings. In this article, we adopt the definitions of trust proposed by (Mayer et al. 1995) and (Lee and See 2004) who define trust as “the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of the ability to monitor or control that other party.” This definition is among the most widely used in the state-of-the-art (Ueno et al., 2022). In addition to defining trust itself, it is also necessary to specify a trust model that explains how human trust is formed. (Hoff and Bashir 2015) proposed the de facto standard model, which identifies three factors that shape humanautomation trust: dispositional trust (stemming from the human), situational trust (arising from the environment), and learned trust (based on the automated system's behavior).

The goal of this paper is to propose a novel measure for user trust in an XAI system. Knowing that user trust is highly dependent on the model's behavior, we proposed a novel approach based on the existing relation between the correct prediction and the user's trust in the system. (Muir and Moray 1996) state that, “Results showed that operators' subjective ratings of trust in the automation were based mainly upon their perception of its competence. Trust was significantly reduced by any sign of incompetence in the automation, even one that had no effect on overall system performance. This conclusion is shared by other authors (de Visser et al., 2012; Dietvorst et al., 2015; Manzey et al., 2012), as we showed in the previous section. Building on these insights, our goal is to unify performance and trust into a single, intuitive measure that remains consistent with the three key dimensions identified in the (Hoff and Bashir 2015) trust model.

In the context of classification tasks, the prediction results undergo evaluation using a set of established measures, including the number of TPs, FPs, FNs, and TNs. These fundamental measures serve as the foundation for computing a variety of more intricate metrics, allowing for a comprehensive and objective analysis of performance across various dimensions. We can simplify these four metrics in a simpler binary measure: true predictions (True Positives and True Negatives) or false predictions (False Positives and False Negatives). On the other hand, the trust of a user in a system is subjective; different users can have completely different levels of trust in the same system. However, as discussed by (Scharowski et al. 2022), we can objectively measure this subjective feature. We proposed, based on the work of (Lai and Tan 2019), to combine performance information into a behavioral question for the user: whether the user will employ the system for a particular sample, taking into account whether the sample is correctly classified.

We proposed four fundamental sample-based measures that combine both the information of the trust and the correctness of the prediction:

  • Trust True (TT). The number of correct predictions for which the user trusts the corresponding explanation.

  • Untrust True (UT). The number of correct predictions for which the user does not trust the corresponding explanation.

  • Trust False (TF). The number of incorrect predictions for which the user trusts the corresponding explanation.

  • Untrust False (UF). The number of incorrect predictions for which the user does not trust the corresponding explanation.

The previous sample-based measures are calculated by counting the number of occurrences of each case. These measures are summarized in Table 1. This is the main contribution of our article: a novel confusion matrix that combines trust and performance information for a particular AI system, in a well-known structure, allowing the use of existing metrics.

Table 1

Trust/predictionTrue predictionFalse prediction
YesTTTF
NoUTUF

Basic measures proposed in a confusion matrix format.

Our proposed measures allowed us to differentiate between different well-known human trust with AI models problems: the over-trust (also known as the over-reliance), captured by the TF measure; and the under-trust (also known as under-reliance), captured by the UT measure. However, similarly to classification measures, our proposals can be combined, obtaining multiple higher-level, well-established metrics. For example, we can adapt these two metrics:

  • Precision. In the trust context, it is the proportion of TT among the total of trusted predictions. This measure penalized user over-trust. See Equation 1 for more details.

  • Recall. In the trust context, it is the fraction of TT among the total of correct predictions. This measure penalized user under-trust. See Equation 2 for more details.

These two metrics, as already known in the classification context, are not enough to determine whether a user really trusts a system. For example, the Recall will be 1, the maximum value, if the user trusts only one correct prediction and distrusts the rest of the samples (both correct and incorrect). However, in this hypothetical case, the Precision is going to have a very low value. Additionally, in the case contrary, that the user trusts all predictions either correct or incorrect, the Precision will have a perfect result, while the Recall will be much lower. To avoid these extreme and misleading situations, we can use the F1-Score, where Precision and Recall are combined using the harmonic mean. This metric penalizes disproportionate values between the two components, ensuring that a high score is only achieved when both Precision and Recall are simultaneously high. Consequently, the F1-Score provides a more reliable and stable indicator of the degree to which user trust aligns with the system's correct behavior. This metric can be seen in Equation 3.

We calculated this metric, as well as Precision and Recall, using our novel sample-based measures. Therefore, these measurable properties remain constant: the results are within a range of [0, 1]; the results are easily interpretable, with a value of 1 indicating a perfect result; maintain the monotonic behavior both with respect to user trust and model performance; are ill-defined when the denominator measures are 0. Additionally, our proposed measures (TT, UT, TT, UF) can be used in a much broader set of metrics, not being limited to the ones presented here. Therefore, our proposal takes into account both the performance of the model and the user's trust in the explanation, allowing the use of already known, tested, and defined metrics.

In the following section, we show a set of studies. Their goal is to show the expressivity of our approach, working in completely different contexts, with different trust and performance levels.

3 Results

In the previous section, we introduced a novel trust measure that leverages performance data through the use of a confusion matrix. Our goal in this section is to demonstrate how this approach behaves across a range of diverse scenarios. Specifically, we define three distinct case studies, each characterized by varying levels of trust and performance. Importantly, our goal with these scenarios was not to obtain perfect performance or perfect trust, but to verify that our approach allows the detection of different behaviors.

First, we defined a set of hypothetical results covering extreme cases. Second, we applied our approach to real results obtained from a state-of-the-art machine learning study by (Petrović et al. 2020). Finally, we tested our method on a novel machine learning model specifically designed to assess the trust placed in medical experts. due to the limited number of users (only two radiologists), we framed this last experimentation as a pilot study. In all three studies, we compared our proposal to that of (Lai and Tan 2019): the proportion of trusted samples.

3.1 Case study 1: hypothetical trust, hypothetical machine learning

In this first case study, we proposed three extreme scenarios. We referred to these scenarios as users because they represent extreme user behavior related to trust. These three extreme cases were the following:

  • Perfect system user. This first case depicts a user who trusts the correct predictions and does not trust the incorrect ones.

  • Overtrusting user. This case illustrates a user who consistently places unwavering trust in any prediction.

  • Never-trust user. This case showed the results of a user who never trusts the outcome of the AI model.

Once these different users were defined, we calculated the values of our proposed measures. For all three cases we designed, we have different trust levels with the same hypothetical model performance: half of the 100 test samples were correctly classified. The existence of both correct and incorrect classifications allows us to identify the behavior of the proposed measures with completely different samples. To avoid numerical instability (the fact that we discussed in the previous section that these measures are ill-defined when the denominator is 0), we always set at least one TT.

The trust sample measures of these hypothetical users can be seen in Tables 24. Table 5 shows the results of higher-level metrics based on the previous measures. The fact that performance remains unchanged while our measure varies according to the trust value demonstrates the expressiveness of our approach in capturing the user's trust in the AI system.

Table 2

Trust/predictionTrue predictionFalse prediction
Yes500
No050

Trust confusion matrix for the perfect system user in the first case study.

Table 3

Trust/predictionTrue predictionFalse prediction
Yes5050
No00

Trust confusion matrix for the overtrusting user in the first case study.

Table 4

Trust/predictionTrue predictionFalse prediction
Yes10
No4950

Trust confusion matrix for the overtrusting User in the first case study.

Table 5

MetricPerfect system userOvertrusting userNever-trust user
Correct predictions505050
Trusted samples501001
Precision10.511.00
Recall11.000.02
F1-score10.660.04
(Lai and Tan 2019)0.510.01

Results across all hypothetical users for case study 1.

These findings demonstrate the effectiveness of the proposed metrics in capturing trust levels across distinct contexts. For the first user, all metrics achieved perfect scores, indicating optimal performance. For the second user, Recall attained a perfect score, while the remaining metrics produced significantly lower values, highlighting the sensitivity of the UF measure, and therefore, the Recall metric, to over-trusting behaviors. Finally, for the third user, is the reverse of the previous one, with a general lack of trust of the user in the system. Therefore, we see a perfect Precision and an almost 0 Recall result. We can also see that this metric is more sensitive to under-trust than over-trust; nonetheless, this expressivity is present in our proposed confusion matrix.

The results of our trust measure contrast with the approach proposed by (Lai and Tan 2019), particularly in the evaluation of the first two users. Our method considers a lack of trust in incorrect predictions to be a desirable behavior, as demonstrated in the first scenario. In contrast, (Lai and Tan 2019) penalize this behavior. For the over-trusting user, the opposite occurs: while our approach identifies this behavior as incorrect, Lai and Tan's method considers it valid. This highlights a key limitation that our measure aims to addressdistrust in incorrect predictions should not be penalized, but rather recognized as appropriate and even desirable.

3.2 Case study 2: hypothetical trust, real machine learning

In this second case study, we used the proposed trust measures with a real machine learning model. Particularly, we used the results from (Petrović et al. 2020). These authors proposed a novel approach to select and train an AI model to identify and classify peripheral blood smear images of red blood cells, depending on their morphology. Specifically, their classification problem includes three categories: elongated, circular, and others. Figure 1 depicts examples from each class. While the authors compared various machine learning algorithms, we selected their best-performing method, Gradient Boosting, to evaluate our trust measure. The performance of this model can be seen in the confusion matrix depicted in Table 6.

Figure 1

Table 6

CircularElongatedOther
Circular48865
Elongated81948
Other20575

Confusion matrix of the Gradient Boosting method proposed by (Petrović et al. 2020).

We recalculated and repeated the three hypothetical trust levels using the results obtained from this model. This approach allowed us to analyze different models under the same user trust assumptions, enabling us to assess whether our trust measures are also sensitive to model performance. For simplicity, we did not distinguish between classes: a correct prediction was considered valid regardless of the predicted class. However, our approach can easily be extended to enable class-specific analyses.

The trust sample measures of the same hypothetical users can be seen in Tables 79. Table 10 shows the results of higher-level metrics based on the previous measures. From these tables, we can see that the sole difference between the first case study and this second one was the over-trusting user. This case demonstrated that using an AI model with high performance values hid the bad trust results; however, the confusion matrix can be utilized to detect this unwanted behavior. Comparing our approach to the one from (Lai and Tan 2019), we can see, once again, that according to their approach, the users have perfect results, while clearly the trust behavior is not the desired one.

Table 7

Trust/predictionTrue predictionFalse prediction
Yes7570
No052

Trust confusion matrix for the perfect system user in the second case study.

Table 8

Trust/predictionTrue predictionFalse prediction
Yes75752
No00

Trust confusion matrix for the overtrusting user in the second case study.

Table 9

Trust/predictionTrue predictionFalse prediction
Yes10
No75650

Trust confusion matrix for the overtrusting user in the second case study.

Table 10

Perfect system userOvertrusting userNever-trust user
Correct predictions757757757
Trusted samples7578090
Precision10.941
Recall110
F1-Score10.970
(Lai and Tan 2019)0.9410

Results across all hypothetical users from the results obtained by Petrovic et al. (Petrović et al. 2020).

These hypothetical results showed the ability of our proposal to identify different trust behaviors. Additionally, the goal of this section is to define a set of known behaviors, allowing us to compare future results with a baseline. Therefore, we used the knowledge obtained in this section to make the analysis of the results from a real case study. In the following section, we tested our measures with a real AI model.

3.3 Case study 3: real trust, real machine learning

In this third study, we tested the proposed trust measures in a real scenario: using a real model with expert users. It is important to mention that the goal of this study, similarly to the previous ones, is to test the utility of our proposed measures, not to develop an XAI model. The limited amount of real users, while not allowing us to fully conclude the overall human trust with the tested AI, allowed us to identify the benefits and limitations of our trust measure, working as a pilot study for our proposal.

We investigate how our proposed measures can assess the trust of medical doctors in a real XAI approach to detect pneumonia provoked by COVID-19 from X-ray images. Particularly, from X-ray images, doctors can evaluate if there is a pulmonary involvement and its extent. However, in this case, the only lung disease present in the dataset was COVID-19, which allowed us to classify any lung involvement as this specific disease. In this section, we first introduce the main elements of the pipeline (dataset, AI model, XAI method, and User recollection strategy) and show the pilot study results.

The image dataset we utilized in this investigation was provided by the University Hospital Son Espases (HUSE) situated in Palma, Spain. In total, 2040 chest X-ray images from patients with and without COVID-19 pneumonia were analyzed. In Figure 2, we can see samples from this dataset. This dataset and experimentation have been authorized by the Research commission from HUSE (Hospital Universitari Son Espases) (Ref: 3959). We used a ResNet18 (He et al., 2016), a well-known DL model for the classification of images, as an AI model. (Arias-Duart et al. 2022) proposed an objective benchmark for post-hoc XAI methods and identified GradCAM (Selvaraju et al., 2017) as “consistently reliable” in contrast with other largely used XAI techniques. We used this method based on these authors' results. The performance of the trained model can be seen in Table 11. We consider that the training details are outside the scope of this article, centered on the measurement of trust. Nonetheless, and to allow for better reproducibility, we upload the model's weights to a public repository1.

Figure 2

Table 11

TPFPFNTNPrec.RecallF1-Sc.Acc.
46127114900.2660.8070.40.8

Metrics obtained with the AI model used in our experimentation.

The resulting explanation from GradCAM (Selvaraju et al., 2017) is a saliency map. We consider that, while saliency maps are widely used, this visualization can be hard to interpret for a non-expert AI user. To simplify them, we highlight the most important parts of the image; we did that using a set of four different thresholds (0.9, 0.75, 0.50, 0.25), depicting only the pixels with an importance higher than the threshold. An example of the resulting visualization can be seen in Figure 3. We consider the resulting explanation, from the same saliency map but with different threshold values, as completely independent samples.

Figure 3

To evaluate trust in the system, we recruited two senior radiologists. Data was collected via an interactive Graphical User Interface (GUI) displaying the model's predictions, explanations, and corresponding X-ray images (Figure 4). The ground truth (GT) of the prediction is never shown to the user. The GUI design was refined based on radiologists' feedback; detailed user instructions are provided in Appendix B. The radiologists examined a total of 120 chest X-rays. A subset of 40 images was reviewed by both participants to allow for comparison, while the remaining 80 were divided between them. The order of the images was randomly defined. Participants were asked to indicate their agreement with the provided predictions and explanations. The user only had two options: agreeing or not agreeing with the combination of prediction and explanation.

Figure 4

The trust sample measures of the users can be seen in Tables 12, 13. Tables 10, 14 shows the results of higher-level metrics based on the previous measures disaggregated by user. Notably, there is a significant divergence between the two users, who yielded F1-Scores of 0.17 and 0.03, respectively. However, both values are sufficiently low to indicate that neither user trusted the explanations. We attribute these low scores to the 'trust trajectory,' where incorrect system predictions negatively influence subsequent trust measurements, even for correctly predicted instances. These findings align with results from our previous case studies, bearing a strong resemblance to the never-trust user profile defined earlier. Furthermore, the subset of shared images that can be seen in Tables 15, 16 revealed a lack of consensus between the two users. Despite having similar backgrounds and viewing the same images, their differing responses underscore the inherently user-dependent nature of trust measurement.

Table 12

Trust/predictionTrue predictionFalse prediction
Yes72
No5714

User 1 trust confusion matrix in the third case study.

Table 13

Trust/predictionTrue predictionFalse prediction
Yes11
No6315

User 2 trust confusion matrix in the third case study.

Table 14

MetricUser 1 (All imgs)User 2 (All imgs)User 1 (Shared imgs)User 2 (Shared imgs)
Correct predictions64644040
Trusted samples211460
Precision0.330.060.670
Recall0.110.020.130
F1-score0.170.030.190
(Lai and Tan 2019)0.260.200.150

Results obtained from Users 1 and 2 with all images and only with the images that both users have measured.

Table 15

Trust/predictionTrue predictionFalse prediction
Yes42
No286

User 1 (only with the shared images with the second user) trust confusion matrix in the third case study.

Table 16

Trust/predictionTrue predictionFalse prediction
Yes00
No328

User 2 (only with the shared images with the first user) trust confusion matrix in the third case study.

To capture these user differences, we calculated Cohen's Kappa (κ), a well-known inter-rater reliability statistic. The analysis yielded a κ value of 0, with a 95% confidence interval of [−0.103, 0.103], detailed in the contingency table (see Table 17).

Table 17

User 1: trustUser 1: no trust
User 2: trust00
User 2: no trust634

Contingency table for the shared set of 40 images.

While a kappa of 0 typically suggests agreement no better than chance, interpreting this as a lack of consensus is misleading in this context. The observed raw agreement between the users was actually high at 85% (34/40 cases). Instead, the low kappa score is heavily influenced by the prevalence: as shown in the contingency table, User 2 is highly skewed toward the “no trust” category, exclusively categorizing all 40 cases as “no trust,” whereas User 1 categorized 6 cases as “trust.” Because of this extreme prevalence and bias, the kappa metric must be interpreted cautiously, as it obscures the high raw consensus due to the skewed sample distribution, artificially depressing the Cohen's kappa statistic.

By analyzing the results obtained by (Lai and Tan 2019) and comparing them with our proposal, we observe that both approaches yield relatively low levels of trust. However, their method reports slightly higher trust levels than ours. This difference arises primarily because their approach penalizes instances where the user did not trust incorrect predictions.

In Figure 5, we can see the results obtained for all three metrics with different threshold values. From these three plots, we can see the difference between the two users, with an overall higher trust of the first user than the second. We can also see that there was a decrease in the trust when more pixels, including less important ones, were shown to the users (see Figure 3 for examples of different visualizations).

Figure 5

To understand the reason behind this lack of trust and, therefore, verify our proposal, we created a questionnaire to be answered by each user. We asked three distinct questions, each with different goals, in order to determine the source of the lack of trust in the system. Each question had an attached image to be analyzed by the users. Both the questions and the answers can be seen in the Appendix A.

The findings of the questionnaire indicate that the radiologists did not trust the system, highlighting the fragile nature of user trust in AI. The questionnaire responses and conclusions were consistent with the metric results. Both users showed a lack of trust in either the prediction or the explanation, demonstrating the capacity of the proposed measures to assess user trust in an XAI system.

This pilot study depicts both the benefits and limitations of our proposal. On one hand, we identified a lack of trust of both users in the system objectively, and even more, quantified this amount. All of that with a behavioral approach to trust, as discussed by (Scharowski et al. 2022), allowing an objective measurement and avoiding the use of attitudinal (subjective) measures of trust as questionnaires. Our proposal surpassed the previous proposals, mainly the work from (Lai and Tan 2019), maintaining their simplicity to analyze the results, but at the same time with an improved capacity to identify possible problems and allowing a more granular analysis of the results, thanks to the addition of performance information. We verified this increased granularity with a further analysis of data recollection in the form of a posterior questionnaire.

4 Discussion

In this study, we proposed a novel trust measure from a behavioral point of view, the trust of a user within an XAI system. Our proposal aimed to combine both information about the performance of the predictive model (an objective feature) and the user confidence in the explanation (a subjective feature), allowing a straightforward interpretation of the results, aiming to quantify the system performance effect on the trust in it. In particular, we proposed to combine classification measures (True Positives, False Positives, False Negatives, and True Negatives) obtained from the objective comparison between the ground truth and the prediction, and the user's choice to trust or not an explanation. The results of this combination were the following measures: Trust True (TT), Untrust True (UT), Trust False (TF), Untrust False (UF).

These four measures were based on well-known concepts from social science, such as trust trajectory and the algorithmic aversion. Additionally, from these four measures, we can use any of the existing high-level metrics to get an easy-to-understand result. The key advantage of our solution is the ease with which the data may be analyzed. In contrast to other existing trust measures, which are based on questionnaires and have multiscale values, we used non-subjective results, allowing for a simpler and straightforward interpretation. This objective evaluation is a large improvement in the existing literature, as discussed by (Miller 2019). Until now, most of the evaluations in an XAI context have depended on researchers' intuitions of what constitutes a “good” explanation. Furthermore, the increase in granularity, in comparison to other objective proposals to measure trust (Lai and Tan, 2019), allowed us to identify the reason behind the lack of trust.

We defined three different studies to test our proposed measures. The first one is defined by both hypothetical performance and hypothetical trust results. This first case study allowed us to check the sensitivity of our proposal in three extreme cases: perfect user, overtrusting user, and never-trust user. Second, we test this same hypothetical trust values with a real machine learning model proposed by (Petrović et al. 2020). Similarly to the first one, this case study showed the limitation of the previous work (Lai and Tan, 2019) and the increased ability of our proposal to detect different trust behaviors, particularly, our sample-based measures allow us to clearly identify, in these two studies, both under-trust and over-trusting user behaviors. Additionally, this allowed us to verify that our measures are sufficiently sensitive to compare the performance of different models under the same trust levels and to identify distinct behavioral patterns. This suggests a potential future application of our measure: comparing different machine learning models with similar users. However, the aggregation metrics employed in this study, derived from conventional classification frameworks, show limited sensitivity in scenarios characterized by pronounced over-trust. This limitation highlights a clear direction for future research: the development of novel, high-level trust metrics capable of more faithfully capturing overtrust effects. Finally, we defined a pilot study with real performance and trust values. This pilot study is based on an XAI pipeline to detect COVID-19 in chest x-ray images. We tested the trust, via a Graphical User Interface, of two radiologists in this pipeline. The results of the metrics indicated an overall lack of trust in the system.

Statements

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://doi.org/10.5281/zenodo.12623862.

Ethics statement

The studies involving humans were approved by Research Commission from HUSE (Hospital Universitari Son Espases) (Ref: 3959). The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

Author contributions

MM-N: Software, Writing – original draft, Formal analysis, Visualization. GM-A: Software, Writing – review & editing, Resources, Formal analysis, Methodology, Visualization, Supervision, Conceptualization. AJ-i-C: Funding acquisition, Supervision, Writing – review & editing, Formal analysis, Project administration, Resources, Methodology, Conceptualization. MG-H: Investigation, Supervision, Methodology, Writing – review & editing, Formal analysis, Project administration. AG: Validation, Project administration, Writing – review & editing, Supervision. MS: Validation, Data curation, Visualization, Formal analysis, Writing – review & editing. JP: Data curation, Visualization, Validation, Writing – review & editing, Formal analysis.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This study is part of the Project PID2023-149079OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU. The work of Maria Gemma Sempere and Manuel Gonzlez-Hidalgo was partially supported by the R+D+i Project PID2020-113870GB-I00-“Desarrollo de herramientas de Soft Computing para la Ayuda al Diagnstico Clnico y a la Gestin de Emergencias (HESOCODICE)” funded by MICIU/AEI/10.13039/501100011033/.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcomp.2026.1746591/full#supplementary-material

References

  • 1

    AdadiA.BerradaM. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access6, 5213852160. doi: 10.1109/ACCESS.2018.2870052

  • 2

    Amengual-AlcoverE.Jaume-i-CapóA.Miró-NicolauM.Moyà-AlcoverG.Paniza-FullanaA. (2025). “Towards an evaluation framework for explainable artificial intelligence systems for health and well-being,” in Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering - ENASE (SciTePress), 530540. doi: 10.5220/0013289600003928

  • 3

    AnjomshoaeS.NajjarA.CalvaresiD.FrämlingK. (2019). “Explainable agents and robots: results from a systematic literature review,” in 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019) (Montreal, QC: International Foundation for Autonomous Agents and Multiagent Systems), 10781088. doi: 10.65109/KCZB5817

  • 4

    Arias-DuartA.ParésF.Garcia-GasullaD.Giménez-ÁbalosV. (2022). “Focus! Rating XAI methods and finding biases,” in 2022 IEEE International Conference on Fuzzy Systems (FUZZ) (Padua: IEEE), 18. doi: 10.1109/FUZZ-IEEE55066.2022.9882821

  • 5

    ArrietaA. B.Díaz-RodríguezN.Del SerJ.BennetotA.TabikS.BarbadoA.et al. (2020). Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Inform. Fus. 58, 82115. doi: 10.1016/j.inffus.2019.12.012

  • 6

    BodriaF.GiannottiF.GuidottiR.NarettoF.PedreschiD.RinzivilloS. (2023). Benchmarking and survey of explanation methods for black box models. Data Min. Knowl. Discov. 37, 17191778. doi: 10.1007/s10618-023-00933-9

  • 7

    BuincaZ.LinP.GajosK. Z.GlassmanE. L. (2020). “Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems,” in Proceedings of the 25th International Conference on Intelligent User Interfaces, Iui '20 (New York, NY: Association for Computing Machinery), 454464. doi: 10.1145/3377325.3377498

  • 8

    CahourB.ForzyJ.-F. (2009). Does projection into use improve trust and exploration? An example with a cruise control system. Saf. Sci. 47, 12601270. doi: 10.1016/j.ssci.2009.03.015

  • 9

    ChaddadA.PengJ.XuJ.BouridaneA. (2023). Survey of explainable ai techniques in healthcare. Sensors23:634. doi: 10.3390/s23020634

  • 10

    de VisserE. J.KruegerF.McKnightP.ScheidS.SmithM.ChalkS.et al. (2012). “The world is not enough: trust in cognitive agents,” in Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 56 (Los Angeles, CA: Sage Publications Sage CA), 263267. doi: 10.1177/1071181312561062

  • 11

    DietvorstB. J.SimmonsJ. P.MasseyC. (2015). Algorithm aversion: people erroneously avoid algorithms after seeing them err. J. Exp. Psychol.: Gen. 144:114. doi: 10.1037/xge0000033

  • 12

    Doshi-VelezF.KimB. (2017). Towards a rigorous science of interpretable machine learning. arXiv Preprint arXiv:1702.08608. doi: 10.48550/arXiv.1702.08608

  • 13

    DošilovićF. K.BrčićM.HlupićN. (2018). “Explainable artificial intelligence: a survey,” in 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (Opatija: IEEE), 210215. doi: 10.23919/MIPRO.2018.8400040

  • 14

    EitelF.RitterK.AdniA. D. N. I. (2019). “Testing the robustness of attribution methods for convolutional neural networks in mri-based Alzheimer's disease classification,” in Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support: Second International Workshop, iMIMIC 2019, and 9th International Workshop, ML-CDS 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Proceedings 9 (Shenzhen: Springer), 311. doi: 10.1007/978-3-030-33850-3_1

  • 15

    GliksonE.WoolleyA. W. (2020). Human trust in artificial intelligence: review of empirical research. Acad. Manag. Ann. 14, 627660. doi: 10.5465/annals.2018.0057

  • 16

    HeK.ZhangX.RenS.SunJ. (2016). “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Las Vegas, NV), 770778. doi: 10.1109/CVPR.2016.90

  • 17

    HoffK. A.BashirM. (2015). Trust in automation: integrating empirical evidence on factors that influence trust. Hum. Factors57, 407434. doi: 10.1177/0018720814547570

  • 18

    HoffmanR. R.MuellerS. T.KleinG.LitmanJ. (2018). Metrics for explainable AI: challenges and prospects.arXiv Preprint arXiv:1812.04608. doi: 10.48550/arXiv.1812.04608

  • 19

    JianJ.-Y.BisantzA. M.DruryC. G. (2000). Foundations for an empirically determined scale of trust in automated systems. Int. J. Cogn. Ergon. 4, 5371. doi: 10.1207/S15327566IJCE0401_04

  • 20

    KaufmanR. A.LeeE.BedmuthaM. S.KirshD.WeibelN. (2025). “Predicting trust in autonomous vehicles: modeling young adult psychosocial traits risk-benefit attitudes and driving factors with machine learning,” in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (Yokohama: Association for Computing Machinery), 124. doi: 10.1145/3706598.3713188

  • 21

    KrizhevskyA.SutskeverI.HintonG. E. (2012). Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 25.

  • 22

    LaiV.TanC. (2019). “On human predictions with explanations and predictions of machine learning models: a case study on deception detection,” in Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA: Association for Computing Machinery), 2938. doi: 10.1145/3287560.3287590

  • 23

    LeeJ. D.SeeK. A. (2004). Trust in automation: designing for appropriate reliance. Hum. Factors46, 5080. doi: 10.1518/hfes.46.1.50.30392

  • 24

    ManzeyD.ReichenbachJ.OnnaschL. (2012). Human performance consequences of automated decision aids: the impact of degree of automation and system experience. J. Cogn. Eng. Decis. Mak. 6, 5787. doi: 10.1177/1555343411433844

  • 25

    MayerR. C.DavisJ. H.SchoormanF. D. (1995). An integrative model of organizational trust. Acad. Manag. Rev. 20, 709734. doi: 10.2307/258792

  • 26

    MerrittS. M. (2011). Affective processes in human-automation interactions. Hum. Factors53, 356370. doi: 10.1177/0018720811411912

  • 27

    MillerT. (2019). Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 138. doi: 10.1016/j.artint.2018.07.007

  • 28

    MinhD.WangH. X.LiY. F.NguyenT. N. (2022). Explainable artificial intelligence: a comprehensive review. Artif. Intell. Rev. 55, 166. doi: 10.1007/s10462-021-10088-y

  • 29

    Miró-NicolauM.Moyà-AlcoverG.Jaume-i CapóA. (2022). Evaluating explainable artificial intelligence for x-ray image analysis. Appl. Sci. 12:4459. doi: 10.3390/app12094459

  • 30

    MohseniS.ZareiN.RaganE. D. (2021). A multidisciplinary survey and framework for design and evaluation of explainable ai systems. ACM Trans. Interact. Intell. Syst. 11, 145. doi: 10.1145/3387166

  • 31

    MuirB. M.MorayN. (1996). Trust in automation. Part II. experimental studies of trust and human intervention in a process control simulation. Ergonomics39, 429460. doi: 10.1080/00140139608964474

  • 32

    MurdochW. J.SinghC.KumbierK.Abbasi-AslR.YuB. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. 116, 2207122080. doi: 10.1073/pnas.1900654116

  • 33

    NautaM.TrienesJ.PathakS.NguyenE.PetersM.SchmittY.et al. (2023). From anecdotal evidence to quantitative evaluation methods: a systematic review on evaluating explainable ai. ACM Comput. Surv. 55, 142. doi: 10.1145/3583558

  • 34

    PerrigS. A. C.ScharowskiN.BrhlmannF. (2023). “Trust issues with trust scales: examining the psychometric quality of trust measures in the context of AI,” in Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg: Association for Computing Machinery), 17. doi: 10.1145/3544549.3585808

  • 35

    PetrovićN.Moyà-AlcoverG.Jaume-i CapóA.Gonzàlez-HidalgoM. (2020). Sickle-cell disease diagnosis support selecting the most appropriate machine learning method: towards a general and interpretable approach for cell morphology analysis from microscopy images. Comput. Biol. Med. 126:104027. doi: 10.1016/j.compbiomed.2020.104027

  • 36

    RessiD.RomanelloR.PiazzaC.RossiS. (2024). AI-enhanced blockchain technology: a review of advancements and opportunities. J. Netw. Comput. Applic. 225:103858. doi: 10.1016/j.jnca.2024.103858

  • 37

    ScharowskiN.PerrigS. A.von FeltenN.BrühlmannF. (2022). Trust and reliance in xai-distinguishing between attitudinal and behavioral measures. arXiv Preprint arXiv:2203.12318. doi: 10.48550/arXiv.2203.12318

  • 38

    SelvarajuR. R.CogswellM.DasA.VedantamR.ParikhD.BatraD. (2017). “Grad-CAM: visual explanations from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV) (Venice), 618626. doi: 10.1109/ICCV.2017.74

  • 39

    UenoT.SawaY.KimY.UrakamiJ.OuraH.SeabornK. (2022). “Trust in human-AI interaction: scoping out models, measures, and methods,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts (New Orleans, LA: Association for Computing Machinery), 17. doi: 10.1145/3491101.3519772

  • 40

    Van der VeldenB. H.KuijfH. J.GilhuijsK. G.ViergeverM. A. (2022). Explainable artificial intelligence (xai) in deep learning-based medical image analysis. Med. Image Anal. 79:102470. doi: 10.1016/j.media.2022.102470

  • 41

    ViloneG.LongoL. (2021). Notions of explainability and evaluation approaches for explainable artificial intelligence. Inform. Fus. 76, 89106. doi: 10.1016/j.inffus.2021.05.009

  • 42

    WangL.JamiesonG. A.HollandsJ. G. (2009). Trust and reliance on an automated combat identification system. Hum. Factors51, 281291. doi: 10.1177/0018720809338842

  • 43

    WischnewskiM.KrämerN.MüllerE. (2023). “Measuring and understanding trust calibrations for automated systems: a survey of the state-of-the-art and future directions,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg: Association for Computing Machinery), 116. doi: 10.1145/3544548.3581197

  • 44

    YuK.BerkovskyS.TaibR.ConwayD.ZhouJ.ChenF. (2017). “User trust dynamics: an investigation driven by differences in system performance,” in Proceedings of the 22nd International Conference on Intelligent User Interfaces, Iui '17 (Limassol: Association for Computing Machinery), 307317. doi: 10.1145/3025171.3025219

Summary

Keywords

human-centered evaluation, measure, medical image, trust, XAI

Citation

Miró-Nicolau M, Moyà-Alcover G, Jaume-i-Capó A, González-Hidalgo M, Ghazel A, Sempere Campello MG and Palmer Sancho JA (2026) Toward a novel measure of user trust in XAI systems. Front. Comput. Sci. 8:1746591. doi: 10.3389/fcomp.2026.1746591

Received

14 November 2025

Revised

03 March 2026

Accepted

23 March 2026

Published

28 April 2026

Volume

8 - 2026

Edited by

Giuseppe Perelli, Sapienza University of Rome, Italy

Reviewed by

Sabina Rossi, Ca' Foscari University of Venice, Italy

Carlos Bustamante Orellana, Arizona State University, United States

Updates

Copyright

*Correspondence: Miquel Miró-Nicolau,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics