Empathy: Assessment Instruments and Psychometric Quality – A Systematic Literature Review With a Meta-Analysis of the Past Ten Years

Objective: To verify the psychometric qualities and adequacy of the instruments available in the literature from 2009 to 2019 to assess empathy in the general population. Methods: The following databases were searched: PubMed, PsycInfo, Web of Science, Scielo, and LILACS using the keywords “empathy” AND “valid∗” OR “reliability” OR “psychometr∗.” A qualitative synthesis was performed with the findings, and meta-analytic measures were used for reliability and convergent validity. Results: Fifty studies were assessed, which comprised 23 assessment instruments. Of these, 13 proposed new instruments, 18 investigated the psychometric properties of instruments previously developed, and 19 reported cross-cultural adaptations. The Empathy Quotient, Interpersonal Reactivity Index, and Questionnaire of Cognitive and Affective Empathy were the instruments most frequently addressed. They presented good meta-analytic indicators of internal consistency [reliability, generalization meta-analyses (Cronbach’s alpha): 0.61 to 0.86], but weak evidence of validity [weak structural validity; low to moderate convergent validity (0.27 to 0.45)]. Few studies analyzed standardization, prediction, or responsiveness for the new and old instruments. The new instruments proposed few innovations, and their psychometric properties did not improve. In general, cross-cultural studies reported adequate adaptation processes and equivalent psychometric indicators, though there was a lack of studies addressing cultural invariance. Conclusion: Despite the diversity of instruments assessing empathy and the many associated psychometric studies, there remain limitations, especially in terms of validity. Thus far, we cannot yet nominate a gold-standard instrument.


INTRODUCTION
There is growing consensus among researchers concerning empathy being a multidimensional phenomenon in recent years, which necessarily includes cognitive and emotional components (Davis, 2018). Reniers et al. (2011), for instance, consider that empathy comprises both an understanding of other peoples' experiences (cognitive empathy) and an ability to feel their Hong and Han (2020) recently conducted a systematic review to identify scales assessing empathy among health workers in general. Eleven studies were included in the review, among which the Consultation and Relational Empathy, Jefferson Scale of Physician Empathy, and Therapist Empathy Scale (TES). These scales stood out in terms of psychometric quality; however, like previous reviews, the conclusion was that there were no instruments with desirable psychometric qualities to be considered the gold standard. Additionally, none of the measures were specifically developed for professionals working with the elderly, which indicates an important gap in the field.
To our knowledge, no systematic reviews focus on instruments that measure empathic ability in the general population. Hence, this review aimed to describe the psychometric quality and adequacy of instruments available in the literature from 2009 to 2019 to assess empathy in the general population.

MATERIALS AND METHODS
This study complied with the recommendations proposed by the Preferred Reporting Items for Systematic Review and Meta-Analyses -PRISMA (Moher et al., 2009) and the methodological guidelines established by the BRASIL. Ministério da Saúde et al. (2014). The following databases were searched: PubMed, PsycINFO, Web of Science, Scielo, and LILACS together with the keywords empathy, valid * , reliability, and psychometr * . Inclusion criteria were studies: (a) addressing 18-year-old or older individuals in the general population of both sexes; (b) published between 2009 and 2019, regardless of the language; and (c) with the objective to develop and/or assess the psychometric quality of instruments measuring empathic response in the general population. Figure 1 presents the exclusion criteria and the entire process used to select the studies.
Two mental health workers experienced in psychological and psychometric assessments (FFL, FLO) independently decided on the studies' eligibility; divergences were resolved by consensus. A standard form was developed to extract the following variables: (a) year of publication; (b) study's objective; (c) sample characteristics (i.e., country of origin, sample size, sex, age, and education); (d) instrument's characteristics (objective, number of items, application format, and scoring); and (e) psychometric indicators concerning validity and reliability.
The framework proposed by Andresen (2000) was used to assess the psychometric quality of the papers included in this review. It rates different criteria on a nominal scale ranging from A (strong adequacy) to C (weak or no adequacy), namely: Norms, Standard values; Measurement model; Item/instrument bias; Respondent burden; Administrative burden; Reliability; Validity; Responsiveness; Alternate/accessible forms, and Culture/language adaptations. This review's authors independently assessed the studies' psychometric quality and resolved divergences by achieving a consensus. The definitions of psychometric qualities and assessment criteria are presented in Supplementary Material 1.
A qualitative synthesis of the results was performed for each instrument. Additionally, for those instruments with more than FIGURE 1 | Flowchart describing to the inclusion and exclusion criteria processes based on PRISMA protocol (* Bora and Baysan, 2009;Innamorati et al., 2015).
two studies, meta-analytic measures of reliability and convergent validity were produced using the Jamovi software. We conduct reliability generalization meta-analysis of Cronbach's alpha (for the total scale and/or subscales) (Pentapati et al., 2020), and intraclass correlation coefficients (ICC) were grouped for the computation of test-retest meta-analytic measures (Macchiavelli et al., 2020). To group data concerning convergent validity with empathy measures and correlate constructs, we used Pearson/Spearman correlation coefficient (r) as the effect size measure (Duckworth and Kern, 2011). In the case of multiple indicators, the largest indicator in absolute values was chosen. Untransformed estimates and inverse variance weighting were used (Hedges and Olkin, 1985). An average coefficient and a 95% confidence interval (95%CI) were calculated for each metaanalysis. Heterogeneity of the measures between studies was verified using Q-statistic and I2 index. The funnel plot was used to assess the publication bias (Egger et al., 1997).

RESULTS
Fifty studies were selected, and 23 different instruments were identified. The instruments most frequently addressed were the Empathy Quotient (EQ; n = 11), IRI (n = 10), and Questionnaire of Cognitive and Affective Empathy (QCAE; n = 5). Only one or two studies assessed each of the remaining instruments. A total of 60.9% of the instruments were developed in the period included in this review, from 2009 to 2019. The remaining studies were developed before 2009, and the studies assessed new aspects of their psychometric qualities and/or cross-cultural adaptations. Table 1 presents an overview of each instrument. Table 1 shows that most instruments are self-report scales (n = 21), rated on a Likert scale (70% included five-point scales), with the number of items ranging from one to 80 (median = 23). Three instruments present alternative versions with fewer items [EQ, IRI, and Empathy Assessment Index (EAI)]. In most cases, data were collected face-to-face (n = 12), while the Active-Empathic Listening Scale (AELS) (Drollinger et al., 2006) was the only instrument with an other-report version. Two instruments consisted of computational tasks with the presentation of photorealistic stimuli: the Multifaceted Empathy Test (MET) (Dziobek et al., 2008) and the Pictorial Empathy Test (PET) (Lindeman et al., 2018). Data concerning the samples used by the different studies are presented in Table 2.
As shown in Table 2, the smallest sample was composed of 50 participants, and the largest sample had 5.724 participants (mean = 1036.6 ± 1577.5). Regarding age, most studies TABLE 1 | Overview of the instruments analyzed to assess general empathic capacity in the general population (ranked from most studied to least studied).
The instruments' psychometric proprieties were assessed according to the parameters proposed by Andresen (2000). The results of which are presented in Tables 3, 4 and Supplementary Material 2 present raw data concerning these indicators based on reliability and validity criteria (construct and criterion).  Kosonogov (2014)

Empathy Quotient
Eleven studies (22.0%) assessed the EQ's psychometric properties, six of which applied the instrument's complete version (60 items); three applied the 40-item version (filler items are removed); one study applied the 22-item version, and one the 15-item version. The Respondent burden criterion was considered satisfactory in all the studies and received grade A; all the versions were brief and well accepted by the target population. Administrative burden also received grade A because the EQ is easy to apply, score, and interpret. None of the studies presented specific normative indicators such as the T score or percentile distribution, only data concerning the mean score (n = 8), which resulted in grade B.
Regarding the Measurement model criterion, only Kim and Lee (2010; EQ-40) and Kosonogov (2014; EQ-60) presented kurtosis and asymmetry indicators to show the normality of data distribution. The remaining studies (81.8%) did not report analyses with this purpose, revealing a weakness regarding this psychometric indicator.
Seven studies conducted the EQ cross-cultural adaptation into Korean, Portuguese (Portugal and Brazil), Russian, Turkish, and Chinese. Rodrigues et al. (2011), Gouveia et al. (2012), Zhang et al. (2018), andZhao et al. (2018) obtained grade A in the Item/instrument bias criterion as they adopted the recommended guidelines for face validity, namely: translation, back translation, peer review, and pretest applied in the target population (Beaton et al., 2000). The Korean and Turkish versions (Kim and Lee, 2010;Kose et al., 2018) received grade B because these did not report a pretest. The Russian version (Kosonogov, 2014) obtained grade C because it did not report its procedures. Two other studies (Preti et al., 2011 andSenese et al., 2016) assessed the psychometric quality of the Italian version, using the version previously adapted by Baron-Cohen (2004), while Redondo and Herrero-Fernández (2018)  Regarding convergent validity, the studies used the IRI, Self-assessed Empathizing, Questionnaire Measure of Emotional Empathy, and Quotient of Empathic Abilities as a reference and found weak to moderate correlations (most obtained grade B, with correlations between 0.30 and 0.60). The estimated average correlation coefficient was 0.44 (CI95%: 0.36-0.52; I2 = 87.8%; Q = 77.398, p < 0.001; Egger's: p = 0.59). Other studies adopted instruments that assess correlated constructs such as alexithymia, social desirability, autism symptoms, and theory of mind (predominance of grade B). The pooled correlation estimate for was 0.38 (CI95%: 0.30-0.46; I2 = 93.8%; Q = 194.799, p < 0.001; Egger's: p < 0.001) (see Supplementary Material 3.3).
Divergent validity was mainly verified through instruments assessing specific psychiatric symptoms such as hallucination, delirium, hypomania, and systematization (an individual's ability to develop a system and analyze its variables, considering underlining rules that guide the system's behavior) (N = 4). The values found in these studies ranged from −0.33 and 0.24 and obtained grade A.
Still, in search of evidence of validity with other variables, most studies (N = 9) assessed differences between genders; women tended to rate higher in empathy than men, especially in the emotional factor (grade A predominated). Only Gouveia et al. (2012) investigated the EQ's scores concerning age and education. The authors verified that older age was accompanied by a decline in the EQ's emotional and social subscales. Education was associated with more frequent expressions of empathy in the instrument's cognitive, emotional, and social subscales. Only one study tested and verified the instrument's invariance regarding gender (Senese et al., 2016). Predictive validity/responsiveness was not investigated, revealing a gap in the literature.
The exploratory factor analyses presented models with a varied number of factors, which, however, did not explain the significant percentage of data variance (<47.4%) (Hair et al., 2009). The well-established models proposed by Lawrence et al. (2004; 3 factors: Cognitive Empathy, Emotional Reactivity and Social Skills −28 items), and Muncer and Ling (2006; 3 factors: Cognitive Empathy, Emotional Reactivity, and Social Skills −15 items) were the ones most frequently tested in confirmatory analyses. The results signaled goodness of fit problems for most of the studies; only one-third was rated A in this regard. These two models' unidimensionality was also tested, presenting contradictory results, while the one-factor model for the 40and 60-item versions was considered inadequate by the three studies assessing it.
Regarding the instrument's format, Wright and Skagerberg (2012), Senese et al. (2016), andZhao et al. (2018) tested the online format, the psychometric indicators of which were similar to the original version (pencil-and-paper format). However, the invariance between the versions was not objectively tested.
Note that Wright and Skagerberg (2012) tested an alternative version of the EQ-40, rewriting negative statements into positive to test the hypothesis that the original format was syntactically more complex and challenging. They verified that response time was shorter in the alternative format; however, the remaining psychometric findings were not the same as in the original version, so that the authors did not recommend its use.

Interpersonal Reactivity Index
Ten studies assessed the psychometric properties of the IRI's original and alternative versions (with 26, 16, and 15 items). The instrument was considered adequate in terms of Respondent burden and Administrative burden, either due to its brevity or ease of application and interpretation; grade A was obtained.
In terms of normative aspects, as previously observed with the EQ, 70% of the studies only presented data concerning the samples' mean scores and their respective standard deviations (grade B).
As for the Measurement Model criterion, only the studies by Koller and Lamm (2014), and Ingoglia et al. (2016) investigated this criterion. The first study assessed floor and ceiling effects, while the latter reported kurtosis and asymmetry indicators to verify the normality of the data distribution. The remaining (80%) did not perform analysis with this purpose, so that there is a lack of studies analyzing items.
Half of the studies addressing the IRI presented its crosscultural adaptation into different languages (Spanish, Portuguese from Brazil, French, and Russian), in general presenting adequate methodology to assess face validity.
Reliability was verified through internal consistency and temporal stability. Moderate meta-analytic measures of internal consistency were found for each subscale (Empathic Concern Regarding validity, most studies focused on analyzing the scale's factorial structure, in which various models were tested using exploratory (N = 5) and confirmatory factor analyses (N = 8). Davis's (1983Davis's ( ) original model (19834 factors -Empathic Concern, Fantasy, Perspective Taking, and Personal Distress) was the most frequently tested model, though controversial, and in general unsatisfactory results were found. Four-factor alternative models were also investigated, with slightly superior results [e.g., Braun et al. (2015) -15 items andFormiga et al. (2015) -26 items]. Unidimensional and bidimensional models (N = 3) were assessed and also presented controversial results.
Convergent validity was performed with other three instruments to assess general empathy and instruments measuring correlated constructs such as positive and negative affect, self-esteem, anxiety, aggression, social desirability, social avoidance, emotional fragility, emotional intelligence, gender roles, and sense of identity. The correlations with correlated constructs tended to be higher ( In most cases, validity based on other variables was assessed in terms of gender. However, Gilet et al. (2013) also investigated age differences, reporting that younger individuals tended to be more empathic than older individuals, especially in the Fantasy and Personal Distress subscales. The studies addressing the IRI did not investigate predictive validity or responsiveness. The only alternative to the instrument's original format (pencil-andpaper) was a computer version addressed by Chrysikou and Thompson (2016), comparing the equivalence between both (not invariance).

Questionnaire of Cognitive and Affective Empathy
The studies addressing the QCAE involved its original proposition (Reniers et al., 2011) and well-conducted crosscultural adaptations into French, Portuguese (Portugal), and Chinese, except for the fact that they did not use a pilot study to check for face validity (predominance of B grade).
The QCAE was considered a brief instrument; its application takes no more than 15 min. Scoring is manual and easy to interpret, rating the highest in terms of Respondent and Administrative burden quality. In general, these studies involved basic constructs of the Classic Psychometric Theory, while no data concerning standardization and item analysis were reported.
The Test-retest reliability (r = 0.76) was satisfactory; the latter was only reported by Liang et al. (2019).
Regarding validity, there was a predominance of studies investigating the scale's factorial structure (n = 5). The original study reports a five-factor structure (Perspective Taking, Emotion Contagion, Online Simulation, Peripheral Responsivity, and Proximal Responsivity) and few goodness-of-fit problems (grade A). Later, most studies (75%) confirmed this structure and report adequate indexes (grade A). Some alternative models were investigated, and findings indicate acceptable goodness-offit for the QCAE's first and second-order four-factor structure (Liang et al., 2019), though the instrument's unidimensionality was not confirmed.
Regarding convergent validity with other empathy measures (n = 2 studies: Basic Empathy Scale and IRI), correlation ranged from 0.27 to 0.76 and obtained grade A (according to the criterion established, only one correlation above 0.60 was necessary to obtain the highest grade). Instruments were also used to assess correlated constructs (e.g., aggressiveness, alexithymia, impulsivity, interpersonal competence, psychopathy, and social anhedonia, among others). The coefficients in these studies were moderate, and most were graded. The estimated average correlation coefficient was 0.27 (CI95%: 0.20-0.35), I2 = 92.12%, Q = 183.846, p < 0.001, Egger's: p = 0.04 -see Supplementary Material 5.2.
Studies addressing known groups analyses (gender) predominated (N = 3), reinforcing previous studies indicating that women have greater empathic ability than men. None of the studies addressing the QCAE investigated predictive validity or responsiveness.
One of the studies (Di Girolamo et al., 2017) assessed the equivalence between the pencil-and-paper and online formats, and both presented similar psychometric indexes and measurement invariance. Invariance was also verified for sex (Myszkowski et al., 2017;Liang et al., 2019).

Active-Empathic Listening Scale
The AELS was proposed in 2006 (Drollinger et al., 2006) for the specific context of the relationship established between seller and customer, but Bodie (2011) proposed its expanded use for interpersonal relationships in general. Even though Bodie (2011) reported that the original items had to be changed and adapted, no information concerning the items analysis was provided, so that the item/instrument bias criterion obtained grade C.
Later, Gearhart and Bodie (2011) expanded the adapted version's psychometric studies, presenting well-assessed Respondent burden and Administrative burden (grade A). Only Bodie (2011) assessed internal consistency, and the coefficients for the instrument as a whole (>0.86) were considered excellent (grade A).
From the factorial structure perspective, the three-factor model (Sensing, Processing, and Responding) was considered appropriate, specifically for the self-report version (grade A) (Bodie, 2011), which was later confirmed by Gearhart and Bodie (2011) The convergent validity indexes concerning the self-report version indicated that correlations for the correlated constructs (conversational adequacy, interaction implications, social skills; −0.16 to 0.67 -grade A) were more robust than for the general empathy construct (0.15 to 0.44), which were considered moderate (grade B). Only correlations with correlated constructs (conversational adequacy, conversational effectiveness, nonverbal immediacy) were investigated for the other-report version, ranging from 0.15 to 0.75, and considered adequate (grade A). The self-report and other-report versions evidenced invariance of measure.
Studies addressing validity with other variables investigated the relationship between empathy scores and whether an individual is considered a good or poor listener (having an active and emotional interaction or not). Good listeners tended to score higher in empathy. There are no studies addressing the AELS normative data or predictive validity or studies conducting cross-cultural adaptations.

Toronto Empathy Questionnaire
The Toronto Empathy Questionnaire (TEQ) was addressed by two studies between 2009 and 2019: the study that originally proposed it (Spreng et al., 2009) and the study of its crosscultural adaptation into Turkish (Totan et al., 2012). The process of developing TEQ was adequately described, but no pilot test was reported. A pilot test was implemented during its crosscultural adaptation, impeding the Item/instrument bias criterion from achieving the maximum grade. On the other hand, due to the instrument's ease of use and application, the Respondent burden and Administrative burden criteria were assessed and obtained grade A.
The TEQ's reliability was assessed using internal consistency (α = 0.79 to 0.87; predominance of grade A) and temporal stability (0,73; grade B), which were adequate. In terms of factorial structure, Spreng et al. (2009) conducted two exploratory analyses, and a unidimensional structure was found in both, with factor loadings above 0.37 (grade B). Totan et al. (2012) replicated the TEQ's unifactorial structure but found problems in three specific items, which led them to retest the model after excluding these items. Both the 16-item and 13-item versions appeared satisfactory in the confirmatory analysis, and the shortest version was recommended.
The TEQ's convergent validity was verified by comparing other instruments measuring empathy and instruments assessing correlated constructs such as autism, ability to understand the mental states of others, and interpersonal perception. As expected, most correlations between TEQ and other instruments assessing empathy were higher (0.29 to 0.80; grade A) than correlations with correlated constructs (−0.33 to 0.35; grade B).
Finally, other evidence of validity was analyzed, having the gender as a reference, and showed that women scored higher than men. The TEQ studies did not investigate predictive validity or responsiveness and did not report alternative formats or transcultural adaptations.  originally proposed the EAI, and  later assessed its psychometric properties. The authors described the process of instrument development and the theoretical conceptualization of each of the five factors composing it (Affective Response, Perspective Taking, Self-Awareness, Emotion Regulation and Empathic Attitudes); Grade A was granted to the Item/instrument bias criterion.

Empathy Assessment Index
Like the remaining instruments presented thus far, the EAI was also considered easy to apply, and therefore, the Respondent burden and Administrative burden were rated with the highest grade. Its precision coefficients ranged from 0.30 to 0.83, and temporal stability ranged from 0.59 to 0.85; the retest was applied with a 1-week interval (grade B).
The original study reports that the exploratory factor analysis indicated a 34-item and 6-factor structure (Empathetic Attitudes, Affective Response -happy, Perspective Taking, Affective Response -sad, Perspective Taking-Affective Response and Emotion Regulation), which explained 43.19% of the variance of data (grade B). Later, based on literature reviews and feedback provided by specific community groups and experts in empathy,  performed factor analyses for a new 48-item version, concluding that the model presenting the best goodness of fit was composed of 17 items and five factors (Affective Response, Perspective Taking, Self-Awareness, Emotion Regulation and Empathic Attitudes) (grade A).  verified the convergent validity of the 34-item version, comparing it with the IRI and the coefficients ranged from 0.48 to 0.75; grade A was obtained.  investigated convergent validity by comparing the EAI with correlated constructs, such as attention and cognitive emotion regulation. As expected, moderate coefficients were found (−0.40 to 0.51; grade B).  verified the EAI's validity concerning the different sociodemographic variables. The results suggest differences concerning race (Afro-and Latin-Americans tended to present greater empathetic behavior than Caucasians); educational background (social workers presented greater empathy than individuals from the criminal justice, sociology, education, or nursing fields); and the family of origin's socioeconomic status (poor/working-class individuals presented greater empathetic behavior than middle/high-class individuals). The studies addressing the EAI did not address predictive validity or responsiveness nor reported alternative formats.

Affective and Cognitive Measure of Empathy
Affective and cognitive measure of empathy (ACME) was proposed by Vachon and Lynam (2016), and later, new validity evidence was presented by Murphy et al. (2018). Vachon and Lynam did not report the procedures concerning the instrument's development so that the Item/instrument bias criterion obtained grade C. However, the instrument presented the characteristics necessary to receive grade A in the Respondent burden and Administrative burden criteria. Note that Vachon and Lynam (2016) study obtained grade B in the Norms and standard values criteria because only the means and standard deviations of each of the instrument's scales were reported according to sex and race for the entire sample.
Its reliability was only investigated through the internal consistency method (>0.85; Vachon and Lynam, 2016), indicating a gap concerning temporal stability indicators.
Investigations related to the instrument's factorial structure and convergent validity were found. Vachon and Lynam (2016) suggested a three-factor structure (Cognitive Empathy, Affective Resonance and Affective Dissonance) with satisfactory goodness-of-fit indexes (grade A) and invariance between genders. However, Murphy et al. (2018) were unable to replicate this model and obtained unsatisfactory goodness-offit indexes. Hence, they proposed a five-factor model (two factors based on the items' polarity -positive and negative items, in addition to Cognitive Empathy, Affective Resonance and Affective Dissonance factors), with presented criteria that obtained grade A.
Convergent validity was verified in relation to IRI (−0.24 to 0.80) and the Basic Empathy Scale (0.40 to 0.65); these criteria obtained grade A. The results indicated grade A for both studies regarding the indexes concerning correlated constructs, such as aggressive behavior, externalizing disorders, and personality pathologies (−0.83 to 0.77).

Remaining Instruments
Other 16 instruments were analyzed by single studies. Seven of these intended to propose new instruments, five intended to obtain additional validity evidence, and four conducted a cross-cultural adaptation of existing instruments.
Except for the MET, all the instruments obtained grade A in the Respondent burden and Administrative burden criteria because they were brief, well-accepted, easy to apply, score and interpret. MET obtained grade B in both items because it is composed of 80 items and its application/scoring requires specific software.
Analysis of the new instruments showed no specific normative indicators were reported for any of them (Norms, standard values, grade B). Reliability verified through internal consistency was investigated in 85.7% of the studies and, in general, presented satisfactory results (grade B). Temporal stability was verified in only 28.6% of the studies and presented positive results (grades A and B).
As for existing instruments, there is a lack of normative data (data reported by 33.3% of the studies were restricted to mean and standard deviation of the total score). Nonetheless, as verified in the studies previously presented, no specific comparison indicators were reported between groups (e.g., T score or percentile).
Note that only two studies addressing these new instruments (Segal et al., 2013 -Interpersonal andSocial Empathy Index andBatchelder et al., 2017 -Empathy Components Questionnaire) reported information concerning how the instruments were developed and obtained grade A. Convergent (n = 7), factor (n = 6), discriminant (n = 5), and predictive (n = 2) analyses were performed to investigate the instruments' validity. Regarding the factorial structure, both the instruments proposed before 2009 and those proposed after 2009 obtained grade B, showing that this group of instruments' factorial structures was confirmed with a few goodness-of-fit problems.
In general, the quality of the results concerning convergent validity was considered moderate (grade B). Correlations with other instruments measuring empathy were similar to the correlations found with instruments measuring correlated constructs. Validity studies with other variables were also restricted to the investigation of gender, corroborating the findings reported in the literature; that is, women tend to be more empathic than men. Hollar (2017) expanded the variables of interest (age and ethnicity) but obtained no satisfactory results.
Among this group of studies, Oceja Fernández et al. (2009) investigated predictive validity concerning the Vicarious Experience Scale; the Sympathy and Vicarious Distress subscales did not present satisfactory indexes for the prediction of elicited empathy and personal anguish. Batchelder et al. (2017), in turn, report the predictive ability of the Empathy Components Questionnaire concerning the scores obtained in the Social Interests Index (grade B).
Regarding this group of instruments, note that the Pictorial Empathy Test (Lindeman et al., 2018) differs from the remaining. It presents higher ecological validity because it is composed of images of people, while the Single Item Trait Empathy Scale stands out because it is composed of a single item. In general, both presented satisfactory psychometric properties.
As for cross-cultural studies, in general, face validity procedures were in line with the guidelines recommended by Beaton et al. (2000), and most (75%) obtained grade A. These studies' psychometric properties were considered satisfactory, while the Culture/language adaptations item obtained grade B.
The instruments' reliability (internal consistency and temporal stability) was considered acceptable in most studies (grades A and B). However, the cross-cultural study addressing the MET obtained indexes below the expected for the instrument's cognitive factor, even after decreasing the scale's number of items.
Six studies investigated the instruments' factorial structure. The results showed acceptable indexes for the Measure of Empathy and Sympathy, Multidimensional Emotional Empathy Scale, Basic Empathy Scale, The Mexican Empathy Scale, Positive Empathy Scale, and Cognitive, Affective, and Somatic Empathy Scales.
Among this set of studies, Park et al. (2019) conducted a cross-cultural adaptation of the Cognitive, Affective, and Somatic Empathy Scales. This instrument presents a specific scale to assess somatic empathy, which, according to the authors, can be defined as a tendency to imitate and automatically synchronize other peoples' facial expressions, vocalizations, behaviors, and movements. Only this instrument presented this measure. In general, its psychometric qualities were considered satisfactory.
On the other hand, the results indicated problems concerning convergent validity and factorial structure, making little progress in the solution and discussion of these impasses, e.g., a low to moderate correlation was found, especially between the EQ, IRI and QCAE and other instruments assessing the empathy construct (meta-analytic measures of correlation between 0.31 and 0.44), while similar or higher values were found when correlating these with different correlated constructs (metaanalytic measures of correlation between 0.27 and 0.45), indicating that the instruments' clinical validity was greater than theoretical validity. Studies published before the period addressed in this review also indicated these limits concerning convergent validity (e.g., EQ vs. IRI: Lawrence et al., 2004;De Corte et al., 2007;IRI vs. Hogan Empathy Scale: Davis, 1983; AELS vs. IRI: Drollinger et al., 2006;BES vs. IRI: Jolliffe and Farrington, 2006;MET vs. IRI: Dziobek et al., 2008) and factorial structure (e.g., EQ: Lawrence et al., 2004;Muncer and Ling, 2006;IRI: Siu and Shek, 2005;De Corte et al., 2007;BES: Jolliffe and Farrington, 2006;EI: Falcone et al., 2008).
It is important to note that most of the instruments analyzed here did not reach a consensus regarding the best factorial structure, considering that various models were tested. The results concerning the goodness of fit suggest problems related to both the base model (Comparative Fit Index and Tucker Lewis Index below the expected) and population covariance (Root Mean Square Error of Approximation above the expected) (Bentler, 1990;Hu and Bentler, 1999;Thompson, 2004). These divergences possibly reflect on the analyses of convergent validity with the different instruments measuring empathy, most of which obtained values within moderate limits.
We believe that controversies concerning the multidimensional nature of empathy (Murphy et al., 2018) reflect on the analyses, especially when the target instruments' subscales are more specifically analyzed. Murphy et al. (2018) widely discuss these aspects and note a lack of consensus regarding the empathy construct and how different authors adopt such a concept when developing instruments. These authors note there is greater consensus regarding the presence of affective and cognitive components; however, the analyses of the bifactor models assessed here (e.g., concerning the IRI) also failed to present satisfactory factor indexes. Given this lack of consensus, Surguladze and Bergen-Cico (2020) stress the need to reconsider and discuss this construct, considering its different dimensions and directly and indirectly related mechanisms.
In addition to what Murphy et al. (2018) state regarding lack of consensus, this review's findings indicate that some studies do not specify the conceptual model of empathy that grounded the development of the instruments and which would theoretically ground the empirical analysis of the instruments' internal structure, especially in second-order more complex models and with a varied number of factors. It was the case of both new instruments, proposed during the period covered by this review (for example: QCAE: Reniers et al., 2011;QoE: Miguel et al., 2018;TEQ: Spreng et al., 2009), and older instruments (EQ: Baron-Cohen and Wheelwright, 2004; IRI: Davis, 1983;MxES: Díaz-Loving et al., 1986;EI: Falcone et al., 2008). A lack of theoretical models to properly ground the empathy construct and its dimensions possibly explains the restrictions concerning structural validity and lack of convergence between the different instruments.
Regarding the statistical techniques used to investigate the instruments' structures, Marsh (2018) highlight that newer models, such as the Structural Equations Models (Gefen et al., 2000), can more deeply capture the complexity of the empathy construct and also resolve a series of problems encountered based on the CFA approach (e.g., restricted factor loadings) (Marsh, 2018). Nonetheless, most of the studies opted for adopting confirmatory and exploratory factor analyses so that future studies are needed to invest in these technologies and techniques of analysis. Note that studies based on the Item Response Theory (Pasquali, 2020) can contribute to this impasse, considering that the studies addressed here attempted to improve the factor model by removing specific items.
On the other hand, the construct's clinical/empirical validity seems to be unanimous. Even though the studies were conducted with non-clinical samples, associations with different correlated constructs are adequate and reinforce the relationship of empathy with different psychopathological and behavioral indicators (e.g., autism: Komeda et al., 2019, Post-traumatic stress disorder: Feldman et al., 2019, and Borderline personality disorder: Flasbeck et al., 2019. However, for these instruments to be used in a clinical setting, aspects related to predictive evidence, which remain scarce, need to be explored. In this context, normative studies, which were not the target of the psychometric studies addressed here, are also needed. Cross-cultural studies were an important focus of interest among researchers within this topic. These are relevant studies because they enable generating and/or reinforcing psychological theories that take the cultural context into account (Gomes et al., 2018). Additionally, these studies enable applying the same instrument among different individuals belonging to different contexts and facilitate understanding the similarities and characteristics these groups share (Borsa et al., 2012), which is essential, especially in clinical research.
In general, the results of cross-sectional studies addressing instruments reported psychometric qualities comparable to the original versions, though cultural invariance was not assessed for any of the target instruments. Investigating invariance between different cultural groups answering an instrument is essential to identify whether there are significant differences between scores, and if that is the case, verify whether differences are related to actual differences at a latent trait level or the instruments' parameters are not equivalent (Damásio et al., 2016).
Regarding new instruments, note that the same few added in terms of Respondent burden and Administrative burden, considering that these aspects, except for the MET, obtained grade A, though were little discriminant. These instruments also innovated little in terms of format and structure. Most were based on self-reported items rated on a Likert scale. The studies also do not seem to overcome the critical points mentioned earlier in psychometric terms.
Interpersonal reactivity index, and more recently, EQ, have been widely used in different clinical studies and applied in different target populations (e.g., Feeser et al., 2015;Fitriyah et al., 2020). However, despite their popularity, they present weaknesses concerning structural validity and limitations regarding responsiveness, standardization, and bias.
The conclusion is that despite the diversity of the instruments available to assess empathy and many associated psychometric studies, limitations stand out, especially in terms of validity. Hence, as noted by previous reviews that evaluated specific instruments of empathy and/or their performance in specific populations (Hemmerdinger et al., 2007;Yu and Kirk, 2009;Hong and Han, 2020), no instrument can be currently appointed as the gold standard.
Therefore, this field of study needs to advance in conceptual and theoretical terms. Such an advance will enable the establishment of more robust models to be empirically reproduced by the instruments. Additionally, problems with the internal structure of various instruments can be minimized or resolved using more sophisticated techniques based on the analysis and refinement of items. Normative and predictive studies can improve the validity of evidence of existing studies, favoring greater clinical applicability. Complementary studies of invariance, testing the effect of cultures, and alternative forms of application (especially those using technological resources, such as online and computer applications) are desirable and can expand the reach of instruments. Regarding the proposition of new instruments, there seems to remain a need for instruments with alternative formats to minimize response bias, especially social desirability, a recurrent problem in self-report instruments.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.