Clarifying the reliability paradox: poor measurement reliability attenuates group differences

Karvelis, Povilas; Diaconescu, Andreea O.

doi:10.3389/fpsyg.2025.1592658

BRIEF RESEARCH REPORT article

Front. Psychol., 15 October 2025

Sec. Quantitative Psychology and Measurement

Volume 16 - 2025 | https://doi.org/10.3389/fpsyg.2025.1592658

Clarifying the reliability paradox: poor measurement reliability attenuates group differences

Povilas Karvelis¹^*

Andreea O. Diaconescu^1,2,3,4

¹Krembil Centre for Neuroinformatics, Centre for Addiction and Mental Health (CAMH), Toronto, ON, Canada
²Department of Psychiatry, University of Toronto, Toronto, ON, Canada
³Institute of Medical Sciences, University of Toronto, Toronto, ON, Canada
⁴Department of Psychology, University of Toronto, Toronto, ON, Canada

Cognitive sciences are grappling with the reliability paradox: measures that robustly produce within-group effects tend to have low test-retest reliability, rendering them unsuitable for studying individual differences. Despite the growing awareness of this paradox, its full extent remains underappreciated. Specifically, most research focuses exclusively on how reliability affects correlational analyses of individual differences, while largely ignoring its effects on studying group differences. Moreover, some studies explicitly and erroneously suggest that poor reliability does not pose problems for studying group differences, possibly due to conflating within- and between-group effects. In this brief report, we aim to clarify this misunderstanding. Using both data simulations and mathematical derivations, we show how observed group differences get attenuated by measurement reliability. We consider multiple scenarios, including when groups are created based on thresholding a continuous measure (e.g., patients vs. controls or median split), when groups are defined exogenously (e.g., treatment vs. control groups, or male vs. female), and how the observed effect sizes are further affected by differences in measurement reliability and between-subject variance between the groups. We provide a set of equations for calculating attenuation effects across these scenarios. This has important implications for biomarker research and clinical translation, as well as any other area of research that relies on group comparisons to inform policy and real-world applications.

1 Introduction

An influential paper by Hedge et al. (2018) has highlighted the “reliability paradox”: cognitive tasks that produce robust within-group effects tend to have poor test-retest reliability, undermining their use for studying individual differences. Many studies have followed, demonstrating the prevalence of low test-retest reliability and emphasizing its implications for studying individual differences across various research contexts, including neuroimaging, computational modeling, psychiatric disorders, and clinical translation (Enkavi et al., 2019; Elliott et al., 2020, 2021; Fröhner et al., 2019; Nikolaidis et al., 2022; Kennedy et al., 2022; Nitsch et al., 2022; Blair et al., 2022; Zuo et al., 2019; Milham et al., 2021; Feng et al., 2022; Haines et al., 2023; Parsons et al., 2019; Hedge et al., 2020; Enkavi and Poldrack, 2021; Zorowitz and Niv, 2023; Gell et al., 2023; Rouder et al., 2023; Karvelis et al., 2023, 2024; Clayson, 2024; Vrizzi et al., 2025).

However, the studies on this topic tend to focus exclusively on how test-retest reliability affects correlational individual differences analyses without making it clear that it is just as relevant for studying group differences (although see LeBel and Paunonen, 2011; Zuo et al., 2019). Not only that, some studies incorrectly suggest that poor test-retest reliability is not problematic for studying group differences. For example: “Low reliability scores are problematic only if we were interested in differences between individuals (within a group) rather than between groups” (De Schryver et al., 2016); “although improved reliability is critical for understanding individual differences in correlational research, it is not very relevant or informative for studies comparing conditions or groups” (Zhang and Kappenman, 2024); “On a more positive note, insufficient or unproven test-retest reliability does not directly imply that one cannot reliably assess group differences (e.g., clinical vs. control)” (Schaaf et al., 2023); “while many cognitive tasks (including those presented here) have been well validated in case-control studies (e.g., comparing MDD and healthy individuals) where there may be large group differences, arguably these tests may be less sensitive at detecting individual differences” (Foley et al., 2024); “The reliability paradox... implies that many behavioral paradigms that are otherwise robust at the group-level (e.g., those that produce highly replicable condition- or group-wise differences) are unsuited for testing and building theories of individual differences” (Haines et al., 2020); “Many tasks clearly display robust between-group or between-condition differences, but they also tend to have sub-optimal reliability for individual differences research” (Parsons et al., 2019). Sometimes the opposite mistake is made by suggesting that poor reliability is equally detrimental for studying both between-group differences and within-group effects (e.g., see Figure 1 in Zuo et al., 2019).

An apparent common thread across these examples is the conflation of within-group effects and between-group effects, treating both simply as “group effects.” However, within- and between-group effects are often in tension. If an instrument is designed to produce strong within-group effects (i.e., robust changes across conditions or time points), it will typically do so by minimizing between-subject variability – which in turn reduces its ability to reliably detect individual or between-group differences. This trade-off lies at the heart of the reliability paradox. The key insight here is that both group and individual differences live on the same dimension of between-subject variability and are, therefore, affected by measurement reliability in the same way.

The aim of this brief report is therefore (1) to clarify and highlight the relevance of the reliability paradox for studying group differences (2) to present simulation-based illustrations to make the implications of the reliability paradox more intuitive, and (3) to provide a set of mathematical formulae for effect size attenuation that cover different scenarios of group comparisons.

2 Methods

2.1 Simulated data

To simulate data, we sampled from a normal distribution

\begin{array}{l} X ~ N (μ, σ_{b}^{2} + σ_{e}^{2}) & (1) \end{array}

by independently varying between-subject ( $σ_{b}^{2}$ ) and error ( $σ_{e}^{2}$ ) variances. To represent repeated measurements of task performance, we generated two distributions (“test” and “retest”) with the same mean μ = 0. To simulate one-sample effects, we simply generated another distribution that is shifted upward by a constant offset (μ = 2). To simulate paired-sample effects, we generated two distributions (corresponding to Condition 1 and Condition 2) one of which was at μ = 0 and the other at μ = 2. Finally, to illustrate relationships with external traits, we generated additional datasets with fixed between-subject variance ( $σ_{b}^{2}$ ) and no error variance ( $σ_{e}^{2} = 0$ ). We specified true population correlations of r_true = 0.5 and r_true = 0.9 to represent different levels of association between task performance and symptom/trait measures.

Note, while we refer to these data distributions as representing “task performance” and “traits/symptoms” to make this analysis more intuitive, these datasets are generated at a high level of abstraction and do not assume any specific data-generating process—i.e., we are not simulating trial-level or item-level data, we are simply generating distributions of individual-level scores.

Patients vs. controls groups were created by splitting the datasets such that 10% of the distribution with the highest symptom scores were assigned to the patient group while the remaining 90% were assigned to the control group. For creating high vs. low trait groups, we simply performed a median split across the datasets.

To achieve sufficient stability of the test-retest reliability and effect size estimates, we used a sample size of 10,000 for each combination of σ_b and σ_e, each of which was varied between 0.3 and 2 for most of the analysis. To further increase the stability of our effects, when investigating the relationship between true and observed effect size metrics as a function of reliability, we increased the sample size to 1,000,000. We also kept between-subject variance fixed at σ_b = 0.5, and only varied error variance in the range σ_e∈[0.01 3]. When comparing how the different statistical metrics fare when it comes to significance testing, we used a sample of N = 60 (a typical effect size seen in practice)—to achieve stable estimates of p-values we averaged results over 20,000 repetitions.

2.2 Statistical analysis

To assess test-retest reliability, we used the intraclass correlation coefficient (ICC) (Fleiss, 2011; Koo and Li, 2016; Liljequist et al., 2019):

\begin{array}{l} I C C = \frac{σ_{b}^{2}}{σ_{b}^{2} + σ_{e}^{2}}, & (2) \end{array}

where for clarity we omit the within-subject variance term in the denominator because throughout our analysis it was kept at 0 between test and retest measurements.

To measure group effects, we used Cohen's d as the main effect size metric for both within- and between-group effects. To account for unequal variances between groups when performing 90/10% split (for controls vs. patients), we used d*, a variant of Cohen's d that does not assume equal variances:

\begin{array}{l} d = \frac{μ_{2} - μ_{1}}{σ_{p}}, where \\ σ_{p} = \sqrt{\frac{(n_{1} - 1) σ_{1}^{2} + (n_{2} - 1) σ_{2}^{2}}{n_{1} + n_{2} - 2}} & (3) \end{array}

\begin{array}{l} d^{*} = \frac{μ_{2} - μ_{1}}{σ_{n p}}, where σ_{n p} = \sqrt{\frac{σ_{1}^{2} + σ_{2}^{2}}{2}} & (4) \end{array}

Equation 3 is the standard method for calculating Cohen's d using the pooled standard deviation, where in the numerator we have the difference between the means of the two groups (μ₁ and μ₂), with n₁ and n₂ denoting the sample sizes of each group, and $σ_{1}^{2}$ and σ² denoting the variances of each group. In contrast, Cohen's d* (Equation 4) is based on the non-pooled standard deviation. While the use of a non-pooled standard deviation somewhat complicates the interpretation of the resulting effect size metric, empirical investigations have shown that it possesses robust inferential properties and may be a more practical option given that the equal variance requirement is rarely met in practice (Delacre et al., 2021).

To be more comprehensive in our analysis, alongside Cohen's d, we also report a non-parametric alternative, the rank-biserial correlation coefficient (r_rb). To perform the associated statistical significance tests, we use the t-test and the Mann-Whitney U test, respectively.

To quantify the impact of reliability on statistical power, we calculated the required sample sizes for each effect size metric to achieve 80% power at α = 0.05 (Cohen, 2013). The critical values z_α/2 and z_β are defined as:

\begin{array}{l} z_{α / 2} = Φ^{- 1} (1 - α / 2) and z_{β} = Φ^{- 1} (1 - β), & (5) \end{array}

where Φ⁻¹ is the inverse cumulative distribution function of the standard normal distribution, α = 0.05 is the two-sided significance level, and β = 0.20 (corresponding to 80% power) is the Type II error rate. With these parameters, z_α/2 = 1.96 and z_β = 0.84.

For Pearson correlation, we used the Fisher z-transform approach (Cohen, 2013):

\begin{array}{l} N_{r} = 3 + {(\frac{z_{α / 2} + z_{β}}{atanh (| r_{o b s} |)})}^{2}, & (6) \end{array}

where |r_obs| is the absolute value of the observed correlation attenuated by measurement error. For Cohen's d from median split analysis, we used the standard two-sample t-test power formula (Cohen, 2013):

\begin{array}{l} N_{d} = 4 {(\frac{z_{α / 2} + z_{β}}{d_{o b s}})}^{2}, & (7) \end{array}

where d_obs is the observed effect size attenuated by reliability. For the rank-biserial correlation with equal group sizes, we used (Noether, 1987):

\begin{array}{l} N_{r b} \approx \frac{4}{3} {(\frac{z_{α / 2} + z_{β}}{r_{r b, o b s}})}^{2}, & (8) \end{array}

where r_{rb, obs} is the observed rank-biserial correlation coefficient.

3 Results

3.1 The reliability paradox

First, we performed data simulations to illustrate the reliability paradox—namely, that strong within-group effects are inherently at odds with high test-retest reliability. We generated multiple sets of synthetic data by independently varying between-subject variance ( $σ_{b}^{2}$ ) and measurement error variance ( $σ_{e}^{2}$ ), and explored how this affects test-retest reliability and observed within-group effects (Figure 1). The key takeaway here is that while test-retest reliability is determined by the proportion of between-subject variance relative to total variance $σ_{b}^{2} / (σ_{b}^{2} + σ_{e}^{2})$ , within-group effects depend on the total variance $σ_{b}^{2} + σ_{e}^{2}$ . In other words, increasing error variance $σ_{e}^{2}$ will reduce both reliability and within-group effect sizes, whereas increasing between-subject variance $σ_{b}^{2}$ will improve reliability but will reduce within-group effects, since a fixed mean difference becomes smaller relative to the larger total variance.

Figure 1

Panel A displays a scatter plot showing test-retest reliability with measurements at T1 and T2. Panel B features a box plot depicting one-sample effects for Condition 1. Panel C contains a box plot with paired-sample effects for Conditions 1 and 2. Panels D, E, and F show heatmaps for test-retest reliability (ICC), one-sample Cohen’s d, and paired-sample Cohen’s d, respectively. Color gradients in heatmaps represent values, with axes labeled as error variance and between-subject variance.

Figure 1. The reliability paradox. Top panels (A–C) illustrate the statistical tests under consideration: (A) test-retest correlation (same measure obtained twice), (B) one-sample test (mean of a single condition compared to zero), and (C) paired-sample test (mean difference between two conditions). Bottom panels (D–F) show how the observed outcomes of these tests depend on the relative contributions of error variance and between-subject variance. Test-retest reliability (D) increases when error variance is minimized and between-subject variance is maximized, whereas observed one-sample and paired-sample effect sizes (E, F) increase when both error and between-subject variances are minimized.

Note that for simplicity here we assumed the between-subject variance in condition 1 and condition 2 to be uncorrelated (Figure 1C). Under this assumption, the variance of the difference scores (i.e., the individual-level condition differences used to compute paired-sample Cohen's d) is equal to the sum of the variances in each condition. Due to this linear relationship, Figure 1F can therefore be interpreted as referring to the between-subject variance of difference scores. In practice, however, task conditions are often positively correlated, which reduces the variance of the difference scores—thereby inflating effect sizes while reducing the reliability of the underlying scores (Cronbach and Furby, 1970; Hedge et al., 2018; Draheim et al., 2019). This would introduce non-linearities in how the between-subject variance of each condition relates to the observed effect size, but the relationship between the between-subject variance of difference scores and the observed effect size, which we aim to convey here, still holds.

3.2 Group differences: data simulations

3.2.1 Data simulations for groups created by dichotomizing continuous measures

Next, using data simulations we investigated how measurement reliability affects group differences when the groups are derived by thresholding a continuous measure (e.g., symptoms or traits). To make this more intuitive we considered two common scenarios: 1) mental disorders, which can be generally thought of as occupying the low end of the wellbeing distribution (Huppert et al., 2005) or any specific symptom dimension, and 2) “low” vs. “high” cognitive traits formed by a median split (Figure 2). Considering these scenarios using simulations helps illustrate one key insight: raw group differences scale together with between-subject variance (Figures 2B, C). Hence, unlike with within-group effects, reducing between-subject variance does not lead to larger group effects. Adding measurement error can further increase raw group differences, but it also leads to misclassification (Figures 2B, C), which ultimately reduces observed group differences in any other measures of interest, as we will see next.

Figure 2

Chart consisting of three panels. Panel A shows a bell curve with psychological resources on the x-axis and percentage of population on the y-axis, depicting states from “mental disorder” to “flourishing.” Panel B illustrates a scatter plot comparing mental health assessments, highlighting patients in red and controls in black across population variance. Panel C presents a similar scatter plot for trait assessment, indicating high traits in purple and low traits in black.

Figure 2. Group differences as a function of population variance. (A) The dimensional view of mental disorders; adapted from Huppert et al. (2005). (B) The relationship between patients vs. controls group differences and population variance, assuming that patients are defined as 10% of the population with the poorest mental health. (C) A more general case illustrating the group differences resulting from the median split of the data (based on some cognitive measure) as a function of population variance. In both (B) and (C) we see that as true population variance increases, raw group difference increase too, while adding measurement error to the true scores results in misclassification of some individuals—which will end up attenuating observed group differences in the measures of interest.

We generated correlated “symptoms/traits” and “task performance” datasets such that they had r_true = 0.7 Pearson's correlation. To derive the groups, we defined patients as occupying the 10% of the population with the poorest mental health (Figures 3A, B), with the rest of the population being controls; using the median split along the trait dimension Figure 3A, we grouped individuals into “low” and “high” trait groups (Figure 3C).

Figure 3

Scatter plots and heatmaps examining traits and performance. Panel A shows a scatter plot with a median and 90/10 split. Panels B and C display box plots for patient and median splits. Panels D to F are heatmaps of performance error variance against between-subject variance using Pearson's r and Cohen's d. Panels G to I are similar heatmaps focusing on traits and symptoms error variance. Color gradients indicate variance levels.

Figure 3. Test-retest reliability effects on observed group differences. The top row panels (A–C) illustrate the different analysis scenarios, while the 2 row panels (D–F) show the corresponding observed effects for different error and between-subject variance values of task performance, and the bottom row panels (G–I) show the corresponding observed effects for different error and between-subject variance values of symptoms/traits. (A) An illustration of correlation between traits or symptoms and task performance. The vertical dashed lines indicate how the data was split for the two analysis scenarios. (B) An illustration of patient and control groups created by assigning 10% of the population with the poorest mental health to patient group and the remaining 90% to control group. (C) An illustration of “low” and “high” trait groups by performing a median split. Overall, the test results show that both observed correlation strength and observed group differences increase with increasing test-retest reliability (i.e., with reducing error variance/increasing between-subject variance) both when varying between-subject and error variances of task performance measures (D–F) and symptoms/traits measures (G–I).

We then examined how the observed effect size metrics were affected by independently varying σ_b and σ_e of task performance and then of traits/symptoms (see Methods for more details). In both cases, we find the same results: reducing reliability in either task performance measures (Figures 3D–F) or symptoms/traits measures (Figures 3G–I) leads to attenuation of observed effect sizes that mirror those of correlational analyses of individual differences (Equation 9).

3.2.2 Comparing attenuation effects across effect size metrics

In a correlational analysis, the true correlation strength between a measure x and a measure y is attenuated by their respective reliabilities following (Spearman, 1904):

\begin{array}{l} r_{o b s e r v e d} = r_{t r u e} \sqrt{I C C_{x} I C C_{y}} . & (9) \end{array}

We compared the observed effect sizes from our simulations to the predicted attenuation relationship and found that both parametric (Cohen's d) and non-parametric (rank-biserial r_rb) estimates closely followed the same reliability-based scaling as Pearson's r, especially for moderate true correlations (r_true = 0.5; Figure 4A). However, Cohen's d deviated more substantially when r_true = 0.9 (Figure 4A) inset due to increasing deviations from normality caused by dichotomization. Thus, when the assumptions of the effect size metric hold, observed between-group differences can be approximated as:

\begin{array}{l} δ_{observed} = δ_{true} \sqrt{I C C_{x} I C C_{y}} & (10) \end{array}

Although attenuation similarly affects both correlations and group differences, it is important to keep in mind that correlational analyses generally retain greater statistical power. Figure 4B illustrates that p-values for group comparisons of dichotomized data are larger than those for correlation tests (N = 60, r_true = 0.5) and Figure 4C similarly illustrates that required sample sizes (to have 80% power at α = 0.05) for dichotomized data are substantially larger than those for correlational analysis, especially when reliability is low. That is simply because the variance discarded during dichotomization results in information loss. This is well documented in previous work (e.g., MacCallum et al., 2002; Royston et al., 2006; Naggara et al., 2011; Streiner, 2002) therefore we will not go into any further details here. Just, please, avoid dichotomizing your continuous data as much as you can.

Figure 4

Graph A shows the relationship between normalized observed effect size and reliability (ICC) for true correlation values of 0.5 and 0.9. Three effect size measures are compared: Pearson's r, Cohen’s d, and Rank-biserial. Graph B depicts the effect of reliability on p-values for a true correlation of 0.5 with a sample size of 60, showing a decrease in p-value as reliability increases. Graph C displays the required sample size across varying reliability for a true correlation of 0.5, alpha of 0.05, and power of 0.8, with sample size decreasing as reliability improves.

Figure 4. Test-retest reliability effects across different effect size metrics and statistical tests. (A) The observed effect sizes as a function of reliability for r_true = 0.5, comparing group differences to correlational strength. Note, because the effect sizes among the tests are not directly comparable, each effect size is normalized by its own maximum value at ICC = 1. The inset shows the results for r_true = 0.9. The dashed line denotes $r_{o b s e r v e d} = r_{t r u e} \sqrt{I C C_{x} I C C_{y}}$ . (B) The p-value as a function of reliability for r_true = 0.5 and the total sample size of N = 60. Dichotomizing data substantially increases p-values, especially when reliability is low. (C) The required sample size to achieve 80% statistical power at α = 0.05 as a function of reliability for the three effect size metrics. Dichotomizing data substantially increases the required sample sizes to detect the same true effect, especially when reliability is low.

3.3 Group differences: mathematical derivations

3.3.1 Attenuation for exogenously defined groups

In previous sections, we used simulations to illustrate how poor reliability attenuates group differences when groups are derived from a noisy continuous measure. These visualizations were meant to provide an intuitive understanding of the attenuation effect. Here, we derive the same effect mathematically, but this time considering exogenously defined groups (e.g., male vs. female or treatment vs. control), which are categorical and not subject to measurement error.

For such externally defined groups, random measurement error does not systematically bias the raw mean difference (Lord and Novick, 1968; Nunnally and Bernstein, 1994) but inflates total variance ( $σ_{b}^{2} + σ_{e}^{2}$ ). Consequently, the raw difference scales with between-subject variance rather than total variance (Cohen, 1988; Ellis, 2010):

\begin{array}{l} μ_{2} - μ_{1} = δ_{true} σ_{b} . & (11) \end{array}

The expression for δ_observed will then depend on total variance, such that:

\begin{array}{l} δ_{observed} = \frac{μ_{2} - μ_{1}}{\sqrt{σ_{b}^{2} + σ_{e}^{2}}} & (12) \end{array}

\begin{array}{l} = δ_{true} \frac{σ_{b}}{\sqrt{σ_{b}^{2} + σ_{e}^{2}}} & (13) \end{array}

\begin{array}{l} = δ_{true} \sqrt{I C C}, & (14) \end{array}

where used Equations 11, 2 to get to the final expression.

Notably, this same attenuation mechanism applies to both externally defined groups and groups formed via dichotomization, although the latter additionally suffers from misclassification effects.

3.3.2 Attenuation when the groups have different reliabilities

Equation 14 assumes both groups have the same measurement reliability, but this is not always the case. Here, we derive a more general formula that accounts for differing reliabilities. If the groups have reliabilities ICC₁ and ICC₂, their observed standard deviations are:

\begin{array}{l} σ_{1} = \frac{σ_{b}}{\sqrt{I C C_{1}}}, σ_{2} = \frac{σ_{b}}{\sqrt{I C C_{2}}}, & (15) \end{array}

where we assume both groups share the same true between-subject standard deviation. Using the non-pooled variance estimation (see Methods for more details), the observed standard deviation is given by:

\begin{array}{l} σ_{n p} = \sqrt{\frac{σ_{1}^{2} + σ_{2}^{2}}{2}} = \sqrt{\frac{σ_{b}^{2}}{2} (\frac{1}{I C C_{1}} + \frac{1}{I C C_{2}})} \\ = σ_{b} \sqrt{\frac{1}{2} (\frac{1}{I C C_{1}} + \frac{1}{I C C_{2}})} . & (16) \end{array}

Thus, the observed standardized difference is:

\begin{array}{l} δ_{observed} = \frac{μ_{2} - μ_{1}}{σ_{n p}} & (17) \end{array}

\begin{array}{l} = \frac{δ_{true} σ_{b}}{σ_{b} \sqrt{\frac{1}{2} (\frac{1}{I C C_{1}} + \frac{1}{I C C_{2}})}} & (18) \end{array}

\begin{array}{l} = δ_{true} \sqrt{\frac{2}{\frac{1}{I C C_{1}} + \frac{1}{I C C_{2}}}} & (19) \end{array}

\begin{array}{l} = δ_{true} \sqrt{\frac{2 I C C_{1} I C C_{2}}{I C C_{1} + I C C_{2}}} . & (20) \end{array}

In the special case where ICC₁ = ICC₂ = ICC, this expression simplifies to Equation 14, consistent with the earlier result.

3.3.3 Attenuation when true variances are also unequal

Thus far, we assumed that both groups share the same underlying between-subject standard deviation. Here, we relax that assumption and allow the two groups to have different true variances, such that the total true variance is

\begin{array}{l} σ_{b, np} = \sqrt{\frac{σ_{b, 1}^{2} + σ_{b, 2}^{2}}{2}}, & (21) \end{array}

and the observed variances are

\begin{array}{l} σ_{obs, 1} = \frac{σ_{b, 1}}{\sqrt{I C C_{1}}}, σ_{obs, 2} = \frac{σ_{b, 2}}{\sqrt{I C C_{2}}}, & (22) \end{array}

and so the total observed unpooled variance is

\begin{array}{l} σ_{np} = \sqrt{\frac{σ_{obs, 1}^{2} + σ_{obs, 2}^{2}}{2}} = \sqrt{\frac{σ_{b, 1}^{2} / I C C_{1} + σ_{b, 2}^{2} / I C C_{2}}{2}} . & (23) \end{array}

Thus, the observed standardized difference is

\begin{array}{l} δ_{observed} = \frac{μ_{2} - μ_{1}}{σ_{np}} = \frac{δ_{true} σ_{b, np}}{σ_{np}} . & (24) \end{array}

Substituting Equations 21, 23 into Equation 24 yields

\begin{array}{l} δ_{observed} = δ_{true} \sqrt{\frac{σ_{b, 1}^{2} + σ_{b, 2}^{2}}{σ_{b, 1}^{2} / I C C_{1} + σ_{b, 2}^{2} / I C C_{2}}} . & (25) \end{array}

In the special case where σ_{b, 1} = σ_{b, 2} = σ_b, Equation 25 simplifies to Equation 20.

3.3.4 Attenuation by classification reliability

We can further extend these equations to take into account the reliability of group labels (e.g., patients vs. controls). Note, that we have already demonstrated via simulations that when group classification is error prone, the observed group differences scales as $\sqrt{I C C}$ for the underlying continuous measure. However, when comparing two groups, a more likely measure of classification reliability that would be used is Cohen's Kappa (κ) (Cohen, 1960), which measures the reliability of categorical labels (and is often used to quantify the inter-rater reliability of clinical diagnoses). The relationship between κ and the underlying reliability of continuous measures ICC can be shown to be Kraemer (1979):

\begin{array}{l} κ = \frac{2}{π} arcsin (\sqrt{I C C}) . & (26) \end{array}

Rearranging this for ICC gives

\begin{array}{l} I C C = {sin}^{2} (\frac{π}{2} κ) . & (27) \end{array}

Now, the expression Equation 10 can be re-expressed in terms of classification reliability, while Equations 20, 25 can be further extended to account for classification reliability:

\begin{array}{l} δ_{observed} = δ_{true} \sqrt{I C C \cdot sin (\frac{π}{2} κ)}, & (28) \end{array}

\begin{array}{l} δ_{observed} = δ_{true} \sqrt{\frac{2 I C C_{1} I C C_{2}}{I C C_{1} + I C C_{2}} sin (\frac{π}{2} κ)}, & (29) \end{array}

\begin{array}{l} δ_{observed} = δ_{true} \sqrt{\frac{σ_{b, 1}^{2} + σ_{b, 2}^{2}}{σ_{b, 1}^{2} / I C C_{1} + σ_{b, 2}^{2} / I C C_{2}} sin (\frac{π}{2} κ)} . & (30) \end{array}

We summarize all the attenuation equations in Box 1.

Box 1. Attenuation of observed group differences in different scenarios.

$δ_{observed} = δ_{true} \sqrt{I C C}$ Continuous outcome measured with

reliability ICC; classification is error-free.

$δ_{observed} = δ_{true} \sqrt{I C C \cdot sin (\frac{π}{2} κ)}$ Continuous outcome measured with reliability

ICC while classification reliability is κ.

$δ_{observed} = δ_{true} \sqrt{\frac{2 I C C_{1} I C C_{2}}{I C C_{1} + I C C_{2}}}$ Continuous outcome with group-specific

reliabilities ICC₁ and ICC₂; classification

is error-free.

$δ_{observed} = δ_{true} \sqrt{\frac{2 I C C_{1} I C C_{2}}{I C C_{1} + I C C_{2}} sin (\frac{π}{2} κ)}$ Continuous outcome measured with

group-specific reliabilities ICC₁ and ICC₂ while

classificaiton reliability is κ.

$δ_{observed} = δ_{true} \sqrt{\frac{σ_{b, 1}^{2} + σ_{b, 2}^{2}}{σ_{b, 1}^{2} / I C C_{1} + σ_{b, 2}^{2} / I C C_{2}}}$ Continuous outcome with group-specific

variances $σ_{b, 1}^{2}$ , $σ_{b, 2}^{2}$ and reliabilities ICC₁, ICC₂;

classification is error-free.

$δ_{observed} = δ_{true} \sqrt{\frac{σ_{b, 1}^{2} + σ_{b, 2}^{2}}{σ_{b, 1}^{2} / I C C_{1} + σ_{b, 2}^{2} / I C C_{2}} sin (\frac{π}{2} κ)}$ Continuous outcome measures with with

group-specific variances $σ_{b, 1}^{2}$ , $σ_{b, 2}^{2}$ and

reliabilities ICC₁, ICC₂, while classification

reliability is κ.

4 Discussion

This report extends the implications of the reliability paradox beyond its original focus on individual differences (Hedge et al., 2018), demonstrating that it presents the same problems when studying group differences. When groups are formed by thresholding continuous measures (e.g., patients vs. controls), the resulting loss of statistical power makes detecting group differences (vs. individual differences) even harder when reliability is low. We hope that this work will help raise awareness of measurement reliability implications in group differences research and that the provided mathematical expressions will help researchers better account for the magnitude of the effect size attenuation in their studies.

4.1 Implications for clinical translation

Poor reliability leads to small observed effects, which severely impedes clinical translation (Karvelis et al., 2023; Nikolaidis et al., 2022; Gell et al., 2023; Tiego et al., 2023; Moriarity and Alloy, 2021; Paulus and Thompson, 2019; Hajcak et al., 2017). For example, for a measure to have diagnostic utility—defined as ≥80% sensitivity and ≥80% specificity—it must show a group difference of d≥1.66 (Loth et al., 2021). Note that d≥0.8 is considered “large” and it is rarely seen in practice. This may also explain why treatment response prediction research, where it is common to dichotomize symptom change into responders vs. non-responders, has so far shown limited success (Karvelis et al., 2022). Improving the reliability of measures to uncover the landscape of large effects is therefore of paramount importance (DeYoung et al., 2025; Nikolaidis et al., 2022; Zorowitz and Niv, 2023). This applies not only to cognitive performance measures—where the reliability paradox discussion originates—but equally to other instruments including clinical rating scales and diagnostic criteria (Regier et al., 2013; Shrout, 1998), self-report questionnaires (Enkavi et al., 2019; Vrizzi et al., 2025), and experience sampling methods (ESM) (Dejonckheere et al., 2022; Csikszentmihalyi and Larson, 1987). To begin uncovering large effect sizes, however, reliability analysis and reporting must first become a routine research practice (Karvelis et al., 2023; Parsons et al., 2019; LeBel and Paunonen, 2011). While some guidelines such as APA's JARS for psychological research (Appelbaum et al., 2018) and COSMIN for health measurement instruments (Mokkink et al., 2010) do encourage routine reporting of reliability, others, such as PECANS for cognitive and neuropsychological studies (Costa et al., 2025), do not mention reliability or psychometric quality at all, underscoring the need to continue raising awareness of measurement reliability issues.

4.2 Double bias: reliability attenuation and small-sample inflation

Correct interpretation of observed effects requires considering not only the attenuation effects we describe here, but also sampling error. While low measurement reliability attenuates observed effect sizes, small samples produce unstable estimates that are often selectively reported, leading to systematic inflation of reported effects—known as the winner's curse (Sidebotham and Barlow, 2024; Button et al., 2013; Ioannidis, 2008). Currently, research in cognitive neuroscience and psychology is dominated by small samples, with an estimated 50% of research reporting false positive results (Szucs and Ioannidis, 2017); also see Schäfer and Schwarz (2019). While the attenuation of effect sizes can be addressed by the equations we provide, inflation due to the winner's curse can be mitigated by collecting larger samples, preregistering analyses, applying bias-aware estimation or meta-analytic techniques (Button et al., 2013; Nosek et al., 2018; Zöllner and Pritchard, 2007; Vevea and Hedges, 1995).

4.3 Broader implications for real-world impact

Although we presented our statistical investigation with psychiatry and cognitive sciences in mind, the implications of our results are quite general and could inform any area of research that relies on group comparisons, including education, sex, gender, age, race, and ethnicity (e.g., Hyde, 2016; Ones and Anderson, 2002; Roth et al., 2001; Rea-Sandin et al., 2021; Perna, 2005; Vedel, 2016). The reliability of measures is rarely considered in such studies, but the observed effect sizes are often treated as proxies for practical importance (Cook et al., 2018; Funder and Ozer, 2019; Kirk, 2001, 1996; Olejnik and Algina, 2000) and are used to inform clinical practice (e.g., Ferguson, 2009; McGough and Faraone, 2009) and policy (e.g., Lipsey et al., 2012; Pianta et al., 2009; McCartney and Rosenthal, 2000). Not accounting for the reliability of measures can therefore create a very misleading scientific picture and lead to damaging real-world consequences.

4.4 Limitations and caveats

Our derivations of effect size attenuation are based on parametric assumptions and may not give precise estimates when the data is highly non-normal or is contaminated with outliers. By extension, they may not give precise estimates for non-parametric group differences metrics, although it should still provide a good approximation. Furthermore, we should highlight once again that our derivations rely on using non-pooled variance for calculating standardized mean differences, which allows dropping the assumption of equal variance. Thus, when the true variances are indeed not equal between the groups, it is important to use the non-pooled variance version of Cohen's d* (see Delacre et al., 2021, for further details) when using the attenuation equations. However, if the true variances are roughly equal, the attenuation relationships derived here will hold just as well for the standard Cohen's d, which uses pooled variance.

Data availability statement

The datasets presented in this study can be found in online repositories. The code for producing the data simulations and figures is available at: https://github.com/povilaskarvelis/clarifying_the_reliability_paradox.

Author contributions

PK: Conceptualization, Formal analysis, Funding acquisition, Investigation, Visualization, Writing – original draft, Writing – review & editing. AD: Funding acquisition, Supervision, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. PK is supported by CIHR Fellowship (472369). AD was supported by NSERC Discovery Fund (214566) and the Krembil Foundation (1000824).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Generative AI statement

The author(s) declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., and Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: the APA publications and communications board task force report. Am. Psychol. 73:3. doi: 10.1037/amp0000191

PubMed Abstract | Crossref Full Text | Google Scholar

Blair, R. J. R., Mathur, A., Haines, N., and Bajaj, S. (2022). Future directions for cognitive neuroscience in psychiatry: recommendations for biomarker design based on recent test re-test reliability work. Curr. Opin. Behav. Sci. 44:101102. doi: 10.1016/j.cobeha.2022.101102