Rank–Polyserial Correlation: A Quest for a “Missing” Coefficient of Correlation

In the typology of coefficients of correlation, we seem to miss such estimators of correlation as rank–polyserial (RRPS) and rank–polychoric (RRPC) coefficients of correlation. This article discusses a set of options as RRP, including both RRPS and RRPC. A new coefficient JTgX based on Jonckheere–Terpstra test statistic is derived, and it is shown to carry the essence of RRP. Such traditional estimators of correlation as Goodman–Kruskal gamma (G) and Somers delta (D) and dimension-corrected gamma (G2) and delta (D2) are shown to have a strict connection to JTgX, and, hence, they also fulfil the criteria for being relevant options to be taken as RRP. These estimators with a directional nature suit ordinal-scaled variables as well as an ordinal- vs. interval-scaled variable. The behaviour of the estimators of RRP is studied within the measurement modelling settings by using the point-polyserial, coefficient eta, polyserial correlation, and polychoric correlation coefficients as benchmarks. The statistical properties, differences, and limitations of the coefficients are discussed.


INTRODUCTION
Over the years, scholars have developed many estimators of the association of two variables X and Y, depending on their scale properties. Usually, these are based either on the covariance between X and Y (e.g., Pearson's tetrachoric, biserial, polyserial, point-biserial, point-polyserial, or polychoric correlation) or the ratio of the favourable combinations and all combinations (e.g., Cureton's rank-biserial correlation, Goodman-Kruskal tau, lambda, gamma, Kendall's tau family, Pearson's eta and phi, or Somers' delta). These coefficients are divided into coefficients of observed and inferred association (see [1]). The observed association is estimated for the manifested variables and the inferred association for the latent variables or for the combination of an observed and a latent variable (see Figure 1). This difference between the latent and observed variables is discussed first, after which the factual estimators are discussed.

Statistical Model Latent to Correlations
Assume two continuous variables ξ and η with the unknown joint distribution. For the later use of variables with different scales related to the proposed rank-polyserial coefficients of correlation, let these latent variables be manifested as two observed variables g with a narrower scale and X with a wider scale with x i = 1, . . . , r with distinctive ordinal categories and X with y i = 1, . . . , c FIGURE 1 | Two latent variables η and ξ manifested as X and g with ordinal or interval scales.
distinctive categories with metric properties (ordinal, interval, pseudo-continuous 1 or continuous scales), respectively, and r<<c. The variable g is related to ξ and X to η with the class limits or thresholds γ i and τ j so that and X = y j , if τ j−1 ≤ η < τ j , j = 1, 2, . . . , c For the observed values, x 1 < x 2 < ... < x r−1 and y 1 < y 2 < ... < y r−1 , and for convenience, γ 0 = τ 0 = −∞ and γ r = τ c = +∞. The relation of the variables with related symbols is illustrated in Figure 1 where n gX denotes the number of times the observation (g, X) is obtained in the sample, and the latent variables are assumed to be normally distributed.
In the measurement modelling settings used in the numerical examples, an item g and a score variable X compiled by a 1 The contemporary measurement modelling settings result in score variables that are, factually, categorical ones, either ordinal, interval-scaled, or pseudocontinuous type, with the limited number of categories (see the discussion in, e.g., [2]). To obtain a truly continuous scale necessitates a truly continuous scale in at least one test item and a very large number of test takers. This kind of scale always leads to an even distribution because all test takers will get a unique score. Truly continuous scales are very rare in the testing settings. A raw score forms usually a categorical (ordinal or interval-scaled) score and the one-parameter item response theory (IRT) modelling or Rasch modelling forms a pseudo-continuous score where the number of categories in the score variable equals the number of the raw score, but the "names" of the categories come from a continuous scale. The scores by factor analysis and two-or three-parameter IRT models produce scales with more categories, although these, too, are pseudo-continuous scales where the number of categories is strictly bound to the number of test takers, the number of items, and the number of categories in the items. set of test items share the common latent variable θ, such as achievement in mathematics, which is manifested in two variables, item g and the test score X. As above, the threshold values of θ for each category in g and X are denoted by γ i and, respectively. Hence, g with r = 1, . . . , r distinctive ordinal categories and score variable X with c = 1, . . . , c distinctive categories with a metric scale are related to θ so that g = x i , if γ i−1 ≤ θ <γ i , i = 1, 2, . . . , r and X = y j , if τ j−1 ≤ θ < τ j , j = 1, 2, . . . , c, and γ 0 = τ 0 = −∞ and γ R = τ C = +∞. Often, the scoring starts from zero, which is illustrated in Figure 2. Then, the degrees of freedom are df (g) = r -1 and df (X) = c -1.

Multitudes of Coefficients of Association
The estimators of the association are really many. Olsson et al. [1] collected some estimators as a typology, and their work is elaborated and rethought in what follows (see Table 1).
At the beginning of the twentieth Century, Karl Pearson initiated and developed many coefficients for the observed association that still are in our use. The mechanics of the product-moment correlation coefficient between two observed variables with a metric scale (PMC; Pearson [4] onwards based on Bravais [5]) is used in the point-biserial correlation (R PB = ρ gX ) between an observed dichotomized or binary g and a metricscaled X and in point-polyserial correlation (R PP = ρ gX ) between a polytomous ordinal g and a metric-scaled X. These are classic estimators of the item-score association in the measurement modelling settings.
Pearson also presented coefficient phi [6] suitable for two observed nominal-scaled variables, and coefficient eta [7,8] suitable for an observed nominal-or ordinal-scaled variable and a metric variable. Later, such robust, non-parametric coefficients were developed for the observed association for ordinal-scaled variables as Goodman's and Kruskal's lambda, tau, and gamma a These are directional coefficients. b Tetrachoric and bichoric correlations are special cases of polychoric correlation, and biserial correlation is a special case of polyserial correlation. c Bichoric is a new term to cover estimators of the observed correlation between dichotomous and polytomous variables. It is a special case of polychoric coefficient, which is also to be invented. d Rank-polyserial correlation discussed in this article is, factually, rank-polychoric correlation in the same manner as the traditional rank-biserial correlation coefficient by Cureton [3] is, factually, rank-bichoric coefficient. However, there are no technical reasons why the coefficients could not be used with the interval-scaled (or better) variables, although the factual values in the scale are not used in the analysis.
(G) [9,10], the family of Kendall's Tau ( [11] onwards) and the family of Somers' D [12], including Cureton's rank-biserial correlation (R RB ) [3,13,14]; R RB is a special case of D in the case of a binary variable g and ordinal-scaled X (see Newson [15]). This relationship is deepened later. For the coefficients for inferred association between the latent variables ξ and η, the most known is the polychoric correlation (R PC = ρ ξ η ) and its special case, tetrachoric correlation suitable for two latent, dichotomized variables (R TC = ρ ξ η ) [16,17]. Pearson also initiated polyserial correlation (R PS = ρ ξ X ) between a latent variable related to the variable with a shorter scale (ξ) and observed X, and its special case, the biserial correlation for the dichotomized ξ and observed X (R BS ) [17,18]. Common to all these is that we intend to infer what could the correlation between the variables be if measured in their latent, unobservable form.
When it comes to the factual estimation of the inferred association of two observed ordinal or interval-scaled variables with (theoretical, unobservable) latent variables, we have established routines for estimating R PC (e.g., [19][20][21]; see also [22]), as well as R BS and R PS (see [23]). The traditional routines of calculating the estimates by R BS and R PS led, however, to practicalities that the estimates reached out of range values (R BS , R PS >> +1) if the embedded PMC and the item variance are high to start with (e.g., [24]; see the discussion in, e.g., [25,26]; see the computational form in Eq. 34). One of the best alternatives for R BS and R PS , if not the best, is a coefficient called r-bireg and rpolyreg correlation (R REG ; see [27,28]; see Eq. 37), which have behaved quite optimally in simulations (see, e.g., [29]).

Missing Coefficients of Correlation
As highlighted in Table 1, we seem to miss a set of coefficients for the ordinal variables: rank-polyserial and rank-polychoric correlation coefficients (R RP ) for the observed association between the ordinal-scaled or metric variables. In the ERIC database with more than 1.2 million articles and research papers, we found no hits with the fixed keywords "rank polyserial" or "rank polychoric" at the time of finalising the article (April 2022). Nevertheless, some possible options such as R RP are available. These are discussed in this article.

Research Questions
This article discusses and studies the characteristics of a set of coefficients of correlation that could be called either rank polyserial or rank polychoric correlation coefficient. In what follows, the name rank-polyserial is preferred because of its connection to rank-biserial (R RB ) correlation by Cureton. Although the options for R RP discussed here are not restricted to item analysis settings, their characteristics are studied in the framework of measurement modelling. After all, estimators of the association have a central role to play, for example, as the estimators of the item-score association and embedded to estimators of reliability (see discussion in, e.g., [26,29,30]). This perspective leads us to compare the options of R RP with its traditional alternatives: R PP = ρ gX , often called item-total correlation (Rit), Henrysson's corrected item-total correlation (R PPH ) [31], also known as item-rest correlation (Rir), coefficient eta directed so that X explains the order in g or "g given X" 2 , 2 A specific peculiarity in naming of the directions may be necessary to discuss here (see also [2,29,32]) because the directional estimators are used in what follows. With the truly directional estimators D and eta, in widely used software packages as IBM SPSS, SAS, and R-libraries, this specific direction is traditionally named as "X dependent" (see, e.g., [15,[33][34][35][36][37]). This naming is relevant in the general linear modelling (GLM) settings related to eta squared where the score cannot explain, for example, the sex of the test takers (g), and, hence, the score X must be a dependent variable. However, opposite thinking for the same direction that is, η g |X , as well as polyserial and polychoric coefficients of correlation (R PS , R PC ).
After introducing a set of possible coefficients relevant to be taken as R RP , the following questions are asked:

Rank-Biserial Correlation and U-Test Statistic
Assume two sub-samples i = 0 and j = 1 where i and j could be males and females or incorrect and correct answers in a test item. The standard procedure of the U-test produces two statistics, where U 1 refers to the higher values and U 2 refers to the lower values (see the estimation of U-test in, e.g., [33,37]). Wendt's [14] is used in the measurement modelling settings, where the score variable explains the behaviour in the test item and the other direction does not make sense (see, e.g., [38,39]). This is the case also in the nonparametric testing with U-test or Jonckheere-Terpstra test, as examples, where the idea is to first order the cases by X, after which the order of the cases in the item is analyzed. Hence, the score explains the order of the observations in g, and then, this direction should be named as "g dependent" or "g given X". In this article, the notations D(g|X) and η(g|X) refer to the direction of "g given X, " which, in the widely used statistical packages, would be named as "X dependent". modification of ρ RB is related to the lower of the groups i, j regardless of the statistics U 1 and U 2 ; this is discussed later. The original idea by Cureton was based on the proportion of favourable cases (f ) and unfavourable cases (u) and this idea is used later in deriving the corresponding rankpolyserial correlation. To compute the proportion of favourable cases, the mechanics and heuristics of U-test statistics could be used. The heuristic of the observed U statistic is to compute the number of "favourable" incidents of how many observations from the subsample i = 0 fall below each observation from the subsample j = 1 after the variable of interest g is ordered by a metric variable X. If no tied pairs are obtained, the observed U statistic related to the sub-population j = 1 (U obs gX ) is the sum of those sums (see, e.g., [33,37]). With tied cases, Wilcoxon's method [42] produces the correct value; this is discussed later. The maximal value of the U statistic is reached when all cases in the subsample i = 0 (altogether, n 0 cases) are ranked lower than all cases in subsample j = 1 (altogether, n 1 cases): In the binary (ordinal) case, the proportion of "favourable" cases is the ratio of the observed and maximal U statistic U Obs gX /U Max gX that varies between 0 and 1. This ratio is rescaled by multiplying it with 2 and relocated by −1, and we get the values ranging from −1 to +1 as is standard in coefficients of correlation: Notably, Eq. (5) is identical to Eq. (1), while Wendt's formula (Eq. 2) is based on the subsample i = 0, and Eq. (5) is based on the subsample j = 1. All in all, R RB is the rescaled and relocated proportion of logically (ascending) located cases within the categorical (ordinal) variable g after they are ordered by the metric variable X. Notably, Eq. (5) strictly corresponds with Cureton's idea (R RB = 2×f -1; see Eq. 3). The further the erroneous locations from the deterministic position are and the more these erroneous locations are, the lower the value in the estimate. This is illustrated later with a numerical example.

Rank-Polyserial Correlation and JT-Test Statistic
Jonckheere-Terpstra test statistic (JT) [43,44], also known as Jonckheere trend test [43], with a directional nature generalises U-test statistic and its heuristic to polytomous ordinal cases (see [33,37]). This connection is used in proposing a new estimator of correlation, carrying characteristics relevant to R RP .
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org Assume an ordinal variable g with observed categories r = i, j, and i < j and the metric variable X. Then, n i and n j are the numbers of cases in the subsamples i and j in variable g. In the 5-point Likert scale, as an example, one pair of subsamples is i = 1 and j = 4. In general, we have r(r − 1)/2 possible values for the U Obs gXij statistics. In the case of the 5-point Likert scale, as an example, we obtain 5×4/2 = 10 values: U 12 , U 13 , U 14 , U 15 , U 23 , U 24 , U 25 , U 34 , U 35 , and U 45 . In the computational procedure in what follows, the sum of the ranks in the higher of the subsamples In the same manner, as with R RB , the essence of the new coefficient is the ratio of the observed and maximal JT statistics (JT Obs gX and JT Max gX , respectively). JT Obs gX can be expressed by using the U statistic: (see [33,37]) where U Obs gXij refers to the observed U statistics related to all the pairs of subsamples i and j. This statistic indicates the number of "favourable" incidents where, after ordering by X, the cases with a higher value in X have a higher value also in g. The observed U statistic for the observed JT statistic can be computed by using Wilcoxon's W statistic [42]: where W j is the sum of the ranks of the higher of the subsamples i and j. The maximal value for each U statistic is reached when all the test-takers in subsample j (altogether, n j cases) are ranked higher than all test-takers in the subsample i (altogether, n i cases). Hence, with each pair of subsamples, where r refers to the number of categories in the variable with a narrower scale (g), j refers to the higher number of the subsamples i and j in g, and W j is the sum of the ranks in the higher number of the subsamples i and j. The core of the coefficient JT gX is the probability statistics of the ratio of observed and maximal JT statistic JT Obs gX /JT Max gX , that is, the proportion of "favourable" cases in the spirit of Cureton that varies between 0 and 1. This ratio is rescaled by multiplying it by 2 and relocated by −1 to the same scale as the Pearson correlation. This coefficient could be called either rank-polyserial correlation as a legacy to Cureton's rank-biserial correlation or rank-polychoric coefficient as a robust counterpart to the classic polychoric correlation; here, the former is used, but an abbreviation R RP is used to keep both interpretations open.
The value JT gX = +1 indicates the positive deterministic pattern in g; after being ordered by X, all the observations in the higher subsample(s) j are ordered higher than those in the lower subsample(s) i. By using the concept of "concordant pairs" familiar from many robust coefficients of association such as Somers D and Goodman-Kruskal G, JT gX = +1 refers to the situation where all the pairs of observations are concordant. The further the erroneous locations from the deterministic position are and the more these erroneous locations are, the closer is the magnitude in JT gX to zero. The value JT gX = 0 refers to a situation where the observations are randomly spread in variable g after being ordered by X. The value JT gX = −1 indicates the ultimate situation that all the cases in the lower subsample(s) i would be ranked higher than those in the higher subsample(s) j. By using the concept "discordant pairs, " the last refers to the situation in which all the pairs of observations are discordant. The interpretation of the magnitude of the estimates by JT gX is the same as that in R PP (ρ gX ), with the note that, in real-life datasets, R PP cannot reach perfect +1 or −1 (see e.g., [29,45]) while JT gX can.
In the specific case that there are only two categories in g, for example, when only two categories are obtained in a Likert type of scale (see later the numerical example) or in a binary case, U Obs gXij includes only one U statistic, U ij , and JT Obs gX is reduced to ordinary U-test statistic related to the higher number of the subsamples i, j. Hence, because of Eqs. (10), (2), (4), (5), in the binary or dichotomous case, R RB is a special case of JT gX :

Identity of JT gX and Somers D
JT gX has the identity of Somers' D directed, so that "g given X," that is, D g X , henceforth, plainly D. Assume two variables, ordinal g with the subsamples i < r with observed values x i and ordinal X with subsamples with observed values y j sampled from the same bivariate distribution forming an r × c cross-tabulation.
The number of cases in the subsamples related to g is n i and The computational form of D directed so that "g given X" can be expressed as (e.g., [32,33,37]) where P is the sum of the concordant pairs of two observations x i and x j , and, correspondingly, y i and y j , and Q is the sum of the discordant pairs. Because of Eq. (12), (see [32]). Hence, because of Eqs. (13) and (14), D can be rewritten as When all the pairs are concordant, Q equals 0, and D g X = P Max r i<j n i n j = 1 (16) and, consequently, The statistic Q is strictly related to P Max : Hence, because of Eqs. (17) and (18), and D g X can be rewritten as Notably, from the JT statistic viewpoint, the observed JT statistic is the number of concordant pairs in the positive direction: Because of Eqs. (10), (21), and (20), JT gX can be expressed as Hence, because both JT gX (= R RP ) and Cureton's (and Glass' and Wendt's) R RB is a special case of Somers' D (of the derivation for R RB , see [15]), these coefficients form a family related to Somers' D ( X| Y); see other coefficients and test statistics related to the same family in, e.g., Metsämuuronen [32]. Although JT gX is a specific case of Somers' D g X , the advantage over D in measurement modelling settings is that it leads us strictly to the correct form of the three alternatives produced by the standard procedures of calculating Somers' D.
Because JT gX has the identity of D, it carries the same advantages and weaknesses as D does. One of the advances is that the sampling variance and asymptotic standard errors of JT gX are known (see, e.g., [32,46,47]; see Supplementary Appendix 1). One of the weaknesses of D is that it tends to underestimate item-score association in an obvious manner when the number of categories exceeds 3 (see [25,48,49]). This characteristic is discussed later.

Relation of JT gX and Goodman-Kruskal G
Although it is not a generally known fact, Goodman-Kruskal gamma (G) is a directional measure in the same direction as D g X (see the proof in [32]), although, usually, G is taken as a symmetric measure (see, e.g., [50]) because it produces only one estimate of correlation. However, if there are no tied pairs in the dataset, as is the case when the metric variable has no tied cases (see other cases in [32]), G equals D g X and not D X| g nor D (Symmetric).
Except for the special case of no tied cases, the estimates by G are more liberal than those by D. The main difference between G and D is how they treat the tied pairs. By using the concepts of concordant and discordant pairs (P and Q) as with D, G is computed by The number of all possible pairs can be expressed as (see, e.g., [32]) where 4T g refers to the number of tied pairs, that is, the pairs where the direction is not known, and the magnitude of P, Q, and T g is thought double-sized than the simplified formulae indicate 3 . In computing the probability by G, the tied pairs are omitted and, hence, the number of pairs used in the estimation is Then, because of Eq. (14) and because Eq. (25), Q = R l<h n l n h − P + 2T g , and and, because of Eq. (21), (see footnote 3 discussion of the double content of P and, consequently, of T g ). Then, [32]. Hence, in G, the core is the probability measure JT Obs gX / r l<h n l n h − 2T g = JT Obs gX /(P + Q) referring to the proportion of the "favourable" cases of logically (ascending) ordered observations in g after they are ordered by X while considering only those cases for which we know the order where the pairs are omitted, of which the direction is not known. Notably, the same logic of computing probability is used in such famous procedures as the sign test (traced to [52]; see [33]) and Wilcoxon signed-rank test [42]. In the specific case that there are no tied pairs, T g = 0 and G = D = JT gX . Simulations within the measurement modelling settings have shown that coefficients G and D have a major deficiency to underestimate the association between an item and a score in an obvious manner, with polytomous items having more than three categories (in D) or four categories (in G) (see Metsämuuronen [25,45,49]; see also [48]) 4 . To overcome this obvious deficiency,

Dimension-Corrected G and D as Options for the Coefficient of Rank-Polyserial Correlation
Because of the obvious conservative nature of G and D with polytomous items in the measurement settings with wide-ish scales, Metsämuuronen [45,49] proposed two new estimators, dimension-corrected G and D (G 2 and D 2 ) specific for the measurement modelling settings 5 . The computational form of G 2 is [45], where G is the observed value of G, abs(G) is the absolute value of G, and where df (g) = (number of categories in variable g−1). Correspondingly, the computational form of D 2 is (originally, in [49], corrected [45]) where D is the observed value of D g |X , and A is as in Eq. (30). Sampling variances and asymptotic standard errors of these estimators are discussed in Metsämuuronen [45] (see also Supplementary Appendix 1).
Inherited from G and D, G 2 and D 2 are based on probability, but, because of the third-order element A, they are described as semi-trigonometric in nature (see [29]). Coefficients G 2 and D 2 can be used both with binary and polytomous items, and, when D = G = 1 and with binary items, G 2 = G and D 2 = D. In simulations [29,45], G 2 tends to give estimates with a magnitude of close to those by R PC . Of G 2 and D 2 , D 2 gives more conservative estimates. This is inherited from the behaviour of D towards G.
To condense the discussion by far, on the one hand, the new coefficient of correlation JT gX can be taken as a coefficient of rank-polyserial correlation with a directional nature of the same manner and same direction as D, G, eta, and R PP are directed 6 . JT gX is the general case for the classic rank-biserial coefficient of correlation R RB . On the other hand, JT gX has the identity of Somers' D directed so that "g given X" (or "X dependent" in the GLM settings), and, in the case that there are no tied pairs in the dataset, JT gX also has the identity of Goodman-Kruskal G. The statistical properties of JT gX are identical to those by D. Because the underlying coefficients G and D can be taken as rankpolyserial coefficients of correlations, G 2 and D 2 could be taken dimension-corrected rank-polyserial coefficients of correlations.

NUMERICAL EXAMPLES OF COMPUTING DIFFERENT OPTIONS FOR R RP AND RELATED BENCHMARKING COEFFICIENTS
In what follows, the behaviour of JT gX = D, G, G 2 , and D 2 is studied in comparison with relevant benchmarking estimators in the context of measurement modelling: item-total correlation (Rit = R PP ), Henrysson's item-rest correlation (Rir = R PPH ), coefficient eta, polyserial correlation (R PS ), and polychoric correlation (R PC ). The computational forms of these estimators are discussed with numerical examples. First, a simple example is given with which the computation of the estimates is discussed in Section Simple Comparison of the Options for R RP . Second, a published dataset of 6,932 polytomous items from a real-world test is used to study the differences between the estimators in Section Comparison of the Estimates With a Larger Dataset. Third, their tendency to resist deflation in the estimates is discussed in Section Deflation in the Estimates.

Simple Comparison of the Options for R RP
Assume a simple dataset with four items with a Likert type of scale and the score X as in Table 2. Two of the items (g 1 and g 2 ) represent a deterministically discriminating response pattern, while two others (g 3 and g 4 ) include stochastic error either in minor extent (g 3 ) or wider extent (g 4 ).
Item g 1 represents items with an extreme "difficulty" level where we expect to see obvious underestimation by Rit, Rir, and coefficient eta (see [2]). With these kinds of items, G and D, and consequently, JT gX detect the deterministic pattern as G = D = JT gX = 1. Item g 2 is a deterministic one, but it includes a minor tie in X, and, hence, the estimate by D = JT gX is expected to give a slightly more conservative estimate of the association in comparison with that by G. Items g 3 and g 4 have both tied cases and error in the order, and, hence, both G and D are expected to give estimates with the magnitude ofD <Ĝ < 1. In all cases, the estimates by G 2 and D 2 are expected to be higher than those by G and D.
The manual calculation of the estimates of the coefficients is discussed by using item g 4 as an example. The statistics related to Wilcoxon's statistics are seen in Table 3 and the contingency table of g 4 × X in Table 4.

JT gX
By using the heuristics of U-test statistic, assuming no ties, the sum of the statistics U g4Xij equals 49 and the ties add 1.5, totaling to JT Obs gX = 50.5 ( Table 2). The same is obtained strictly by using the routine of Wilcoxon (Eq. 7; see Table 3). The maximal value is JT Max g 4 X = r i<j n i n j = 57 (Eq. 9; Table 2). This can be obtained also by using Eq. (14) and Table 4: the maximum value is JT Max The core in the estimator, JT Obs g 4 X /JT Max g 4 X = 50.5/57 = 0.886 indicates that 88.6% of the observations in item g 4 are logically (ascending) located after they are ordered by the score X.
For Table 2, the estimates by JT gX were computed manually by using the information from Table 2. However, in real-life settings, JT gX has the identity of D(g|X). Then, it is easy to use traditional software packages, such as IBM SPSS, Stata, SAS, or Rpackages for calculation. For instance, with IBM SPSS, the syntax for D is CROSSTABS /TABLES = item BY Score/STATISTICS = D. In Stata, a module by Newson [54] is available. In SAS, the command PROC FREQ provides D by specifying the TEST statement by D, SMDC and R options. Correspondingly, in R, D can be computed by Somers Delta (x, y = NULL, direction = c("row, " "column"), conf.level = NA, ...) (see https://rdrr.io/cran/ DescTools/man/). From the output, the option "X dependent" is selected.

D and G
When it comes to coefficients D and G, statistics P and Q are needed for the manual calculation. These can be computed by using a contingency table (Table 4). By using the strategy of "count all entries that lie to the 'Southeast' of the particular entry" (see the manual calculation, e.g., [33,37]), the number of pairs in the same direction is P = (2×) 49. Parallel, the number of pairs in the opposite directions is counted by using the strategy of "count all entries that lie to the 'Southwest' of the particular entry": Q = (2 ×) 5. Consequently, 2 × (P -Q) = 2 × 44 and 2 × (P + Q) = 2 × 54. The number of all pairs in the direction of "g given X" is D g = 12 2 − 2 2 + 2 2 + 3 2 + 3 2 + 2 2 = 144 − 30 = 114. Then, G = 44/54 = 0.815 and D(g 4 |X) = (2 × 44)/114 = 0.772. Notably, the latter equals the estimate by JT g4X because of Eq. (22).
For Table 2, the estimates by D and G were calculated manually based on contingency tables. In traditional software packages, such as IBM SPSS, for instance, the syntax for G is CROSSTABS/TABLES = item BY Score/STATISTICS = GAMMA and the syntax for D is CROSSTABS/TABLES = item BY Score/STATISTICS = D. In Stata, the command tabulate g X [if] [in] [weight] [,gamma] produces G (see [55]), and Newson's module [54] produces D. In SAS, the command PROC FREQ provides G and D by specifying the TEST statement by GAMMA, D, SMDCR options. Correspondingly, by using R, G can be computed by GoodmanKruskalGamma (x, y = NULL, conf.level = NA, ...) and D by Somers Delta (x, y = NULL, direction = c ("row, " "column"), conf.level = NA, ...) (see https://rdrr.io/cran/ DescTools/man/). From the output related to D, the option "X dependent" is selected.
The dimension-corrected rank-polyserial coefficients D 2 and G 2 are based on the observed values of D and G and knowledge of the number of categories in the item. In the case of g 4 , df (g) =     Table 2, these were computed manually by using traditional spreadsheet software.

Rit and Rir
When it comes to the benchmarking estimators, the mechanics of PMC are used in the point-polyserial correlation, that is, in item-total correlation for observed association of the item and the score: and in Henrysson's item-the rest correlation where σ gX , σ g , and σ X are covariation and standard deviations of the item (g) and the score (X), and σ gP are the covariation and standard deviation of the item g and modified score P where the item in interest has been omitted from the compilation. For item g 4 , the estimates are Rit = 0.799 and Rir = 0.639. In the real-life testing settings, the magnitude of the estimates by Rir is always lower than those by Rit (see algebraic reasons in, e.g., [56]), and both estimators underestimate item-score association when the scales are not equal (e.g., [29]) as is the case with g 4 and X. From this viewpoint, it is known that in the hypothetical example in Table 2, D needs to underestimate the association in an obvious manner as 0.772 < 0.799; this type of obvious underestimation was the reason why the dimension-corrected estimators D 2 and G 2 were developed (see the discussion in [25,45,49]). In the case of g 4 , the estimate by G (0.815) exceeds the one by Rit. However, this is not true in general; when the number of categories in an item exceeds 4 as here, in reallife testing settings, G tends to give estimates that are lower in magnitude than those by Rit (see [29]; see also later Figure 3).
For Table 2, the estimates by Rit and Rit were computed manually by using standard spreadsheet software. Both indices are defaults for the classical item analysis in the widely used general software packages, such as IBM SPSS [50], SAS (e.g., [57]), STATA [55], and in some libraries of R (e.g., [58,59]).

Eta
Coefficient eta is a close sibling to Rit; with binary and dichotomous items, eta = Rit (see [60,61]), and with polytomous items Rit follows closely the direction of η(g|X), henceforth, just eta, usually denoted as "X dependent"-which is the same direction as in D(g|X)-and not the opposite direction η(X|g) (see [2]). This is its traditional direction in settings related to GLM ("X dependent"). One of the advances of eta over Rit is that, unlike Rit, eta can detect the possible non-linear pattern in the item, and, hence, in the polytomous settings, the magnitudes of the estimates by eta are always somewhat higher than those by Rit (see the algebraic reasons in, for example [2]).
The traditional form for is (e.g., [62]) whereX Xg = n g j=1 y j n g refers to the means of X in different categories in g, and GM x = R g=1 n gXXg / R g=1 n g is the grand mean of X. However, Metsämuuronen [2] suggests that a better form-also considering the possible negative values of eta-would be η g |X = sign (R PP ) × SS between g |X SS total g |X This is a relevant modification because using variances in the estimation of eta automatically leads to a positive outcome even if the true association would be negative.
In Table 2, the identity of eta and Rit is seen in g 1 (0.740), which is, factually, a dichotomous item because only two categories are obtained. In the case of g 4 , the magnitude of the estimate by eta is notably higher [η g X = 0.866] than that by Rit (ρ gX = 0.799).
For Table 2

R PS
The parametric polyserial coefficients of correlations are the natural counterparts for the robust rank-polyserial coefficients of correlation. While the previous estimators are intended to estimate observed correlation, R PS is intended to estimate the inferred correlation between a latent g and observed X. In the early years of item analysis, the traditional R PS was the most used estimator of the item-score association (see [24]). However, from early on, it was known that the traditional way of estimating R PS leads to obvious overestimation with out-of-range values if the embedded ρ gX and σ g are high to start with (R PS >>1.000). During the years, Clemans [24], Turnbull [63], Brogden [64], and Henrysson [65], as examples, offered solutions to the challenge of overestimation (see the history in [28]). By far, the most promising option in this family is a coefficient called r-polyreg correlation (R REG ; see [27,28]), which produces estimates that do not exceed 1, nor does it rely on bivariate normality assumptions. It has shown to be very resistant to deflation, although, with short scales in X, it seems to give underestimations (see [29]; see also later Section Comparison of the Estimates With a Larger Dataset).
For the general interest, the traditional estimates by R PS were computed for the items in Table 2, although they are not seen in the table. The estimates are ρ PS_g1 = 1.334, ρ PS_g2 = 1.120, ρ PS_g3 = 0.945, and ρ PS_g4 = 0.841. The two first ones are, obviously, out of range in magnitude, and there is a reasonable doubt also with the other estimates. In Table 2, the estimates by R REG are seen. By using R REG in estimating the inferred association related to the item g 1 , the estimates are notably higher in magnitude (0.879) than those by Rit, Rir, and eta (≤0.740), although the magnitude is far lower than those by G, G2, and R PC (1.000), which are known to detect the deterministic pattern accurately (see [29,45]). With g 2 , also with a deterministic pattern, the estimate by R REG (0.991) is very close to those by G, G2, and R PC (1.000). With g 3 , the estimate by R REG (0.909) is close to that by D (0.906) but lower than those by G (0.960) and R PC (0.980). Notably, with item g 4 , R REG seems to underestimate association in an obvious manner (0.487).
The traditional R PS can be obtained by the two-step procedure introduced by Olsson et al. ( [1], see also [23]) where, in the first phase, the marginal proportions of the categorical ordinal variable (p j ) are used to obtain the threshold estimates (γ j ), and these are used, in the second phase, to give estimates by R PS . The estimate by R PS can be obtained by (e.g., [23]) where ρ gX is the point-polyserial correlation between g and X, σ g is the standard deviation of the categorical ordinal variable, γ j = −1 p j is the inverse of the standard normal density, γ j = (2π) −1/2 exp −γ 2 /2 is the standard normal density, and g j is the category in the ordinal variable g. The last is not needed when all the categories are met as they are in items g 2 g 4 ; in these cases, g j+1 -g j = 1. However, in item g 1 , g j+1g j = 5 -1 = 4.
For computing R REG , a statistic β i is needed. This is the slope parameter of the probit regression model P x i ≤ 1 y = a i − β i y where Φ is the standard normal cumulative distribution function and a i and β i are intercept and slope parameters. The β-value can be computed, for example, in IBM SPSS software using the syntax After the estimates of β and the population variance of the score variable X (β andσ 2 X , respectively) are computed, R REG is estimated as For Table 2, the β values were computed by using SPSS software, and the estimates of σ 2 X and R REG were computed manually by using a spreadsheet software package. Note that the standard outputs of generally known software packages produce, usually, the sample variances 2 X . This is transformed into the "population" variance by multiplying the outcome with (N − 1)/N. If using MS-Excel, as an example, we can select whether the sample variance (= VAR.S) or population variance (= VAR.P) is used. The latter is used in Table 2.

R PC
As R BS above, also the parametric coefficient of correlation is a natural counterpart for the nonparametric or robust rankpolyserial (or polychoric) coefficients of correlation. R PC differs from the previous ones in that no closed-form expression for the relation between R PP and R PC is available. Instead, several alternatives for the process to obtain the estimates are suggested, which produce slightly different estimates (see [23]). As an estimator of correlation, R PC is advantageous over R PP , specifically, with ordinal datasets (e.g., [66][67][68][69]), and it is very resistant against several sources of deflation (see [29,45]). Also, it is robust in accurate reproduction of the measurement models with unbiased standard errors even in small sample sizes (e.g., [66][67][68]).
One practical challenge in from the item analysis viewpoint is that we do not know what kind of composite the item discrimination refers to; the estimates refer to hypothetical composites to the research is not privy to (see [25,70]). Also, the computational challenges are well-known; the estimation needs complicated procedures (e.g., [71]). Additionally, the established routines for estimating ρ PC (e.g., [19][20][21][22]) cannot reach the extreme values +1 and −1 because the deterministic patterns lead to computational problems. The last challenge is easy to solve though (see below the restrictions used in estimation).
In reference to Table 2, R PC accurately detects the deterministic patterns in items g 1 and g 2 . This is expected by its behaviour in simulations (e.g., [29,45]). Also, the magnitudes of the estimates in g 3 and g 4 (0.980 and 0.897, respectively) seem to be quite close to those by G 2 (0.976 and 0.879, respectively). Hence, if the estimates by R PC are taken as the closest approximation of the latent variables manifested in ordinal or interval-scaled form, it seems that, of the options for rank-polyserial coefficients of correlation, G 2 could be taken a quite close match to R PC . In-depth studies are needed to confirm this. Some light on this matter is given in the next section, with a comparison with a larger dataset.
In IBM SPSS [50], the syntax for R PC is not available, although some macros have been published (e.g., [72]). In SAS, the command PROC CORR provides R PC . Correspondingly, in R, R PC can be computed by CorPolychor (x, y, ML = FALSE, control = list(), std.err = FALSE, maxcor = 0.9999)## S3 method for class "CorPolychor" print (x, digits = max (3, getOption ("digits")-3),...) (see https://rdrr.io/cran/DescTools/ man/CorPolychor.html). In computing the estimates in Table 2, the two-step estimator by Martinson and Hamdan [20] was used. Simplified by Zaionts [73], the task is to find an estimate of R PP = ρ gX , which maximises the log-likelihood function LL, where In the first step, the threshold coefficients γ i and τ j are estimated for both g and X, and in the second step by iteration, we find R PP that maximises LL. The estimates were computed manually by modifying Zaionts' [73] procedure for MS-Excel: Rit was restricted to be Rit < 0.99999999, and, in each operation with logarithm, an additional 0.000000001 was added. The latter is for the cases of deterministic patterns, causing value 0 in the cell; the logarithm of zero is not defined. Hence, technically, R PC cannot reach the ultimate correlation. In Table 2, this is denoted by value 1.000 instead of 1 as with G and G 2 .

Comparison of the Estimates With a Larger Dataset
The coefficients of R RP are compared with relevant estimators by using 6,932 real-world items from 1,440 datasets. The estimators are studied from three viewpoints. First, how the factual estimates differ from each other in different situations by varying the number of categories in the item and the score, varying the item difficulty, and varying the number of observations in the sample. Second, the estimators are compared from the viewpoint of how efficiently they reflect the population value. Third, the estimators are briefly compared related to their tendency to resist deflation as estimators of the item-score association.

Empirical Real-Life Datasets Used in the Comparison
The dataset used in the comparison is a public one.  [74] with N = 4,023. The items and scores formed 1,440 tests with different numbers of test-takers (n), test lengths (k), difficulty levels (p), reliabilities (α), and a number of categories in the items and scores [df (g) and df (X), respectively].

Comparison of the Estimates in Varying the Conditions
First, an obvious lift of the comparison of the estimators is that the estimates by JT gX = D and G underestimate item-score association in an obvious manner when the number of categories exceeds 3 (in D) and 4 (in G) (see Figure 3 also including the binary items; see Table 1 in Supplementary Appendix 2). This is also known by previous simulations (e.g., [29,45]). The phenomenon is known from the fact that the estimates by R PP are always deflated whenever the scales are not identical in two variables (see the algebraic reasons in [75,76]) as is always the case with an item and a score. Notably, the magnitude of the estimates by R PP tends to exceed those by D and G with items with a wide scale. Second, the estimates by D 2 tend to be close to those by R REG and R PC, and the estimates by G 2 tend to be slightly higher than those by R PC regardless of the number of categories in the items. These are not general characteristics though. It is to be seen that, more frequently, the magnitude of the estimates by D 2 tends to be close to those by R REG and the magnitude of the estimates by G 2 tends to be close to those by R PC .
Third, the binary case is given as a benchmark here; the estimates by R PP and eta are notably deflated with the binary dataset. With the special case suitable for a point-biserial correlation, all other estimators than R PP and eta would produce closely the same estimate. The differences between the estimators come with items with a wide scale [df (g) > 3]. In what follows, the binary cases are omitted, and just k = 6,932 items with polytomous nature are used in the study.
Regarding the number of categories in the score, it seems that the estimates are stable if the score has more than 15 categories (Figure 4; see Table 2 in Supplementary Appendix 2). If the number of categories is less than 16, all the estimators tend to produce unstable estimates. The instability in the estimates with tests with a narrow scale in a score is discussed later with the efficiency of the estimators to reflect the population value. When it comes to the magnitude of the estimate, the estimators seem to form three groups. With scores wider than 15 categories, R RP based on D tends to give estimates with notably lower magnitude than the other estimators. R RP based on G tends to give estimates that are at the same level as those by R PP and eta. R RP based on G 2 and D 2 tends to give estimates that are at the same level as those by R REG and R PC . Of these estimators, the magnitude of the estimates by D 2 tends to follow closely those by R REG, and the magnitude of the estimates by G 2 tends to follow those by R PC .
When it comes to the difficulty levels of the items, the traditional R PP and eta include obvious deflation in estimates with extremely easy and difficult items-and in the binary case, as discussed in Figure 3 (Figure 5, see Table 3 in Supplementary Appendix 2). The phenomenon is known from the previous simulations (e.g., [25,29,45,49]). In the given dataset, this phenomenon is indicated by the notably lower magnitude of the very easy items-extremely difficult polytomous items were not obtained in the polytomous dataset. The magnitude of estimates by JT gX = D and G tends to be lower than by the other estimators with items of medium difficulty levels, but, with items with extreme difficulty levels, they tend to not differ from those by the other estimators. This is caused by the fact that the probability to obtain deterministic patterns is high with items with extreme difficulty levels. Then, the magnitude of the estimates by D and G tends to get closer to D 2 and G 2 . Correspondingly, the magnitude of the estimates by D 2 and G 2 does not differ notably from the benchmarking estimators R REG and R PC .
When it comes to sample size, all estimators tend to give stable estimates when the sample size n = 50 is reached (Figure 6; see also Table 3 in Supplementary Appendix 2). When the sample size is very small (n = 25 in the dataset), the magnitudes of the estimates are deflated. As above, the magnitude of the estimates by D 2 tends to follow closely those by R REG, and the magnitude of the estimates by G 2 tends to follow those by R PC . The estimates by D, G, RPP, and eta tend to be deflated in comparison with R PC and G 2 . The estimators of R RP form four distinguished estimators when it comes to magnitude of the estimates. Coefficient D is known to be the most conservative of the options for R RP , and, hence, the magnitude of the estimates by JT gX and D is the lowest in comparison. Coefficient G is more liberal than D, and the magnitudes of the estimates tend to follow the tendency of R PP and eta when the number of sample size exceeds n = 50. Coefficient D 2 is somewhat more conservative in comparison with G 2, but the magnitudes of the estimates are notably higher than those by D and G. Notably, with very small sample size (n = 25 in the dataset), the magnitude of the estimates by D 2 seems to be very close to those by eta, and, when the sample sizes reach n = 50, the magnitude starts to follow the tendency of R REG . The highest magnitudes of the estimates are given by G 2 , and its trend follows closely the tendency by R PC .
To condense the results by far, it seems that, with items with wide or wide-ishscale (more than 3-4 categories), the estimators JT gX = D and G tend to underestimate item-score association in an obvious manner, while the magnitude of the estimates by D 2 tends to be close to those by R REG, and magnitude of the estimates by G 2 tends to be close to those by R PC . The magnitudes of the estimates tend to be as follows: This order is expected because of the known characteristics of the estimators; the estimates by D are more conservative than those by G (see, e.g., [45]), and the estimates by D 2 are more conservative than those by G 2 (e.g., [29]).

Efficiency of the Estimators to Reflect the Population Value
The efficiency of the estimators of R RP and related benchmarking estimators to reflect the population value is studied by comparing the sample and "population" estimates computed from the original real-world dataset of 4,023 test-takers. As a combination of different sets of items, the procedure for producing the simulation dataset came up with 137 different score variables related to polytomous items. These all produce slightly different population values. The average estimates are collected in Table 5.
Because we do not know the real population value of itemscore association in the real-life datasets, each estimator races its own race against itself: D from the sample is compared with the corresponding D from the population, as an example. A simple and straightforward statistic is computed: the difference (d) between the sample estimate and the population estimate. If d > 0, the population value was overestimated, if d = 0, the estimate was equal in the sample and the population, and if d < 0, the population value was underestimated. This statistic is denoted as d in the name of the coefficient: dD refers to a difference between the sample D and population D. When it comes to the number of categories in the item, the first point to make is that the estimates by D 2 and G 2 are the least effective in reaching the population value-they tend to underestimate the population value the greatest (Figure 7; see Table 5 in Supplementary Appendix 2). In contrast, second, the estimates by eta tend to be overestimated with items with wide scales. This is an interesting phenomenon, knowing that eta tends to give obvious underestimates in the same manner as R PP does. It means that the wider the number of categories gets, the more probable it is to find association by eta from the population (or in large sample size), with a lower magnitude in comparison with the sample estimates. This seems to be the opposite with D 2 and G 2 . Third, except R REG , all estimators in comparison share the common characteristic that the narrower the scale in item (up to 8 categories) is, the more the estimator tends to underestimate the population parameter up to 0.02 units of correlation. R REG seems to be surprisingly robust against the effect of the scale in item; regardless of the length of the scale, the estimates tend to be very close to the population value.
When it comes to the number of categories in the score, above, it was noted that, when the number of categories in the score exceeds 15, the estimates tend to be stable. From the perspective of reflecting the population estimate, except R REG , all estimators tend to notably underestimate item-score association with tests with a narrow scale in the score, that is, when df (X) < 15 (Figure 8, see Table 6 in Supplementary Appendix 2). In this respect, there seem to be no differences between the estimators except that R REG produces robust estimates and eta does not underestimate the association as much as the other estimators.
When it comes to the difficulty of the items, all estimators tend to underestimate the population correlation with difficult items (Figure 9, see Table 7 in Supplementary Appendix 2), and the underestimation may be notable up to 0.09 units of correlation. Notably, the dataset used in the comparison does  not include polytomous items, with an extreme difficulty level; it is possible that, with more extreme (difficult) items, the underestimation may be more drastic. Notably, with extremely easy items, the underestimation is not as radical as with extremely difficult items (<0.02 units of correlation). Systematic studies in this respect would be beneficial. In this respect, there are no notable differences between the estimators of R RP -all are conservative. Again, it seems that R REG is more robust than the other estimators.
Finally, when it comes to the sample size, except eta, all estimators tend to be conservative; they underestimate the population estimate (Figure 10; see also Table 8 in

General Measurement Model Related to Deflation in Estimators of Correlation
It is a well-known fact that PMC is prone both to attenuation caused by errors in measurement modelling and to radical deflation caused by a technical or mechanical errors in the calculation process. These concepts are discussed, amongst others, by Chan [77], Gadermann et al. [78], Lavrakas [79], and Metsämuuronen [26,30,80].
Both attenuation and deflation in PMC are artificial and systematic. Sometimes, attenuation has been connected to the phenomenon called restriction of a range or range restriction (see literature, e.g., in [81][82][83][84]). Pearson (5) himself was the first to offer a solution to the attenuation problem, and many solutions have been offered to correct the attenuation in the X variable (see the typology in [82]). However, even if there is no manifestation of range restriction in X in the sense discussed by Sacket et al. [82], PMC is very vulnerable to several sources of mechanical error in the estimates of correlation causing deflation. Metsämuuronen [29,45] found seven such sources: (1) the division of subpopulations in g (or item difficulty in the measurement modelling settings), (2) the discrepancy in scales of the variables, (3) the distribution of the latent variable, (4) the number of categories in g, (5) the number of categories in X, (6) the number of items forming the score, and (7) the number of tied cases in the score. In practical terms, the deflation is obvious when assuming two identical normally distributed variables with obvious perfect correlation. If we dichotomized one variable (g) and polytomize the other (X), PMC and many other estimators based on covariation cannot reach the (obvious) perfect (latent) correlation (see the algebraic reasons for PMC in [75,76], and for coefficient eta in [2]). The deflation, that is, the underestimation of the true latent association because of technical or mechanical reasons is the greater the more extreme is the division (or the difficulty level) in g.
While the effect of attenuation may be nominal in the dataset, deflation in PMC = R PP may be radical; it approximates 100% if the variance in the item is small, that is, with an item with an extreme difficulty level, causing a small item-total covariation; this is strictly inherited by the formula of PMC (see Eq. 32). To make this radical deflation visible in the measurement model, Metsämuuronen [26,29,30] has proposed a general measurement model, combining a latent variable (θ), observed values of an item (x i ), and a weight factor w iθ that links θ with item i , and the measurement error e iθ : x i = w iθ θ + e iθ (40) generalised from the traditional measurement model (see, e.g., [85,86]). In the general model, the unobservable θ may be manifested as a varying type of relevantly formed compilation of items, including a theta score formed by the raw score (θ X ), a principal component score (θ PC ), a factor score (θ FA ), item response theory (IRT) or Rasch modelling (θ IRT ), or various non-linear combination of the items (θ Non−Linear ). The weight factor w iθ is a coefficient of correlation in some form, also including a principal component and factor loadings (λ i ). In a normal case, w iθ varies −1 ≤ w iθ ≤ +1; values higher than +1 or smaller than −1, sometimes obtained by bi-and polyserial correlation or by factor loadings, are taken out-of-range values.
The mechanical error in the estimation of correlation leading to deflation in the estimates has been re-conceptualised in the measurement model as (e.g., [29]), where the notation in the element e wi θ _MEC refers to the fact that the magnitude of deflation caused by the mechanical error in the estimation (MEC) depends on the weighting factor w, item i, and score variable θ. This characteristic of the rankpolyserial coefficients of correlation is discussed in what follows.

Deflation in the Estimates of R RP
Based on a simulation of 11 sources of mechanical error, causing deflation in estimates of the item-score association [29], of the options for R RP , the estimators D and D 2 were noticed to be the most prone to deflation, while the estimators G and G 2 tended to be close to be deflation free. Of the benchmarking estimators, R PC appeared to be close to deflation free, while R REG is mildly defected by the number of categories in the item and the score, the distribution of the latent variable, and R PP is severely affected by several sources of deflation. That D and D 2 were ranked lower in comparison is caused by the fact that, in a theoretical dataset with identical latent variables, D and D 2 tend to be sensitive to the number of categories in the item and the score as well as for the distribution of the latent variable (see Table 6). However, in real-life datasets, as seen in the previous sections, the factual magnitude of the estimates by D 2 tends to follow the magnitude of those by R REG (see Section Comparison of the Estimates With a Larger Dataset).
Simulations have shown that such coefficients of correlation in Table 1 related to PMC as R PP and coefficient eta include a notable magnitude of deflation in the estimates (see [2,29,45]). This can be easily verified by using a simple example related to a Scale: +2 = no effect = MEC-free, +1 = insignificant effect, 0 = unknown effect, -1 = notable effect, -2 = remarkable effect lowering the estimate. b Scale: +1 = trigonometric / directional nature, 0 = unknown, −1 = linear / symmetric nature. c Scale: +1 = stable in reflecting population parameter, 0 = instable only in very small samples, −1 = notably instable. d Scale: +1 = no tendency for overestimation, 0 = very small overestimation, −1 = notable tendency for overestimation.  Take two identical variables with a continuous scale, and they (obviously) have a perfect correlation (ρ θθ = 1). In the case we truncate one into 5 categories as in Section Simple Comparison of the Options for R RP (categories 1-5; Item g) and the other into more than two categories (Score X); these estimators of association cannot reach the factual latent correlation. Depending on the cut-offs of the values in g, that is, the "difficulty level" of the items and the number of categories in X, the deflation in the estimates may be notable, approximating 100% with extremely difficult items.
In contrast, such estimators as R PC , R REG , G, and D include notably less mechanical error (MEC) than Rit if not being MEC free (see [29,45]). This phenomenon of deflation is illustrated in Figure 11. Let us assume an item close to g 4 in Table 2. Two identical variables with 1,000 cases with a normal distribution are truncated so that the other includes 5 categories [df (g) = 4] and the other 8 categories [df (X) = 7], and the cut-offs in the item are selected so that p = 0.616 7 . Figure 10  the magnitude of the estimates by D is lower than those by G, for example.
All in all, referring to the measurement model in Eq. (40) and the element e wiθ_MEC related to the deflation caused by the mechanical error in the estimating process (MEC), we end up with the following relation of the deflation in the estimators of R RP : e JTgX i _MEC = e D i _MEC > e D 2i _MEC > e G i _MEC = e G 2i _MEC ≈ 0.

Main Results in a Nutshell
This article started with the note that we seem to miss a coefficient of correlation that could be used as the rank-polyserial coefficient of the observed correlation between a categorical ordinal variable and an interval-or ordinal-scaled variable. The quest for finding the "missing" coefficient led us, first, to a new coefficient of correlation, rank-biserial correlation between an ordinal variable g and a metric variable X(JT gX ) that was derived by generalising rank-biserial correlation into polytomous ordinal cases by using Jonckheere-Terpstra test statistics-hence, the name JT gX . It was shown that two traditional coefficients of correlation, Somers D and Goodman-Kruskal G, are strictly related to JT gX , and, hence, these also could be considered as rank-polyserial coefficients. Furthermore, two related estimators of correlation, dimensioncorrected D and G (D 2 and G 2 ), are strictly related to D and G, and, hence, those could be considered dimension-corrected coefficients rank-polyserial correlation.
To conclude the outcomes from the empirical section, by using different estimators carrying characteristics of rankpolyserial coefficient, the estimates by JT gX = D, G, D 2 , and G 2 tend to follow the following pattern: The magnitude of the estimates by JT gX and D is the lowest; D is the most affected by several sources of deflation. The magnitude of the estimates by G tends to be higher than those by D, and the estimates tend to follow the tendency of R PP and eta when the sample size exceeds n = 50. The magnitude of the estimates by D 2 is higher than those by D and G, and the estimates seem to follow the tendency of R REG when the sample size exceeds n = 50. The highest magnitudes of the estimates are given by G 2 , and its trend follows closely the tendency of the estimates by R PC .
Hence, on the one hand, if the estimates by R REG and R PC are taken as accurate reflections of the latent item-score correlations, rank-polyserial coefficients based on D 2 and G 2 seem to be relevant options to use as R RP and to study more. These are based on observed variables because the underlying estimators D and G are based on observed variables. On the other hand, rankpolyserial coefficients based on D and G given the possibility of an interesting practical interpretation because of Eqs. (20) and (28): 0.5 × D + 0.5 and 0.5 × G + 0.5 strictly indicate the proportion of logically (ascending) ordered observations in Item g after they are ordered by the score. Hence, if D = 0.90, 95% of the observations (0.5 × 0.90 + 0.5 = 0.95) are logically ordered in the item.
Characteristic of all these estimators proposed as estimators of R RP is that they all are directional-the same also holds with the rank-biserial correlation. They all indicate to what extent the variable with a wider scale (X)-ordinal, interval, pseudo-continuous, or continuous scale-explains the ordinal pattern in the variable with a narrower ordinal scale (g). In the measurement modelling settings, this can be taken as an advance. After all, the whole apparatus in testing settings is based on the idea that the latent variable manifested as the score variable explains the behaviour in a test item, and the other direction does not make sense. All the options of R RP discussed in this article are directed to favour this direction. A possible advantage of the new coefficient JT gX is that it leads us strictly to the correct form of the three alternatives produced by the standard procedures of calculating Somers' D.

Known Limitations
An obvious limitation in the study is that the characteristics of the behaviour of the coefficients were illustrated only by using a limited real-world dataset with very limited sample sizes. Although the sample sizes may be taken relevant from the practical testing setting viewpoint-after all, arguably, most tests in the world are administered in the classroom situation or lectures with a very limited number of test takers. However, controlled simulations with the known true association and with large sample sizes would be beneficial. In this, using the character in D and G to strictly indicate the proportion of logically ordered observations after ordered by the score could be utilised.
The simulation is Section Comparison of the Estimates With a Larger Dataset did not include very difficult items-this is clearly a deficiency in the original dataset. Hence, systematic studies of the behaviour of the options for R RP with items with extreme difficulty levels would be beneficial.

ETHICS STATEMENT
Ethical approval was not provided for this study on human participants because the dataset is collected as part of national assessment and evaluation by the National Authority. Dataset is anonymized and used by permission. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and has approved it for publication.

FUNDING
Open access support was provided by Centre for Learning Analytics, University of Turku.