The Joint Distribution Criterion and the Distance Tests for Selective Probabilistic Causality

A general definition and a criterion (a necessary and sufficient condition) are formulated for an arbitrary set of external factors to selectively influence a corresponding set of random entities (generalized random variables, with values in arbitrary observation spaces), jointly distributed at every treatment (a set of factor values containing precisely one value of each factor). The random entities are selectively influenced by the corresponding factors if and only if the following condition, called the joint distribution criterion, is satisfied: there is a jointly distributed set of random entities, one entity for every value of every factor, such that every subset of this set that corresponds to a treatment is distributed as the original variables at this treatment. The distance tests (necessary conditions) for selective influence previously formulated for two random variables in a two-by-two factorial design (Kujala and Dzhafarov, 2008, J. Math. Psychol. 52, 128–144) are extended to arbitrary sets of factors and random variables. The generalization turns out to be the simplest possible one: the distance tests should be applied to all two-by-two designs extractable from a given set of factors.


What Is selectIve Influence?
Consider a simple double-detection experiment: there are two stimuli each of which may possess or lack a certain feature (signal property), and an observer has to respond Yes (signal present) or No (signal absent) to each of the two stimuli. For instance, the stimuli may be two spatially separated line segments in a frontal plane each of which may be either vertical (signal absent) or tilted by a fixed small angle (signal present); the observer says Yes − Yes if both lines appear to be tilted, Yes − No if the left line appears tilted and the right one not, etc. These responses are random variables: A (response to the left stimulus) and B (response to the right one), each with two possible values {Yes, No} occurring with some probabilities. They are jointly distributed, in the sense that by the virtue of co-occurring in the same trial the values of A and B are naturally paired, enabling one to meaningfully pose questions like "What is the joint probability of A = Yes and B = No?". The joint distribution of A and B may change depending on the values of the following external factors: α = Tilt of the left line, with two possible values, {absent, present}, and β = Tilt of the right line, with the same two values. The combination of factor values chosen, one for each of the factors, is traditionally referred to as a treatment. With A system's behavior, be the system biological, social, or technological, can be thought of as a network of stochastically interdependent random entities. The external world provides inputs (influences, interventions, conditions) presumably affecting some of the components of the network and not affecting the others. The question arises therefore as to how, based on the joint distributions of all these random entities, to distinguish the components affected and not affected by each of these external inputs.
The notion of selective influence under stochastic interdependence was introduced and systematically analyzed in the behavioral context by Townsend (1984), although implicitly it had been used before (Lazarsfeld, 1965;Bloxom, 1972;Schweickert, 1982). Townsend's approach to selective influence (further developed in Townsend and Thomas, 1994, and mathematically characterized in Dzhafarov, 1999) is, however, very different from the present one. In fact, in all non-trivial cases they are incompatible. Our approach gradually developed starting with Dzhafarov (2001), based on Dzhafarov's earlier work on response time analysis (see Dzhafarov, 1997, for an overview). In Dzhafarov (2003) the definition of selective influence adopted in the present paper was given for finite systems of random entities. This notion was put on a more solid probabilistic foundation in Dzhafarov and Gluhovsky (2006), and further developed in Kujala and Dzhafarov (2008). In the latter work, for the first time, workable tests for selective influence were formulated.
The present paper continues this line of research on a higher level of mathematical rigor and arguably the highest possible level of generality. The abstract nature of the mathematical theory makes it rather difficult reading, with notation which, though carefully What does this mean? The meaning of the relation is only obvious if A and B are stochastically independent for all four treatments, i.e., if p ij = p i· p ·j , q ij = q i· q ·j , etc., where i, j ∈{1, 2}. In this case all one has to establish to prove (A, B)  (α, β) is that the marginal distribution of A is not affected by changes in β and the marginal distribution of B is not affected by changes in α.
To look at this in detail, let the pair of our random variables (A, B) at the four treatments be denoted (A, B) 11 , (A, B) 12 , (A, B) 21 , and (A, B) 22 , where 1, 2 denote "absent" and "present," respectively. If A and B are independent at all four treatments, then the selectiveness (A, B)  (α, β) simply means that the marginal distribution of A does not depend on β (i.e., p 1· = q 1· and r 1· = s 1· ) and the marginal distribution of B does not depend on α (i.e., p ·1 = q ·1 and r ·1 = s ·1 ). The problem arises when A and B are not independent for at least one of the treatments: how should one determine then if (A, B)  (α, β)? This is the problem addressed in this paper, only we do not confine the consideration to the case of two factors and two random variables. Rather we generalize the problem to an arbitrary set of external factors and an arbitrary (but one-to-one corresponding to the set of factors) set of random entities. 1 For a finite set of random variables the definition of selective influence was given in Dzhafarov (2003) and then refined in Dzhafarov and Gluhovsky (2006) and Kujala and Dzhafarov (2008). Applying it to our example, (A, B) is selectively influenced by (α, β) if and only if one can find functions f and g and a random entity C whose distribution does not depend on α, β, such that where ∼ stands for "is distributed as". 2 That is, denoting f(α = 1, C) by f 1 (C), g(β = 2, C) by g 2 (C), etc., As an example, let C be a random vector (C 0 , C I , C 2 ) with stochastically independent components having the following interpretation: C 0 is a random entity representing the general level of visual attention, while C 1 and C 2 are stimulus-specific sources of randomness (which, with no loss of generality, can be taken to be uniformly distributed between 0 and 1). Let this terminology and notation, the population-level (idealized) results of the experiment in question can be presented in the form of four matrices: The letters p, q, r, s here represent theoretical probabilities, with the usual meaning of the subscripts: p 1· = p 11 + p 12 , p 2· = p 21 + p 22 , p 1· + p 2· = 1, etc. It is natural to surmise that, unless the observer does not look at the stimuli at all, the random variable A should depend on (be influenced by) the value of α, and B should be influenced by the value of β. It is not obvious, however, whether factor α only (or selectively) influences random variable A, without affecting B, and whether factor β only (selectively) influences random variable B, without affecting A: , as opposed to the possibilities , .
Thus, we will have one of the latter scenarios if the "present" value of α visually masks or enhances the salience of the "present" value of β, or if the values of β somehow affect the level of attention the observer pays to the factor α. We denote the case when (α, β) selectively influence (A, B), respectively, by ( ) ( )  α β 1 As explained in Section 2, we distinguish random entities and their special case, random variables. Random entities take on values in arbitrary measurable spaces, while random variables map, or can be redefined to map, into real numbers endowed with the Borel sigma-algebra. Note also that the notion of a random entity (or variable) should always be taken to include deterministic entities (variables) as a special case, the same as the notion of stochastic interdependence, unless otherwise indicated, should be taken to include stochastic independence as a special case.

2
It is usually the case that the possibility of selectiveness is considered when it is known that the factors are effective in their influence upon (A, B), meaning that for at least one value of either of the two factors the change of the other factor from 1 to 2 changes the joint distribution of (A, B). This aspect of the dependence of (A, B) on (α, β) being relatively trivial, we do not include it in the definition of selective influence. In other words, (A, B)  (α, β) is taken to mean that β does not influence A and α does not influence B, leaving open the question of whether α influences A and/or β influences B (see Dzhafarov, 2003, p. 10).
obvious consequence of (1). The reverse is not true, as illustrated by examples in Dzhafarov (2003) and, more systematically, in Kujala and Dzhafarov (2008). Other examples are given in this paper: in fact, in all our examples where the selective influence relation does not hold marginal selectivity is satisfied. Third, selective influence relation satisfies the nestedness property: if some random variables are selectively influenced by corresponding factors (say, (A, B, C)  (α, β, γ) -we need more than two factor-variable pairs for this property to be non-trivial), then any subset of these variables is selectively influenced by the corresponding subset of factors: (A, B)  (α, β), (A, C)  (α, γ), and (B, C)  (β, z). This property is obvious as soon as (1) is generalized to larger sets.
In this paper the three properties of selective influence will be demonstrated on the maximal level of generality, for arbitrary sets of random entities and corresponding sets of factors of arbitrary nature.

dIstance tests for selectIve Influence
How can one determine that (A, B)  (α, β)? In Kujala and Dzhafarov (2008) two types of necessary conditions for selective influence were formulated, termed cosphericity tests and distance tests. As we only generalize in this paper the latter class of tests, we need not discuss the former. To apply a distance test to our example means to do the following. First, the values of A and B have to be encoded by real numbers. In accordance with what we know about the transformations we can use any functions f(α, A) and g(β, B) with numerical values. Second, one chooses a number r ≥ 1. Third, for each of the four treatments αβ = 11, 12, 21, 22 one computes the quantity where E denotes expected value and (A αβ , B αβ ) is an alternative (and more convenient) way of designating (A, B) αβ . Note that Dαβ (the same as αβ, 21, etc.) is a string of symbols, with no multiplication involved. It has been shown in Kujala and Dzhafarov (2008) that if (A, B)  (α, β), then, considering each random variable at each value of the corresponding factor as a point (this yields four points, A 1 , A 2 , B 1 , B 2 ), these points can be placed in a metric space in which the values D11, D12, D21, D22 are, with some caveats, distances between A points and B points (D11 between A 1 and B 1 , D12 between A 1 and B 2 , etc.). As these distances, by definition, should satisfy the triangle inequality, we conclude with a bit of algebra (see Section 5) that

( )
A distance test consists in checking if this inequality is satisfied: if not (at least for one choice of the numerical values and the exponent r), then the selective influence relation is ruled out.
In this paper we generalize this test to arbitrary sets of random entities selectively influenced by arbitrary sets of external factors. As it turns out, all one has to do to prove that all random entities, taken one for each value of the corresponding factor, can be embedded in a metric space is to apply the test just described to where h 1 , h 2 are some measurable functions from the set of possible values of C into interval [0, 1]. One can see that A and B are generally stochastically interdependent by virtue of depending on one and the same random entity, C 0 , but that A does not depend on β, in the sense that for any given values of the other arguments, C 0 = c 0 , C 1 = c 1 , C 2 = c 2 , and α = 1 or 2, the value of A does not change as a function of β; and B does not depend on α in the analogous sense.
The definition of selective influence can also be looked at in a simpler and more fundamental way. The fact that for any given treatment A and B are stochastically related (i.e., paired, whether independent or interdependent) means in Kolmogorov's probability theory that A and B are measurable functions of one and the same random entity. It is always true therefore that The random entities C 11 , C 12 , C 21 , C 22 , can always be replaced with a single C, e.g., by putting C = (C 11 , C 12 , C 21 , C 22 ) and redefining the functions f, g accordingly: Comparing this universal representation with (1) we see that the assumption of selective influence is that β in f and α in g are dummy arguments.

MaIn propertIes of selectIve Influence
There are three main properties of the selective influence relation, First, selective influence is invariant with respect to all (measurable) transformations of the random variables A,B, even if transformations of A are allowed to depend on values of factor α and transformations of B are allowed to depend on values of factor β. In our example the values of A and B are denoted yes and no. Clearly, we can encode them 0 and 1, respectively, or by any other two numbers or words. Moreover, we can, if we so choose, denote , if it holds for the original values for A and B must also hold after any such transformations. This follows from the fact that if (1) holds then after any factor-value-specific transformations F(α, A) and G(β, B) we have Second, selective influence implies marginal selectivity, the term coined by Townsend and Schweickert (1989) for the situation when the marginal distribution of A does not depend on β and the marginal distribution of B does not depend on α. This is an where P ij x α y β is a string of symbols, with no multiplication implied. To ascertain if (A, B)  (α, β) using (4), we have to see if we can find 16 probabilities and such that for all i, j, k, l ∈{yes, no}. Indeed, which shows that the first of the equations (7) is equivalent to the application of (4) to x α y β = 1 α 1 β ; and analogously for the other three equations.
Note that (7) implies marginal selectivity. For instance, it follows from (7) that all pairs of 2 × 2 treatments for all pairs of factors. To present this result in an unambiguous form we have to introduce some notation that may appear cumbersome at first: since a value of a factor generally does not itself indicate which factor it is a value of (e.g., absent or 1 can be a value of both α and β), we superscript each factor value by the corresponding factor name. In our example it would be 1 α , 2 α , 1 β , 2 β . We call these pairs, factor value with factor name, factor points. The four distances will now be written D1 α 1 β , D1 α 2 β , D2 α 1 β , D2 α 2 β . Note that we could only get away with the previous notation because the identity of the factors in it was encoded by the order of their values within pairs: in D11 the first 1 belonged to α and the second one to β. This convention cannot work, of course, for more than two factors. In the new notation the distance test acquires the form where

the joInt dIstrIbutIon crIterIon for selectIve Influence
Compliance with a given set of distance tests is only a necessary condition for selective influence. Is there a way to definitively prove that selective influence (A, B)  (α, β) does hold if it is not ruled out by distance tests? As it turns out, the answer is affirmative, and it is an almost immediate consequence of our definition of selective influence, if presented at a sufficiently high level of mathematical rigor. Stated in intuitive terms and applied to our example, consider four hypothetical random variables, one for each of our factor points: , . Suppose that they are jointly distributed, i.e., we can speak of co-occurring quadruples of values. There are six pairwise combinations of the four factor points but only four of them, those of the form x α y β (x, y ∈{1, 2}), form treatments, whereas the remaining two, 1 α 2 α and 1 β 2 β , do not. The four treatments correspond to pairs ( , ) A B x y x y α β α β whose joint distributions are well defined. Suppose now that for all those cases when a pair of factor points forms a treatment we have x y x y α β α β are represented by four probabilities each, denoted in the four matrices introducing our example by p ij , q ij , r ij , s ij . We now switch to a more convenient notation (although again, more cumbersome at first glance): and max 1 1 , 1 2 , 2 1 , 2 2 =1> 1 2 = 1 1 + 1 2 + 2 1 + 2 2 In this paper the joint distribution criterion is formulated in complete generality, for arbitrary sets of random entities and corresponding sets of external factors.

the need for generalIzatIon
In a controlled experiment or systematic survey we usually focus on a small number of random entities, such as which of several responses is given and how long it has taken, and try to selectively target some of them by experimental manipulations, or selectively relate them to concomitant factors. Relatively small networks of random entities and external factors are therefore of paramount practical importance. But a network of random entities and the set of external factors that may be thought to affect them selectively can be quite large, even infinitely large, in theoretical considerations dealing with complex observable behaviors, such as a person's activities within a typical day, or unobservable "mental networks" behind even relatively simple tasks, such as pushing a key in response to a stimulus varying in two binary properties (see Dzhafarov, Schweickert, and Sung, 2004, for an example). Random processes are routinely used in modeling simple forms of decision making (see, e.g., Diederich and Busemeyer, 2003). Any random process can be viewed as a system of stochastically interdependent random entities indexed by "intervention values" (including "no intervention") at every moment of time. An intervention α at moment t 1 can be thought to selectively affect a portion of the random process in some interval [t 1 , t 2 ] (perhaps even with t 2 = t 1 ), and the problem arises as to how to identify such an interval from the observed joint distribution of the random entities constituting the process. It is important therefore to be able to apply the notion of selective influence to arbitrary, finite and infinite, systems of random entities, and external factors.

conventIons and notatIon
A factor is defined as a non-empty set of factor points (a dummy factor can be defined as a set containing a single point). Denoting factors by lowercase Greek letters, α, β, γ, …, the factor points of, say, factor α are formally pairs (x, 'α') consisting of a factor value (or level), x, and a unique factor name, 'α' (read: value/level x of factor α). This ensures that no two distinct factors have common points: e.g., level 1 of factor 'size' (1, 'size'), is distinct from level 1 of factor 'shape' (1, 'shape'). It is convenient to write x α in place of (x, 'α'): 1 shape , (50 db) intensity , present left stilumus , etc.
Let Φ be a non-empty set of factors. A set φ containing precisely one factor point x x It is easy to check that this distribution satisfies (6) and (7), hence also (4). By the joint distribution criterion, we conclude that {A, B}  {α, β}. . , different treatments. Note that A φ and A ′ φ are defined on different sample spaces: they do not possess a joint distribution. In particular, they are not mutually independent.
A set of random entities {A ω } ω∈Ω on one and the same sample space is a random entity whose observation space (A, Σ) is the conventionally understood product of the observation spaces (A ω , Σ ω ) for A ω , ω ∈ Ω. If the set of random entities {A ω } ω∈Ω depends on Φ, we present {A ω } ω∈Ω at a treatment φ as { } A φ ω ω∈Ω instead of the more correct but less convenient ({A ω } ω∈Ω ) φ .

selectIve Influence
In accordance with the previous section, given a set of factors Φ, a corresponding set of random entities is denoted {A α } α∈Φ . For each α∈Φ, the entity A α may in fact be a shortcut notation for a set of stochastically unrelated random entities indexed by different treatments, { } . A φ α φ∈∏Φ In other words, A φ α is treated as a random entity A corresponding to factor α and taken at treatment φ. The complete notation for the set of random entities {A α } α∈Φ then is where the elements of { } , A φ α α∈Φ for a given φ, are stochastically interrelated (possess a joint distribution), while the sets { } A φ α α∈Φ and { } , A ′ ∈ φ α α Φ for distinct φ, φ′, are stochastically unrelated. It is more convenient, however, not to use this explicit notation and to speak instead of {A α } α∈Φ depending on Φ.
Definition 3.1. Let a set of random entities {A α } α∈Φ indexed by a set of factors Φ depend on this set of factors (i.e., be presentable as (8)). We say that the dependence of {A α } α∈Φ on Φ is marginally selective (satisfies the property of marginal selectivity) if, for any subset Φ 1 ⊂ Φ and any φ 1 ∈ΠΦ 1 , the distribution of { } A φ α α∈Φ 1 is the same for all treatments φ containing φ 1 (that is, it does not depend on {x β ∈φ : β ∈Φ − Φ 1 }).
The notion of marginal selectivity was introduced by Townsend and Schweickert (1989), for two random variables. In Dzhafarov (2003) it was generalized to a finite set of random variables under is called a treatment. 3 When the set of factors Φ is finite, treatments will be presented as strings of factor points, without commas or parentheses: x α y β z γ , x x x k k 1 2 1 2 µ µ µ … , etc.
A random entity A is a triad consisting of a measurable function f : ′→ A A, a sample (probability) space ( , , ), ′ ′ ′ A Σ M and an observation (measurable) space (A, Σ), on which f induces a probability measure M. Traditionally, A is simply identified with f, the sample space and the observation space being assumed implicitly, or A is viewed as the identity function on A, with (A′, Σ′, M′) = (A, Σ, M). The latter view is often the only practical one, as we almost never know anything about a sample space as separate from the observation space.
A random variable is a random entity whose observation space is a subset of reals endowed with the Borel sigma-algebra. 4 Given an arbitrary indexing set Ω, any set of random entities whose measurable functions {f ω : A′ → A ω } ω∈Ω map from one and the same sample space (A′, Σ′, M′) into respective observation spaces {(A ω , Σ ω )} ω∈Ω possesses a joint distribution, i.e., a probability measure M induced by M′ on the product space ⊗ ω∈Ω (A ω , Σ ω ). 5 Two random entities A and B defined on different sample spaces are called (stochastically or probabilistically) unrelated (see Dzhafarov and Gluhovsky, 2006). They do not possess a joint distribution. Note that two unrelated random variables can be identically distributed -if they map into one and the same observation space on which they induce one and the same probability measure.
Throughout this paper we deal with a set of probabilistically unrelated random entities {A φ } φ∈ΠΦ indexed by treatments φ∈ΠΦ, with measures {M φ } φ∈ΠΦ induced on one and the same observation space (A, Σ). For convenience, we refer to A φ as "a random entity A at φ", as if A φ and A ′ φ for φ ≠ φ′ were "a single" entity A at two 3 Strictly speaking, an element of the Cartesian product ΠΦ is a choice function, {( , )} α α α α x ∈Φ whereas a treatment φ is the range of a choice function, { } x α α α∈Φ . We conveniently confuse the two notions. Also for convenience only, in this paper we assume "completely crossed design", i.e., that every member of ΠΦ is a possible treatment. With only slight modifications ΠΦ can be replaced with any nonempty subset thereof. 4 A random entity A with A a finite or infinite denumerable set and Σ the set of all its subsets can also be (and traditionally is) considered a random variable, because such an A can always be injectively mapped into the set of reals, or into a partition of an interval of reals. 5 Recall that in the product measurable space ⊗ ω∈Ω (A ω , Σ ω ) = (A, Σ) the set A is the Cartesian product Π ω∈Ω A ω , while Σ = ⊗ ω∈Ω Σ ω is the smallest sigma algebra containing all sets of the form a a ω ω ω ω ω ω . 6 We could have extended the scope of this definition by allowing A φ to be a fun- i.e., by allowing the set and the sigma algebra, not only the measure M φ , to depend on treatment φ. This would have, however, made our abuse of language (in treating different A φ 's as a single A at different φ's) even more abusive. Moreover, this general approach can always be reduced to the set-up with a φ-independent (A, Σ) by putting { ) } and Σ the sigma algebra consisting of all countable unions of the sets a×{ } φ for all a∈Σ φ and all φ∈ΠΦ. where ⊥ indicates stochastic independence and ¬ negation (i.e., A 2 2 α β and B 2 2 α β are not stochastically independent). If now we attempt to use in these relations the abridged indexing, we will run into a contradiction: from A B if, for some random entity C and every x α ∈α ∈Φ there is a measurable function f x α such that, for every treatment φ, the name of complete marginal selectivity. The adjective "complete" (omitted in the present paper for simplicity) distinguishes this notion from a weaker and less useful generalization of Townsend and Schweickert's term: for any factor α∈Φ and any treatment φ, the distribution of A φ α does not depend on {x β ∈φ : β ∈Φ − {α}}. Note that Definition 3.1 does not mean that for distinct treatments φ and φ′ which include φ 1 = {x β ∈φ : β ∈Φ 1 } = {x β ∈φ′ : β ∈Φ 1 }, This equality is not legitimate as the two sets of random variables do not possess a joint distribution. One can only say that where, as before, ∼ means "is distributed as."

the joInt dIstrIbutIon crIterIon
Definition 3.4 suggests a way of looking at the selective influence relation directly in terms of the (product) observation space for the system of the random entities involved, making the overt reconstruction of C and the functions f x α unnecessary (or trivial, as in the proof of the theorem below).

Theorem 4.1. A necessary and sufficient condition for
such that for every subset φ of Φ that forms a treatment (i.e., belongs to ΠΦ), Remark 4.2. We call this the joint distribution criterion for selective influence.
Proof. The necessity is proved by observing that if {A α } α∈Φ  Φ, then the system is a jointly distributed system of random entities. To prove the sufficiency, define and, for every x α , define where Pr oj x α denotes the x α th coordinate projection. Ä s a very simple application of the joint distribution criterion we prove the following (intuitively quite obvious) statement.
Remark 3.5. Alternatively, one could posit, for every treatment φ, is a set of pairwise unrelated random entities all distributed as C. This formulation is more cumbersome but it correctly emphasizes the stochastic unrelatedness of { } A φ α α∈Φ for different treatments φ. Definition 3.4, however, is more parsimonious, as the stochastic unrelatedness property is known from the context.
Remark 3.6. If applied to finite sets Φ, Definition 3.4 becomes equivalent to the formulations of selective influence given in Dzhafarov (2003), Dzhafarov and Gluhovsky (2006), and Kujala and Dzhafarov (2008). Even for the finite case, however, the present definition is mathematically more rigorous, and it profits from the precision offered by the notation x α = (x, 'α') for factor points. More importantly, it can be seen more immediately than the previous definitions to be reformulable into the joint distribution criterion for selective influence, as discussed in the next section.
The following statements are obvious.
That is (refer to Section 1.2), selective influence has the nestedness property and implies marginal selectivity.
The next lemma says that if a set of random entities {A α } α∈Φ is selectively influenced by Φ, then the set of individually transformed versions of these random variables is also selectively influenced by Φ (refer to the first property in Section 1.2). "Individual transformations" of A α can be different for different factor points x α .
where, for any α ∈Φ, any x α ∈α, and any treatment φ containing x α , which implies In any case, the exceptional simplicity of these tests makes it worthwhile to always consider them before applying the joint distribution criterion.

dIstance tests
In Kujala and Dzhafarov (2008) the distance tests were formulated for two variables influenced by two factors in a two-by-two factorial design. In this section we generalize these tests to arbitrary random variables {A α } α∈Φ whose dependence on factors Φ is marginally selective. Perhaps surprisingly, we show that this generalization requires nothing more and nothing less than applying the original tests to all possible two-by-two factorial designs one can extract from Φ. Distance tests can be applied to non-numerical random entities only after they have been numerically transformed (thus, for the distance test applied to Example 1.2 we transformed yes into 0 and no into 1). In this section therefore we confine our discussion to random variables.
We will need some auxiliary notions and notation conventions. Any finite sequence of factor points ( , , ) x x n n 1 1 α α … is called a chain. Chains will be written as strings, x x n n 1 1 α α … , without commas and parentheses (this generalizes the convention we have already used for chains which are finite treatments). Chains can be denoted by capital Roman letters, X x x n n = … 1 1 α α (from the second half of the alphabet, to distinguish them from random variables and entities for which we use the first half). A chain X may be empty or consist of a single element (factor point), x α . A subsequence of points belonging to a chain forms its subchain.
A concatenation of two chains X and Y is written as XY. So, we can have chains x α Xy β , x α XYy β , Xx α y β Y, x α Xy β Z, etc.
The number of points in a chain X is its cardinality, | X |, and any chain with the smallest cardinality within a set of chains is referred to as a minimal chain (in this set). In particular, one can speak of a minimal subchain of a chain among all subchains with a certain property (this notion is used in the proof of Theorem 5.11 below).
Definition 5.1. Let the dependence of a set of random variables {A α } α∈Φ on factors Φ be marginally selective. Let r ≥ 1 be fixed. For any (x α , y β ) with α ≠ β, we define where || A − B || r for any jointly distributed A and B is defined as Remark 5.2. Here ess sup is the essential supremum, the lowest upper bound that holds almost surely; it is the limit of || A − B || r as r → ∞.
Remark 5.3. Note that Dx α y β is well-defined only under the assumption that the dependence of a set of random variables {A α } α∈Φ on factors Φ is marginally selective. Otherwise || || A A x y x y α β α β α β ρ − would not be determined by x α y β only, and it would not even be legitimate to index the two variables by x α y β alone.  or the next theorem, recall that we are following Convention 5.6.
Theorem 5.11. Every contravening chain X contains a contravening tetradic subchain X′ of the form x α y β v α u β .
Proof. Let X′ = x α Pu β be a minimal contravening subchain of X. Then α ≠ β, and by Lemma 5.9, | X′ | ≥ 4. If for some z γ in X′ we had α ≠ γ ≠ β, then the subchains x α Qz γ and z γ Ru β with Qz γ R = P would have to be compliant (otherwise X′ would not be minimal). Then, by Lemma 5.8, we would have a contravening triadic chain x α z γ u β , which is impossible by Lemma 5.9. For every z γ in X′ therefore, either γ = α or γ = β. Since a contravening chain cannot contain repeating superscripts, X′ is of the form x α y β v α Su β .
We will rely on the following result, whose proof we omit as its only non-trivial part follows from the Minkowski inequality (a somewhat abridged proof can be found in Kujala and Dzhafarov, 2008).
Lemma 5.4. Given a sample space, let R be a set of all random variables A, B, … (jointly distributed) on this space. For any r ≥ 1, || A − B || r is an extended metric on R, provided we do not distinguish A,B identical on a set of measure 1.
Remark 5.5. The adjective "extended" means that ∞ is included in the set of possible values. The norms E[ ] | | A B − ρ ρ and ess sup | A − B |, as they only involve non-negative values, always exist, finite or infinite.
Convention 5.6. In the remainder of this section we will tacitly assume that the dependence of {A α } α∈Φ on Φ is marginally selective. We will also tacitly assume that r in the definition of || … || r and D is fixed. Definition 5.7. A chain x α Xy β is said to be compliant with the chain inequality (or simply, compliant) if Dx α Xy β ≥ Dx α y β . The chain is said to be contravening (the chain inequality) if Dx α Xy β < Dx α y β .
It follows from this definition that if x α Xy β is contravening or compliant, then α ≠ β (otherwise Dx α y β is not defined), and no factor in x α Xy β occurs twice in succession. For a chain to be contravening, in addition, X must be non-empty (i.e., | x α Xy β | ≥ 3; Lemma 5.9 below shows that in fact | x α Xy β | ≥ 4). A non-contravening chain need not be compliant: it may, e.g., be any chain with fewer than 3 elements, or it can be any chain of the form x α Xy α . Analogously, a non-compliant chain is not necessarily contravening. 7 Lemma 5.8. Let U = Xy β Yz γ Z be a contravening chain with a compliant subchain y β Yz γ . Then U* = Xy β z γ Z (i.e., U without Y) is a contravening subchain of U.
Proof. Let x α and u δ be the first and the last elements of U, respectively (then necessarily α ≠ δ). Note that x α may coincide with y β or u δ with z γ (but not both). From We cannot resist mentioning at this point a surprising mathematical similarity between the conceptual apparatus of (hence also the notation adopted in) the present theory, especially in this section, and that of the completely unrelated theory of "regular well-matched spaces" developed in Dzhafarov and Dzhafarov (2010) for comparative judgments. In particular, factors and factor points seem to be formally homologous to "stimulus areas" and "stimuli," respectively, and the contravening chains of the present theory essentially mirror the "soritical" sequences for comparative judgments, so that the proof of Theorem 5.11 below is almost identical to that of Lemma 3.3 of Dzhafarov and Dzhafarov (2010). 8 One can easily generalize this reasoning to show that every chain x x n n 1 1 α α  with pairwise distinct {α 1 , … α n } is compliant. As will be apparent from the proof of Theorem 5.11, however, in the present development we should only be concerned with n = 3.